Ahh, statistical hypothesis testing—it's a cornerstone of classical statistical inference. 
It can also seem a bit daunting when presented as a list of [over 100 different methods](https://en.wikipedia.org/wiki/Category:Statistical_tests) 
from which the user is expected to choose the right procedure based on their application domain, data, and question.
But it doesn't have to be that way

*It turns out that many traditional frequentist tests can be viewed as special cases of regression models.*

This is a powerful and liberating idea, because it can help us transition from choosing from the zoo of canned methods to building our own bespoke analyses, tailored to our exact use case.
Adopting this perspective will allow us to

- transcend rote memorization of which test should be applied where and instead reason from first principles
- make our assumptions clear and explicit 
- make extensions like incorporating covariates and interactions more natural

So the plan for this post is to take a look at a few of the most common statistical tests, and formulate each one as a regression model. We'll implement each one using its dedicated stats library method and also a generic regression model.

Let's roll!

![Flow chart for choosing an appropriate statistical test](stat-flow-chart.png "")

I've got a quick one for you today; here's the key idea up front.

*Many common statistical tests are equivalent to tests on the coefficients of regression models.*

This is useful to know because it helps collapse the multiplicity of different statistical tests down into a simple process of writing a model for the data generating process and then doing inference on the parameter(s) of interest.


## Two-Sample t-test

Here's the setup. You have two populations or processes $Y_0$ and $Y_1$, and you want to know whether their true means $\mu_0$ and $\mu_1$ are equal or not. We  assume that the two processes are Gaussian with equal but unknown variance $\sigma^2$. So far we have

$$ E[Y_0] = \mu_0$$
$$ E[Y_1] = \mu_1 $$
$$ Var[Y_0] = Var[Y_1] = \sigma^2 $$

To do inference on the potential difference in means, you draw samples from each process or "group"—$n_0$ samples from group 0 and $n_1$ samples from group 1—and you'll compute the sample means $\bar{Y_0}_0$ and $\bar{Y_1}$. Your test statistic $t$ for the two-sample t-test is

$$t =  \frac{\bar{Y}_1 - \bar{Y}_0}{\sqrt{\hat{\sigma}_{\text{pool}}^2 (1/n_0+1/n_1)}} $$

where that $\hat{\sigma}_{\text{pool}}^2$ is the pooled sample variance, i.e. concatenate the samples and compute their combined variance. 
Under the null hypothesis where $\mu_0 = \mu_1$, this test statistic's sampling distribution is a student's t-distribution with $n_0+n_1-1$ degrees of freedom. 

Having horrifying flashbacks to your intro to stats class yet? Good. Let's look at it from another angle.

Let's write this data generating process down like a linear model.

$$ Y = \beta_0 + \beta_1 X + \epsilon $$

where $X \in \{0,1\}$ indexes the two groups and $\epsilon \sim N(0,\sigma^2)$. Look what happens when we take conditional expectations and the variance of $Y$.

$$ E[Y|X=0] = \beta_0 $$
$$ E[Y|X=1] = \beta_0 + \beta_1 $$
$$ Var[Y] = \sigma^2 $$

So in this setup, the difference in means is given by

$$ E[Y|X=1] - E[Y|X=0] = \beta_1 $$

---

Then we somehow derive the mean and the variance of $_beta_1 in the regression formulation and mean and variance of the t-stat in the t-test formulation, showing they' the same.

## Two-Sample t-test

So we're going to do a detailed analytical breakdown of the two-sample t-test from two perspectives—the classical setup and a linear regression reformulation. In each case we'll break the approach down into these items: data generating process, estimator, expectation and variance of the estimator, test statistic, and sampling distribution of the test statistic. You can use this kind of breakdown to understand pretty much any classical statistical test. In this case, the point is to clearly show that the classical t-test and the linear regression formulation yield identical tests. 

### The Classical t-test Approach

**The data generating process**

You have two populations or processes $Y_0$ and $Y_1$, and you want to know whether their true means $\mu_0$ and $\mu_1$ are equal. We assume that both processes are Gaussian with equal but unknown variance $\sigma^2$:

$$ Y_0 \sim N(\mu_0, \sigma^2), \quad Y_1 \sim N(\mu_1, \sigma^2) $$

**The estimator**

You draw $n_0$ samples from group 0 and $n_1$ samples from group 1 for a total of $n=n_0+n_1$ samples, and compute the sample means $\bar{Y}_0$ and $\bar{Y}_1$. Your estimator for the difference in means is simply:

$$\hat{\delta} = \bar{Y}_1 - \bar{Y}_0$$

**Expectation of the estimator**

Since $E[\bar{Y}_0] = \mu_0$ and $E[\bar{Y}_1] = \mu_1$, we have:

$$E[\hat{\delta}] = E[\bar{Y}_1 - \bar{Y}_0] = \mu_1 - \mu_0$$

So $\hat{\delta}$ is an unbiased estimator of the true difference in means.

**Standard error of the estimator**

The sample means are independent, so:

$$Var[\hat{\delta}] = Var[\bar{Y}_1] + Var[\bar{Y}_0] = \frac{\sigma^2}{n_1} + \frac{\sigma^2}{n_0} = \sigma^2\left(\frac{1}{n_1} + \frac{1}{n_0}\right)$$

Since we don't know $\sigma^2$, we estimate it with the pooled sample variance:

$$\hat{\sigma}_{\text{pooled}}^2 = \frac{(n_0-1)s_0^2 + (n_1-1)s_1^2}{n_0 + n_1 - 2}$$

where $s_0^2$ and $s_1^2$ are the sample variances for each group. This gives us the estimated standard error:

$$SE(\hat{\delta}) = \sqrt{\hat{\sigma}_{\text{pooled}}^2\left(\frac{1}{n_0} + \frac{1}{n_1}\right)}$$

**The test statistic**

We form the test statistic by dividing our estimator by its standard error:

$$t = \frac{\hat{\delta}}{SE(\hat{\delta})} = \frac{\bar{Y}_1 - \bar{Y}_0}{\sqrt{\hat{\sigma}_{\text{pooled}}^2 (1/n_0 + 1/n_1)}}$$

**Sampling distribution**

Under the null hypothesis $H_0: \mu_1 = \mu_0$, this test statistic follows a Student's t-distribution with $n_0 + n_1 - 2$ degrees of freedom.

Having horrifying flashbacks to your intro to stats class yet? No worries. Let's look at it from a new perspective.


### The Regression Approach

**The data generating process**

We can express the exact same data generating process as a linear regression model. Stack all observations into a single length-$n$ vector $Y$ and create a dummy variable $X \in \{0,1\}$ indexing which group each observation came from:

$$ Y = \beta_0 + \beta_1 X + \epsilon $$

where $\epsilon \sim N(0, \sigma^2)$.

Taking conditional expectations:

$$ E[Y|X=0] = \beta_0 = \mu_0 $$
$$ E[Y|X=1] = \beta_0 + \beta_1 = \mu_1 $$

So we can see that $\beta_1 = \mu_1 - \mu_0$, meaning the regression coefficient $\beta_1$ directly represents the difference in population means.

**The estimator**

The ordinary least squares estimator for $\beta_1$ is:

$$\hat{\beta}_1 = \frac{\sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^{n} (X_i - \bar{X})^2}$$

For our dummy variable where $\bar{X} = n_1/(n_0 + n_1)$, after some algebra that you can crank through on your own this simplifies to:

$$\hat{\beta}_1 = \bar{Y}_1 - \bar{Y}_0$$

Well look at that—the regression coefficient estimate is exactly the difference in sample means!

**Expectation of the estimator**

By the properties of OLS under our model assumptions:

$$E[\hat{\beta}_1] = \beta_1 = \mu_1 - \mu_0$$

So $\hat{\beta}_1$ is also an unbiased estimator of the difference in means.

**Standard error of the estimator**

The standard error formula for an OLS coefficient is:

$$SE(\hat{\beta}_1) = \sqrt{\hat{\sigma}^2 \cdot \frac{1}{\sum_{i=1}^{n}(X_i - \bar{X})^2}}$$

where $\hat{\sigma}^2$ is the residual variance from the regression:

$$\hat{\sigma}^2 = \frac{1}{n_0 + n_1 - 2}\sum_{i=1}^{n}(Y_i - \hat{Y}_i)^2$$

For our dummy variable, it turns out that:
- The residual variance $\hat{\sigma}^2$ equals the pooled variance $\hat{\sigma}_{\text{pooled}}^2$
- The sum $\sum_{i=1}^{n}(X_i - \bar{X})^2 = \frac{n_0 n_1}{n_0 + n_1}$

Substituting these:

$$SE(\hat{\beta}_1) = \sqrt{\hat{\sigma}_{\text{pooled}}^2 \cdot \frac{n_0 + n_1}{n_0 n_1}} = \sqrt{\hat{\sigma}_{\text{pooled}}^2\left(\frac{1}{n_0} + \frac{1}{n_1}\right)}$$

This is exactly the same standard error we got from the classical approach.

**The test statistic**

We form the test statistic by dividing our coefficient estimate by its standard error:

$$t = \frac{\hat{\beta}_1}{SE(\hat{\beta}_1)} = \frac{\bar{Y}_1 - \bar{Y}_0}{\sqrt{\hat{\sigma}_{\text{pooled}}^2 (1/n_0 + 1/n_1)}}$$

**Sampling distribution**

Under the null hypothesis $H_0: \beta_1 = 0$, this test statistic follows a Student's t-distribution with $n_0 + n_1 - 2$ degrees of freedom (the residual degrees of freedom from the regression).

### The Punchline

See what just happened? The two approaches give us:
- The same point estimate: $\hat{\delta} = \hat{\beta}_1 = \bar{Y}_1 - \bar{Y}_0$
- The same standard error: $\sqrt{\hat{\sigma}_{\text{pooled}}^2(1/n_0 + 1/n_1)}$
- The same test statistic: $t = \frac{\bar{Y}_1 - \bar{Y}_0}{\sqrt{\hat{\sigma}_{\text{pooled}}^2 (1/n_0 + 1/n_1)}}$
- The same sampling distribution: $t_{n_0+n_1-2}$
- Therefore, the same p-value

The two-sample t-test **is** a linear regression with a dummy variable. They're not similar procedures or approximations of each other—they're mathematically identical. Every single calculation matches. The t-test is just a special case of regression that got its own name and recipe because it's so commonly used.