# The F-test for comparing nested linear models

Recall the setup of the $F$ test.  We are comparing two nested linear models:

$$
\begin{align*}
\textrm{Null Hypothesis:} \hphantom{dd} Y &= \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_q X_q +  0 X_{q+1} + \dots +0 X_p + \epsilon\\
\textrm{Alternative Hypothesis:} \hphantom{dd} Y &= \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_q X_q +  \beta_{q+1} X_{q+1} + \dots +\beta_p X_p + \epsilon
\end{align*}
$$

where $\epsilon \sim \mathcal{N}(0, \sigma^2)$.

Define the $F$-statistic

$$
F = \frac{(\textrm{RSS}_0 - \textrm{RSS}_a) / (p-q)}{\textrm{RSS}_a/ (n - p - 1)}
$$

where $\textrm{RSS}_0$ is the sum of the squared residuals from the fit full model and $\textrm{RSS}_a$ is the sum of the squared residuals from the fit reduced model.

We will show that *if the null hypothesis is true* that $F \sim  F_{p-q, n-p-1}$  where the latter is [Snedecor's F-distribution](https://en.wikipedia.org/wiki/F-distribution) $F_{d_1,d_2} \sim \frac{U_1/d_1}{U_2/d_2}$, where $U_1$ and $U_2$ are independent $\chi^2$ distributions with $d_1$ and $d_2$ degrees of freedom respectively.

Our main tool will be the following definition/theorem pair (which is a simplified form of Theorem 3.8.2 from the textbook "The Coordinate Free approach to Linear Models" by Wichura):

**Definition** We say that a random vector $X$ in an inner product space $V$ is **normally distributed with variance $\sigma^2$ if, for any unit vector $v \in V$, the real random variable $\langle v, X \rangle \sim \mathcal{N}(0,\sigma^2)$.  In this case we write $X \sim \mathcal{N}(0,\sigma^2 I_{V})$

**Theorem** (Geometric Form of Cochran's Theorem): If $X \sim \mathcal{N}(0,\sigma^2 I_{V})$ is normally distributed on an inner product space $V$, then for any subspace $U \subset V$ we have that $\textrm{Proj}_U(X)$ and $\textrm{Proj}_{U^\perp}(X)$ are independent and are both normally distributed with the same variance $\sigma^2$.


It follows immediately from the definition that if $U$ is a subspace of $V$ then $\textrm{Proj}_U(X)$ is normally distributed on $U$ with the same variance since for any unit vector $u \in U$, $\langle u, \textrm{Proj}_U(X)\rangle = \langle u, X \rangle \sim \mathcal{N}(0,\sigma^2)$.

The more difficult part of the theorem is proving independence.  Building up to this occupies most of Chapter 3 of the aforementioned textbook so we will not prove it here.  A good intuition is that the distribution of the multivariate normal distribution on $\mathbb{R}^n$ with covariance matrix $\sigma^2 I_n$ has p.d.f. 

$$
f(x) = \frac{1}{(2\pi \sigma^2)^{k/2}} \exp( -\frac{1}{2\sigma^2} |x|^2) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi \sigma^2}} \exp(-\frac{1}{2\sigma^2} x_i^2)
$$

This shows that the coordinates of $X$ with respect to the standard basis are independent and normal.  Since the distribution is invariant under the action of $\mathcal{O}(n)$ (the p.d.f. only depends on $|x|$), we can see that the coordinates with respect to any orthonormal basis would also be independent and normal with the same variance.  The result above just generalizes this to an inner product space $V$ instead of $\mathbb{R}^n$.

With this we can make sense of the $F$-test as follows.

![F-test picture](math_hour_assets/F_test_picture.png)

In the picture the ambient space is the space of observations $\mathbb{R}^n$.

* $y_{\textrm{True}} = X \beta_{\textrm{True}} \in \textrm{Reduced}$.
* $y_{\textrm{obs}} - y_{\textrm{True}}$ is drawn from $\mathcal{N}(0, \sigma^2 I_n)$.
* $y_{\textrm{obs}} - \hat{y}_{\textrm{Red}} \in \textrm{Reduced}^\perp$ is the vector of residuals from the fit reduced model.
* $y_{\textrm{obs}} - \hat{y}_{\textrm{Full}} \in \textrm{Full}^\perp$ is the vector of residuals from the fit full model.
* $y_{\textrm{Full}} - \hat{y}_{\textrm{Red}} \in \textrm{Full} \cap \textrm{Reduced}^\perp$ is the difference between the two fit models.


Now, by geometric Cochran's theorem we have that, under the null hypothesis, $y_{\textrm{Full}} - \hat{y}_{\textrm{Red}}$ is normally distributed on $\textrm{Full} \cap \textrm{Reduced}^\perp$ which has dimension $p - q$.  We also have that $y_{\textrm{obs}} - \hat{y}_{\textrm{Full}}$ is normally distributed on $\textrm{Full}^\perp$ which has dimension $n - p - 1$.

Thus $|y_{\textrm{Full}} - \hat{y}_{\textrm{Red}}|^2 \sim \sigma^2\chi^2_{p-q}$ and $|y_{\textrm{obs}} - \hat{y}_{\textrm{Full}}|^2 \sim \sigma^2 \chi^2_{n-p-1}$ and these are independent.

So the test statistic 

$$
\frac{|y_{\textrm{Full}} - \hat{y}_{\textrm{Red}}|^2 / (p-q)}{|y_{\textrm{obs}} - \hat{y}_{\textrm{Full}}|^2 / (n-p-1)} \sim F_{p-q, n-p-1}
$$

Notice that this test statistic is just $\cot^2(\theta)$ (scaled by the constant $\frac{n-p-1}{p-q}$) where $\theta$ is the angle between $y_{\textrm{obs}} - \hat{y}_{\textrm{Red}}$ and the subspace $\textrm{Full}$.  This is a reasonable test statistic:  when the angle $\theta$ is small the observations are "unreasonably close" to the subspace $\textrm{Full}$ if the true model actually belonged to  $\textrm{Reduced}$.  When $\theta$ is small, $\cot^2(\theta)$ is large.  So large values of $\cot^2(\theta)$ are "suspicious" if the data generating process was the reduced rather than the full model.

Rewriting the numerator by using the pythagorean theorem gives

$$
\frac{\left( |y_{\textrm{obs}} - \hat{y}_{\textrm{Red}}|^2 - |y_{\textrm{obs}} - \hat{y}_{\textrm{Full}}|^2 \right)/ (p-q)}{|y_{\textrm{obs}} - \hat{y}_{\textrm{Full}}|^2 / (n-p-1)} \sim F_{p-q, n-p-1}
$$

This is the result we were seeking!

The key thing making all of this work is that while we do not know $y_{\textrm{True}}$ we *do* know that since $y_{\textrm{obs}} - y_{\textrm{True}} \sim N(0, \sigma^2 I_n)$ that its projection onto $\textrm{Reduced}^\perp$ is normally distributed on this subspace.  Since we actually have access to each of the vectors $y_{\textrm{obs}}$, $\hat{y}_{\textrm{Red}}$, and $\hat{y}_{\textrm{Full}}$ we can compute with them.  By the big theorem, we get that the two legs of this triangle are independent and normally distributed in the subspaces where they live.  This immediately leads to their squared lengths being $\chi^2$ distributed and the ratio of these squared lengths being $F$ distributed with the appropriate number of degrees of freedom.