# Chapter 5 - Linear Model Theory

Joshua French

To open this information in an interactive Colab notebook, click the Open in Colab graphic below.

<a href="https://colab.research.google.com/github/jfrench/LinearRegression/blob/master/notebooks/05-linear-model-theory-notebook.ipynb"> <img src="https://colab.research.google.com/assets/colab-badge.svg"> </a>

------------------------------------------------------------------------

# Basic theoretical results for linear models

In this chapter we discuss many basic theoretical results for linear models.

We assume the responses can be modeled as

$$
Y_i=\beta_0+\beta_1 x_{i,1} +\ldots + \beta_{p-1}x_{i,-1}+\epsilon_i,\quad i=1,2,\ldots,n,
$$

or using matrix formulation, as

$$
\mathbf{y} = \mathbf{X}\boldsymbol{\beta}+\boldsymbol{\epsilon}.
$$

# Standard assumptions

We assume that the components of our linear model have the characteristics previously described in Chapter 3. We also need to make several specific assumptions about the errors.

**Error Assumption 1**

The mean of the errors is zero conditional on the value of the regressors.

This means that

$$E(\epsilon_i \mid \mathbb{X} = \mathbf{x}_i)=0, i=1,2,\ldots,n,$$

or using matrix notation,

$$
E(\boldsymbol{\epsilon}\mid \mathbf{X}) = 0_{n\times 1}.
$$

where “$\mid \mathbf{X}$” is notation meaning “conditional on knowing the regressor values for all observations”.

**Error Assumption 2**

The errors have constant variances and are uncorrelated, conditional on knowing the regressors, i.e.,

$$
\mathrm{var}(\epsilon_i\mid \mathbb{X}=\mathbf{x}_i) = \sigma^2, \quad i=1,2,\ldots,n.
$$

and

$$
\mathrm{cov}(\epsilon_i, \epsilon_j\mid \mathbf{X}) = 0, \quad i,j=1,2,\ldots,n,\quad i\neq j.
$$

In matrix notation, this is stated as

$$
\mathrm{var}(\boldsymbol{\epsilon} \mid {\mathbf{X}})=\sigma^2\mathbf{I}_{n\times n}.
$$

**Error Assumption 3**

The errors are identically distributed. This may be written as

$$
\epsilon_i \sim F, i=1,2,\ldots,n,
$$

where $F$ is some arbitrary distribution.

**Error Assumption 4**

In practice, it is common to assume the errors have a normal (Gaussian) distribution.

**Assumptions 1-4 combined**

Two uncorrelated normal random variables are also independent (but this is not generally true for other distributions).

Putting assumptions 1-4 together, we have that

$$
\epsilon_1,\epsilon_2,\ldots,\epsilon_n \mid \mathbf{X}\stackrel{i.i.d.}{\sim} \mathsf{N}(0, \sigma^2),
$$

or using matrix notation,

$$
\boldsymbol{\epsilon}\mid \mathbf{X}\sim \mathsf{N}(\mathbf{0}_{n\times 1},\sigma^2 \mathbf{I}_{n\times n}).
$$

In summary, our error assumptions are:

1.  $E(\epsilon_i \mid \mathbb{X}=\mathbf{x}_i)=0$ for $i=1,2,\ldots,n$.
2.  $\mathrm{var}(\epsilon_i\mid \mathbb{X}=\mathbf{x}_i)=\sigma^2$ for $i=1,2,\ldots,n$.
3.  $\mathrm{cov}(\epsilon_i,\epsilon_j\mid \mathbf{X})=0$ for $i\neq j$ with $i,j=1,2,\ldots,n$.
4.  $\epsilon_i$ has a normal distribution for $i=1,2,\ldots,n$.

**Summary of results**

------------------------------------------------------------------------

Combining these results with our linear model, we have:

1.  $\mathbf{y}\mid \mathbf{X}\sim \mathsf{N}(\mathbf{X}\boldsymbol{\beta}, \sigma^2 \mathbf{I}_{n\times n})$.
2.  $\hat{\boldsymbol{\beta}}\mid \mathbf{X}\sim \mathsf{N}(\boldsymbol{\beta}, \sigma^2(\mathbf{X}^T\mathbf{X})^{-1})$.
3.  $\hat{\boldsymbol{\epsilon}}\mid \mathbf{X}\sim \mathsf{N}(\mathbf{0}_{n\times 1}, \sigma^2 (\mathbf{I}_{n\times n} - \mathbf{H}))$, where $\mathbf{H}=\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T$.
4.  $\hat{\boldsymbol{\beta}}$ has the minimum variance among all unbiased estimators of $\boldsymbol{\beta}$ with the additional assumptions that the model is correct and $\mathbf{X}$ is full-rank.

We prove these results in the sections below. To simplify the derivations below, we let $\mathbf{I}=\mathbf{I}_{n\times n}$.

**Results for $\mathbf{y}$**

------------------------------------------------------------------------

For our given linear model and under the assumptions summarized previously, our response variable has mean

$$
E(\mathbf{y}\mid \mathbf{X})=\mathbf{X}\boldsymbol{\beta}.
$$

*Proof:*

</br>  
</br>  
</br>  
</br>  
</br>  
</br>  
</br>  
</br>  
</br>

$\vphantom{blank}$

For the variance of the response:

$$
\mathrm{var}(\mathbf{y}\mid \mathbf{X})=\sigma^2 \mathbf{I}.
$$ *Proof:*

</br>  
</br>  
</br>  
</br>  
</br>  
</br>  
</br>  
</br>  
</br>

$\vphantom{blank}$

The response variable has the following distribution:

$$
\mathbf{y}\mid \mathbf{X}\sim \mathsf{N}(\mathbf{X}\boldsymbol{\beta}, \sigma^2 \mathbf{I}).
$$

*Proof:*

</br>  
</br>  
</br>  
</br>  
</br>  
</br>  
</br>  
</br>  
</br>

$\vphantom{blank}$

**Results for $\hat{\boldsymbol{\beta}}$**

------------------------------------------------------------------------

The OLS estimator for $\boldsymbol{\beta}$ is

$$
\hat{\boldsymbol{\beta}}=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}.
$$

This is an unbiased estimator for $\boldsymbol{\beta}$, i.e.,

$$
E(\hat{\boldsymbol{\beta}}\mid \mathbf{X})=\boldsymbol{\beta}.
$$

*Proof:*

</br>  
</br>  
</br>  
</br>  
</br>  
</br>  
</br>  
</br>  
</br>

$\vphantom{blah}$

The OLS estimator $\hat{\boldsymbol{\beta}}$ has variance

$$
\mathrm{var}(\hat{\boldsymbol{\beta}}\mid \mathbf{X})=\sigma^2(\mathbf{X}^T\mathbf{X})^{-1}.
$$

*Proof:*

</br>  
</br>  
</br>  
</br>  
</br>  
</br>  
</br>  
</br>  
</br>

$\vphantom{blah}$

The OLS estimator $\hat{\boldsymbol{\beta}}$ has the following distribution:

$$
\hat{\boldsymbol{\beta}}\mid \mathbf{X}\sim \mathsf{N}(\boldsymbol{\beta}, \sigma^2(\mathbf{X}^T\mathbf{X})^{-1}).
$$

*Proof:*

</br>  
</br>  
</br>  
</br>  
</br>  
</br>  
</br>  
</br>  
</br>

**Results for the residuals**

------------------------------------------------------------------------

The residual vector can be expressed in various equivalent ways, such as

$$
\begin{aligned}
\hat{\boldsymbol{\epsilon}} &= \mathbf{y}-\hat{\mathbf{y}} \\
&= \mathbf{y}-\mathbf{X}\hat{\boldsymbol{\beta}}.
\end{aligned}
$$

The **hat** matrix is denoted as:

$$
\mathbf{H}=\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T.
$$

Thus, using the substitution $\hat{\boldsymbol{\beta}}=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$ and the definition for $\mathbf{H}$, we see that:

$$
\begin{aligned}
\hat{\boldsymbol{\epsilon}} &= \mathbf{y}-\mathbf{X}\hat{\boldsymbol{\beta}} \\ 
&= \mathbf{y} - \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} \\
&= \mathbf{y} - \mathbf{H}\mathbf{y} \\
&= (\mathbf{I}-\mathbf{H})\mathbf{y}.
\end{aligned}
$$

The hat matrix is an important theoretical matrix, as it projects $\mathbf{y}$ into the space spanned by the vectors in $\mathbf{X}$.

The hat matrix $\mathbf{H}$ is symmetric and idempotent.

*Proof:*

</br>  
</br>  
</br>  
</br>  
</br>  
</br>  
</br>  
</br>  
</br>

The matrix $\mathbf{I} - \mathbf{H}$ is symmetric and idempotent.

*Proof:*

</br>  
</br>  
</br>  
</br>  
</br>  
</br>  
</br>  
</br>  
</br>

Under the assumptions we discussed previously, the residuals have mean

$$
E(\hat{\boldsymbol{\epsilon}}\mid \mathbf{X})=\mathbf{0}_{n\times 1}.
$$

*Proof:*

</br>  
</br>  
</br>  
</br>  
</br>  
</br>  
</br>  
</br>  
</br>

$\vphantom{blank}$

The residuals have variance

$$
\mathrm{var}(\hat{\boldsymbol{\epsilon}}\mid \mathbf{X})=\sigma^2 (\mathbf{I} - \mathbf{H}).
$$

*Proof:*

</br>  
</br>  
</br>  
</br>  
</br>  
</br>  
</br>  
</br>  
</br>

The residuals have the following distribution:

$$
\hat{\boldsymbol{\epsilon}}\mid \mathbf{X}\sim \mathsf{N}(\mathbf{0}_{n\times 1}, \sigma^2 (\mathbf{I} - \mathbf{H})).
$$

*Proof:*

</br>  
</br>  
</br>  
</br>  
</br>  
</br>  
</br>  
</br>  
</br>

$\vphantom{blank}$

The RSS can be represented as

$$
RSS=\mathbf{y}^T(\mathbf{I}-\mathbf{H})\mathbf{y}.
$$

*Proof:*

</br>  
</br>  
</br>  
</br>  
</br>  
</br>  
</br>  
</br>  
</br>

# The Gauss-Markov Theorem

Suppose we will fit the regression model:

$$
\mathbf{y}=\mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}.
$$

Assume that

1.  $E(\boldsymbol{\epsilon}\mid \mathbf{X}) = 0$.
2.  $\mathrm{var}(\boldsymbol{\epsilon}\mid \mathbf{X}) = \sigma^2 \mathbf{I}$, i.e., the errors have constant variance and are uncorrelated.
3.  $E(\mathbf{y}\mid \mathbf{X})=\mathbf{X}\boldsymbol{\beta}$
4.  $\mathbf{X}$ is a full-rank matrix.

Then the **Gauss-Markov** states that the OLS estimator of $\boldsymbol{\beta}$,

$$
\hat{\boldsymbol{\beta}}=(\mathbf{X}^T\mathbf{X})^T\mathbf{X}^T\mathbf{y},
$$

has the minimum variance among all unbiased estimators of $\boldsymbol{\beta}$ and this estimator is unique.

Some comments:

-   Assumption 3 guarantees that we have hypothesized the correct model, i.e., that we have included exactly the correct regressors in our model. Not only are we fitting a linear model to the data, but our hypothesized model is actually correct.
-   Assumption 4 ensures that the OLS estimator can be computed (otherwise, there is no unique solution).
-   The Gauss-Markov theorem only applies to unbiased estimators of $\boldsymbol{\beta}$. Biased estimators could have a smaller variance.
-   The Gauss-Markov theorem states that no unbiased estimator of $\boldsymbol{\beta}$ can have a smaller variance than $\hat{\boldsymbol{\beta}}$.
-   The OLS estimator uniquely has the minimum variance property, meaning that if an $\tilde{\boldsymbol{\beta}}$ is another unbiased estimator of $\boldsymbol{\beta}$ and $\mathrm{var}(\tilde{\boldsymbol{\beta}}) = \mathrm{var}(\hat{\boldsymbol{\beta}})$, then in fact the two estimators are identical and $\tilde{\boldsymbol{\beta}}=\hat{\boldsymbol{\beta}}$.

We do not prove this theorem.