# Linear Model Confidence Intervals

In this notebook we will establish:

* Confidence intervals for linear regression parameters $\beta_i$
* Confidence intervals for conditional outcome means (i.e. confidence intervals for $\mathbb{E}(Y | x)$)
* Prediction intervals for conditional outcomes (i.e. confidence intervals for $Y|x$)

## Setting up notation

We presume that $Y|x = x^\top \beta + \epsilon$ where $x \in \mathbb{R}^{p+1}$ has been augmented with an initial $1$ and $\epsilon \sim \mathcal{N}(0, \sigma^2)$.

We are given an $n \times (p+1)$ design matrix of features $X$. We are thinking of these $n$ observed feature vectors $x_i$ as fixed.

The vector of outcomes $\vec{Y}_\textrm{obs}$ is then a random vector in $\mathbb{R}^n$ which is distributed  $\vec{Y}_\textrm{obs} \sim \mathcal{N}(X\beta, \sigma^2 I_n)$.

## Confidence intervals for linear regression parameters

Since $\vec{Y}_\textrm{obs}$ is a random vector, then *fit* linear regression parameters $\hat{\beta}$ is a random vector in $\mathbb{R}^{p+1}$.  Moreover, since $\hat{\beta} = (X^\top X)^{-1}X^\top \vec{Y}_{\textrm{obs}}$, we know that $\hat{\beta}$ is multivariate normally distributed.

The mean of $\hat{\beta}$ is $(X^\top X)^{-1}X^\top (X\beta) = \beta$.

The covariance matrix of $\hat{\beta}$ is

$$
\begin{align*}
\operatorname{Cov}(\hat{\beta})
&= \operatorname{Cov}((X^\top X)^{-1}X^\top \vec{Y}_\textrm{obs})\\
&= (X^\top X)^{-1}X^\top \operatorname{Cov} (\vec{Y}_\textrm{obs}) ((X^\top X)^{-1}X^\top )^\top\\
&= (X^\top X)^{-1}X^\top \sigma^2 I_n X (X^\top X)^{-1}\\
&= \sigma^2 (X^\top X)^{-1}
\end{align*}
$$

Thus we have that $\hat{\beta} \sim \mathcal{N}(\beta, \sigma^2 (X^\top X)^{-1})$.

Note that $\hat{\beta}_i = e_i^\top \hat{\beta} \sim \mathcal{N}(\beta_i, \sigma^2 e_i^\top (X^\top X)^{-1} e_i) = \mathcal{N}(\beta_i, \sigma^2 (X^\top X)^{-1}_{ii})$.

If we somehow knew the population variance $\sigma^2$ we could immediately use this to construct confidence intervals for $\beta_i$ using quantiles of the standard normal as follows:

Since $\hat{\beta}_i \sim \mathcal{N}(\beta_i, \sigma^2 (X^\top X)^{-1}_{ii})$, we have that $\frac{\hat{\beta_i} - \beta_i}{\sigma \sqrt{(X^\top X)^{-1}_{ii}}} \sim \mathcal{N}(0,1)$.  Thus we would obtain a $(1-\alpha)$ confidence interval

$$
-z_{1-\alpha/2}< \frac{\hat{\beta_i} - \beta_i}{\sigma \sqrt{(X^\top X)^{-1}_{ii}}} < z_{1-\alpha/2}\\
$$

which leads to $\beta_i$ between $\hat{\beta}_i \pm \sigma \sqrt{(X^\top X)^{-1}_{ii}}z_{1-\alpha/2}$.

Unfortunately we do not know the population variance $\sigma^2$!  The best we can reasonably expect is to approximate it with the unbiased estimate $\hat{\sigma}^2 = \frac{1}{n-p-1} |\vec{Y}_{\textrm{obs}}- X\hat{\beta}|^2$.

![Confidence Interval Geometry](math_hour_assets/conf_int_geometry.png)


By the geometric version of Cochran's theorem, $\hat{\sigma}^2 \sim \sigma^2 \chi^2_{n-p-1}/(n-p-1)$ and $X\hat{\beta} - X\beta$ (and thus $\hat{\beta} - \beta$) is independent of $\hat{\sigma}^2$.

So we have that

$$
\frac{\hat{\beta_i} - \beta_i}{\hat{\sigma} \sqrt{(X^\top X)^{-1}_{ii}}} \sim \frac{\sigma \mathcal{N(0,1)}}{\sigma \sqrt{\chi^2_{n-p-1}/(n-p-1)}} = t_{n-p-1}
$$

We thus obtain the $(1-\alpha)$ confidence interval $\hat{\beta} \pm \hat{\sigma} \sqrt{(X^\top X)^{-1}_{ii}} t_{1-\alpha/2, n-p-1}$.

## Confidence Intervals for Conditional Outcome Means

Let $x_0 \in \mathbb{R}^p$ be a fixed new input.  We want a $(1-\alpha)$ confidence interval for $\mathbb{E}(Y|x_0) = x_0^\top \beta$.

We can re-use most of the same ingredients from above:

Since $\hat{\beta} \sim \mathcal{N}(\beta, \sigma^2 (X^\top X)^{-1})$, we know that $x_0^\top \hat{\beta} \sim \mathcal{N}(\beta, \sigma^2 x_0^\top(X^\top X)^{-1} x_0)$

So $\frac{x_0^\top \hat{\beta} - x_0^\top \beta}{\sqrt{x_0^\top(X^\top X)^{-1} x_0}} \sim \sigma \mathcal{N}(0, 1)$.

On the other hand, we have already shown that $\hat{\sigma^2} \sim \sigma^2 \chi^2_{n-p-1}/(n-p-1)$.

So we obtain that 

$$
\frac{x_0^\top \hat{\beta} - x_0^\top \beta}{\hat{\sigma}  x_0^\top(X^\top X)^{-1} x_0} \sim t_{n-p-1}
$$

which leads to the confidence interval $$x_0^\top \hat{\beta} \pm \hat{\sigma} \left( x_0^\top(X^\top X)^{-1} x_0 \right)t_{1-\alpha/2, n-p-1}$$

## Prediction Intervals for Conditional Outcomes

The conditional outcome at $x_0$ is distributed as $y_0 = x_0\beta + \epsilon$ with $\epsilon \sim \mathcal{N}(0, \sigma^2)$.

Thus $x_0\hat{\beta} - y_0 = x_0\hat{\beta} - x_0\beta + \epsilon \sim \mathcal{N}(0, \sigma^2 (1+x_0^\top(X^\top X)^{-1} x_0))$

So we obtain

$$
\frac{x_0\hat{\beta} - y_0}{\hat{\sigma} (1+x_0^\top(X^\top X)^{-1} x_0)} \sim t_{n-p-1}
$$

which leads to the confidence interval $$x_0^\top \hat{\beta} \pm \hat{\sigma}  (1+ x_0^\top(X^\top X)^{-1} x_0)t_{1-\alpha/2, n-p-1}$$


