# Simple Linear Model

## One predictor variable

**Video lecture: https://youtu.be/rOA5YXfCP1w**

Let ${(x_i, y_i)}$ be a set of observed values. We want to fit the data in a linear function.

Let $Y_i = b_0 + b_1x_i + \epsilon_i$ where $\epsilon_i$ are iid $N(0, \sigma^2)$, $x_i$ are known predcitor variables, $Y_i$ are known response variables. An alternative model $Y_i = b_0^{*} + b_1(x_i - \bar{x}) + \epsilon_i$ where $b_0^{*} = b_0 + b_1\bar{x}$.


### Likelihood function:

$$
L(b_0, b_1, \sigma^2) = \prod_{i=1}^n\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{\sum[Y_i-b_0-b_1x_i]^2}{2\sigma^2}}
$$

To maximum the Likelihood function is equivalent to minimize:
$$
-\log{L(b_0, b_1, \sigma^2)} = \frac{n}{2}\log{(2\pi\sigma^2)} + {\frac{\sum[Y_i-b_0-b_1x_i]^2}{2\sigma^2}}
$$

And is equivalent to minimize:

$$
H(b_0, b_1) = \sum[Y_i-b_0-b_1x_i]^2
$$


Maximum likelihood method is equavilent to least squared method when finding the estimated parameters $b_0, b_1$.


Take the derivative on $\log{L(b_0, b_1, \sigma^2)}$ w.r.t $b_0, b_1, \sigma^2$, we get:

$$
\begin{align}
\hat{b}_1 &= \frac{\sum(Y_i-\bar{Y})(x_i-\bar{x})}{\sum(x_i-\bar{x})^2} = \frac{\sum{Y_i(x_i-\bar{x})}}{\sum(x_i-\bar{x})^2} = \sum{k_iY_i} \\
\hat{b}_0 &= \bar{Y}-\hat{b}_1\bar{x}  \\
\hat{b}_0^{*} &= \hat{b}_0 + \hat{b}_1\bar{x} = \bar{Y} \\
\hat{\sigma}^{2} &= \frac{1}{n}\sum{(Y_i- \hat{Y}_i)^2} = \frac{1}{n}\sum{\hat{e}_i^2} = \frac{\text{SSE}}{n} 
\end{align}
$$

Where **SSE** is called **Sum of Squared Errors**, and

$$
\begin{align}
k_i &= \frac{x_i-\bar{x}}{\sum(x_i-\bar{x})^2} \\
\sum{k_i} &= 0 \\
\sum{k_ix_i} &= 1 \\
\sum{k_i^2} &= \frac{1}{\sum(x_i-\bar{x})^2}
\end{align}
$$

$$
\begin{align}
\hat{Y_i} &= \hat{b}_0 + \hat{b}_1x_i \\
\hat{e}_i &= Y_i- \hat{Y}_i \\
e_i &= Y_i- \mu[Y_i] = Y_i - b_0 - b_1x_i
\end{align}
$$
### Expected value of $\hat{b}_0, \hat{b}_1$:

$$
\begin{align}
\mu[\hat{b}_1] &= \sum{\mu[k_iY_i]} = \sum{k_i(b_0+b_1x_i)} = {b_1\sum{k_ix_i}} = b_1 \\
\mu[\hat{b}_0] &= \mu[\bar{Y}-\hat{b}_1\bar{x}] = \frac{1}{n}\sum{\mu[Y_i]} - \mu[\hat{b}_1]\bar{x} = b_0 + b_1\bar{x_i} - b_1\bar{x} = b_0
\end{align}
$$


### Theorem

$$
[\hat{b}_0^{*}, \hat{b}_1]^{\intercal} \sim N\bigg([b_0^{*}, b_1]^{\intercal}, \sigma^2\begin{bmatrix}\frac{1}{n} & 0 \\ 
0 & \frac{1}{\sum(x_i - \bar{x})^2} \end{bmatrix}\bigg)
$$


Proof:

Since $\hat{b}_0^{*}, \hat{b}_1$ are both linear combination of normal r.vs, its sufficient to prove that the covariance is $0$:

$$
Cov(\hat{b}_0^{*}, \hat{b}_1) = Cov(\bar{Y}, \sum{k_iY_i}) = \frac{1}{n}\sum{k_iCov(\sum_{j \neq i}^{n} + Y_i, Y_i)} = \frac{1}{n}\sigma^2\sum{k_i} = 0
$$

$$
\begin{align}
\mu[\hat{b}_0^{*}] &= \mu[\bar{Y}] = \frac{1}{n}\sum{b_0^{*} + b_1(x_i-\bar{x})} = b_0^{*} \\
\sigma^2[\hat{b}_0^{*}] &= \frac{\sigma^2}{n}  \\
\mu[\hat{b}_1] &= \sum{k_i \mu[Y_i]} = \sum{k_i(b_0^{*} + b_1(x_i-\bar{x}))} = \frac{b_1\sum{(x_i-\bar{x})^2}}{\sum(x_i-\bar{x})^2} = b_1 \\
\sigma^2[\hat{b}_1] &= \sum{k_i^2 \sigma^2[Y_i]} = \frac{\sigma^2}{\sum(x_i-\bar{x})^2} \tag{1}
\end{align}
$$

$\blacksquare$

### Variance of $\hat{b}_0$

$$
\begin{equation}
\sigma^2[\hat{b}_0] = \sigma^2[\hat{b}_0^{*} - \bar{x}\hat{b_1}] = \frac{1}{n^2} \sum{\sigma^2[Y_i]} + \bar{x}^2\sigma^2[\hat{b_1}] = \sigma^2\bigg[\frac{1}{n} + \frac{\bar{x}^2}{\sum{(x_i-\bar{x})^2}} \bigg]
\end{equation}
$$
### Quadratic decomposition

$$
\begin{align}
\sum(Y_i - b_0^{*} - b_1(x_i - \bar{x}))^2 &= \sum\big[ (\hat{b}_0^{*} -b_0^{*}) + (\hat{b}_1 - b_1)(x_i - \bar{x}) + (Y_i - \hat{b}_0^{*} - \hat{b}_1(x_i-\bar{x}) \big]^2 \\
&= n(\hat{b}_0^{*} -b_0^{*})^2 + (\hat{b}_1 - b_1)^2\sum(x_i-\bar{x})^2 + n\hat{\sigma}^2 \\
Q &= Q_1 + Q_2 + Q_3
\end{align}
$$
Where $Q = \sum(Y_i - b_0^{*} - b_1(x_i - \bar{x}))^2$, $Q_1 = n(\hat{b}_0^{*} -b_0^{*})^2$, $Q_2 = (\hat{b}_1 - b_1)^2\sum(x_i-\bar{x})^2$, $Q_3 = n\hat{\sigma}^2 = \sum{\hat{e}_i^2} = \text{SSE}$. 


$Q, Q_1, Q_2, Q_3$ are quadratic forms in terms of r.v.s $Z_i = Y_i - b_0^{*} - b_1(x_i-\bar{x})$, $i = 1, 2, ..., n$. Because $Q_1 = n\bar{Z}^2$, $Q_2 = \big( \sum{k_iZ_i} \big)^2\sum(x_i-\bar{x})^2$, $Q_3 = \sum_{i=1}^n\big( Z_i - \bar{Z} - (\sum_{j=1}^nk_jZ_j)(x_i-\bar{x})  \big)^2$.

### Theorem

$$
\frac{Q_3}{\sigma^2} = \frac{\sum_{i=1}^n(Y_i - \hat{b}_0^{*} - \hat{b}_1(x_i - \bar{x}))^2}{\sigma^2} = \frac{\sum_{i=1}^n{\hat{e}_i^2}}{\sigma^2} \sim \chi^2(n-2)
$$


Proof: 

- $\frac{(Y_i - b_0^{*} - b_1(x_i - \bar{x}))}{\sigma} \sim N(0, 1)$, therefore, Q is a sum of squares of indepdent standard normal r.vs, therefore $\frac{Q}{\sigma^2} \sim \chi^2(n)$
- $\frac{\sqrt{n}(\hat{b}_0^{*} -b_0^{*})}{\sigma} \sim N(0, 1)$, therefore, $\frac{Q_1}{\sigma^2} \sim \chi^2(1)$
- $\frac{\sqrt{\sum(x_i-\bar{x})^2}(\hat{b}_1 - b_1)}{\sigma} \sim N(0, 1)$, therefore $\frac{Q_2}{\sigma^2} \sim \chi^2(1)$
- $Q_3$ is positive, therefore, by the **Quadratic Form theorem**, $\frac{Q_3}{\sigma^2} \sim \chi^2(n-2)$

$\blacksquare$

### Expected value of $\hat{\sigma}^2$

$$
\mu[\hat{\sigma}^2] = \frac{1}{n}\mu[\sum_{i=1}^n{\hat{e}_i^2}] = \frac{1}{n}\mu[Q_3] = \frac{\sigma^2}{n}\mu[\frac{Q_3}{\sigma^2}] = \frac{n-2}{n}\sigma^2
$$

$\hat{\sigma}^2$ is an unbiased estimator. 

However, if we define the **Mean Squared Error(MSE)** as: 

$$
\begin{equation}
\text{MSE} = \frac{n\hat{\sigma}^2}{n-2} = \frac{Q_3}{n-2} = \frac{\text{SSE}}{n-2}  = \frac{\sum_{i=1}^n{\hat{e}_i^2}}{n-2} = \frac{\sum_{i=1}^n{(Y_i-\hat{Y_i})^2}}{n-2}
\end{equation}
$$
We get 

$$
\begin{align}
\mu[MSE] &=  \mu[\frac{\text{SSE}}{\sigma^2} \frac{\sigma^2}{(n-2)}] = \frac{\sigma^2}{n-2}(n-2) = \sigma^2 \\
\sigma^2[MSE] &= \sigma^2[\frac{\text{SSE}}{\sigma^2} \frac{\sigma^2}{(n-2)}] = \frac{2(n-2)\sigma^4}{(n-2)^2} = \frac{2\sigma^4}{n-2}
\end{align}
$$

### Confidence Intervals

#### The $1-\alpha$ percent confidence interval of $b_1$

We estimate the variance of $\hat{b}_1$ by replacing the unknown $\sigma^2$ with its unbiased **MSE** estimator: $s^2[{\hat{b}_1}] = \frac{MSE}{\sum(x_i-\bar{x})^2}$. $s^2[{\hat{b}_1}]$ is called the sample variance of $\hat{b}_1$.

$$
\displaystyle\frac{\hat{b}_1-b_1}{s[\hat{b}_1]} =  \frac{(\hat{b}_1-b_1)/\sigma[\hat{b}_1]}{s[\hat{b}_1]/\sigma[\hat{b}_1]}
$$
The numerator is a standard normal rv. The demoninator is:
$$
\frac{s[\hat{b}_1]}{\sigma[\hat{b}_1]} = \sqrt{\frac{\frac{\text{MSE}}{\sum(x_i-\bar{x})^2}}{\frac{\sigma^2}{\sum(x_i-\bar{x})^2}}} = \sqrt{\frac{\frac{\text{SSE}}{\sigma^2}}{n-2}} \sim \sqrt{\frac{\chi^2(n-2)}{n-2}}
$$

So $\frac{\hat{b}_1-b_1}{s[\hat{b}_1]} = \frac{z}{\sqrt{\frac{\chi^2(n-2)}{n-2}}} \sim t(n-2)$ is exactly a student distribution with $n-2$ degree of freedom.


Therefore the $1-\alpha$ confidence interval is:

$$
\begin{equation}
\hat{b}_1 \pm t(1-\alpha/2; n-2)s[\hat{b}_1]
\end{equation}
$$

#### The $1-\alpha$ percent confidence interval of $b_0$ is

$\hat{b}_0$ is a linear function of normal r.vs $\hat{b}_0^{*}, \hat{b}_1$, therefore is normal. Similar to $\hat{b}_1$, we also replace the variance of $b_0$ with its sample variance $s^2[\hat{b}_0] = MSE \bigg[ \frac{1}{n} + \frac{\bar{x}^2}{\sum(x_i-\bar{x})^2} \bigg]$ and we get:

$$
\begin{equation}
\hat{b}_0 \pm t(1-\alpha/2; n-2)s[\hat{b}_0]
\end{equation}
$$

#### The $1-\alpha$ percent confidence interval of $\mu[Y_h]$ of any given $Y_h$ when $x = x_h$



$\hat{Y}_h = \hat{b}_0 + \hat{b}_1x_i$ is normal because its a linear function of normal rvs $(\hat{b}_0, \hat{b}_1)$. Moreover, $\hat{Y}_h$ is an unbiased point estimator of $\mu[Y_h] = b_0 + b_1x_i$:

$$
\begin{align}
\mu[\hat{Y}_h] &= \mu[\hat{b}_0] + \mu[\hat{b}_1]x_i = b_0 + b_1x_h = \mu[Y_h] \\
\sigma^2[\hat{Y}_h] &= \sigma^2[\hat{b}_0^{*} + \hat{b}_1(x_i-\bar{x})] = \sigma^2[\hat{b}_0^{*}] + \sigma^2[\hat{b}_1](x_h-\bar{x})^2 = \sigma^2 \times \bigg[\frac{1}{n} + \frac{(x_h-\bar{x})^2}{\sum(x_i-\bar{x})^2}\bigg]
\end{align}
$$

Replace $\sigma^2$ with its MSE estimator we get an unbiased estimated variance:

$$
s^2[\hat{Y}_h] = \text{MSE} \times \bigg[\frac{1}{n} + \frac{(x_h-\bar{x})^2}{\sum(x_i-\bar{x})^2}\bigg]
$$

Similarly, the $1-\alpha$ confidence interval of $\mu[Y_h]$ is:

$$
\begin{equation}
\hat{Y}_h \pm t(1-\alpha/2; n-2)s[\hat{Y}_h]
\end{equation}
$$


#### The $1-\alpha$ percent confidence interval of $Y_h$ when $x = x_h$

Let $Y_h$ be a new observation. Let $\hat{e}_h = Y_h-\hat{Y}_h$. Because $\hat{Y}_h = \hat{b}_0 + \hat{b}_1x_h$ is a linear function of $Y_1, Y_2,...,Y_n$ which are independent from $Y_h = b_0+b_1x_h + \epsilon_h$, they are independent from each other. 

$$
\begin{align}
\sigma^2[\hat{e}_h] &= \sigma^2[Y_h] + \sigma^2[\hat{Y}_h] = \sigma^2 + \sigma^2\times\bigg[ \frac{1}{n} + \frac{(x_h-\bar{x})^2}{\sum{(x_i-\bar{x})^2}} \bigg] = \sigma^2\times\bigg[ 1 + \frac{1}{n} + \frac{(x_h-\bar{x})^2}{\sum{(x_i-\bar{x})^2}} \bigg] \\
s^2[\hat{e}_h] &= \text{MSE}\times\bigg[ 1 + \frac{1}{n} + \frac{(x_h-\bar{x})^2}{\sum{(x_i-\bar{x})^2}} \bigg]
\end{align}
$$

Since $Y_h, \hat{Y}_h$ are normal, $\hat{e}_h$ is normal, and it can proved similarly as previous ones, $\frac{\hat{e}_h}{s[\hat{e}_h]} \sim t(n-2)$. The $1-\alpha$ confidence interval of $Y_h$ is:

$$
\begin{equation}
\hat{Y}_h\pm t(1-\alpha/2; n-2)s[\hat{e}_h]
\end{equation}
$$

#### The $1-\alpha$ percent confidence interval of $\bar{Y}_h$ of $m$ new observations when $x = x_h$

Let $Y_{h_1}, Y_{h_2},..., Y_{h_1}$ be $m$ new observations unknown, we want to find its unknown mean $\bar{Y}_h = \frac{\sum_{i=1}^m{Y_{h_i}}}{m}$. Note, $Y_{h_i}$ are independent from each other, they are also independent from $\hat{Y}_h = \hat{b}_0 + \hat{b}_1x_h$ which is linear function of $Y_1, Y_2, ..., Y_n$.

Let $\hat{r}_h = \hat{Y}_h - \bar{Y}_h$, $\hat{r}_h$ is normal with mean and variance:

$$
\begin{align}
\mu[\hat{r}_h] &= \mu[\hat{Y}_h] - \mu[\bar{Y}_h] = b_0 + b_1x_h -\frac{m(b_0+b_1x_h)}{m} = 0 \\
\sigma^2[\hat{r}_h] &= \sigma^2[\hat{Y}_h] + \sigma^2[\bar{Y}_h] = \sigma^2 \times \bigg[ \frac{1}{n} + \frac{(x_h-\bar{x})^2}{\sum(x_i-\bar{x})^2} \bigg] + \frac{\sigma^2}{m} = \sigma^2 \times \bigg[ \frac{1}{m} + \frac{1}{n} + \frac{(x_h-\bar{x})^2}{\sum(x_i-\bar{x})^2} \bigg] \\
s^2[\hat{r}_h] &= \text{MSE} \times \bigg[ \frac{1}{m} + \frac{1}{n} + \frac{(x_h-\bar{x})^2}{\sum(x_i-\bar{x})^2} \bigg]
\end{align}
$$

Therefore $\frac{\hat{r}_h}{s[\hat{r}_h]} \sim t(n-2)$. And the $1-\alpha$ confidence interval of $\bar{Y}_h$ is:

$$
\begin{equation}
\hat{Y}_h \pm t(1-\alpha/2; n-2)s[\hat{r}_h] 
\end{equation}
$$

#### The Working-Hotelling $1-\alpha$ percent confidence band of regression line $\mu[Y_h] = b_0 + b_1x_h$ for any $Y_h$

$$
\hat{Y}_h \pm Ws[\hat{Y}_h]
$$
where $W^2 = 2F(1-\alpha; 2, n-2)$.

### Decision making

#### Testing $H_0: b_1 = 0$ vs $H_1: b_1 \ne 0$

It can be shown that,

$$
\begin{align}
\sum(Y_i-\bar{Y})^2 &= \sum(\hat{Y}_i-\bar{Y})^2 + \sum(Y_i-\hat{Y}_i)^2\\
\text{SSTO} &= \text{SSR} + \text{SSE}
\end{align}
$$

Under the null hypothesis, $\frac{\text{SSTO}}{\sigma^2} \sim \chi^2(n-1)$, $\frac{SSE}{\sigma^2} \sim \chi^2(n-2)$, by the quadratic form theorem, $\frac{\text{SSR}}{\sigma^2} \sim \chi^2(1)$.

Besides being able to decompose $\text{SSTO}$ into sum of squares of $\text{SSR}$ and $\text{SSE}$, we want to also associate the concept called **degree of freedom(df)** to those sum of squares regardless of whether we are under null or alternative hypothesis as follows:

- $\text{SSTO}$ has $n-1$ df.
- $\text{SSR}$ has $1$ df.
- $\text{SSE}$ has $n-2$ df. Note that $\text{SSE}$ always has the true chi-squared $n-2$ df in both null and alternative hypothesis, this is proved in the previous theorem.

Just like $\text{MSE} = \frac{\text{SSE}}{df(SSE)} = \frac{\text{SSE}}{n-2}$, we could also define other mean sum of squares as follows:

- $\text{MSR} = \frac{\text{SSR}}{1}$
- $\text{MSTO} = \frac{\text{SSTO}}{n-1}$

#### Expected values of $\text{MSR}$

$$
\text{MSR} = \sum(\hat{Y}_i - \bar{Y})^2 = \sum(\hat{b}_0^{*} + \hat{b}_1(x_i-\bar{x}))^2 = \hat{b}_1^2\sum(x_i-\bar{x})^2
$$

By using equation (1),

$$
\begin{align}
\mu[MSR] &= \mu[\hat{b}_1^2]\times\sum(x_i-\bar{x})^2 \\
&= (\sigma^2[\hat{b}_1] + \mu[\hat{b}_1]^2)\Big(\sum(x_i-\bar{x})^2\Big) \\
&= \Big(\frac{\sigma^2}{ \sum(x_i-\bar{x})^2 } + b_1^2\Big)\sum(x_i-\bar{x})^2 \\
&= \sigma^2 + b_1^2\sum(x_i-\bar{x})^2 
\end{align}
$$

Comparing to $\mu[\text{MSE}] = \sigma^2$, $\mu[\text{MSR}] \ge \mu[\text{MSE}]$. If $H_\alpha$ is true, $\frac{MSR}{MSE}$ should be greater than 1 and close to 1 if $H_0$ is true, and that inspires our test statistics for testing $H_0: b_1 = 0$ vs $H_\alpha: b_1 \ne 0$:

$$
\begin{equation}
F^{*} = \frac{MSR}{MSE} \sim F(1, n-2)
\end{equation}
$$

- If $F^{*} \le F(1-\alpha; 1, n-2)$ concludes $H_0$
- If $F^{*} \gt F(1-\alpha; 1, n-2)$ concludes $H_\alpha$

$F^{*}$ is equivalent to a $t^{*} = \frac{\hat{b}_1}{s[\hat{b}_1]}$ statistics test:

$$
\begin{align}
F^{*} &= \frac{\text{SSR}\div1}{\text{MSE}\div(n-2)}  \\
&= \frac{\hat{b}_1^2\sum(x_i-\bar{x})^2}{\text{MSE}} \\
&= \frac{\hat{b}_1^2}{\frac{\text{MSE}}{\sum(x_i-\bar{x})^2}} \\
&= \bigg( \frac{\hat{b}_1}{s[\hat{b}_1]} \bigg)^2 = (t^{*})^2
\end{align}
$$

- $|t^{*}| \le t(1-\alpha/2; n-2)$ concludes $H_0$
- $|t^{*}| \gt t(1-\alpha/2; n-2)$ concludes $H_\alpha$


