# 3.2 Linear Regression Models and Least Squares

We have an input vector $X^T=(X_1,...,X_p)$ and want to predict a real-valued output $Y$.The linear regression model has the form:

$$f(X) = B_0 + \sum_{j=1}^p {X_j\beta_j}$$

 Typically we have a set of training data $(x_1, y_1)...(x_N, y_n)$ from which to estimate the parameters $\beta$. The most popular estimation method is *least squares*, in which we pick $\beta$ to minimize the residual sum of squares, (3.2):
$$ 
\begin{align}
RSS(\beta)&=\sum_{i=1}^N(y_i-f(x_i))\\
&=\sum_{i=1}^N(y_i-\beta_0-\sum_{j=1}^p{x_{ij}\beta_j})^2
\end{align}
$$

How do we minimize (3.2)? We can write the (3.2) using matrix, (3.3):
$$RSS(\beta)=(\mathbf{y}-\mathbf{X}\beta)^T(\mathbf{y}-\mathbf{X}\beta)$$

Differentiating with respect to $\beta$ we obtain:
$$
\begin{align}
\frac{\partial{RSS}}{\partial\beta} = -2\mathbf{X}^T(\mathbf{y}-\mathbf{X}\beta)
\end{align}
$$

Assuming that **X** has full column rank, and hence the second derivative is positive definite:
$$\mathbf{X}^T(\mathbf{y}-\mathbf{X}\beta)=0$$

and the unique solution is:
$$\hat{\beta}=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$$

The predicted value at an input vector $x_0$ are given by $\hat{f}(x_0)=(1:x_0)^T\hat{\beta}$:

$$\hat{y}=\mathbf{X}\hat{\beta}=\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$$

The matrix $\mathbf{H}=\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T$ is sometimes called the "hat" matrix.

*Geometrical representation of the least squares:* We denote the column vectors of **X** by $x_0, x_1, ..., x_p$. These vectors span a subspace of $\mathcal{R}^N$, also referred as the column space of **X**. We minimize $RSS(\beta)=||\mathbf{y}-\mathbf{X}\beta||^2$ by choosing $\hat{\beta}$ so that the residual vector $\mathbf{y} - \hat{\mathbf{y}}$ is orthogonal to this subspace and the orthogonality is expressed by $\mathbf{X}^T(\mathbf{y}-\mathbf{X}\beta)=0$. The hat matrix **H** is the projection matrix.

*Sampling properties of $\hat{\beta}$*: In order to pin down the sampling properties of $\hat{\beta}$, we assume that the observations $y_i$ are uncorrelated and have constant variance $\sigma^2$, and that the $x_i$ are fixed. The variance-covariance matrix is given by:

$$
\begin{align}
Var(\hat{\beta}) &= E\left[(\hat{\beta}-E(\hat{\beta}))(\hat{\beta}-E(\hat{\beta})^T)\right]\\
&= E\left[(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{\varepsilon}\mathbf{\varepsilon}^T\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\right]\\
&= \sigma^2(\mathbf{X}^T\mathbf{X})^{-1}
\end{align}
$$

One estimates the variance $\sigma^2$ by:
$$
\hat{\sigma}^2 = \frac{1}{N-p-1} \sum_{i=1}^N(y_i-\hat{y_i})^2
$$

The N-p-1 rather than N in the denominator makes $\hat{\sigma}^2$ an unbiased estimate of $\sigma^2$: $E(\hat{\sigma}^2)=\sigma^2$.
