# $\S1$. Linear Regression

**Author**: [Gilyoung Cheong](https://www.linkedin.com/in/gycheong/)

Linear regression is a [supervised machine learning](https://en.wikipedia.org/wiki/Supervised_learning) algorithm. Given input data $\boldsymbol{x}_1, \dots, \boldsymbol{x}_m \in \mathbb{R}^n$ and output data $\boldsymbol{y} = (y_1, \dots, y_m) \in \mathbb{R}^n$, our goal is to find $\boldsymbol{\beta} = (\beta_0, \beta_1, \dots, \beta_m) \in \mathbb{R}^{m+1}$ such that $\boldsymbol{\hat{y}}^{\boldsymbol{\beta}} = (\hat{y}^{\boldsymbol{\beta}}_1, \dots, \hat{y}^{\boldsymbol{\beta}}_m) \in \mathbb{R}^n$ defined by
$$\hat{y}^{\boldsymbol{\beta}}_i := \beta_0 + \beta_1 x_{i1} + \cdots + \beta_m x_{im}$$
is the best possible approximation of $\boldsymbol{\hat{y}}$ in the sense that $\|\boldsymbol{y} - \boldsymbol{\hat{y}}^{\boldsymbol{\beta}}\|$ is minimized, where $\boldsymbol{x}_j = (x_{1j}, \dots, x_{nj})$ for $1 \leq j \leq m$. Consider the following $n \times (m+1)$ matrix
$$X := \begin{bmatrix}
1 & x_{11} & x_{12} & \cdots & x_{1m} \\
1 & x_{21} & x_{22} & \cdots & x_{2m} \\
\vdots & \vdots & \vdots & \cdots & \vdots \\
1 & x_{n1} & x_{n2} & \cdots & x_{nm}
\end{bmatrix},$$
and note that $\boldsymbol{\hat{y}}^{\boldsymbol{\beta}} = X\boldsymbol{\beta}$. Hence, minimizing
$$\|\boldsymbol{y} - \boldsymbol{\hat{y}}^{\boldsymbol{\beta}}\| = \|\boldsymbol{y} - X\boldsymbol{\beta}\|$$
is equivalent to requiring that $X\boldsymbol{\beta}$ is the orthogonal projection of $\boldsymbol{y}$ onto the subspace of $\mathbb{R}^{n}$ generated by the columns of $X$. This is equivalent to saying $X^T (\boldsymbol{y} - X\boldsymbol{\beta}) = 0$. This proves the following:


**Theorem (Linear Regression)**. The set of $\boldsymbol{\beta} \in \mathbb{R}^{m+1}$ that minimizes $\|\boldsymbol{y} - X\boldsymbol{\beta}\|$ are precisesly the solutions to
$$X^T X \boldsymbol{\beta} = X^T \boldsymbol{y}.$$

In particular, if $X^T X$ is invertible, then
$$\boldsymbol{\beta} = (X^T X)^{-1} X^T \boldsymbol{y}$$
is the unique choice.

**Defintion**. We say $\boldsymbol{y}^{\boldsymbol{\beta}} = X\boldsymbol{\beta}$ is a **linear regression** of $((\boldsymbol{x}_1, \dots, \boldsymbol{x}_m), \boldsymbol{y})$.

### Special Case: Simple Linear Regression

A **simple regression** is a linear regression with a single feature for the input data (i.e., $m = 1$). Writing $\boldsymbol{x} = (x_1, \dots, x_n) \in \mathbb{R}^n$ for the input data, our matrix $X$ can be written as
$$X := \begin{bmatrix}
1 & x_{1} \\
1 & x_{2} \\
\vdots & \vdots \\
1 & x_{n}
\end{bmatrix}.$$
Note that
$$X^TX = \begin{bmatrix}
n & x_1 + \cdots + x_n \\
x_1 + \cdots + x_n & x_1^2 + \cdots + x_n^2
\end{bmatrix}$$
so that 
$$X^TX\boldsymbol{\beta} = \begin{bmatrix}
n \beta_0 + (x_1 + \cdots + x_n)\beta_1 \\
(x_1 + \cdots + x_n)\beta_0 + (x_1^2 + \cdots + x_n^2)\beta_1
\end{bmatrix}.$$
Equating this to
$$X^T\boldsymbol{y} = \begin{bmatrix}
y_1 + \cdots + y_n \\
x_1y_1 + \cdots + x_n y_n
\end{bmatrix},$$
we get
* $\beta_0 = \bar{\boldsymbol{y}} - \beta_1 \bar{\boldsymbol{x}}$ and
* $(x_1 + \cdots + x_n)\beta_0 + (x_1^2 + \cdots + x_n^2)\beta_1 = x_1y_1 + \cdots + x_n y_n$,
where $\boldsymbol{x} := (x_1 + \cdots + x_n)/n$ and $\boldsymbol{y} := (y_1 + \cdots + y_n)/n$.

Using the first condition to the second (by replacing $\beta_0$), we get
$$((x_1^2 + \cdots + x_n^2) - (x_1 + \cdots + x_n)\boldsymbol{x})\beta_1 = \sum_{i=1}^nx_i(y_i - \boldsymbol{y}),$$
which can be rewritten as
$$\beta_1 \sum_{i=1}^n((x_i - \bar{\boldsymbol{x}})^2 + (x_i - \bar{\boldsymbol{x}})\bar{\boldsymbol{x}}) = \sum_{i=1}^nx_i(y_i - \boldsymbol{y}).$$
Noting that $\sum_{i=1}^{n}(x_i - \bar{\boldsymbol{x}}) = 0 = \sum_{i=1}^{n}(y_i - \bar{\boldsymbol{y}}),$ the above is equivalent to
$$\beta_1 \sum_{i=1}^n(x_i - \bar{\boldsymbol{x}})^2 = \sum_{i=1}^n(x_i - \bar{\boldsymbol{x}})(y_i - \boldsymbol{y}).$$

If we assume that 
* $n \geq 2$ and
* not all of $x_1, \dots, x_n$ are identical,
then $X^TX$ is invertible (as a $2 \times 2$ real matrix). Furthermore, one may note that the above two are the precise conditions needed to ensure that $X^TX$ is invertible, which is also equivalent to saying that
$$\sum_{i=1}^n(x_i - \bar{\boldsymbol{x}})^2 \neq 0.$$
This proves the following:

**Theorem (Simple Linear Regression)**. The pairs $(\beta_0, \beta_1) \in \mathbb{R}^2$ that minimize
$$\sum_{i=1}^n(y_i - (\beta_0 + \beta_1 x_i))^2$$
are precisely the ones that satisfy:
* $\beta_0 = \bar{\boldsymbol{y}} - \beta_1 \bar{\boldsymbol{x}}$ and
* $\beta_1 \sum_{i=1}^n(x_i - \bar{\boldsymbol{x}})^2 = \sum_{i=1}^n(x_i - \bar{\boldsymbol{x}})(y_i - \boldsymbol{y}).$

In particular, we get a unique choice for such $(\beta_0, \beta_1)$ if $n \geq 2$ and not all $x_1, \dots, x_n$ are identical with
$$\beta_1 = \frac{\sum_{i=1}^n(x_i - \bar{\boldsymbol{x}})(y_i - \boldsymbol{y})}{\sum_{i=1}^n(x_i - \bar{\boldsymbol{x}})^2}.$$