## Linear Regression

例如有一組顧客資料，有如下 features:

```
|----|----feature---|-------data---------|
| x1 |          age |         23   years |
| x2 |       salary |  1,000,000   NTD   |
| x3 | years in job |          0.5 years |
| x4 |         debt |    200,000   NTD   |
```

features of customer: $ \vec{x} = \big( x_0, x_1, x_2, \cdots, x_d \big) $

$ x_0 $ 為自行加上的常數項。

給每個 feature 加上權重  w, feature乘上各自的權重後相加，獲得 貸款額度 y。

$$
y \approx \sum_{i=0}^{d} w_i x_i
$$

Linear regression hypothesis: $ h(x) = w^T x $

Linear Regression: Find lines(x:1D) / hyperplanes(x:2D) with small residuals.

### Error Measure

Squared Error: $ \text{err }( \hat{y}, y) = ( \hat{y} - y)^2 $

$ 
\begin{align}
E_{in}(w) & = \frac{1}{N} \sum_{n=1}^N \big( h(x_n) - y_n \big)^2 \\
          & = \frac{1}{N} \sum_{n=1}^N \big( w^T x_n - y_n \big)^2
\end{align}
$

$ 
E_{out}(w) = \mathcal{E}_{x,y \sim P} \sum_{n=1}^N \big( w^T x - y_n \big)^2
$

How to minimize $ E_{in} $ ?

$ 
E_{in}(w) = \frac{1}{N} \sum_{n=1}^N \big( w^T x_n - y_n \big)^2 = \frac{1}{N} \sum_{n=1}^N \big( x_n^T w - y_n \big)^2
$

$
= \frac{1}{N} \begin{Vmatrix} 
\vec{x}_1^T \vec{w} - y_1 \\
\vec{x}_2^T \vec{w} - y_2 \\
\vdots \\
\vec{x}_N^T \vec{w} - y_N
\end{Vmatrix}^2
$

$ = \frac{1}{N} 
\Big\Vert
\begin{bmatrix}
\vec{x}_1^T \\ \vec{x}_2^T \\ \vdots \\ \vec{x}_N^T
\end{bmatrix} \vec{w}
- \begin{bmatrix}
y_1 \\ y_2 \\ \vdots \\ y_N
\end{bmatrix}
\Big\Vert^2 $, 把每一個向量 $ \vec{x} $ 組成 [矩陣X], 每一個 y 組成向量 $ \vec{y} $

$ E_{in}(w) = \frac{1}{N} 
\Big\Vert
X \vec{w} - \vec{y}
\Big\Vert^2 $

$ E_{in}(w) $ : Continuous 連續, Differentiable 可微分 , Convex 凸函數

最小值會出現在梯度 $ \nabla E_{in}(w) $ 為 零 的地方，

意即最好的 w 出現在:

$$
\nabla E_{in}(w) \equiv 
\begin{bmatrix}
\frac{\partial}{\partial w_0} E_{in}(w) = 0 \\
\frac{\partial}{\partial w_1} E_{in}(w) = 0 \\
\vdots \\
\frac{\partial}{\partial w_d} E_{in}(w) = 0
\end{bmatrix}
$$

所以目的就是求得 Linear Reg. 的最佳解 $ w_{LIN} $ such that $ \nabla E_{in}(w_{LIN}) = 0 $

$
E_{in}(w) = \frac{1}{N} \Big\Vert X w - y \Big\Vert^2 = \frac{1}{N} \Big( w^T X^T X w - 2 w^T X^T y + y^T y \Big)
$

### The Gradient $ \nabla E_{in}(w) $

Let:  
$ X^T X = A $ 矩陣  
$ X^T y = \vec{b} $ 向量  
$ y^T y = c $ 常數

$ E_{in}(w) = \frac{1}{N} \Big( w^T A w - 2 w^T \vec{b} + c \Big) $

$ \nabla E_{in}(w) = \frac{1}{N} \Big( 2 A w - 2 \vec{b} \Big) = \frac{2}{N} \Big( X^T X w - X^T y \Big) = 0 $

$ w_{LIN} = \Big( X^T X \Big)^{-1} X^T y $

Pseudo-Inverse $ X^{\dagger} = \Big( X^T X \Big)^{-1} X^T $

- Case : Invertible $ X^T X $, 唯一組解
- Case : Singular $ X^T X $, 可能多組解

### Linear Regression Algorithm

#### STEP 1: From D, construct input matrix X and output vector y.

X size: $ N \times d+1 $  
y length: N

$$
X = \begin{bmatrix}
\vec{x}_1^T \\ \vec{x}_2^T \\ \vdots \\ \vec{x}_N^T
\end{bmatrix}
$$

$$
y = \begin{bmatrix}
y_1 \\ y_2 \\ \vdots \\ y_N
\end{bmatrix}
$$

#### STEP 2: Calculate pseudo-inverse matrix of X

$$
X^{\dagger} = \Big( X^T X \Big)^{-1} X^T
$$

#### STEP 3:  return w

$$
w_{LIN} = X^{\dagger} y
$$

在得到 最佳解 $ w_{LIN} $ 後，即可進行預測 $ \hat{y} = w_{LIN} x_n $

代入上面的公式，也可知道 matrix formula of $ \hat{y} $

$
\begin{bmatrix}
\hat{y}_1 \\ \hat{y}_2 \\ \vdots \\ \hat{y}_N
\end{bmatrix}
 = X X^{\dagger} y 
$

$ X^{\dagger}, y $ 是 in sample data

$ X $ 是 $ x_n $ 組合的 prediction input

### Simpler-than-VC Guarantee

TO SHOW: average of $ E_{in} = \mathcal{E}_{D \sim P^N} \Big \{ E_{in} (w_{LIN} \text{ w.r.t. } D \Big \} = \text{noise level} \times \Big( 1 - \frac{d+1}{N} \Big) $

d+1: 自由度，有多少的 w  
N: in sample 資料量  
$ \hat{y} $: predictions  
call "$ X X^\dagger $" : the "hat matrix H" because it puts ^ on y.

$
E_{in}(w_{LIN}) = \frac{1}{N}
\begin{Vmatrix}
y - \hat{y}
\end{Vmatrix}^2
$

$ = \frac{1}{N} \begin{Vmatrix} y - X X^\dagger y \end{Vmatrix}^2 $

$ = \frac{1}{N} \begin{Vmatrix} \big( I - X X^\dagger \big) y \end{Vmatrix}^2 $

在 $ \Re^N $ 維度的空間中，  
y = X w : 是任何一組解，也就是將 X Columns 做任意線性組合的結果, Span X  
$ \hat{y} = X w_{LIN} $ 是 y 落在 Span X 上的最佳解  
向量: $ (y - \hat{y}) $ 使得 $ \hat{y} $ 最小 : $ (y - \hat{y}) \perp $ Span of X  
hat matrix H:  project y to $ \hat{y} \in $ Span of X  
I - H: transform y to $ (y - \hat{y}) \perp $ Span of X  

claim: trace(I - H) = N - (d + 1)

average of $ E_{in} = \text{ noise level } \times \big( 1 - \frac{d+1}{N} \big) $

average of $ E_{out} = \text{ noise level } \times \big( 1 + \frac{d+1}{N} \big) $

both converge to $ \delta^2 $ (noise level) for $ N \to \infty $

expected generalization error: $ \frac{2(d+1)}{N} $  
similar to worst-case guarantee from VC.

![img](./imgs/c09-learningcurves.png)