## Terms

### explantory variables
independent variables
### Collinearity
a linear association between two explantory variables: $X_{2i} = \lambda_0 + \lambda_1 X_{1i} $

### Multicollinearity
two or more explantory variables are highly linearly related: $\lambda_0 + \lambda_1 X_{1i} + \lambda_2 X_{2i} + \dots + \lambda_k X_{ki} = 0$


### degree of freedom
The minimum number of independent coordinates that can specify the position of the system completely.

See examples in the "Of random vectors" section from [Degrees of freedom(statistics)](https://en.wikipedia.org/wiki/Degrees_of_freedom_(statistics))

### singular matrix
A square matrix that does not have a matrix inverse



## Ordinary Least Squares

lower bias, higher variance

$Y_i = w_0 + w_1 X_{1i} + \dots + w_k X_{ki}$

$X^T X$, where $$X = \begin{bmatrix}
    1       & X_{11} & \ldots & X_{k1} \\
    \vdots  & \vdots &        & \vdots \\
    1       & X_{1N} & \ldots & X_{kN} 
    \end{bmatrix}$$
    
$k$: number of explantory variables

$N$: number of observations. ($N \geq k + 1$)

Details: [最小二乘线性回归从入门到放弃 :-)](https://www.bilibili.com/video/av13759873)

If $X$ has full column rank, $w$  has an explicit solution in matrix form as:$$w^* = (X^T X)^{-1} X^T y$$
Then, the fitted values by OLS will be: $$y = X w^*$$


In [1]:
import numpy as np

def f(x1, x2):
    return 2 +  3 * x1 - x2

x1 = [1, 3, 0, 1, -1]
x2 = [2, 4, 1, 0, 1]
y = [f(x11, x22) for x11, x22 in zip(x1, x2)]

X = np.zeros((5, 3))
X[:, 0] = 1
X[:, 1] = x1
X[:, 2] = x2
print('X', X, sep='\n')

w_opt = np.matmul(np.matmul(np.linalg.inv(np.matmul(X.T, X)), X.T ), y)
print('w_opt', w_opt, sep='\n')


X
[[ 1.  1.  2.]
 [ 1.  3.  4.]
 [ 1.  0.  1.]
 [ 1.  1.  0.]
 [ 1. -1.  1.]]
w_opt
[ 2.  3. -1.]


## Ridge Regression
Fix the issue when $X^T X$ is singular or nearly singular: Add a small constant **positive** value $\lambda$ to the diagonl entries of the matrix $X^T X$ before taking its inverse. [Regularization Part 1: Ridge Regression](https://www.youtube.com/watch?v=Q81RR3yKn30)


+ high bias, low variance
+ make $y$ less sensitive to `x` by decreasing the slope
+ smaller sample size (Note that OLS can only be applied to dataset whose size is not less than its variables, while Ridge Regression has no such constraint) => improve the generalization ability of the model

The ridge estimator is $$\beta^{ridge} = (X^T X + \lambda I_p)^{-1} X^T y$$

Note that OLS is $$\beta^{ols} = (X^T X)^{-1} X^T y$$

The ridge estimator $\beta^{ridge}$ can be seen as a solution to $$\underset{\beta \in R^p}{\text{minimize}} \lVert X \beta  - y \rVert^2  + \lambda \lVert \beta \rVert^2$$ 

### Lasso Regression
Similar to Ridge Regression, except that its penality is $\lambda \lvert \beta \rvert$. [Regularization Part 2: Lasso Regression
](https://www.youtube.com/watch?v=NGf0voTMlcs)

+ Ridge Regression does not exclude useless variables, while Lasso Regression does by setting their coefficients to zero


The lasso estimator $\beta^{ridge}$ can be seen as a solution to $$\underset{\beta \in R^p}{\text{minimize}} \lVert X \beta  - y \rVert^2  + \lambda \lvert \beta \rvert$$ 


### Elastic Net Regression
Ridge + Lasso. [Regularization Part 3: Elastic Net Regression](https://www.youtube.com/watch?v=1dKRdX9bfIo)