## Terms

### explantory variables
independent variables
### Collinearity
a linear association between two explantory variables: $X_{2i} = \lambda_0 + \lambda_1 X_{1i} $

### Multicollinearity
two or more explantory variables are highly linearly related: $\lambda_0 + \lambda_1 X_{1i} + \lambda_2 X_{2i} + \dots + \lambda_k X_{ki} = 0$


### degree of freedom
The minimum number of independent coordinates that can specify the position of the system completely.

See examples in the "Of random vectors" section from [Degrees of freedom(statistics)](https://en.wikipedia.org/wiki/Degrees_of_freedom_(statistics))

### singular matrix
A square matrix that does not have a matrix inverse



## Ordinary Least Squares

lower bias, higher variance

$Y_i = w_0 + w_1 X_{1i} + \dots + w_k X_{ki}$

$X^T X$, where $$X = \begin{bmatrix}
    1       & X_{11} & \ldots & X_{k1} \\
    \vdots  & \vdots &        & \vdots \\
    1       & X_{1N} & \ldots & X_{kN} 
    \end{bmatrix}$$
    
$k$: number of explantory variables

$N$: number of observations. ($N \geq k + 1$)

Details: [最小二乘线性回归从入门到放弃 :-)](https://www.bilibili.com/video/av13759873)

If $X$ has full column rank, $w$  has an explicit solution in matrix form as:$$w^* = (X^T X)^{-1} X^T y$$
Then, the fitted values by OLS will be: $$y = X w^*$$


In [3]:
import numpy as np

def f(x1, x2):
    return 2 +  3 * x1 - x2

x1 = [1, 3, 0, 1, -1]
x2 = [2, 4, 1, 0, 1]
y = [f(x11, x22) for x11, x22 in zip(x1, x2)]

X = np.zeros((5, 3))
X[:, 0] = 1
X[:, 1] = x1
X[:, 2] = x2
print('X', X, sep='\n')

w_opt = np.matmul(np.matmul(np.linalg.inv(np.matmul(X.T, X)), X.T ), y)
print('w_opt', w_opt, sep='\n')


X
[[ 1.  1.  2.]
 [ 1.  3.  4.]
 [ 1.  0.  1.]
 [ 1.  1.  0.]
 [ 1. -1.  1.]]
w_opt
[ 2.  3. -1.]


## Ridge Regression
Fix the issue when $X^T X$ is singular or nearly singular: Add a small constant **positive** value $\lambda$ to the diagonl entries of the matrix $X^T X$ before taking its inverse. [Regularization Part 1: Ridge Regression](https://www.youtube.com/watch?v=Q81RR3yKn30)


+ high bias, low variance
+ make $y$ less sensitive to `x` by decreasing the slope
+ smaller sample size (Note that OLS can only be applied to dataset whose size is not less than its variables, while Ridge Regression has no such constraint) => improve the generalization ability of the model

The ridge estimator is $$\beta^{ridge} = (X^T X + \lambda I_p)^{-1} X^T y$$

Note that OLS is $$\beta^{ols} = (X^T X)^{-1} X^T y$$

The ridge estimator $\beta^{ridge}$ can be seen as a solution to $$\underset{\beta \in R^p}{\text{minimize}} \lVert X \beta  - y \rVert^2  + \lambda \lVert \beta \rVert^2$$ 

### Lasso Regression
Similar to Ridge Regression, except that its penality is $\lambda \lvert \beta \rvert$. [Regularization Part 2: Lasso Regression
](https://www.youtube.com/watch?v=NGf0voTMlcs)

+ Ridge Regression does not exclude useless variables, while Lasso Regression does by setting their coefficients to zero


The lasso estimator $\beta^{ridge}$ can be seen as a solution to $$\underset{\beta \in R^p}{\text{minimize}} \lVert X \beta  - y \rVert^2  + \lambda \lvert \beta \rvert$$ 


### Elastic Net Regression
Ridge + Lasso. [Regularization Part 3: Elastic Net Regression](https://www.youtube.com/watch?v=1dKRdX9bfIo)

## Batch Normalization

Pytorch: [BatchNorm2d](https://pytorch.org/docs/master/nn.html#torch.nn.BatchNorm1d)

Normalize each dimension(such as `x[:, k, :, :]` in CHW version):
$$\hat{x}^{(k)} = \frac{x^{(k)} - \textbf{E}[x^{(k)}]}{\sqrt{  \text{Var}[x^{(k)}] + \epsilon   }}$$

Scale and shift the normalized value:
$$ y^{(k)} = \gamma ^ {(k)} \hat{x} ^{(k)} + \beta^{(k)} $$

For pytorch, $\gamma$ and $\beta$ are stored in `bn.weight` and `bn.bias` respectively. If `affine` is `False`, `bn` has no learnable parameters and thus `bn.weight` and `bn.bias` are `None`.

```
running_mean = (1 - momentum) * running_mean + momentum * batch_mean
running_var  = (1 - momentum) * running_var  + momentum * batch_var
```

Note that `torch.var` is default to unbiased, which means that it is  $Var[x] = \frac{m}{m - 1}$

In [8]:
import torch
import numpy as np

x = torch.arange(48).reshape(2, 4, 3, 2).float()
bn = torch.nn.BatchNorm2d(4, affine=False)
print('bn.eps:', bn.eps)
y = bn(x)

y_mine = torch.zeros_like(y)
for i in range(4):
    y_mine[:, i, :, :] = (x[:, i, :, :] - torch.mean(x[:, i, :, :])) / torch.sqrt(torch.var(x[:, i, :, :], unbiased=False) + bn.eps)

print('y:', y, sep='\n')
print('y_mine:', y_mine, sep='\n')

bn.eps: 1e-05
y:
tensor([[[[-1.1963, -1.1138],
          [-1.0313, -0.9488],
          [-0.8663, -0.7838]],

         [[-1.1963, -1.1138],
          [-1.0313, -0.9488],
          [-0.8663, -0.7838]],

         [[-1.1963, -1.1138],
          [-1.0313, -0.9488],
          [-0.8663, -0.7838]],

         [[-1.1963, -1.1138],
          [-1.0313, -0.9488],
          [-0.8663, -0.7838]]],


        [[[ 0.7838,  0.8663],
          [ 0.9488,  1.0313],
          [ 1.1138,  1.1963]],

         [[ 0.7838,  0.8663],
          [ 0.9488,  1.0313],
          [ 1.1138,  1.1963]],

         [[ 0.7838,  0.8663],
          [ 0.9488,  1.0313],
          [ 1.1138,  1.1963]],

         [[ 0.7838,  0.8663],
          [ 0.9488,  1.0313],
          [ 1.1138,  1.1963]]]])
y_mine:
tensor([[[[-1.1963, -1.1138],
          [-1.0313, -0.9488],
          [-0.8663, -0.7838]],

         [[-1.1963, -1.1138],
          [-1.0313, -0.9488],
          [-0.8663, -0.7838]],

         [[-1.1963, -1.1138],
          [-1.0313, -0

In [9]:
bn = torch.nn.BatchNorm2d(4) # eps=1e-5, affine=True(gamma and beta are enabled)
bn.bias[:] += 5 # set to 5
print('bn.bias:', bn.bias.data)
y = bn(x)


y_mine = torch.zeros_like(y)
for i in range(4):
    y_mine[:, i, :, :] = (x[:, i, :, :] - torch.mean(x[:, i, :, :])) / torch.sqrt(torch.var(x[:, i, :, :], unbiased=False) + bn.eps) * bn.weight[i] + bn.bias[i]

print('y:', y, sep='\n')
print('y_mine:', y_mine, sep='\n')

bn.bias: tensor([5., 5., 5., 5.])
y:
tensor([[[[3.8226, 3.9038],
          [3.9850, 4.0662],
          [4.1474, 4.2286]],

         [[4.6951, 4.7161],
          [4.7371, 4.7582],
          [4.7792, 4.8002]],

         [[4.4152, 4.4555],
          [4.4959, 4.5362],
          [4.5765, 4.6168]],

         [[4.2404, 4.2928],
          [4.3452, 4.3975],
          [4.4499, 4.5023]]],


        [[[5.7714, 5.8526],
          [5.9338, 6.0150],
          [6.0962, 6.1774]],

         [[5.1998, 5.2208],
          [5.2418, 5.2629],
          [5.2839, 5.3049]],

         [[5.3832, 5.4235],
          [5.4638, 5.5041],
          [5.5445, 5.5848]],

         [[5.4977, 5.5501],
          [5.6025, 5.6548],
          [5.7072, 5.7596]]]], grad_fn=<ThnnBatchNormBackward>)
y_mine:
tensor([[[[3.8226, 3.9038],
          [3.9850, 4.0662],
          [4.1474, 4.2286]],

         [[4.6951, 4.7161],
          [4.7371, 4.7582],
          [4.7792, 4.8002]],

         [[4.4152, 4.4555],
          [4.4959, 4.5362],
   

In [11]:
bn = torch.nn.BatchNorm2d(4).eval() # eps=1e-5, affine=True(gamma and beta are enabled)
y = bn(x)

y_mine = torch.zeros_like(y)
for i in range(4):
    y_mine[:, i, :, :] = (x[:, i, :, :] - bn.running_mean[i]) / torch.sqrt(bn.running_var[i] + bn.eps) * bn.weight[i] + bn.bias[i]
    
print('y:', y, sep='\n')
print('y_mine:', y_mine, sep='\n')

y:
tensor([[[[ 0.0000,  0.4406],
          [ 0.8813,  1.3219],
          [ 1.7626,  2.2032]],

         [[ 3.9458,  4.6034],
          [ 5.2610,  5.9186],
          [ 6.5763,  7.2339]],

         [[ 8.3611,  9.0579],
          [ 9.7547, 10.4514],
          [11.1482, 11.8449]],

         [[10.9858, 11.5961],
          [12.2065, 12.8168],
          [13.4271, 14.0374]]],


        [[[10.5755, 11.0161],
          [11.4568, 11.8974],
          [12.3381, 12.7787]],

         [[19.7288, 20.3864],
          [21.0440, 21.7017],
          [22.3593, 23.0169]],

         [[25.0834, 25.7802],
          [26.4769, 27.1737],
          [27.8705, 28.5672]],

         [[25.6336, 26.2439],
          [26.8542, 27.4645],
          [28.0749, 28.6852]]]], grad_fn=<ThnnBatchNormBackward>)
y_mine:
tensor([[[[ 0.0000,  0.4406],
          [ 0.8813,  1.3219],
          [ 1.7626,  2.2032]],

         [[ 3.9458,  4.6034],
          [ 5.2610,  5.9186],
          [ 6.5763,  7.2339]],

         [[ 8.3611,  9.0579],
   

In [12]:
bn = bn.train()
bn(x)
print(bn.running_mean)
bn = bn.eval()

y = bn(x)

y_mine = torch.zeros_like(y)
for i in range(4):
    y_mine[:, i, :, :] = (x[:, i, :, :] - bn.running_mean[i]) / torch.sqrt(bn.running_var[i] + bn.eps) * bn.weight[i] + bn.bias[i]
    
print('y:', y, sep='\n')
print('y_mine:', y_mine, sep='\n')

tensor([1.4500, 2.0500, 2.6500, 3.2500])
y:
tensor([[[[-0.1553, -0.0482],
          [ 0.0589,  0.1660],
          [ 0.2731,  0.3802]],

         [[ 0.6314,  0.7912],
          [ 0.9511,  1.1109],
          [ 1.2707,  1.4306]],

         [[ 1.5835,  1.7528],
          [ 1.9222,  2.0915],
          [ 2.2609,  2.4302]],

         [[ 2.1881,  2.3364],
          [ 2.4848,  2.6331],
          [ 2.7814,  2.9298]]],


        [[[ 2.4152,  2.5223],
          [ 2.6294,  2.7365],
          [ 2.8436,  2.9507]],

         [[ 4.4675,  4.6274],
          [ 4.7872,  4.9471],
          [ 5.1069,  5.2668]],

         [[ 5.6479,  5.8173],
          [ 5.9866,  6.1560],
          [ 6.3253,  6.4947]],

         [[ 5.7483,  5.8967],
          [ 6.0450,  6.1933],
          [ 6.3417,  6.4900]]]], grad_fn=<ThnnBatchNormBackward>)
y_mine:
tensor([[[[-0.1553, -0.0482],
          [ 0.0589,  0.1660],
          [ 0.2731,  0.3802]],

         [[ 0.6314,  0.7912],
          [ 0.9511,  1.1109],
          [ 1.2707,  1.4

In [13]:
y == y_mine

tensor([[[[1, 0],
          [0, 1],
          [1, 1]],

         [[1, 1],
          [0, 1],
          [1, 1]],

         [[0, 0],
          [1, 1],
          [1, 1]],

         [[1, 1],
          [1, 1],
          [0, 1]]],


        [[[1, 1],
          [0, 1],
          [0, 1]],

         [[0, 1],
          [0, 1],
          [0, 1]],

         [[1, 0],
          [1, 1],
          [1, 0]],

         [[1, 0],
          [1, 1],
          [1, 0]]]], dtype=torch.uint8)