# Bias &amp; Variance

Assume human erroe about 0 %

|a|b|c|d|e|
|-|-|-|-|-|
| Training Set Error | 1%            | 15%       | 15%                | 0.5%              |
| Dev Set Error      | 11%           | 16%       | 30%                | 1%                |
|                    | High Variance | High Bias | High Var. and Bias | Low Var. and Bias |

## Basic Recipe

`High Bias`

- Bigger network
- Train longer
- Other NN architecture

`High Variance`

- More data
- Regularization
- Other NN architecture

# Regularization

## Logistic regression

$$
\min_{w,b} \mathcal{J} \big( w, b \big) \\
\mathcal{J} \big( w, b \big) = \frac{1}{m} \sum_{i=1}^m 
\mathcal{L} \big( \hat{y}^{(i)}, y^{(i)} \big) +
\frac{\lambda}{2m} \Vert w \Vert_2^2 
+ \underbrace{\frac{\lambda}{2m} b^2}_{\text{omit this item}}
$$

## L2 Regularization

$$
\Vert w \Vert^2_2 = \sum_{j=1}^{n_x} w_j^2 = w^T \ w
$$

## L1 Regularization

w will be sparse. (more zero elements)

$$
\frac{\lambda}{2m} \sum_{i=1}^{n_x} \big| \ w \ \big| = 
\frac{\lambda}{2m} \Vert w \Vert_1
$$

## Neural Network

$$
\mathcal{J} \big( w^{[1]}, b^{[1]}, \dots, w^{[L]}, b^{[L]} \big) =
\frac{1}{m} \sum_{i=1}^m \mathcal{L} \big( \hat{y}^{(i)}, y^{(i)} \big) +
\frac{\lambda}{2m} \sum_{l=1}^L \Vert w^{[l]} \Vert^2
$$

"Frobenius Norm" or "Euclidean Norm"

$$
\Vert w^{[l]} \Vert^2_F = \sum_{i=1}^{n^{[l-1]}} \ \sum_{j=1}^{n^{[l]}} 
\Big( w_{ij}^{[l]} \Big)^2 \\
w.shape = \big( n^{[l]}, n^{[l-1]} \big)
$$



加上 Regularization term 後，Gradient Descent 變為:

$$
dw^{[l]} = \big( \text{ from back-propagation } \big) + \frac{\lambda}{m} w^{[l]} \\
\begin{align}
w^{[l]} & := w^{[l]} - \alpha \ dw^{[l]} \\
        & = w^{[l]} - \alpha 
\big[ \big( \text{ from back-propagation } \big) + \frac{\lambda}{m} w^{[l]} \big] \\
        & = w^{[l]} - \frac{\alpha \lambda}{m} w^{[l]} -
\alpha \big( \text{ from back-propagation } \big)
\end{align}
$$

會使得 $ w^{[l]} $ 更小一些，因此也叫做 weight decay 權重衰減

## Dropout

keep_prob = 0.8 用一個隨機矩陣，80% 的 w element 會保留

``
d3 = np.random.rand(a3.shape[0], a3.shape[1]) < keep_prob
a3 = np.multiply(a3, d3)
``

## Other Regularization methods

- Data augmentation: 將資料稍作變化，成為新 tagged data. 如將圖像水平翻轉。
- Early stopping

# Setting up problem

## Normalizing inputs

Subtract mean

$$
\mu = \frac{1}{m} \sum_{i=1}^m x^{(i)} \\
x := x - \mu
$$

Normalize Variance

$$
\sigma^2 = \frac{1}{m} \sum_{i=1}^m x^{(i)} ** 2 \\
x := x / \sigma^2
$$

## Vanishing / Exploding gradients

若是 初始權重 或 啟動函數 的值較大或較小，在深度的層進行 multiply 後，會變成 指數地巨大或微小，造成學習不易。需要謹慎地選擇初始權重。

$$
Var \big( w_i \big) = \frac{1}{n} \\
w^{[l]} = \text{ np.random.rand(shape) * np.sqrt} \big( \frac{1}{n^{[l-1]}} \big)
$$

若是使用 ReLU, 則變異數 VAR 設為 $ \frac{2}{n} $

若是使用 tanh, 則變異數 VAR 設為 $ \sqrt{\frac{1}{n^{[l-1]}}} $, 又叫做 `Xavier Initialization`

另一個 Yoshua Bengio... 的版本是 VAR 設為 $ \sqrt{\frac{2}{n^{[l-1]} + n^{[l]}}} $

paper (he initialization) : He et al. 2015, 變異數 VAR 設為 $ \sqrt{\frac{2}{n^{[l-1]}}} $

``
w_init = np.random.randn(layers_dim[l], layers_dim[l-1]) * np.sqrt( 2 / layers_dim[l-1] )
``

## Gradient Check

Numerical approximation

$$
\frac{f(\theta+\epsilon) \ - \ f(\theta-\epsilon)}{2 \epsilon} \approx g\big( \theta \big)
$$

將 $ w^{[1]}, b^{[1]}, \dots, w^{[L]}, b^{[L]} $ reshape 成一個長 vector $ \theta $

將 $ dw^{[1]}, db^{[1]}, \dots, dw^{[L]}, db^{[L]} $ reshape 成一個長 vector $ d\theta $

$$
\begin{align}
\text{for each i: } & \\
                    & d\theta_{\text{approx}}[i] = 
\frac{
\mathcal{J}\big( \theta_1, \theta_2, \dots, \theta_i + \epsilon, \dots \big) -
\mathcal{J}\big( \theta_1, \theta_2, \dots, \theta_i - \epsilon, \dots \big)}
{2 \epsilon} \\
& d\theta[i] = \frac{\partial \mathcal{J}}{\partial{\theta_i}} \\
& d\theta_{\text{approx}}[i] \approx d\theta[i] \\
& \text{ Check: }
\frac{\Vert d\theta_{\text{approx}} - d\theta \Vert_2}
{\Vert d\theta_{\text{approx}} \Vert_2 + \Vert d\theta \Vert_2}
\end{align}
$$

$ \epsilon = 10^{-7} $

$$
\begin{cases}
\lt 10^{-7} : & \text{ good } \\
\approx 10^{-5} : & \text{ maybe okay } \\
\gt 10^{-3} : & \text{ worry } \\
\end{cases}
$$

- Don't use "grad check" in training, only for debug.
- "grad check" doesn't work with dropout.