# Perceptron

$\textbf{Assumption}$: training set is linearly separable

$\textbf{Loss function}$:
$$\mathcal{L} (w, b) = -\sum y_i (w^T x_i + b)$$

$\textbf{Steps}$:
- model:
$$f(x) = sign(w^T x + b), sign(x) = \begin{cases}
      1, & \text{if}\ x > 0\\
      -1, & \text{else}\ 
    \end{cases}$$
- Initialize $w_0, b_0$
- Gradient descent:
  - $\alpha$: learning rate
$$\frac{\partial \mathcal{L}}{w} = - \sum y_i x_i, \frac{\partial \mathcal{L}}{b} = - \sum y_i$$
$$w := w + \alpha y_i x_i, b := b + \alpha y_i$$

# Feedforward Neural Network (FFNN)

$\textbf{Output Layer}$:
- Gaussian (regression): Linear Unit
- Bernoulli (binary classification): Sigmoid
  - gradient vanishes with extreme negative or positive values
- Multinomial (multilabel classification): Softmax
  - gradient vanishes when difference between input values are extreme
  - softmax: differentiable and continuous
  - argmax (one-hot): not continuous / differentiable

$\textbf{Activation Functions}$:
- Sigmoid: $\sigma(x) = \frac{1}{1 + exp(-x)} \in (0, 1)$
- Hyperbolic Tangent: $tanh(x) = \frac{exp(x) - exp(-x)}{exp(x) + exp(-x)} \in (-1, 1)$
- Rectified Linear Unit: $ReLU(x) = max(0, x)$
  - Leaky ReLU: $f(x) = \begin{cases}
      x, & \text{if}\ x > 0\\
      0.01 x, & \text{else}\ 
    \end{cases}$
  - Exponential ReLU: $f(x) = \begin{cases}
      x, & \text{if}\ x > 0\\
      a (exp(x) - 1), & \text{else}\ 
    \end{cases}$
  - Softplus/Smooth ReLU: $f(x) = ln(1 + exp(x))$


![backpropagation](images/Neural_Networks/backpropagation.png)


$\textbf{Forward propagation}$
1. $a^{[1]} = f_1(z^{[1]}), z^{[1]} = w^{[1]} x + b^{[1]}$
2. $\hat{y} = a^{[2]} = f_2(z^{[2]}), z^{[2]} = w^{[2]} a^{[1]} + b^{[2]}$
3. $loss = L(y, \hat{y})$

$\textbf{Backpropagation}$
1. $w^{[2]} := w^{[2]} - \alpha * \frac{\partial L}{\partial w^{[2]}} = w^{[2]} - \alpha * \left(\frac{\partial L}{\partial f_2} * \frac{\partial f_2}{\partial z^{[2]}}\right) * \frac{\partial z^{[2]}}{\partial w^{[2]}}$
   - $da^{[2]} = \frac{\partial L}{\partial f_2} * \frac{\partial f_2}{\partial z^{[2]}}$
2. $w^{[1]} := w^{[1]} - \alpha * \frac{\partial L}{\partial w^{[1]}} = w^{[1]} - \alpha * \left(\frac{\partial L}{\partial f_2} * \frac{\partial f_2}{\partial z_2} \right) * \left(\frac{\partial z_2}{\partial f_1} * \frac{\partial f_1}{\partial z^{[1]}} \right) * \frac{\partial z^{[1]}}{\partial w^{[1]}}$
   - $da^{[1]} = da^{[2]} * \frac{\partial z^{[2]}}{\partial f_1} * \frac{\partial f_1}{\partial z^{[1]}}$

$\textbf{Gradient Descent Algorithms}$:
- General Gradient Descent:
  - Hyperparameter
    - $\alpha$: learning rate
    $$w := w - \alpha * \frac{\partial L}{\partial w}$$
- Momentum (exponential weighted average)
  - Hyperparameter
    - $\alpha$: learning rate
    - $\beta$: decay (default 0.9)
    $$V_{dw} = \beta_1 V_{dw} + (1 - \beta_1) dw, V_{db} = \beta_1 V_{db} + (1 - \beta_1) db$$
    $$w := w - \alpha V_{dw}, b := b - \alpha V_{db}$$
- AdaGrad (adaptive gradient)
  - compute outer product of gradient, sum up diagonal
  - Hyperparameter
    - $\alpha$: learning rate
    - $\epsilon$: small positive number
    $$G_{dw} := G_{dw} + dw^2, G_{db} := G_{db} + db^2$$
    $$w := w - \alpha \frac{dw}{\sqrt{G_{dw} + \epsilon}}, b := b - \alpha \frac{db}{\sqrt{G_{db} + \epsilon}}$$
- RMSprop
  - Hyperparameter
    - $\alpha$: learning rate (default 0.001)
    - $\beta$: decay (default 0.9)
    - $\epsilon$: small positive number (default $10^{-8}$)
    $$S_{dw} = \beta_2 S_{dw} + (1 - \beta_2) dw^2, S_{db} = \beta_1 V_{db} + (1 - \beta_2) db^2$$
    $$w := w - \alpha \frac{dw}{\sqrt{S_{dw} + \epsilon}}, b := b - \alpha \frac{db}{\sqrt{S_{db} + \epsilon}}$$
- Adam (Adaptive Momentum) optimizer
  - Hyperparameter
    - $\alpha$: learning rate
    - $\beta_1$: decay for $dw$
    - $\beta_2$: decay for $dw^2$
    - $\epsilon$: small positive number (default $10^{-8}$)
    $$V_{dw} = \beta_1 V_{dw} + (1 - \beta_1) dw, V_{db} = \beta_1 V_{db} + (1 - \beta_1) db$$
    $$S_{dw} = \beta_2 S_{dw} + (1 - \beta_2) dw^2, S_{db} = \beta_1 V_{db} + (1 - \beta_2) db^2$$
    $$V_{dw}^{corrected} = \frac{V_{dw}}{1 - \beta_1^t}, V_{db}^{corrected} = \frac{V_{db}}{1 - \beta_1^t}$$
    $$S_{dw}^{corrected} = \frac{S_{dw}}{1 - \beta_2^t}, S_{db}^{corrected} = \frac{S_{db}}{1 - \beta_2^t}$$
    $$w := w - \alpha \frac{V_{dw}^{corrected}}{\sqrt{S_{dw}^{corrected} + \epsilon}}, b := b - \alpha \frac{V_{db}^{corrected}}{\sqrt{S_{db}^{corrected} + \epsilon}}$$
- Learning rate decay:
    $$\alpha = \frac{1}{1 + \beta * \text{epoch num}} \alpha_0$$
    $$\alpha = \frac{k}{\sqrt{\text{epoch num}}} \alpha_0$$

$\textbf{Batch Size vs Accuracy vs GPU}$

| Batch Size | Converage | Gradient estimate | Generalization | GPU memory |
| --- | --- | --- | --- | --- |
| Small | Slowly | Noisier | Good | Less |
| Large | Quickly | More accurate | Bad | More |

$\textbf{Training Slow}$
- Large sample size: modify batch size
- Gradient small: increate learning rate

$\textbf{Batch Normalization (BN)}$
- Steps:
  1. compute mini-batch mean: $\mu_{B} = \frac{1}{m} \sum_{i=1}^m x_i$
  2. compute mini-batch variance:$\sigma_{B}^2 = \frac{1}{m} \sum_{i=1}^m (x_i - \mu_B)^2$
  3. normalize: $\hat{x_i} = \frac{x_i - \mu_B}{\sqrt{\sigma_{B}^2+ \epsilon} }$
  4. scale and shift: $y_i = \gamma \hat{x_i} + \beta$

- Advantages:
  - keep the distribution of X consistent (mean 0, variance 1)
  - allow each layer to learn by itself more independently of other layers
  - can use higher learning rate
  - reduce overfitting: add some noise to each hidden layer
  
- Disadvantages:
  - need to have large batch size
  - not suitable for RNN
    - RNN’s depth is unstable
    - hard to normalize the same layer

$\textbf{Reduce Overfitting}$:
- Regularization
  - L2
    $$\tilde{\mathcal{L}}(w, X, y) = \mathcal{L}(w, X, y) + \frac{\lambda}{2} w^T w$$
    $$\frac{\partial \tilde{\mathcal{L}}(w, X, y)}{\partial w} = \frac{\partial \mathcal{L}(w, X, y)}{\partial w} + \lambda w$$
  - L1
    $$\tilde{\mathcal{L}}(w, X, y) = \mathcal{L}(w, X, y) + \lambda ||w||_1$$
    $$\frac{\partial \tilde{\mathcal{L}}(w, X, y)}{\partial w} = \frac{\partial \mathcal{L}(w, X, y)}{\partial w} + \lambda sign(w)$$
- Early Stopping
  - small learning rate leads to early stopping given same iteration
  - not converage to local optimum in traning
  - can reduce training cost
- Dropout
  - randomly deactivate a part of neurons during training (similar to bagging)
  - don't need to dropout in testing
- Reduce number of hidden layer