# Training
---

# Loss Functions

# Softmax  

\begin{equation}
    \hat{y_i} = \frac{e^{z_i}}{\sum_{j=1}^N e^{z_j}} 
\end{equation}

* Output sums to one
* Represent probability distribution across discrete mutually exclusive alternatives

In [None]:
def softmax(x):
    return np.exp(x)/np.sum(np.exp(x))

# Softmax Derivative

\begin{equation}
    \frac{\partial \hat{y_i}}{\partial z_i} = \hat{y_i} ( 1 - \hat{y_i})
\end{equation}

# Cross-entropy Cost Function

\begin{equation}
    J = - \sum_j y_j \log \hat{y_j}
\end{equation}

\begin{equation}
    \frac{\partial J}{\partial z_i} = - \sum_j \frac{\partial J}{\partial \hat{y_i}} \frac{\partial \hat{y_i}}{\partial z_i} = \hat{y_i} - y_i
\end{equation}

In [1]:
def cost(y_hat, y):
    return y_hat - y

# Optimizers

# Mini-batch SGD Loop:
1. Sample a batch of data
2. Forward prop it through the graph (network), get loss
3. Backprop to calculate the gradients
4. Update the parameters using the gradient

# SGD
* Because we use minibatches gradients can be noisy

\begin{equation}
    \theta_{t+1} = \theta_t - \alpha \nabla J(\theta)
\end{equation}

\begin{equation}
    J(\theta) = \frac{1}{N} \sum_{i=1}^{N} J_i (x_i, y_i, \theta)
\end{equation}

\begin{equation}
    \nabla_{\theta} J(\theta) = \frac{1}{N} \sum_{i=1}^{N} \nabla_{\theta}  J_i (x_i, y_i, \theta)
\end{equation}

# Momentum
* Use direction of gradients to push us forward
* Helps to avoid local minima

Momentum usually is $\rho = 0.9$ 

\begin{equation}
    v_{t+1} = \rho v_t + \nabla J(\theta)
\end{equation}

\begin{equation}
    \theta_{t+1} = \theta_t - \alpha v_{t+1}
\end{equation}

# AdaGrad
* Element wise scaling of gradient based on past sum of squares in each dimension
* Adaptive learning rate

In [None]:
grad_squared = 0
while True:
    dtheta = compute_gradients(theta)
    grad_squared += dtheta * dtheta
    
    theta -= learning_rate * dtheta / (np.sqrt(grad_squared) 1+e-7)

# RMSProp

In [None]:
grad_squared = 0
while True:
    dtheta = compute_gradients(theta)
    grad_squared += dtheta * dtheta
    
    theta -= learning_rate * dtheta / (np.sqrt(grad_squared) 1+e-7)

# Adam

# Regularization
* $L_1, L_2$ weight penalties
* Dropout
* Batch Normalization
* Data Augmentation

# Dropout

# Batch Norm

* “you want zero-mean unit-variance activations? just make them so.”
* Compute mean and variance of each dimension
* Normalize

\begin{equation}
\hat{x}^{(l)} = \frac{x^{(l)} - E[x^{(l)}]}{\sqrt{Var[x^{(l)}]}}
\end{equation}

[Ioffe and Szegedy, 2015]

* Sort of a regularization technique
* Better gradient flow through network

# Data Augmentation

# Transfer Learning
* Take a pretrained network (trained to classify cats)
* Use it for a new task (classify dogs)

* These are similar taskes (cats and dogs share similar features)
* Need much less data to train

| _ | Similar dataset | Different dataset |  
| ----- |:-----:| -----:|  
| Small data | Train new top layer | Bummer |  
| Big Data | Finetune a couple layers | Finetune most layers |  

# Tips
* Watch the loss
* Check for over fitting
* 