#### Recap of Transfer Learning
- Take a trained network in one context and use it with little additional training
- If you have a small dataset, just take inception v3 and add another linear layer on top of it.
    - intuition is that if the tasks are similar, then features learned by inception on imagenet would be useful.

#### Ensemble Methods
- If the models are partially independent, less likely to make the same mistake
- train multiple models and average their results together
- bit more expensive though
- wiht $k$ models, variability of error goes down by $\frac{1}{k}$. 
- In general, assuming that the mean of the error is $0$ (i.e. unbiased). $$E[(\frac{1}{k}\sum_{i=1}^{k} \epsilon_i)^2] = \frac{1}{k}E[\epsilon_i^2] + \frac{k-1}{k}E[e_ie_j]$$

#### Bagging
- Ensemble method, for regularization- Construct $k$ datasets by sampling w/replacement
- train $k$ different models
- neural nets trained on the same dataset tend to produce partially independent errors because there's different init, hyperparameters, etc
- more expensive though: time to train models can be very large, and also prediction time can go down. Unless you do it in parallel

#### Other ideas
- Huang et al used cosine annealing to find local minima and average them together

#### Dropout
- Method for generalization/regularization
- Approximation of bagging procedure for exponentially many models
- Procedure: sample 100 binary masks (draw 1 with p)
- Apply the masks to all the units
- Basically sets a proportion $1-p$ of all the activations to $0$. 
- approximating sparse structure
- "During training, dropout samples from an exponential number of different “thinned” networks" - dropout paper.
- Acts as an approximation of combining exponentially many ensemble learning/model combination, but with highly correlated networks
- At test time, don't do dropout, but scale them by the probability of dropout $p$. (i.e. mult by $p$)
- equivalent to dividing activations by $p$ during training.
- Inverted dropout: divide the mask by $p$ while training
- "By doing this scaling, 2n networks with shared weights can be combined into a single neural network to be used at test time" - dropout paper

#### How can we make SGD even better? 
- Momentum, Adam, RMS prop, adaptive moments, second order methods
- Regular SGD: $$\theta \leftarrow{} \theta - \epsilon * \nabla_\theta J(\theta)$$
- gradients are stochastic bc its a function of training data
- smal batch sizes can act like a regularizer, because they introduce random noise into the training process.
- noisily converging to a minimum
- stochastic is good because it can also get you out of bad local minima
- large learning rate causes zigzagging of gradients

#### Momentum
- Average gradient steps from previous iterations
- Maintian running mean of the gradients, which then update the paramters
- Set $v = 0, \alpha \in [0,1]$. Momentum update: $$v \leftarrow{} \alpha * v - \epsilon*g$$ $$\theta \leftarrow{} \theta + v$$. Basically it's $\theta = \theta + (\alpha*v \epsilon * g)$
- implementing weighted average of past grads
- Momentum can push you out of local optima, can push you out of local minima that is steep but not shallow, because it will still do updates since the gradient is $0$, but the momentum is not zero
- Tends to converge to shallower optima, places tht have local curvature

#### Nesterov Momentum
- Evaluate teh gradient at $\theta + \alpha * v$. $$v \leftarrow{} \alpha * v - \epsilon \nabla_\theta J(\theta + \alpha * v) $$ $$ \theta \leftarrow{} \theta + v$$.
- Intuition: compute the gradient with respect to what the paramters would be if you did only the momentum update. 
- Interpretation: since 


#### Adaptive Learning Rates
- Adaptive Gradient: form of SGD where the LR is decreased thourh division by historical gradient norms. 
- Let $a = 0$ initially. Then while learning, compute the gradient $g$, and update $a \leftarrow{} a + g\circ g$, and the gradient step is $$\theta \leftarrow{} \theta - \frac{\epsilon}{\sqrt{a} + \sigma} \nabla_{\theta} J$$
- Basically decrease the learning rate in proportion to previous gradients, instead of just randomly decaying it
- Issue: it remembers all of your past gradients, so if you ever had a huge gradient, then the learning rate will get scaled to be too small -> this is why Adagrad tends to slow down later in training

#### RMSProp
- Forget historical gradients
- Augment Adagrad by making it an exponentially weighted moving average, forget initilay gradients 
- Just scale $a$ with $\beta$ and $g \circ g$ with $1 - \beta$. Here $\beta$ being small basically means that dividing the LR by past gradients is that much less important

#### Adam
- Adam with no bias correction: first and second moment
- Adam with bias correction: adjust moments with bias exponentially with the timestep. 
