# Faster Optimizers
- **IF NONE OF THESE WORK**: Use TensorFlow Model Optimization Toolkit (TF-MOT)
## Momentum optimization
- Good when the inputs are not normalized
- inspiration: a bowling bowl rolling down a gentle slope. It starts out slow but eventually gets faster until terminal velocity
- vs. regular Gradient Descent:
    - regular GD takes small regular steps down the slope, so the algorithm takes more time to reach the bottom
    - regular GD does not care about where the previous gradients were
        - if the local gradient is tiny, it goes very slowly
- Cares about the what the previous gradients were
    - subtracts the local gradient from the *momentum vector* **m**, then updates the weights by adding this momentum vector
        - the graient is used for acceleration, not for speed
    - Beta hyperarameter, called "momentum", prevents the momentum from becoming too large (must be set betweewn 0 and 1)
        - typical values is 0.9
        

<img src="images/MomentumEquation.jpeg" width=360/>

### verifying if the gradients remain constant
- if the terminal velocity (the maximum size of the weight updates) is equal to that gradietn multiplied by the learning rate *eta* multiplied by 1/(1-Beta) ignoring the sign

In [2]:
from tensorflow import keras

optimizer = keras.optimizers.SGD(lr=0.01, momentum = 0.9)

## Nesterov Accelerated Gradient (Nesterov momentum optimization)
- measures the gradient of the cost function not at the loacl position theta but slightly ahead in the direction of the momentum, at *theta + Beta*m*
- a variant to momentum optimization
    - almost always faster than vanilla momentum opt.
    
<img src="images/NesterovMomentumEquation.jpeg" width=360/>
- **Why it works**
    - the momentum vector will be pointing in the right direction (toward the optimum) in general
       - more accurate to use the gradient measured a bit farther in the direction of the momentum vector than the gradient at the original position
       - NAG is generally faster than regular momentum optimization    
<img src="images/NAGGraph.jpeg" width=360/>

## AdaGrad 
- points the direction of descent closer to the global optimum
- Preface:
    - elongated bowl problem and Gradient Descent
        - it points towards the steepest slope, not towards the global optimum
- How it works:
    - scales down the gradient vector along along the steepest dimensions
- Short summary: 
    - "Adaptive learning rate" - it **decays the learning rate**, but does so faster for steep dimensions than for dimensions with gentler slopes
        - helps point the resulting updates more direclty toward the global optimum. 
    - Requires much less tuning of the learning rate eta
       
- <img src="images/AdaGrad.jpeg" width=360/>
- Downsides
    - can slow down too fast and never converge to the global optimum
    - often stops too early when training neural networks because the learning rate gets scaled down too much before reaching the global optimum
     - **Don't use for DNNs but can be used for Linear Regression for effiency**

## RMSProp
- Fixes AdaGrad's problem with not converging to the global optimum
- Performs almost always better than AdaGrad
- Was the most used optimiation algorithm until Adam optimization
- How does it fix the problem?
    - accumulates only the gradients from the most recent iterations (vs. all the gradients since the beginning of training)
    - It does so by using exponential decay in the first step:
    - <img src="images/RMSProp.jpeg" width=360/>
    - Decay rate Beta

In [4]:
optimizer = keras.optimizers.RMSprop(lr=0.001, rho=0.9) # what the heck is rho?

## Adam and Nadam Optimization
- "Adam" - adaptive moment estimation
- requires much less tuning of the learning rate beacuse it is an adaptive learning rate algorithm (like AdaGrad and RMSProp)
- combines the ideas of momentum optimization and RMSProp
    - similarities to momentum optimization - keeps track of an exponentially decaying average of past gradients
    - similrities to RMSProp - keeps track of an exponentially decaying average of past squared gradients
    - <img src="images/AdamOptimization.jpeg" width=360/>
- Variants:
    - AdaMax
    - Nadam


In [5]:
optimizer = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999)

### WARNING
- Adaptive optimization methods can lead to poorly generalizing networks
- In this case **just use Nesterov Accelerated Gradient instead** because the dataset might just be incompatible to adaptive gradients

### Extra Info:
- all adaptive optimization techniques discussed so far only rely on *first-order partial derivatives (Jacobians)*
- There are optimization algorithms based on *second-order partial derivatives (Hessians)*, which are the partial derivates of Jacobians
    - Very hard to apply because n^2 Hessians per output (n==# of params) as opposed to n Jacobians per output
        - can risk not fitting in memory and slowing down the backpropagation process too much

## Sparse Models
- Use if:
    - blazingly fast runtime is needed
    - have low memory
    - prefer to use a sparse model instead (duh!)
- How to get:
    - Easy but not optimal way:
        - train the model as usual
        - get rid of the teeny tiny weights (set them to zero)
    - Better way:
        - apply strong l<sub>1</sub> regularization during training
        - pushes the optimizer to zero out as many weights as it can
- dangers:
    - may degrade performance

<img src="images/OptimizerComparison.jpeg" width=360/>
* for bad, ** for average, *** for good

## Learning Rate Scheduling
- Considerably speeds up converage of learning rate
- Very important to find a good learning rate
    - if too high, training may diverge
    - if too low, getting to global optimum will take a very long time
    - if slightly too high, make progress very quick at first, but will end up dancing around the optimum, without ever converging
    - if limited computing budget, training might be interrupted before converging, yielding a suboptimal solution
    - <img src="images/LearningCurves.jpeg" width=360/>
- Try not to use constant learning rates
- Use **Learning Schedules**
    - Power scheduling
    - Exponential scheduling
    - Piecewise constant scheduling
    - Performance scheduling
    - 1cycle scheduling

In [7]:
#power scheduling:
optimizer = keras.optimizers.SGD(lr=0.01, decay=1e-4) # decay is the inverse of s (the # of steps it takes to divide the learning rate by one more unit)

The learning rate drops every ***s*** steps

In [8]:
#exponential scheduling w/ constant eta_0 (0.01) and s (20):
def exponential_decay_fn(epoch):
    return 0.01 * 0.1**(epoch/20)

#exponential decay w/ specifiable eta_0 and s:
def exponential_decay(lr0, s):
    def exponential_decay_fn(epoch):
        return lr0 * 0.1**(epoch/s)
    return exponential_decay_fn

expontial_decay_fn = exponential_decay(lr0=0.01, s=20)

In [10]:
#Learning Rate Scheduler callback to be passed into the fit() method
lr_scheduler_cb = keras.callbacks.LearningRateScheduler(exponential_decay_fn)

In [None]:
# schedule function that relies on the optimizer's learning rate
def exponential_decay_fun(epoch, lr):
    return lr * 0.1**(1/20)

When saving a model, the learning rate and optimizer gets saved along with it, a trained model can continue training where it left off

Although, the epoch does **not** get saved and reset to 0 every time the *fit()* method is called and coudl lead to a very large learning rate and damage the model's weights
    - solution: manually set the *intial_epoch* in the fit method so the schedule function can start at the right epoch every time

In [11]:
# piecewise constant scheduling
def piecewise_constant_fn(epoch):
    if epoch < 5:
        return 0.01
    elif epoch < 15:
        return 0.005
    else:
        return 0.001

In [12]:
# performance scheduling
lr_scheduler = keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=5)

Alternative to all above methods for learning rate scheduling, use **keras.optimizers.schedules**

In [13]:
# exponential_decay_fn() using optimizers.schedules

x_train = [1, 2, 3] # dummy X_train
s = 20 * len(x_train) // 32 # number of steps in 20 epochs (batch size = 32)
learning_rate = keras.optimizers.schedules.ExponentialDecay(0.01, s, 0.1)
optimizer = keras.optimizers.SGD(learning_rate)

Simple and better way because learning rate and its schedule (plus the state) gets saved along with its model