# Chapter 10: Optimizers

In [1]:
# Preface: Install necessary packages:
import numpy as np
import matplotlib.pyplot as plt
import math
import nnfs
from nnfs.datasets import spiral_data
from timeit import timeit
from resources.classes import DenseLayer, ReLU, SoftMax, Loss, CategoricalCrossEntropy, SoftMaxCategoricalCrossEntropy, SGD, AdaGrad, RMSProp, Adam

We spent chapter nine learning about how we can calculate and apply the gradients to adjust weights & biases to ultimately reduce loss. What we ended up doing was subtracting a fraction of the gradient for each weight and bias parameter -- and that is called stochastic gradient descent (SGD). Most optimizer that we use are actually just modifications in the implementation of SGD.

## Section 1: Stochastic Gradient Descent (SGD)

Anyone who's heard about optimizers before has probably heard many names -- seemingly used interchangeably -- including:
- Stochastic Gradient Descent or SGD.
- Vanilla Gradient Descent, Gradient Descent or GD, Batch Gradient Descent or BGD.
- Mini-batch Gradient Descent or MBGD.

However: these are not the same. SGD has historically referred to an optimizer that fits a single sample at a time. Meanwhile, BGD is an optimizer used to fit a whole dataset at once. Lastly, MBGD is an optimizer used to fit slices to a dataset, which we'd call batches in our context.

As a general rule of thumb, we call slices of data **batches**. However, historically, these same slices have been referred to as **mini-batches** in the context of SGD. With that said, the field has evolved and the two are now used interchangeably to the point where we actually think of the SGD optimizer as one that assumes a batch of data.

In the case of SGD, we need to choose a learning rate. From that, we then subtract $\text{learning_rate} \cdot \text{paramater_gradients}$ from the actual values. 

We'll walk through creating an example SGD class below -- but I've already made one and added it to the class.py file :).

In [None]:
class ExampleSGD():
    # Initialize the class
    def __init__(self, lr=1.0):
        # Store the learning rate; just 1 if not specified
        self.lr = lr
        
    # Method to update the parameters after a backward pass
    def updateParams(self, layer):
        # Update values according to: -lr * parameter_gradients
        layer.weights += -self.lr * layer.dweights
        layer.bias += -self.lr * layer.dbiases

The above is an early version of our SGD class! Now, to use it, we have to instantiate the object by doing "optimizer = SGD()" and then doing "optimizer.updateParams(denseX)" for each hidden layer X in our model.

Again, the above is code cell is just an example implementation for convenience; we can just directly reference the SGD() class from our classes.py.

With this built out, we can begin training our model in repeated iterations called epochs. An epoch simply means a full pass through all the training data. So, let's implement our model to trainable in epochs by leveraging looping. 

In [None]:
# Creating some training data used the spiral_data function
X, y = spiral_data(samples=100, classes=3)

# Create dense layer with 2 input features and 64 output features
dense1 = DenseLayer(2, 64)

# Use a relu activation
activation1 = ReLU()

# Create a dense layer for our output with 64 as an input and 3 as an output
dense2 = DenseLayer(64, 3)

# Use a softmax combined with ccel. for our output 
activationLoss = SoftMaxCategoricalCrossEntropy()

# Initialize optimizer
optimizer = SGD()

# Create the loop that trains our model in epochs
for epoch in range(10000):
    # Perform the forward pass, as shown previously
    dense1.forward(X)
    activation1.forward(dense1.output)
    dense2.forward(activation1.output)
    loss = activationLoss.forward(dense2.output, y)
    
    # Calculate the accuracy
    predictions = np.argmax(activationLoss.output, axis=1)
    if len(y.shape) == 2:
        y = np.argmax(y, axis=1)
    accuracy = np.mean(predictions==y)
    
    if not epoch % 100:
        print(f"epoch: {epoch}, accuracy: {accuracy: .3f}, loss: {loss: .3f}")
        
    # Perform the backward pass
    activationLoss.backward(activationLoss.output, y)
    dense2.backward(activationLoss.dinputs)
    activation1.backward(dense2.dinputs)
    dense1.backward(activation1.dinputs)
    
    # Use the optimizer and update the weights and biases
    optimizer.updateParams(dense1)
    optimizer.updateParams(dense2)

Now making models that are learning and performing better on their dataset -- which we can clearly see by increasing accuracy and decreasing loss. However, our accuracy seems to get stuck just below the ~.70 mark with about ~.62 loss. That tells us the model may have reached a local minimum, which we'll talk about soon. That tells us that adding epochs likely won't be very helpful at this point, and we need to work on our optimizer. The first thing we can change is the learning rate, so that's what we'll look at next!

## Section 2: Learning Rates

We want to apply a fraction of the gradients to the parameters in order to descend the loss value. Typically, we don't apply the full gradient value (which would just be the slope of the tangent line) because these values will typically be too large to produce meaningful improvements. Instead, we want to perform small steps, calculating the gradient and updating parameters by a negative fraction of this gradient. These small steps allow us to ensure that we are following the direction of the steepest descent -- but these can be too small, causing learning stagnation.

When the learning rate is too small, the update to the parameters are too small the model may get stuck in a local minimum as opposed to actually finding the global minimum. So, how do we know if our model is at the global minimum? We know it is so when the loss reaches as close to 0 as possible -- but we frequently never reach 0 in practice, as that may cause overfitting (a topic which I assume will be talked about in a later chapter). 

Learning rate in itself is not enough -- you may still be getting stuck in local minimums no matter what you set it to. So, for that reason, we must introduce momentum. Momentum, in an optimizer, adds to the gradient what we, in the physical world, would call intertia. For example, if we throw a ball uphill, then with enough force or a small enough hill, the ball can roll over the crest of the hill and onto the other side.

These parameters in the optimizer (such as the learning rate or momentum) are referred to as hyperparameters.

If the learning rate is too big the model may start jumping around during SGD, caused by the amount of gradient applied being too large. At the extreme, this may cause a gradient explosion. A gradient explosion is where the parameter updates cause the function's output to rise instead of fall, causing the gradient and loss to increase with each step. 

Ultimately, choosing the correct hyperparameters will enable you to speed up the learning process and save yourself money and time. It's typically best to start with the optimizer defaults, perform a few steps, and then observe the training process when tuning different settings. However, it's always useful to have some system that actively tunes your hyperparameters even during training. One way that we do this is by using learning rate decay. 

## Section 3: Learning Rate Decay   

The idea of learning rate decay is to start with a large learning rate (i.e. something like 1.0) and then decrease it during training. We have a few ways of going about this, and one is to decrease the learning rate in response to the loss across epochs. You can either program this by checking performance or manually adjust it once you deem it appropriate. The other way is to implement a decay rate which steadily decays the learning rate per batch or epoch.

We can start by planning decay per step -- otherwise known as 1/t decaying or exponential decaying. We'll be updating the learning rate each step by the reciprocal step count function, and this will take the step and the decaying ratio and multiples them. Basically, the bigger the step is, the biggest the result of this multiplication. As the take the reciprocal, that means that our learning rate will continue decreasing by ever smaller amounts. The fraction then works out to $\frac{1}{1 + lr * step}$, with the added 1 on the bottom ensuring that our learning rate decay will never raise the lr. Let's show how this can work in code:

In [None]:
# Initialize our lr to the default for SGD (1)
lr_init = 1.
# Set our decay rate
lr_decay = 0.1 

# Now to code out the equation we showed above (using a lambda function for versatility)
lr = lambda step: lr_init * (1. /(1 + lr_decay * step))

# Let's see the lr at step 1
stepA = 1
print(f"At step {stepA} the lr is: {lr(stepA)}")

# Let's see how that differs at step 20
stepB = 20
print(f"At step {stepB} the lr is: {lr(stepB)}")

In practice a decay of 0.1 would be pretty aggressive, but this gives good intuition about how it's meant to work. 

Now, let's update our SGD optimize class. I'll show the changes made to the SGD class below as additions to code previously written for our earlier ExampleSGD class made a few cells ago. As previously, the actual changes will be made in the classes.py file.

In [None]:
class ExampleSGD():
    # MODIFIED: initialize object
    def __init__(self, lr=1., decay=0.):
        ...
        # Create a way to store the current learning rate, decay, and iteration/step
        self.lr_curr = lr
        self.decay = decay
        self.iteration = 0
        
    # NEW: call to update learning rate before parameter refresh
    def preUpdateParams(self):
        # If there is a nonzero decay, update the lr before updating the parameters 
        if self.decay:
            self.lr_curr = self.lr * (1./(1. + self.decay * self.iteration))
            
    # NEW: Call after a parameter update
    def postUpdateParams(self):
        self.iteration += 1

Like I said, all of the above is callable through our full SGD function implemented in the classes.py file.

So, let's try to train our model again, but with a decay rate of 1e-2 (0.01).

In [None]:
# Creating some training data used the spiral_data function
X, y = spiral_data(samples=100, classes=3)

# Create dense layer with 2 input features and 64 output features
dense1 = DenseLayer(2, 64)

# Use a relu activation
activation1 = ReLU()

# Create a dense layer for our output with 64 as an input and 3 as an output
dense2 = DenseLayer(64, 3)

# Use a softmax combined with ccel. for our output 
activationLoss = SoftMaxCategoricalCrossEntropy()

# Initialize optimizer
optimizer = SGD(decay=1e-3)

# Create the loop that trains our model in epochs
for epoch in range(10000):
    # Perform the forward pass, as shown previously
    dense1.forward(X)
    activation1.forward(dense1.output)
    dense2.forward(activation1.output)
    loss = activationLoss.forward(dense2.output, y)
    
    # Calculate the accuracy
    predictions = np.argmax(activationLoss.output, axis=1)
    if len(y.shape) == 2:
        y = np.argmax(y, axis=1)
    accuracy = np.mean(predictions==y)
    
    if not epoch % 100:
        print(f"epoch: {epoch}, accuracy: {accuracy: .3f}, loss: {loss: .3f}, lr: {optimizer.lr_curr}")
        
    # Perform the backward pass
    activationLoss.backward(activationLoss.output, y)
    dense2.backward(activationLoss.dinputs)
    activation1.backward(dense2.dinputs)
    dense1.backward(activation1.dinputs)
    
    # Use the optimizer and update the weights and biases
    optimizer.preUpdateParams()
    optimizer.updateParams(dense1)
    optimizer.updateParams(dense2)
    optimizer.postUpdateParams()

Now we're on the right track, with higher accuracy and the lowest loss so far. However, it is likely still possible to produce a better output, as it seems possible the model has gotten stuck in a local minimum.

## Section 4: SGD with Momentum

Momentum may be a solution to our problem - as it creates a rolling average of gradients over some number of updates and uses this avg. with the unique gradient at each step. This helps when a model gets stuck in a local minimum and is bouncing around in it. As we remember, the gradient as a vector points in the direction of the steepest loss ascent -- meaning the direction which, if followed, would lead to loss **increasing** the most. For that reason, we always take the negative of the gradient, as that logically would point us in the direction of the steepest loss ascent. If you understand linear algebra, I would relate it as follows: imagine you have a gradient in a 2d space as a column vector g, where g=[1,2]. Then, you have a line going up 2 and 1 to the right, and if you follow that line your loss will increase by the highest amount. Now, if we take the negative of that, we have g=[-1,-2], meaning that we have a new vector going 2 to the left, and 1 down -- which is our direction of steepest loss descent. That example really helped me intuitively understand it, so I hope it helped you too. 

The reason momentum is so helpful is thanks to leveraging previous movements. Pure SGD simply goes into the opposite direction of the steepest loss ascent, meaning it is possible for it to bounce between "walls" of a local minimum and never be able to get out. Momentum on the other hand uses the previous update's direction to influence the next update's direction, minimizing the odds of such bouncing around.

We use momentum by setting a parameter between 0 and 1, representing the fraction of the previous parameter update to retain, and subtracting our actual gradient, multiplied by the learning rate. The update then contains only a portion of the gradient from preceding steps as our momentum and only a portion of the current gradient. 

When we set the momentum value too high, the model might stop learning at all since the direction of updates may not be able to follow the global gradient descent. 

The code for this is as follows: ```weight_updates = self.momentum * layer.weight_momentums - self.lr_curr * layer.dweights``` where the hyperparameter self.momentum is chosen at the start and the layer.weight_momentums start as all zeros but as changed during training as ```layer.weight_momentums = weight_updates```.

The above tells us that momentum is always updated before the parameters -- so we'll add it to our updateParams() method in the SGD class. I'll show the additions here and also make changes directly in the classes.py. 

In [None]:
class ExampleSGD():
    def __init__(self, lr=1., decay=0., momentum=0.):
        ...
        # Store the momentum in the object 
        self.momentum = momentum 
         
    # MODIFIED: added the use of momentum
    def updateParams(self, layer):
        # Do if we've used momentum
        if self.momentum:
            # If layer does not have a momentum array, create it
            if not hasattr(layer, 'weight_momentums'):
                layer.weight_momentums = np.zeros_like(layer.weights)
                # No momentum array --> no bias array; so create it too.
                layer.bias_momentums = np.zeros_like(layer.biases)
                
            # Build weight updates with momentum
            weight_updates = self.momentum * layer.weight_momentums - self.lr_curr * layer.dweights
            layer.weight_momentums = weight_updates
            
            #Build bias updates with momentum
            bias_updates = self.momentum * layer.bias_momentums - self.lr_curr * layer.dbiases
            layer.bias_momentums = bias_updates
        # SGD without momentum
        else:
            weight_updates = -self.lr_curr * layer.dweights
            bias_updates = -self.lr_curr * layer.dbiases
            
        #With updates now calculated, update both weights and biases
        layer.weights += weight_updates
        layer.bias += bias_updates

I've made the corresponding changes in our full SGD() class, but the above should give you a clearer picture exactly what changes have been made.

Let's run our model again now, with a decay of 1e-3 and a momentum of 0.5.

In [None]:
# Creating some training data used the spiral_data function
X, y = spiral_data(samples=100, classes=3)

# Create dense layer with 2 input features and 64 output features
dense1 = DenseLayer(2, 64)

# Use a relu activation
activation1 = ReLU()

# Create a dense layer for our output with 64 as an input and 3 as an output
dense2 = DenseLayer(64, 3)

# Use a softmax combined with ccel. for our output 
activationLoss = SoftMaxCategoricalCrossEntropy()

# Initialize optimizer
optimizer = SGD(decay=1e-3, momentum=0.5)

# Create the loop that trains our model in epochs
for epoch in range(10000):
    # Perform the forward pass, as shown previously
    dense1.forward(X)
    activation1.forward(dense1.output)
    dense2.forward(activation1.output)
    loss = activationLoss.forward(dense2.output, y)
    
    # Calculate the accuracy
    predictions = np.argmax(activationLoss.output, axis=1)
    if len(y.shape) == 2:
        y = np.argmax(y, axis=1)
    accuracy = np.mean(predictions==y)
    
    if not epoch % 100:
        print(f"epoch: {epoch}, accuracy: {accuracy: .3f}, loss: {loss: .3f}, lr: {optimizer.lr_curr}")
        
    # Perform the backward pass
    activationLoss.backward(activationLoss.output, y)
    dense2.backward(activationLoss.dinputs)
    activation1.backward(dense2.dinputs)
    dense1.backward(activation1.dinputs)
    
    # Use the optimizer and update the weights and biases
    optimizer.preUpdateParams()
    optimizer.updateParams(dense1)
    optimizer.updateParams(dense2)
    optimizer.postUpdateParams()

We're getting there, but let's try setting the momentum to 0.9 and see how the model reactions:

In [None]:
# Creating some training data used the spiral_data function
X, y = spiral_data(samples=100, classes=3)

# Create dense layer with 2 input features and 64 output features
dense1 = DenseLayer(2, 64)

# Use a relu activation
activation1 = ReLU()

# Create a dense layer for our output with 64 as an input and 3 as an output
dense2 = DenseLayer(64, 3)

# Use a softmax combined with ccel. for our output 
activationLoss = SoftMaxCategoricalCrossEntropy()

# Initialize optimizer
optimizer = SGD(decay=1e-3, momentum=0.9)

# Create the loop that trains our model in epochs
for epoch in range(10000):
    # Perform the forward pass, as shown previously
    dense1.forward(X)
    activation1.forward(dense1.output)
    dense2.forward(activation1.output)
    loss = activationLoss.forward(dense2.output, y)
    
    # Calculate the accuracy
    predictions = np.argmax(activationLoss.output, axis=1)
    if len(y.shape) == 2:
        y = np.argmax(y, axis=1)
    accuracy = np.mean(predictions==y)
    
    if not epoch % 100:
        print(f"epoch: {epoch}, accuracy: {accuracy: .3f}, loss: {loss: .3f}, lr: {optimizer.lr_curr}")
        
    # Perform the backward pass
    activationLoss.backward(activationLoss.output, y)
    dense2.backward(activationLoss.dinputs)
    activation1.backward(dense2.dinputs)
    dense1.backward(activation1.dinputs)
    
    # Use the optimizer and update the weights and biases
    optimizer.preUpdateParams()
    optimizer.updateParams(dense1)
    optimizer.updateParams(dense2)
    optimizer.postUpdateParams()

That's a massive difference!! Loss decreased by 50% from .422 to .211 -- which is a huge difference!

SGD with momentum is typically one of the 2 main choices for an optimizer in practice next to the Adam optimizer. Let's talk about 2 more optimizers now! 

## Section 5: AdaGrad

AdaGrad, short for adaptive gradient, institutes a per-parameter learning rate rather than a globally-shared rate. This has the advantage of normalizing the updates made to features -- preventing certain weights from rising disproportionally to others. AdaGrad keeps a history of previous updates and, the bigger the sum of the updates is (both pos. or neg.), then the smaller updates are made further in training. This has the benefit of allowing less-frequently updated parameters to keep-up with changes, resulting in a more effective use of neurons for training.

This concept can be contained in the following two lines of code: ```cache += parm_gradient ** 2 ``` and ```parm_updates = lr * parm_gradient / (sqrt(cache) + eps)```. In this, the cache holds a history of squared gradients, and the parm_updates is a function of the learning rate multipled by the gradient and is then divided by the sqrt of the cache plus an epsilon value. This value, epsilon, is a hyperparameter preventing division by 0. It usually has a small value, such as 1e-7, which we'll default to.

 To implement AdaGrad, we extend our SGD optimizer class; but change the name, add an epsilon property to the object, and remove the momentum. The book completely copies over the SGD class through a copy paste, but we can just extend it and override methods instead. The only thing that changes is the following:

In [None]:
# MODIFIED: Removing momentum from updateParams
class ExampleAdaGrad(ExampleSGD):
    def __init__(self, lr=1., decay=0., epsilon=1e-7):
        ...
        self.epsilon = epsilon
        
    def updateParams(self, layer):
        # If layer does not contain cache arrays, create them
        if not hasattr(layer, 'weight_cache'):
            layer.weight_cache = np.zeros_like(layer.weights)
            layer.bias_cache = np.zeros_like(layer.biases)
            
        # Update cache with gradients squared
        layer.weight_cache += layer.dweights**2
        layer.bias_cache += layer.dbiases**2
        
        # Vanilla SGD parameter update & norm
        layer.weights += -self.lr_curr * layer.dweights / (np.sqrt(layer.weight_cache) + self.epsilon)
        layer.biases += -self.lr_curr * layer.dbiases / (np.sqrt(layer.bias_cache) + self.epsilon)

Those are the only changes we need to make! As usual, the full class is implemented in the classes.py file. The only difference here is the full class is not written out, but instead we create an AdaGrad class which extends the SGD class, letting us use its methods in Adagrad, saving us from having to rewrite the whole class.

Now, lets try out how AdaGrad works with our usual model training situation, but setting AdaGrad's decay to 1e-4.

In [None]:
# Creating some training data used the spiral_data function
X, y = spiral_data(samples=100, classes=3)

# Create dense layer with 2 input features and 64 output features
dense1 = DenseLayer(2, 64)

# Use a relu activation
activation1 = ReLU()

# Create a dense layer for our output with 64 as an input and 3 as an output
dense2 = DenseLayer(64, 3)

# Use a softmax combined with ccel. for our output 
activationLoss = SoftMaxCategoricalCrossEntropy()

# Initialize optimizer as Adagrad with a decay
optimizer = AdaGrad(decay=1e-4)

# Create the loop that trains our model in epochs
for epoch in range(10000):
    # Perform the forward pass, as shown previously
    dense1.forward(X)
    activation1.forward(dense1.output)
    dense2.forward(activation1.output)
    loss = activationLoss.forward(dense2.output, y)
    
    # Calculate the accuracy
    predictions = np.argmax(activationLoss.output, axis=1)
    if len(y.shape) == 2:
        y = np.argmax(y, axis=1)
    accuracy = np.mean(predictions==y)
    
    if not epoch % 100:
        print(f"epoch: {epoch}, accuracy: {accuracy: .3f}, loss: {loss: .3f}, lr: {optimizer.lr_curr}")
        
    # Perform the backward pass
    activationLoss.backward(activationLoss.output, y)
    dense2.backward(activationLoss.dinputs)
    activation1.backward(dense2.dinputs)
    dense1.backward(activation1.dinputs)
    
    # Use the optimizer and update the weights and biases
    optimizer.preUpdateParams()
    optimizer.updateParams(dense1)
    optimizer.updateParams(dense2)
    optimizer.postUpdateParams()

AdaGrad also performs quite well in this situation -- although not as well as plain SGD with momentum surprisingly. But yeah, that's AdaGrad for you!

## Section 6: RMSProp

RMSProp is short for Root Mean Square Propagation. It's similar to AdaGrad, in the sense that it calculate an adaptive learning rate per parameter, but it's calculated differently than AdaGrad. While AdaGrad calculates the cache as ```cache += gradient**2```, RMSProp instead calculates the cache as ```cache = rho * cache + (1-rho) * gradient**2```. 

This is somewhat similar to both momentum in the SGD optimizer and cache in the AdaGrad -- where's it's like a per-parameter momentum, which ensure smoother changes. However, instead of always adding the gradients squared to the cache, it instead uses a moving average. This results in cache contents that move with data and don't stall.

The new hyperparameter we use here is *rho*. Rho is the cache memory decay rate

Here we once again can just extend the SGD class and override methods, as I'll show below: 

In [None]:
class ExampleRMSProp(ExampleSGD):
    # Extend the init method 
    def __init__(self, lr=1., decay=0., epsilon=1e-7, rho=0.9):
        ...
        self.epsilon = epsilon
        self.rho = rho
        
    # MODIFIED: added RMSProp functionality to the updateParam method    
    def updateParams(self, layer):
        ...
        # Update cache with squared current gradients
        layer.weight_cache = self.rho * layer.weight_cache + (1-self.rho) * layer.dweights**2
        layer.bias_cache = self.rho * layer.bias_cache + (1-self.rho) * layer.dbiases**2
        
        # Vanilla SGD parameter update & norm
        layer.weights += -self.lr_curr * layer.dweights / (np.sqrt(layer.weight_cache) + self.epsilon)
        layer.biases += -self.lr_curr * layer.dbiases / (np.sqrt(layer.bias_cache) + self.epsilon)

This, again, is also implemented as the RMSProp class in classes.py. So, now lets run our model with the RMSProp optimizer and a 1e-4 decay!

In [None]:
# Creating some training data used the spiral_data function
X, y = spiral_data(samples=100, classes=3)

# Create dense layer with 2 input features and 64 output features
dense1 = DenseLayer(2, 64)

# Use a relu activation
activation1 = ReLU()

# Create a dense layer for our output with 64 as an input and 3 as an output
dense2 = DenseLayer(64, 3)

# Use a softmax combined with ccel. for our output 
activationLoss = SoftMaxCategoricalCrossEntropy()

# Initialize optimizer as Adagrad with a decay
optimizer = RMSProp(lr=0.02, decay=1e-5, rho=0.999)

# Create the loop that trains our model in epochs
for epoch in range(10000):
    # Perform the forward pass, as shown previously
    dense1.forward(X)
    activation1.forward(dense1.output)
    dense2.forward(activation1.output)
    loss = activationLoss.forward(dense2.output, y)
    
    # Calculate the accuracy
    predictions = np.argmax(activationLoss.output, axis=1)
    if len(y.shape) == 2:
        y = np.argmax(y, axis=1)
    accuracy = np.mean(predictions==y)
    
    if not epoch % 100:
        print(f"epoch: {epoch}, accuracy: {accuracy: .3f}, loss: {loss: .3f}, lr: {optimizer.lr_curr}")
        
    # Perform the backward pass
    activationLoss.backward(activationLoss.output, y)
    dense2.backward(activationLoss.dinputs)
    activation1.backward(dense2.dinputs)
    dense1.backward(activation1.dinputs)
    
    # Use the optimizer and update the weights and biases
    optimizer.preUpdateParams()
    optimizer.updateParams(dense1)
    optimizer.updateParams(dense2)
    optimizer.postUpdateParams()

That's pretty good, but still not up to the level of SGD with momentum. So, we have one last optimizer to cover: Adam.

## Section 7: Adam

**Adam** is just short for *Adaptive Momentum*. It is currently the most widely used optimizer and is built atop RMSProp. It has the momentum concept from SGD added back in, meaning that, instead of applying gradients we're going to apply momentums like we did in the SGD optimizer with momentum, then apply a per-weight adaptive learning rate with the cache as done in RMSProp.

Adam additionally adds a bias correction mechanism. But: this is not the layer's bias. The bias correction mechanism is applied to the cache and momentum, compensating for the initial zerod values before they warm up with initial steps. To achieve this correction, both momentum and caches are divided by 1-beta$^{step}$. As this step raises, beta$^{step}$ approaches 0. As we know the limit of $x^{n}$ where 0>x<1 as n approaches infinity tends to 0. 

The same thing applies to the cache and the beta 2, where the starting value is 0.001 and approaches 1. Both beta1 and beta2 divided the momentums in the cache. Division by a fraction causes them to be multiple times bigger, significantly speeding up training in the initial stages before both tables warm up during multiple initial steps.

As the Adam code is based on that of RMSProp, we can just extend it's class, which I'll show below and implement in the classes.py.

In [None]:
class ExampleAdam(ExampleRMSProp):
    #MODIFIED: add beta1 and beta2 
    def __init__(self, lr=0.001, decay=0., epsilon=1e-7, beta1=0.9, beta2=0.999):
        ...
        self.beta1 = beta1
        self.beta2 = beta2
        
    def updateParams(self, layer):
        # Create cache arrays
        if not hasattr(layer, 'weight_cache'):
            layer.weight_momentums = np.zeros_like(layer.weights)
            layer.weight_cache = np.zeros_like(layer.weights)
            layer.bias_momentums = np.zeros_like(layer.biases)
            layer.bias_cache = np.zeros_like(layer.biases)
            
        # Update momentum
        layer.weight_momentums = self.beta1 * layer.weight_momentums + (1-self.beta1) * layer.dweights
        layer.bias_momentums = self.beta1 * layer.bias_momentums + (1-self.beta1) * layer.dbiases
        
        # Correct momentum
        weight_momentums_corrected = layer.weight_momentums / (1-self.beta1 ** (self.iterations+1))
        bias_momentums_corrected = layer.bias_momentums / (1-self.beta1 ** (self.iterations+1))
        layer.weight_cache = self.beta2 * layer.weight_cache + (1-self.beta2) * layer.dweights**2
        layer.bias_cache = self.beta2 * layer.bias_cache + (1-self.beta2) * layer.dbiases**2
        
        # Correct cache
        weight_cache_corrected = layer.weight_cache / (1-self.beta2 ** (self.iterations+1))
        bias_cache_corrected = layer.bias_cache / (1-self.beta2 ** (self.iterations+1))
        
        # Vanilla SGD parameter update + norm with square rooted cache
        layer.weights += -self.lr_curr * weight_momentums_corrected / (np.sqrt(weight_cache_corrected) + self.epsilon)
        layer.biases += -self.lr_curr * bias_momentums_corrected / (np.sqrt(bias_cache_corrected) + self.epsilon)

That is Adam! For what feels like the 20th time, full implementation is in the classes.py.

Now, we can run our model for the final time -- now with the Adam optimizer!

In [3]:
# Creating some training data used the spiral_data function
X, y = spiral_data(samples=100, classes=3)

# Create dense layer with 2 input features and 64 output features
dense1 = DenseLayer(2, 64)

# Use a relu activation
activation1 = ReLU()

# Create a dense layer for our output with 64 as an input and 3 as an output
dense2 = DenseLayer(64, 3)

# Use a softmax combined with ccel. for our output 
activationLoss = SoftMaxCategoricalCrossEntropy()

# Initialize optimizer as Adagrad with a decay
optimizer = Adam(lr=0.05, decay=5e-7)

# Create the loop that trains our model in epochs
for epoch in range(10000):
    # Perform the forward pass, as shown previously
    dense1.forward(X)
    activation1.forward(dense1.output)
    dense2.forward(activation1.output)
    loss = activationLoss.forward(dense2.output, y)
    
    # Calculate the accuracy
    predictions = np.argmax(activationLoss.output, axis=1)
    if len(y.shape) == 2:
        y = np.argmax(y, axis=1)
    accuracy = np.mean(predictions==y)
    
    if not epoch % 100:
        print(f"epoch: {epoch}, accuracy: {accuracy: .3f}, loss: {loss: .3f}, lr: {optimizer.lr_curr}")
        
    # Perform the backward pass
    activationLoss.backward(activationLoss.output, y)
    dense2.backward(activationLoss.dinputs)
    activation1.backward(dense2.dinputs)
    dense1.backward(activation1.dinputs)
    
    # Use the optimizer and update the weights and biases
    optimizer.preUpdateParams()
    optimizer.updateParams(dense1)
    optimizer.updateParams(dense2)
    optimizer.postUpdateParams()

epoch: 0, accuracy:  0.307, loss:  1.099, lr: 0.05
epoch: 100, accuracy:  0.750, loss:  0.630, lr: 0.04999752512250644
epoch: 200, accuracy:  0.800, loss:  0.463, lr: 0.04999502549496326
epoch: 300, accuracy:  0.823, loss:  0.399, lr: 0.049992526117345455
epoch: 400, accuracy:  0.837, loss:  0.349, lr: 0.04999002698961558
epoch: 500, accuracy:  0.873, loss:  0.326, lr: 0.049987528111736124
epoch: 600, accuracy:  0.880, loss:  0.306, lr: 0.049985029483669646
epoch: 700, accuracy:  0.870, loss:  0.280, lr: 0.049982531105378675
epoch: 800, accuracy:  0.873, loss:  0.272, lr: 0.04998003297682575
epoch: 900, accuracy:  0.877, loss:  0.253, lr: 0.049977535097973466
epoch: 1000, accuracy:  0.890, loss:  0.246, lr: 0.049975037468784345
epoch: 1100, accuracy:  0.883, loss:  0.236, lr: 0.049972540089220974
epoch: 1200, accuracy:  0.890, loss:  0.227, lr: 0.04997004295924593
epoch: 1300, accuracy:  0.927, loss:  0.220, lr: 0.04996754607882181
epoch: 1400, accuracy:  0.913, loss:  0.214, lr: 0.049

Here we have the best result that we've achieved thus far -- and it can't get much better! Adam has performed the best on this task, but it is recommended to start with Adam initially and then compare the other optimizers if you are not getting the expected (or hoped) results.

While we achieved great results here, with an end accuracy of 95%, they may be a little bit *too* good. The next chapter will talk about the risks this may bring, but for now, life is good!.

### Anyways, that's it for this chapter! Thanks for following along with my annotations of *Neural Networks from Scratch* by Kinsley and Kukieła! 