# Chapter 10: Optimizers

In [4]:
# Preface: Install necessary packages:
import numpy as np
import matplotlib.pyplot as plt
import math
import nnfs
from nnfs.datasets import spiral_data
from timeit import timeit
from resources.classes import DenseLayer, ReLU, SoftMax, Loss, CategoricalCrossEntropy, SoftMaxCategoricalCrossEntropy, SGD

We spent chapter nine learning about how we can calculate and apply the gradients to adjust weights & biases to ultimately reduce loss. What we ended up doing was subtracting a fraction of the gradient for each weight and bias parameter -- and that is called stochastic gradient descent (SGD). Most optimizer that we use are actually just modifications in the implementation of SGD.

## Section 1: Stochastic Gradient Descent (SGD)

Anyone who's heard about optimizers before has probably heard many names -- seemingly used interchangeably -- including:
- Stochastic Gradient Descent or SGD.
- Vanilla Gradient Descent, Gradient Descent or GD, Batch Gradient Descent or BGD.
- Mini-batch Gradient Descent or MBGD.

However: these are not the same. SGD has historically referred to an optimizer that fits a single sample at a time. Meanwhile, BGD is an optimizer used to fit a whole dataset at once. Lastly, MBGD is an optimizer used to fit slices to a dataset, which we'd call batches in our context.

As a general rule of thumb, we call slices of data **batches**. However, historically, these same slices have been referred to as **mini-batches** in the context of SGD. With that said, the field has evolved and the two are now used interchangeably to the point where we actually think of the SGD optimizer as one that assumes a batch of data.

In the case of SGD, we need to choose a learning rate. From that, we then subtract $\text{learning_rate} \cdot \text{paramater_gradients}$ from the actual values. 

We'll walk through creating an example SGD class below -- but I've already made one and added it to the class.py file :).

In [None]:
class ExampleSGD():
    # Initialize the class
    def __init__(self, lr=1.0):
        # Store the learning rate; just 1 if not specified
        self.lr = lr
        
    # Method to update the parameters after a backward pass
    def updateParams(self, layer):
        # Update values according to: -lr * parameter_gradients
        layer.weights += -self.lr * layer.dweights
        layer.bias += -self.lr * layer.dbiases

The above is an early version of our SGD class! Now, to use it, we have to instantiate the object by doing "optimizer = SGD()" and then doing "optimizer.updateParams(denseX)" for each hidden layer X in our model.

Again, the above is code cell is just an example implementation for convenience; we can just directly reference the SGD() class from our classes.py.

With this built out, we can begin training our model in repeated iterations called epochs. An epoch simply means a full pass through all the training data. So, let's implement our model to trainable in epochs by leveraging looping. 

In [12]:
# Creating some training data used the spiral_data function
X, y = spiral_data(samples=100, classes=3)

# Create dense layer with 2 input features and 64 output features
dense1 = DenseLayer(2, 64)

# Use a relu activation
activation1 = ReLU()

# Create a dense layer for our output with 64 as an input and 3 as an output
dense2 = DenseLayer(64, 3)

# Use a softmax combined with ccel. for our output 
activationLoss = SoftMaxCategoricalCrossEntropy()

# Initialize optimizer
optimizer = SGD()

# Create the loop that trains our model in epochs
for epoch in range(10000):
    # Perform the forward pass, as shown previously
    dense1.forward(X)
    activation1.forward(dense1.output)
    dense2.forward(activation1.output)
    loss = activationLoss.forward(dense2.output, y)
    
    # Calculate the accuracy
    predictions = np.argmax(activationLoss.output, axis=1)
    if len(y.shape) == 2:
        y = np.argmax(y, axis=1)
    accuracy = np.mean(predictions==y)
    
    if not epoch % 100:
        print(f"epoch: {epoch}, accuracy: {accuracy: .3f}, loss: {loss: .3f}")
        
    # Perform the backward pass
    activationLoss.backward(activationLoss.output, y)
    dense2.backward(activationLoss.dinputs)
    activation1.backward(dense2.dinputs)
    dense1.backward(activation1.dinputs)
    
    # Use the optimizer and update the weights and biases
    optimizer.updateParams(dense1)
    optimizer.updateParams(dense2)

epoch: 0, accuracy:  0.320, loss:  1.099
epoch: 100, accuracy:  0.427, loss:  1.078
epoch: 200, accuracy:  0.427, loss:  1.066
epoch: 300, accuracy:  0.430, loss:  1.064
epoch: 400, accuracy:  0.433, loss:  1.062
epoch: 500, accuracy:  0.430, loss:  1.060
epoch: 600, accuracy:  0.433, loss:  1.056
epoch: 700, accuracy:  0.437, loss:  1.052
epoch: 800, accuracy:  0.437, loss:  1.059
epoch: 900, accuracy:  0.427, loss:  1.060
epoch: 1000, accuracy:  0.403, loss:  1.060
epoch: 1100, accuracy:  0.407, loss:  1.057
epoch: 1200, accuracy:  0.420, loss:  1.050
epoch: 1300, accuracy:  0.410, loss:  1.039
epoch: 1400, accuracy:  0.400, loss:  1.028
epoch: 1500, accuracy:  0.430, loss:  1.008
epoch: 1600, accuracy:  0.423, loss:  1.015
epoch: 1700, accuracy:  0.457, loss:  0.998
epoch: 1800, accuracy:  0.450, loss:  0.982
epoch: 1900, accuracy:  0.413, loss:  0.993
epoch: 2000, accuracy:  0.483, loss:  0.975
epoch: 2100, accuracy:  0.480, loss:  0.949
epoch: 2200, accuracy:  0.517, loss:  0.933


Now making models that are learning and performing better on their dataset -- which we can clearly see by increasing accuracy and decreasing loss. However, our accuracy seems to get stuck just below the ~.70 mark with about ~.62 loss. That tells us the model may have reached a local minimum, which we'll talk about soon. That tells us that adding epochs likely won't be very helpful at this point, and we need to work on our optimizer. The first thing we can change is the learning rate, so that's what we'll look at next!

## Section 2: Learning Rates

We want to apply a fraction of the gradients to the parameters in order to descend the loss value. Typically, we don't apply the full gradient value (which would just be the slope of the tangent line) because these values will typically be too large to produce meaningful improvements. Instead, we want to perform small steps, calculating the gradient and updating parameters by a negative fraction of this gradient. These small steps allow us to ensure that we are following the direction of the steepest descent -- but these can be too small, causing learning stagnation.

When the learning rate is too small, the update to the parameters are too small the model may get stuck in a local minimum as opposed to actually finding the global minimum. So, how do we know if our model is at the global minimum? We know it is so when the loss reaches as close to 0 as possible -- but we frequently never reach 0 in practice, as that may cause overfitting (a topic which I assume will be talked about in a later chapter). 

Learning rate in itself is not enough -- you may still be getting stuck in local minimums no matter what you set it to. So, for that reason, we must introduce momentum. Momentum, in an optimizer, adds to the gradient what we, in the physical world, would call intertia. For example, if we throw a ball uphill, then with enough force or a small enough hill, the ball can roll over the crest of the hill and onto the other side.

These parameters in the optimizer (such as the learning rate or momentum) are referred to as hyperparameters.

If the learning rate is too big the model may start jumping around during SGD, caused by the amount of gradient applied being too large. At the extreme, this may cause a gradient explosion. A gradient explosion is where the parameter updates cause the function's output to rise instead of fall, causing the gradient and loss to increase with each step. 

Ultimately, choosing the correct hyperparameters will enable you to speed up the learning process and save yourself money and time. It's typically best to start with the optimizer defaults, perform a few steps, and then observe the training process when tuning different settings. However, it's always useful to have some system that actively tunes your hyperparameters even during training. One way that we do this is by using learning rate decay. 

## Section 3: Learning Rate Decay   