## Stochastic Gradient Descent

To adjust the weights and biases using the derivatives gathered from back propagtion, Stochastic Gradient Descent will be used. Three hyperparemters will be used to ensure the global minimum will be reached: a learning rate, a learning rate decay and momentum.

In [1]:
# Import relavant modules
import numpy as np

# Import class objects for the neural network and spiral data
from layerdense import *
from costfunctions import *
from spiraldata import spiral_data

In [2]:
# SGD optimiser class
class Optimizer_SGD:
    # Initialise optimiser and set default params
    def __init__(self, learning_rate = 1.0, decay = 0.0, momentum = 0.0):
        self.learning_rate = learning_rate
        self.current_learning_rate = learning_rate
        self.decay = decay 
        self.iterations = 0
        self.momentum = momentum
    
    # Call once before any parameter updates
    def pre_update_params(self):
        # If decay argument, apply decay to current learning rate calculations
        self.current_learning_rate = self.learning_rate * \
                                     (1.0 / (1.0 + self.decay * self.iterations))
    
    # Update parameters
    def update_params(self, layer):
        # If momentum was passed
        if self.momentum:
            # If layer output does not contain momentum arrays
            if not hasattr(layer, 'weight_momentums'):
                layer.weight_momentums = np.zeros_like(layer.weights)
                layer.bias_momentums = np.zeros_like(layer.biases)
                
                weight_updates = \
                                 self.momentum * layer.weight_momentums - \
                                 self.current_learning_rate * layer.dweights
                layer.weight_momentums = weight_updates
                
                # Update bias
                bias_updates = self.momentum * layer.bias_momentums - \
                               self.current_learning_rate * layer.dbiases
                layer.bias_momentums = bias_updates
                
                # If no momentum is used 
            else:
                weight_updates = -self.current_learning_rate * \
                                  layer.dweights
                bias_updates = -self.current_learning_rate * layer.dbiases
                
            # Update weights and biases
            layer.weights += weight_updates
            layer.biases += bias_updates
        
    # Call once after a params update
    def post_update_params(self):
        self.iterations += 1                    

Test the optimising function using the spiral data

In [4]:
# Create spiral data
X, y = spiral_data(100, 3)

# Initiate instances of NN classes. We need more neurons in the hidden layer to make it more accurate.
hidden_layer = LayerDense(2, 50)
ReLU = ActivationRelU()
output_layer = LayerDense(50, 3)
softmax_cost = ActivationSoftmaxCost()

# Create optimiser instance object
sgd = Optimizer_SGD(decay = 1e-3, momentum = 0.9)

# Train in epochs. 40001 iterations.
for epoch in range(100001):
    # Forward propagation
    hidden_layer.forward(X)
    ReLU.forward(hidden_layer.output)
    output_layer.forward(ReLU.output)
    
    # Calculate error
    cost = softmax_cost.forward(output_layer.output, y)
    
    # Calculate accuracy from output of softmax and y
    predictions = np.argmax(softmax_cost.output, axis=1)
    accuracy = np.mean(predictions==y)
    
    # Print statistics per set of epochs
    if not epoch % 500:
        print(f'epoch: {epoch}, ' +
              f'acc: {accuracy:.3f}, ' +
              f'loss: {cost:.3f}, ' +
              f'lr: {sgd.current_learning_rate}')
        
    # Back propagation 
    softmax_cost.backward(softmax_cost.output, y)
    output_layer.backward(softmax_cost.dinputs)
    ReLU.backward(output_layer.dinputs)
    hidden_layer.backward(ReLU.dinputs)
    
    # Update weights and biases
    sgd.pre_update_params()
    sgd.update_params(hidden_layer)
    sgd.update_params(output_layer)
    sgd.post_update_params()

epoch: 0, acc: 0.283, loss: 1.099, lr: 1.0
epoch: 500, acc: 0.437, loss: 1.061, lr: 0.66711140760507
epoch: 1000, acc: 0.433, loss: 1.052, lr: 0.5002501250625312
epoch: 1500, acc: 0.467, loss: 1.018, lr: 0.4001600640256102
epoch: 2000, acc: 0.510, loss: 0.980, lr: 0.33344448149383127
epoch: 2500, acc: 0.480, loss: 0.956, lr: 0.2857959416976279
epoch: 3000, acc: 0.513, loss: 0.933, lr: 0.25006251562890724
epoch: 3500, acc: 0.543, loss: 0.902, lr: 0.22227161591464767
epoch: 4000, acc: 0.547, loss: 0.875, lr: 0.2000400080016003
epoch: 4500, acc: 0.580, loss: 0.847, lr: 0.18185124568103292
epoch: 5000, acc: 0.603, loss: 0.823, lr: 0.16669444907484582
epoch: 5500, acc: 0.630, loss: 0.798, lr: 0.15386982612709646
epoch: 6000, acc: 0.643, loss: 0.773, lr: 0.1428775539362766
epoch: 6500, acc: 0.650, loss: 0.750, lr: 0.13335111348179757
epoch: 7000, acc: 0.667, loss: 0.726, lr: 0.12501562695336915
epoch: 7500, acc: 0.700, loss: 0.702, lr: 0.11766090128250381
epoch: 8000, acc: 0.703, loss: 0.682

epoch: 65000, acc: 0.837, loss: 0.416, lr: 0.015151744723404902
epoch: 65500, acc: 0.837, loss: 0.415, lr: 0.015037820117595755
epoch: 66000, acc: 0.837, loss: 0.415, lr: 0.014925595904416484
epoch: 66500, acc: 0.840, loss: 0.414, lr: 0.014815034296804398
epoch: 67000, acc: 0.840, loss: 0.414, lr: 0.01470609861909734
epoch: 67500, acc: 0.837, loss: 0.413, lr: 0.014598753266471044
epoch: 68000, acc: 0.840, loss: 0.412, lr: 0.01449296366614009
epoch: 68500, acc: 0.840, loss: 0.412, lr: 0.014388696240233673
epoch: 69000, acc: 0.843, loss: 0.411, lr: 0.014285918370262433
epoch: 69500, acc: 0.843, loss: 0.411, lr: 0.01418459836309735
epoch: 70000, acc: 0.847, loss: 0.410, lr: 0.014084705418386176
epoch: 70500, acc: 0.847, loss: 0.410, lr: 0.013986209597337027
epoch: 71000, acc: 0.847, loss: 0.409, lr: 0.013889081792802679
epoch: 71500, acc: 0.847, loss: 0.409, lr: 0.013793293700602768
epoch: 72000, acc: 0.847, loss: 0.408, lr: 0.013698817792024549
epoch: 72500, acc: 0.847, loss: 0.408, lr: 

The neural network achieved 90 % accuracy after 100000 epochs. To achieve a greater accuracy, further hyper parameters will need to be added such L1 and L2 regularisation.