# Chapter 15: Dropout

In [1]:
# Preface: Install necessary packages:
import numpy as np
import matplotlib.pyplot as plt
import math
import nnfs
from nnfs.datasets import spiral_data
from timeit import timeit
from resources.classes import DenseLayer, ReLU, SoftMax, Loss, CategoricalCrossEntropy, SoftMaxCategoricalCrossEntropy, SGD, AdaGrad, RMSProp, Adam, DropoutLayer
import random

Yet another option for neural network regularization is adding a dropout layer. 

This is a kind of layer that disables some neurons while others remain unchanged. The idea here is similar, where it is meant to prevent a neural network from becoming too dependent on a particular neuron. Dropout also helps with the issue of co-adoption, which happens when certain neurons depend on the output values of other neurons and as a result don't learn the patterns on their own. 

The Dropout function works by randomly disabling neurons at a given rate during every forward pass, forcing the network to make accurate predictions with a random amount of functional neurons. It forces a model to use a greater number of neurons for the same purpose resulting in a higher likelihood of learning patterns in the data, versus just the data itself.

## Section 1: The Forward Pass

We carry out dropout by turning off certain neuron inputs at random, which we do by just zeroing their outputs. To do so, we use a filter which is an array of the same shape as the layer output but filled with numbers drawm from a Bernoulli distribution. To provide an exact definition: a Bernoulli distribution is a binary probability distribution where we can get a value of 1 with a probability of p and a value of 0 with a probability of q. For example, if we take a random value, let's call it $r_{i}$, from the distribution, then:
$$
P(r_{i} = 1) = p \\
P(r_{i} = 0 = q = 1 - p = 1 - P(r_{i} = 1)
$$
If the probability of $r_{i}$ being 1 is p, then the probability of 0 being q is (1-p), therefore:
$$
r_{i} \sim Bernoulli(p)
$$
From all of this, we can gather that the given $r_{i}$ is an equivalent of a value from the Bernoulli distribution with a probability p for this value to be 1. Finally, we are returned an array filled with values of 1 with a probability of p and values of 0 with a probability of q. We then apply this filter to the layer output that we're trying to add the dropout to.

In our code, we only have one hyperparameter: the dropout rate. The dropout rate represents the percent of neurons in that layer to disable. For example, a dropout rate of 0.3 will mean that 30% of neurons are disabled at random during each forward pass.

Let's demonstrate this is vanilla python: 

In [5]:
dropoutRate = 0.3
# Mock output for testing purposes
exOutput = [0.27, -1.03, 0.67, 0.99, 0.05, -0.37, -2.01, 1.13, -0.07, 0.73]

# Repeat as many times as need to zero all necessary outputs
while True:
    # Randomly choose index
    index =random.randint(0, len(exOutput)-1)
    exOutput[index] = 0
    
    # Check the total amount of 0's 
    zeroD = 0
    for value in exOutput:
        if value == 0:
            zeroD += 1
            
    # Check the zeroD / total ratio is equal to the dropout rate
    if zeroD / len(exOutput) >= dropoutRate:
        break

print(exOutput)

[0.27, 0, 0.67, 0.99, 0, -0.37, -2.01, 1.13, 0, 0.73]


That's the basic idea of dropout; pretty simple right? Yes, but there's a simpler way of doing this!

We can leverage the np.random.binomial() method. The binomial method is only different from the bernoulli distribution in one way -- it adds a parameter n. N is the number of concurrent experiments and returns the number of successes from these n experiments. 

Np.random.binomial() takes in the parameters n, p, and size. This is where n is how many experiments to run for that sample, p is the probability for an experiment result to be 1, and size is amount of times you run the experiments with n samples each.

So, let's do it in Numpy now!

In [7]:
# Hard wire our dropout rate
dropoutRate = 0.3
# Mock output for testing purposes
exOutput = np.array([0.27, -1.03, 0.67, 0.99, 0.05, -0.37, -2.01, 1.13, -0.07, 0.73])

exOutput *= np.random.binomial(1, 1-dropoutRate, exOutput.shape)

print(exOutput)

[ 0.27 -1.03  0.67  0.99  0.   -0.37 -2.01  1.13 -0.    0.73]


Something to note here is that, in the book's presentation, the dropout rate is the percent of neurons to drop. On the other hand, in frameworks like PyTorch, they represent the percent of neurons to keep.

So, we're on the right track with dropout, but there's one more thing to add: scaling the data. When we use a dropout, it sometimes results in the data having mismatched output sizes during training and inference. To resolve this, we also scale the dropout.

We do this as such:
```
exOutput *= np.random.binomial(1, 1-dropoutRate, exOutput.shape) / (1-dropoutRate)
```

This way, our data output sizes are scaled back up and there is no longer an imbalance!

## Section 2: The Backward Pass

This section will very briefly show the derivative of the dropout function. Let's denote dropout as $D_{r}$, as such:
$$
\frac{\partial}{\partial z_{i}} D_{r_{i}} = \frac{r_{i}}{1-q} 
$$

That's really it. I've kept that super short because that's really all the detail you need. Now we can implement it in a class. I'll do an example class here and implement the full class in our classes.py file.

In [None]:
class ExampleDropoutLayer:
    # Method to initialize
    def __init__(self, rate):
        # Remember, we invert the rate
        self.rate = 1 - rate
        
    # Forward pass method
    def forward(self, inputs):
        # Save the inputs
        self.inputs = inputs
        # Create mask and scale it
        self.binaryMask = np.random.binomial(1, self.rate, size=inputs.shape) / self.rate
        # Apply the mask to the outputs
        self.output = inputs * self.binaryMask
        
    # Backward pass method
    def backward(self, dvalues):
        # The gradient
        self.dinputs = dvalues * self.binaryMask

Now that's all we need for our dropout layer. We just slide is in between our outputs and inputs of the following layer and it works plug-and-play. Let's use this in our model now!

In [4]:
# Creating some training data used the spiral_data function
X, y = spiral_data(samples=1000, classes=3)

# Create dense layer with 2 input features and 64 output features
# NEW: changed the layer size from 64 to 512 to improve accuracy
dense1 = DenseLayer(2, 512, weightl2=5e-4, biasl2=5e-4)

# Use a relu activation
activation1 = ReLU()

# NEW: Create the dropout layer
dropout1 = DropoutLayer(0.1)

# Create a dense layer for our output with 64 as an input and 3 as an output
# NEW: changed the layer size from 64 to 512 to improve accuracy
dense2 = DenseLayer(512, 3)

# Use a softmax combined with ccel. for our output 
activationLoss = SoftMaxCategoricalCrossEntropy()

# Initialize optimizer as Adagrad with a decay
optimizer = Adam(lr=0.05, decay=5e-7)

# Create the loop that trains our model in epochs
for epoch in range(10000):
    # Perform the forward pass, as shown previously
    dense1.forward(X)
    activation1.forward(dense1.output)
    dropout1.forward(activation1.output)
    dense2.forward(dropout1.output)
    # calculate dataLoss, regLoss, and then add for total loss
    dataLoss = activationLoss.forward(dense2.output, y)
    regLoss = activationLoss.loss.regularizationLoss(dense1) + activationLoss.loss.regularizationLoss(dense2)
    loss = dataLoss + regLoss
    
    # Calculate the accuracy
    predictions = np.argmax(activationLoss.output, axis=1)
    if len(y.shape) == 2:
        y = np.argmax(y, axis=1)
    accuracy = np.mean(predictions==y)
    
    if not epoch % 100:
        print(f"epoch: {epoch}, accuracy: {accuracy: .3f}, loss: {loss: .3f}, dLoss: {dataLoss}, rLoss: {regLoss}, lr: {optimizer.lr_curr}")
        
    # Perform the backward pass
    activationLoss.backward(activationLoss.output, y)
    dense2.backward(activationLoss.dinputs)
    dropout1.backward(dense2.dinputs)
    activation1.backward(dropout1.dinputs)
    dense1.backward(activation1.dinputs)
    
    # Use the optimizer and update the weights and biases
    optimizer.preUpdateParams()
    optimizer.updateParams(dense1)
    optimizer.updateParams(dense2)
    optimizer.postUpdateParams()

epoch: 0, accuracy:  0.289, loss:  1.099, dLoss: 1.0987270628564938, rLoss: 5.2330243903968354e-05, lr: 0.05
epoch: 100, accuracy:  0.714, loss:  0.735, dLoss: 0.6753092698186839, rLoss: 0.05958928210316922, lr: 0.04999752512250644
epoch: 200, accuracy:  0.766, loss:  0.642, dLoss: 0.5676832829999394, rLoss: 0.07415277804248334, lr: 0.04999502549496326
epoch: 300, accuracy:  0.778, loss:  0.630, dLoss: 0.5510411084907014, rLoss: 0.07868330977248542, lr: 0.049992526117345455
epoch: 400, accuracy:  0.811, loss:  0.566, dLoss: 0.4882249400259358, rLoss: 0.07795775180196346, lr: 0.04999002698961558
epoch: 500, accuracy:  0.826, loss:  0.568, dLoss: 0.49081368857481905, rLoss: 0.07754818539347334, lr: 0.049987528111736124
epoch: 600, accuracy:  0.837, loss:  0.541, dLoss: 0.46610160442782295, rLoss: 0.07444942020261587, lr: 0.049985029483669646
epoch: 700, accuracy:  0.827, loss:  0.541, dLoss: 0.45862855836980543, rLoss: 0.08273262613898158, lr: 0.049982531105378675
epoch: 800, accuracy:  

That's fine performance, but we should run our validation set as well, and see how the model performs!

In [5]:
# Model validation
X_test, y_test = spiral_data(samples=100, classes=3)

dense1.forward(X_test)
activation1.forward(dense1.output)
dense2.forward(activation1.output)
loss = activationLoss.forward(dense2.output, y_test)

predictions = np.argmax(activationLoss.output, axis=1)
if len(y_test.shape) == 2:
    y_test = np.argmax(y_test, axis=1)
accuracy = np.mean(predictions==y_test)

print(f"validation: accuracy: {accuracy: .3f}, loss: {loss: .3f}")

validation: accuracy:  0.870, loss:  0.358


So! Now we actually have something where the model performs better on a validation set than on the training set. That's interesting, and means we actually are doing a pretty good job!

### Anyways, that's it for this chapter! Thanks for following along with my annotations of *Neural Networks from Scratch* by Kinsley and Kukieła!