Another option for neural network regularization is adding a **dropout layer**. This type of layer disables some neurons, while the others pass through unchanged. The idea here, similarly to regularization, is to prevent a neural network from becoming too dependent on any neuron or for any neuron to be relied upon entirely in a specific instance (which can be common if a model overfits the training data). Another problem dropout can help with is **co-adoption**, which happens when neurons depend on the output values of other neurons and do not learn the underlying function on their onw. Dropout can also help with **noise** and other pertubations in the training data as more neurons working together mean that the model can learn more complex functions.

The Dropout function works by randomly disabling neurons at a given rate during every forward pass, forcing the network to learn how to make accurate predictions with only a random part of neurons remaining. Dropout forces the model to use more neurons for the same purpose, resulting in a higher chance of learning the underlying function that describes the data. For example, if we disbale one half of the neurons during the current step, and the other half during the next step, we are forcing more neurons to learn the data, as only a part of them "see" the data and gets updates in a given pass. These alternating halves of neurons are an example, and in reality, we'll use a hyperparameter to inform the dropout layer of the number of neurons to disable randomly.

Also, since active neurons are changing, dropout helps prevent overfitting, as the model can't use specific neurons to memorize certain samples. It's also worth mentioning that the dropuout layer does not truly disable neurons, but instead zeros their output. In other words, dropout does not decrease the number of neurons used, nor does it make the training process twice as fast when half the neurons are disabled.

## Forward Pass

In the code, we will "turn off" neurons with a filter that is an array with the same shape as the layer output but filled with numbers drawn from a Nernoulli distribution. A **Bernoulli distribution** is a binary (or discrete) probability distribution where we can get a value of *1* with a probability of *p* and value of *0* with a probability of *q*.

$$
P(r_i = 1) = p $$
$$
P(r_i = 0) = q = 1 - p = 1- P(r_i) = 1
$$

What this means is that the probability of this value being *1* is *p*. The probability of it being *0* is *q = 1 -p*. Therefore:

$$ r_i ~ Bernoulli(p) $$

This means that the given $$r_i$$ is an equivalent of a value from the Bernoulli distribution with a probability of *p* for this value to be *1*. If *r_i* is a single value from this distribution, a draw from this distribution, reshaped to match the layer output, can be used as mask to these outputs.

We are returned an array filled with values of *1* with a probability of *p* and otherwise value of *0*. We then apply this filter to the output of a layer we want to add dropout to.

With the code, we have one hyperparameter for a dropout layer. This is a value for the percentage of neurons to disable in that layer. For example, if we choose 0.10 for the dropout parameter, 10% of the neuron will be disabled at random during each forward pass.



In [2]:
import random

dropout_rate = 0.5

# Example output containing 10 values
example_output = [0.27, -1.03, 0.67, 0.99, 0.05, -0.37, -2.01, 1.13, -0.07, 0.73]

# Repeat as long as necessary
while True:

    # Randomly choose index and set value to 0
    index = random.randint(0, len(example_output) - 1)
    example_output[index] = 0

    # We might set an index that already is zeroed
    # There are different ways of overcoming this problem,
    # for simplicity we count values that exactly 0
    # while it's extremely rare in real model that weights
    # are exactly 0, this is not the best method for sure
    dropped_out = 0
    for value in example_output:
        if value == 0:
            dropped_out += 1

    # If required number of outputs is zeroed - leave the loop
    if dropped_out / len(example_output) >= dropout_rate:
        break

print(example_output)

[0.27, 0, 0.67, 0, 0.05, 0, -2.01, 0, 0, 0.73]


A binomial distribution differs from Bernoulli distribution in one way, as it adds a parameter, *n*, which is the number of concurrent experiments (instead of just one) and returns the number of successes from the these *n* experiments.

`np.random.binomial()` works by taking the already discussed parameter *n* (number of experiments) and *p* (probability of the true value of the experiment) as well as an additional parameter *size*:

<code>

np.random.binomial(n, p, size)

</code>

The function itself can be thought of like a coin toss, where the result will be 0 or 1. The *n* is how many tosses the coin we want to do. The *p* is the probability for the toss result to be 1. The overal result is a sum of all toss results. the *size* is how many of these "test" to run, and the return is a list of overall results.

In [4]:
import numpy as np

np.random.binomial(2, 0.5, size=10)

array([0, 2, 0, 1, 0, 1, 1, 1, 1, 1], dtype=int32)

We can use this to create a dropout layer. Our goal here is to create a filter where the intended dropout % is represented as 0, with everything else as 1. For example, let's say we have a dropout layer that we'll add after a layer that consists of 5 neurons, and we wish to have a 20% dropout. An example of a drop out layer might look like:

[1, 0, 1, 1, 1]

1/5 of that list is 0. This is an example of the filter we're going to apply to the output of the dense layer. If we multiplied a neural network's layer output by this, we'd be effectively disabling the neuron at the same index as the 0.

In [5]:
dropout_rate = 0.20
np.random.binomial(1, 1-dropout_rate, size=5)

array([0, 1, 1, 1, 1], dtype=int32)

This is based on probabilities, so there will be times when it does not look like the above array. There could be times no neurons zero out, or all neurons zero out. On average, these random draws will tend toward the probability we desire. Also, this was an example using a very small layer (5 neurons). On a realistically sized layer, we should find the probability more consistenly matches our intedended value.

In [37]:
example_output = np.array([0.27, -1.03, 0.67, 0.99, 0.05,
                            -0.37, -2.01, 1.13, -0.07, 0.73])

dropout_rate = 0.30
example_output *= np.random.binomial(1, 1-dropout_rate, example_output.shape)

example_output

array([ 0.27, -1.03,  0.67,  0.  ,  0.05, -0.37, -2.01,  1.13, -0.07,
        0.73])

While dropout helps a neural network generalize and is helpful for training, it's not something we want to utilize when predicting. It's not as simple as only omitting it becayse the magnitude of inputs to the next neurons can be dramatically different. If we have a dropout of 50%, for example, this would suggest that, on average, our inputs to the next layer neurons will be 50% smaller when summed, assuming they are fully-connected. What that means is that we used dropout during training, and, in this example, a random 50% of neuron output a value of 0 at each of the steps. Neurons in the next layer multiply inputs by weights, sum them, and receive values of 0 for half of their inputs. If we don't use dropout during prediction, all neurons will output their values, and this state won't match the state seen during training, since the sums will be statistically about twice as big. To handle this, during prediction, we might multiply all of the outputs by the dropout fraction, but that'd add another step for the forward pass, and there is a better way to achieve this. Instead, we want to scale the data back up after a dropout, during the training phase, to mimic the mean of the sum when all of the neurons output their values.
*example_output* becomes:

In [38]:
example_output *= np.random.binomial(1, 1-dropout_rate, example_output.shape) / (1 - dropout_rate)
print(example_output)

[ 0.         -1.47142857  0.95714286  0.          0.         -0.52857143
 -0.          1.61428571 -0.1         1.04285714]


Notice that we added a division of the dropout's result by the dropout rate. Since this rate is a fraction, it makes the resulting values larger, accounting for the value lost because a fraction of the neuron outputs being zeroed out. This way, we don't have to worry about the prediction and can simply omit the dropout during prediction. In any specific example, we will find that scaling doesn't equal the same sum as before because we're randomly dropping neurons. That said, after enough samples, the scaling will average out overall. to prove this:

In [21]:
dropout_rate = 0.2
example_output = np.array([0.27, -1.03, 0.67, 0.99, 0.05,
                        -0.37, -2.01, 1.13, -0.07, 0.73])

print(f"sum initial {sum(example_output)}")

sums = []

for i in range(1000000):
    example_output2 = example_output * np.random.binomial(1, 1-dropout_rate, example_output.shape) / (1 - dropout_rate)
    sums.append(sum(example_output2))

print(f"mean sum: {np.mean(sums)}")

sum initial 0.36000000000000015
mean sum: 0.35974747500000015


## Backward Pass

When the value of element r_i
 equals 1, its function and derivative becomes the neuron’s output, z,
compensated for the loss value by 1-q, where q is the dropout rate,


In [39]:
class Layer_Dropout:

    def __init__(self, rate):
        # Store rate, we invert it as for example for dropout
        # of 0.1 we need success rate of 0.9
        self.rate = 1 - rate

    def forward(self, inputs):
        # save input values
        self.inputs = inputs
        # Generate and save scaled mask
        self.binary_mask = np.random.binomial(1,
                                              self.rate,
                                              size=inputs.shape) / \
                            self.rate
        # Apply mask to output values
        self.output = inputs * self.binary_mask

    def backward(self, dvalues):
        # Gradients on values
        self.dinputs = dvalues * self.binary_mask

# Full code up to now

In [40]:
import numpy as np
import nnfs
from nnfs.datasets import spiral_data

nnfs.init()

In [41]:
class Layer_Dense:

    def __init__(self, n_inputs, n_neurons,
                 weight_regularizer_l1=0,
                 weight_regularizer_l2=0,
                 bias_regularizer_l1=0,
                 bias_regularizer_l2=0):
        # Initialize weights and biases
        self.weights = np.random.randn(n_inputs, n_neurons)
        self.biases = np.zeros((1, n_neurons))
        self.weight_regularizer_l1 = weight_regularizer_l1
        self.weight_regularizer_l2 = weight_regularizer_l2
        self.bias_regularizer_l1 = bias_regularizer_l1
        self.bias_regularizer_l2 = bias_regularizer_l2

    def forward(self, inputs):
        self.inputs = inputs
        self.output = np.dot(inputs, self.weights) + self.biases

    def backward(self, dvalues):
        self.dweights = np.dot(self.inputs.T, dvalues)
        self.dbiases = np.sum(dvalues,
                              axis=0,
                              keepdims=True)

        # gradients on regularization
        # L1 on weights
        if self.weight_regularizer_l1 > 0:
            dL1 = np.ones_like(self.weights)
            dL1[self.weights < 0] = -1
            self.dweights += self.weight_regularizer_l1 * dL1

        # L2 on weights
        if self.weight_regularizer_l2 > 0:
            self.dweights += 2 * self.weight_regularizer_l2 * self.weights

        # L1 on biases
        if self.bias_regularizer_l1 > 0:
            dL1 = np.ones_like(self.biases)
            dL1[self.biases < 0] = -1
            self.dbiases += self.bias_regularizer_l1 * dL1

        # L2 on biases
        if self.bias_regularizer_l2 > 0:
            self.dbiases += 2 * self.bias_regularizer_l2 * self.biases

        # Gradients on values
        self.dinputs = np.dot(dvalues, self.weights.T)


In [78]:
class Layer_Dropout:

    def __init__(self, rate):
        # Store rate, we invert
        self.rate = 1 - rate

    def forward(self, inputs):
        self.inputs = inputs
        self.binary_mask = np.random.binomial(1,
                                              self.rate,
                                              size=inputs.shape) /  self.rate
        self.output = inputs * self.binary_mask

    def backward(self, dvalues):
        self.dinputs = dvalues * self.binary_mask

In [79]:
class Activation_ReLU:

    def forward(self, inputs):
        self.inputs = inputs
        self.output = np.maximum(0, inputs)

    def backward(self, dvalues):
        self.dinputs = dvalues.copy()
        self.dinputs[self.inputs <= 0] = 0

In [80]:
class Activation_Softmax:

    def forward(self, inputs):
        self.inputs = inputs

        exp_values = np.exp(inputs - np.max(inputs,
                                            axis=1,
                                            keepdims=True))
        probabilities = exp_values / np.sum(exp_values,
                                            axis=1,
                                            keepdims=True)
        self.output = probabilities

    def backward(self, dvalues):
        self.dinputs = np.empty_like(dvalues)

        for index, (single_output, single_dvalue) in enumerate(zip(self.output, dvalues)):
            # flatten output array
            single_output = single_output.reshape(-1, 1)
            jacobian_matrix = np.diagflat(single_output) - \
                                np.dot(single_output, single_output.T)
            self.dinputs[index] = np.dot(jacobian_matrix,
                                         single_dvalue)

In [81]:
class Optimizer_SGD:

    def __init__(self,
                 learning_rate=1.,
                 decay=0.,
                 momentum=0.):
        self.learning_rate = learning_rate
        self.current_learning_rate = learning_rate
        self.decay = decay
        self.iteration = 0
        self.momentum = momentum

    def pre_update_params(self):
        if self.decay:
            self.current_learning_rate = self.learning * \
                                         (1. / (1. + self.decay * self.interations))

    def update_params(self, layer):

        if self.momentum:

            if not hasattr(layer, 'weight_momentums'):
                layer.weight_momentums = np.zeros_like(layer.weights)
                layer.bias_momentums = np.zeros_like(layer.biases)


            weight_updates = self.momentum * layer.weight_momentums - \
                            self.current_learning_rate * layer.dweights
            layer.weight_momentums = weight_updates

            bias_updates = self.momentum * layer.bias_momentums - \
                            self.current_learning_rate * layer.dbiases
            layer.bias_momentums = bias_updates

        # Vanilla SGD updates (as before momentum update)
        else:
            weight_updates = -self.current_learning_rate * layer.dweights
            bias_updates = -self.current_learning_rate * layer.dbiases

        # update weights and biases using either
        # vanilla or momentum updates
        layer.weights += weight_updates
        layer.biases += bias_updates

    def post_update_params(self):
        self.iterations += 1


In [82]:
class Optimizer_Adagrad:

    def __init__(self,
                 learning_rate=1.,
                 decay=0.,
                 epsilon=1e-7):
        self.learning_rate = learning_rate
        self.current_learning_rate = learning_rate
        self.decay = decay
        self.iterations = 0
        self.epsilon = epsilon

    def pre_update_params(self):
        if self.decay:
            self.current_learning_rate = self.learning_rate * \
                                         (1. / (1. + self.decay * self.iterations))
    def update_params(self, layer):

        if not hasattr(layer, 'weight_cache'):
            layer.weight_cache = np.zeros_like(layer.weights)
            layer.bias_cache = np.zeros_like(layer.biases)

        # Update caches with squared current gradients
        layer.weight_cache += layer.dweights ** 2
        layer.bias_cache += layer.dbiases ** 2

        # Vanilla SGD paramter update + normalization
        # with square rooted cache
        layer.weights += -self.current_learning_rate * \
                        layer.dweights / \
                         (np.sqrt(layer.weight_cache) + self.epsilon)
        layer.biases += -self.current_learning_rate * \
                        layer.dbiaes / \
                        (np.sqrt(layer.bias_cache) + self.epsilon)

    def post_update_params(self):
        self.iterations += 1


In [83]:
class Optimizer_RMSProp:

    def __init__(self,
                 learning_rate=0.001,
                 decay=0.,
                 epsilon=1e-7,
                 rho=0.9):
        self.learning_rate = learning_rate
        self.current_learning_rate = learning_rate
        self.decay = decay
        self.iterations = 0
        self.epsilon = epsilon
        self.rho = rho

    def pre_update_params(self):
        if self.decay:
            self.current_learning_rate = self.learning_rate * \
                                         (1. / (1. + self.decay * self.iterations))

    def update_params(self, layer):

        if not hasattr(layer, 'weight_cache'):
            layer.weight_cache = np.zeros_like(layer.weights)
            layer.bias_cache = np.zeros_like(layer.biases)

        layer.weight_cache = self.rho * layer.weight_cache + \
                             (1 - self.rho) * layer.dbiases**2
        layer.bias_cache = self.rho * layer.bias_cache + \
                           (1 - self.rho) * layer.biases**2

        layer.weights += -self.current_learning_rate * \
                        layer.dweights / \
                         (np.sqrt(layer.weight_cache) + self.epsilon)
        layer.biases += -self.current_learning_rate * \
                        layer.dbiases / \
                        (np.sqrt(layer.bias_cache) + self.epsilon)

    def post_update_params(self):
        self.iterations += 1


In [84]:
class Optimizer_Adam:

    def __init__(self,
                 learning_rate=0.001,
                 decay=0.,
                 epsilon=1e-7,
                 beta_1=0.9,
                 beta_2=0.999):
        self.learning_rate = learning_rate
        self.current_learning_rate = learning_rate
        self.decay = decay
        self.iterations = 0
        self.epsilon = epsilon
        self.beta_1 = beta_1
        self.beta_2 = beta_2

    def pre_update_params(self):
        if self.decay:
            self.current_learning_rate = self.learning_rate * \
                                         (1. / (1. + self.decay * self.iterations))

    def update_params(self, layer):

        if not hasattr(layer, 'weight_cache'):
            layer.weight_cache = np.zeros_like(layer.weights)
            layer.weight_momentums = np.zeros_like(layer.weights)
            layer.bias_cache = np.zeros_like(layer.biases)
            layer.bias_momentums = np.zeros_like(layer.biases)

        layer.weight_momentums = self.beta_1 * \
                                    layer.weight_momentums + \
                                 (1 - self.beta_1) * layer.dweights
        layer.bias_momentums = self.beta_1 * \
                                layer.bias_momentums + \
                               (1 - self.beta_1) * layer.dbiases

        weight_momentums_corrected = layer.weight_momentums / \
                                     (1 - self.beta_1 ** (self.iterations + 1))
        bias_momentums_corrected = layer.bias_momentums / \
                                   (1 - self.beta_1 ** (self.iterations + 1))

        layer.weight_cache = self.beta_2 * layer.weight_cache + \
                             (1 - self.beta_2) * layer.dweights**2
        layer.bias_cache = self.beta_2 * layer.bias_cache + \
                           (1 - self.beta_2) * layer.dbiases**2

        weight_cache_corrected = layer.weight_cache / \
                                 (1 - self.beta_2 ** (self.iterations + 1))
        bias_cache_corrected = layer.bias_cache / \
                               (1 - self.beta_2 ** (self.iterations + 1))

        layer.weights += -self.current_learning_rate * \
                        weight_momentums_corrected / \
                         (np.sqrt(weight_cache_corrected) + self. epsilon)
        layer.biases += -self.current_learning_rate * \
                        bias_momentums_corrected / \
                        (np.sqrt(bias_cache_corrected) + self.epsilon)

    def post_update_params(self):
        self.iterations += 1

In [85]:
class Loss:

    def regularization_loss(self, layer):

        regularization_loss = 0

        if layer.weight_regularizer_l1 > 0:
            regularization_loss += layer.weight_regularizer_l1 * \
                                    np.sum(np.abs(layer.weights))

        if layer.weight_regularizer_l2 > 0:
            regularization_loss += layer.weight_regularizer_l2 * \
                                    np.sum(layer.weights * layer.weights)

        if layer.bias_regularizer_l1 > 0:
            regularization_loss += layer.bias_regularizer_l1 * \
                                   (np.sum(np.abs(layer.biases)))

        if layer.bias_regularizer_l2 > 0:
            regularization_loss += layer.bias_regularizer_l2 * \
                                    np.sum(layer.biases * layer.biases)

        return regularization_loss

    def calculate(self, output, y):
        sample_losses = self.forward(output, y)

        data_loss = np.mean(sample_losses)

        return data_loss


In [86]:
class Loss_CategoricalCrossentropy(Loss):

    def forward(self, y_pred, y_true):
        samples = len(y_pred)

        y_pred_clipped = np.clip(y_pred, 1e-7, 1 - 1e-7)

        if len(y_true.shape) == 1:
            correct_confidences = y_pred_clipped[
                range(samples),
                y_true
            ]

        elif len(y_true.shape) == 2:
            correct_confidences = np.sum(
                y_pred_clipped * y_true,
                axis=1
            )

        negative_log_likelihoods = -np.log(correct_confidences)
        return negative_log_likelihoods

    def backward(self, dvalues, y_true):

        samples = len(dvalues)

        labels = len(dvalues[0])

        if len(y_true.shape) == 1:
            y_true = np.eye(labels)[y_true]

        self.dinputs = -y_true / dvalues

        self.dinputs = self.dinputs / samples


In [87]:
class Activation_Softmax_Loss_CategoricalCrossentropy:

    def __init__(self):
        self.activation = Activation_Softmax()
        self.loss = Loss_CategoricalCrossentropy()

    def forward(self, inputs, y_true):

        self.activation.forward(inputs)
        self.output = self.activation.output

        return self.loss.calculate(self.output, y_true)

    def backward(self, dvalues, y_true):

        samples = len(dvalues)

        if len(y_true.shape) == 2:
            y_true = np.argmax(y_true, axis=1)

        self.dinputs = dvalues.copy()

        self.dinputs[range(samples), y_true] -= 1

        self.dinputs = self.dinputs / samples

In [107]:
X, y = spiral_data(samples=100, classes=3)

dense1 = Layer_Dense(2, 512,
                     weight_regularizer_l2=1e-4,
                     bias_regularizer_l2=1e-5)
activation1 = Activation_ReLU()

dropout1 = Layer_Dropout(0.1)

dense2 = Layer_Dense(512, 3)
loss_activation = Activation_Softmax_Loss_CategoricalCrossentropy()

optimizer = Optimizer_Adam(learning_rate=0.001,
                           decay=5e-5)

for epoch in range(10001):

    dense1.forward(X)
    activation1.forward(dense1.output)
    dropout1.forward(activation1.output)
    dense2.forward(dropout1.output)
    data_loss = loss_activation.forward(dense2.output, y)

    regularization_loss = \
        loss_activation.loss.regularization_loss(dense1) + \
        loss_activation.loss.regularization_loss(dense2)

    loss = data_loss + regularization_loss

    predictions = np.argmax(loss_activation.output, axis=1)

    if len(y.shape) == 2:
        y = np.argmax(y, axis=1)

    accuracy = np.mean(predictions==y)

    if not epoch % 100:
        print(f'epoch: {epoch}, ' +
              f'acc: {accuracy}, ' +
              f'loss: {loss:.3f}, ' +
              f'data_loss: {data_loss}, ' +
              f'reg_loss: {regularization_loss:.3f}, ' +
              f'lr: {optimizer.current_learning_rate}')

    loss_activation.backward(loss_activation.output, y)
    dense2.backward(loss_activation.dinputs)
    dropout1.backward(dense2.dinputs)
    activation1.backward(dropout1.dinputs)
    dense1.backward(activation1.dinputs)

    optimizer.pre_update_params()
    optimizer.update_params(dense1)
    optimizer.update_params(dense2)
    optimizer.post_update_params()


epoch: 0, acc: 0.35, loss: 5.606, data_loss: 5.497826099395752, reg_loss: 0.109, lr: 0.001
epoch: 100, acc: 0.35, loss: 2.489, data_loss: 2.3822672367095947, reg_loss: 0.107, lr: 0.0009950743818100403
epoch: 200, acc: 0.43666666666666665, loss: 2.126, data_loss: 2.021054983139038, reg_loss: 0.105, lr: 0.000990148027130056
epoch: 300, acc: 0.44, loss: 2.305, data_loss: 2.2016849517822266, reg_loss: 0.103, lr: 0.00098527021035519
epoch: 400, acc: 0.4633333333333333, loss: 2.315, data_loss: 2.213624954223633, reg_loss: 0.101, lr: 0.0009804402176577284
epoch: 500, acc: 0.5033333333333333, loss: 2.177, data_loss: 2.0772554874420166, reg_loss: 0.099, lr: 0.0009756573491389824
epoch: 600, acc: 0.4866666666666667, loss: 1.849, data_loss: 1.751529574394226, reg_loss: 0.097, lr: 0.000970920918491189
epoch: 700, acc: 0.45666666666666667, loss: 1.997, data_loss: 1.9017854928970337, reg_loss: 0.095, lr: 0.0009662302526692111
epoch: 800, acc: 0.5366666666666666, loss: 1.622, data_loss: 1.52914965152

In [108]:
X_test, y_test = spiral_data(samples=100, classes=3)

dense1.forward(X_test)
activation1.forward(dense1.output)
dense2.forward(activation1.output)
loss = loss_activation.forward(dense2.output, y_test)
predictions = np.argmax(loss_activation.output, axis=1)

if len(y_test.shape) == 2:
    y_test = np.argmax(y_test, axis=1)

accuracy = np.mean(predictions==y_test)

print(f'validation, acc: {accuracy:.3f}, loss: {loss:.3f}')



validation, acc: 0.737, loss: 0.597
