# Chapter 14: L1 & L2 Regularization

In [1]:
# Preface: Install necessary packages:
import numpy as np
import matplotlib.pyplot as plt
import math
import nnfs
from nnfs.datasets import spiral_data
from timeit import timeit
from resources.classes import DenseLayer, ReLU, SoftMax, Loss, CategoricalCrossEntropy, SoftMaxCategoricalCrossEntropy, SGD, AdaGrad, RMSProp, Adam

Regularization methods are those which reduce generalization error. The first ones that we'll use here are L1 and L2 regularization. They are used to calculate to calculate a penalty number added to the loss value to penalize the model for large weights and biases. Large weights may indicate that a neuron is attempting to memorize a data element, so it makes sense that we would try to curtail them.  

## Section 1: The Forward Pass

L1 regularization's penalty is the sum of all the absolute values for the weights and biases. This is a linear penalty as regularization loss return by this function is directly proportional to parameter values. L2 regularization's penalty is, on the other hand, the sum of the squared values of the weights and biases. L2's non-linear approach penalizes larger weights and biases more than smaller ones because of the square function. 

L2 reg. is more commonly used because it doesn't affect small parameter values much while also preventing the model's larger weights from growing disproportionally. In contrast, L1 reg. (due to its linear nature) penalizes small weights more the large ones -- which is why it may be combined with L2 reg. but generally infrequently used overall. To make this work, we use a hyperparameter lambda, where the higher the lambda the higher the penalty applied is.

The equation of L1 reg. of weights is:
$$
L_{1w} = \lambda \sum_{m} |W_{m}|
$$   
The equation of L1 reg. of biases is:
$$
L_{1b} = \lambda \sum_{n} |b_{n}|
$$

The equation of L2 reg. of weights is:
$$
L_{2w} = \lambda \sum_{m} |w^{2}_{m}|
$$
The equation of L2 reg. of biases is:
$$
L_{2b} = \lambda \sum_{n} |b^{2}_{n}|
$$

The overall loss then becomes:
$$
Loss = DataLoss + L_{1w} + L_{1b} + L_{2w} + L_{2b}
$$

We can do this in code with the following:
```
l1w = lambda_l1w * sum(abs(weights))
l1b = lambda_l1b * sum(abs(biases))
l2w = lambda_l2w * sum(abs(weights ** 2))
l2b = lambda_l2b * sum(abs(biases ** 2))
loss = dataLoss + l1w + l1b + l2w + l2b
```

That's pretty straightforward, so we can now implement that in the dense layer class. I'll show the modifications below in an example class, but I'll make them all directly in the classes.py.

In [None]:
class ExampleDense:
    # MODIFIED: added the weight and bias inputs + storage
    def __init__(self, nInputs, nNeurons, weightl1=0, weightl2=0, biasl1=0, biasl2=0):
        ...
        # Store regularization strength
        self.weightl1 = weightl1
        self.weightl2 = weightl2
        self.biasl1 = biasl1
        self.biasl2 = biasl2

Now, we also need to update our loss class to make sure it accounts for these! We'll add this method in our general Loss class, because that is extended among all our later classes, meaning it'll be accessible there too!

In [None]:
class ExampleLoss:
    # NEW: method for regularization loss
    def regularizationLoss(self, layer):
        # Set it to 0 by default
        regLoss = 0
        
        # L1 reg for weights
        if layer.weightl1 > 0:
            regLoss += layer.weightl1 * np.sum(np.abs(layer.weights))
        
        # L1 reg for biases
        if layer.biasl1 > 0:
            regLoss += layer.biasl1 * np.sum(np.abs(layer.biases))
            
        # L2 reg for weights
        if layer.weightl2 > 0:
            regLoss += layer.weightl2 * np.sum(layer.weights ** 2)
            
        # L2 reg for bises
        if layer.biasl2 > 0:
            regLoss += layer.biasl2 * np.sum(layer.biases ** 2)
            
        return regLoss

Then, we will incorporate this into our model loss calculation as such:
```
# Calculate the loss
dataLoss = loss_function.forward(activation2.output, y)

# Calculate reg. penalty
regLoss = loss_function.regularizationLoss(dense1) + loss_function.regularizationLoss(dense2)

# Total loss calculation
loss = dataLoss + regLoss
```

We must also do the same for the backward pass!

## Section 2: The Backward Pass

The derivative of the L2 reg. function is:
$$
L_{2w} = \lambda \sum_{m} w^{2}_{m} \rightarrow 2\lambda w_{m}
$$

I skipped a bunch of the book's calculus there, but I assume most people won't be needing it.

The derivative of the L1 reg. function is a little more complicated, but it is:
$$
L_{1w} = \lambda \sum_{m} |w_{m}| \rightarrow 
\begin{array}
\lambda \text{ if } w_{m} > 0 \\
-\lambda \text{ if } w_{m} < 0
\end{array}
$$

We can write this in python below:

In [None]:
weights = [0.2, 0.8, -0.5]
dl1 = []
for weight in weights:
    if weight >= 0:
        dl1.append(1)
    else:
        dl1.append(-1)
print(dl1)

We can now multiply this to work with multiple neurons in a layer.

In [None]:
weights = [[0.2, 0.8, -0.5, 1],
           [0.5, -.91, .26, -0.5],
           [-.26, -.27, .17, .87]]
dl1 = []
for neuron in weights:
    neuron_dl1 = []
    for weight in neuron:
        if weight >= 0:
            neuron_dl1.append(1)
        else:
            neuron_dl1.append(-1)
    dl1.append(neuron_dl1)
print(dl1)

We can simplify this to just a few lines by leveraging the inbuilt NumPy functionality!

In [None]:
weights = np.array([[0.2, 0.8, -0.5, 1],
                    [0.5, -.91, .26, -0.5],
                    [-.26, -.27, .17, .87]])
dl1 = np.ones_like(weights)

dl1[weights < 0] = -1

print(dl1)

With that in mind, lets make some more additions to our DenseLayer class! I'll show the changes below in an example class, and the full changes will be made in the classes.py file.

In [None]:
class ExampleDenseLayer:
    # MODIFIED: added l1 and l2 reg functionality
    def backward(self, dvalues):
        ...
        # L1 on weights
        if self.weightl1 > 0:
            dl1 = np.ones_like(self.weights)
            dl1[self.weights < 0] = -1
            self.dweights += self.weightl1 * dl1
        # L1 on biases
        if self.biasl1 > 0:
            dl1 = np.ones_like(self.biases)
            dl1[self.biases < 0] = -1
            self.dbiases += self.biasl1 * dl1
        # L2 on weights
        if self.weightl2 > 0:
            self.dweights += 2 * self.weightl2 * self.weights
        # L2 on biases
        if self.biasl2 > 0:
            self.dbiases += 2 * self.biasl2 * self.biases

        # Gradients on input values; we use weights because it's with respect to the inputs
        self.dinputs = np.dot(dvalues, self.weights.T)

Those changes have been reflected in the normal DenseLayer class too! Now, lets run our model using L1 and L2 reg!

In [3]:
# Creating some training data used the spiral_data function
X, y = spiral_data(samples=1000, classes=3)

# Create dense layer with 2 input features and 64 output features
dense1 = DenseLayer(2, 64, weightl2=5e-4, biasl2=5e-4)

# Use a relu activation
activation1 = ReLU()

# Create a dense layer for our output with 64 as an input and 3 as an output
dense2 = DenseLayer(64, 3)

# Use a softmax combined with ccel. for our output 
activationLoss = SoftMaxCategoricalCrossEntropy()

# Initialize optimizer as Adagrad with a decay
optimizer = Adam(lr=0.05, decay=5e-7)

# Create the loop that trains our model in epochs
for epoch in range(10000):
    # Perform the forward pass, as shown previously
    dense1.forward(X)
    activation1.forward(dense1.output)
    dense2.forward(activation1.output)
    # NEW: calculate dataLoss, regLoss, and then add for total loss
    dataLoss = activationLoss.forward(dense2.output, y)
    regLoss = activationLoss.loss.regularizationLoss(dense1) + activationLoss.loss.regularizationLoss(dense2)
    loss = dataLoss + regLoss
    
    # Calculate the accuracy
    predictions = np.argmax(activationLoss.output, axis=1)
    if len(y.shape) == 2:
        y = np.argmax(y, axis=1)
    accuracy = np.mean(predictions==y)
    
    if not epoch % 100:
        print(f"epoch: {epoch}, accuracy: {accuracy: .3f}, loss: {loss: .3f}, dLoss: {dataLoss}, rLoss: {regLoss}, lr: {optimizer.lr_curr}")
        
    # Perform the backward pass
    activationLoss.backward(activationLoss.output, y)
    dense2.backward(activationLoss.dinputs)
    activation1.backward(dense2.dinputs)
    dense1.backward(activation1.dinputs)
    
    # Use the optimizer and update the weights and biases
    optimizer.preUpdateParams()
    optimizer.updateParams(dense1)
    optimizer.updateParams(dense2)
    optimizer.postUpdateParams()

epoch: 0, accuracy:  0.333, loss:  1.099, dLoss: 1.098588681034382, rLoss: 6.878172268839509e-06, lr: 0.05
epoch: 100, accuracy:  0.681, loss:  0.823, dLoss: 0.7799804883537036, rLoss: 0.04303547960378618, lr: 0.04999752512250644
epoch: 200, accuracy:  0.739, loss:  0.698, dLoss: 0.6287478621473813, rLoss: 0.0690696077816451, lr: 0.04999502549496326
epoch: 300, accuracy:  0.772, loss:  0.644, dLoss: 0.5647735026479431, rLoss: 0.07935410056235946, lr: 0.049992526117345455
epoch: 400, accuracy:  0.776, loss:  0.649, dLoss: 0.567071155027531, rLoss: 0.08237097562283491, lr: 0.04999002698961558
epoch: 500, accuracy:  0.800, loss:  0.584, dLoss: 0.5013903111873796, rLoss: 0.08245665009315942, lr: 0.049987528111736124
epoch: 600, accuracy:  0.818, loss:  0.552, dLoss: 0.4699856596353743, rLoss: 0.08212274570662345, lr: 0.049985029483669646
epoch: 700, accuracy:  0.832, loss:  0.535, dLoss: 0.4541652732545524, rLoss: 0.08118981867032718, lr: 0.049982531105378675
epoch: 800, accuracy:  0.844, 

We should also go through a validation run -- which is basically just a forward pass through the model.  

In [4]:
# Model validation
X_test, y_test = spiral_data(samples=100, classes=3)

dense1.forward(X_test)
activation1.forward(dense1.output)
dense2.forward(activation1.output)
loss = activationLoss.forward(dense2.output, y_test)

predictions = np.argmax(activationLoss.output, axis=1)
if len(y_test.shape) == 2:
    y_test = np.argmax(y_test, axis=1)
accuracy = np.mean(predictions==y_test)

print(f"validation: accuracy: {accuracy: .3f}, loss: {loss: .3f}")

validation: accuracy:  0.877, loss:  0.316


That's actually pretty good! Now, what we could do it work on playing with the model or dataset size to see how it impacts the model, but what's next to talk about is dropout!

### Anyways, that's it for this chapter! Thanks for following along with my annotations of *Neural Networks from Scratch* by Kinsley and Kukieła!