***
# <center>***L1 and L2 Regularization***
***

**Regularization methods** are those which **reduce generalization error**. The first forms of regularization that we will address are **L1** and **L2** **regularization** to calculate a number (called a **penalty**. L1 and L2 regularization are used) added to the loss value to penalize the model for large weights and biases. Large weights might indicate that a neuron is attempting to memorize a data element; generally, it is believed that it would be better to have many neurons contributing to a model’s output, rather than a select few. 

***
## ***Forward Pass***
***

**L1 regularization’s** penalty is the sum of all the absolute values for the weights and biases. This is a linear penalty as regularization loss returned by this function is directly proportional to parameter values. **L2 regularization’s** penalty is the sum of the squared weights and biases. This non-linear approach penalizes larger weights and biases more than smaller ones because of the square function used to calculate the result. In other words, **L2 regularization** is commonly used as it does not affect small parameter values substantially and does not allow the model to grow weights too large by heavily penalizing relatively big values. L1 regularization, because of its linear nature, penalizes small weights more than L2 regularization, causing the model to start being invariant to small inputs and variant only to the bigger ones. That’s why L1 regularization is rarely used alone and usually combined with L2 regularization if it’s even used at all. Regularization functions of this type drive the sum of weights and the sum of parameters towards 0​, which can also help in cases of exploding gradients (model instability, which might cause weights to become very large values). Beyond this, we also want to dictate how much of an impact we want this regularization penalty to carry. 

### **L1 Regularization**
 **L1 Weight Regularization:**
$$L_{1w} = \lambda \sum_{m} |w_{m}| $$
*Explanation:*
 - This term is added to the loss function during training.
 - It penalizes the model for having large weights.
 - The absolute value (|w_m|) encourages the weights to become sparse, meaning many weights will become exactly zero.
 - This can help in feature selection, as it effectively removes the influence of irrelevant features from the model.

**L1 Bias Regularization:**
$$L_{1b} = \lambda \sum_{n} |b_{n}|$$
*Explanation:*
 - Similar to L1 weight regularization, this term penalizes the model for having large biases.
 - It encourages the biases to also become sparse, which can further simplify the model.

### **L2 Regularization**

**L1 Weight Regularization:**
$$L_{2w} = \lambda \sum_{m} w_{m}^2$$
*Explanation:*
 - This term is also added to the loss function.
 - It penalizes the model for having large weights, but in a different way than L1.
 - It encourages the weights to be small, but not necessarily zero.
 - L2 regularization is often called weight decay because it tends to shrink the weights during training.

**L2 Bias Regularization:**
$$L_{2b} = \lambda \sum_{n} b_{n}^2$$
*Explanation:*

 - This term is added to the loss function during training.
 - It penalizes the model for having large biases.
 - It encourages the biases to be small, but not necessarily zero.
 - Similar to L2 weight regularization, it tends to shrink the biases during training.

**Overall Loss:**
$$Loss = DataLoss + L_{1w} + L_{1b} + L_{2w} + L_{2b}$$
*Explanation:*

 - This equation represents the total loss that the model tries to minimize during training.
 - `DataLoss`: This is the primary loss function that measures how well the model predicts the target values for the given data. Examples include mean squared error (MSE) or cross-entropy loss.
 - `L_{1w}`, `L_{1b}`, `L_{2w}`, and `L_{2b}`: These are the regularization terms discussed earlier. They are added to the DataLoss to control the complexity of the model and prevent overfitting.

In [1]:

def calculate_loss(weights, biases, data_loss, lambda_l1w, lambda_l1b, lambda_l2w, lambda_l2b):
    """
        Calculates the total loss, including data loss and L1/L2 regularization for weights and biases.
        
        Args:
          weights: A list or array of weights.
          biases: A list or array of biases.
          data_loss: The data loss (e.g., mean squared error, cross-entropy).
          lambda_l1w: Regularization strength for L1 weight penalty.
          lambda_l1b: Regularization strength for L1 bias penalty.
          lambda_l2w: Regularization strength for L2 weight penalty.
          lambda_l2b: Regularization strength for L2 bias penalty.
        
        Returns:
          The total loss.
    """
    
    l1w = lambda_l1w * sum(abs(w) for w in weights)
    l1b = lambda_l1b * sum(abs(b) for b in biases)
    l2w = lambda_l2w * sum(w**2 for w in weights)
    l2b = lambda_l2b * sum(b**2 for b in biases)
    loss = data_loss + l1w + l1b + l2w + l2b
    
    return loss
    

In [3]:

# Assuming you have the following values:
weights = [0.5, -1.2, 0.8]
biases = [0.2, -0.1]
data_loss = 0.3
lambda_l1w = 0.01
lambda_l1b = 0.005
lambda_l2w = 0.001
lambda_l2b = 0.0005

total_loss = calculate_loss(weights, biases, data_loss, lambda_l1w, lambda_l1b, lambda_l2w, lambda_l2b)

print("Total Loss:", total_loss)


Total Loss: 0.328855


The overall loss function is a combination of the data loss and regularization terms. By minimizing this overall loss, the model learns to make accurate predictions while keeping its weights and biases relatively small. This helps to improve the model's generalization ability and prevent it from overfitting to the training data.

***
## ***Backward pass***
***

The derivative of L2 regularization is relatively simple:

$$L_{2w} = \lambda \sum_{m} w_{m}^2$$

$$\frac{∂L_{2w}}{∂w_m} = λ \frac{∂}{∂w_m} \left[ \sum_{m} w_{m}^2 \right] = λ \sum_{m} \frac{∂}{∂w_m} \left[ w_{m}^2 \right] = λ \sum_{m} 2w_m = \frac{∂L_{2w}}{∂w_m} = 2λw_m$$


This might look complicated, so we can move it outside of the derivative term. We can remove the sum operator since we calculate the partial derivative with respect to the given parameter only, and the sum of one element equals this element. So, we only need to calculate the derivative of w​2​, which we know is 2w​. From the coding perspective, we will multiply all of the weights by 2λ​. 

L1 regularization’s derivative, on the other hand, requires more explanation. In the case of L1 rgularization, we must calculate the derivative of the absolute value piecewise function, which effectively multiplies a value by -1 if it is less than 0; otherwise, it’s multiplied by 1. This is because the absolute value function is linear for positive values, and we know that a linear function’s derivative is: 

$$f(x) = x \rightarrow f'(x) = 1$$
For negative values, it negates the sign of the value to make it positive. In other words, it multiplies values by -1: 
$$f(x) = -x \rightarrow f'(x) = -1$$
When we combine that:

$$abs(x) = \begin{cases}
    x & x > 0 \\
    -x & x < 0
\end{cases} \rightarrow abs'(x) = \begin{cases}
    1 & x > 0 \\
    -1 & x < 0
\end{cases}$$

And the complete partial derivative of L1 regularization with respect to given weight:

$$L_{1w} = \lambda \sum_{m} |w_m| \rightarrow \frac{∂L_{1w}}{∂w_m} = \lambda \frac{∂}{∂w_m} \sum_{m} |w_m| = \lambda \frac{∂}{∂w_m} |w_m| = \lambda \begin{cases}
    1 & w_m > 0 \\
    -1 & w_m < 0
\end{cases}$$

Like L2 regularization, **lambda** is a constant, and we calculate the partial derivative of this regularization with respect to the specific input. The partial derivative, in this case, equals 1 or -1 depending on the wm (weight) value. 
     
We are calculating this derivative with respect to weights, and the resulting gradient, which has the same shape as the weights, is what we’ll use to update the weights.