**Regularization in Deep Learning (L1 and L2)**

**1. Introduction to Regularization**

*   **Context:** Regularization is a very important technique in deep learning for improving neural network performance. It builds upon other techniques we've discussed, such as Early Stopping and Normalising Inputs, all aimed at enhancing model quality.
*   **Purpose:** The main goal of regularization is to **reduce overfitting** in neural networks. It helps to manage the complexity of the model, allowing it to generalise better to new, unseen data.
*   **When to Use:** Regularization should be considered a possible solution whenever your model is overfitting.

**2. The Problem: Overfitting in Neural Networks**

*   **Definition:** Overfitting occurs when a machine learning or deep learning model performs exceptionally well on the **training data** but shows poor results on **new, unseen test data**.
*   **Why it Happens:** An overfit model is typically **too complex** for the given data. Due to its complexity, it learns even the minor patterns and noise in the training data "by heart" rather than understanding the underlying concepts. This is akin to a student memorising a book without understanding, performing poorly on new questions.
*   **Visual Example:** An overfit classification model will create a highly complex, "wiggly" decision boundary that tries to perfectly separate every single training data point, even outliers, leading to poor generalisation on new data.
*   **Why ANNs are Prone to Overfitting:** Artificial Neural Networks (ANNs) are inherently complex. They have multiple layers with numerous nodes (neurons), and each node is typically fully connected to others. This extensive connectivity and many parameters allow ANNs to capture every tiny pattern, making them very susceptible to overfitting.
    *   **Increasing Complexity with Neurons:** A simple ANN with one neuron might only draw a straight line to classify. As you increase the number of neurons (e.g., 10, 50, 257, 1000), the network gains the capability to draw increasingly complex curves and decision boundaries to perfectly fit the training data, leading to overfitting.

**3. General Solutions to Overfitting**

Apart from regularization, other techniques to combat overfitting include:
*   **Adding More Data:** Providing more diverse data helps the model learn general patterns. Techniques like Data Augmentation (e.g., rotating/cropping images) can artificially increase data.
*   **Reducing Model Complexity:** Simplifying the network architecture (fewer layers or nodes).
*   **Dropout:** Randomly turning off a percentage of neurons during training.
*   **Early Stopping:** Halting training when validation performance starts to degrade.

**4. What is Regularization? (Core Concept)**

*   **Mechanism:** Regularization works by **adding a "penalty term" to the model's loss function**.
*   **Goal:** This penalty discourages the model from assigning very large weights to features, effectively **pushing the weight values towards zero** or making them very small.
*   **Effect on Complexity:** By reducing the magnitude of weights (or even setting some to zero), regularization simplifies the model, making it less prone to capturing noise and overfitting. This makes the complex model behave more like a simpler one.

**5. Types of Regularization: L1 and L2**

There are two main types of regularization, with L2 being much more commonly used in deep learning.

**a) L2 Regularization (Weight Decay)**

*   **Penalty Term:** The L2 penalty term is the **sum of the squares of all the weights** in the network, scaled by a hyperparameter `λ` (lambda) and often divided by `2n` (where `n` is the number of rows or a scaling factor).
    *   **Formula:** `Loss_new = Loss_original + (λ / 2n) * Σ (w_i)^2`.
    *   Here, `Σ (w_i)^2` represents the sum of the squares of all weights (w1^2 + w2^2 + ... + wN^2).
*   **Hyperparameter `λ` (Lambda):**
    *   Controls the **strength of regularization**.
    *   **Higher `λ`:** Stronger regularization, weights are pushed more aggressively towards zero. If `λ` is too high, it can lead to **underfitting**.
    *   **`λ = 0`:** No regularization applied, as the penalty term becomes zero.
*   **Effect on Weights:** L2 regularization forces weights to become **small, close to zero**, but typically **not exactly zero**. This means all features retain some influence, but their impact is dampened, leading to a "simpler" but not "sparse" model.
*   **Mathematical Intuition (Gradient Descent):**
    *   In backpropagation, weights are updated using `w_new = w_old - learning_rate * (∂Loss / ∂w_old)`.
    *   When the L2 penalty is added, the derivative of the new loss function with respect to a weight `w_i` becomes `(∂Loss_original / ∂w_i) + (λ/n) * w_i`.
    *   Substituting this back into the update rule and rearranging, the new weight update effectively looks like: `w_new = w_old * (1 - learning_rate * (λ/n)) - learning_rate * (∂Loss_original / ∂w_old)`.
    *   The `(1 - learning_rate * (λ/n))` term is a positive number less than 1. This means that in each update, the weight `w_old` is effectively **scaled down (decayed)** before the regular gradient update is applied. This continuous "decay" drives the weights towards zero over epochs.
    *   **Weight Decay:** This characteristic property of L2 regularization, where weights continuously shrink, is why it's also commonly referred to as **Weight Decay**.
*   **Bias Terms:** It is important to note that **bias terms are typically NOT regularized**. The penalty is only applied to the weights.

**b) L1 Regularization**

*   **Penalty Term:** The L1 penalty term is the **sum of the absolute values of all the weights**, scaled by `λ`.
    *   **Formula:** `Loss_new = Loss_original + (λ / 2n) * Σ |w_i|`.
*   **Effect on Weights:** Unlike L2, L1 regularization has a tendency to drive many weights **exactly to zero**.
*   **Sparsity:** This property makes L1 regularization useful for **feature selection**, as it effectively eliminates the influence of certain features by setting their corresponding weights to zero. It creates a **sparse model**.
*   **Commonality:** While useful, L1 regularization is generally **less common** and often provides less favourable results than L2 regularization in deep learning.

**6. Comparing L1 and L2 Regularization**

*   **Weight Values:** L2 pushes weights towards zero (but rarely exactly zero); L1 can make many weights exactly zero.
*   **Sparsity:** L1 results in sparse models (fewer active features/connections); L2 creates models with small but non-zero weights.
*   **Deep Learning Preference:** L2 regularization (Weight Decay) is generally preferred in deep learning for its more stable performance and better empirical results.

**7. Practical Implementation in Keras**

Keras makes it straightforward to add L1 and L2 regularization to your neural network layers.

*   **Applying Regularization:** You add `kernel_regularizer` as an argument to your `Dense` (or other) layers when defining your model.
*   **Specifying L1 or L2:**
    *   For **L2 regularization**: Use `tf.keras.regularizers.L2(lambda_value)`.
    *   For **L1 regularization**: Use `tf.keras.regularizers.L1(lambda_value)`.
    *   `lambda_value` is your chosen `λ` hyperparameter (e.g., `0.001`, `0.01`).
*   **Example Code Snippet:**
    ```python
    from tensorflow.keras.models import Sequential
    from tensorflow.keras.layers import Dense
    from tensorflow.keras import regularizers

    model = Sequential([
        Dense(128, activation='relu', kernel_regularizer=regularizers.L2(0.001), input_shape=(input_dim,)),
        Dense(128, activation='relu', kernel_regularizer=regularizers.L2(0.001)),
        Dense(1, activation='sigmoid')
    ])
    ```
    In this example, L2 regularization with a `lambda_value` of `0.001` is applied to the kernels (weights) of the first two hidden layers.

**8. Practical Demonstration and Effects**

The source provides a clear demonstration of L2 regularization on a classification problem:

*   **Without Regularization:**
    *   The model produced a **highly "wiggly" and complex decision boundary** that perfectly fit the training data, indicating severe overfitting.
    *   The **loss curve** showed a significant divergence, with training loss continuing to decrease while validation loss either plateaued or increased, confirming overfitting.
*   **With L2 Regularization (e.g., `λ=0.001`):**
    *   The same model, with L2 regularization applied, produced a **much cleaner and smoother decision boundary**. This boundary was less sensitive to individual training points and better at generalising.
    *   The **loss curve** showed that the validation loss now closely followed the training loss, with **no significant gap**, indicating successful reduction of overfitting.
*   **Weight Distribution Analysis:**
    *   **Without regularization:** The weights were spread across a wider range (e.g., from -2 to 1.75), with some large values. The probability density function (PDF) plot showed a broad distribution.
    *   **With L2 regularization:** The weights were **much more concentrated around zero** (e.g., from -0.5 to 0.3), confirming that the regularization effectively pushed them to smaller magnitudes. The PDF plot visually confirmed this tight clustering around zero.

**9. Challenges/Considerations**

*   **Hyperparameter Tuning:** The `λ` (lambda) value is a hyperparameter that often requires careful tuning to find the optimal balance between bias and variance. Incorrect `λ` can lead to underfitting (too high) or continued overfitting (too low).

**10. Conclusion**

*   Regularization is an **essential technique** for managing model complexity and preventing overfitting in deep neural networks.
*   L2 regularization (Weight Decay) is particularly effective and widely used in deep learning, providing significant improvements in generalisation performance.
*   By adding a penalty to the loss function that discourages large weights, regularization helps the network learn more robust and generalisable patterns.

**11. Further Resources**

*   For a deeper, mathematical understanding of regularization in machine learning and deep learning, including detailed explanations of L1 and L2 from scratch, the source recommends a dedicated playlist on the "CampusX" channel .

---

# 🔍 Why L1 Can Make Coefficients Zero but L2 Cannot

---

## 1. L2 Regularization (Ridge)

**Update rule:**
```
w_j ← w_j - η * ( dL/dw_j + 2λ * w_j )
```

- The penalty term `2λ * w_j` is proportional to the weight.  
- As `w_j → 0`, the penalty term also goes to 0.  
- The “pull” toward zero becomes weaker near zero.  
- **Result:** weights shrink but rarely become exactly zero.

---

## 2. L1 Regularization (Lasso)

**Update rule:**
```
w_j ← w_j - η * ( dL/dw_j + λ * sign(w_j) )
```

- The penalty term `λ * sign(w_j)` is a constant push:  
  - `+λ` if `w_j > 0`  
  - `-λ` if `w_j < 0`  
- Unlike L2, it does not weaken as `w_j` gets smaller.  
- If `|w_j|` is small enough, this constant push can overshoot and set `w_j = 0` in one update.  
- Once at zero, it can stay there.  
- **Result:** L1 produces sparse solutions (feature selection).

---

## 3. Key Intuition

- **L2 (Ridge)** → "Shrinkage": weights get small but not eliminated.  
- **L1 (Lasso)** → "Soft-thresholding": weights can be forced to exactly zero.  

---

## ⚖️ Conclusion

Even though both updates involve subtraction, the form of the penalty makes the difference:

- **L2 penalty → scales with weight, vanishes near zero.**  
- **L1 penalty → constant push, can zero out weights.**
