<a href="https://colab.research.google.com/github/karankulshrestha/ai-notebooks/blob/main/regularization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Regularization
It is refer as a technique to reduce the algorithm overfitting by introducing some penalty or constraints and make it generalized on unseen data, even if the training error increases.

# L2 Regularization (Ridge)
L2 Regularization helps prevent overfitting by keeping the modelâ€™s weights small. Large weights make the model too sensitive to training data, causing overfitting. To control this, we add a penalty term (the sum of the squares of all weights) to the loss function â€” for example, along with Mean Squared Error (MSE).

During backpropagation, this penalty discourages large weights, gradually shrinking them and resulting in a smoother, more generalized model.

It works in 90% cases

In [6]:
import numpy as np

# sample data
X = np.array([[1], [2], [3], [4]])
y = np.array([2, 4, 6, 8])

# initialize the weight
w = 0.5
lambda_red = 0.1 # Regularization strength

y_pred = X * w

# loss
mse_loss = np.mean((y - y_pred)**2)

l2_loss = mse_loss + lambda_red * (w**2)

print("We add a panelty { lambda_red * (w**2) } to the loss", l2_loss)

print("\nThe penalty makes the loss bigger when weights are large. Because loss is bigger, the gradient update becomes stronger for those large weights.")

We add a panelty { lambda_red * (w**2) } to the loss 19.4

The penalty makes the loss bigger when weights are large. Because loss is bigger, the gradient update becomes stronger for those large weights.


# L1 Regularization (Lasso)
L1 regularization adds a penalty for large weights using their absolute values (|w|).
It makes the model simpler by forcing some weights to become zero,
which helps prevent overfitting and select only the most important features.

If your normal loss is **Mean Squared Error (MSE):**

$$
\text{Loss} = \text{MSE}
$$

Then with **L1 regularization**, it becomes:

$$
\text{Loss} = \text{MSE} + \lambda \sum |w|
$$

where  
- \(Î» lambda \) = regularization strength  
- \( w \) = model weights

In [7]:
import numpy as np

# Sample data
X = np.array([[1], [2], [3], [4]])
y = np.array([2, 4, 6, 8])

# Initialize a weight
w = 0.5
lambda_reg = 0.1  # Regularization strength

# Predicted values
y_pred = X * w

# Mean Squared Error (MSE)
mse_loss = np.mean((y - y_pred) ** 2)

# L1 Regularization term (sum of absolute weights)
l1_penalty = lambda_reg * np.abs(w)

# Total loss = MSE + L1 penalty
l1_loss = mse_loss + l1_penalty

print("MSE Loss:", mse_loss)
print("L1 Penalty:", l1_penalty)
print("Total Loss with L1 Regularization:", l1_loss)

MSE Loss: 19.375
L1 Penalty: 0.05
Total Loss with L1 Regularization: 19.425


# Dataset Augmentation
Dataset augmentation is like taking your current data and making new versions of it by flipping, rotating, cropping, adding noise, or slightly changing it,
so your model learns to recognize patterns better and doesnâ€™t overfit.

Itâ€™s mostly used in machine learning and deep learning, especially for images, audio, and text.

In [8]:
from torchvision import transforms

# Define a set of random augmentations
transform = transforms.Compose([
    transforms.RandomHorizontalFlip(),
    transforms.RandomRotation(20),
    transforms.ColorJitter(brightness=0.2, contrast=0.2),
    transforms.RandomResizedCrop(224)
])

# Apply this to your training dataset
# train_dataset = torchvision.datasets.ImageFolder("data/train", transform=transform)

# Dropout
Hidden layer outputs: [0.3, 0.5, 0.8, 0.2]

If dropout = 0.5 (50% dropout rate)

After dropout: [0.3, 0, 0.8, 0]


# Batch Normalization - The Optimization Plus
**What It Does:** It makes sure that the data flowing through the network doesnâ€™t get too big or too small after each layer (pre-activation)

**Primary Purpose:** The values inside can explode (become very large) or vanish (become very small) after many layers.

**Bonus Effect:** Acts as regularization (sometimes makes dropout unnecessary)

**Modern Reality:** Very popular in deep learning

### ðŸ§® Batch Normalization Formula
$$
\mathrm{BN}(x) = \gamma \left( \frac{x - \mu}{\sqrt{\sigma^{2} + \epsilon}} \right) + \beta
$$
**Where:**
- $x$ = neuron activation  
- $\mu$ = mean of activations in the batch  
- $\sigma^{2}$ = variance of activations in the batch  
- $\epsilon$ = small number (to avoid division by zero)  
- $\gamma, \beta$ = learned scale and shift parameters

In [10]:
# Batch Normalization
import torch
import torch.nn as nn

# Example for a fully connected network
model = nn.Sequential(
    nn.Linear(10, 64),
    nn.BatchNorm1d(64),  # Batch Normalization for 1D input
    nn.ReLU(),
    nn.Linear(64, 1)
)

# Adversarial Training
Adversarial Training means we train a model not just on normal data,
but also on tricky, slightly modified data that tries to fool the model.

In [11]:
import torch

def fgsm_attack(image, epsilon, data_grad):
    # Add small noise in the direction that increases the loss
    sign_data_grad = data_grad.sign()
    perturbed_image = image + epsilon * sign_data_grad
    perturbed_image = torch.clamp(perturbed_image, 0, 1)  # keep pixel values valid
    return perturbed_image