📝 **Author:** Amirhossein Heydari - 📧 **Email:** <amirhosseinheydari78@gmail.com> - 📍 **Origin:** [mr-pylin/pytorch-workshop](https://github.com/mr-pylin/pytorch-workshop)

---


**Table of contents**<a id='toc0_'></a>    
- [Dependencies](#toc1_)    
- [Parameters](#toc2_)    
- [Hyperparameters](#toc3_)    
  - [A List of Hyperparameters](#toc3_1_)    
  - [Train-Validation-Test Ratio](#toc3_2_)    
  - [Data Augmentation](#toc3_3_)    
  - [Batch Size](#toc3_4_)    
  - [Weight Initialization](#toc3_5_)    
  - [Number of Layers & Neurons](#toc3_6_)    
  - [Normalizations](#toc3_7_)    
  - [Activation Functions](#toc3_8_)    
  - [Loss Function](#toc3_9_)    
  - [Optimizer](#toc3_10_)    
  - [Learning Rate](#toc3_11_)    
  - [Momentum](#toc3_12_)    
  - [Number of Epochs](#toc3_13_)    
  - [Learning Rate Decay](#toc3_14_)    
  - [Dropout Rate](#toc3_15_)    
  - [Regularization](#toc3_16_)    
  - [Gradient Clipping](#toc3_17_)    
  - [Early Stopping](#toc3_18_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Dependencies](#toc0_)


In [None]:
import torch
from torch import nn
from torch.optim import SGD, Adam
from torch.optim.lr_scheduler import StepLR
from torch.utils.data import DataLoader, TensorDataset, random_split
from torchinfo import summary
from torchvision.datasets import FakeData
from torchvision.models import AlexNet
from torchvision.transforms import v2 as transforms

# <a id='toc2_'></a>[Parameters](#toc0_)

- Parameters are the core elements that define the model's behavior and functionality.
- These parameters are **learned** from the **training data** and are crucial for making accurate predictions.
- The primary parameters in neural networks are **weights** and **biases**.
  - **Weights** determine the strength of the connection between neurons
  - **Biases** allow the model to shift the activation function to better fit the data.


In [None]:
# initialize AlexNet with random weights and biases
model = AlexNet(num_classes=1000)
model

In [None]:
summary(model=model, input_size=(1, 3, 227, 227), device="cpu")

# <a id='toc3_'></a>[Hyperparameters](#toc0_)

- Hyperparameters in deep learning models are settings that you configure **before training** your model.
- Hyperparameters are **not learned** from the **data** but are crucial for controlling the **training process** and **model architecture**.

## <a id='toc3_1_'></a>[A List of Hyperparameters](#toc0_)

<table style="margin: 0 auto;">
  <tbody>
    <tr>
      <td>Train-Validation-Test Ratio</td>
      <td>Data Augmentation</td>
      <td>Normalizations</td>
      <td>Weight Initialization</td>
      <td>Number of Layers</td>
      <td>Number of Neurons</td>
    </tr>
    <tr>
      <td>Activation Functions</td>
      <td>Loss Function</td>
      <td>Optimizer</td>
      <td>Learning Rate</td>
      <td>Learning Rate Decay</td>
      <td>Momentum</td>
    </tr>
    <tr>
      <td>Batch Size</td>
      <td>Number of Epochs</td>
      <td>Dropout Rate</td>
      <td>Regularization</td>
      <td>Gradient Clipping</td>
      <td>Early Stopping</td>
    </tr>
  </tbody>
</table>


## <a id='toc3_2_'></a>[Train-Validation-Test Ratio](#toc0_)

- The Train-Validation-Test Ratio is the proportion in which the dataset is split into three subsets:
  - **Training Set**: Used to train the model.
  - **Validation Set**: Used to tune hyperparameters and evaluate the model during training.
  - **Test Set**: Used to evaluate the final model performance.

**✍️ Key Points**

- A larger **training** set can help the model learn better, **but** it should not be so large that the **validation** and **test** sets are too small to provide reliable evaluations.
- A properly sized **validation** set helps in **tuning hyperparameters** and **preventing overfitting**.
- A sufficiently large **test** set ensures that the final evaluation of the model is **reliable** and **unbiased**.


In [100]:
# create an artificial dataset
dataset = FakeData(size=5000, image_size=(3, 32, 32), num_classes=3, transform=None)

# define the train-validation-test split ratios
train_ratio = 0.7
val_ratio = 0.15
test_ratio = 0.15

# split the dataset
train_dataset, val_dataset, test_dataset = random_split(dataset, [train_ratio, val_ratio, test_ratio])

In [None]:
# extract labels
targets = torch.tensor([l for _, l in dataset])

# calculate distribution of each set
train_distribution = dict(zip(*[c.tolist() for c in torch.unique(targets[train_dataset.indices], return_counts=True)]))
val_distribution = dict(zip(*[c.tolist() for c in torch.unique(targets[val_dataset.indices], return_counts=True)]))
test_distribution = dict(zip(*[c.tolist() for c in torch.unique(targets[test_dataset.indices], return_counts=True)]))

# log
print("train_dataset:")
print(f"\t -> len(train_dataset) : {len(train_dataset)}")
print(f"\t -> distibution        : {train_distribution}\n")
print("val_dataset:")
print(f"\t -> len(val_dataset)   : {len(val_dataset)}")
print(f"\t -> distibution        : {val_distribution}\n")
print("test_dataset:")
print(f"\t -> len(test_dataset)  : {len(test_dataset)}")
print(f"\t -> distibution        : {test_distribution}")

## <a id='toc3_3_'></a>[Data Augmentation](#toc0_)

- Data augmentation is a technique used to **artificially increase** the size of a **training** dataset by creating **modified** versions of the data.
- This helps improve the model's ability to **generalize** by providing more **varied** training examples.
- Common data augmentation techniques include **rotations**, **translations**, **flips**, and **color adjustments**.

**✍️ Key Points**

- **Improves Generalization**: By exposing the model to a wider **variety** of data, it can learn more **robust** features and perform better on **unseen data**.
- **Reduces Overfitting**: Augmented data helps prevent the model from **memorizing the training data**, thus reducing **overfitting**.


In [None]:
# define transformations including data augmentation
transform = transforms.Compose(
    [
        transforms.RandomHorizontalFlip(),  # randomly flips the image horizontally with a 50% chance
        transforms.RandomRotation(degrees=10),  # randomly rotates the image by up to 10 degrees
        transforms.ToImage(),  # convert the tensor to an image
        transforms.ToDtype(dtype=torch.float32, scale=True),  # convert the image to float32 and scale it
        transforms.Normalize(mean=(0.5,), std=(0.5,)),  # normalize the image
    ]
)

# load the dataset with the defined transformations
dataset = FakeData(size=5000, image_size=(3, 32, 32), num_classes=3, transform=transform)

# log
print(dataset.extra_repr)

## <a id='toc3_4_'></a>[Batch Size](#toc0_)

- Batch size is a hyperparameter that defines the **number of training examples** used in one **iteration**.
- It determines how many samples are processed before the model's internal parameters are **updated**.

**✍️ Key Points**

- **Small Batch Size**: Can lead to more **noisy** updates but can help the model **generalize** better. It also requires **less memory**.
- **Large Batch Size**: Can speed up training by making more efficient use of hardware but may lead to **overfitting** and requires **more memory**.


In [None]:
# create an artificial dataset
data = torch.randn(1000, 10)  # 1000 samples, each with 10 features
labels = torch.randint(0, 2, (1000,))  # binary labels (0 or 1) for each sample

# combine data and labels into a TensorDataset
dataset = TensorDataset(data, labels)

# define different batch sizes
batch_size = 128
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

for batch_idx, (inputs, targets) in enumerate(dataloader):
    outputs = inputs.sum(dim=1)
    print(outputs.shape)

## <a id='toc3_5_'></a>[Weight Initialization](#toc0_)

- Weight initialization is the process of setting the initial values of the **weights** in a neural network **before** training begins.

**✍️ Key Points**

- **Faster Convergence**: Proper initialization can lead to faster convergence by providing a **good starting point** for the optimization process.
- **Stability**: Helps in stabilizing the training process by **preventing large updates** to the weights.
- **Avoiding Vanishing/Exploding Gradients**: Proper initialization can prevent the gradients from becoming too small (**vanishing**) or too large (**exploding**) during **backpropagation**.

**📈 Common Initialization Techniques for Weights**

1. **Zero Initialization**
    - All weights are initialized to zero.
    - Not recommended for deep networks as it can lead to symmetry problems where all neurons in a layer learn the same features.

1. **Random Initialization**
    - Weights are initialized randomly, typically from a uniform or normal distribution.
    - Provides a diverse set of starting points but can still lead to issues with vanishing or exploding gradients.

1. **Xavier (Glorot) Initialization**
    - Weights are initialized from:
      - $W \sim \mathcal{N}\left(0, \frac{2}{n_{\text{in}} + n_{\text{out}}}\right)$
      - $W \sim \mathcal{U}\left(-{gain}\times\sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}}, {gain}\times\sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}}\right)$
    - Helps in maintaining the variance of activations and gradients throughout the network

1. **He (Kaiming) Initialization**
    - Weights are initialized from:
      - $W \sim \mathcal{N}\left(0, \frac{2}{n_{\text{in}}}\right)$
      - $W \sim \mathcal{U}\left(-{gain}\times\sqrt{\frac{6}{n_{\text{in}}}}, {gain}\times\sqrt{\frac{6}{n_{\text{in}}}}\right)$
    - Particularly useful for networks with ReLU activations as it helps in maintaining the variance of activations

**📉 Common Initialization Techniques for Biases**

1. **Zero Initialization**
    - All biases are initialized to zero.
    - Generally works well and is commonly used because it does not introduce any initial bias in the learning process.

1. **Constant Initialization**
    - All biases are initialized to a constant value, often a small positive value like 0.01.
    - Can help in ensuring that all neurons in a layer start with a small positive bias, which can be beneficial in some cases.

📝 **Papers**:

- [**Understanding the difficulty of training deep feedforward neural networks**](https://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf) by [*Xavier Glorot*](https://scholar.google.com/citations?user=_WnkXlkAAAAJ&hl=en&oi=sra) and [*Yoshua Bengio*](https://scholar.google.com/citations?user=kukA0LcAAAAJ&hl=en&oi=sra) in 2010.
- [**Delving deep into rectifiers: Surpassing human-level performance on imagenet classification**](https://openaccess.thecvf.com/content_iccv_2015/html/He_Delving_Deep_into_ICCV_2015_paper.html) by [*Kaiming He*](https://scholar.google.com/citations?user=DhtAFkwAAAAJ&hl=en&oi=sra) et al. in 2015.

📝 **Docs**:

- `torch.nn.init.xavier_uniform_`: [pytorch.org/docs/stable/nn.init.html#torch.nn.init.xavier_uniform_](https://pytorch.org/docs/stable/nn.init.html#torch.nn.init.xavier_uniform_)
- `torch.nn.init.xavier_normal_`: [pytorch.org/docs/stable/nn.init.html#torch.nn.init.xavier_normal_](https://pytorch.org/docs/stable/nn.init.html#torch.nn.init.xavier_normal_)
- `torch.nn.init.kaiming_uniform_`: [pytorch.org/docs/stable/nn.init.html#torch.nn.init.kaiming_uniform_](https://pytorch.org/docs/stable/nn.init.html#torch.nn.init.kaiming_uniform_)
- `torch.nn.init.kaiming_normal_`: [pytorch.org/docs/stable/nn.init.html#torch.nn.init.kaiming_normal_](https://pytorch.org/docs/stable/nn.init.html#torch.nn.init.kaiming_normal_)


In [None]:
class SimpleNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(10, 50)
        self.fc2 = nn.Linear(50, 2)
        self._initialize_weights()

    def _initialize_weights(self) -> None:
        # initializing weights based on the xavier formula (normal distribution version)
        nn.init.xavier_normal_(self.fc1.weight)
        nn.init.xavier_normal_(self.fc2.weight)

        # initializing biases with zero
        nn.init.zeros_(self.fc1.bias)
        nn.init.zeros_(self.fc2.bias)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.fc1(x)
        x = torch.relu(x)
        x = self.fc2(x)
        return x


# other initializations
# nn.init.uniform_
# nn.init.zeros_
# nn.init.normal_
# nn.init.xavier_uniform_
# nn.init.kaiming_normal_
# nn.init.kaiming_uniform_

## <a id='toc3_6_'></a>[Number of Layers & Neurons](#toc0_)

- The number of layers and neurons in a neural network defines its **architecture**.
- **Number of Layers** Refers to the **depth** of the network.
- **Number of Neurons** Refers to the **width** of each layer.

**✍️ Key Points**

- **Model Capacity**: Increasing the number of layers and neurons increases the model's capacity to learn **complex patterns**.
- **Overfitting**: Too many layers and neurons can lead to **overfitting**, where the model performs well on training data but poorly on **unseen data**.
- **Computational Cost**: More layers and neurons increase the **computational cost** and **training time**.


In [None]:
# define a complex neural network with more layers and neurons
class ComplexNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(10, 100)
        self.fc2 = nn.Linear(100, 200)
        self.fc3 = nn.Linear(200, 100)
        self.fc4 = nn.Linear(100, 50)
        self.fc5 = nn.Linear(50, 2)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = torch.relu(self.fc3(x))
        x = torch.relu(self.fc4(x))
        x = self.fc5(x)
        return x

## <a id='toc3_7_'></a>[Normalizations](#toc0_)

- Normalization techniques in neural networks are used to **standardize** the inputs to a layer, improving the **training speed** and **stability**.
- Common normalization techniques include **Batch Normalization**, **Layer Normalization**, **Instance Normalization**, and **Group Normalization**.

**✍️ Key Points**

- **Faster Convergence**: Normalization helps in **faster convergence** by reducing **internal covariate** shift.
- **Improved Performance**: Ensures that the model learns more effectively by providing standardized inputs.
- **Stability**: Helps in stabilizing the training process by preventing large gradient updates.


In [None]:
# define a simple neural network with Batch Normalization
class SimpleNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(10, 50)
        self.bn1 = nn.BatchNorm1d(50)
        self.fc2 = nn.Linear(50, 2)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.fc1(x)
        x = self.bn1(x)
        x = torch.relu(x)
        x = self.fc2(x)
        return x

## <a id='toc3_8_'></a>[Activation Functions](#toc0_)

- Activation functions introduce **non-linearity** into the neural network, enabling it to learn and represent **complex patterns** in the data.
- They determine the output of a neuron given an input or set of inputs.
- Choosing the right activation function is crucial for the performance of the neural network.
- **More info**: [activation-functions.ipynb](./activation-functions.ipynb)


In [None]:
# define a custom neural network that uses several activation functions
class CustomNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(10, 50)
        self.fc2 = nn.Linear(50, 30)
        self.fc3 = nn.Linear(30, 20)
        self.fc4 = nn.Linear(20, 10)
        self.fc5 = nn.Linear(10, 5)
        self.fc6 = nn.Linear(5, 3)

        # define activation functions
        self.sigmoid = nn.Sigmoid()
        self.tanh = nn.Tanh()
        self.relu = nn.ReLU()
        self.leaky_relu = nn.LeakyReLU(0.01)
        self.softmax = nn.Softmax(dim=1)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.sigmoid(self.fc1(x))
        x = self.tanh(self.fc2(x))
        x = self.relu(self.fc3(x))
        x = self.leaky_relu(self.fc4(x))
        x = self.relu(self.fc5(x))
        x = self.softmax(self.fc6(x))
        return x

## <a id='toc3_9_'></a>[Loss Function](#toc0_)

- Loss functions (aka **cost/objective** functions), measure how well a neural network's predictions match the actual target values.
- They guide the optimization process by providing a measure of the model's performance.
- The choice of loss function depends on the type of problem being solved (e.g., **regression**, **binary classification**, **multi-class classification**)
- **More info**: [loss-functions.ipynb](./loss-functions.ipynb)


In [None]:
# regression example

# artificial true and predicted values
y_true = torch.tensor([2.5, 0.0, 2.1, 7.8])
y_pred = torch.tensor([3.0, -0.5, 2.0, 8.0])

# define the loss function
criterion = nn.MSELoss()

# compute the loss
loss = criterion(y_pred, y_true)
print(f"MSE: {loss.item()}")

In [None]:
# binary classification example

# artificial true and predicted values
y_true = torch.tensor([1, 0, 1, 0], dtype=torch.float32)
y_pred = torch.tensor([0.9, 0.1, 0.8, 0.4], dtype=torch.float32)

# define the loss function
criterion = nn.BCELoss()

# compute the loss
loss = criterion(y_pred, y_true)
print(f"BCE: {loss.item()}")

In [None]:
# multi-class classification example

# artificial true and predicted values
y_true = torch.tensor([2, 0, 1])
y_pred = torch.tensor([[0.1, 0.2, 0.7], [0.8, 0.1, 0.1], [0.2, 0.6, 0.2]])

# define the loss function
criterion = nn.CrossEntropyLoss()

# compute the loss
loss = criterion(y_pred, y_true)
print(f"CE: {loss.item()}")

## <a id='toc3_10_'></a>[Optimizer](#toc0_)

- Optimizers are algorithms used to **update the weights** of a neural network to **minimize** the loss function.
- They determine how the model's parameters are adjusted based on the gradients computed during **backpropagation**.
- Different optimizers impact the **convergence speed** and **final performance** of the model.

**Common Optimizers**:

- Stochastic Gradient Descent (SGD)
   $$ \theta = \theta - \eta \nabla J(\theta) $$

- Adam (Adaptive Moment Estimation)
   $$ m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t $$
   $$ v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 $$
   $$ \hat{m}_t = \frac{m_t}{1 - \beta_1^t} $$
   $$ \hat{v}_t = \frac{v_t}{1 - \beta_2^t} $$
   $$ \theta = \theta - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} $$

- RMSProp (Root Mean Square Propagation)
   $$ E[g^2]t = \gamma E[g^2]{t-1} + (1 - \gamma) g_t^2 $$
   $$ \theta = \theta - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} g_t $$


In [None]:
class SimpleNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(10, 50)
        self.fc2 = nn.Linear(50, 1)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)
        return x


# create a model instance
model = SimpleNN()

# define the optimizer
optimizer = Adam(model.parameters(), lr=0.001)

# example input and target
input = torch.randn(5, 10)
target = torch.randn(5, 1)

# forward pass
output = model(input)

# compute the loss
criterion = nn.MSELoss()
loss = criterion(output, target)

# backward pass and optimization
optimizer.zero_grad()
loss.backward()
optimizer.step()

## <a id='toc3_11_'></a>[Learning Rate](#toc0_)

- The learning rate is a hyperparameter that controls the step size at each iteration while moving toward a minimum of the loss function.
- It determines how much to change the model's parameters in response to the estimated error each time the model weights are updated.

**✍️ Key Points**

- **High Learning Rate**: Can cause the model to **converge** too quickly to a **suboptimal solution** or even **diverge**.
- **Low Learning Rate**: Can make the training process very **slow** and may get stuck in **local minima**.


In [None]:
model = AlexNet()
optimizer_1 = SGD(params=model.parameters(), lr=0.1)
optimizer_2 = Adam(params=model.parameters(), lr=0.001)

# log
print(optimizer_1)
print(optimizer_2)

## <a id='toc3_12_'></a>[Momentum](#toc0_)

- Momentum is a technique used to **accelerate** the convergence of the optimization process by **adding a fraction of the previous update to the current update**.
- It helps in **smoothing** the optimization path and can **prevent** the model from getting stuck in **local minima**.

**✍️ Key Points**

- **Faster Convergence**: Helps in accelerating the convergence by smoothing the optimization path.
- **Stability**: Reduces oscillations and helps in stabilizing the training process.


In [None]:
model = AlexNet()
optimizer = SGD(params=model.parameters(), lr=0.1, momentum=0.5)

# log
print(optimizer)

## <a id='toc3_13_'></a>[Number of Epochs](#toc0_)

- It defines how many times the learning algorithm will work through the **entire** training dataset.

**✍️ Key Points**

- More epochs can lead to **better learning**, but too many can cause **overfitting**.
- The right number of epochs helps the model **generalize well** to new data.


In [None]:
class SimpleNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(10, 128)
        self.fc2 = nn.Linear(128, 2)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)
        return x


# generate artificial data
data = torch.randn(1000, 10)  # 1000 samples, 10 features each
labels = torch.randint(0, 2, (1000,))  # Binary labels (0 or 1)

# create DataLoader
dataset = TensorDataset(data, labels)
trainloader = DataLoader(dataset, batch_size=64, shuffle=True)

# initialize network, loss function, and optimizer
model = SimpleNN()
criterion = nn.CrossEntropyLoss()
optimizer = SGD(model.parameters(), lr=0.01)

epochs = 5

# training loop
for epoch in range(epochs):
    running_loss = 0.0
    for images, labels in trainloader:
        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
    print(f"epoch {epoch+1:0{len(str(epochs))}}/{epochs}, loss: {running_loss/len(trainloader):.5f}")

## <a id='toc3_14_'></a>[Learning Rate Decay](#toc0_)

- Learning rate decay is a technique used to reduce the learning rate over time during training.
- This helps the model converge more precisely by taking smaller steps as it approaches the minimum of the loss function.
- Learning rate decay can be implemented in various ways, such as step decay, exponential decay, and adaptive learning rates.

**✍️ Key Points**

- **Improved Convergence**: Helps the model converge more precisely by reducing the learning rate over time.
- **Stability**: Reduces the risk of overshooting the minimum of the loss function by taking smaller steps as training progresses.


In [None]:
class SimpleNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(10, 50)
        self.fc2 = nn.Linear(50, 1)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)
        return x


# create a model instance
model = SimpleNN()
criterion = nn.MSELoss()
optimizer = SGD(model.parameters(), lr=0.1)

# define a learning rate scheduler
scheduler = StepLR(optimizer, step_size=3, gamma=0.1)

# example input and target
input = torch.randn(5, 10)
target = torch.randn(5, 1)

epochs = 10

# mimic the training loop with learning rate decay
for epoch in range(epochs):
    # forward pass
    output = model(input)

    # compute the loss
    loss = criterion(output, target)

    # backward pass and optimization
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    # step the learning rate scheduler
    scheduler.step()

    print(f"epoch {epoch+1:0{len(str(epochs))}}/{epochs}, loss: {loss.item():.5f}, learning rate: {scheduler.get_last_lr()[0]:.5f}")

## <a id='toc3_15_'></a>[Dropout Rate](#toc0_)

- It sets a **fraction of input units** to **zero** at each update during training time, which forces the network to learn more **robust features**.
- Helps in regularizing the model and preventing **overfitting** by ensuring that the network does not rely too heavily on **any individual neuron**.


In [None]:
class SimpleNN(nn.Module):
    def __init__(self, dropout_rate=0.5):
        super().__init__()
        self.fc1 = nn.Linear(10, 128)
        self.dropout = nn.Dropout(dropout_rate)
        self.fc2 = nn.Linear(128, 2)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = torch.relu(self.fc1(x))
        x = self.dropout(x)
        x = self.fc2(x)
        return x

## <a id='toc3_16_'></a>[Regularization](#toc0_)

- Regularization is a technique used to prevent **overfitting** by adding **a penalty to the loss function**.
- This penalty **discourages** the model from fitting **too closely** to the training data, which helps improve its **generalization** to new data.

✍️ **Common Regularizations**

- **L1 (Lasso) Regularization**:
  - Adds the **absolute** value of the **coefficients** as a penalty term to the **loss function**.
  - $ L1 = \lambda \sum_{i=1}^{n} |w_i| $
- **L2 (Ridge) Regularization**:
  - Adds the **squared** value of the **coefficients** as a penalty term to the **loss function**.
  - $ L2 = \lambda \sum_{i=1}^{n} w_i^2 $


In [117]:
# training function + l1 regularization
def train_model(epochs: int, l1_lambda: float = 0.01):
    # initialize loss function, and optimizer
    criterion = nn.CrossEntropyLoss()
    optimizer = SGD(model.parameters(), lr=0.01)

    for epoch in range(epochs):
        for inputs, targets in trainloader:
            optimizer.zero_grad()
            outputs = model(inputs)

            # add L1 regularization to the loss
            loss = criterion(outputs, targets)
            l1_norm = sum(p.abs().sum() for p in model.parameters())
            loss = loss + l1_lambda * l1_norm

            loss.backward()
            optimizer.step()

In [None]:
# training function + l2 regularization
def train_model(epochs: int, l2_lambda: float = 0.01):
    # initialize loss function, and optimizer
    criterion = nn.CrossEntropyLoss()
    optimizer = SGD(model.parameters(), lr=0.01, weight_decay=l2_lambda)  # weight_decay is the coefficient for l2_norm

    for epoch in range(epochs):
        for inputs, targets in trainloader:
            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, targets)
            loss.backward()
            optimizer.step()

## <a id='toc3_17_'></a>[Gradient Clipping](#toc0_)

- It is a technique used to **prevent the exploding gradient** problem in neural networks, especially in **recurrent neural networks (RNNs)**.
- It involves **capping the gradients** during backpropagation to a **maximum** value to ensure they don't become **too large**.


In [None]:
# training function + gradient clipping
def train_model(epochs: int, clip_value: float = 1.0):
    criterion = nn.CrossEntropyLoss()
    optimizer = SGD(model.parameters(), lr=0.01)

    for epoch in range(epochs):
        for inputs, targets in trainloader:
            optimizer.zero_grad()
            outputs = model(inputs)

            loss = criterion(outputs, targets)
            loss.backward()

            # apply gradient clipping
            nn.utils.clip_grad_norm_(model.parameters(), clip_value)

            optimizer.step()

## <a id='toc3_18_'></a>[Early Stopping](#toc0_)

- It is used to **prevent overfitting** by **stopping** the training process when the model's performance on a **validation set** starts to degrade.


In [120]:
def train_model(epochs: int, trainloader: DataLoader, valloader: DataLoader, patience: int = 3):
    best_loss = float("inf")
    patience_counter = 0

    for epoch in range(epochs):

        # train loop
        model.train()
        train_loss = 0.0
        for inputs, targets in trainloader:
            pass

        # validation loop
        model.eval()
        val_loss = 0.0
        with torch.no_grad():
            for inputs, targets in valloader:
                pass

        # check for early stopping
        if val_loss < best_loss:
            best_loss = val_loss
            patience_counter = 0
        else:
            patience_counter += 1
            if patience_counter >= patience:
                print("Early stopping triggered")
                break