# **Deep Learning Course**

## **Loss Functions and Multilayer Perceptrons (MLP)**

---

### **Student Information:**

- **Name:** *Mahdi Tabatabaei*
- **Student Number:** *400101515*

---

### **Assignment Overview**

In this notebook, we will explore various loss functions used in neural networks, with a specific focus on their role in training **Multilayer Perceptrons (MLPs)**. By the end of this notebook, you will have a deeper understanding of:
- Types of loss functions
- How loss functions affect the training process
- The relationship between loss functions and model optimization in MLPs

---

### **Table of Contents**

1. Introduction to Loss Functions
2. Types of Loss Functions
3. Multilayer Perceptrons (MLP)
4. Implementing Loss Functions in MLP
5. Conclusion

---



# 1.Introduction to Loss Functions 

In deep learning, **loss functions** play a crucial role in training models by quantifying the difference between the predicted outputs and the actual targets. Selecting the appropriate loss function is essential for the success of your model. In this assay, we will explore various loss functions available in PyTorch, understand their theoretical backgrounds, and provide you with a scaffolded class to experiment with these loss functions.

Before begining, let's train a simle MLP model using the **L1Loss** function. We'll return to this model later to experiment with different loss functions. We'll start by importing the necessary libraries and defining the model architecture.

First things first, let's talk about **L1Loss**.

### 1. L1Loss (`torch.nn.L1Loss`)
- **Description:** Also known as Mean Absolute Error (MAE), L1Loss computes the average absolute difference between the predicted values and the target values.
- **Use Case:** Suitable for regression tasks where robustness to outliers is desired.

Here is the mathematical formulation of L1Loss:
\begin{equation}
\text{L1Loss} = \frac{1}{n} \sum_{i=1}^{n} |y_{\text{pred}_i} - y_{\text{true}_i}|
\end{equation}

Let's implement a simple MLP model using the L1Loss function.

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import pandas as pd
from torch.utils.data import TensorDataset, DataLoader
from sklearn.model_selection import train_test_split
from torch.optim import Adam
# Don't be courious about Adam, it's just a fancy name for a fancy optimization algorithm

Here, we'll define a class called `SimpleMLP` that inherits from `nn.Module`. This class can have multiple layers, and we'll use the `nn.Sequential` module to define the layers of the model. The model will have the following architecture:

In [2]:
from tqdm import tqdm

class SimpleMLP(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim, num_hidden_layers=1, last_layer_activation_fn=nn.ReLU):
        super(SimpleMLP, self).__init__()
        
        # Initialize a list to store the layers
        layers = []
        
        # Add the first layer (input layer to first hidden layer)
        layers.append(nn.Linear(input_dim, hidden_dim))
        layers.append(nn.ReLU())  # Apply ReLU activation after each hidden layer
        
        # Add hidden layers
        for _ in range(num_hidden_layers - 1):
            layers.append(nn.Linear(hidden_dim, hidden_dim))
            layers.append(nn.ReLU())
        
        # Add output layer
        layers.append(nn.Linear(hidden_dim, output_dim))
        
        # Add activation function for the output layer
        if last_layer_activation_fn is not None:
            layers.append(last_layer_activation_fn())
        
        # Use nn.Sequential to define the model architecture
        self.model = nn.Sequential(*layers)

    def forward(self, x):
        # Pass the input through the model
        return self.model(x)


Now, let's define a class called `SimpleMLP_Loss` that has the following architecture:

In [3]:
from tqdm import tqdm

class SimpleMLPTrainer:
    def __init__(self, model, criterion, optimizer):
        self.model = model
        self.criterion = criterion
        self.optimizer = optimizer

    def train(self, train_loader, num_epochs):
        # List to store training loss for each epoch
        epoch_losses = []

        # Training loop
        for epoch in range(num_epochs):
            self.model.train()  # Set model to training mode
            running_loss = 0.0

            # Use tqdm for a progress bar
            for inputs, targets in tqdm(train_loader, desc=f"Epoch {epoch + 1}/{num_epochs}"):
                # Zero the parameter gradients
                self.optimizer.zero_grad()

                # Forward pass
                outputs = self.model(inputs)
                loss = self.criterion(outputs, targets)

                # Backward pass and optimize
                loss.backward()
                self.optimizer.step()

                # Accumulate loss
                running_loss += loss.item()

            # Calculate and store average loss for this epoch
            avg_loss = running_loss / len(train_loader)
            epoch_losses.append(avg_loss)

            # Print loss for the epoch
            print(f"Epoch [{epoch + 1}/{num_epochs}], Loss: {avg_loss:.4f}")

        return epoch_losses

    def evaluate(self, val_loader):
        self.model.eval()  # Set model to evaluation mode
        val_loss = 0.0
        correct_predictions = 0
        total_predictions = 0

        with torch.no_grad():  # Disable gradient computation
            for inputs, targets in val_loader:
                # Forward pass
                outputs = self.model(inputs)
                loss = self.criterion(outputs, targets)

                # Accumulate validation loss
                val_loss += loss.item()

                if outputs.shape[1] == 2:  # Multi-class prediction
                    predicted = torch.argmax(outputs, dim=1)
                # Convert targets to 1D if they are one-hot encoded or multi-dimensional
                    targets = torch.argmax(targets, dim=1) if targets.dim() > 1 else targets
                else:  # Binary classification
                    predicted = (outputs >= 0.5).float().squeeze()
                    targets = targets.squeeze()
                    
                correct_predictions += (predicted == targets).sum().item()
                total_predictions += targets.size(0)

        # Average validation loss
        avg_val_loss = val_loss / len(val_loader)
        # Calculate accuracy
        accuracy = correct_predictions / total_predictions

        print(f"Validation Loss: {avg_val_loss:.4f}, Accuracy: {accuracy:.4f}")
        return avg_val_loss, accuracy


Next, lets test our model using the L1Loss function. You'll use <span style="color:red">*Titanic Dataset*</span> to train the model.


In [4]:
# Load dataset
train_url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
data = pd.read_csv(train_url)

# Preprocessing (simple example)
data = data[['Pclass', 'Sex', 'Age', 'Fare', 'Survived']].dropna()
data['Sex'] = data['Sex'].map({'male': 0, 'female': 1})

In [5]:
# Convert data to tensors
X = data[['Pclass', 'Sex', 'Age', 'Fare']].values
y = data['Survived'].values

X_tensor = torch.tensor(X, dtype=torch.float32)
y_tensor = torch.tensor(y, dtype=torch.float32)

In [6]:
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_tensor, y_tensor, test_size=0.2, random_state=42)

# Create TensorDatasets and DataLoaders
train_dataset = TensorDataset(X_train, y_train)
test_dataset = TensorDataset(X_test, y_test)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

In [7]:
# Check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

In [8]:
# Define model parameters
input_dim = X.shape[1]  # Number of features
hidden_dim = 16  # Number of units in hidden layer
output_dim = 1  # Output layer size for binary classification (Survived or not)
num_hidden_layers = 2

# Instantiate the model
model_L1 = SimpleMLP(input_dim, hidden_dim, output_dim, num_hidden_layers=num_hidden_layers, last_layer_activation_fn=nn.Sigmoid)
# model_L1 = model_L1.to(device)  # Move model to the specified device

# Define the criterion and optimizer
criterion = nn.L1Loss()
optimizer = Adam(model_L1.parameters(), lr=0.001)


  from .autonotebook import tqdm as notebook_tqdm


<div style="text-align: center;"> <span style="color:red; font-size: 26px; font-weight: bold;">Let's train!</span> </div>

In [9]:
# Instantiate the trainer
trainer = SimpleMLPTrainer(model_L1, criterion, optimizer)

# Train the model
num_epochs = 20
training_losses = trainer.train(train_loader, num_epochs)

  return F.l1_loss(input, target, reduction=self.reduction)
  return F.l1_loss(input, target, reduction=self.reduction)
Epoch 1/20: 100%|██████████| 18/18 [00:00<00:00, 333.79it/s]


Epoch [1/20], Loss: 0.4563


Epoch 2/20: 100%|██████████| 18/18 [00:00<00:00, 801.23it/s]


Epoch [2/20], Loss: 0.4312


Epoch 3/20: 100%|██████████| 18/18 [00:00<00:00, 462.25it/s]


Epoch [3/20], Loss: 0.4213


Epoch 4/20: 100%|██████████| 18/18 [00:00<00:00, 289.02it/s]


Epoch [4/20], Loss: 0.4139


Epoch 5/20: 100%|██████████| 18/18 [00:00<00:00, 262.34it/s]


Epoch [5/20], Loss: 0.4144


Epoch 6/20: 100%|██████████| 18/18 [00:00<00:00, 517.06it/s]


Epoch [6/20], Loss: 0.4125


Epoch 7/20: 100%|██████████| 18/18 [00:00<00:00, 521.30it/s]


Epoch [7/20], Loss: 0.4114


Epoch 8/20: 100%|██████████| 18/18 [00:00<00:00, 400.10it/s]


Epoch [8/20], Loss: 0.4110


Epoch 9/20: 100%|██████████| 18/18 [00:00<00:00, 433.85it/s]


Epoch [9/20], Loss: 0.4114


Epoch 10/20: 100%|██████████| 18/18 [00:00<00:00, 375.26it/s]


Epoch [10/20], Loss: 0.4102


Epoch 11/20: 100%|██████████| 18/18 [00:00<00:00, 386.63it/s]


Epoch [11/20], Loss: 0.4091


Epoch 12/20: 100%|██████████| 18/18 [00:00<00:00, 435.42it/s]


Epoch [12/20], Loss: 0.4109


Epoch 13/20: 100%|██████████| 18/18 [00:00<00:00, 386.95it/s]


Epoch [13/20], Loss: 0.4106


Epoch 14/20: 100%|██████████| 18/18 [00:00<00:00, 592.25it/s]


Epoch [14/20], Loss: 0.4109


Epoch 15/20: 100%|██████████| 18/18 [00:00<00:00, 453.30it/s]


Epoch [15/20], Loss: 0.4099


Epoch 16/20: 100%|██████████| 18/18 [00:00<00:00, 771.09it/s]


Epoch [16/20], Loss: 0.4092


Epoch 17/20: 100%|██████████| 18/18 [00:00<00:00, 432.00it/s]


Epoch [17/20], Loss: 0.4102


Epoch 18/20: 100%|██████████| 18/18 [00:00<00:00, 662.71it/s]


Epoch [18/20], Loss: 0.4108


Epoch 19/20: 100%|██████████| 18/18 [00:00<00:00, 522.42it/s]


Epoch [19/20], Loss: 0.4105


Epoch 20/20: 100%|██████████| 18/18 [00:00<00:00, 721.37it/s]

Epoch [20/20], Loss: 0.4082





In [10]:
# Evaluate the model
test_loss, accuracy = trainer.evaluate(test_loader)

Validation Loss: 0.3925, Accuracy: 0.6084


  return F.l1_loss(input, target, reduction=self.reduction)


---
# 2. Types of Loss Functions

PyTorch offers a variety of built-in loss functions tailored for different types of problems, such as regression, classification, and more. Below, we discuss several commonly used loss functions, their theoretical foundations, and typical use cases.

### 2. MSELoss (`torch.nn.MSELoss`)
- **Description:** Mean Squared Error (MSE) calculates the average of the squares of the differences between predicted and target values.
- **Use Case:** Commonly used in regression problems where larger errors are significantly penalized.

Here is boring math stuff for MSE:
\begin{equation}
\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_{i} - \hat{y}_{i})^{2}
\end{equation}

<span style="color:red; font-size: 18px; font-weight: bold;">Warning:</span> Don't forget to reinitialize the model before experimenting with different loss functions.

In [11]:
from torch.nn import MSELoss

# Define model parameters
input_dim = X.shape[1]  # Number of input features (Pclass, Sex, Age, Fare)
hidden_dim = 16  # Number of units in hidden layer
output_dim = 1  # Binary classification output (Survived or not)
num_hidden_layers = 2

# Initialize the model and move it to the device
model_MSE = SimpleMLP(input_dim, hidden_dim, output_dim, num_hidden_layers=num_hidden_layers, last_layer_activation_fn=nn.Sigmoid)
# model_MSE = model_MSE.to(device)  # Move model to the specified device

# Initialize MSE Loss as the criterion and Adam as the optimizer
criterion = MSELoss()
optimizer = Adam(model_MSE.parameters(), lr=0.001)

# Instantiate the trainer
trainer = SimpleMLPTrainer(model_MSE, criterion, optimizer)

# Train the model
num_epochs = 20
training_losses = trainer.train(train_loader, num_epochs)

  return F.mse_loss(input, target, reduction=self.reduction)
  return F.mse_loss(input, target, reduction=self.reduction)
Epoch 1/20: 100%|██████████| 18/18 [00:00<00:00, 538.64it/s]


Epoch [1/20], Loss: 0.2677


Epoch 2/20: 100%|██████████| 18/18 [00:00<00:00, 631.39it/s]


Epoch [2/20], Loss: 0.2499


Epoch 3/20: 100%|██████████| 18/18 [00:00<00:00, 592.76it/s]


Epoch [3/20], Loss: 0.2458


Epoch 4/20: 100%|██████████| 18/18 [00:00<00:00, 682.17it/s]


Epoch [4/20], Loss: 0.2435


Epoch 5/20: 100%|██████████| 18/18 [00:00<00:00, 495.78it/s]


Epoch [5/20], Loss: 0.2442


Epoch 6/20: 100%|██████████| 18/18 [00:00<00:00, 532.35it/s]


Epoch [6/20], Loss: 0.2435


Epoch 7/20: 100%|██████████| 18/18 [00:00<00:00, 646.94it/s]


Epoch [7/20], Loss: 0.2430


Epoch 8/20: 100%|██████████| 18/18 [00:00<00:00, 663.77it/s]


Epoch [8/20], Loss: 0.2431


Epoch 9/20: 100%|██████████| 18/18 [00:00<00:00, 706.52it/s]


Epoch [9/20], Loss: 0.2427


Epoch 10/20: 100%|██████████| 18/18 [00:00<00:00, 588.31it/s]


Epoch [10/20], Loss: 0.2426


Epoch 11/20: 100%|██████████| 18/18 [00:00<00:00, 608.23it/s]


Epoch [11/20], Loss: 0.2427


Epoch 12/20: 100%|██████████| 18/18 [00:00<00:00, 568.54it/s]


Epoch [12/20], Loss: 0.2432


Epoch 13/20: 100%|██████████| 18/18 [00:00<00:00, 664.03it/s]


Epoch [13/20], Loss: 0.2469


Epoch 14/20: 100%|██████████| 18/18 [00:00<00:00, 331.25it/s]


Epoch [14/20], Loss: 0.2427


Epoch 15/20: 100%|██████████| 18/18 [00:00<00:00, 633.29it/s]


Epoch [15/20], Loss: 0.2426


Epoch 16/20: 100%|██████████| 18/18 [00:00<00:00, 467.93it/s]


Epoch [16/20], Loss: 0.2427


Epoch 17/20: 100%|██████████| 18/18 [00:00<00:00, 253.70it/s]


Epoch [17/20], Loss: 0.2441


Epoch 18/20: 100%|██████████| 18/18 [00:00<00:00, 269.11it/s]


Epoch [18/20], Loss: 0.2427


Epoch 19/20: 100%|██████████| 18/18 [00:00<00:00, 224.33it/s]


Epoch [19/20], Loss: 0.2434


Epoch 20/20: 100%|██████████| 18/18 [00:00<00:00, 262.21it/s]

Epoch [20/20], Loss: 0.2419





In [13]:
# Evaluate the model
test_loss, accuracy = trainer.evaluate(test_loader)

Validation Loss: 0.2392, Accuracy: 0.6084


### 3. NLLLoss (`torch.nn.NLLLoss`)
- **Description:** Negative Log-Likelihood Loss measures the likelihood of the target class under the predicted probability distribution.
- **Use Case:** Typically used in multi-class classification tasks, especially when combined with `log_softmax` activation.

Here is the mathematical formulation of NLLLoss:
\begin{equation}
\text{NLLLoss} = -\frac{1}{n} \sum_{i=1}^{n} \log(y_{i})
\end{equation}

I hope you note the logarithm in the formula. It's important! 

Why?

Answer: Since NLLLoss directly uses `log(y_i)`, it expects log-probabilities as input rather than raw logits or probabilities. Using `log_softmax` as the final activation function ensures the output is in the correct format for `NLLLoss`. If you use only `ReLU` (or no `log_softmax`), the outputs won’t be log-probabilities, and `NLLLoss` would be applied incorrectly, leading to:

Unstable gradients.
Inconsistent or meaningless loss values

In this part, run your training with Relu at last layer. <span style="color:red; font-weight: bold;">Discuss </span> and explain the difference between the results of the two models. Find a proper solution to the problem.

If you run training with `ReLU` as the last layer, the model’s output will not be in the form of log-probabilities, which is required by `NLLLoss`. This misalignment will likely lead to poor performance or even training failures. The solution to this problem is to replace the final activation with `log_softmax`, ensuring the output fits the expectation of `NLLLoss`.


In [14]:
X_tensor = torch.tensor(X, dtype=torch.float32)
y_tensor = torch.tensor(y, dtype=torch.long)

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_tensor, y_tensor, test_size=0.2, random_state=42)

# Create TensorDatasets and DataLoaders
train_dataset = TensorDataset(X_train, y_train)
test_dataset = TensorDataset(X_test, y_test)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

In [15]:
# Run with relu activation function
from torch.nn import NLLLoss

# Define model parameters
input_dim = X.shape[1]  # Number of input features (Pclass, Sex, Age, Fare)
hidden_dim = 16  # Number of units in hidden layer
output_dim = 2  # Binary classification output (Survived or not)
num_hidden_layers = 2

# Initialize the model and move it to the device
model_NLL_Relu = SimpleMLP(input_dim, hidden_dim, output_dim, num_hidden_layers=num_hidden_layers, last_layer_activation_fn=nn.ReLU)
# model_NLL_Relu.to(device)

# Initialize MSE Loss as the criterion and Adam as the optimizer
criterion = NLLLoss()
optimizer = Adam(model_NLL_Relu.parameters(), lr=0.001)

# Instantiate the trainer
trainer = SimpleMLPTrainer(model_NLL_Relu, criterion, optimizer)

# Train the model
num_epochs = 20
training_losses = trainer.train(train_loader, num_epochs)

Epoch 1/20: 100%|██████████| 18/18 [00:00<00:00, 350.90it/s]


Epoch [1/20], Loss: -0.3359


Epoch 2/20: 100%|██████████| 18/18 [00:00<00:00, 415.19it/s]


Epoch [2/20], Loss: -1.2595


Epoch 3/20: 100%|██████████| 18/18 [00:00<00:00, 296.03it/s]


Epoch [3/20], Loss: -2.2997


Epoch 4/20: 100%|██████████| 18/18 [00:00<00:00, 275.45it/s]


Epoch [4/20], Loss: -3.7494


Epoch 5/20: 100%|██████████| 18/18 [00:00<00:00, 261.82it/s]


Epoch [5/20], Loss: -5.8825


Epoch 6/20: 100%|██████████| 18/18 [00:00<00:00, 277.79it/s]


Epoch [6/20], Loss: -8.8293


Epoch 7/20: 100%|██████████| 18/18 [00:00<00:00, 334.43it/s]


Epoch [7/20], Loss: -12.9363


Epoch 8/20: 100%|██████████| 18/18 [00:00<00:00, 208.10it/s]


Epoch [8/20], Loss: -18.4914


Epoch 9/20: 100%|██████████| 18/18 [00:00<00:00, 247.43it/s]


Epoch [9/20], Loss: -26.2770


Epoch 10/20: 100%|██████████| 18/18 [00:00<00:00, 444.21it/s]


Epoch [10/20], Loss: -36.2976


Epoch 11/20: 100%|██████████| 18/18 [00:00<00:00, 539.80it/s]


Epoch [11/20], Loss: -50.4723


Epoch 12/20: 100%|██████████| 18/18 [00:00<00:00, 614.15it/s]


Epoch [12/20], Loss: -68.2054


Epoch 13/20: 100%|██████████| 18/18 [00:00<00:00, 509.62it/s]


Epoch [13/20], Loss: -91.3761


Epoch 14/20: 100%|██████████| 18/18 [00:00<00:00, 529.26it/s]


Epoch [14/20], Loss: -117.5143


Epoch 15/20: 100%|██████████| 18/18 [00:00<00:00, 567.69it/s]


Epoch [15/20], Loss: -150.8462


Epoch 16/20: 100%|██████████| 18/18 [00:00<00:00, 782.17it/s]


Epoch [16/20], Loss: -191.5641


Epoch 17/20: 100%|██████████| 18/18 [00:00<00:00, 623.20it/s]


Epoch [17/20], Loss: -236.6268


Epoch 18/20: 100%|██████████| 18/18 [00:00<00:00, 692.75it/s]


Epoch [18/20], Loss: -291.7410


Epoch 19/20: 100%|██████████| 18/18 [00:00<00:00, 766.86it/s]


Epoch [19/20], Loss: -352.0806


Epoch 20/20: 100%|██████████| 18/18 [00:00<00:00, 693.02it/s]

Epoch [20/20], Loss: -424.5749





In [16]:
# Evaluate the model
test_loss, accuracy = trainer.evaluate(test_loader)

Validation Loss: -453.4514, Accuracy: 0.3916


In [17]:
# Run with relu activation function
from torch.nn import NLLLoss

# Define model parameters
input_dim = X.shape[1]  # Number of input features (Pclass, Sex, Age, Fare)
hidden_dim = 16  # Number of units in hidden layer
output_dim = 2  # Binary classification output (Survived or not)
num_hidden_layers = 2

# Initialize the model and move it to the device
model_NLL_LogSoftmax = SimpleMLP(input_dim, hidden_dim, output_dim, num_hidden_layers=num_hidden_layers, last_layer_activation_fn=nn.LogSoftmax)
# model_NLL_Relu.to(device)

# Initialize MSE Loss as the criterion and Adam as the optimizer
criterion = NLLLoss()
optimizer = Adam(model_NLL_LogSoftmax.parameters(), lr=0.001)

# Instantiate the trainer
trainer = SimpleMLPTrainer(model_NLL_LogSoftmax, criterion, optimizer)

# Train the model
num_epochs = 20
training_losses = trainer.train(train_loader, num_epochs)

  return self._call_impl(*args, **kwargs)
Epoch 1/20: 100%|██████████| 18/18 [00:00<00:00, 343.02it/s]


Epoch [1/20], Loss: 1.2253


Epoch 2/20: 100%|██████████| 18/18 [00:00<00:00, 454.89it/s]


Epoch [2/20], Loss: 0.7507


Epoch 3/20: 100%|██████████| 18/18 [00:00<00:00, 335.67it/s]


Epoch [3/20], Loss: 0.6330


Epoch 4/20: 100%|██████████| 18/18 [00:00<00:00, 302.23it/s]


Epoch [4/20], Loss: 0.5970


Epoch 5/20: 100%|██████████| 18/18 [00:00<00:00, 264.93it/s]


Epoch [5/20], Loss: 0.5906


Epoch 6/20: 100%|██████████| 18/18 [00:00<00:00, 392.24it/s]


Epoch [6/20], Loss: 0.5859


Epoch 7/20: 100%|██████████| 18/18 [00:00<00:00, 511.57it/s]


Epoch [7/20], Loss: 0.5865


Epoch 8/20: 100%|██████████| 18/18 [00:00<00:00, 657.79it/s]


Epoch [8/20], Loss: 0.5883


Epoch 9/20: 100%|██████████| 18/18 [00:00<00:00, 723.45it/s]


Epoch [9/20], Loss: 0.5827


Epoch 10/20: 100%|██████████| 18/18 [00:00<00:00, 636.47it/s]


Epoch [10/20], Loss: 0.5836


Epoch 11/20: 100%|██████████| 18/18 [00:00<00:00, 768.22it/s]


Epoch [11/20], Loss: 0.5810


Epoch 12/20: 100%|██████████| 18/18 [00:00<00:00, 379.31it/s]


Epoch [12/20], Loss: 0.5805


Epoch 13/20: 100%|██████████| 18/18 [00:00<00:00, 276.99it/s]


Epoch [13/20], Loss: 0.5827


Epoch 14/20: 100%|██████████| 18/18 [00:00<00:00, 320.17it/s]


Epoch [14/20], Loss: 0.5778


Epoch 15/20: 100%|██████████| 18/18 [00:00<00:00, 245.65it/s]


Epoch [15/20], Loss: 0.5741


Epoch 16/20: 100%|██████████| 18/18 [00:00<00:00, 158.37it/s]


Epoch [16/20], Loss: 0.5742


Epoch 17/20: 100%|██████████| 18/18 [00:00<00:00, 259.85it/s]


Epoch [17/20], Loss: 0.5688


Epoch 18/20: 100%|██████████| 18/18 [00:00<00:00, 236.96it/s]


Epoch [18/20], Loss: 0.5708


Epoch 19/20: 100%|██████████| 18/18 [00:00<00:00, 209.48it/s]


Epoch [19/20], Loss: 0.5710


Epoch 20/20: 100%|██████████| 18/18 [00:00<00:00, 232.77it/s]

Epoch [20/20], Loss: 0.5721





In [18]:
# Evaluate the model
test_loss, accuracy = trainer.evaluate(test_loader)

Validation Loss: 0.6592, Accuracy: 0.6084


Your reason for your choice:

<div>
**Your answer here**
</div>


### 4. CrossEntropyLoss (`torch.nn.CrossEntropyLoss`)
- **Description:** Combines `LogSoftmax` and `NLLLoss` in one single class. It computes the cross-entropy loss between the target and the output logits.
- **Use Case:** Widely used for multi-class classification problems.

The mathematical formulation of CrossEntropyLoss is as follows:
\begin{equation}
  \text{CrossEntropy}(y, \hat{y}) = - \sum_{i=1}^{C} y_i \log\left(\frac{e^{\hat{y}_i}}{\sum_{j=1}^{C} e^{\hat{y}_j}}\right)
\end{equation}
  where:
  - \( C \) is the number of classes,
  - \( y_i \) is a one-hot encoded target vector (or a scalar class label),
  - \( \hat{y}_i \) represents the logits (unnormalized model outputs) for each class.
  
  In practice, `torch.nn.CrossEntropyLoss` expects raw logits as input and internally applies the softmax function to convert the logits into probabilities, followed by the negative log-likelihood computation.

- **Background:** Cross-entropy measures the difference between the true distribution \( y \) and the predicted distribution \( \hat{y} \). The function minimizes the negative log-probability assigned to the correct class, effectively penalizing predictions that deviate from the true class, making it a standard choice for classification tasks in deep learning.

Now, let's implement a class called `SimpleMLP_Loss` that has the following architecture:


In [19]:
from torch.nn import CrossEntropyLoss

# Define model parameters
input_dim = X.shape[1]  # Number of input features (Pclass, Sex, Age, Fare)
hidden_dim = 16  # Number of units in hidden layer
output_dim = 2  # Binary classification output (Survived or not)
num_hidden_layers = 2

# Initialize the model and move it to the device
model_CE = SimpleMLP(input_dim, hidden_dim, output_dim, num_hidden_layers=num_hidden_layers, last_layer_activation_fn=None)
# model_CE.to(device)

# Initialize MSE Loss as the criterion and Adam as the optimizer
criterion = CrossEntropyLoss()
optimizer = Adam(model_CE.parameters(), lr=0.001)

# Instantiate the trainer
trainer = SimpleMLPTrainer(model_CE, criterion, optimizer)

# Train the model
num_epochs = 20
training_losses = trainer.train(train_loader, num_epochs)

Epoch 1/20: 100%|██████████| 18/18 [00:00<00:00, 347.68it/s]


Epoch [1/20], Loss: 0.6823


Epoch 2/20: 100%|██████████| 18/18 [00:00<00:00, 489.79it/s]


Epoch [2/20], Loss: 0.6088


Epoch 3/20: 100%|██████████| 18/18 [00:00<00:00, 570.87it/s]


Epoch [3/20], Loss: 0.6018


Epoch 4/20: 100%|██████████| 18/18 [00:00<00:00, 711.91it/s]


Epoch [4/20], Loss: 0.6135


Epoch 5/20: 100%|██████████| 18/18 [00:00<00:00, 692.67it/s]


Epoch [5/20], Loss: 0.5999


Epoch 6/20: 100%|██████████| 18/18 [00:00<00:00, 740.44it/s]


Epoch [6/20], Loss: 0.5928


Epoch 7/20: 100%|██████████| 18/18 [00:00<00:00, 648.75it/s]


Epoch [7/20], Loss: 0.5881


Epoch 8/20: 100%|██████████| 18/18 [00:00<00:00, 673.81it/s]


Epoch [8/20], Loss: 0.5897


Epoch 9/20: 100%|██████████| 18/18 [00:00<00:00, 637.65it/s]


Epoch [9/20], Loss: 0.5861


Epoch 10/20: 100%|██████████| 18/18 [00:00<00:00, 639.45it/s]


Epoch [10/20], Loss: 0.5858


Epoch 11/20: 100%|██████████| 18/18 [00:00<00:00, 554.70it/s]


Epoch [11/20], Loss: 0.5853


Epoch 12/20: 100%|██████████| 18/18 [00:00<00:00, 853.92it/s]


Epoch [12/20], Loss: 0.5844


Epoch 13/20: 100%|██████████| 18/18 [00:00<00:00, 581.40it/s]


Epoch [13/20], Loss: 0.5820


Epoch 14/20: 100%|██████████| 18/18 [00:00<00:00, 282.86it/s]


Epoch [14/20], Loss: 0.5900


Epoch 15/20: 100%|██████████| 18/18 [00:00<00:00, 338.21it/s]


Epoch [15/20], Loss: 0.5901


Epoch 16/20: 100%|██████████| 18/18 [00:00<00:00, 357.09it/s]


Epoch [16/20], Loss: 0.5828


Epoch 17/20: 100%|██████████| 18/18 [00:00<00:00, 351.37it/s]


Epoch [17/20], Loss: 0.5809


Epoch 18/20: 100%|██████████| 18/18 [00:00<00:00, 296.08it/s]


Epoch [18/20], Loss: 0.5794


Epoch 19/20: 100%|██████████| 18/18 [00:00<00:00, 353.80it/s]


Epoch [19/20], Loss: 0.5793


Epoch 20/20: 100%|██████████| 18/18 [00:00<00:00, 359.28it/s]

Epoch [20/20], Loss: 0.5738





In [20]:
# Evaluate the model
test_loss, accuracy = trainer.evaluate(test_loader)

Validation Loss: 0.6466, Accuracy: 0.6154



### 5. KLDivLoss (`torch.nn.KLDivLoss`)
- **Description:** Kullback-Leibler Divergence Loss measures how one probability distribution diverges from a second, reference distribution. Unlike other loss functions that focus on classification, KL divergence specifically compares the relative entropy between two distributions. It quantifies the information loss when using the predicted distribution to approximate the true distribution. 

- **Mathematical Function:**
\begin{equation}
  \text{KL}(P \parallel Q) = \sum_{i=1}^{C} P(i) \left( \log P(i) - \log Q(i) \right)
\end{equation}
  where:
  - \( P \) is the target (true) probability distribution,
  - \( Q \) is the predicted distribution (often the output of `log_softmax`),
  - \( C \) is the number of classes.

  KL divergence is always non-negative, and it equals zero if the two distributions are identical. The loss function expects the model's output to be in the form of log-probabilities (using `log_softmax`) and compares this against a target probability distribution, which is typically a normalized distribution (using softmax).

- **Use Case:** KLDivLoss is frequently used in:
  - **Variational Autoencoders (VAEs):** In VAEs, KL divergence is used to measure how much the learned latent space distribution deviates from a prior distribution (often Gaussian).
  - **Knowledge Distillation:** In teacher-student models, KL divergence is used to transfer the "soft" knowledge from a teacher model to a student model by comparing their output probability distributions.
  - **Reinforcement Learning:** It can be used to update policies while minimizing the divergence from a previous policy.

- **Background:** Kullback-Leibler divergence, a core concept in information theory, measures the inefficiency of assuming the predicted distribution \( Q \) when the true distribution is \( P \). It is asymmetric, meaning that \( KL(P \parallel Q) \neq KL(Q \parallel P) \), so the direction of the comparison matters.

Again, in this part, run your training with Relu at last layer. <span style="color:red; font-weight: bold;">Discuss </span> and explain the difference between the results of the two models. Find a proper solution to the problem.

Answer: Replace `ReLU` with `log_softmax` in the last layer. This ensures the model output is in the correct form (log-probabilities) and can be compared against the target distribution in a meaningful way.
Using `log_softmax` in the final layer, you will observe stable loss values, faster convergence, and more reliable performance, as `KLDivLoss` will now be able to measure the divergence properly between the predicted and target distributions.

In [21]:
import torch.nn.functional as F

X_tensor = torch.tensor(X, dtype=torch.float32)
y_tensor = torch.tensor(y, dtype=torch.long)
y_tensor = F.one_hot(y_tensor, num_classes=2).float()

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_tensor, y_tensor, test_size=0.2, random_state=42)

# Create TensorDatasets and DataLoaders
train_dataset = TensorDataset(X_train, y_train)
test_dataset = TensorDataset(X_test, y_test)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

In [22]:
# Run with relu activation function
from torch.nn import KLDivLoss

# Define model parameters
input_dim = X.shape[1]  # Number of input features (Pclass, Sex, Age, Fare)
hidden_dim = 16  # Number of units in hidden layer
output_dim = 2  # Binary classification output (Survived or not)
num_hidden_layers = 2

# Initialize the model and move it to the device
model_KLDV_Relu = SimpleMLP(input_dim, hidden_dim, output_dim, num_hidden_layers=num_hidden_layers, last_layer_activation_fn=nn.ReLU)
# model_KLDV_Relu.to(model_KLDV_Relu)

criterion = KLDivLoss(reduction="batchmean")
optimizer = Adam(model_KLDV_Relu.parameters(), lr=0.001)

# Instantiate the trainer
trainer = SimpleMLPTrainer(model_KLDV_Relu, criterion, optimizer)

# Train the model
num_epochs = 20
training_losses = trainer.train(train_loader, num_epochs)

Epoch 1/20: 100%|██████████| 18/18 [00:00<00:00, 560.19it/s]


Epoch [1/20], Loss: -1.9587


Epoch 2/20: 100%|██████████| 18/18 [00:00<00:00, 1070.64it/s]


Epoch [2/20], Loss: -3.2984


Epoch 3/20: 100%|██████████| 18/18 [00:00<00:00, 721.92it/s]


Epoch [3/20], Loss: -5.3868


Epoch 4/20: 100%|██████████| 18/18 [00:00<00:00, 906.51it/s]


Epoch [4/20], Loss: -8.4288


Epoch 5/20: 100%|██████████| 18/18 [00:00<00:00, 452.72it/s]


Epoch [5/20], Loss: -13.0496


Epoch 6/20: 100%|██████████| 18/18 [00:00<00:00, 511.01it/s]


Epoch [6/20], Loss: -19.0498


Epoch 7/20: 100%|██████████| 18/18 [00:00<00:00, 576.92it/s]


Epoch [7/20], Loss: -28.1887


Epoch 8/20: 100%|██████████| 18/18 [00:00<00:00, 821.74it/s]


Epoch [8/20], Loss: -39.9642


Epoch 9/20: 100%|██████████| 18/18 [00:00<00:00, 820.82it/s]


Epoch [9/20], Loss: -55.6300


Epoch 10/20: 100%|██████████| 18/18 [00:00<00:00, 531.84it/s]


Epoch [10/20], Loss: -77.3023


Epoch 11/20: 100%|██████████| 18/18 [00:00<00:00, 731.28it/s]


Epoch [11/20], Loss: -105.5364


Epoch 12/20: 100%|██████████| 18/18 [00:00<00:00, 407.32it/s]


Epoch [12/20], Loss: -142.9996


Epoch 13/20: 100%|██████████| 18/18 [00:00<00:00, 411.85it/s]


Epoch [13/20], Loss: -186.5315


Epoch 14/20: 100%|██████████| 18/18 [00:00<00:00, 421.64it/s]


Epoch [14/20], Loss: -241.1785


Epoch 15/20: 100%|██████████| 18/18 [00:00<00:00, 302.99it/s]


Epoch [15/20], Loss: -307.8080


Epoch 16/20: 100%|██████████| 18/18 [00:00<00:00, 245.26it/s]


Epoch [16/20], Loss: -386.0494


Epoch 17/20: 100%|██████████| 18/18 [00:00<00:00, 544.92it/s]


Epoch [17/20], Loss: -481.7671


Epoch 18/20: 100%|██████████| 18/18 [00:00<00:00, 470.97it/s]


Epoch [18/20], Loss: -583.8053


Epoch 19/20: 100%|██████████| 18/18 [00:00<00:00, 612.17it/s]


Epoch [19/20], Loss: -703.3131


Epoch 20/20: 100%|██████████| 18/18 [00:00<00:00, 412.57it/s]

Epoch [20/20], Loss: -831.2457





In [23]:
# Evaluate the model
test_loss, accuracy = trainer.evaluate(test_loader)

Validation Loss: -880.0218, Accuracy: 0.3916


In [24]:
# Run with --- activation function
from torch.nn import KLDivLoss

# Run with relu activation function
from torch.nn import KLDivLoss

# Define model parameters
input_dim = X.shape[1]  # Number of input features (Pclass, Sex, Age, Fare)
hidden_dim = 16  # Number of units in hidden layer
output_dim = 2  # Binary classification output (Survived or not)
num_hidden_layers = 2

# Initialize the model and move it to the device
model_KLDV_LogSoftMax = SimpleMLP(input_dim, hidden_dim, output_dim, num_hidden_layers=num_hidden_layers, last_layer_activation_fn=nn.LogSoftmax)
# model_KLDV_LogSoiftMax.to(model_KLDV_LogSoiftMax)

criterion = KLDivLoss(reduction="batchmean")
optimizer = Adam(model_KLDV_LogSoftMax.parameters(), lr=0.001)

# Instantiate the trainer
trainer = SimpleMLPTrainer(model_KLDV_LogSoftMax, criterion, optimizer)

# Train the model
num_epochs = 20
training_losses = trainer.train(train_loader, num_epochs)

Epoch 1/20: 100%|██████████| 18/18 [00:00<00:00, 500.42it/s]


Epoch [1/20], Loss: 1.0039


Epoch 2/20: 100%|██████████| 18/18 [00:00<00:00, 379.03it/s]


Epoch [2/20], Loss: 0.6769


Epoch 3/20: 100%|██████████| 18/18 [00:00<00:00, 475.53it/s]


Epoch [3/20], Loss: 0.6394


Epoch 4/20: 100%|██████████| 18/18 [00:00<00:00, 605.10it/s]


Epoch [4/20], Loss: 0.6332


Epoch 5/20: 100%|██████████| 18/18 [00:00<00:00, 643.56it/s]


Epoch [5/20], Loss: 0.6179


Epoch 6/20: 100%|██████████| 18/18 [00:00<00:00, 641.69it/s]


Epoch [6/20], Loss: 0.6127


Epoch 7/20: 100%|██████████| 18/18 [00:00<00:00, 590.49it/s]


Epoch [7/20], Loss: 0.6155


Epoch 8/20: 100%|██████████| 18/18 [00:00<00:00, 656.08it/s]


Epoch [8/20], Loss: 0.6123


Epoch 9/20: 100%|██████████| 18/18 [00:00<00:00, 512.14it/s]


Epoch [9/20], Loss: 0.6109


Epoch 10/20: 100%|██████████| 18/18 [00:00<00:00, 546.74it/s]


Epoch [10/20], Loss: 0.6135


Epoch 11/20: 100%|██████████| 18/18 [00:00<00:00, 459.65it/s]


Epoch [11/20], Loss: 0.6166


Epoch 12/20: 100%|██████████| 18/18 [00:00<00:00, 360.69it/s]


Epoch [12/20], Loss: 0.6047


Epoch 13/20: 100%|██████████| 18/18 [00:00<00:00, 374.18it/s]


Epoch [13/20], Loss: 0.5991


Epoch 14/20: 100%|██████████| 18/18 [00:00<00:00, 422.04it/s]


Epoch [14/20], Loss: 0.6020


Epoch 15/20: 100%|██████████| 18/18 [00:00<00:00, 426.61it/s]


Epoch [15/20], Loss: 0.6008


Epoch 16/20: 100%|██████████| 18/18 [00:00<00:00, 467.85it/s]


Epoch [16/20], Loss: 0.6059


Epoch 17/20: 100%|██████████| 18/18 [00:00<00:00, 372.54it/s]


Epoch [17/20], Loss: 0.6015


Epoch 18/20: 100%|██████████| 18/18 [00:00<00:00, 439.93it/s]


Epoch [18/20], Loss: 0.5964


Epoch 19/20: 100%|██████████| 18/18 [00:00<00:00, 478.53it/s]


Epoch [19/20], Loss: 0.5929


Epoch 20/20: 100%|██████████| 18/18 [00:00<00:00, 486.44it/s]

Epoch [20/20], Loss: 0.5903





In [25]:
# Evaluate the model
test_loss, accuracy = trainer.evaluate(test_loader)

Validation Loss: 0.6584, Accuracy: 0.6154


Your reason for your choice:

<div>
**Your answer here**
</div>

### 6. CosineEmbeddingLoss (`torch.nn.CosineEmbeddingLoss`)
- **Description:** Measures the cosine similarity between two input tensors, `x1` and `x2`, and computes the loss based on a label `y` that indicates whether the tensors should be similar (`y = 1`) or dissimilar (`y = -1`). Cosine similarity focuses on the angle between vectors, disregarding their magnitude.

- **Mathematical Function:** 
\begin{equation}
  \text{CosineEmbeddingLoss}(x1, x2, y) = 
  \begin{cases} 
  1 - \cos(x_1, x_2), & \text{if } y = 1 \\
  \max(0, \cos(x_1, x_2) - \text{margin}), & \text{if } y = -1
  \end{cases}
\end{equation}
  where $ \cos(x_1, x_2) $ is the cosine similarity between the two vectors, and `margin` is a threshold that determines how dissimilar the vectors should be.

- **Use Case:** Commonly used in tasks like face verification, image similarity, and other scenarios where the relative orientation of vectors (angle) is more important than their length, such as in embeddings and metric learning.

- **Background:** Cosine similarity compares the directional alignment of vectors, making it ideal for high-dimensional data where the magnitude may not be as informative. This loss is particularly useful when training models to learn meaningful embeddings that capture semantic similarity.

You'll become more fimiliar with this loss function in future.

---

# Regularization in Machine Learning

## Introduction

Regularization is a fundamental technique in machine learning that helps prevent overfitting by adding a penalty to the loss function. This penalty discourages the model from becoming too complex, ensuring better generalization to unseen data. In this notebook, you will explore the concepts of regularization, understand different types of regularization techniques, and apply them using Python's popular libraries.

## What is Regularization?

Regularization involves adding a regularization term to the loss function used to train machine learning models. This term imposes a constraint on the model's coefficients, effectively reducing their magnitude. By doing so, regularization helps in:

- **Preventing Overfitting:** Ensures the model does not become too tailored to the training data.
- **Improving Generalization:** Enhances the model's performance on new, unseen data.
- **Feature Selection:** Especially in L1 regularization, it can drive some coefficients to zero, effectively selecting important features.

## Types of Regularization

There are several types of regularization techniques, each imposing different constraints on the model's parameters:

### 1. L1 Regularization (Lasso)

L1 regularization adds the absolute value of the magnitude of coefficients as a penalty term to the loss function. It can lead to sparse models where some feature coefficients are exactly zero.

### 2. L2 Regularization (Ridge)

L2 regularization adds the squared magnitude of coefficients as a penalty term to the loss function. It tends to shrink the coefficients evenly but does not set them to zero.

### 3. Elastic Net

Elastic Net combines both L1 and L2 regularization penalties. It balances the benefits of both Lasso and Ridge methods, allowing for feature selection and coefficient shrinkage.

## Homework Time!
Import Iris dataset from sklearn.datasets and apply ridge regression with different alpha values. Then, create a gif that shows the changes of the classification boundary with respect to alpha values.

Import the libs that you need and start coding!

In [26]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from PIL import Image
from io import BytesIO
import imageio
import warnings


# Suppress warnings for cleaner output
warnings.filterwarnings("ignore")

Load the Iris dataset and select Setosa and Versicolor classes

In [27]:
# 1. Load the Iris Dataset
iris = load_iris()
X = iris.data  # Features
y = iris.target  # Target labels

# 2. Select only two classes (Setosa and Versicolor) for binary classification
# and two features (Sepal Length and Petal Length) for 2D visualization
# Class 0 = Setosa, Class 1 = Versicolor
mask = np.where((y == 0) | (y == 1))  # Select only Setosa and Versicolor (label 0 and 1)
X = X[mask][:, [0, 2]]  # Use only Sepal Length and Petal Length
y = y[mask]

# 3. Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 4. Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# 5. Convert to PyTorch tensors
X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.long)  # Target needs to be long for classification
X_test_tensor = torch.tensor(X_test, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test, dtype=torch.long)

# 6. Create TensorDatasets and DataLoaders
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=16, shuffle=False)


Define Function to Plot Decision Boundary

In [28]:
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
from io import BytesIO
import torch

def plot_decision_boundary(model, X, y, alpha=0.5):
    # Define the grid
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100), np.linspace(y_min, y_max, 100))

    # Predict over the grid
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    # Create a figure
    fig, ax = plt.subplots(figsize=(6, 5))

    # Plot the decision boundary
    ax.contourf(xx, yy, Z, alpha=0.3, levels=[-0.1, 0.1, 1.1], colors=['blue', 'red'])

    # Scatter plot of the training data
    scatter = ax.scatter(
        X[:, 0], X[:, 1], c=y, cmap='bwr', edgecolor='k', s=50
    )

    # Title and labels
    ax.set_title(f'MLP Decision Boundary (alpha={alpha})')
    ax.set_xlabel('Sepal Length (standardized)')
    ax.set_ylabel('Petal Length (standardized)')

    # Remove axes for clarity
    ax.set_xticks([])
    ax.set_yticks([])

    # Tight layout
    plt.tight_layout()

    # Save the plot to a BytesIO object
    buf = BytesIO()
    plt.savefig(buf, format='png')
    plt.close(fig)
    buf.seek(0)
    return Image.open(buf)


Train MLP with Varying Alpha Values and Collect Images

In [29]:
from sklearn.neural_network import MLPClassifier

def create_decision_boundary_gif(alpha_values, X_train, y_train, n_neurons):
    # List to store images for each alpha value
    images = []
    
    for idx, alpha in enumerate(alpha_values):
        print(f"Processing alpha={alpha:.4f} ({idx + 1}/{len(alpha_values)})")

        # Create and train the MLP
        mlp = MLPClassifier(hidden_layer_sizes=(n_neurons,), alpha=alpha, max_iter=1000, random_state=42)
        mlp.fit(X_train, y_train)

        # Plot decision boundary and get the image
        img = plot_decision_boundary(mlp, X_train, y_train, alpha)
        images.append(img)

        # Plot the decision boundary and get the image
        img = plot_decision_boundary(mlp, X_train, y_train, alpha)
        images.append(img)

    # Save the images as a GIF
    gif_filename = 'mlp_classification_boundaries.gif'
    images[0].save(
        gif_filename,
        save_all=True,
        append_images=images[1:],
        duration=500,
        loop=0
    )

    print(f"GIF saved as '{gif_filename}'")

    # Return the GIF filename for display or use
    return gif_filename

## RUN

In [30]:

# Use np.logspace to generate alpha values, with at least 20 values
alpha_values = np.logspace(-2, 2, 20)
# Define the number of neurons in the hidden layer
n_neurons =  10

# Create the decision boundary GIF
gif_dir = create_decision_boundary_gif(alpha_values, X_train, y_train, n_neurons)

Processing alpha=0.0100 (1/20)
Processing alpha=0.0162 (2/20)
Processing alpha=0.0264 (3/20)
Processing alpha=0.0428 (4/20)
Processing alpha=0.0695 (5/20)
Processing alpha=0.1129 (6/20)
Processing alpha=0.1833 (7/20)
Processing alpha=0.2976 (8/20)
Processing alpha=0.4833 (9/20)
Processing alpha=0.7848 (10/20)
Processing alpha=1.2743 (11/20)
Processing alpha=2.0691 (12/20)
Processing alpha=3.3598 (13/20)
Processing alpha=5.4556 (14/20)
Processing alpha=8.8587 (15/20)
Processing alpha=14.3845 (16/20)
Processing alpha=23.3572 (17/20)
Processing alpha=37.9269 (18/20)
Processing alpha=61.5848 (19/20)
Processing alpha=100.0000 (20/20)
GIF saved as 'mlp_classification_boundaries.gif'


Your gif should look like this:

<div style="text-align: center;">

### **Multilayer Perceptron Classification Boundaries**

![Classification Boundaries](mlp_classification_boundaries_example.gif)

*Figure 1: Demonstration of classification boundaries created by a Multilayer Perceptron (MLP) model.*

</div>

