# Deep learning architectures with PyTorch

This notebook provides an introduction to various deep learning architectures in PyTorch. Each section includes explanations and code examples to help us understand and implement these models.

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from sklearn.model_selection import train_test_split
import math

### Feedforward neural networks (FFNN)
Feedforward neural networks are the simplest type of artificial neural network architecture. Information flows in one direction, from input to output, without any cycles or loops. Key components:
- **Layers**: Consist of an input layer, one or more hidden layers, and an output layer.
- **Activation functions**: Introduce non-linearities to the model.

In [2]:
# Generate synthetic data
X_ffnn = torch.rand(1000, 20)
y_ffnn = torch.randint(0, 2, (1000, 1)).float()

# Split the data
X_train_ffnn, X_val_test_ffnn, y_train_ffnn, y_val_test_ffnn = train_test_split(X_ffnn, y_ffnn, test_size=0.4)
X_val_ffnn, X_test_ffnn, y_val_ffnn, y_test_ffnn = train_test_split(X_val_test_ffnn, y_val_test_ffnn, test_size=0.5)

# Define the FFNN model
class FFNN(nn.Module):
    def __init__(self):
        super(FFNN, self).__init__()
        self.fc1 = nn.Linear(in_features=20, out_features=64)
        self.fc2 = nn.Linear(in_features=64, out_features=32)
        self.fc3 = nn.Linear(in_features=32, out_features=1)
        self.relu = nn.ReLU()
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.relu(self.fc2(x))
        x = self.sigmoid(self.fc3(x))
        return x

model_ffnn = FFNN()

# Define loss function and optimizer
criterion = nn.BCELoss()
optimizer = optim.Adam(model_ffnn.parameters(), lr=0.001)

# Create DataLoader for traning
train_dataset = TensorDataset(X_train_ffnn, y_train_ffnn)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

# Training loop
num_epochs = 10
for epoch in range(num_epochs):
    model_ffnn.train()  # Set the model to training mode

    # Training loop (forward pass and backward pass)
    for X_batch, y_batch in train_loader:
        optimizer.zero_grad()  # Clear gradients
        
        # Forward pass
        outputs = model_ffnn(X_batch)
        train_loss = criterion(outputs, y_batch)
        
        # Backward pass
        train_loss.backward()
        optimizer.step()

    # Validation loop (only forward pass)
    model_ffnn.eval()  # Set the model to evaluation mode
    with torch.inference_mode():  # No need to calculate gradients
        outputs = model_ffnn(X_val_ffnn)
        val_loss = criterion(outputs, y_val_ffnn)

    # Print losses for this epoch
    print(f'Epoch {epoch+1}/{num_epochs}, Train Loss: {train_loss:.4f}, Validation Loss: {val_loss:.4f}')


# Test set evaluation (only forward pass)
model_ffnn.eval()  # Set the model to evaluation mode
with torch.inference_mode():
    outputs = model_ffnn(X_test_ffnn)
    test_loss = criterion(outputs, y_test_ffnn)
print(f'Test Loss: {test_loss:.4f}')

Epoch 1/10, Train Loss: 0.6910, Validation Loss: 0.6918
Epoch 2/10, Train Loss: 0.6676, Validation Loss: 0.6925
Epoch 3/10, Train Loss: 0.6752, Validation Loss: 0.6926
Epoch 4/10, Train Loss: 0.6907, Validation Loss: 0.6930
Epoch 5/10, Train Loss: 0.6890, Validation Loss: 0.6937
Epoch 6/10, Train Loss: 0.6923, Validation Loss: 0.6940
Epoch 7/10, Train Loss: 0.6707, Validation Loss: 0.6948
Epoch 8/10, Train Loss: 0.7058, Validation Loss: 0.6952
Epoch 9/10, Train Loss: 0.6782, Validation Loss: 0.6970
Epoch 10/10, Train Loss: 0.6679, Validation Loss: 0.6978
Test Loss: 0.7038


#### Understanding the FFNN model syntax

1. **Defining the model class**: In PyTorch, a neural network is defined as a class that inherits from `nn.Module`. This class encapsulates the architecture and behavior of the model, including the definition of layers and the forward pass. The model class is essential because it allows PyTorch to manage the model's parameters, layers, and operations seamlessly.
    ```python
    class FFNN(nn.Module):
        def __init__(self):
            super(FFNN, self).__init__()
            # Define layers and operations here
    ```
    - **`super(FFNN, self).__init__()`**: This line calls the constructor of the parent class (`nn.Module`), initializing the model and allowing PyTorch to keep track of the network's layers and parameters.

2. **Defining layers and operations**:
    - **Layer definitions**: In the `__init__` method, the layers of the model are defined. These are typically instances of PyTorch's built-in layer classes, such as `nn.Linear` for fully connected layers. The layers are defined as attributes of the class (`self.fc1`, `self.fc2`, `self.fc3`, etc.), making them part of the model's state.
        ```python      
        self.fc = nn.Linear(in_features, out_features)
        ```
        - **Input and output dimensions**: The parameters of `nn.Linear` define the number of input (`in_features`, the number of neurons in the previous layer) and output (`out_features`, the number of neurons in this layer) features for each layer. For instance, `self.fc1 = nn.Linear(20, 64)` indicates that the first layer takes 20 input features and outputs 64 features.
    - **Activation functions**: In this model, activation functions are also defined in the `__init__` method. This is a slight variation from the previous example where activation functions were directly used in the `forward` method without defining them in `__init__`. Here, ReLU (`nn.ReLU()`) and Sigmoid (`nn.Sigmoid()`) are both defined in `__init__` as class attributes.
        ```python
        self.relu = nn.ReLU()
        self.sigmoid = nn.Sigmoid()
        ```
        - **Why define activation functions in `__init__`?**: Defining activation functions in `__init__` makes them reusable components of the model. By doing so, we avoid creating a new instance of the activation function every time it is called in the `forward` method, which can be slightly more efficient. This approach also makes the forward pass cleaner and easier to read. Although ReLU is used twice in the `forward` method, it is defined only once in `__init__`. This is because the same ReLU function can be applied multiple times to different layers. There is no need to define separate ReLU instances for each layer unless there is a specific reason to do so (e.g., different parameter settings for different layers).
    - **Importance of order**: The order of definition in the `__init__` method is not crucial as long as the layers and operations are properly referenced in the `forward` method. However, it's common practice to define the layers first, followed by any activation functions or other operations. This convention improves code readability and organization.

3. **Defining the forward pass**: The `forward` method defines how the input data flows through the network layers and activation functions. This method describes the sequence of operations applied to the input data, transforming it step-by-step through the network.

    ```python
    def forward(self, x):
        x = self.relu(self.fc1(x))  # Apply first layer and ReLU activation
        x = self.relu(self.fc2(x))  # Apply second layer and ReLU activation
        x = self.sigmoid(self.fc3(x))  # Apply final layer and Sigmoid activation
        return x
    ```

    - **Sequential Operations**: The operations are applied in sequence, with each layer's output serving as the input to the next layer. ReLU activation is applied after each hidden layer, and Sigmoid is applied at the final output layer.


#### Understanding the training loop structure
- Epoch loop: Runs through the entire training dataset for a specified number of epochs.
- Batch loop: Divides the dataset into mini-batches, and each batch is processed independently.
    - Forward pass: The part of the code where the model makes predictions based on the input data.
    - Backward pass: The part of the code where the gradients are calculated and the model parameters are updated.
- Validation loss: Calculated after each epoch to monitor how well the model is generalizing to unseen data.
 - Here we used `torch.inference_mode()` and `torch.no_grad()`, but both are used in PyTorch to disable gradient computation. `torch.no_grad()` only affects gradient computation, whereas `torch.inference_mode()` also disables autograd tracking and performs other optimizations, specifically optimized for inference scenarios.

---

### Recurrent neural networks (RNN)

RNNs are a class of neural networks designed to handle sequential data, such as time series or text. Unlike feedforward neural networks, RNNs have connections that form directed cycles, allowing them to maintain a "memory" of previous inputs. This makes RNNs well-suited for tasks where context or temporal dependencies are important. Key components:
- **Recurrent layers**: Process sequences by maintaining a hidden state that is updated at each time step.

In [3]:
# Generate synthetic sequential data
X_rnn = torch.rand(1000, 10, 20)  # 1000 sequences, each with 10 timesteps, each timestep with 20 features
y_rnn = torch.randint(0, 2, (1000, 1)).float()  # Binary targets

# Split the data
X_train_rnn, X_val_test_rnn, y_train_rnn, y_val_test_rnn = train_test_split(X_rnn, y_rnn, test_size=0.4)
X_val_rnn, X_test_rnn, y_val_rnn, y_test_rnn = train_test_split(X_val_test_rnn, y_val_test_rnn, test_size=0.5)

# Define the RNN model
class RNNModel(nn.Module):
    def __init__(self, input_size, hidden_size, output_size, num__rnn_layers):
        super(RNNModel, self).__init__()
        self.rnn = nn.RNN(input_size=input_size, hidden_size=hidden_size, num_layers=num__rnn_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        h0 = torch.zeros(self.rnn.num_layers, x.size(0), self.rnn.hidden_size)  # Initialize hidden state with zeros
        out, _ = self.rnn(x, h0)  # RNN layer returns all outputs and hidden state
        out = self.fc(out[:, -1, :])  # Take the output from the last time step and pass it through the fully connected layer
        out = self.sigmoid(out)  # Apply sigmoid activation
        return out

model_rnn = RNNModel(input_size=20, hidden_size=64, output_size=1, num__rnn_layers=1)

# Define loss function and optimizer
criterion = nn.BCELoss()
optimizer = optim.Adam(model_rnn.parameters(), lr=0.001)

# Create DataLoader for training
train_dataset = TensorDataset(X_train_rnn, y_train_rnn)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

# Training loop
num_epochs = 10
for epoch in range(num_epochs):
    model_rnn.train()  # Set the model to training mode

    # Training loop (forward pass and backward pass)
    for X_batch, y_batch in train_loader:
        optimizer.zero_grad()  # Clear gradients
        
        # Forward pass
        outputs = model_rnn(X_batch)
        train_loss = criterion(outputs, y_batch)
        
        # Backward pass
        train_loss.backward()
        optimizer.step()

    # Validation loop (only forward pass)
    model_rnn.eval()  # Set the model to evaluation mode
    with torch.inference_mode():  # No need to calculate gradients
        outputs = model_rnn(X_val_rnn)
        val_loss = criterion(outputs, y_val_rnn)

    # Print losses for this epoch
    print(f'Epoch {epoch+1}/{num_epochs}, Train Loss: {train_loss:.4f}, Validation Loss: {val_loss:.4f}')

# Test set evaluation (only forward pass)
model_rnn.eval()  # Set the model to evaluation mode
with torch.inference_mode():
    outputs = model_rnn(X_test_rnn)
    test_loss = criterion(outputs, y_test_rnn)
print(f'Test Loss: {test_loss:.4f}')

Epoch 1/10, Train Loss: 0.6951, Validation Loss: 0.6952
Epoch 2/10, Train Loss: 0.7006, Validation Loss: 0.7002
Epoch 3/10, Train Loss: 0.6830, Validation Loss: 0.6969
Epoch 4/10, Train Loss: 0.6833, Validation Loss: 0.6986
Epoch 5/10, Train Loss: 0.6898, Validation Loss: 0.7003
Epoch 6/10, Train Loss: 0.6950, Validation Loss: 0.7008
Epoch 7/10, Train Loss: 0.7237, Validation Loss: 0.7139
Epoch 8/10, Train Loss: 0.5937, Validation Loss: 0.7186
Epoch 9/10, Train Loss: 0.5563, Validation Loss: 0.7473
Epoch 10/10, Train Loss: 0.6387, Validation Loss: 0.7546
Test Loss: 0.7575


#### Understanding the RNN model syntax

1. **Defining the model class**: Similar to the FFNN, the RNN model is defined as a class that inherits from `nn.Module`. This encapsulates the network's architecture and operations, making the model modular and reusable.
    ```python
    class RNNModel(nn.Module):
        def __init__(self, input_size, hidden_size, output_size, num_layers):
            super(RNNModel, self).__init__()
            # Define layers and operations here
    ```
    - **`super(RNNModel, self).__init__()`**: This line calls the parent class constructor, initializing the model so PyTorch can manage the layers and parameters properly.

2. **Defining layers and operations**:
    - **RNN Layer**: In this model, the core component of an RNN model is the recurrent layer and is defined using the `nn.RNN` module.
        ```python
        self.rnn = nn.RNN(input_size, hidden_size, num_layers, batch_first=True)
        ```
        - **Input and output dimensions**: The `input_size` parameter defines the number of input features at each time step. The `hidden_size` parameter determines the size of the hidden state that the RNN maintains across time steps (the number of neurons in the hidden layers). `num_layers` specifies how many recurrent layers are stacked on top of each other.
        - **`batch_first=True`**: This argument specifies that the input and output tensors are organized with the batch size as the first dimension, followed by time steps and features (batch_size, sequence_length, input_size).
    - **Fully connected layer**: After processing the sequence with the RNN, the output from the last time step is passed through a fully connected layer to produce the final output.
        ```python
        self.fc = nn.Linear(hidden_size, output_size)
        ```
        - **Final output**: The `output_size` here is 1, indicating a binary output (for binary classification). The output of the RNN layer at the final time step is transformed by the fully connected layer into the final prediction.

    - **Activation function**: A sigmoid activation function (`nn.Sigmoid()`) is applied to the output layer to squash the output between 0 and 1, suitable for binary classification.

3. **Defining the forward pass**:
    - **Initial hidden state**: Before the sequence is processed by the RNN, an initial hidden state `h0` is defined. This hidden state is typically initialized to zeros.
        ```python
        h0 = torch.zeros(self.rnn.num_layers, x.size(0), self.rnn.hidden_size)
        ```
    - **Processing the sequence and RNN output**: The input sequence `x` is processed by the RNN layer, which updates the hidden state at each time step.
        ```python
        out, _ = self.rnn(x, h0)
        ```
        - **Sequence output**: The RNN layer returns two outputs:
             - `out`: This contains the output from each time step of the sequence for each batch.
             - `hidden`: This contains the final hidden state for each sequence in the batch. Although we don't use it here, it can be useful for tasks like sequence prediction.
        - **Sequence output**: The RNN outputs a tensor for each time step. Here, we are only interested in the output from the last time step (`-1`), which is used as input to the fully connected layer. This is the feature representation of the entire sequence.
        ```python
        out = self.fc(out[:, -1, :])
        ```
    - **Output layer**: The output is passed through the sigmoid activation function to produce the final prediction.
        ```python
        out = self.sigmoid(out)
        return out
        ```

---

### Long short-term memory (LSTM)

LSTMs are a type of RNN that can learn long-term dependencies, making them effective for sequence prediction problems. Key components:
- **Memory cells**: Allow the network to retain information over longer periods.
- **Gates**: Control the flow of information into and out of the memory cell.

In [4]:
# Generate synthetic sequential data
X_lstm = torch.rand(1000, 10, 20)  # 1000 sequences, each with 10 timesteps, each timestep with 20 features
y_lstm = torch.randint(0, 2, (1000, 1)).float()  # Binary targets

# Split the data
X_train_lstm, X_val_test_lstm, y_train_lstm, y_val_test_lstm = train_test_split(X_lstm, y_lstm, test_size=0.4)
X_val_lstm, X_test_lstm, y_val_lstm, y_test_lstm = train_test_split(X_val_test_lstm, y_val_test_lstm, test_size=0.5)

# Define the LSTM model
class LSTMModel(nn.Module):
    def __init__(self, input_size, hidden_size, output_size, num_layers):
        super(LSTMModel, self).__init__()
        self.lstm = nn.LSTM(input_size=input_size, hidden_size=hidden_size, num_layers=num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        h0 = torch.zeros(self.lstm.num_layers, x.size(0), self.lstm.hidden_size)  # Initialize hidden state
        c0 = torch.zeros(self.lstm.num_layers, x.size(0), self.lstm.hidden_size)  # Initialize cell state
        
        out, _ = self.lstm(x, (h0, c0))  # LSTM layer returns all outputs and hidden state, cell state
        out = self.fc(out[:, -1, :])  # Take the output from the last time step and pass it through the fully connected layer
        out = self.sigmoid(out)  # Apply sigmoid activation
        return out

model_lstm = LSTMModel(input_size=20, hidden_size=64, output_size=1, num_layers=1)

# Define loss function and optimizer
criterion = nn.BCELoss()
optimizer = optim.Adam(model_lstm.parameters(), lr=0.001)

# Create DataLoader for training
train_dataset = TensorDataset(X_train_lstm, y_train_lstm)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

# Training loop
num_epochs = 10
for epoch in range(num_epochs):
    model_lstm.train()  # Set the model to training mode

    # Training loop (forward pass and backward pass)
    for X_batch, y_batch in train_loader:
        optimizer.zero_grad()  # Clear gradients
        
        # Forward pass
        outputs = model_lstm(X_batch)
        train_loss = criterion(outputs, y_batch)
        
        # Backward pass
        train_loss.backward()
        optimizer.step()

    # Validation loop (only forward pass)
    model_lstm.eval()  # Set the model to evaluation mode
    with torch.inference_mode():  # No need to calculate gradients
        outputs = model_lstm(X_val_lstm)
        val_loss = criterion(outputs, y_val_lstm)

    # Print losses for this epoch
    print(f'Epoch {epoch+1}/{num_epochs}, Train Loss: {train_loss:.4f}, Validation Loss: {val_loss:.4f}')

# Test set evaluation (only forward pass)
model_lstm.eval()  # Set the model to evaluation mode
with torch.inference_mode():
    outputs = model_lstm(X_test_lstm)
    test_loss = criterion(outputs, y_test_lstm)
print(f'Test Loss: {test_loss:.4f}')

Epoch 1/10, Train Loss: 0.6864, Validation Loss: 0.6915
Epoch 2/10, Train Loss: 0.6923, Validation Loss: 0.6923
Epoch 3/10, Train Loss: 0.6906, Validation Loss: 0.6929
Epoch 4/10, Train Loss: 0.6936, Validation Loss: 0.6923
Epoch 5/10, Train Loss: 0.6929, Validation Loss: 0.6945
Epoch 6/10, Train Loss: 0.6912, Validation Loss: 0.6922
Epoch 7/10, Train Loss: 0.6952, Validation Loss: 0.6939
Epoch 8/10, Train Loss: 0.6830, Validation Loss: 0.6962
Epoch 9/10, Train Loss: 0.6883, Validation Loss: 0.7014
Epoch 10/10, Train Loss: 0.6580, Validation Loss: 0.7165
Test Loss: 0.7177


#### Understanding the LSTM model syntax

1. **Defining the model class**: The LSTM model, like the FFNN and RNN models, is defined as a class that inherits from `nn.Module`. This encapsulates the network's architecture and behavior, making it modular and reusable.
    ```python
    class LSTMModel(nn.Module):
        def __init__(self, input_size, hidden_size, output_size, num_layers):
            super(LSTMModel, self).__init__()
            # Define layers and operations here
    ```
    - **`super(LSTMModel, self).__init__()`**: This line calls the parent class constructor, initializing the model so that PyTorch can manage the layers and parameters properly.

2. **Defining layers and operations**:
    - **LSTM layer**: The core component of an LSTM model is the LSTM layer, which is defined using the `nn.LSTM` module.
        ```python
        self.lstm = nn.LSTM(input_size=input_size, hidden_size=hidden_size, num_layers=num_layers, batch_first=True)
        ```
        - **Input and output dimensions**: 
            - `input_size`: Defines the number of input features at each time step.
            - `hidden_size`: Determines the size of the hidden state (the number of neurons in the hidden layers).
            - `num_layers`: Specifies how many LSTM layers are stacked on top of each other.
            - `batch_first=True`: Specifies that the input and output tensors are organized with the batch size as the first dimension, followed by time steps and features `(batch_size, sequence_length, input_size)`.
    - **Fully connected layer**: After processing the sequence with the LSTM, the output from the last time step is passed through a fully connected layer to produce the final output.
        ```python
        self.fc = nn.Linear(hidden_size, output_size)
        ```
        - **Final output**: The `output_size` here is 1, indicating a binary output (for binary classification). The output of the LSTM layer at the final time step is transformed by the fully connected layer into the final prediction.
    - **Activation function**: A sigmoid activation function (`nn.Sigmoid()`) is applied to the output layer to squash the output between 0 and 1, suitable for binary classification.

3. **Defining the forward pass**:
    - **Initial hidden and cell states**: Before the sequence is processed by the LSTM, initial hidden state `h0` and cell state `c0` are defined. These states are typically initialized to zeros.
        ```python
        h0 = torch.zeros(self.lstm.num_layers, x.size(0), self.lstm.hidden_size)
        c0 = torch.zeros(self.lstm.num_layers, x.size(0), self.lstm.hidden_size)
        ```
    - **Processing the sequence and LSTM output**: The input sequence `x` is processed by the LSTM layer, which updates the hidden state and cell state at each time step.
        ```python
        out, _ = self.lstm(x, (h0, c0))
        ```
        - **Sequence output**: The LSTM layer returns two outputs:
            - `out`: Contains the output from each time step of the sequence for each batch.
            - `hidden`: Contains the final hidden and cell states for each sequence in the batch. These are useful for tasks like sequence prediction, but are not used here.
        - **Final time step output**: The LSTM outputs a tensor for each time step. Here, we are only interested in the output from the last time step (`-1`), which is used as input to the fully connected layer. This captures the feature representation of the entire sequence.
        ```python
        out = self.fc(out[:, -1, :])
        ```
    - **Output layer**: The output is passed through the sigmoid activation function to produce the final prediction.
        ```python
        out = self.sigmoid(out)
        return out
        ```
        
---

### Gated recurrent units (GRU)
GRUs are similar to LSTMs but with a simpler architecture. They are effective for capturing dependencies in sequential data. Key components:
- **Update and reset gates**: Simplify the control mechanism compared to LSTMs.

In [5]:
# Generate synthetic sequential data
X_gru = torch.rand(1000, 10, 20)  # 1000 sequences, each with 10 timesteps, each timestep with 20 features
y_gru = torch.randint(0, 2, (1000, 1)).float()  # Binary targets

# Split the data
X_train_gru, X_val_test_gru, y_train_gru, y_val_test_gru = train_test_split(X_gru, y_gru, test_size=0.4)
X_val_gru, X_test_gru, y_val_gru, y_test_gru = train_test_split(X_val_test_gru, y_val_test_gru, test_size=0.5)

# Define the GRU model
class GRUModel(nn.Module):
    def __init__(self, input_size, hidden_size, output_size, num_layers):
        super(GRUModel, self).__init__()
        self.gru = nn.GRU(input_size=input_size, hidden_size=hidden_size, num_layers=num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        h0 = torch.zeros(self.gru.num_layers, x.size(0), self.gru.hidden_size)  # Initialize hidden state with zeros
        out, _ = self.gru(x, h0)  # GRU layer returns all outputs and hidden state
        out = self.fc(out[:, -1, :])  # Take the output from the last time step and pass it through the fully connected layer
        out = self.sigmoid(out)  # Apply sigmoid activation
        return out

model_gru = GRUModel(input_size=20, hidden_size=64, output_size=1, num_layers=1)

# Define loss function and optimizer
criterion = nn.BCELoss()
optimizer = optim.Adam(model_gru.parameters(), lr=0.001)

# Create DataLoader for training
train_dataset = TensorDataset(X_train_gru, y_train_gru)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

# Training loop
num_epochs = 10
for epoch in range(num_epochs):
    model_gru.train()  # Set the model to training mode

    # Training loop (forward pass and backward pass)
    for X_batch, y_batch in train_loader:
        optimizer.zero_grad()  # Clear gradients
        
        # Forward pass
        outputs = model_gru(X_batch)
        train_loss = criterion(outputs, y_batch)
        
        # Backward pass
        train_loss.backward()
        optimizer.step()

    # Validation loop (only forward pass)
    model_gru.eval()  # Set the model to evaluation mode
    with torch.inference_mode():  # No need to calculate gradients
        outputs = model_gru(X_val_gru)
        val_loss = criterion(outputs, y_val_gru)

    # Print losses for this epoch
    print(f'Epoch {epoch+1}/{num_epochs}, Train Loss: {train_loss:.4f}, Validation Loss: {val_loss:.4f}')

# Test set evaluation (only forward pass)
model_gru.eval()  # Set the model to evaluation mode
with torch.inference_mode():
    outputs = model_gru(X_test_gru)
    test_loss = criterion(outputs, y_test_gru)
print(f'Test Loss: {test_loss:.4f}')

Epoch 1/10, Train Loss: 0.6869, Validation Loss: 0.6937
Epoch 2/10, Train Loss: 0.6911, Validation Loss: 0.6942
Epoch 3/10, Train Loss: 0.6670, Validation Loss: 0.6944
Epoch 4/10, Train Loss: 0.6926, Validation Loss: 0.6954
Epoch 5/10, Train Loss: 0.6752, Validation Loss: 0.6967
Epoch 6/10, Train Loss: 0.6749, Validation Loss: 0.6967
Epoch 7/10, Train Loss: 0.6996, Validation Loss: 0.6968
Epoch 8/10, Train Loss: 0.6883, Validation Loss: 0.6981
Epoch 9/10, Train Loss: 0.6905, Validation Loss: 0.6986
Epoch 10/10, Train Loss: 0.6813, Validation Loss: 0.7007
Test Loss: 0.6973


#### Understanding the GRU model syntax

1. **Defining the model class**: Like the FFNN, RNN, and LSTM models, the GRU model is defined as a class that inherits from `nn.Module`. This encapsulates the model's architecture and behavior, making it modular and reusable.
    ```python
    class GRUModel(nn.Module):
        def __init__(self, input_size, hidden_size, output_size, num_layers):
            super(GRUModel, self).__init__()
            # Define layers and operations here
    ```
    - **`super(GRUModel, self).__init__()`**: This line calls the parent class constructor, initializing the model so that PyTorch can manage the layers and parameters properly.

2. **Defining layers and operations**:
    - **GRU layer**: The core component of a GRU model is the GRU layer, defined using the `nn.GRU` module.
        ```python
        self.gru = nn.GRU(input_size=input_size, hidden_size=hidden_size, num_layers=num_layers, batch_first=True)
        ```
        - **Input and output dimensions**:
            - `input_size`: Defines the number of input features at each time step.
            - `hidden_size`: Determines the size of the hidden state (the number of neurons in the hidden layers).
            - `num_layers`: Specifies how many GRU layers are stacked on top of each other.
            - `batch_first=True`: Specifies that the input and output tensors are organized with the batch size as the first dimension, followed by time steps and features `(batch_size, sequence_length, input_size)`.
    - **Fully connected layer**: After processing the sequence with the GRU, the output from the last time step is passed through a fully connected layer to produce the final output.
        ```python
        self.fc = nn.Linear(hidden_size, output_size)
        ```
        - **Final output**: The `output_size` here is 1, indicating a binary output (for binary classification). The output of the GRU layer at the final time step is transformed by the fully connected layer into the final prediction.
    - **Activation function**: A sigmoid activation function (`nn.Sigmoid()`) is applied to the output layer to squash the output between 0 and 1, suitable for binary classification.

3. **Defining the forward pass**:
    - **Initial hidden state**: Before the sequence is processed by the GRU, an initial hidden state `h0` is defined. This hidden state is typically initialized to zeros. In GRUs, we do not need to define or manage a separate cell state as we do in LSTMs. The GRU architecture simplifies the LSTM by combining the cell state and hidden state into a single hidden state. 
        ```python
        h0 = torch.zeros(self.gru.num_layers, x.size(0), self.gru.hidden_size)
        ```
    - **Processing the sequence and GRU output**: The input sequence `x` is processed by the GRU layer, which updates the hidden state at each time step.
        ```python
        out, _ = self.gru(x, h0)
        ```
        - **Sequence output**: The GRU layer returns two outputs:
            - `out`: Contains the output from each time step of the sequence for each batch.
            - `hidden`: Contains the final hidden state for each sequence in the batch. While this hidden state can be useful for tasks like sequence prediction, we do not use it here.
        - **Final time step output**: The GRU outputs a tensor for each time step. Here, we are only interested in the output from the last time step (`-1`), which is used as input to the fully connected layer. This captures the feature representation of the entire sequence.
        ```python
        out = self.fc(out[:, -1, :])
        ```
    - **Output layer**: The output is passed through the sigmoid activation function to produce the final prediction.
        ```python
        out = self.sigmoid(out)
        return out
        ```
        
---

### Bidirectional RNN (Bi-RNN)
Bidirectional RNNs process the input data in both forward and backward directions, capturing context from both ends of the sequence. Key components:

- **Bidirectional layer**: Wraps an RNN layer to process inputs in both directions.

In [6]:
# Generate synthetic sequential data
X_birnn = torch.rand(1000, 10, 20)  # 1000 sequences, each with 10 timesteps, each timestep with 20 features
y_birnn = torch.randint(0, 2, (1000, 1)).float()  # Binary targets

# Split the data
X_train_birnn, X_val_test_birnn, y_train_birnn, y_val_test_birnn = train_test_split(X_birnn, y_birnn, test_size=0.4)
X_val_birnn, X_test_birnn, y_val_birnn, y_test_birnn = train_test_split(X_val_test_birnn, y_val_test_birnn, test_size=0.5)

# Define the Bidirectional RNN model
class BiRNNModel(nn.Module):
    def __init__(self, input_size, hidden_size, output_size, num_layers):
        super(BiRNNModel, self).__init__()
        self.birnn = nn.RNN(input_size=input_size, hidden_size=hidden_size, num_layers=num_layers, batch_first=True, bidirectional=True)
        self.fc = nn.Linear(hidden_size * 2, output_size)  # Multiply by 2 because of bidirectional outputs
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        h0 = torch.zeros(self.birnn.num_layers * 2, x.size(0), self.birnn.hidden_size)  # Multiply by 2 for bidirectional
        out, _ = self.birnn(x, h0)  # Bidirectional RNN layer returns all outputs and hidden state
        out = self.fc(out[:, -1, :])  # Take the output from the last time step and pass it through the fully connected layer
        out = self.sigmoid(out)  # Apply sigmoid activation
        return out

model_birnn = BiRNNModel(input_size=20, hidden_size=64, output_size=1, num_layers=1)

# Define loss function and optimizer
criterion = nn.BCELoss()
optimizer = optim.Adam(model_birnn.parameters(), lr=0.001)

# Create DataLoader for training
train_dataset = TensorDataset(X_train_birnn, y_train_birnn)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

# Training loop
num_epochs = 10
for epoch in range(num_epochs):
    model_birnn.train()  # Set the model to training mode

    # Training loop (forward pass and backward pass)
    for X_batch, y_batch in train_loader:
        optimizer.zero_grad()  # Clear gradients
        
        # Forward pass
        outputs = model_birnn(X_batch)
        train_loss = criterion(outputs, y_batch)
        
        # Backward pass
        train_loss.backward()
        optimizer.step()

    # Validation loop (only forward pass)
    model_birnn.eval()  # Set the model to evaluation mode
    with torch.inference_mode():  # No need to calculate gradients
        outputs = model_birnn(X_val_birnn)
        val_loss = criterion(outputs, y_val_birnn)

    # Print losses for this epoch
    print(f'Epoch {epoch+1}/{num_epochs}, Train Loss: {train_loss:.4f}, Validation Loss: {val_loss:.4f}')

# Test set evaluation (only forward pass)
model_birnn.eval()  # Set the model to evaluation mode
with torch.inference_mode():
    outputs = model_birnn(X_test_birnn)
    test_loss = criterion(outputs, y_test_birnn)
print(f'Test Loss: {test_loss:.4f}')

Epoch 1/10, Train Loss: 0.6905, Validation Loss: 0.7051
Epoch 2/10, Train Loss: 0.6930, Validation Loss: 0.6953
Epoch 3/10, Train Loss: 0.6908, Validation Loss: 0.7029
Epoch 4/10, Train Loss: 0.6823, Validation Loss: 0.7127
Epoch 5/10, Train Loss: 0.7147, Validation Loss: 0.7415
Epoch 6/10, Train Loss: 0.5681, Validation Loss: 0.7439
Epoch 7/10, Train Loss: 0.6522, Validation Loss: 0.7677
Epoch 8/10, Train Loss: 0.5624, Validation Loss: 0.7980
Epoch 9/10, Train Loss: 0.5081, Validation Loss: 0.7914
Epoch 10/10, Train Loss: 0.5804, Validation Loss: 0.8259
Test Loss: 0.8079


#### Understanding the bidirectional RNN model syntax

1. **Defining the model class**: As with the previous models (FFNN, RNN, LSTM, GRU), the bidirectional RNN model is defined as a class that inherits from `nn.Module`. This encapsulates the model’s architecture and operations.
    ```python
    class BiRNNModel(nn.Module):
        def __init__(self, input_size, hidden_size, output_size, num_layers):
            super(BiRNNModel, self).__init__()
            # Define layers and operations here
    ```
    - **`super(BiRNNModel, self).__init__()`**: Calls the parent class constructor, initializing the model so that PyTorch can manage the layers and parameters properly.

2. **Defining layers and operations**:
    - **Bidirectional RNN layer**: The core of the model is the bidirectional RNN layer, which can be defined using the `nn.RNN` module with the `bidirectional=True` argument.
        ```python
        self.birnn = nn.RNN(input_size=input_size, hidden_size=hidden_size, num_layers=num_layers, batch_first=True, bidirectional=True)
        ```
        - **Input and output dimensions**:
            - `input_size`: The number of input features at each time step.
            - `hidden_size`: The size of the hidden state (the number of neurons in the hidden layers). In a bidirectional RNN, the hidden state from both directions is combined, effectively doubling the output size.
            - `num_layers`: The number of RNN layers stacked on top of each other.
            - `batch_first=True`: Specifies that the input and output tensors are organized with the batch size as the first dimension, followed by time steps and features `(batch_size, sequence_length, input_size)`.
            - `bidirectional=True`: Specifies that the RNN should be bidirectional.
    - **Fully connected layer**: After processing the sequence with the bidirectional RNN, the output from the last time step is passed through a fully connected layer to produce the final output. Since the output from the bidirectional RNN includes information from both directions, the size of the input to the fully connected layer is `hidden_size * 2`.
        ```python
        self.fc = nn.Linear(hidden_size * 2, output_size)
        ```
        - **Final output**: The `output_size` here is 1, indicating a binary output (for binary classification).
    - **Activation function**: A sigmoid activation function (`nn.Sigmoid()`) is applied to the output layer to squash the output between 0 and 1, suitable for binary classification.

3. **Defining the forward pass**:
    - **Initial hidden state**: Before the sequence is processed by the bidirectional RNN, an initial hidden state `h0` is defined. This hidden state is typically initialized to zeros. The hidden state size is multiplied by 2 because there are two RNNs (forward and backward) in a bidirectional RNN.
        ```python
        h0 = torch.zeros(self.birnn.num_layers * 2, x.size(0), self.birnn.hidden_size)
        ```
    - **Processing the sequence and bidirectional RNN output**: The input sequence `x` is processed by the bidirectional RNN layer, which updates the hidden state at each time step.
        ```python
        out, _ = self.birnn(x, h0)
        ```
        - **Sequence output**: The bidirectional RNN layer returns two outputs:
            - `out`: Contains the output from each time step of the sequence for each batch. The output from both the forward and backward RNNs is concatenated along the feature dimension.
            - `hidden`: Contains the final hidden state for each sequence in the batch.
        - **Final Time Step Output**: The output from the last time step is selected from the sequence and passed through the fully connected layer to produce the final prediction.
        ```python
        out = self.fc(out[:, -1, :])
        ```
    - **Output layer**: The output is passed through the sigmoid activation function to produce the final prediction.
        ```python
        out = self.sigmoid(out)
        return out
        ```

**Note**: For bidirectional LSTM or bidirectional GRU models, we can replace the RNN layer with an LSTM or GRU layer, respectively, by setting `nn.LSTM` or `nn.GRU` in place of `nn.RNN`, and similarly enabling bidirectionality with `bidirectional=True`.

---

### Nested RNN
A nested RNN, also known as a stacked RNN, is a deep RNN architecture where multiple RNN layers are stacked on top of each other. This architecture is particularly useful for capturing hierarchical patterns in the data, as the nested structure allows the model to learn both fine-grained and coarse-grained temporal dependencies by having each RNN layer process the output from the previous RNN layer. Key components include:
- **Nested layer**: Combines multiple RNN cells in a hierarchical manner, where each RNN cell processes different levels of temporal abstraction.

In [7]:
# Generate synthetic sequential data
X_nested_rnn = torch.rand(1000, 10, 20)  # 1000 sequences, each with 10 timesteps, each timestep with 20 features
y_nested_rnn = torch.randint(0, 2, (1000, 1)).float()  # Binary targets

# Split the data
X_train_nested_rnn, X_val_test_nested_rnn, y_train_nested_rnn, y_val_test_nested_rnn = train_test_split(X_nested_rnn, y_nested_rnn, test_size=0.4)
X_val_nested_rnn, X_test_nested_rnn, y_val_nested_rnn, y_test_nested_rnn = train_test_split(X_val_test_nested_rnn, y_val_test_nested_rnn, test_size=0.5)

# Define the Nested RNN model
class NestedRNNModel(nn.Module):
    def __init__(self, input_size, hidden_size, output_size, num_layers):
        super(NestedRNNModel, self).__init__()
        self.layers = nn.ModuleList([nn.RNN(input_size=input_size if i == 0 else hidden_size, 
                                            hidden_size=hidden_size, 
                                            num_layers=1, 
                                            batch_first=True) 
                                     for i in range(num_layers)])
        self.fc = nn.Linear(hidden_size, output_size)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        for rnn in self.layers:
            h0 = torch.zeros(1, x.size(0), rnn.hidden_size)  # Initialize hidden state with zeros for each layer
            out, _ = rnn(x, h0)  # Process the sequence through the current RNN layer
            x = out  # The output of the current layer is used as input to the next layer
            
        out = self.fc(out[:, -1, :])  # Take the output from the last time step and pass it through the fully connected layer
        out = self.sigmoid(out)  # Apply sigmoid activation
        return out

model_nested_rnn = NestedRNNModel(input_size=20, hidden_size=64, output_size=1, num_layers=3)

# Define loss function and optimizer
criterion = nn.BCELoss()
optimizer = optim.Adam(model_nested_rnn.parameters(), lr=0.001)

# Create DataLoader for training
train_dataset = TensorDataset(X_train_nested_rnn, y_train_nested_rnn)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

# Training loop
num_epochs = 10
for epoch in range(num_epochs):
    model_nested_rnn.train()  # Set the model to training mode

    # Training loop (forward pass and backward pass)
    for X_batch, y_batch in train_loader:
        optimizer.zero_grad()  # Clear gradients
        
        # Forward pass
        outputs = model_nested_rnn(X_batch)
        train_loss = criterion(outputs, y_batch)
        
        # Backward pass
        train_loss.backward()
        optimizer.step()

    # Validation loop (only forward pass)
    model_nested_rnn.eval()  # Set the model to evaluation mode
    with torch.inference_mode():  # No need to calculate gradients
        outputs = model_nested_rnn(X_val_nested_rnn)
        val_loss = criterion(outputs, y_val_nested_rnn)

    # Print losses for this epoch
    print(f'Epoch {epoch+1}/{num_epochs}, Train Loss: {train_loss:.4f}, Validation Loss: {val_loss:.4f}')

# Test set evaluation (only forward pass)
model_nested_rnn.eval()  # Set the model to evaluation mode
with torch.inference_mode():
    outputs = model_nested_rnn(X_test_nested_rnn)
    test_loss = criterion(outputs, y_test_nested_rnn)
print(f'Test Loss: {test_loss:.4f}')

Epoch 1/10, Train Loss: 0.6993, Validation Loss: 0.6949
Epoch 2/10, Train Loss: 0.6940, Validation Loss: 0.6949
Epoch 3/10, Train Loss: 0.6808, Validation Loss: 0.7052
Epoch 4/10, Train Loss: 0.5973, Validation Loss: 0.7273
Epoch 5/10, Train Loss: 0.6159, Validation Loss: 0.7634
Epoch 6/10, Train Loss: 0.6772, Validation Loss: 0.7689
Epoch 7/10, Train Loss: 0.7156, Validation Loss: 0.7995
Epoch 8/10, Train Loss: 0.7415, Validation Loss: 0.8207
Epoch 9/10, Train Loss: 0.5549, Validation Loss: 0.8040
Epoch 10/10, Train Loss: 0.7886, Validation Loss: 0.8679
Test Loss: 0.7556


#### Understanding the nested RNN model syntax

1. **Defining the model class**: The nested RNN model is defined as a class that inherits from `nn.Module`. This allows the model's architecture and operations to be encapsulated in a modular and reusable way.
    ```python
    class NestedRNNModel(nn.Module):
        def __init__(self, input_size, hidden_size, output_size, num_layers):
            super(NestedRNNModel, self).__init__()
            # Define layers and operations here
    ```
    - **`super(NestedRNNModel, self).__init__()`**: This line calls the parent class constructor, initializing the model so that PyTorch can manage the layers and parameters properly.

2. **Defining layers and operations**:
    - **Nested RNN layers**: The core of the model consists of several RNN layers stacked in a nested manner. The layers are stored in a `ModuleList`, which allows for the creation of a flexible and dynamic architecture.
        ```python
        self.layers = nn.ModuleList([nn.RNN(input_size=input_size if i == 0 else hidden_size, 
                                            hidden_size=hidden_size, 
                                            num_layers=1, 
                                            batch_first=True) 
                                     for i in range(num_layers)])
        ```
        - **Submodules list**: `nn.ModuleList` holds submodules (like layers) in a list. It is similar to a Python list, but with main purpose to manage a list of layers or modules that are registered as submodules within a model. This registration allows PyTorch to automatically keep track of the parameters of each submodule, making it easier to save, load, and update them during training. It is also useful when the number of layers or submodules in the model is determined dynamically (e.g., based on input parameters).
            ```python
            self.layers = nn.ModuleList([nn.Linear(input_size if i == 0 else hidden_size, hidden_size) for i in range(num_layers)])
            ```
                
            - `self.layers = nn.ModuleList([...])`: This creates a `ModuleList` that contains a list of linear layers (`nn.Linear`). The first layer has `input_size` input features, and the subsequent layers all have `hidden_size` input features. In our example, the first RNN layer takes the `input_size` as its input, while subsequent layers take `hidden_size` as input. This is managed using the conditional `input_size=input_size if i == 0 else hidden_size`.
            - Comparison with `nn.Sequential`: `nn.ModuleList` is similar to `nn.Sequential`, but `nn.Sequential` automatically connects the layers in a sequence, passing the output of one layer as the input to the next. It is best suited for simple, linear models. `nn.ModuleList` does not automatically connect the layers. We have to define how the data flows through the layers manually. This allows for more complex and flexible model architectures, such as models with branching or shared layers.


        - **Input and output dimensions**:
            - `input_size`: The number of input features at each time step. The first RNN layer uses the raw input size, while subsequent layers use the hidden size of the previous layer.
            - `hidden_size`: The size of the hidden state (the number of neurons in the hidden layers).
            - `num_layers`: The number of nested RNN layers in the model.
            - `batch_first=True`: Specifies that the input and output tensors are organized with the batch size as the first dimension, followed by time steps and features `(batch_size, sequence_length, input_size)`.
    - **Fully connected layer**: After processing the sequence through the nested RNN layers, the output from the last time step is passed through a fully connected layer to produce the final output.
        ```python
        self.fc = nn.Linear(hidden_size, output_size)
        ```
        - **Final output**: The `output_size` here is 1, indicating a binary output (for binary classification).
    - **Activation function**: A sigmoid activation function (`nn.Sigmoid()`) is applied to the output layer to squash the output between 0 and 1, suitable for binary classification.

3. **Defining the forward pass**:
    - **Processing the sequence through nested RNN layers**: We can iterate over the layers in a `ModuleList` using a loop. This is particularly useful in the `forward` method we you want to pass the input through each layer sequentially. The input sequence `x` is processed through each RNN layer in the model. The output of one layer is used as the input to the next layer, allowing the model to capture complex hierarchical dependencies.
        ```python
        for rnn in self.layers:
            h0 = torch.zeros(1, x.size(0), rnn.hidden_size)  # Initialize hidden state with zeros for each layer
            out, _ = rnn(x, h0)  # Process the sequence through the current RNN layer
            x = out  # The output of the current layer is used as input to the next layer
        ```
    - **Final time step output**: The output from the last time step of the last RNN layer is passed through the fully connected layer to produce the final prediction.
        ```python
        out = self.fc(out[:, -1, :])
        ```
    - **Output layer**: The output is passed through the sigmoid activation function to produce the final prediction.
        ```python
        out = self.sigmoid(out)
        return out
        ```
        
**Note**: For nested LSTM or nested GRU models, we can replace the RNN layer with an LSTM or GRU layer, respectively, by setting `nn.LSTM` or `nn.GRU` in place of `nn.RNN`.

---

### Convolutional neural networks (CNNs)
CNNs are designed to process visual data, such as images, though they can also be applied to other types of data. CNNs are particularly effective at capturing spatial hierarchies in images by using convolutional layers, pooling layers, and fully connected layers. Key components:
- Convolutional layers: Extract features from input data.
- Pooling layers: Reduce the spatial dimensions of the data.
- Flatten layers: Transforms the 2D (or higher) data into a 1D vector.

In [8]:
# Generate synthetic data
X_cnn = torch.rand(1000, 1, 28, 28)  # 1000 images, 1 channel (grayscale), 28x28 pixels
y_cnn = torch.randint(0, 10, (1000,)).long()  # 10 classes

# Split the data
X_train_cnn, X_val_test_cnn, y_train_cnn, y_val_test_cnn = train_test_split(X_cnn, y_cnn, test_size=0.4)
X_val_cnn, X_test_cnn, y_val_cnn, y_test_cnn = train_test_split(X_val_test_cnn, y_val_test_cnn, test_size=0.5)

# Define the CNN model
class CNNModel(nn.Module):
    def __init__(self):
        super(CNNModel, self).__init__()

        # Convolutional Block 1
        self.conv1 = nn.Conv2d(in_channels=1, out_channels=16, kernel_size=3, padding=1)
        self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2)
        
        # Convolutional Block 2
        self.conv2 = nn.Conv2d(in_channels=16, out_channels=32, kernel_size=3, padding=1)
        self.pool2 = nn.AvgPool2d(kernel_size=2, stride=2)
        
        # Convolutional Block 3 (Using Dilated Convolution)
        self.conv3 = nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, padding=2, dilation=2)
        self.pool3 = nn.AdaptiveMaxPool2d((4, 4))
        
        # Flatten layer
        self.flatten = nn.Flatten()
        
        # Fully connected layers
        self.fc1 = nn.Linear(64 * 4 * 4, 128)
        self.fc2 = nn.Linear(128, 64)
        self.fc3 = nn.Linear(64, 10)
        
        # Activation function
        self.relu = nn.ReLU()
        
    def forward(self, x):
        # Forward pass through convolutional blocks
        x = self.pool1(self.relu(self.conv1(x)))
        x = self.pool2(self.relu(self.conv2(x)))
        x = self.pool3(self.relu(self.conv3(x)))
        
        # Flatten the output
        x = self.flatten(x)
        
        # Forward pass through fully connected layers
        x = self.relu(self.fc1(x))
        x = self.relu(self.fc2(x))
        x = self.fc3(x)  # No activation here since it's for classification (cross-entropy loss will be used)
        
        return x

# Instantiate the model
model_cnn = CNNModel()

# Define loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model_cnn.parameters(), lr=0.001)

# Create DataLoader for training
train_dataset = TensorDataset(X_train_cnn, y_train_cnn)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

# Training loop
num_epochs = 10
for epoch in range(num_epochs):
    model_cnn.train()  # Set the model to training mode
    
    # Training loop (forward pass and backward pass)
    for X_batch, y_batch in train_loader:
        optimizer.zero_grad()  # Clear gradients
        
        # Forward pass
        outputs = model_cnn(X_batch)
        train_loss = criterion(outputs, y_batch)
        
        # Backward pass
        train_loss.backward()
        optimizer.step()

    # Validation loop (only forward pass)
    model_cnn.eval()  # Set the model to evaluation mode
    with torch.inference_mode():  # No need to calculate gradients
        outputs = model_cnn(X_val_cnn)
        val_loss = criterion(outputs, y_val_cnn)
    
    # Print losses for this epoch
    print(f'Epoch {epoch+1}/{num_epochs}, Train Loss: {train_loss:.4f}, Validation Loss: {val_loss:.4f}')

# Test set evaluation (only forward pass)
model_cnn.eval()  # Set the model to evaluation mode
with torch.inference_mode():
    outputs = model_cnn(X_test_cnn)
    test_loss = criterion(outputs, y_test_cnn)
print(f'Test Loss: {test_loss:.4f}')

Epoch 1/10, Train Loss: 2.2936, Validation Loss: 2.3006
Epoch 2/10, Train Loss: 2.2799, Validation Loss: 2.3005
Epoch 3/10, Train Loss: 2.3639, Validation Loss: 2.3068
Epoch 4/10, Train Loss: 2.2688, Validation Loss: 2.3003
Epoch 5/10, Train Loss: 2.3399, Validation Loss: 2.3043
Epoch 6/10, Train Loss: 2.2671, Validation Loss: 2.3029
Epoch 7/10, Train Loss: 2.2905, Validation Loss: 2.3036
Epoch 8/10, Train Loss: 2.3438, Validation Loss: 2.3032
Epoch 9/10, Train Loss: 2.3054, Validation Loss: 2.3011
Epoch 10/10, Train Loss: 2.2997, Validation Loss: 2.3036
Test Loss: 2.3052


#### Understanding the CNN model syntax

1. **Defining the model class**: Like the previous models, the CNN model is defined as a class inheriting from `nn.Module`. This allows PyTorch to manage the layers and operations within the model.
    ```python
    class CNNModel(nn.Module):
        def __init__(self):
            super(CNNModel, self).__init__()
    ```
    
2. **Defining convolutional blocks**: A convolutional block in a CNN typically consists of a sequence of two main components: a convolutional layer followed by a pooling layer. These blocks are the building blocks of CNNs and are stacked to progressively extract higher-level features from the input data, controlling the computational load and focusing on the most important features. The general syntax for defining a Convolutional Block in PyTorch is as follows:
    ```python
    self.conv = nn.Conv2d(in_channels, out_channels, kernel_size, padding, stride=1, dilation=1)
    self.pool = nn.PoolingType(kernel_size, stride)
    ```
    - **Convolutional layer** (`nn.Conv2d`):
        - **`in_channels`**: The number of channels in the input data (e.g., 1 for grayscale images, 3 for RGB images).
        - **`out_channels`**: The number of filters applied by the convolutional layer, resulting in this many output feature maps.
        - **`kernel_size`**: The size of the convolutional kernel (e.g., `3x3`), which determines the area of the input that the filter will process at a time.
        - **`padding`**: Zero-padding added to the input on both sides to control the output size. Padding helps maintain the same spatial dimensions after convolution.
        - **`stride`**: The step size of the convolutional kernel as it moves across the input. By default, this is set to `1`.
        - **`dilation`**: The spacing between the kernel points, which allows the receptive field of the convolutional layer to increase without increasing the number of parameters.

        In the specific model examples:
        - **Standard convolution**: Used in the first two blocks, where `padding=1` and `kernel_size=3`, ensuring that the output spatial dimensions remain similar to the input dimensions.
        - **Dilated convolution**: Used in the third block with `dilation=2` and `padding=2`, expanding the receptive field to capture more contextual information without increasing the kernel size.

    - **Pooling layer**: Pooling layers down-sample the feature maps produced by the convolutional layer, reducing their spatial dimensions. This helps in reducing the computational complexity and mitigating overfitting by summarizing the presence of features in patches of the feature map. Different pooling methods used in the model include:
        - **`nn.MaxPool2d`**: Applies max pooling with a `2x2` kernel and `stride=2`. This pooling operation selects the maximum value from each `2x2` patch, effectively reducing the spatial dimensions by half.
        - **`nn.AvgPool2d`**: Instead of taking the maximum value, average pooling computes the average of each `2x2` patch, also with a stride of `2`. This pooling method is useful for smooth down-sampling and can help when the maximum operation is too aggressive.
        - **`nn.AdaptiveMaxPool2d`**: An adaptive pooling operation that outputs a fixed-size feature map, regardless of the input size. For example, in the third block, it reduces the feature map to a size of `4x4`, which is useful for ensuring consistent input size to the subsequent fully connected layers.

3. **Defining flatten layer** (`nn.Flatten`): This layer is used to convert the 2D feature maps into a 1D vector before feeding them into the fully connected layers. This is necessary because fully connected layers expect a flat vector as input.
    ```python
    self.flatten = nn.Flatten()
    ```

4. **Defining fully connected layers** (`nn.Linear`): The fully connected layers reduce the dimensionality of the input step by step, eventually outputting `10` values, one for each class. The first layer takes input corresponding to the flattened vector from the convolutional output (`64` channels * `4x4` feature map).
    ```python
    self.fc1 = nn.Linear(64 * 4 * 4, 128)
    self.fc2 = nn.Linear(128, 64)
    self.fc3 = nn.Linear(64, 10)
    ```

5. **Activation function** (`nn.ReLU`): ReLU activation is applied after each layer (except the final output layer). It introduces non-linearity, enabling the network to learn more complex patterns.

6. **Defining the forward pass**:
    - **Convolutional blocks**: Each convolutional block processes the input sequentially, applying a convolution, activation, and pooling step. The convolutional layers extract hierarchical features from the input data, with each successive layer capturing increasingly complex patterns. This is followed by an activation function, such as ReLU, to introduce non-linearity into the model, allowing it to learn a more complex mapping from inputs to outputs.
        ```python
        x = self.pool(self.relu(self.conv(x)))
        ```
    - **Flattening**: After the last convolutional block, the output is flattened into a 1D vector.
        ```python
        x = self.flatten(x)
        ```
    - **Fully connected layers**: The flattened vector is passed through fully connected layers, eventually producing a final output vector with `10` elements, corresponding to the 10 possible classes.
        ```python
        x = self.fc(x)
        ```

---

### Transformer model
The Transformer model is a type of neural network architecture that excel at processing sequential data. Unlike previous models, transformers rely solely on attention mechanisms to capture relationships between data points. Key components:
- **MultihHead self-attention**: Allows the model to focus on different parts of the input sequence simultaneously, capturing various aspects of the information.
- **Positional encoding**: Adds information about the position of elements in the sequence, since the model doesn’t inherently understand sequence order.
- **Feed-forward network (FFN)**: Applies a simple two-layer neural network to each position in the sequence independently.
- **Layer normalization**: Normalizes inputs to each sub-layer to stabilize and speed up training.
- **Residual connections**: Allows gradients to flow through the network directly, helping prevent vanishing gradients.

In [9]:
# Generate synthetic sequential data
X_transformer = torch.rand(1000, 10, 20)  # 1000 sequences, each with 10 timesteps, each timestep with 20 features
y_transformer = torch.randint(0, 2, (1000, 1)).float()  # Binary targets

# Split the data
X_train_trans, X_val_test_trans, y_train_trans, y_val_test_trans = train_test_split(X_transformer, y_transformer, test_size=0.4)
X_val_trans, X_test_trans, y_val_trans, y_test_trans = train_test_split(X_val_test_trans, y_val_test_trans, test_size=0.5)


# Positional Encoding Class
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super(PositionalEncoding, self).__init__()
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0) # Shape becomes (1, max_len, d_model)
        self.register_buffer('pe', pe)

    def forward(self, x):
        x = x + self.pe[:, :x.size(1), :]  # Correctly slice the positional encoding tensor
        return x


# Define the Transformer model
class TransformerModel(nn.Module):
    def __init__(self, input_size, num_heads, hidden_size, num_encoder_layers, num_decoder_layers, output_size, dropout=0.1):
        super(TransformerModel, self).__init__()
        self.embedding = nn.Linear(input_size, hidden_size)
        self.pos_encoder = PositionalEncoding(hidden_size)
        self.transformer = nn.Transformer(
            d_model=hidden_size,
            nhead=num_heads,
            num_encoder_layers=num_encoder_layers,
            num_decoder_layers=num_decoder_layers,
            dim_feedforward=2048,
            dropout=dropout,
            batch_first=True
        )
        self.fc = nn.Linear(hidden_size, output_size)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        # Embed the input sequence
        x = self.embedding(x)
        x = self.pos_encoder(x)  # Positional encoding
        
        # Assume that the decoder input is the same as the encoder input for simplicity
        # Typically, for tasks like machine translation, the decoder input would be different.
        tgt = x  # For simplicity in this binary classification task, we use the same sequence as both the source and target
        
        # Apply the Transformer
        out = self.transformer(x, tgt)
        
        # Take the output from the last time step and pass it through the fully connected layer
        out = self.fc(out[:, -1, :])
        out = self.sigmoid(out)
        return out

model_transformer = TransformerModel(
    input_size=20,
    num_heads=4,
    hidden_size=64,
    num_encoder_layers=2,
    num_decoder_layers=2,
    output_size=1
)

# Define loss function and optimizer
criterion = nn.BCELoss()
optimizer = optim.Adam(model_transformer.parameters(), lr=0.001)

# Create DataLoader for training
train_dataset = TensorDataset(X_train_trans, y_train_trans)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

# Training loop
num_epochs = 10
for epoch in range(num_epochs):
    model_transformer.train()  # Set the model to training mode

    # Training loop (forward pass and backward pass)
    for X_batch, y_batch in train_loader:
        optimizer.zero_grad()  # Clear gradients
        
        # Forward pass
        outputs = model_transformer(X_batch)
        train_loss = criterion(outputs, y_batch)
        
        # Backward pass
        train_loss.backward()
        optimizer.step()

    # Validation loop (only forward pass)
    model_transformer.eval()  # Set the model to evaluation mode
    with torch.inference_mode():  # No need to calculate gradients
        outputs = model_transformer(X_val_trans)
        val_loss = criterion(outputs, y_val_trans)

    # Print losses for this epoch
    print(f'Epoch {epoch+1}/{num_epochs}, Train Loss: {train_loss:.4f}, Validation Loss: {val_loss:.4f}')

# Test set evaluation (only forward pass)
model_transformer.eval()  # Set the model to evaluation mode
with torch.inference_mode():
    outputs = model_transformer(X_test_trans)
    test_loss = criterion(outputs, y_test_trans)
print(f'Test Loss: {test_loss:.4f}')

Epoch 1/10, Train Loss: 0.7558, Validation Loss: 0.7129
Epoch 2/10, Train Loss: 0.7491, Validation Loss: 0.7314
Epoch 3/10, Train Loss: 0.6803, Validation Loss: 0.7049
Epoch 4/10, Train Loss: 0.6806, Validation Loss: 0.6782
Epoch 5/10, Train Loss: 0.7007, Validation Loss: 0.6959
Epoch 6/10, Train Loss: 0.6955, Validation Loss: 0.7588
Epoch 7/10, Train Loss: 0.7060, Validation Loss: 0.6886
Epoch 8/10, Train Loss: 0.7007, Validation Loss: 0.6917
Epoch 9/10, Train Loss: 0.7225, Validation Loss: 0.6964
Epoch 10/10, Train Loss: 0.6975, Validation Loss: 0.6772
Test Loss: 0.7162


#### Understanding the Transformer model syntax

1. **Defining the positional encoding class**: PyTorch provides a built-in way to handle positional encodings when using the `nn.Transformer` and related modules. However, if we want to customize the positional encodings are generated and applied, we need to define a positional encoding class. Positional encoding allows the Transformer to incorporate the order of the sequence, which it otherwise would not know because of its parallel processing nature.
    ```python
    class PositionalEncoding(nn.Module):
      def __init__(self, d_model, max_len=5000):
          super(PositionalEncoding, self).__init__()
          ...
    ```
    - **`__init__(self, d_model, max_len=5000)`**: The constructor initializes the positional encoding matrix.
    - **`d_model`**: Dimensionality of the model's hidden states, matching the size of each embedding vector.
    - **`max_len`**: Maximum length of the sequence that the model can handle; this could be tuned based on the dataset.
    - **Creating the positional encoding matrix**: It uses sine and cosine functions of different frequencies to generate a unique encoding for each position in the sequence.
        - `position` is a tensor containing positions from `0` to `max_len - 1`.
        - `div_term` is a scaling factor that ensures that the positional encoding vectors have a unique structure. It controls the frequencies of the sine and cosine functions.
        - `pe` stores the positional encodings, with even indices filled with sine values and odd indices with cosine values. This encoding scheme helps to differentiate positions in the sequence.
    - **`forward(self, x)`**: In the forward pass, the positional encoding tensor is added to the input embeddings (`x`). The positional encoding is broadcasted to match the batch size and sequence length.


2. **Defining the transformer model class**: The Transformer model class integrates the different layers and operations required to process the input sequence and produce an output.
    - **Defining layers and operations**:
        - **Embedding layer**: Converts input features (e.g., tokens in NLP, or features in a time series) into a higher-dimensional space to be processed by the Transformer.
          ```python
          self.embedding = nn.Linear(input_size, hidden_size)
          ```
            - **`nn.Linear(input_size, hidden_size)`**: This linear layer takes the input features of size `input_size` and projects them into a space of size `hidden_size`. This is necessary because the Transformer model operates on vectors of a fixed dimension (i.e., `hidden_size`).
        - **Positional encoding layer**: The positional encoding layer is defined using the `PositionalEncoding` class. It adds positional information to the embedded input sequences.
            ```python
            self.pos_encoder = PositionalEncoding(hidden_size)
            ```
            - **`PositionalEncoding(hidden_size)`**: This initializes the positional encoding with the same dimension as the embedding layer (`hidden_size`). The positional encoding is added to the input embeddings to provide information about the order of elements in the sequence.
        - **Transformer layer**: The transformer layer is the core of the model. It consists of stacked encoder and decoder layers that process the input sequences. Layer normalization is automatically handled within each Transformer layer before and after the self-attention and feedforward sub-layers to stabilize the training and make it faster.
          ```python
          self.transformer = nn.Transformer(
              d_model=hidden_size,
              nhead=num_heads,
              num_encoder_layers=num_encoder_layers,
              num_decoder_layers=num_decoder_layers,
              dim_feedforward=2048,
              dropout=dropout,
              batch_first=True
          )
          ```
            - **`d_model=hidden_size`**: The dimensionality of the model's hidden states, matching the embedding size (also called `hidden_size`).
            - **`nhead`**: Number of heads in the multi-head attention mechanism, allowing the model to focus on different parts of the input sequence simultaneously.
            - **`num_encoder_layers`**: Number of encoder layers, each consisting of a self-attention mechanism followed by a feedforward neural network and layer normalization.
            - **`num_decoder_layers`**: Number of decoder layers, similar to encoder layers but with an additional attention mechanism that focuses on the encoder's output.
            - **`dim_feedforward`**: Dimensionality of the feedforward network inside the transformer layers. This feedforward network processes the output of the attention mechanism.
            - **`dropout`**: Dropout rate, which helps in regularization by randomly setting some of the weights to zero during training.
            - **`batch_first=True`**: Ensures the input and output tensors are ordered with the batch size as the first dimension in the format `(batch_size, sequence_length, hidden_size)`.

        - **Fully connected layer: The fully connected layer is applied after the Transformer layers. It maps the final hidden state of the sequence to the desired output size.
            ```python
            self.fc = nn.Linear(hidden_size, output_size)
            ```
                - **`nn.Linear(hidden_size, output_size)`**: This linear layer reduces the dimensionality from `hidden_size` to `output_size`, which could be the size of the target classes in classification tasks.
        - **Activation function** (`nn.Sigmoid()`): The activation function applied here is the sigmoid function, which is commonly used for binary classification tasks.

    - **Forward pass**: The forward pass defines how the input data moves through the network and gets transformed at each stage.
        - **Embedding**: The input data `x` is first passed through the embedding layer, which projects it into a higher-dimensional space suitable for processing by the Transformer model.
        - **Positional encoding**: Positional encoding is added to the embedded input sequence to give the model information about the order of the sequence.
          ```python
          x = self.pos_encoder(x)
          ```
        - **Transformer operation**: The input sequence, enriched with positional encodings, is passed through the Transformer. The model attends to different parts of the sequence to produce the final representation.
          ```python
          out = self.transformer(x, tgt)
          ```
        - **Final output**: The output from the last time step is passed through a fully connected layer and a sigmoid activation to produce the final prediction.
          ```python
          out = self.fc(out[:, -1, :])
          out = self.sigmoid(out)
          ```

---

### Multimodal neural network (RNN + CNN)
A multimodal neural network can process different types of data simultaneously by combining different neural network architectures, such as RNNs for sequential data (e.g., time series, text) and CNNs for spatial data (e.g., images). These models work independently on their respective input types, and their outputs are then concatenated and processed together to make a final prediction. Key components:
- **RNN branch**: Handles sequential data, such as text or time series.
- **CNN branch**: Handles spatial data, such as images.
- **Concatenation**: Merges the outputs of the RNN and CNN models.
- **Fully connected layers**: Process the combined features from both branches to produce the final output.

In [10]:
# Generate synthetic sequential data for RNN
X_rnn = torch.rand(1000, 10, 8)  # 1000 sequences, 10 timesteps, 8 features
y = torch.randint(0, 3, (1000,))  # 3 classes for classification

# Generate synthetic image data for CNN
X_cnn = torch.rand(1000, 1, 28, 28)  # 1000 images, 1 channel (grayscale), 28x28 pixels

# Split the data
X_train_rnn, X_test_rnn, X_train_cnn, X_test_cnn, y_train, y_test = train_test_split(X_rnn, X_cnn, y, test_size=0.2)

# Define the RNN branch
class RNNBranch(nn.Module):
    def __init__(self, input_size, hidden_size, rnn_layers):
        super(RNNBranch, self).__init__()
        self.rnn = nn.RNN(input_size, hidden_size, rnn_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, 32)
        self.relu = nn.ReLU()

    def forward(self, x):
        h0 = torch.zeros(self.rnn.num_layers, x.size(0), self.rnn.hidden_size)
        out, _ = self.rnn(x, h0)
        out = self.fc(out[:, -1, :])
        out = self.relu(out)
        return out

# Define the CNN branch
class CNNBranch(nn.Module):
    def __init__(self):
        super(CNNBranch, self).__init__()
        self.conv1 = nn.Conv2d(1, 16, kernel_size=3, padding=1)
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
        self.conv2 = nn.Conv2d(16, 32, kernel_size=3, padding=1)
        self.flatten = nn.Flatten()
        self.fc = nn.Linear(32 * 7 * 7, 32)
        self.relu = nn.ReLU()

    def forward(self, x):
        x = self.pool(self.relu(self.conv1(x)))
        x = self.pool(self.relu(self.conv2(x)))
        x = self.flatten(x)
        x = self.fc(x)
        x = self.relu(x)
        return x

# Define the Multimodal Neural Network (RNN + CNN)
class MultimodalNN(nn.Module):
    def __init__(self, rnn_input_size, rnn_hidden_size, rnn_layers):
        super(MultimodalNN, self).__init__()
        self.rnn_branch = RNNBranch(rnn_input_size, rnn_hidden_size, rnn_layers)
        self.cnn_branch = CNNBranch()
        self.fc = nn.Linear(32 + 32, 3)  # Output size is 3 for classification
        self.softmax = nn.Softmax(dim=1)

    def forward(self, x_rnn, x_cnn):
        rnn_out = self.rnn_branch(x_rnn)
        cnn_out = self.cnn_branch(x_cnn)
        combined = torch.cat((rnn_out, cnn_out), dim=1)
        out = self.fc(combined)
        out = self.softmax(out)
        return out

# Instantiate the model
model = MultimodalNN(rnn_input_size=8, rnn_hidden_size=64, rnn_layers=1)

# Define loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Create DataLoader for training
train_dataset = TensorDataset(X_train_rnn, X_train_cnn, y_train)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

# Training loop
num_epochs = 10
for epoch in range(num_epochs):
    model.train()  # Set the model to training mode

    # Training loop (forward pass and backward pass)
    for X_batch_rnn, X_batch_cnn, y_batch in train_loader:
        optimizer.zero_grad()  # Clear gradients

        # Forward pass
        outputs = model(X_batch_rnn, X_batch_cnn)
        train_loss = criterion(outputs, y_batch)

        # Backward pass
        train_loss.backward()
        optimizer.step()

    # Validation loop (only forward pass)
    model.eval()  # Set the model to evaluation mode
    with torch.inference_mode():  # No need to calculate gradients
        val_outputs = model(X_test_rnn, X_test_cnn)
        val_loss = criterion(val_outputs, y_test)

    # Print losses for this epoch
    print(f'Epoch {epoch+1}/{num_epochs}, Train Loss: {train_loss:.4f}, Validation Loss: {val_loss:.4f}')

# Test set evaluation (only forward pass)
model.eval()  # Set the model to evaluation mode
with torch.inference_mode():
    test_outputs = model(X_test_rnn, X_test_cnn)
    test_loss = criterion(test_outputs, y_test)
print(f'Test Loss: {test_loss:.4f}')

Epoch 1/10, Train Loss: 1.0979, Validation Loss: 1.1002
Epoch 2/10, Train Loss: 1.0973, Validation Loss: 1.1009
Epoch 3/10, Train Loss: 1.0962, Validation Loss: 1.1017
Epoch 4/10, Train Loss: 1.0886, Validation Loss: 1.1013
Epoch 5/10, Train Loss: 1.1043, Validation Loss: 1.1026
Epoch 6/10, Train Loss: 1.0452, Validation Loss: 1.1035
Epoch 7/10, Train Loss: 1.0839, Validation Loss: 1.1037
Epoch 8/10, Train Loss: 1.0998, Validation Loss: 1.1041
Epoch 9/10, Train Loss: 1.0940, Validation Loss: 1.1132
Epoch 10/10, Train Loss: 1.0955, Validation Loss: 1.1096
Test Loss: 1.1096


#### Understanding the multimodal neural network syntax

1. **Defining the RNN and CNN branches**: To build a multimodal network model in PyTorch, we need to create separate neural network branches to handle each type of data (like an RNN for text and a CNN for images), and then combine their outputs into a single prediction. Each branch is defined as a separate class, encapsulating its specific architecture and operations. We will start by building two separate models: one for each type of data (modality). 
   - **Defining the RNN branch**: Handles sequential data like time series or text. The RNN branch is defined as a class that inherits from `nn.Module`.
        ```python
        class RNNBranch(nn.Module):
            def __init__(self, input_size, hidden_size, output_size, num_layers):
                super(RNNBranch, self).__init__()
                # Define layers and operations here
        ```
        - **`super(RNNBranch, self).__init__()`**: This line calls the parent class constructor, initializing the model so PyTorch can manage the layers and parameters properly.
        - **Defining layers and operations**:
            - **Recurrent layer** (`nn.RNN`): Processes the sequential data, updating the hidden state at each time step.
                ```python
                self.rnn = nn.RNN(input_size, hidden_size, num_layers, batch_first=True)
                ```
            - **Fully connected layer** (`nn.Linear`): After processing the sequence with the RNN, the output from the last time step is passed through a fully connected layer. It reduces the dimensionality of the RNN output.
                ```python
                self.fc = nn.Linear(hidden_size, output_size)
                ```
        - **Defining the forward pass**:
            - **Initial hidden state**: The RNN requires an initial hidden state, typically initialized to zeros.
                ```python
                h0 = torch.zeros(self.rnn.num_layers, x.size(0), self.rnn.hidden_size)
                ```
            - **Processing the sequence**: The input sequence is processed by the RNN, updating the hidden state at each time step.
                ```python
                out, _ = self.rnn(x, h0)
                ```
            - **Final output**: The output from the last time step is passed through the fully connected layer.
                ```python
                out = self.fc(out[:, -1, :])
                ```
   - **Defining the CNN branch**: Handles spatial data like images. The CNN branch is also defined as a class that inherits from `nn.Module`.
        ```python
        class CNNBranch(nn.Module):
            def __init__(self, input_channels, num_classes):
                super(CNNBranch, self).__init__()
                # Define layers and operations here
        ```
        - **Defining layers and operations**:
            - **Convolutional layers** (`nn.Conv2d`): The first layer in the CNN branch that extracts features from the input image. Additional convolutional layer further refines the extracted features.
                ```python
                self.conv = nn.Conv2d(input_channels, 16, kernel_size=3, padding=1)
                ```
             - **Pooling layer** (`nn.MaxPool2d`): Reduce the spatial dimensions of the feature maps.
                ```python
                self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
                ```
            - **Flatten layer** (`nn.Flatten`): Converts the 2D feature maps into a 1D vector.
                ```python
                self.flatten = nn.Flatten()
                ```
            - **Fully connected layer** (`nn.Linear`): Reduces the dimensionality of the CNN output.
                ```python
                self.fc = nn.Linear(128, num_classes)
                ```
        - **Defining the forward pas**:
            - **Convolutional and pooling blocks**: The input is processed through the convolutional and pooling layers.
                ```python
                x = self.pool(F.relu(self.conv1(x)))
                x = self.pool(F.relu(self.conv2(x)))
                ```
            - **Flattening**: Converts the output to a 1D vector.
                ```python
                x = self.flatten(x)
                ```
            - **Fully connected layers**: The flattened vector is passed through the fully connected layers to produce the final feature vector.
                ```python
                x = F.relu(self.fc(x))
                ```

2. **Creating the multimodal model**: Now that we have two branches (RNN and CNN), the next step is to combine them into a single model. This is done by creating another class that brings the outputs from both branches together and produces a final prediction.
    - **Defining the multimodal class**: The multimodal model is defined as a class that inherits from `nn.Module`.
        ```python
        class MultimodalModel(nn.Module):
            def __init__(self, rnn_params, cnn_params, final_output_size):
                super(MultimodalModel, self).__init__()
                # Define layers and operations here
        ```
        - **Initializing the RNN and CNN branches**: The RNN and CNN classes defined earlier are instantiated in the multimodal class.
            ```python
            self.rnn_branch = RNNBranch(**rnn_params)
            self.cnn_branch = CNNBranch(**cnn_params)
            ```
    - **Defining the forward pass**: The forward pass specifies how the RNN and CNN outputs are processed and combined.
        - **Processing the Inputs**: Pass the sequential and spatial inputs through their respective branches.
            ```python
            rnn_output = self.rnn_branch(seq_input)
            cnn_output = self.cnn_branch(img_input)
            ```
        - **Combining the outputs** (`torch.cat()`): The outputs from the RNN and CNN branches are concatenated using `torch.cat()`. 
            ```python
            combined = torch.cat((rnn_output, cnn_output), dim=1)
            ```
        - **Final fully connected layer** (`nn.Linear`): This combined feature vector is then passed through a final fully connected layer (`nn.Linear`) reduces the combined feature vector to the desired output size (e.g., number of classes).
        - **Final activation** (`nn.Softmax`): Converts the output logits into probabilities for each class, suitable for multi-class classification tasks.