<h1 style="color:#BF66F2 ">  Recurrent Neural Networks in PyTorch 1 </h1>
<div style="margin-top: -30px;">
<h4> RNN / GRU / LSTM comparison. Learn long-term dependencies in sequential data. Focus on DataLoaders </h4> 
</div>
<div style="margin-top: -18px;">
<span style="display: inline-block;">
    <h3 style="color: lightblue; display: inline;">Keywords:</h3>
    ``'important words'`` yellow in markdown + tqdm progression bar + PyTorch functional interface +
    margin-top: in markdown
</span>
</div>

In [4]:
import torch
from torch import nn  
from torch import optim 
from torch.utils.data import DataLoader
import torch.nn.functional as F
import torchvision.datasets as datasets  
import torchvision.transforms as transforms  

from tqdm import tqdm 

In [5]:
device = "cuda" if torch.cuda.is_available() else "cpu"

In [6]:
""" Hyperparameters """
input_size = 28
hidden_size = 256
num_layers = 2
num_classes = 10
sequence_length = 28
learning_rate = 0.005
batch_size = 64
num_epochs = 3

<h3 style="color:#BF66F2"> RNN Args </h3>
<div style="margin-top: -17px;">

- input_size: The number of expected features in the input `x`    
- hidden_size: The number of features in the hidden state `h`   
- num_layers: Number of recurrent layers. 
        would mean stacking two RNNs together to form a `stacked RNN`,   
        with the second RNN taking in outputs of the first RNN and   
        computing the final results. Default: 1   
- nonlinearity: The non-linearity to use. Can be either ``'tanh'`` or ``'relu'``. Default: ``'tanh'``   
- bias: If ``False``, then the layer does not use bias weights `b_ih` and `b_hh`.   
        Default: ``True``   
- batch_first: If ``True``, then the input and output tensors are provided   
        as `(batch, seq, feature)` instead of `(seq, batch, feature)`.   
        Note that this does not apply to hidden or cell states. Default: ``False``   
- dropout: If non-zero, introduces a `Dropout` layer on the outputs of each    
        RNN layer except the last layer, with dropout probability equal to    
        :attr:`dropout`. Default: 0
- bidirectional: If ``True``, becomes a bidirectional RNN. Default: ``False``
</div>

In [7]:
class myRNN(nn.Module):
    """ Recurrent Neural Network model (many-to-one). """
    def __init__(self, input_size, hidden_size, num_layers, num_classes):
        super(myRNN, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.rnn = nn.RNN(input_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size * sequence_length, num_classes)

    def forward(self, x):
        """ Perform a forward pass through the neural network.
            Params:
            Details:
                - #*      Initialize the hidden states as a tensor of zeros with dimension (num_layers, batch_size, hidden_size)
                - #**     Set initial hidden and cell states 
                - #***    Use throwaway to ignore the tensor representing the final hidden state 
                - #****   Reshape the output tensor (while preserving its underlying data) out \\
                        into a 2D tensor with dimensions (batch_size, num_hidden * sequence_length). \\
                            __out.shape[0] is the batch size of the input sequence. \\
                            __-1 it is the size of the second dimension of the output tensor. \\
                            It means that the size of the second dimension is inferred based on the size of \\
                            the original tensor and the given batch size.

                - #*****  Apply the fully connected layer (Linear) to the reshaped output tensor, to \\
                            to decode the hidden state of the last time step
            Returns:
                Output tensor [torch.Tensor of shape batch_size, output_size]
        """
        # Init
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(device) #*
        ## Forward
        out, _ = self.rnn(x, h0)                                                  #*** throwaway
        out = out.reshape(out.shape[0], -1)     
        # Decode
        out = self.fc(out)
        
        return out

<h3 style="color:#BF66F2"> Recap: Gated Recurrent Unit </h3>
<div style="margin-top: -15px;">
A GRU is similar to LSTM (Long Short-Term Memory) network (designed to overcome the vanishing gradient problem). <br>    
It has a simpler architecture than the LSTM, with only two gates: Reset gate and Update gate. <br>

In LSTM networks, the cell state serves as a "memory" that is propagated from one time step to the next through the use of gating mechanisms.<br>
The cell state is updated at each time step using the input at that time step and the previous cell state, and the updated cell state <br> is then passed to the next time step.<br>
On the other hand, the hidden state, is a function of the cell state and is used to make predictions or classifications based on the current input.   

In [8]:
class myRNN_GRU(nn.Module):
    """ GRU-based Recurrent Neural Network (RNN) model (many-to-one).

        Args:
            - Number of expected features in the input `x` [int]
            - Number of features in the hidden state [int]
            - Number of recurrent layers [int]
            - Number of output classes [int]

        Attributes:
            - hidden_size (int): The number of features in the hidden state.
            - num_layers (int): Number of recurrent layers.
            - gru (nn.GRU): The GRU (Gated Recurrent Unit) layer.
            - fc (nn.Linear): The fully connected output layer.

        Methods:
            forward(x): Forward pass through the RNN.

        """

    def __init__(self, input_size, hidden_size, num_layers, num_classes):
        super(myRNN_GRU, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.gru = nn.GRU(input_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size * sequence_length, num_classes)

    def forward(self, x):
        """ Perform a forward pass through the RNN. \\
        Set initial hidden and cell states + Forward + Decoding with fullyconnected.
        
        Parameters:
            Input tensor of shape [torch.Tensor of shape (batch_size, sequence_length, input_size)]

        Returns:
            Output logits for each class [torch.Tensor of shape (batch_size, num_classes)]
        """        
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(device)
        ## Forward propagate LSTM
        out, _ = self.gru(x, h0)
        out = out.reshape(out.shape[0], -1)
        # Decode 
        out = self.fc(out)
        
        return out

In [9]:
class RNN_LSTM(nn.Module):
    """ Recurrent Neural Network with LSTM architecture (many-to-one).

    Parameters:
        - Number of expected features in the input `x` [int]
        - Number of features in the hidden state [int]
        - Number of recurrent layers [int]
        - Number of output classes [int]

    Attributes:
        - hidden_size (int): The number of features in the hidden state
        - num_layers (int): Number of recurrent layers
        - lstm (nn.LSTM): The LSTM (Long Short-Term Memory) layer
        - fc (nn.Linear): The fully connected output layer

    Methods:
        forward(x): Forward pass through the LSTM-based RNN
    """
    def __init__(self, input_size, hidden_size, num_layers, num_classes):
        super(RNN_LSTM, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size * sequence_length, num_classes)

    def forward(self, x):
        """ Forward pass through the LSTM-based RNN.

        Args:
            Input tensor of shape [torch, Tensor of shape (batch_size, sequence_length, input_size)]

        Returns:
            Output predictions for each class [torch.Tensor of shape (batch_size, num_classes)]
        """        
        # Set initial hidden and cell states
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(device)
        c0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(device)

        # Forward propagate LSTM
        out, _ = self.lstm(
            x, (h0, c0)
        )  # out: tensor of shape (batch_size, seq_length, hidden_size)
        out = out.reshape(out.shape[0], -1)

        # Decode the hidden state of the last time step
        out = self.fc(out)
        return out

<h3 style="color:#BF66F2"> Recap: DataLoader </h3>
<div style="margin-top: -15px;">
Utility to load and preprocess large datasets for training or inference in a neural network. <br>
A data loader takes a dataset object as input and returns batches of data during training or evaluation.<br>
A "torch.utils.data.DataLoader" can have several optional arguments such as the batch size, shuffle option, and number of worker <br> threads to use for data loading. <br>
<div style="line-height:1.6">
<h4> Benefits: </h4>
</div>
<div style="margin-top: -23px;">

- Memory efficiency: A data loader can load and preprocess data in batches, which allows it to process large datasets that may not fit into memory.     
- Data parallelism: A data loader can be used in conjunction with PyTorch's DataParallel module to split a batch of data across multiple GPUs <br> for parallel processing.    
- Randomization: A data loader can shuffle the order of the data during training to prevent the model from overfitting to the order of the data.
</div>
</div>

In [10]:
#### Load Data
train_dataset = datasets.MNIST(root="dataset/", train=True, transform=transforms.ToTensor(), download=True)
test_dataset = datasets.MNIST(root="dataset/", train=False, transform=transforms.ToTensor(), download=True)
train_loader = DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(dataset=test_dataset, batch_size=batch_size, shuffle=True)

In [11]:
test_loader

<torch.utils.data.dataloader.DataLoader at 0x7f421886dff0>

In [12]:
""" Initialize network (try out just using simple RNN, or GRU, and then compare with LSTM) """
model = RNN_LSTM(input_size, hidden_size, num_layers, num_classes).to(device)
model

In [13]:
## Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

In [14]:
############## Train
for epoch in range(num_epochs):
    for batch_idx, (data, targets) in enumerate(tqdm(train_loader)):
        ## Get data to cuda if possible
        data = data.to(device=device).squeeze(1)
        targets = targets.to(device=device)
        ## Forward
        scores = model(data)
        loss = criterion(scores, targets)
        ## Backward
        optimizer.zero_grad()
        loss.backward()
        # Update Adam step
        optimizer.step()

  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
100%|██████████| 938/938 [04:21<00:00,  3.59it/s]
100%|██████████| 938/938 [04:53<00:00,  3.19it/s]
100%|██████████| 938/938 [04:33<00:00,  3.43it/s]


In [15]:
def check_accuracy(loader, model):
    """ Compute the accuracy of the given model on the given data loader.

    Parameters:
        - Data loader to use for evaluation [torch.utils.data.DataLoader]
        - Model to evaluate [torch.nn.Module]

    Returns:
        The accuracy of the model on the given data loader [float]
    """
    i, num_correct, num_samples= 0, 0, 0

    # Set the model to evaluation mode
    model.eval()
    print(f"type model is {type(model)}")

    # Disable gradient computation
    with torch.no_grad():
        for x, y in loader:
            i += 1
            # Move the data to the device and remove the channel dimension
            x = x.to(device=device).squeeze(1)
            y = y.to(device=device)

            # Compute the scores and predictions
            scores = model(x)
            _, predictions = scores.max(1)
            if i < 10:
                print(f"scores are => {scores}")

            ## Update the number of correct predictions and total samples
            num_correct += (predictions == y).sum()
            num_samples += predictions.size(0)

    # Set the model back to training mode
    model.train()
    # Compute the accuracy
    acc = num_correct / num_samples
    return acc

### => Check accuracy on training and test sets

In [18]:
def check_accuracy(loader, model):
    """ Calculate the accuracy of the given model on the given data loader (num correct predictions / total num examples).

    Parameters:
        - Data loader to use for evaluation [torch.utils.data.DataLoader]
        - Model to evaluate [torch.nn.Module]    
    
    Details: 
        - #   Set model to eval to disable certain layers such as dropout and batch normalization that are only used during training
        - #   Disable gradient computation
        - #*  Move the data to the device and remove the channel dimension
        - #   Score and predict
        - #** The built-in max(1) return the maximum value (ignored) and its index (used to predict)
        - #   Update the number of correct predictions and total samples
        - #   Set the model back to training mode

    Returns:
        The accuracy of the model on the given data loader [float]
    """
    i, num_correct, num_samples= 0, 0, 0

    model.eval()
    print(f"type model is {type(model)}")
    
    with torch.no_grad():
        for x, y in loader:
            i+=1
            x = x.to(device=device).squeeze(1)  #*
            y = y.to(device=device)
            ## Predict
            scores = model(x)
            _, predictions = scores.max(1)      #**
            if i < 3:
                print(f"scores are => {scores}")
                print(f"predictions are => {predictions}")
            num_correct += (predictions == y).sum()
            num_samples += predictions.size(0)

    # Toggle model back to train
    model.train()
    
    return num_correct / num_samples

In [19]:
print(f"Accuracy on training set: {check_accuracy(train_loader, model)*100:2f}")
print(f"Accuracy on test set: {check_accuracy(test_loader, model)*100:.2f}")

type model is <class '__main__.RNN_LSTM'>
scores are => tensor([[ -1.4506,   0.6159,  11.2689,  -0.0529,  -2.3261,  -5.4228,  -4.1745,
           3.4416,  -2.6421,  -8.0659],
        [ 11.6694,  -3.2642,  -0.0641,  -4.5471,   0.5387,  -2.5249,   1.6720,
          -2.8427,   1.5373,  -1.8280],
        [-10.4789,  -2.4007,  -3.8520,  -5.3308,  19.0206,  -5.7118,  -8.8527,
           2.9862,   0.3749,  -0.5495],
        [-10.6375,  -2.9849,  -1.6753,  -0.5275,   0.4147,  -3.9382, -14.8768,
           4.3097,  -2.5668,  12.9842],
        [ 14.2484,  -4.6526,   0.3349,  -2.8458,  -2.8282,  -4.4086,  -1.7836,
          -2.8119,   1.4300,   0.8523],
        [ -5.5365, -10.8673,   3.5054,   6.5811,  -6.0490,  -3.7222,  -7.5258,
          -6.1584,  15.2175,  -1.1498],
        [  1.2060,  -4.1376,  -5.3185,  -1.6799,  -5.9375,  10.3153,   0.5505,
          -2.6321,  -0.3931,  -1.0251],
        [ -8.9853,  -0.8404,  -2.9307,  -3.4539,  13.4351,  -6.3206,  -6.2865,
           1.3123,   0.8742,   2