<div style="line-height:1.2;">

<h1 style="color:#BF66F2; margin-bottom: 0.3em;"> RNN + CNN model in PyTorch 1 </h1>

<h4 style="margin-top: 0.3em; margin-bottom: 1em;"> Image captioning with an Encoder and a Decoder. Focus on the Compose() function. </h4>

<div style="line-height:1.4; margin-bottom: 0.5em;">
    <h3 style="color: lightblue; display: inline; margin-right: 0.5em;">Keywords:</h3> SummaryWriter TensorBoard + Dropout + AdaptiveAvgPool2d + filterwarnings + DataLoader drop_last=True
</div>

<div style="margin-top: 5px;">
<div style="line-height:1.2">
<span style="display: inline-block;">
    <h3 style="color: red; display: inline;">Notes:</h3> 
    The dataset of images and captions was drastically reduced to make the example reproducible and avoid uploading a huge data folder.
</span>
</div>
</div>
</div>

<h1 style="color:#BF66F2 ">  RNN + CNN model in PyTorch 1 </h1>
<div style="margin-top: -30px;">
<h4> Image captioning with an Encoder and a Decoder. Focus on Compose() function </h4> 
</div>

<div style="margin-top: -18px;">
<span style="display: inline-block;">
    <h3 style="color: lightblue; display: inline;">Keywords:</h3>
    SummaryWriter TensorBoard + Dropout + AdaptiveAvgPool2d + filterwarnings + DataLoader drop_last=True
</span>
<br>
<div style="margin-top: -1px;">
<div style="line-height:1.2">
<span style="display: inline-block;">
    <h3 style="color: red; display: inline;">Notes:</h3> 
    The dataset of images and captions was drastically reduced to make the example reproducible and avoid uploading a huge data folder.
</span>
</div>
</div>

In [31]:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'  #to ignore CUDA warnings when GPU is not in use

In [32]:
import spacy  
import warnings
import pandas as pd

import torch
import torch.nn as nn
import torch.optim as optim

import statistics
import torchvision.models as models
import torchvision.transforms as transforms

from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import DataLoader, Dataset

from PIL import Image
from tqdm import tqdm

In [33]:
from torch.utils.tensorboard import SummaryWriter

<h3 style="color:#BF66F2"> Recap: SummaryWriter </h3>
<div style="margin-top: -17px;">
SummaryWriter is a class in the PyTorch library that provides a way to write TensorBoard event files. <br>
It creates a writer object that writes data to a directory specified by the user. <br>
During training, the train() function in your code writes the training loss to the SummaryWriter object using the writer.add_scalar() method. <br>
The global step value is also recorded.
</div>

In [34]:
class EncoderCNN(nn.Module):
    """ Convolutional neural network (CNN) for encoding image features (inherits from nn.Module).

    Parameters:
        - The size of the output embedding [int]
        - Bool to decide if finetune the CNN [bool]
    
    Attributes:
        - Flag to finetune the CNN [bool].
        - Inception V3 network [nn.Module].
        - Rectified linear unit activation function [nn.Module].
        - Dropout laye [nn.Module].
    
    Methods:
        - __init__(self, embed_size, train_CNN=False)
        - forward(images)
    
    Details:
        - Create and inception loading the pre-trained Inception V3 network from the torchvision.models module
        - Replace the final fully connected layer of the Inception V3 network with a linear layer of the specified output size.
        - Create relu.
        - Create dropout layer -> dropout probability of 0.5.
    Notes:
        'aux_logits' param for inception_v3 cannot be set to False!
    """
    def __init__(self, embed_size, train_CNN=False):
        """ Constructor to initialize the class """
        super(EncoderCNN, self).__init__()
        self.train_CNN = train_CNN
        #self.inception = models.inception_v3(pretrained=True)
        self.inception = models.inception_v3(pretrained=True, aux_logits=True)
        #self.inception.fc = nn.Linear(self.inception.fc.in_features, embed_size)  
        self.inception.fc = nn.Identity() 
        self.relu = nn.ReLU()
        self.adaptive_pool = nn.AdaptiveAvgPool2d((1, 1)) 
        self.dropout = nn.Dropout(0.5)

    def forward(self, images):
        """ Computes the forward pass of the CNN on a batch of images.
        
        Parameters:
            - Tensor of images
        
        Details: 
            - Pass the input images through the Inception V3 network to produce a tensor of shape (batch_size, embed_size).
            - Apply the ReLU activation function and dropout regularization to the tensor of encoded image features and return the resulting tensor.
        
        Returns:
            - Tensor of encoded image features.
        """
        #features = self.inception(images)
    
        features = self.inception(images) # Extract features from InceptionOutputs
        #features = features.view(features.size(0), -1)  # Flatten the features tensor
        print(type(features))
        features = self.dropout(self.relu(features))
        return features

In [35]:
class DecoderRNN(nn.Module):
    """ Recurrent neural network (RNN) for decoding image features into captions.

    Parameters:
        - Size of the input word embeddings [int].
        - Size of the hidden state of the LSTM [int].
        - Size of the vocabulary [int].
        - Number of LSTM layers [int].

    Attributes:
        - Embedding layer for mapping words to vectors [nn.Module].
        - LSTM layer for processing the input sequence [nn.Module].
        - Linear layer for mapping from hidden states to output logits [nn.Module].
        - Dropout layer [nn.Module].

    Methods:
        - __init__(self, embed_size, hidden_size, vocab_size, num_layers)
        - forward(features, captions)
    
    Details:
        - Create an embedding layer for mapping words to vectors.
        - Create an LSTM layer for processing the input sequence.
        - Create a linear layer for mapping from hidden states to output logits.
        - Create a dropout layer, with a dropout probability of 0.5.
    """
    def __init__(self, embed_size, hidden_size, vocab_size, num_layers):
        """ Constructor to initialize the class """
        super(DecoderRNN, self).__init__()
        self.embed = nn.Embedding(vocab_size, embed_size)  
        self.lstm = nn.LSTM(embed_size, hidden_size, num_layers)
        self.linear = nn.Linear(hidden_size, vocab_size)  
        self.dropout = nn.Dropout(0.5)  

    def forward(self, features, captions):
        """ Compute the Forward pass of the RNN on a batch of image features and captions.         
        
        Parameters:
            - Tensor of features
            - Tensor of caption sequences
        
        Details: 
            - Map the input caption sequence to a tensor of word embeddings using the embedding layer, and apply dropout regularization.\\
            The embeddings tensor has a shape of (seq_length, embed_size), where seq_length is the length of the caption sequence. 
            - Concatenate the features tensor with the embeddings tensor along the batch dimension.\\
                -The result in a tensor with a shape of:
                    - with dim = 0 (#along the first dimension) is (seq_length+1, embed_size)
                    - with dim = 0 (#along the second dimension) is (batch_size, seq_length+1, embed_size) 
                
            - Pass the concatenated tensor through the LSTM layer to produce a tensor of hidden states.
            - Map the hidden states to a tensor of output logits using the linear layer.
        
        Returns:
            - Tensor of encoded image features, the output logits.
        """
        captions = captions[:, :-1]
        embeddings = self.dropout(self.embed(captions))
        print("Features size in forward pass method:", features.size())
        print("Embeddings size in forward pass method:", embeddings.size())
        #embeddings = torch.cat((features.unsqueeze(0), embeddings), dim=1) #along the second dimension
        # Concatenate along batch dimension
        embeddings = torch.cat((features.unsqueeze(0), embeddings), dim=0)  
        
        ### Transform input tensor to match expected input size of LSTM layer
        embeddings = self.linear(embeddings)  
        hiddens, _ = self.lstm(embeddings)
        outputs = self.linear(hiddens)
        return outputs

In [36]:
class CNNtoRNN(nn.Module):
    """ Neural network that combines an image encoder with a caption decoder.
    
    Parameters:
        - The size of the input word embeddings [int]
        - The size of the hidden state of the LSTM [int]
        - The size of the vocabulary [int]
        - The number of LSTM layers [int]

    Attributes:
        - encoderCNN (EncoderCNN)
        - decoderRNN (DecoderRNN)

    Methods:
        - __init__(self, embed_size, hidden_size, vocab_size, num_layers)
        - forward(self, images, captions)
        - caption_image(self, image, vocabulary, max_length=50)
    """
    def __init__(self, embed_size, hidden_size, vocab_size, num_layers):
        """ Constructor to initialize the class """
        super(CNNtoRNN, self).__init__()
        self.encoderCNN = EncoderCNN(embed_size)
        #self.encoderCNN = nn.Sequential(*list(inception.children())[:-2], nn.AdaptiveAvgPool2d((1,1)), nn.Identity())        
        self.decoderRNN = DecoderRNN(embed_size, hidden_size, vocab_size, num_layers)


    def forward(self, images, captions):
        """ Compute the forward pass of the CNNtoRNN network on a batch of images and captions.
        
        Parameters:
            - images (torch.Tensor): A tensor of shape (batch_size, channels, height, width) representing a batch of images
            - captions (torch.Tensor): A tensor of shape (batch_size, max_seq_length) representing a batch of caption sequences
        
        Details:
            - Create Encoder for extracting image features
            - Create caption Decoder for generating captions from image features
        
        Returns:
            - Tensor of shape (batch_size, max_seq_length, vocab_size) representing the output logits for each word in the caption sequence.
        """
        features = self.encoderCNN(images)
        outputs = self.decoderRNN(features, captions)
        return outputs

    def caption_image(self, image, vocabulary, max_length=50):
        """Generate a caption for a given image using the trained CNNtoRNN model.
        
        Parameters:
            - image (torch.Tensor): A tensor of shape (channels, height, width) representing an input image
            - vocabulary (torchtext.vocab.Vocab): A vocabulary object that maps between words and word indices
            - max_length (int): The maximum length of the generated caption
        
        Details: 
            - Disable gradient computation during inference to reduce memory usage and speed up computation
            - Initialize an empty list to store the generated word indices
            - Extract image features from the input image using the image encoder, and add an extra dimension to the tensor for the batch size
            - Initialize the hidden state tensor to None
            - Iterate over the maximum length of the generated caption
                - Pass the input tensor and the hidden state tensor through the LSTM layer of the caption decoder to produce a new hidden state tensor
                - Map the hidden state tensor to a tensor of output logits using the linear layer of the caption decoder
                Remove the first dimension from the hiddens tensor (1, batch_size, hidden_size), with squeeze(1)
                    to pass the tensor to the linear layer of the caption decoder, which expects a tensor of shape (batch_size, vocab_size).
                - Select the word index with the highest logit value as the predicted word
                - Append the predicted word index to the list of generated word indices
                - Map the predicted word index to a tensor of word embeddings using the embedding layer of the caption decoder,
                    and add an extra dimension to the tensor for the batch size
                - If the predicted word is the end-of-sentence token, break out of the loop
        
        Returns:
            - List of strings representing the generated caption.
        """
        result_caption = []

        with torch.no_grad():
            features = self.encoderCNN(image)
            states = None
            x = features.unsqueeze(0)

            for _ in range(max_length):
                hiddens, states = self.decoderRNN.lstm(x, states)
                output = self.decoderRNN.linear(hiddens.squeeze(0))
                predicted = output.argmax(1)
                result_caption.append(predicted.item())
                x = self.decoderRNN.embed(predicted).unsqueeze(0)

                if vocabulary.itos[predicted.item()] == "<EOS>":
                    break

        return [vocabulary.itos[idx] for idx in result_caption]

In [37]:
def print_examples(model, device, dataset):
    """ Print some example image captions generated by the CNNtoRNN model.
    
    Parameters:
        - Trained CNNtoRNN model to use for generating captions
        - Device to use for running the model ("cpu" or "cuda").
        - Dataset used to train the model.

    Details:    
        - Create the composition of image transformation functions that will be applied to images.\\
            The Compose function is used to create a sequence of transformations that will be applied to an input image in order.\\
            `transforms.Compose` is a function from the `torchvision.transforms` module that creates a\\
            composition of multiple image transformations. 
            - 3 transformation sequence => 3 transformations: 
                - resizing the image to a fixed size of (299,299 pixels), 
                - converting the image to a PyTorch tensor, 
                - normalizing the pixel values of the image to have zero mean and unit variance.\\
                    (typical preprocessing step to help the model to converge more quickly and improve its overall performance).\\
            => The resulting tensor has shape (3, 299, 299). --> (color channels (RGB), height, width) 

        - Add an extra dimension (of size 1) at the beginning of the tensor, with the 'unsqueeze(0)',\\
            without having to modify the model's input format. \\
            In fact PyTorch expect, expect the input to be in the form of batches of data, to make the model more stable\\
            and use parallelism and vectorization to process multiple data points at once.\\ 
            This allows using the same training pipeline and data loading code that you would use for larger datasets,\\
            without having to make modifications to handle the case of a single data point.

        - Print the caption generated by a model for an image, using the "caption_image" method,\\
            which takes an image (`test_img5`) and a vocabulary (`dataset.vocab`) as input.\\
            The `join` method is used to concatenate the words in the caption into a single string.
    """
    transform = transforms.Compose(
        [
            transforms.Resize((299, 299)),
            transforms.ToTensor(),
            transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
        ]
    )

    model.eval()
    test_img1 = transform(Image.open("./data/images_torch_07/flickr8k/test_examples/dog.jpg").convert("RGB")).unsqueeze(0)
    test_img2 = transform(Image.open("./data/images_torch_07/flickr8k/test_examples/child.jpg").convert("RGB")).unsqueeze(0)
    test_img3 = transform(Image.open("./data/images_torch_07/flickr8k/test_examples/bus.png").convert("RGB")).unsqueeze(0)
    test_img4 = transform(Image.open("./data/images_torch_07/flickr8k/test_examples/boat.png").convert("RGB")).unsqueeze(0)
    test_img5 = transform(Image.open("./data/images_torch_07/flickr8k/test_examples/horse.png").convert("RGB")).unsqueeze(0)
    
    print("Example 1 CORRECT: Dog on a beach by the ocean")
    print("Example 1 OUTPUT: " + " ".join(model.caption_image(test_img1.to(device), dataset.vocab)))
    print()
    print("Example 2 CORRECT: Child holding red frisbee outdoors")
    print("Example 2 OUTPUT: "+ " ".join(model.caption_image(test_img2.to(device), dataset.vocab)))
    print()
    print("Example 3 CORRECT: Bus driving by parked cars")
    print("Example 3 OUTPUT: "+ " ".join(model.caption_image(test_img3.to(device), dataset.vocab)))
    print()
    print("Example 4 CORRECT: A small boat in the ocean")
    print("Example 4 OUTPUT: "+ " ".join(model.caption_image(test_img4.to(device), dataset.vocab)))
    print()
    print("Example 5 CORRECT: A cowboy riding a horse in the desert")
    print("Example 5 OUTPUT: "+ " ".join(model.caption_image(test_img5.to(device), dataset.vocab)))
    
    # Set train mode
    model.train()


def save_checkpoint(state, filename="my_checkpoint.pth.tar"):
    """ Save the current state of the CNNtoRNN model and optimizer to a checkpoint file. """
    print("=> Saving checkpoint")
    torch.save(state, filename)


def load_checkpoint(checkpoint, model, optimizer):
    """ Load the state of the trained CNNtoRNN model and optimizer from a checkpoint dict. \
    Set the current training step to the step saved in the dict.

    Parameters:
        - Dictionary containing the saved state of the model, optimizer, and training step 
        - CNNtoRNN model to load the state into.
        - Optimizer to load the state into.
    
    Returns:
        - Training step saved in the checkpoint file [int].
    """
    print("=> Loading checkpoint")
    model.load_state_dict(checkpoint["state_dict"])
    optimizer.load_state_dict(checkpoint["optimizer"])
    step = checkpoint["step"]
    return step

<h3 style="color:#BF66F2"> Recap: Convert text -> numerical values </h3>
<div style="margin-top: -17px;">
It is necessary to:

1. Vocabulary mapping each word to a index
2. Setup a Pytorch dataset to load the data
3. Setup padding of every batch (all examples should be of same seq_len and setup dataloader)

In [38]:
%%script echo Skipping since already downloaded
#### Download with
!python -m spacy download en

Skipping since already downloaded


In [39]:
#spacy_eng = spacy.load("en")                   #load a model using a shortcut it is an obsolete way!
spacy_eng = spacy.load("en_core_web_sm")

In [40]:
class Vocabulary:
    """ Class for building and using a vocabulary for NPL.
    
    Attributes:
        - itos dict to maps integer indices to string tokens;
        - stoi dict to maps string tokens to integer indices;
        - The minimum frequency required for a word to be included in the vocabulary [int].

    Methods:
        - __len__(): Returns the number of tokens in the vocabulary.
        - tokenizer_eng(text): Tokenizes an English text string using Spacy. 
        - build_vocabulary(sentence_list): Builds the vocabulary from a list of sentences.
        - numericalize(text): Converts a text string to a list of integer indices corresponding to the tokens in the vocabulary.
    """

    def __init__(self, freq_threshold):
        """ Constructor. """
        self.itos = {0: "<PAD>", 1: "<SOS>", 2: "<EOS>", 3: "<UNK>"}
        self.stoi = {"<PAD>": 0, "<SOS>": 1, "<EOS>": 2, "<UNK>": 3}
        self.freq_threshold = freq_threshold

    def __len__(self):
        """ Return the number of tokens in the vocabulary. """
        return len(self.itos)

    @staticmethod
    def tokenizer_eng(text):
        """ Tokenize an English text string using Spacy.

        Parameters:
            The text to be tokenized [str].

        Returns:
            List of string tokens.
        """
        return [tok.text.lower() for tok in spacy_eng.tokenizer(text)]

    def build_vocabulary(self, sentence_list):
        """ Set up the vocabulary from a list of sentences.
        
        Parameters:
            List of sentences to build the vocabulary from.

        Details: 
            - Iterate over each sentence in the list
            - Tokenize the sentence into words using Spacy

        """
        frequencies = {}
        idx = 4        
        for sentence in sentence_list:
            for word in self.tokenizer_eng(sentence):
                # If the word is not in the frequency dictionary, add it and set the count to 1
                if word not in frequencies:
                    frequencies[word] = 1
                # If the word is already in the frequency dictionary, increment its count
                else:
                    frequencies[word] += 1
                # If the word count reaches the frequency threshold, add the word to the vocabulary
                if frequencies[word] == self.freq_threshold:
                    self.stoi[word] = idx
                    self.itos[idx] = word
                    idx += 1

    def numericalize(self, text):
        """ Convert a text string to a list of integer indices corresponding to the tokens in the vocabulary.

        Details: 
            The text string is tokenized into a list of string tokens using the tokenizer_eng method.\\
            For each token in the tokenized text list, checks if it is in the vocabulary's stoi dictionary (the token is a known word in the vocabulary).\\
            If it is, the corresponding integer index is added to a list. \\
            If it is not, the <UNK> token's integer index (index 3 in the vocabulary's stoi dictionary) is added to the list instead.
            
        Parameters:
            Text to be numericalized [str]

        Returns:
            List of integer indices corresponding to the tokens in the vocabulary
        """
        tokenized_text = self.tokenizer_eng(text)

        return [self.stoi[token] if token in self.stoi else self.stoi["<UNK>"]
            for token in tokenized_text]


In [41]:
class FlickrDataset(Dataset):
    """ PyTorch Dataset class for loading image-caption pairs from a CSV file.

    Args:
        - Root directory containing the image files [str]
        - Path to the CSV file containing the image-caption pairs [str]
        - Transformation to apply to the images (callable, optional)
        - Minimum frequency required for a word to be included in the vocabulary [int]

    Attributes:
        - Root directory containing the image files [str]
        - Pandas DataFrame containing the image-caption pairs
        - Pandas Series containing the image filenames
        - Pandas Series containing the captions
        - Vocabulary object used to numericalize the captions

    Methods:
        - __len__(): Returns the number of image-caption pairs in the dataset
        - __getitem__(index): Loads an image-caption pair from the dataset
    """

    def __init__(self, root_dir, captions_file, transform=None, freq_threshold=5):
        self.root_dir = root_dir
        ## Get dataset of images
        self.df = pd.read_csv(captions_file)
        self.transform = transform
        ## Get img, caption columns
        self.imgs = self.df["image"]
        self.captions = self.df["caption"]

        ## Initialize vocabulary and build vocab
        self.vocab = Vocabulary(freq_threshold)
        self.vocab.build_vocabulary(self.captions.tolist())

    def __len__(self):
        """ Return the number of image-caption pairs in thedataset. """
        return len(self.df)

    def __getitem__(self, index):
        """ Load an image-caption pair from the dataset.

        Parameters:
            The index of the image-caption pair to load [int].

        Returns:
            The image and its corresponding numericalized caption.
        """
        caption = self.captions[index]
        img_id = self.imgs[index]
        img = Image.open(os.path.join(self.root_dir, img_id)).convert("RGB")

        if self.transform is not None:
            img = self.transform(img)

        numericalized_caption = [self.vocab.stoi["<SOS>"]]
        numericalized_caption += self.vocab.numericalize(caption)
        numericalized_caption.append(self.vocab.stoi["<EOS>"])

        return img, torch.tensor(numericalized_caption)

In [42]:
class MyCollate:
    """ Collate the image and caption data returned by the FlickrDataset class into batches\\
    that can be fed into a neural network for training. 
    
    Args:
        batch (list): A list of tuples containing image and caption data.

    Details: 
        - Creates a list imgs containing the image data from the input batch, extracting the image data\\
            For each item it takes the first element and unsqueezes it along the first dimension,\\
            to create a tensor of shape (1, C, H, W), where C, H, and W are the number of channels, height, and width\\
            imgs = contains len(batch) tensors, each of shape (1, C, H, W).
        - Concatenate the image data into a single tensor, along the first dimension.\\
            imgs = Tensor has shape (batch_size, C, H, W)
        - Extract the caption data from the input batch for each item
        - Pad the caption data sequence so that all captions have the same length (the length of the longest caption in the batch)\\
            tensors targets so that all captions have the same length.\\
            The pad_sequence() function is used to perform the padding.\\
            By default, pad_sequence() pads sequences to have the same length along the second dimension (batch_first=False).\\
            targets shape = (max_seq_len, batch_size)

    Returns:
        Collated image data and caption data.
    """
    def __init__(self, pad_idx):
        """ Initializes a MyCollate object with a padding index.
        Args:
            pad_idx (int): The index of the padding token in the vocabulary.
        """
        self.pad_idx = pad_idx

    def __call__(self, batch):
        imgs = [item[0].unsqueeze(0) for item in batch]
        imgs = torch.cat(imgs, dim=0)
        targets = [item[1] for item in batch]
        targets = pad_sequence(targets, batch_first=False, padding_value=self.pad_idx)

        return imgs, targets


def get_loader(root_folder, annotation_file, transform, batch_size=32, num_workers=8, shuffle=True, pin_memory=True):
    """ Returns a DataLoader for loading image-caption pairs, to be able to load data in parallel using multiple worker threads.

    Parameters:
        - Path to the root folder containing the image data [str].
        - Path to the file containing the image captions [str].
        - Function used to transform the images.
        - Batch size for loading data [int].
        - Number of worker threads to use for loading data. [int].
        - Option to shuffle the data during loading [bool].
        - Option to pin the data in memory during loading [bool].

    Details: 
        - Create a FlickrDataset object
        - Get the index of the padding <PAD> token in the vocabulary of the FlickrDataset object\\
        The vocab attribute of the FlickrDataset object contains a torchtext.vocab.Vocab object that maps words to integer indices.\\ 
        The stoi attribute of the Vocab object is a dictionary that maps words to their corresponding integer indices.\\
        - Create a DataLoader object to load the image-caption pairs in batches. 

    Notes: 
        - #***
            - Adding drop_last = True is fundamental when dealing with InceptionOutputs!\\
            To avoid the infamous TypeError, raised when activations functions receive during the training also an "InceptionOutputs" object after a batch of Tensors.

    Returns:
        DataLoader object and the FlickrDataset object.
    """
    dataset = FlickrDataset(root_folder, annotation_file, transform=transform)
    pad_idx = dataset.vocab.stoi["<PAD>"]
    loader = DataLoader(
        dataset=dataset,
        batch_size=batch_size,
        num_workers=num_workers,
        shuffle=shuffle,
        pin_memory=pin_memory,
        drop_last=True,     #***
        collate_fn=MyCollate(pad_idx=pad_idx),)

    return loader, dataset

In [43]:
def train():
    """ Encapsulate training process into a method, to train a CNNtoRNN model on the Flickr8k dataset.
    
    Details:
        - Create a sequence of transformations:
            4 transformations == 4 sequences: 
            - resizing to fixed size of (356, 356) pixels
            - cropping the images to a random size of 299x299 pixels.
            - converting the image to a PyTorch tensor
            - normalizing pixel values to have zero mean and unit variance. 
                Using (0.5, 0.5, 0.5) for both mean and variance => output[channel] = (input[channel] - mean[channel]) / std[channel]
                This means that the pixel values in each channel will be shifted by -0.5 and then scaled by 1/0.5=2,\\
                which will result in pixel values with zero mean and unit variance. 

        - Get custom get_loader 
        - Call the nn.Module model.train() to set the mode of the model object to "train".\\
        - For all epochs:
            - Save checkpoint if requested.
            - Initializes a for loop that iterates over the batches of data in the train_loader object.\\
                - "enumerate" returns an iterator that generates tuples containing the batch index and the batch data;\\
                - "tqdm" is used to display a progress bar during training;\\
                - total number of batches in the train_loader;\\
                - leave=False tells the progress bar to remain visible after training ends. 
            - Generate output => predicted captions. 
                caption[:-1] slice to exclude the last word in each caption, since this word is used as the ground truth\\ 
                for the model's prediction.\\
                In fact, during training, the model is typically trained to predict the next word in the sequence\\ 
                given all the previous words.\\
                Thus, by removing the last word from the ground truth caption, the model is forced to predict the last word itself\\ 
                and is trained to generate the entire caption from scratch.\\
                This ensure that the model is not simply memorizing the ground truth captions,\\
                but is instead learning to generate captions that accurately describe the contents of the input image.

            - Calculate the loss between the predicted captions and the ground truth captions using CrossEntropyLoss.
                Reshape to flatten the predicted and ground truth captions into 2D arrays with shape:\\
                (batch_size * seq_len, vocab_size) and (batch_size * seq_len,), respectively.
            - Record the training loss to the SummaryWriter object\\
                "add_scalar" saves a scalar value (the training loss at each training iteration) to the event file. 
            - Perform backpropagation and gradient descent steps of the optimization process, resetting the gradients\\ 
                of all model parameters to zero. 
            - Update the model parameters using the computed gradients and the optimizer's update rule, doing a step.
    """
    transform = transforms.Compose(
        [
            transforms.Resize((356, 356)),
            transforms.RandomCrop((299, 299)),
            transforms.ToTensor(),
            transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
        ]
    )

    train_loader, dataset = get_loader(
        root_folder="./data/images_torch_07/flickr8k/some_images",
        annotation_file="./data/images_torch_07/flickr8k/some_captions.txt",
        transform=transform,
        num_workers=0,
    )

    torch.backends.cudnn.benchmark = True
    # Check if run of GPU
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    load_model, save_model, train_CNN  = False, False, False

    ###### Hyperparameters
    embed_size = 2048
    hidden_size = 2048
    vocab_size = len(dataset.vocab)
    num_layers = 1
    learning_rate = 3e-4
    num_epochs = 2

    # Write on Tensorboard 
    writer = SummaryWriter("runs/flickr")
    step = 0
    
    ### Initialize model, loss, optimizer
    model = CNNtoRNN(embed_size, hidden_size, vocab_size, num_layers).to(device)
    criterion = nn.CrossEntropyLoss(ignore_index=dataset.vocab.stoi["<PAD>"])
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)

    ##### Finetune the CNN
    for name, param in model.encoderCNN.inception.named_parameters():
        if "fc.weight" in name or "fc.bias" in name:
            param.requires_grad = True
        else:
            param.requires_grad = train_CNN

    if load_model:
        step = load_checkpoint(torch.load("./checkpoints/my_checkpoint.pth.tar"), model, optimizer)

    # Set train mode
    model.train()
    
    for epoch in range(num_epochs):
        print_examples(model, device, dataset)
        if save_model:
            checkpoint = {
                "state_dict": model.state_dict(),
                "optimizer": optimizer.state_dict(),
                "step": step,
            }
            save_checkpoint(checkpoint)

        for idx, (imgs, captions) in tqdm(enumerate(train_loader), total=len(train_loader), leave=False):
            imgs = imgs.to(device)
            captions = captions.to(device)

            # Pass complete captions tensor, excluding the last element
            outputs = model(imgs, captions[:, :-1])  
            # Calculate Loss adjust the target captions accordingly            
            loss = criterion(outputs.reshape(-1, outputs.shape[2]), captions[:, 1:].reshape(-1))  

            writer.add_scalar("Training loss", loss.item(), global_step=step)
            step += 1

            optimizer.zero_grad()
            #loss.backward(loss)
            loss.backward()
            optimizer.step()

<h2 style="color:#BF66F2"> Main </h2>

In [44]:
transform = transforms.Compose([transforms.Resize((224, 224)), transforms.ToTensor(),])

# Create Loader avoiding the UserWarning: This DataLoader will create 8 worker processes in total. 
with warnings.catch_warnings():
    warnings.filterwarnings("ignore")
    loader, dataset = get_loader("./data/images_torch_07/flickr8k/some_images/", "./data/images_torch_07/flickr8k/some_captions.txt", transform=transform)

In [45]:
with warnings.catch_warnings():
    warnings.filterwarnings("ignore")
    train()    

Example 1 CORRECT: Dog on a beach by the ocean
<class 'torch.Tensor'>
Example 1 OUTPUT: <UNK> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS>

Example 2 CORRECT: Child holding red frisbee outdoors
<class 'torch.Tensor'>
Example 2 OUTPUT: <EOS>

Example 3 CORRECT: Bus driving by parked cars
<class 'torch.Tensor'>
Example 3 OUTPUT: people a <EOS>

Example 4 CORRECT: A small boat in the ocean
<class 'torch.Tensor'>
Example 4 OUTPUT: fire man a people a man girl girl . a . water . <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS>

Example 5 CORRECT: A cowboy riding a horse in the

                  

Example 1 CORRECT: Dog on a beach by the ocean




<class 'torch.Tensor'>
Example 1 OUTPUT: <UNK> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS>

Example 2 CORRECT: Child holding red frisbee outdoors
<class 'torch.Tensor'>
Example 2 OUTPUT: <EOS>

Example 3 CORRECT: Bus driving by parked cars
<class 'torch.Tensor'>
Example 3 OUTPUT: people a <EOS>

Example 4 CORRECT: A small boat in the ocean
<class 'torch.Tensor'>
Example 4 OUTPUT: fire man a people a man girl girl . a . water . <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS> <SOS>

Example 5 CORRECT: A cowboy riding a horse in the desert
<class 'torch.Tensor'>
Example 5 OUTPUT

                  

In [46]:
%load_ext tensorboard

In [47]:
%%script echo Skipping Tensorboard call 
# Activate Tensorboard on localhost 6006
!python -m tensorboard.main --logdir=logs/

Skipping Tensorboard call
