# Project: Image Captioning

---

In this notebook, you will train your CNN-RNN model.  

You are welcome and encouraged to try out many different architectures and hyperparameters when searching for a good model.

This does have the potential to make the project quite messy!  Before submitting your project, make sure that you clean up:
- the code you write in this notebook.  The notebook should describe how to train a single CNN-RNN architecture, corresponding to your final choice of hyperparameters.  You should structure the notebook so that the reviewer can replicate your results by running the code in this notebook.  
- the output of the code cell in **Step 2**.  The output should show the output obtained when training the model from scratch.

This notebook **will be graded**.  

Feel free to use the links below to navigate the notebook:
- [Step 1](#step1): Training Setup
- [Step 2](#step2): Train your Model
- [Step 3](#step3): (Optional) Validate your Model

<a id='step1'></a>
## Step 1: Training Setup

In this step of the notebook, you will customize the training of your CNN-RNN model by specifying hyperparameters and setting other options that are important to the training procedure.  The values you set now will be used when training your model in **Step 2** below.

You should only amend blocks of code that are preceded by a `TODO` statement.  **Any code blocks that are not preceded by a `TODO` statement should not be modified**.

### Task #1

Begin by setting the following variables:
- `batch_size` - the batch size of each training batch.  It is the number of image-caption pairs used to amend the model weights in each training step. 
- `vocab_threshold` - the minimum word count threshold.  Note that a larger threshold will result in a smaller vocabulary, whereas a smaller threshold will include rarer words and result in a larger vocabulary.  
- `vocab_from_file` - a Boolean that decides whether to load the vocabulary from file. 
- `embed_size` - the dimensionality of the image and word embeddings.  
- `hidden_size` - the number of features in the hidden state of the RNN decoder.  
- `num_epochs` - the number of epochs to train the model.  We recommend that you set `num_epochs=3`, but feel free to increase or decrease this number as you wish.  [This paper](https://arxiv.org/pdf/1502.03044.pdf) trained a captioning model on a single state-of-the-art GPU for 3 days, but you'll soon see that you can get reasonable results in a matter of a few hours!  (_But of course, if you want your model to compete with current research, you will have to train for much longer._)
- `save_every` - determines how often to save the model weights.  We recommend that you set `save_every=1`, to save the model weights after each epoch.  This way, after the `i`th epoch, the encoder and decoder weights will be saved in the `models/` folder as `encoder-i.pkl` and `decoder-i.pkl`, respectively.
- `print_every` - determines how often to print the batch loss to the Jupyter notebook while training.  Note that you **will not** observe a monotonic decrease in the loss function while training - this is perfectly fine and completely expected!  You are encouraged to keep this at its default value of `100` to avoid clogging the notebook, but feel free to change it.
- `log_file` - the name of the text file containing - for every step - how the loss and perplexity evolved during training.

If you're not sure where to begin to set some of the values above, you can peruse [this paper](https://arxiv.org/pdf/1502.03044.pdf) and [this paper](https://arxiv.org/pdf/1411.4555.pdf) for useful guidance!  **To avoid spending too long on this notebook**, you are encouraged to consult these suggested research papers to obtain a strong initial guess for which hyperparameters are likely to work best.  Then, train a single model, and proceed to the next notebook (**3_Inference.ipynb**).  If you are unhappy with your performance, you can return to this notebook to tweak the hyperparameters (and/or the architecture in **model.py**) and re-train your model.

### Question 1

**Question:** Describe your CNN-RNN architecture in detail.  With this architecture in mind, how did you select the values of the variables in Task 1?  If you consulted a research paper detailing a successful implementation of an image captioning model, please provide the reference.



**Answer:** 

I tried first a baseline decoder implementation with a very basic architecture: a vector embedding, a LSTM and a linear layer to map the probabilities to the vocabular. You can find the implementation in the model.py (see BaseLineDecoderCNN). The EncoderCNN was predefined. But I read that it would be better to not remove the spatial information nearly completely of the ResNet50 layers, so I disabled the last average pool and fc layer. I then started to train with a simple combination of such an encoder (with more spatial information) and a LSTM based decoder. But then I read that this is not state of the art anymore so I researched a bit more. An stumbled about Transformer Architectures with Attention. Sadly those transformer architectures have nothing to do with RNN, so I decided for a compromise: 

I decided to implement something more fancy than a standard LSTM based decoder and found a paper which suggests using an attention mechanism (Bahdanau Attention) with GRU based units. GRU is still RNN based, so I think this should be ok. You (the reviewer) can find the implementation in the model.py. There is an attention module and a DecoderGRU implementation.

How did I choose the hyper parameters:

batch_size = 256          # because my laptop GPU has 8 GB of ram and this was ok for my hardware (I always monitor my hardware when I train a net)
vocab_threshold = 5        # I kept it to a default value
embed_size = 256           # see 1)
hidden_size = 512          # see 2) 
num_epochs = 3             # one epoch took about 2h, I kept my laptop powered on over night for 3 epoch

1) Many pre-trained models (e.g., ResNet) output feature maps with high channel counts (often 512, 1024, or 2048). Reducing these to an embed_size of 256 is a good compromise for creating manageable embeddings that still retain significant information. An embed size of 256 helps reduce the dimensions while still maintaining a rich representation, making it easier to process the embeddings in the attention mechanism and the RNN.

2) With a hidden_size of 512, the GRU has enough memory to maintain a strong representation of the sentence’s context as it is generated.
    For sequential tasks like captioning, larger hidden sizes can help the RNN retain important information over longer sequences, which is essential in generating coherent captions.
    
I started with those values and because one epoch took more than 2 hours, I didn't have much room for empirical experiments. 

My source: https://arxiv.org/pdf/2203.01594

### (Optional) Task #2

Note that we have provided a recommended image transform `transform_train` for pre-processing the training images, but you are welcome (and encouraged!) to modify it as you wish.  When modifying this transform, keep in mind that:
- the images in the dataset have varying heights and widths, and 
- if using a pre-trained model, you must perform the corresponding appropriate normalization.

### Question 2

**Question:** How did you select the transform in `transform_train`?  If you left the transform at its provided value, why do you think that it is a good choice for your CNN architecture?

**Answer:** 

I have left the transformations as they were. The COCO dataset has already good photos, so standard transformations (like crop and resize) seem ok for me. By the way, because of the amount of time one epoch took to train, I didn't have much room for experiments. 

### Task #3

Next, you will specify a Python list containing the learnable parameters of the model.  For instance, if you decide to make all weights in the decoder trainable, but only want to train the weights in the embedding layer of the encoder, then you should set `params` to something like:
```
params = list(decoder.parameters()) + list(encoder.embed.parameters()) 
```

### Question 3

**Question:** How did you select the trainable parameters of your architecture?  Why do you think this is a good choice?

**Answer:**

Because I wanted to leverage the pretrained ResNet50 weights I only activated the gradients for the newly added layer in the encoder: 

# Map the ResNet feature map to the desired embedding size
self.embed = nn.Conv2d(2048, embed_size, kernel_size=1)  # 2048 channels in ResNet50's final conv layer

So for my decoder I only activated the gradients for the newly added (replaced the last two layers of ResNet50) embed layer which produces an embedding for the features.
As I said before, I removed the last two layers (avg pool and fc) to keep more spatial information for the decoder.

And of course all gradients of the decoder were activated because we wanted it to train and learn anything new.

### Task #4

Finally, you will select an [optimizer](http://pytorch.org/docs/master/optim.html#torch.optim.Optimizer).

### Question 4

**Question:** How did you select the optimizer used to train your model?

**Answer:** 

I googled and again the ADAM optimizer is a good choice (as always, because its integrated momentum). And I say it again, I did not have much room for experiments because of the time an epoch took to train (2h/epoch).

In [34]:
import torch
import torch.nn as nn
from torchvision import transforms
import sys
#sys.path.append('/opt/cocoapi/PythonAPI')
from pycocotools.coco import COCO
from data_loader import get_loader
from model import EncoderCNN, BaseLineDecoderCNN, DecoderGRU
import math
import torch.optim as optim

# watch for any changes in model.py, if it changes, re-load it automatically
%load_ext autoreload
%autoreload 2

torch.cuda.empty_cache()

device = 'cuda' if torch.cuda.is_available() else 'cpu'

## Select appropriate values for the Python variables below.
batch_size = 256          # batch size
vocab_threshold = 5        # minimum word count threshold
vocab_from_file = True    # if True, load existing vocab file
embed_size = 256           # dimensionality of image and word embeddings
hidden_size = 512          # number of features in hidden state of the RNN decoder
num_epochs = 2             # number of training epochs
save_every = 1             # determines frequency of saving model weights
print_every = 100          # determines window for printing average loss
log_file = 'training_log.txt'       # name of file with saved training loss and perplexity

# (Optional) Amend the image transform below.
transform_train = transforms.Compose([ 
    transforms.Resize(256),                          # smaller edge of image resized to 256
    transforms.RandomCrop(224),                      # get 224x224 crop from random location
    transforms.RandomHorizontalFlip(),               # horizontally flip image with probability=0.5
    transforms.ToTensor(),                           # convert the PIL Image to a tensor
    transforms.Normalize((0.485, 0.456, 0.406),      # normalize image for pre-trained model
                         (0.229, 0.224, 0.225))])

# Build data loader.
data_loader = get_loader(transform=transform_train,
                         mode='train',
                         batch_size=batch_size,
                         vocab_threshold=vocab_threshold,
                         vocab_from_file=vocab_from_file,
                         num_workers=10)

# The size of the vocabulary.
vocab_size = len(data_loader.dataset.vocab)

# Initialize the encoder and decoder. 
encoder = EncoderCNN(embed_size)
decoder = DecoderGRU(embed_size, hidden_size, vocab_size)

# Move models to GPU if CUDA is available. 
encoder.to(device)
decoder.to(device)

# 'models/encoder-1.pkl', 'models/decoder-1.pkl'
encoder.load_state_dict(torch.load('models/encoder-3.pkl', map_location=device))
decoder.load_state_dict(torch.load('models/decoder-3.pkl', map_location=device))

# Define the loss function. 
criterion = nn.CrossEntropyLoss().cuda() if torch.cuda.is_available() else nn.CrossEntropyLoss()

# Specify the learnable parameters of the model.
params = []
for name, param in encoder.named_parameters():
    if "embed" in name:
        param.requires_grad = True
        params.append(param)

# Add all parameters of the Decoder
params += list(decoder.parameters())

# Define the optimizer.
learning_rate = 1e-3  # You can start with this learning rate and adjust based on results
optimizer = optim.Adam(params, lr=learning_rate)

# Set the total number of training steps per epoch.
total_step = math.ceil(len(data_loader.dataset.caption_lengths) / data_loader.batch_sampler.batch_size)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
Vocabulary successfully loaded from vocab.pkl file!
loading annotations into memory...
Done (t=1.56s)
creating index...
index created!
Obtaining caption lengths...


100%|██████████| 591753/591753 [00:55<00:00, 10619.58it/s]
  encoder.load_state_dict(torch.load('models/encoder-3.pkl', map_location=device))
  decoder.load_state_dict(torch.load('models/decoder-3.pkl', map_location=device))


<a id='step2'></a>
## Step 2: Train your Model

Once you have executed the code cell in **Step 1**, the training procedure below should run without issue.  

It is completely fine to leave the code cell below as-is without modifications to train your model.  However, if you would like to modify the code used to train the model below, you must ensure that your changes are easily parsed by your reviewer.  In other words, make sure to provide appropriate comments to describe how your code works!  

You may find it useful to load saved weights to resume training.  In that case, note the names of the files containing the encoder and decoder weights that you'd like to load (`encoder_file` and `decoder_file`).  Then you can load the weights by using the lines below:

```python
# Load pre-trained weights before resuming training.
encoder.load_state_dict(torch.load(os.path.join('./models', encoder_file)))
decoder.load_state_dict(torch.load(os.path.join('./models', decoder_file)))
```

While trying out parameters, make sure to take extensive notes and record the settings that you used in your various training runs.  In particular, you don't want to encounter a situation where you've trained a model for several hours but can't remember what settings you used :).

### A Note on Tuning Hyperparameters

To figure out how well your model is doing, you can look at how the training loss and perplexity evolve during training - and for the purposes of this project, you are encouraged to amend the hyperparameters based on this information.  

However, this will not tell you if your model is overfitting to the training data, and, unfortunately, overfitting is a problem that is commonly encountered when training image captioning models.  

For this project, you need not worry about overfitting. **This project does not have strict requirements regarding the performance of your model**, and you just need to demonstrate that your model has learned **_something_** when you generate captions on the test data.  For now, we strongly encourage you to train your model for the suggested 3 epochs without worrying about performance; then, you should immediately transition to the next notebook in the sequence (**3_Inference.ipynb**) to see how your model performs on the test data.  If your model needs to be changed, you can come back to this notebook, amend hyperparameters (if necessary), and re-train the model.

That said, if you would like to go above and beyond in this project, you can read about some approaches to minimizing overfitting in section 4.3.1 of [this paper](http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=7505636).  In the next (optional) step of this notebook, we provide some guidance for assessing the performance on the validation dataset.

In [10]:
# watch for any changes in model.py, if it changes, re-load it automatically
%load_ext autoreload
%autoreload 2

import torch.utils.data as data
import numpy as np
import os
import requests
import time

# Open the training log file.
f = open(log_file, 'w')

old_time = time.time()

for epoch in range(1, num_epochs+1):
    
    for i_step in range(1, total_step+1):
        
        if time.time() - old_time > 60:
            old_time = time.time()
        
        # Randomly sample a caption length, and sample indices with that length.
        indices = data_loader.dataset.get_train_indices()
        
        # Create and assign a batch sampler to retrieve a batch with the sampled indices.
        new_sampler = data.sampler.SubsetRandomSampler(indices=indices)
        data_loader.batch_sampler.sampler = new_sampler
        
        # Obtain the batch.
        images, captions = next(iter(data_loader))

        # Move batch of images and captions to GPU if CUDA is available.
        images = images.to(device)
        captions = captions.to(device)
        
        # Zero the gradients.
        decoder.zero_grad()
        encoder.zero_grad()
        
        # Pass the inputs through the CNN-RNN model.
        features = encoder(images)
        outputs = decoder(features, captions)

        # Calculate the batch loss.        
        # Using reshape (more flexible tensor does not need to be contigous)
        loss = criterion(outputs.reshape(-1, vocab_size), captions[:, 1:].reshape(-1))

        # Backward pass.
        loss.backward()
        
        # Update the parameters in the optimizer.
        optimizer.step()
            
        # Get training statistics.
        stats = 'Epoch [%d/%d], Step [%d/%d], Loss: %.4f, Perplexity: %5.4f' % (epoch, num_epochs, i_step, total_step, loss.item(), np.exp(loss.item()))
        
        # Print training statistics (on same line).
        print('\r' + stats, end="")
        sys.stdout.flush()
        
        # Print training statistics to file.
        f.write(stats + '\n')
        f.flush()
        
        # Print training statistics (on different line).
        if i_step % print_every == 0:
            print('\r' + stats)
            
    # Save the weights.
    if epoch % save_every == 0:
        torch.save(decoder.state_dict(), os.path.join('./models', 'decoder-%d.pkl' % epoch))
        torch.save(encoder.state_dict(), os.path.join('./models', 'encoder-%d.pkl' % epoch))

# Close the training log file.
f.close()

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
Epoch [1/2], Step [100/2312], Loss: 2.3420, Perplexity: 10.4017
Epoch [1/2], Step [200/2312], Loss: 2.3748, Perplexity: 10.7488
Epoch [1/2], Step [300/2312], Loss: 2.7663, Perplexity: 15.89945
Epoch [1/2], Step [400/2312], Loss: 3.3749, Perplexity: 29.2206
Epoch [1/2], Step [500/2312], Loss: 2.3255, Perplexity: 10.2319
Epoch [1/2], Step [600/2312], Loss: 2.3154, Perplexity: 10.1291
Epoch [1/2], Step [700/2312], Loss: 2.2872, Perplexity: 9.84718
Epoch [1/2], Step [800/2312], Loss: 2.2652, Perplexity: 9.63280
Epoch [1/2], Step [900/2312], Loss: 2.2661, Perplexity: 9.64184
Epoch [1/2], Step [1000/2312], Loss: 2.2079, Perplexity: 9.0966
Epoch [1/2], Step [1100/2312], Loss: 2.3775, Perplexity: 10.7783
Epoch [1/2], Step [1200/2312], Loss: 2.2434, Perplexity: 9.42545
Epoch [1/2], Step [1300/2312], Loss: 2.1875, Perplexity: 8.91286
Epoch [1/2], Step [1400/2312], Loss: 2.1131, Perplexity: 8.27402
Epoch [1/2]

<a id='step3'></a>
## Step 3: (Optional) Validate your Model

To assess potential overfitting, one approach is to assess performance on a validation set.  If you decide to do this **optional** task, you are required to first complete all of the steps in the next notebook in the sequence (**3_Inference.ipynb**); as part of that notebook, you will write and test code (specifically, the `sample` method in the `DecoderRNN` class) that uses your RNN decoder to generate captions.  That code will prove incredibly useful here. 

If you decide to validate your model, please do not edit the data loader in **data_loader.py**.  Instead, create a new file named **data_loader_val.py** containing the code for obtaining the data loader for the validation data.  You can access:
- the validation images at filepath `'/opt/cocoapi/images/train2014/'`, and
- the validation image caption annotation file at filepath `'/opt/cocoapi/annotations/captions_val2014.json'`.

The suggested approach to validating your model involves creating a json file such as [this one](https://github.com/cocodataset/cocoapi/blob/master/results/captions_val2014_fakecap_results.json) containing your model's predicted captions for the validation images.  Then, you can write your own script or use one that you [find online](https://github.com/tylin/coco-caption) to calculate the BLEU score of your model.  You can read more about the BLEU score, along with other evaluation metrics (such as TEOR and Cider) in section 4.1 of [this paper](https://arxiv.org/pdf/1411.4555.pdf).  For more information about how to use the annotation file, check out the [website](http://cocodataset.org/#download) for the COCO dataset.

In [9]:
# Debugging for the poor. My debugger isn't working in my PyCharm. 
def inspect_data_types(data, depth=2, current_depth=0):
    """
    Recursively inspects data types within a nested structure up to a specified depth.
    
    Args:
    - data: The data structure to inspect (e.g., dict, list, tuple).
    - depth: The maximum depth to inspect.
    - current_depth: The current depth in recursion (default is 0).

    Returns:
    - A nested structure representing data types.
    """
    if current_depth >= depth:
        return type(data)

    if isinstance(data, dict):
        return {key: inspect_data_types(value, depth, current_depth + 1) for key, value in data.items()}

    elif isinstance(data, list):
        return [inspect_data_types(item, depth, current_depth + 1) for item in data]

    elif isinstance(data, tuple):
        return tuple(inspect_data_types(item, depth, current_depth + 1) for item in data)

    else:
        return type(data)

### Why Only Five Captions Were Delivered by `DataLoader`

I encountered an issue witht the DataLoader — where only 5 captions were delivered by the batch loader—was due to how PyTorch's `DataLoader` handles batching and the structure of the captions in the dataset.

### Original Issue with `DataLoader` and Batching

1. **DataLoader's Expectation of Consistent Data Shapes**:
   - By default, `DataLoader` expects each item in a batch to have a **consistent tensor shape**. This means that all elements (e.g., images or labels) should have the same structure, allowing them to be stacked into a single tensor for efficient processing.
   - In your original setup, each image had a list of captions, which caused variability in the data structure. Some images might have 5 captions, others fewer, creating a **list of lists** (variable-length lists) that cannot be directly stacked into a tensor.

2. **Automatic Collation Issue**:
   - Without a custom collate function, `DataLoader` tries to collate the data automatically.
   - When `DataLoader` encountered the list of captions, it interpreted it as a **sequence of individual items**, resulting in only the first 5 captions being included in the batch.
   - Essentially, `DataLoader` treated the list of captions as a single dimension rather than treating each list as a whole, so only 5 caption items were delivered across the batch instead of 5 captions per image.

### Solution with `collate_fn`

The custom `collate_fn` resolves this by:
1. **Separating Images and Captions Explicitly**:
   - The custom function `collate_fn` separates the images and captions at the batch level. It stacks images into a tensor and keeps captions as a list of lists, where each sublist corresponds to the captions for a single image.
   
2. **Preserving Caption Structure**:
   - By preserving each list of captions as a whole, the `collate_fn` avoids the automatic collation issue. `DataLoader` no longer tries to interpret the captions in the wrong way.
   - The result is a batch where the images are neatly stacked into a tensor, and the captions are correctly grouped by image, with each image’s captions kept in a separate list.

### Summary

The custom `collate_fn` prevented `DataLoader` from trying to automatically stack lists of captions, which it had mistakenly flattened. Instead, the function preserved the intended structure by handling each image and its associated captions as a distinct pair, resulting in the correct number of captions per image in each batch.


In [21]:
from PIL import Image
from torchvision import transforms
from pycocotools.coco import COCO
from torch.utils.data import DataLoader, Dataset
import os

# Define the transformations
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

# Load the COCO validation dataset
val_coco = COCO(annotation_file='datasets/COCO/annotations/captions_val2017.json')
val_images_dir = 'datasets/COCO/images/val2017/'

# we have to intervene in the batching process to handle the caption list of lists problem
def collate_fn(batch):
    # Separate images and captions
    images, captions = zip(*batch)

    # Stack images into a tensor
    images = torch.stack(images, dim=0)

    return images, list(captions)  # List of captions for each image

# Define a Dataset for the validation images and captions
class COCOValidationDataset(Dataset):
    def __init__(self, coco, images_dir, transform=None):
        self.coco = coco
        self.images_dir = images_dir
        self.transform = transform
        self.image_ids = list(self.coco.imgs.keys())

    def __len__(self):
        return len(self.image_ids)

    def __getitem__(self, idx):
        image_id = self.image_ids[idx]
        image_info = self.coco.loadImgs(image_id)[0]
        image_path = os.path.join(self.images_dir, image_info['file_name'])
        image = Image.open(image_path).convert("RGB")

        if self.transform:
            image = self.transform(image)

        # Retrieve captions for the image
        captions_ids = self.coco.getAnnIds(imgIds=image_id)
        captions = [self.coco.anns[c]['caption'] for c in captions_ids]

        return image, captions


# Create a DataLoader for the validation dataset
val_dataset = COCOValidationDataset(val_coco, val_images_dir, transform=transform)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False, collate_fn=collate_fn)


loading annotations into memory...
Done (t=0.02s)
creating index...
index created!


In [18]:
from torch.utils.data import DataLoader

# Test the DataLoader to ensure it works as expected
for images, captions in val_loader:
    print("Batch of images:", images.shape)
    print("Batch of captions:", captions)

    # Check each image and its associated captions
    for i, (image, caption_list) in enumerate(zip(images, captions)):
        print(f"Image {i + 1} has {len(caption_list)} captions.")
        for caption in caption_list:
            print(f" - {caption}")

    # Break after first batch for demonstration purposes
    break



Batch of images: torch.Size([7, 3, 224, 224])
Batch of captions: [['A man is in a kitchen making pizzas.', 'Man in apron standing on front of oven with pans and bakeware', 'A baker is working in the kitchen rolling dough.', 'A person standing by a stove in a kitchen.', 'A table with pies being made and a person standing near a wall with pots and pans hanging on the wall.'], ['The dining table near the kitchen has a bowl of fruit on it.', 'A small kitchen has various appliances and a table.', 'The kitchen is clean and ready for us to see.', 'A kitchen and dining area decorated in white.', 'A kitchen that has a bowl of fruit on the table.'], ['a person with a shopping cart on a city street ', 'City dwellers walk by as a homeless man begs for cash.', 'People walking past a homeless man begging on a city street', 'a homeless man holding a cup and standing next to a shopping cart on a street', 'People are walking on the street by a homeless person.'], ['A person on a skateboard and bike at 

In [32]:
# watch for any changes in model.py, if it changes, re-load it automatically
%load_ext autoreload
%autoreload 2

from predict_model import ImageCaptioningPredictor

# (Optional) Validate your model.
model = ImageCaptioningPredictor(encoder, decoder, data_loader.dataset.vocab, device=device)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [33]:
# watch for any changes in model.py, if it changes, re-load it automatically
%load_ext autoreload
%autoreload 2

from nltk.translate.bleu_score import sentence_bleu
import torch

# Initialize lists to store generated captions and ground truth captions
generated_captions = []
ground_truth_captions = []

exit_after_batch = 1
batch = 0

# Iterate over the validation DataLoader
for images, captions in val_loader:    
    # Generate caption for each image in the batch
    for i in range(images.size(0)):  # Iterate through each image in the batch
        with torch.no_grad():
            generated_caption = model.generate_caption(images[i].unsqueeze(0))  # Pass a single image with batch dimension

        # Append the generated caption and ground truth captions
        generated_captions.append(generated_caption)

        # Convert the ground truth list of captions to strings
        ground_truth_captions.append([caption.lower() for caption in captions[i]])

        # Optionally, print or log the result for the first few samples
        if len(generated_captions) <= 5:  # Print first 5 samples
            print(f"Generated Caption: {generated_caption}")
            print(f"Ground Truth Captions: {captions[i]}\n")
        
    batch += 1
    
    if batch >= exit_after_batch:
        break

# After generating all captions, calculate BLEU scores
bleu_scores = []
for gen_caption, gt_captions in zip(generated_captions, ground_truth_captions):
    # Compute BLEU score against all ground truth captions for the image
    bleu_score = sentence_bleu([c.split() for c in gt_captions], gen_caption.split())
    bleu_scores.append(bleu_score)

# Print average BLEU score over the validation set
average_bleu_score = sum(bleu_scores) / len(bleu_scores)
print(f"Average BLEU Score on Validation Set: {average_bleu_score:.4f}")


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
features shape: torch.Size([1, 49, 256])
features shape: torch.Size([1, 49, 256])
features shape: torch.Size([1, 49, 256])
features shape: torch.Size([1, 49, 256])
features shape: torch.Size([1, 49, 256])
features shape: torch.Size([1, 49, 256])
features shape: torch.Size([1, 49, 256])
features shape: torch.Size([1, 49, 256])
features shape: torch.Size([1, 49, 256])
features shape: torch.Size([1, 49, 256])
features shape: torch.Size([1, 49, 256])
features shape: torch.Size([1, 49, 256])
features shape: torch.Size([1, 49, 256])
features shape: torch.Size([1, 49, 256])
features shape: torch.Size([1, 49, 256])
features shape: torch.Size([1, 49, 256])
features shape: torch.Size([1, 49, 256])
features shape: torch.Size([1, 49, 256])
features shape: torch.Size([1, 49, 256])
features shape: torch.Size([1, 49, 256])
features shape: torch.Size([1, 49, 256])
features shape: torch.Size([1, 49, 256])
features s

Ok, only 10%, but I only trained for 3 epochs and one epoch took roughly 2h. And the bleu score is not the best metric because it only checks for engram overlaps.

### Better Alternatives to BLEU for Caption Validation

For image captioning, several metrics provide a more comprehensive evaluation than BLEU, as they consider aspects beyond simple n-gram overlap. Here are some alternatives that are often considered better suited to the task:

---

### 1. **METEOR (Metric for Evaluation of Translation with Explicit ORdering)**

- **Description**: METEOR improves upon BLEU by considering **synonyms, stemming,** and **paraphrases**. It evaluates word matches not only for exact n-gram matches but also for related words, which allows it to better handle the diversity of language.
- **Advantages**: 
  - **Semantic Sensitivity**: Recognizes variations in phrasing and synonyms.
  - **Higher Correlation with Human Judgments**: Studies have shown METEOR correlates better with human evaluations than BLEU.
- **Consideration**: Slower to compute than BLEU, as it requires stemming, synonym checking, and paraphrase evaluation.

---

### 2. **CIDEr (Consensus-based Image Description Evaluation)**

- **Description**: CIDEr was specifically developed for image captioning tasks. It measures the similarity between generated captions and reference captions using **term frequency-inverse document frequency (TF-IDF)** weighting, which emphasizes relevant and unique content.
- **Advantages**:
  - **Focus on Salient Information**: By weighting terms based on TF-IDF, CIDEr emphasizes uncommon or unique words that are more likely to describe important aspects of an image.
  - **Strong Suitability for Image Captioning**: CIDEr tends to align well with human judgment on captioning tasks because it captures essential descriptive content.
- **Consideration**: Typically better suited than BLEU for image captioning, and now one of the most widely accepted metrics in this domain.

---

### 3. **SPICE (Semantic Propositional Image Caption Evaluation)**

- **Description**: SPICE evaluates the **semantic content** of captions by parsing both the generated and reference captions into a semantic graph, where it compares objects, attributes, and relationships.
- **Advantages**:
  - **Focus on Semantic Content**: SPICE directly evaluates whether the generated caption includes the key entities and relationships present in the reference captions.
  - **Strong Correlation with Human Judgments**: Particularly strong in assessing high-level content and relations, SPICE correlates well with human evaluations on complex captions.
- **Consideration**: Computationally intensive and doesn’t penalize incorrect word order or grammar, so it’s best used alongside other metrics.

---

### 4. **ROUGE (Recall-Oriented Understudy for Gisting Evaluation)**

- **Description**: ROUGE is widely used for summarization but is also applicable to image captioning. It measures recall and precision based on **overlapping n-grams** and **longest common subsequence**.
- **Advantages**:
  - **Recall-Based**: ROUGE scores are often useful in ensuring that key phrases and words are present in the generated caption, making it sensitive to missing important details.
  - **Good Complementary Metric**: ROUGE can be used alongside CIDEr or SPICE for a more rounded evaluation.
- **Consideration**: ROUGE is less suited for image captioning than CIDEr or SPICE but can still be a useful additional metric.

---

### Recommended Combination for Image Captioning

In practice, it’s often best to use **multiple metrics** together to get a balanced view of caption quality. For image captioning tasks, a combination like **CIDEr, SPICE,** and **METEOR** generally provides a robust evaluation:

1. **CIDEr** to measure the relevance and uniqueness of content in the captions.
2. **SPICE** to evaluate semantic correctness (e.g., objects, attributes, relationships).
3. **METEOR** to capture synonyms, stemming, and phrasing variations for a broader evaluation of language quality.

This combination gives a comprehensive view of both content accuracy and linguistic quality, aligning well with human judgment.
