# PART-A: Image Captioning with CNN Encoder and RNN Decoder

This notebook implements an image captioning model using a Convolutional Neural Network (CNN) as an encoder and a Long Short-Term Memory (LSTM)-based Recurrent Neural Network (RNN) as a decoder. The model aims to generate descriptive captions for images, bridging computer vision and natural language processing.

This notebook is divided into 6 sections:
- Importing and Setup
- Preprocessing
- Model Creation
- Model Evalution
- Evalution and Results
- Final Output

## Importing and Setup
In this section, we import necessary libraries and set up the environment for building the image captioning model. This includes defining functions for preprocessing captions, loading the dataset, and creating a vocabulary.

- **Imports**: Libraries such as os, torch, torch.utils.data, transforms, pandas, Image, numpy, and json are imported.

- **Vocabulary Builder**: The build_vocab function tokenizes captions and creates a vocabulary based on word frequency, with a threshold to filter infrequent words.

- **Caption Tokenization**: Captions are converted into tokenized sequences using the tokenize_captions function, which also applies padding for uniform length.

- **Dataset Loading**: Train, test, and validation datasets are loaded from CSV files containing image filenames and captions.

- **Vocabulary Setup**: Captions from the train and validation datasets are combined, and the vocabulary is built and applied to tokenize the captions.

- **Print Dataset Information**: Display the sizes of the train, test, and validation datasets, along with the vocabulary size.

In [65]:
import os
import torch 
from torch.utils.data import DataLoader, Dataset
from torchvision import transforms
import pandas as pd
from PIL import Image

In [66]:
import numpy as np
import json
from collections import Counter

def build_vocab(captions, threshold=5):
    # Tokenize captions
    word_counts = Counter()
    max_caption_len = 0
    for caption in captions:
        caption_len = 0
        caption = '<start>' + caption + '<end>'
        for word in caption.split(' '):
            word_counts[word] += 1
            caption_len +=1
        max_caption_len = max(max_caption_len, caption_len)
            
    # Create vocabulary
    words = [word for word, count in word_counts.items() if count >= threshold]
    word2idx = {word: idx + 1 for idx, word in enumerate(words)}  # Start indexing from 1
    word2idx['<unk>'] = 0  # Add special token for unknown words
    idx2word = {idx: word for word, idx in word2idx.items()}
    
    return word2idx, idx2word, max_caption_len


def tokenize_captions(captions, word2idx, max_length):
    caption_tokens = []
    for caption in captions:
        tokens = [word2idx.get(word, word2idx['<unk>']) for word in caption.split(' ')]
        tokens = tokens[:max_length] + [0] * (max_length - len(tokens))  # Pad sequences
        caption_tokens.append(tokens)
    return np.array(caption_tokens)

In [67]:
#  Read train.csv
train_df = pd.read_csv('/kaggle/input/image-caption/custom_captions_dataset/train.csv')

# Read test.csv
test_df = pd.read_csv('/kaggle/input/image-caption/custom_captions_dataset/test.csv')

# Read val.csv
val_df = pd.read_csv('/kaggle/input/image-caption/custom_captions_dataset/val.csv')

# Function to preprocess captions (similar to your approach)
def preprocess_caption(caption):
    return ' '.join(caption.strip().split(' ')[:40])  # Truncate to first 40 words


# Access image names and captions
train_image_names = train_df['filename'].tolist()
train_captions = train_df['caption'].tolist()
for i in range(len(train_captions)):
    train_captions[i]=preprocess_caption(train_captions[i])

test_image_names = test_df['filename'].tolist()
test_captions = test_df['caption'].tolist()
for i in range(len(test_captions)):
    test_captions[i]=preprocess_caption(test_captions[i])

val_image_names = val_df['filename'].tolist()
val_captions = val_df['caption'].tolist()
for i in range(len(val_captions)):
    val_captions[i]=preprocess_caption(val_captions[i])

captions = []
captions = train_captions + val_captions


print(len(train_captions), len(val_captions), len(captions))
print(f"Train Size: {train_df.shape}")
print(f"Test Size: {test_df.shape}")
print(f"Val Size: {val_df.shape}")

5715 946 6661
Train Size: (5715, 3)
Test Size: (928, 3)
Val Size: (946, 3)


In [68]:
max_length=0
# Build vocabulary
word2idx, idx2word, max_length = build_vocab(captions, 3)

max_length=40

# Tokenize captions
caption_tokens = tokenize_captions(captions, word2idx, max_length)

print("Vocabulary size:", len(word2idx))
# print("Word to index mapping:", word2idx)
# print("Index to word mapping:", idx2word)
print("Tokenized captions:")
print(caption_tokens)

Vocabulary size: 4222
Tokenized captions:
[[262   2   3 ...  20   2   4]
 [ 38  19  32 ...   6   7   2]
 [262  53  48 ...   5  64 312]
 ...
 [ 38   5   7 ...   5 206  79]
 [  0  23  79 ...   9   7 104]
 [709   5   7 ... 134   5 195]]


## Preprocessing
This section defines a custom dataset class and applies transformations to preprocess images and captions for training the image captioning model.

**Custom Dataset Class**: The CustomDataset class is defined to handle the dataset, taking a pandas dataframe, image directory path, vocabulary mapping word2idx, maximum caption length max_length, and optional transform for image preprocessing.

**Dataset Length**: The __len__ method returns the length of the dataset.

**Data Loading**: The __getitem__ method loads an image and its corresponding caption at the given index. It also tokenizes the caption, pads it to the maximum length, and returns the image, original caption, tokenized caption, and caption length.

In [69]:
class CustomDataset(Dataset):
    def __init__(self, dataframe, image_dir, word2idx, max_length, transform=None):
        self.dataframe = dataframe
        self.image_dir = image_dir
        self.transform = transform
        self.word2idx = word2idx
        self.max_length = max_length

    def __len__(self):
        return len(self.dataframe)

    def __getitem__(self, idx):
        img_name = self.dataframe.iloc[idx, 1]  
        img_path = os.path.join(self.image_dir, img_name)
        image = Image.open(img_path).convert('RGB')
        if self.transform:
            image = self.transform(image)

        caption = self.dataframe.iloc[idx, 2] 
        caption_length = len(caption.split())
        
        # Tokenize caption
        tokens = [self.word2idx.get(word, self.word2idx['<unk>']) for word in caption.split(' ')]
        tokens = tokens[:self.max_length] + [0] * (self.max_length - len(tokens))  # Pad sequences
        tokens = torch.tensor(tokens)
        return image,caption, tokens, caption_length

**Image Preprocessing**: Images are resized and center-cropped to 224x224 pixels, converted to tensors, and normalized using mean [0.485, 0.456, 0.406] and standard deviation [0.229, 0.224, 0.225].

In [70]:
# resize the image
transform = transforms.Compose([
    transforms.Resize(224),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    
])

**Dataset Creation**: Datasets for training, testing, and validation are created using the CustomDataset class with the specified transformations and directories.

**Data Loaders**: Data loaders are created for each dataset with a specified batch size, enabling efficient loading and processing of data during training and evaluation.

**Dataset Information**: Information about the sizes of the train, test, and validation datasets is printed to provide an overview of the dataset.

In [71]:
train_image_dir = '/kaggle/input/image-caption/custom_captions_dataset/train'  
test_image_dir = '/kaggle/input/image-caption/custom_captions_dataset/test'
val_image_dir = '/kaggle/input/image-caption/custom_captions_dataset/val'

# Create datasets
train_dataset = CustomDataset(train_df, train_image_dir, word2idx, max_length, transform=transform)
test_dataset = CustomDataset(test_df, test_image_dir, word2idx, max_length, transform=transform)
val_dataset = CustomDataset(val_df, val_image_dir, word2idx, max_length, transform=transform)

# Define batch size
batch_size = 64

# Create data loaders
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=batch_size)
val_loader = DataLoader(val_dataset, batch_size=batch_size)

# Optionally, print some information about the datasets
print(f"Train Dataset Size: {len(train_dataset)}")
print(f"Test Dataset Size: {len(test_dataset)}")
print(f"Validation Dataset Size: {len(val_dataset)}")

Train Dataset Size: 5715
Test Dataset Size: 928
Validation Dataset Size: 946


## Model Creation

In this section, we define the architecture of the image captioning model, consisting of a CNN encoder and an LSTM-based RNN decoder. The encoder processes input images to extract meaningful features, which are then used by the decoder to generate captions.

- **CNN Encoder**: Utilizes a pre-trained ResNet-50 model to extract image features. The extracted features are then passed through a linear layer and batch normalization to generate an embedding.

- **RNN Decoder**: Uses an embedding layer to convert tokenized captions into embedded representations. The LSTM-based decoder then generates captions based on the embedded representations, producing a sequence of words.

In [72]:
import torch
import torch.nn as nn
import torchvision.models as models
from torch.nn.utils.rnn import pack_padded_sequence

# CNN Encoder
class EncoderCNN(nn.Module):
    def __init__(self, embed_size):
        super(EncoderCNN, self).__init__()
        resnet = models.resnet50(pretrained=True)  # Pre-trained ResNet-50
        modules = list(resnet.children())[:-1]  # Remove the last fully connected layer
        self.resnet = nn.Sequential(*modules)
        self.embed = nn.Linear(resnet.fc.in_features, embed_size)
        self.bn = nn.BatchNorm1d(embed_size, momentum=0.01)

    def forward(self, images):
        with torch.no_grad():
            features = self.resnet(images)
        features = features.reshape(features.size(0), -1)
        features = self.bn(self.embed(features))
        return features

# RNN Decoder
class DecoderRNN(nn.Module):
    def __init__(self, embed_size, hidden_size, vocab_size, num_layers, max_seq_length=20):
        super(DecoderRNN, self).__init__()
        self.embed = nn.Embedding(vocab_size, embed_size)
        self.lstm = nn.LSTM(embed_size, hidden_size, num_layers, batch_first=True)
        self.linear = nn.Linear(hidden_size, vocab_size)
        self.max_seg_length = max_seq_length

    def forward(self, features, captions, lengths):
        embeddings = self.embed(captions)
        embeddings = torch.cat((features.unsqueeze(1), embeddings), 1)
        packed = pack_padded_sequence(embeddings, lengths, batch_first=True, enforce_sorted=False) 
        hiddens, _ = self.lstm(packed)
        outputs = self.linear(hiddens[0])
        return outputs

    def sample(self, features, states=None):
        sampled_ids = []
        inputs = features.unsqueeze(1)
        for i in range(self.max_seg_length):
            hiddens, states = self.lstm(inputs, states)
            outputs = self.linear(hiddens.squeeze(1))
            _, predicted = outputs.max(1)
            sampled_ids.append(predicted)
            inputs = self.embed(predicted).unsqueeze(1)
        sampled_ids = torch.stack(sampled_ids, 1)
        return sampled_ids

In [74]:
# Initialize the encoder and decoder
embed_size = 256
hidden_size = 512
vocab_size = len(word2idx)
num_layers = 1
encoder = EncoderCNN(embed_size)
decoder = DecoderRNN(embed_size, hidden_size, vocab_size, num_layers, max_length)

# Move models to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
encoder.to(device)
decoder.to(device)

DecoderRNN(
  (embed): Embedding(4222, 256)
  (lstm): LSTM(256, 512, batch_first=True)
  (linear): Linear(in_features=512, out_features=4222, bias=True)
)

In [75]:
# Define loss function and optimizer
criterion = nn.CrossEntropyLoss()
params = list(decoder.parameters()) + list(encoder.embed.parameters())
optimizer = torch.optim.Adam(params, lr=0.001)

## Model Training

This section details the training process of the image captioning model, including the training loop, loss computation, and optimization steps.

- **Training Loop**: The model is trained for a specified number of epochs. For each epoch, the total loss is calculated as the sum of losses over all batches.

- **Forward Pass**: The encoder processes input images to extract features, which are then fed into the decoder along with tokenized captions to generate outputs.

- **Loss Computation**: The CrossEntropyLoss is used to compute the loss between the predicted outputs and the ground truth captions.

- **Backward Pass and Optimization**: The optimizer is used to update the model parameters based on the computed gradients.

- **Model Evaluation**: The trained model can be evaluated using the test dataset to generate captions for unseen images.

- **Model Saving**: Optionally, the trained model can be saved to disk for future use.



In [76]:
# Training loop
num_epochs = 20
for epoch in range(num_epochs):
    total_loss = 0
    for i, (images, cap, captions, leng) in enumerate(train_loader):
        lengths = [max_length]*len(leng)
        images = images.to(device)
        captions = captions.to(device)
        targets = pack_padded_sequence(captions, lengths, batch_first=True, enforce_sorted=False)[0]
        
        # Forward pass
        features = encoder(images)
        outputs = decoder(features, captions, lengths)
        
        # Compute loss
        loss = criterion(outputs, targets)
        
        # Backward pass and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item()
        
        # Print progress
        if (i+1) % 50 == 0:
            print(f'Epoch [{epoch+1}/{num_epochs}], Step [{i+1}/{len(train_loader)}], Loss: {loss.item():.4f}')
    
    # Print epoch-wise loss
    print(f'Epoch [{epoch+1}/{num_epochs}], Average Loss: {total_loss/len(train_loader):.4f}')

# Optionally, save the trained model
torch.save(encoder.state_dict(), 'encoder.pth')
torch.save(decoder.state_dict(), 'decoder.pth')

Epoch [1/20], Step [50/90], Loss: 5.1017
Epoch [1/20], Average Loss: 5.1573
Epoch [2/20], Step [50/90], Loss: 4.1889
Epoch [2/20], Average Loss: 4.0098
Epoch [3/20], Step [50/90], Loss: 3.7819
Epoch [3/20], Average Loss: 3.6200
Epoch [4/20], Step [50/90], Loss: 3.5381
Epoch [4/20], Average Loss: 3.3794
Epoch [5/20], Step [50/90], Loss: 3.3499
Epoch [5/20], Average Loss: 3.2043
Epoch [6/20], Step [50/90], Loss: 3.1943
Epoch [6/20], Average Loss: 3.0600
Epoch [7/20], Step [50/90], Loss: 3.0582
Epoch [7/20], Average Loss: 2.9302
Epoch [8/20], Step [50/90], Loss: 2.9293
Epoch [8/20], Average Loss: 2.8106
Epoch [9/20], Step [50/90], Loss: 2.8105
Epoch [9/20], Average Loss: 2.7008
Epoch [10/20], Step [50/90], Loss: 2.6946
Epoch [10/20], Average Loss: 2.5963
Epoch [11/20], Step [50/90], Loss: 2.5914
Epoch [11/20], Average Loss: 2.4955
Epoch [12/20], Step [50/90], Loss: 2.4987
Epoch [12/20], Average Loss: 2.4046
Epoch [13/20], Step [50/90], Loss: 2.4179
Epoch [13/20], Average Loss: 2.3218
Epoc

In [78]:
encoder = EncoderCNN(embed_size)
decoder = DecoderRNN(embed_size, hidden_size, vocab_size, num_layers, max_length)
encoder.load_state_dict(torch.load('encoder.pth'))
decoder.load_state_dict(torch.load('decoder.pth'))

# Set the model to evaluation mode
encoder.eval()
decoder.eval()

# Move models to GPU if available
encoder.to(device)
decoder.to(device)
predicted_captions = []

# Iterate through test dataset
for images, ground_truths, _, _ in test_loader:
    images = images.to(device)
    
    # Extract features using the encoder
    with torch.no_grad():
        features = encoder(images)
    
    # Generate captions using the decoder
    sampled_ids = decoder.sample(features)
    
    # Decode the predicted captions
    sampled_ids = sampled_ids.cpu().numpy()
    for sample_id in sampled_ids:
        predicted_caption = []
#         predicted_caption = ['<start>']
        for word_id in sample_id:
            if word_id==0:
                continue
            word = idx2word[word_id]
            if word == '<end>': 
                break
            predicted_caption.append(word)
#         predicted_caption.append('<end>') 
        predicted_captions.append(' '.join(predicted_caption))
   

## Evalution and Results

This section evaluates the performance of the trained image captioning model by generating captions for a sample of images from the test dataset and calculating evaluation metrics such as CIDEr, ROUGE-L, and SPICE scores.

- **Sample Image and Caption Pairs**: Three images from the test dataset are selected, and captions are generated using the trained model. These image-caption pairs are displayed to showcase the model's captioning capability.

- **Captioning Test Dataset**: Using the trained model, captions are generated for the entire test dataset. These generated captions are then used to calculate evaluation metrics, providing insights into the quality of the model's captions compared to ground truth captions.

- **Evaluation Metrics**: CIDEr, ROUGE-L, and SPICE scores are computed based on the generated captions and ground truth captions from the test dataset, quantifying the similarity and quality of the generated captions.

In [79]:
 import matplotlib.pyplot as plt
import numpy as np
# Print some of the predicted captions
for i in range(5):
        # Assuming you have a test dataset with images
    image, truth, _,_ = test_dataset[i]  # Replace `index` with the index of the image you want to display
    image = image.permute(1, 2, 0)  # Move channel dimension to the last position
    image = (image.numpy() * 255).astype(np.uint8)  # Convert from torch tensor to numpy array and scale pixel values

    # Display the image
#     plt.imshow(image)
#     plt.axis('off')
#     plt.show()
    print(f"Ground Truth {i+1}: {truth}")
    print(len(predicted_captions[i]))
    print(f'Predicted Caption {i+1}: {predicted_captions[i]}\n\n')

Ground Truth 1: A large building with bars on the windows in front of it. There is people walking in front of the building. There is a street in front of the building with many cars on it. 
187
Predicted Caption 1: The image is of a city street. There is a street light in the street with a large The building is painted yellow and black. There is a car parked on the side of the street. There are many


Ground Truth 2: A person is skiing through the snow. There is loose snow all around them from him jumping. The person is wearing a yellow snow suit. The person is holding two ski poles in their hands. 
194
Predicted Caption 2: A person is standing on a beach on a sunny day. The man is wearing a black jacket and blue jean pants. The man is wearing a black jacket and black pants. The person is wearing a black jacket and


Ground Truth 3: There is a bed in a room against a wall. There is a brown blanket on top of the bed. There is a small brown book shelf next to the bed. There is a picture 

In [80]:
# creating json files to store the captions in the format needed for using evaluation metrics

import json

imags_json=[]
for idx,path in enumerate(test_image_names):
    imags_json.append({
        "license": 1,
        "url": '/kaggle/input/image-caption/custom_captions_dataset'+test_image_names[i],
        "file_name": "COCO_val2014_000000572233.jpg",
        "id": idx,
        "width": 640,
        "date_captured": "2013-11-25 14:48:33",
        "height": 427
    }
    )

# Assuming actual_captions and generated_captions are lists of captions
actual_captions_json = []
for idx, caption in enumerate(test_captions):
    actual_captions_json.append({
        "image_id": idx,  # Assuming image IDs start from 0 and are sequential
        "id": idx,  # Assuming IDs are unique and sequential
        "caption": test_captions[idx]
    })

generated_captions_json = []
for idx, caption in enumerate(predicted_captions):
    generated_captions_json.append({
        "image_id": idx,  # Assuming image IDs start from 0 and are sequential
        "caption": predicted_captions[idx]
    })


# Create a new JSON object
data = {
    "info": {
        "description": "This is stable 1.0 version of the 2014 MS COCO dataset.",
        "url": "http://mscoco.org",
        "version": "1.0",
        "year": 2014,
        "contributor": "Microsoft COCO group",
        "date_created": "2015-01-27 09:11:52.357475"
    },
    "images": imags_json,
    "type": "captions",
    "licenses": [
        {
            "url": "http://creativecommons.org/licenses/by-nc-sa/2.0/",
            "id": 1,
            "name": "Attribution-NonCommercial-ShareAlike License"
        },
        {
            "url": "http://creativecommons.org/licenses/by-nc/2.0/",
            "id": 2,
            "name": "Attribution-NonCommercial License"
        },
        {
            "url": "http://creativecommons.org/licenses/by-nc-nd/2.0/",
            "id": 3,
            "name": "Attribution-NonCommercial-NoDerivs License"
        },
        {
            "url": "http://creativecommons.org/licenses/by/2.0/",
            "id": 4,
            "name": "Attribution License"
        },
        {
            "url": "http://creativecommons.org/licenses/by-sa/2.0/",
            "id": 5,
            "name": "Attribution-ShareAlike License"
        },
        {
            "url": "http://creativecommons.org/licenses/by-nd/2.0/",
            "id": 6,
            "name": "Attribution-NoDerivs License"
        },
        {
            "url": "http://flickr.com/commons/usage/",
            "id": 7,
            "name": "No known copyright restrictions"
        },
        {
            "url": "http://www.usa.gov/copyright.shtml",
            "id": 8,
            "name": "United States Government Work"
        }
    ],
    "annotations": actual_captions_json  # Use actual_captions_json or generated_captions_json here
}

data2 = generated_captions_json

# Save the JSON to a file
with open('captions.json', 'w') as f:
    json.dump(data, f, indent=4)
    
# Save the JSON to a file
with open('gen_captions.json', 'w') as ff:
    json.dump(data2, ff, indent=4)

In [81]:
!pip install pycocoevalcap



In [82]:
!pip install pycocotools



In [83]:
from pycocoevalcap.tokenizer.ptbtokenizer import PTBTokenizer
from pycocoevalcap.bleu.bleu import Bleu
from pycocoevalcap.meteor.meteor import Meteor
from pycocoevalcap.rouge.rouge import Rouge
from pycocoevalcap.cider.cider import Cider
from pycocoevalcap.spice.spice import Spice

class COCOEvalCap:
    def __init__(self, coco, cocoRes):
        self.evalImgs = []
        self.eval = {}
        self.imgToEval = {}
        self.coco = coco
        self.cocoRes = cocoRes
        self.params = {'image_id': coco.getImgIds()}

    def evaluate(self):
        imgIds = self.params['image_id']
        # imgIds = self.coco.getImgIds()
        gts = {}
        res = {}
        for imgId in imgIds:
            gts[imgId] = self.coco.imgToAnns[imgId]
            res[imgId] = self.cocoRes.imgToAnns[imgId]

        # =================================================
        # Set up scorers
        # =================================================
        print('tokenization...')
        tokenizer = PTBTokenizer()
        gts  = tokenizer.tokenize(gts)
        res = tokenizer.tokenize(res)

        # =================================================
        # Set up scorers
        # =================================================
        print('setting up scorers...')
        scorers = [
            (Bleu(4), ["Bleu_1", "Bleu_2", "Bleu_3", "Bleu_4"]),
            (Meteor(),"METEOR"),
            (Rouge(), "ROUGE_L"),
            (Cider(), "CIDEr"),
            (Spice(), "SPICE")
        ]

        # =================================================
        # Compute scores
        # =================================================
        for scorer, method in scorers:
            print ('computing %s score...'%(scorer.method()))
            score, scores = scorer.compute_score(gts, res)
            if type(method) == list:
                for sc, scs, m in zip(score, scores, method):
                    self.setEval(sc, m)
                    self.setImgToEvalImgs(scs, gts.keys(), m)
                    print("%s: %0.3f"%(m, sc))
            else:
                self.setEval(score, method)
                self.setImgToEvalImgs(scores, gts.keys(), method)
                print("%s: %0.3f"%(method, score))
        self.setEvalImgs()

    def setEval(self, score, method):
        self.eval[method] = score

    def setImgToEvalImgs(self, scores, imgIds, method):
        for imgId, score in zip(imgIds, scores):
            if not imgId in self.imgToEval:
                self.imgToEval[imgId] = {}
                self.imgToEval[imgId]["image_id"] = imgId
            self.imgToEval[imgId][method] = score

    def setEvalImgs(self):
        self.evalImgs = [eval for imgId, eval in self.imgToEval.items()]

In [84]:
from pycocotools.coco import COCO
from pycocoevalcap.eval import COCOEvalCap

annotation_file = '/kaggle/working/captions.json'
results_file = '/kaggle/working/gen_captions.json'

# create coco object and coco_result object
coco = COCO(annotation_file)
coco_result = coco.loadRes(results_file)

# create coco_eval object by taking coco and coco_result
coco_eval = COCOEvalCap(coco, coco_result)

# evaluate on a subset of images by setting
# coco_eval.params['image_id'] = coco_result.getImgIds()
# please remove this line when evaluating the full validation set
coco_eval.params['image_id'] = coco_result.getImgIds()

# evaluate results
# SPICE will take a few minutes the first time, but speeds up due to caching
coco_eval.evaluate()

# print output evaluation scores
for metric, score in coco_eval.eval.items():
    print(f'{metric}: {score:.3f}')

loading annotations into memory...
Done (t=0.01s)
creating index...
index created!
Loading and preparing results...
DONE (t=0.00s)
creating index...
index created!
tokenization...


PTBTokenizer tokenized 39388 tokens at 157304.24 tokens per second.
PTBTokenizer tokenized 40781 tokens at 156502.91 tokens per second.


setting up scorers...
computing Bleu score...
{'testlen': 36493, 'reflen': 34975, 'guess': [36493, 35565, 34638, 33711], 'correct': [12861, 3148, 996, 316]}
ratio: 1.0434024303073326
Bleu_1: 0.352
Bleu_2: 0.177
Bleu_3: 0.096
Bleu_4: 0.054
computing METEOR score...
METEOR: 0.128
computing Rouge score...
ROUGE_L: 0.270
computing CIDEr score...
CIDEr: 0.313
computing SPICE score...


Parsing reference captions
Initiating Stanford parsing pipeline
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[main] INFO edu.stanford.nlp.pipeline.TokenizerAnnotator - TokenizerAnnotator: No tokenizer type provided. Defaulting to PTBTokenizer.
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator parse
[main] INFO edu.stanford.nlp.parser.common.ParserGrammar - Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ... 
done [0.8 sec].
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ner
Loading classifier from edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz ... done [1.4 sec].
Loading classifier from edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz ... done [0.7 sec].
Loading classif

SPICE evaluation took: 7.136 min
SPICE: 0.086
Bleu_1: 0.352
Bleu_2: 0.177
Bleu_3: 0.096
Bleu_4: 0.054
METEOR: 0.128
ROUGE_L: 0.270
CIDEr: 0.313
SPICE: 0.086


## Final Output
This section displays the captions generated by the trained image captioning model for the entire test dataset. Each image in the test dataset is shown along with its corresponding generated caption, providing a comprehensive view of the model's performance on unseen images.

- **Display Captions**: Captions generated by the model for each image in the test dataset are displayed, showcasing the model's ability to describe a variety of images.

- **Comprehensive Evaluation**: The displayed captions allow for a qualitative evaluation of the model's performance, highlighting its strengths and areas for improvement.

In [85]:
for i in range(len(predicted_captions)):
    print("Caption for Image ",test_image_names[i],": ",predicted_captions[i],"\n")

Caption for Image  test_1.jpg :  The image is of a city street. There is a street light in the street with a large The building is painted yellow and black. There is a car parked on the side of the street. There are many 

Caption for Image  test_2.jpg :  A person is standing on a beach on a sunny day. The man is wearing a black jacket and blue jean pants. The man is wearing a black jacket and black pants. The person is wearing a black jacket and 

Caption for Image  test_3.jpg :  This picture is taken inside of a living room of a home. A small white t.v. is sitting on a desk with a white painted wall behind it. There is a white and black keyboard sitting on the desk in 

Caption for Image  test_4.jpg :  The train is pulling into the station. The train is yellow and black. The train is pulling into the station. The train is yellow and black. The train is pulling into the station. The train is yellow and black. The 

Caption for Image  test_5.jpg :  A bus is parked on the street. The bu