# **VR Mini Project - Image Captioning**

## IMT2020039 - Anshul Jindal  
## IMT2020535 - Shreeya Venneti
## IMT2020094 - Riddhi Chatterjee
## IMT2020523 - Kedar Deshpande

# IMPORTING LIBRARIES

Importing all the relevant libraries

In [None]:
import cv2
import numpy as np
import matplotlib.pyplot as plt
import pickle
import string
import random
from tqdm import tqdm

In [None]:
# Uninstall the previous installed nltk library
!pip install -U nltk

# This upgraded nltkto version 3.5 in which meteor_score is there.
!pip install nltk==3.5

In [None]:
import tensorflow
from tensorflow import keras
from keras.preprocessing import sequence
from tensorflow.keras.models import Model
from tensorflow.keras import Input, layers
from tensorflow.keras import optimizers
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.layers import add
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.applications.resnet50 import ResNet50
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.layers import LSTM, Embedding, Dense, Activation, Flatten, Reshape, Dropout
from tensorflow.keras.layers import Attention
from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.meteor_score import meteor_score
import tensorflow as tf
from keras.layers import concatenate

# Reading the Names of Images and Corresponding Captions

## Opening and Reading the text files

In [None]:
with open("/kaggle/input/flickr/Flickr8k/Flickr8k_text/Flickr_8k.trainImages.txt", "r") as f:
    train_img_names = f.read().split("\n")
    train_img_names = train_img_names[:-1]

with open("/kaggle/input/flickr/Flickr8k/Flickr8k_text/Flickr_8k.valImages.txt", "r") as f:
    val_img_names = f.read().split("\n")
    val_img_names = val_img_names[:-1]

with open("/kaggle/input/flickr/Flickr8k/Flickr8k_text/Flickr_8k.testImages.txt", "r") as f:
    test_img_names = f.read().split("\n")
    test_img_names = test_img_names[:-1]

In [None]:
train_img_names[0]

In [None]:
with open("/kaggle/input/flickr/Flickr8k/Flickr8k_text/Flickr8k.token.txt", "r") as f:
    captions_list = f.read().split("\n")
    captions_list = captions_list[:-1]

In [None]:
captions_list[0:6]

Thus we can see that each of the image has 5 captions associated with it. They are all independent.

# PREPROCESSING

## 1. Arranging the Captions

Every image has 5 captions associated with it (you can see above). So let us store the these all captions in a **DICTIONARY** where **Key will be the Image Name** and **Value will be the corresponding 5 Captions** for that image. Also, in this process, we can separate image names fro their captions in the above captions list. Doing this will make our further processing easier.

In [None]:
# Initialising the Dictionary
captions_dict = {}

for i in captions_list:
    
    # We split on the basis of "\t" token
    # Take the first part of the split because that only rep the image name
    img_name = i.split("\t")[0] 
    img_name = img_name[:-2]    # Removing #(Num) part from the image name
    
    # Second part of the split will correspond to the captions
    img_caption = i.split("\t")[1]
    
    if img_name in captions_dict.keys():
        captions_dict[img_name].append(img_caption)
    else:
        captions_dict[img_name] = [img_caption]  

In [None]:
# Print the captions for the last image
captions_dict[img_name]

## 2. Splitting into Train, Test and Val

In [None]:
# Also we split this whole dictionary into Train, Test and Val dictionaries
train_dict = {}
test_dict = {}
val_dict = {}

for i in train_img_names:
    train_dict[i] = captions_dict[i].copy()
for i in val_img_names:
    val_dict[i] = captions_dict[i].copy()
for i in test_img_names:
    test_dict[i] = captions_dict[i].copy()

## 3. Adding Sequence Tokens 

Adding a **startcap** token at the begining of all the TRAIN captions to mark the begining of caption and a **endcap** token at the end of the captions to mark the end of caption. We will iterate in the dictionary.

In [None]:
for img_name in captions_dict:
    for i in range(len(captions_dict[img_name])):
        if img_name in train_dict.keys():
            captions_dict[img_name][i] = "startcap " + captions_dict[img_name][i] + " endcap"
            train_dict[img_name][i] = "startcap " + train_dict[img_name][i] + " endcap"

## 4. Calculating Maximum length of Caption

Out of all the captions, we will find the caption which has maximum number of words in that. That will help us in padding rest of the captions.

In [None]:
caption_max_length = 0

for img_name in captions_dict:
    for i in captions_dict[img_name]:
        if caption_max_length < len(i.split()):
            caption_max_length = len(i.split())

In [None]:
caption_max_length

## 5. Calculate Vocabulary Size

We will form a vocabulary of the words present in our captions. So lets find the **list of all the unique words** which are occuring in the captions.

In [None]:
vocab_list = []

for img_name in captions_dict:
    for i in captions_dict[img_name]:
        for w in i.split():
            if w not in vocab_list:
                vocab_list.append(w)
            else:
                continue
                
vocab_size = len(vocab_list)

In [None]:
vocab_size

# GENERATING WORD EMBEDDINGS

Now we can't use One-Hot encodings because of the obvious reasons of consuming large amount of memory. So what should be do??

Well, we give our model randomly intilised 10278 vectors each with smaller dimension. The dimension can be 100, 200 anything. Our model will learn all these embeddings itself. This process is called Word Embeddings. Word Embedding matrix contain all 10278 vectors as its rows. 

We will use Pre Trained Embeddings called GloVe Embeddings. The embedding dimension of GloVe is 200.

## Loading GloVe Embeddings

In [None]:
glove_embeddings = open('/kaggle/input/glove6b200d/glove.6B.200d.txt', encoding="utf-8")

Each line in the GloVe Embedding text file is formed of 2 things. The word itself followed by it's 200 dimensional embedding. This is all stored as a single string. We read and split this string and create vocabulary dictionary for all the words present in glove embddings. 

In [None]:
glove_emb_dict = {} 

for embs in glove_embeddings:
    temp = embs.split()
    word = temp[0]
    emb = np.asarray(temp[1:], dtype='float32')
    
    #Adding this into our dictionary
    glove_emb_dict[word] = emb

## Generating Embedding Matrix

Now we will create our embedding matrix. The procedure that we will follow is that we will iterate through our vocabulary list and keep on initialising an embedding for the words in the volcabulary. Whatever words are present in glove embeddings and in vocabulary, we will initialise their embeddings same as of GloVe embeddings, otherise random.

In [None]:
emb_dim = 200 #Same as GloVe dimension
embedding_matrix = np.random.uniform(0, 1, (vocab_size, emb_dim))

# We are iterating through vocab list.
# We will use the position of a word in the list as index for that word for our embedding
for word in vocab_list:
    if word in glove_emb_dict:
        index_of_word = vocab_list.index(word)
        embedding_matrix[index_of_word] = glove_emb_dict.get(word)

# MODEL DECLARATIONS

## 1. CNN - Model ResNet 50

In [None]:
# Loading the pretrained resnet model.
# https://www.tensorflow.org/api_docs/python/tf/keras/applications/resnet50/ResNet50

from tensorflow.keras.applications import VGG16 

vgg16 = VGG16()
vgg16 = Model(inputs = vgg16.inputs, outputs = vgg16.layers[-2].output)

resnet50 = ResNet50(include_top=False, weights='imagenet', input_shape=(224,224,3), pooling='avg')

In [None]:
# train_features = {}

# for tr_img in tqdm(train_img_names):
#     img = cv2.imread("/kaggle/input/flickr/Flickr8k/Flicker8k_Images"  + "/" + tr_img)
#     img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
#     img = cv2.resize(img,(224,224))
#     img = np.expand_dims(img, axis=0)
    
#     # Enabling GPU
#     with tf.device('gpu'):
#         feature_tr_img = resnet50.predict(img, verbose=0).reshape(2048)
        
#     train_features[tr_img] = feature_tr_img

In [None]:
# import os
# from os.path import exists

# if(not exists("/kaggle/working/train-image-features")):
#     os.system("mkdir /kaggle/working/train-image-features")
# with open("/kaggle/working/train-image-features/Train_Images_Features.pkl", 'wb') as f:
#   pickle.dump(train_features, f)

## 2. LSTM and rest of the Model

Now we have extracted features of image using CNN. Now we will leverage that feature vector and the decoded sequence and keep on predicting next word in the caption sentence. Hence we need a model that takes an input image and an input word sequence, processes them, combines the resulting representations, and produces a probability distribution over the vocabulary for each word in the output sequence.

In [None]:
# # First input is the CNN feature vectors of the images
# # One step processing more of the CNN feature vector to reduce it's size.
# from keras.layers import concatenate
# input_1 = Input(shape=(2048,))
# final_image_feature = Dropout(0.5)(input_1)
# final_image_feature = Dense(256, activation='relu')(final_image_feature)

# # Second input is the caption
# input_2 = Input(shape=(caption_max_length,))
# lang_feature = Embedding(vocab_size, emb_dim, weights=[embedding_matrix], trainable=False, mask_zero=True)(input_2)
# lang_feature = Dropout(0.5)(lang_feature)
# lang_feature = LSTM(256)(lang_feature)

# decoder = concatenate([final_image_feature, lang_feature])
# decoder = Dense(256, activation='relu')(decoder)
# outputs = Dense(vocab_size, activation='softmax')(decoder)

# model = Model(inputs=[input_1, input_2], outputs=outputs)

# model.compile(loss='categorical_crossentropy', optimizer='adam')


# feature input -> first path 
in1 = Input(shape = (1, 2048))

feat_l1 = Dropout(0.5)(in1)
feat_l2 = Dense(emb_dim, activation = 'relu')(feat_l1)

# sequence input -> second path
in2 = Input(shape=(caption_max_length,))
emb = Embedding(vocab_size, emb_dim, weights=[embedding_matrix], trainable=False, mask_zero=False)(in2)
emb = Dense(emb_dim, activation = 'relu')(emb)

comb_l1 = add([feat_l2, emb])

seq_l1 = Dropout(0.1)(comb_l1)
seq_l2 = LSTM(emb_dim, return_sequences = True)(seq_l1)

seq_l3 = Dropout(0.1)(seq_l2)
seq_l4 = LSTM(emb_dim, return_sequences = True)(seq_l3)

seq_l5 = Dropout(0.1)(seq_l4)
seq_l6 = LSTM(emb_dim)(seq_l5)


comb_l2 = add([Reshape((emb_dim, ))(feat_l2), seq_l6])
comb_l3 = Dense(emb_dim, activation = 'relu')(comb_l2)

# output
output = Dense(vocab_size, activation = 'softmax')(comb_l3)

# compile model
model = Model(inputs = [in1, in2], outputs = output)
model.compile(loss = 'categorical_crossentropy', 
             optimizer = Adam(amsgrad = True, learning_rate = 0.0005))

# to prevent overfitting
cp = EarlyStopping(patience = 3, restore_best_weights= True)

print(model.summary())

# BUILDING DATALOADER

We will call our fuction **Data Generator** to generate batch_size number of training samples at a time. But before that let us read our training image features that we stored in a pickle file.

In [None]:
train_features = pickle.load(open("/kaggle/input/d/riddhich/train-features/Train_Images_Features.pkl", 'rb'))

In [None]:
for key in train_features.keys():
    print(len(train_features[key]))
    break

In [None]:
def data_generator(features_dict, captions_dict, batch_size):
    
    # Now we have input as image features and caption.
    # Output id the next word in the caption
    # Let us declare lists to store them all for a batch
    img_features = []
    input_cap = []
    output_cap = []
    
    # To count the number of images processed
    itr = 0
    
    while True:
        
        for img_name, cap_list in captions_dict.items():
            
            # Get the relevant image features
            img_feat = features_dict[img_name].reshape(1, 2048)

            for caption in cap_list:
                
                # Encode the caption
                caption_seq = [vocab_list.index(word) for word in caption.split(" ") if word in vocab_list]

                for i in range(1, len(caption_seq)):
                    
                    in_caption = caption_seq[:i]
                    out_caption = caption_seq[i]
                    
                    # Pad the input sequences
                    in_caption = pad_sequences([in_caption], maxlen=caption_max_length)[0]
                    
                    # Convert the output value to one hot 
                    out_caption = to_categorical([out_caption], num_classes=vocab_size)[0]     
                    
                    img_features.append(img_feat)
                    input_cap.append(in_caption)
                    output_cap.append(out_caption)
                    
                itr += 1
                    
                if itr == batch_size:
                    yield ([np.array(img_features), np.array(input_cap)], np.array(output_cap))
                        
                    itr = 0
                    img_features = []
                    input_cap = []
                    output_cap = []
            

In [None]:
# A functionality to check the output for the created batch

gen = data_generator(train_features, train_dict, 32)

# Generate a batch by calling next() on the generator
[img_in, caption_in], caption_trg = next(gen)

print("Image features:", img_in.shape)
print("Caption input:", caption_in.shape)
print("Caption target:", caption_trg.shape)

# TRAINING CNN-LSTM MODEL

In [None]:
number_of_epochs = 50
batch_size = 160
steps = (len(train_dict)*5)//batch_size

gen = data_generator(train_features, train_dict, batch_size)

with tf.device('gpu'):
    model.fit(gen, epochs=number_of_epochs, steps_per_epoch=steps, verbose=1, callbacks=[cp])

In [None]:
with open('Trained_Model.pkl', 'wb') as f:
  pickle.dump(model, f)

model.summary()

# TESTING THE MODEL

We tested the model on various random test images. We defined a **get_caption function** that takes the image and calls our model repetitively to predict the whole caption for that image. Some of the outputs that we get are shown below.

In [None]:
# model = pickle.load(open("/kaggle/input/generateddata/Trained_Model.pkl", 'rb'))
# print(model.summary())

In [None]:
def get_caption(image):
    image = image.reshape(1, 1, 2048)
    
    in_caption = 'startcap'
    pred_word = ""
    length_predicted = 0
    
    while(pred_word != 'endcap' and length_predicted <= caption_max_length) :
        caption_seq = [vocab_list.index(word) for word in in_caption.split() if word in vocab_list]
        caption_seq = pad_sequences([caption_seq], maxlen=caption_max_length)
        
        # Calling our model
        with tf.device('gpu'):
            pred_softmax = model.predict([image,caption_seq], verbose=0)
        pred_index = np.argmax(pred_softmax)
        pred_word = vocab_list[pred_index]
        
        in_caption += ' ' + pred_word
        length_predicted += 1
        
    caption = in_caption.split()
    # Remove 'startcap' and 'endcap' from the predicted caption
    caption = caption[1:-1]
    caption = ' '.join(caption)
    return caption

In [None]:
# Pick a test image that you want to caption.
test_image_name = test_img_names[1]

# Reading the image and processing it
img = cv2.imread("/kaggle/input/flickr/Flickr8k/Flicker8k_Images"  + "/" + test_image_name)   
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
img = cv2.resize(img,(224,224))
img = np.expand_dims(img, axis=0)

# Calling the Model
pred = resnet50.predict(img, verbose=0).reshape(1, 2048)

# Displaying the image
x = plt.imread("/kaggle/input/flickr/Flickr8k/Flicker8k_Images"  + "/" + test_image_name)
plt.imshow(x)
plt.show()

# printing the caption
caption = get_caption(pred)
print(caption)

# EVALUATION USING BLEU SCORE

In [None]:
scores_list = []
test_images_list = test_img_names.copy()
#test_images_list = random.sample(test_images_list, 100)

for img_name in tqdm(test_images_list):   
    predictions_list = []
    img = cv2.imread("/kaggle/input/flickr/Flickr8k/Flicker8k_Images"  + "/" + img_name)
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    img = cv2.resize(img,(224,224))
    img = np.expand_dims(img, axis=0)

    with tf.device('gpu'):
        pred = resnet50.predict(img, verbose = 0).reshape(1,2048)

    pred = get_caption(pred)
    predictions_list.append(pred)

    reference = test_dict[img_name].copy()

    pred_words = pred.split()
    score = sentence_bleu(reference, pred)
    scores_list.append(score)

In [None]:
def mode(arr):
    vals,counts = np.unique(arr, return_counts=True)
    mode = np.argmax(counts)
    return vals[mode]

print("Mean BLEU: " + str(np.mean(scores_list)))
print("Max BLEU: " + str(np.max(scores_list)))
print("Mode BLEU: " + str(mode(scores_list)))
print("Median BLEU: " + str(np.median(scores_list)))

# EVALUATION USING METEOR SCORE

In [None]:
scores_list = []
test_images_list = test_img_names.copy()
#test_images_list = random.sample(test_images_list, 100)

for img_name in tqdm(test_images_list):   
    predictions_list = []
    img = cv2.imread("/kaggle/input/flickr/Flickr8k/Flicker8k_Images"  + "/" + img_name)
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    img = cv2.resize(img,(224,224))
    img = np.expand_dims(img, axis=0)

    with tf.device('gpu'):
        pred = resnet50.predict(img, verbose = 0).reshape(1,2048)

    pred = get_caption(pred)
    predictions_list.append(pred)

    reference = test_dict[img_name].copy()

    pred_words = pred.split()
    score = meteor_score(reference, pred)
    scores_list.append(score)

In [None]:
def mode(arr):
    vals,counts = np.unique(arr, return_counts=True)
    mode = np.argmax(counts)
    return vals[mode]

print("Mean METEOR: " + str(np.mean(scores_list)))
print("Max METEOR: " + str(np.max(scores_list)))
print("Mode METEOR: " + str(mode(scores_list)))
print("Median METEOR: " + str(np.median(scores_list)))