# Image Caption Generator

This notebook is a to-do guide about generating captions for images and it's comparision to the latest models for generating these captions. 

# Problem Description

Image caption generation models are models that analyze images and automatically generate relevant captions. 

They combine techniques from computer vision and natural language processing to “understand” an image's visual content and express it in natural language. This task is complex because it requires not only recognizing objects in an image but also understanding their context, relationships, and the ability to translate this understanding into a coherent sentence.

## Intuition

- Images can be compressed to vectors of a multitude of features. These can be generated using a CNN (Convolutional Neural Network).

- Our goal is to generate a suitable `caption` for the image given, which is a sequence of texts. We can generate a sequence using an RNN (Recurrent Neural Network) like LSTM(Long-Short Term Memory) or GRU (Gated Recurrent Unit)

- We push the Image vector(feature vector) as our initial state for RNN and try to generate text at each time-step of the RNN using the feature vector.

- While training, we will already have our images and captions at the ready. Get our feature vector of the image and push the feature vector against a untrained/ pre-trained RNN and compare it with our actual caption output. Train it with back-prop to get better at accuracy. 

# Strategy

- Use the pretrained `Inception_V3` model to generate the feature vector of the image.

- Pass it through an RNN to generate an output embedding and compare it to the actual output in the embedding form, use an error function with these two and backprop to get a fix of this hybrid model, to generate accurate captions. 

- We are going to implement both `LSTM` and `GRU` architectures as our caption generation models.

- We are using the `MSCOCO` Dataset for our task of image caption generation, with an 80-20 train-test split.

## Models

### LSTM

### GRU

### InceptionV3

## Model Scoring Approach

### BLEU Score

BLEU (Bilingual Evaluation Understudy) is a widely used metric for evaluating the quality of machine-translated text.

BLEU is an algorithm that measures the correspondence between a machine translation and professional human translations of the same text.

> Reference: https://en.wikipedia.org/wiki/BLEU

BLEU compares n-grams (sequences of n consecutive words) of the machine-translated text with n-grams of reference human translations. The algorithm:

- Calculates n-gram precision scores (typically for 1 to 4-grams)
- Applies a brevity penalty to penalize short translations
- Combines these scores using a weighted geometric mean

- BLEU scores range from 0 to 1, where 1 indicates a perfect match with the reference translation.
- In practice, scores between 0.6-0.7 are considered very good.
- Scores closer to 1 are rare and may indicate overfitting

> Reference: https://towardsdatascience.com/foundations-of-nlp-explained-bleu-score-and-wer-metrics-1a5ba06d812b

In [3]:
from nltk.translate.bleu_score import sentence_bleu

reference = [["this", "is", "a", "test"]]
candidate = ["this", "is", "test"]

score = sentence_bleu(reference, candidate)
print(score)

8.987727354491445e-155


The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


# Code

## Installs and Environment Setup

In [4]:
%pip install numpy tensorflow
%pip install keras # For latest versions of tensorflow, it is advised to use keras externally 
%pip install keras_nlp

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


## Imports

In [5]:
import importlib, importlib_metadata

In [6]:
def print_module_version(module_name):
    try:
        version = importlib.metadata.version(module_name)
        print(f"{module_name} Version: ",version)
    except importlib.metadata.PackageNotFoundError:
         print(f"{module_name} is not installed or version information is not available")

In [259]:
import numpy as np
print_module_version("numpy")
import pandas as pd
print_module_version("pandas")
import tensorflow as tf
print_module_version("tensorflow")
from keras.applications import InceptionV3
from keras.applications.inception_v3 import preprocess_input
from keras.models import Model, Sequential
from keras.layers import Input, Embedding, LSTM, GRU, Dense, Dropout, Add
from keras_nlp.tokenizers import Tokenizer, WordPieceTokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical 
print_module_version("keras")
# Importing Pycoctools for potential dataset handling from the coco["train2017"] API -- Python Version
import pycocotools
print_module_version("pycocotools")
from sklearn.model_selection import train_test_split
print_module_version("sklearn")
from nltk.translate.bleu_score import sentence_bleu
print_module_version("nltk")
from scipy.spatial.distance import cosine
print_module_version("scipy")
import re
print_module_version("re")
import pickle
print_module_version("pickle")
import os
print_module_version("os")
import glob
print_module_version("glob")
from PIL import Image
print_module_version("PIL")
from tqdm import tqdm
print_module_version("tqdm")

numpy Version:  1.26.4
pandas Version:  2.0.3
tensorflow Version:  2.17.0
keras Version:  3.6.0
pycocotools Version:  2.0
sklearn is not installed or version information is not available
nltk Version:  3.8.1
scipy Version:  1.11.1
re is not installed or version information is not available
pickle is not installed or version information is not available
os is not installed or version information is not available
glob is not installed or version information is not available
PIL is not installed or version information is not available
tqdm Version:  4.65.0


In [8]:
%matplotlib inline
from pycocotools.coco import COCO
import skimage.io as io
print_module_version("skimage.io")
import matplotlib.pyplot as plt
print_module_version("matplotlib")
import pylab
print_module_version("pylab")
pylab.rcParams['figure.figsize'] = (8.0, 10.0)

skimage.io is not installed or version information is not available
matplotlib Version:  3.7.2
pylab is not installed or version information is not available


## Implementation

### Installing MSCOCO Dataset and Understanding the COCO API

- We have installed it with Github Import from [CocoAPI](https://github.com/cocodataset/cocoapi)
- Used make tool to install from MakeFile of the `cocoapi/PythonAPI` folder in the repository, with the command below. 

$$ make -f MakeFile $$  

But this has only provided us with the validation datasets. What we actually want are all the datasets -- train, val, test. Foe which we used the `pycocotools` module/ API for installing the COCO dataset.

> In the Common Objects in Context (COCO) dataset, an annotation is a list of objects in an image, along with detailed information about each object. This information includes the object's class label, bounding box coordinates, and segmentation mask. 

> Annotations are stored in a JSON file, along with other information about the images and dataset.

#### Instance Viewing

In [9]:
dataDir='./dataset'
dataTypes=['train2017','val2017']

In [10]:
def generate_coco_ds_files(datadir,datatypes):
    coco = dict()
    for dataType in dataTypes:
        annFile='{}/annotations/instances_{}.json'.format(dataDir,dataType)
        coco[dataType]=COCO(annFile)
    return coco

In [11]:
coco = generate_coco_ds_files(datadir=dataDir,datatypes=dataTypes)

loading annotations into memory...
Done (t=14.55s)
creating index...
index created!
loading annotations into memory...
Done (t=0.93s)
creating index...
index created!


In [12]:
print(coco['train2017'].info())
print("")
print("")
print(coco['val2017'].info())


description: COCO 2017 Dataset
url: http://cocodataset.org
version: 1.0
year: 2017
contributor: COCO Consortium
date_created: 2017/09/01
None


description: COCO 2017 Dataset
url: http://cocodataset.org
version: 1.0
year: 2017
contributor: COCO Consortium
date_created: 2017/09/01
None


In [13]:
coco['train2017'].getAnnIds()

[156,
 509,
 603,
 918,
 1072,
 1727,
 1728,
 1767,
 1769,
 1774,
 2144,
 2251,
 2255,
 2259,
 2280,
 2281,
 2496,
 2498,
 2500,
 2526,
 2532,
 2534,
 2544,
 3003,
 3165,
 3212,
 3323,
 3375,
 3488,
 3692,
 3817,
 4047,
 4061,
 4079,
 4254,
 4430,
 4474,
 4654,
 4703,
 4893,
 4932,
 4940,
 5037,
 5181,
 5544,
 5560,
 5585,
 5623,
 5637,
 5652,
 5812,
 6099,
 6104,
 6172,
 6174,
 6259,
 6656,
 7023,
 7064,
 7158,
 7178,
 7228,
 7287,
 7305,
 7319,
 7486,
 7514,
 7588,
 7670,
 7793,
 7880,
 7910,
 8000,
 8084,
 8442,
 8715,
 8721,
 8793,
 8807,
 8915,
 8989,
 9022,
 9031,
 9040,
 9065,
 9120,
 9155,
 9280,
 9417,
 9553,
 9626,
 9657,
 9819,
 10032,
 10071,
 10187,
 10385,
 10442,
 10449,
 10527,
 10600,
 10750,
 11070,
 11097,
 11207,
 11272,
 11290,
 11381,
 11491,
 11554,
 11571,
 11592,
 11640,
 11650,
 11660,
 11682,
 11706,
 11840,
 11970,
 12030,
 12047,
 12282,
 12543,
 12731,
 12770,
 12781,
 12790,
 12799,
 12882,
 12994,
 13200,
 13226,
 13283,
 13328,
 13399,
 13678,
 13836,
 

In [14]:
# display COCO categories and supercategories
cats = coco["train2017"].loadCats(coco["train2017"].getCatIds())
nms=[cat['name'] for cat in cats]
print('COCO categories: \n{}\n'.format(' '.join(nms)))

nms = set([cat['supercategory'] for cat in cats])
print('COCO supercategories: \n{}'.format(' '.join(nms)))

COCO categories: 
person bicycle car motorcycle airplane bus train truck boat traffic light fire hydrant stop sign parking meter bench bird cat dog horse sheep cow elephant bear zebra giraffe backpack umbrella handbag tie suitcase frisbee skis snowboard sports ball kite baseball bat baseball glove skateboard surfboard tennis racket bottle wine glass cup fork knife spoon bowl banana apple sandwich orange broccoli carrot hot dog pizza donut cake chair couch potted plant bed dining table toilet tv laptop mouse remote keyboard cell phone microwave oven toaster sink refrigerator book clock vase scissors teddy bear hair drier toothbrush

COCO supercategories: 
electronic food appliance indoor sports animal kitchen accessory vehicle person outdoor furniture


In [15]:
# get all images containing given categories, select one at random
catIds = coco["train2017"].getCatIds(catNms=['person','dog','skateboard']);
print(len(catIds))
if(len(catIds)<=5):
    print(catIds)
imgIds = coco["train2017"].getImgIds(catIds=catIds)
print(len(imgIds))
if(len(imgIds)<=5):
    print(imgIds)
# Get a Random Image from the above categories
img = coco["train2017"].loadImgs(imgIds[np.random.randint(0,len(imgIds))])[0]

3
[1, 18, 41]
65


In [16]:
I = io.imread(img['coco_url'])
plt.axis('off')
plt.imshow(I)
plt.show()

  plt.show()


In [17]:
# load and display instance annotations
plt.imshow(I); plt.axis('off')
annIds = coco["train2017"].getAnnIds(imgIds=img['id'], catIds=catIds, iscrowd=None)
anns = coco["train2017"].loadAnns(annIds)
coco["train2017"].showAnns(anns)

#### Caption Viewing

The code below demostrates loading the captions of the dataset based on image id of the COCO annotations 

In [18]:
dataDir='./dataset'
dataTypes=['train2017','val2017']

In [19]:
def generate_coco_ds_caption_files(datadir,datatypes):
    coco = dict()
    for dataType in dataTypes:
        annFile='{}/annotations/captions_{}.json'.format(dataDir,dataType)
        coco[dataType]=COCO(annFile)
    return coco

In [20]:
coco_caps = generate_coco_ds_caption_files(datadir=dataDir,datatypes=dataTypes)

loading annotations into memory...


Done (t=0.55s)
creating index...
index created!
loading annotations into memory...
Done (t=0.02s)
creating index...
index created!


In [21]:
# load and display caption annotations
annIds = coco_caps["train2017"].getAnnIds(imgIds=img['id'])
print(annIds)
anns = coco_caps["train2017"].loadAnns(annIds)
coco_caps["train2017"].showAnns(anns)
plt.imshow(I); plt.axis('off'); plt.show()

[65845, 67786, 70180, 72172, 75262]
A dog sleeps in a carrier  inside a mans coat.
A man has a white dog stuffed inside his sweatshirt.
Skater with skateboard in hand with puppy in sweatshirt.
A person holding a skateboard with a dog tucked in their jacket.
A person with a dog in his jacket holding a skateboard.


  plt.imshow(I); plt.axis('off'); plt.show()


#### Review

Now we have a solid understanding of what to do in order to load the COCO dataset. 

### Model Installation

In [22]:
inception_model_pretrained = InceptionV3(weights='imagenet',classifier_activation=None)
inception_model_pretrained.summary()

### Input Setup

Before getting to the next part of the model build, we will demonstrate the build up of both the image and the caption set at once. 

In [23]:
coco # Annotations Object of Train 2017, Validation 2017 Datasets

{'train2017': <pycocotools.coco.COCO at 0x2b4638690>,
 'val2017': <pycocotools.coco.COCO at 0x2fd9efa10>}

In [24]:
coco_caps # Captions Object of Train 2017, Validation 2017 Datasets

{'train2017': <pycocotools.coco.COCO at 0x3054dad50>,
 'val2017': <pycocotools.coco.COCO at 0x3afb518d0>}

In [25]:
img_ids={}
img_ids["train2017"] = coco["train2017"].getImgIds()
img_ids["val2017"] = coco["val2017"].getImgIds()
len(img_ids["train2017"]),len(img_ids["val2017"])

(118287, 5000)

We have approximately 118,000 training images and just 5000 for validation. 

We are now buliding a simple method to demostrate image and caption side-by-side

In [170]:
def generate_images_and_captions(setType,img_ids=[],printSingleImage=False):
    try:
        if(printSingleImage):
            try:
                print("Printing Random Image")
                image_ids=coco[setType].getImgIds()
                unique_img_id = image_ids[np.random.randint(0,len(image_ids))]
                unique_img = coco[setType].loadImgs(unique_img_id)
                unique_ann_id = coco_caps[setType].getAnnIds(imgIds=[unique_img_id])
                unique_ann = coco_caps[setType].loadAnns(unique_ann_id)
                print(unique_img[0])
                unique_img_object = io.imread(unique_img[0]['coco_url'])
                plt.imshow(unique_img_object)
                plt.axis('off')
                plt.show()
                print("COCO URL: ",unique_img[0]['coco_url'])
                print("----Annotations----")
                coco_caps[setType].showAnns(unique_ann)
                
            except:
                print("There has been an unexpected error.")
                return None
        if(img_ids==[]):
            collection={}
            image_ids=coco[setType].getImgIds()
            images = coco[setType].loadImgs(image_ids)
            annIds = coco_caps[setType].getAnnIds()
            anns = coco_caps[setType].loadAnns(annIds)
            collection["images"] = images
            collection["anns"] = anns
            return collection
        else:
            collection={}
            image_ids=coco[setType].getImgIds(img_ids=img_ids)
            images = coco[setType].loadImgs(image_ids)
            annIds = coco_caps[setType].getAnnIds(img_ids=img_ids)
            anns = coco_caps[setType].loadAnns(annIds)
            collection["images"] = images
            collection["anns"] = anns
            return collection
        
    except:
        print("There has been an unexpected error.")
        return None

Now generating input/output set with a random image+annotation print

In [171]:
data={}
data["train2017"] = generate_images_and_captions("train2017",printSingleImage=True)
data["val2017"] = generate_images_and_captions("val2017",printSingleImage=False)

Printing Random Image
{'license': 1, 'file_name': '000000140068.jpg', 'coco_url': 'http://images.cocodataset.org/train2017/000000140068.jpg', 'height': 428, 'width': 640, 'date_captured': '2013-11-19 21:30:32', 'flickr_url': 'http://farm2.staticflickr.com/1143/537506987_1954a029be_z.jpg', 'id': 140068}


  plt.show()


COCO URL:  http://images.cocodataset.org/train2017/000000140068.jpg
----Annotations----
A group of men on a field playing baseball.
Scene at a baseball game with players in the field and people in the stands
An outfielder walking across the field at a major baseball game.
A baseball player walks on the baseball field.
The man is walking on the field to play a game of baseball. 


### Generating Feature Vectors

After setting up data input/output for our main task, we focus on our next step, `Feature Generation` with the pre-trained `InceptionV3` model, with the last layer removed to augment the model for our task of caption generation.

In [28]:
inception_model_pretrained = Model(inception_model_pretrained.input, inception_model_pretrained.layers[-2].output)
inception_model_pretrained.summary()

In [315]:
# Mini-model method to break long-variable loops

def break_loop(index,count=5):
    if(count==None):
        return False
    if index<=count:
        return True
    else:
        return False

In [352]:
def extract_features(img, model):
    
    img = np.resize(img,(299, 299, 3))
    if img.shape == (299, 299, 3):
        img = np.expand_dims(img, axis=0)
        img = preprocess_input(img)
        feature = model.predict(img, verbose=0)
        return feature.reshape(-1)

In [226]:
features = {
    "train2017":{},
    "val2017":{}
}

index=0

for img_file in tqdm(data["train2017"]["images"]):
    img_obj = io.imread(img_file['coco_url'])
    img_id  = img_file["id"]
    features["train2017"][img_id] = extract_features(img_obj, inception_model_pretrained)
    index+=1
    if break_loop(index,count=10):
        break

index=0

for img_file in tqdm(data["val2017"]["images"]):
    img_obj = io.imread(img_file['coco_url'])
    img_id  = img_file["id"]
    features["val2017"][img_id] = extract_features(img_obj, inception_model_pretrained)
    index+=1
    if(index==1):
        print(features["val2017"][img_id])
    if break_loop(index,count=10):
        break


  0%|          | 9/118287 [00:03<14:34:21,  2.25it/s]
  0%|          | 2/5000 [00:00<22:09,  3.76it/s]

[0.00312117 0.34612817 1.0620424  ... 1.2099916  0.00280857 0.68042004]


  0%|          | 9/5000 [00:03<34:02,  2.44it/s]


### Caption Preprocessing

Since we have paired up images and captions, we can preprocess the captions to embedding vectors to make the comparision with the generated vector output.

In [56]:
def preprocess_caption(caption):
    caption = caption.lower()
    caption = re.sub(r'[^a-z\s]', '', caption)
    caption = ' '.join(caption.split())
    caption = "startseq " + caption + " endseq"
    return caption


After building the method, we preprocess our train dataset captions and create a vocab list from them.

In [57]:
# Prepare tokenizer
all_captions={}
all_captions["train2017"] = [preprocess_caption(item['caption']) for item in data["train2017"]["anns"]]
print(len(all_captions["train2017"]))
set_captions_list = list(set((" ".join(all_captions["train2017"])).split(" ")))
vocab = ["UNK"]+ ["PAD"]+ list(set_captions_list)
vocab

591753


['UNK',
 'PAD',
 'italianate',
 'dismount',
 'her',
 'ruched',
 'stockpot',
 'reporter',
 'satirical',
 'clerical',
 'composited',
 'howe',
 'marlins',
 'striding',
 'spy',
 'window',
 'plaintively',
 'ripening',
 'restraint',
 'fisher',
 'ii',
 'ds',
 'orchestrating',
 'pies',
 'longhaired',
 'hos',
 'butterflyshaped',
 'skatebaord',
 'manipulating',
 'potluck',
 'jersey',
 'drydocked',
 'afganistan',
 'random',
 'vendetta',
 'fiield',
 'ferry',
 'ceilings',
 'biscotti',
 'somehwat',
 'grafiti',
 'fettuccini',
 'project',
 'choice',
 'faint',
 'burlap',
 'skatboards',
 'distraught',
 'normally',
 'farmers',
 'autographs',
 'cecil',
 'lumberjacks',
 'allee',
 'snowencrusted',
 'repainted',
 'hide',
 'rhododendron',
 'tahiti',
 'fruitthemed',
 'manual',
 'workroom',
 'jeans',
 'backed',
 'graphic',
 'horsemounted',
 'canoeist',
 'plantlife',
 'counch',
 'merchants',
 'rescue',
 'transdev',
 'davenport',
 'racked',
 'jjet',
 'seedpods',
 'slathered',
 'gook',
 'familys',
 'adultsized',
 

### Tokenization

Now we build a `WordPieceTokenizer` from the generated vocabulary list.

In [58]:
wp_tokenizer = WordPieceTokenizer(
    vocabulary=vocab,
    lowercase=True,
    oov_token='UNK',
    dtype="int32",
)

In [59]:
vocab_size = wp_tokenizer.vocabulary_size()
max_length = max(len(c.split()) for c in all_captions["train2017"])
print(vocab_size)
print(max_length)

28563
51


In [60]:
wp_tokenizer.get_vocabulary()

['UNK',
 'PAD',
 'italianate',
 'dismount',
 'her',
 'ruched',
 'stockpot',
 'reporter',
 'satirical',
 'clerical',
 'composited',
 'howe',
 'marlins',
 'striding',
 'spy',
 'window',
 'plaintively',
 'ripening',
 'restraint',
 'fisher',
 'ii',
 'ds',
 'orchestrating',
 'pies',
 'longhaired',
 'hos',
 'butterflyshaped',
 'skatebaord',
 'manipulating',
 'potluck',
 'jersey',
 'drydocked',
 'afganistan',
 'random',
 'vendetta',
 'fiield',
 'ferry',
 'ceilings',
 'biscotti',
 'somehwat',
 'grafiti',
 'fettuccini',
 'project',
 'choice',
 'faint',
 'burlap',
 'skatboards',
 'distraught',
 'normally',
 'farmers',
 'autographs',
 'cecil',
 'lumberjacks',
 'allee',
 'snowencrusted',
 'repainted',
 'hide',
 'rhododendron',
 'tahiti',
 'fruitthemed',
 'manual',
 'workroom',
 'jeans',
 'backed',
 'graphic',
 'horsemounted',
 'canoeist',
 'plantlife',
 'counch',
 'merchants',
 'rescue',
 'transdev',
 'davenport',
 'racked',
 'jjet',
 'seedpods',
 'slathered',
 'gook',
 'familys',
 'adultsized',
 

In [61]:
wp_tokenizer.tokenize(all_captions["train2017"][0])

<tf.Tensor: shape=(12,), dtype=int32, numpy=
array([27061,  7769, 10936, 17377, 15538,  7769, 22553,  1921, 17151,
       11527, 26343,  2327], dtype=int32)>

In [62]:
print(wp_tokenizer.token_to_id("startseq"))
print(wp_tokenizer.token_to_id("endseq"))
print(wp_tokenizer.token_to_id("UNK"))
print(wp_tokenizer.token_to_id("PAD"))


27061
2327
0
1


Saving this WP Tokenizer.

In [63]:
with open('wp_tokenizer.pkl', 'wb') as f:
    pickle.dump(wp_tokenizer, f)

### Sequence Padding

Now after saving the WP Tokenizer, we build a method to pad our sequences of words.

First of all, we load back the Word Piece Tokenizer

In [64]:
with open('wp_tokenizer.pkl', 'rb') as file:
    wp_tokenizer = pickle.load(file)
wp_tokenizer # build=True comes when it is used inside a model.

<WordPieceTokenizer name=word_piece_tokenizer_1, built=False>

### Model Implementation

Now we define our models with this method below. These two models will be our LSTM and GRU models.

In [357]:
vocab_size = vocab_size
max_length = max_length
feature_dim = features["train2017"][391895].shape # 391895 is valid image id

In [358]:
model_shape_structure = {
    "feature_input":(2048,),
    "embedding_dim":256,
    "vocab_size":vocab_size,
    "max_length":(max_length,),
    "return_sequences":False
}

In [359]:
def define_model(model_shape_structure,rnn_type="LSTM",model_name="lstm_model",return_sequences=False):
    model = Sequential()
    
    feature_input = Input(shape=model_shape_structure["feature_input"])
    feature_input_drp = Dropout(0.5)(feature_input)
    feature_sec = Dense(model_shape_structure["embedding_dim"], activation='relu')(feature_input_drp)
    
    caption_inputs = Input(shape=(model_shape_structure["max_length"]))
    se1 = Embedding(model_shape_structure["vocab_size"],model_shape_structure["embedding_dim"], mask_zero=True)(caption_inputs)
    se2 = Dropout(0.5)(se1)
    
    if rnn_type == "LSTM":
        seq_model = LSTM(units=model_shape_structure["embedding_dim"], return_sequences=return_sequences)(se2)
    else:
        seq_model = GRU(units=model_shape_structure["embedding_dim"], return_sequences=return_sequences)(se2)
    
    decoder1 = Add()([feature_sec, seq_model])
    decoder2 = Dense(model_shape_structure["embedding_dim"], activation='relu')(decoder1)
    outputs = Dense(model_shape_structure["vocab_size"], activation='softmax')(decoder2)

    print("Output Final Shape: ",outputs.shape)    

    model = Model(inputs=[feature_input, caption_inputs], outputs=outputs, name=model_name)
    model.compile(loss='categorical_crossentropy', optimizer='adam')
    return model
    
    

In [360]:
lstm_model = define_model(model_shape_structure=model_shape_structure,rnn_type="LSTM",return_sequences=model_shape_structure["return_sequences"])
gru_model = define_model(model_shape_structure=model_shape_structure,rnn_type="GRU",model_name="gru_model",return_sequences=model_shape_structure["return_sequences"])

Output Final Shape:  (None, 28563)
Output Final Shape:  (None, 28563)


Here below are the summaries of our models

In [361]:
lstm_model.summary()

In [362]:
gru_model.summary()

In [363]:
def generate_padded_seq(seq, wp_tokenizer):
    seq=wp_tokenizer.tokenize(seq)
    padded_seq = pad_sequences(
        [seq],
        maxlen=max_length,
        padding='post',
        value=1
    )
    return padded_seq

def generate_inputs_by_id(setType,img_id):
    annotations_ids = coco_caps[setType].getAnnIds(imgIds=[img_id])
    annotations_obj = coco_caps[setType].loadAnns(annotations_ids)
    annotations = [preprocess_caption(item['caption']) for item in annotations_obj]
    random_annotation = annotations[np.random.randint(0,len(annotations))]
    feature = features["train2017"][img_id]
    return feature , random_annotation

def generate_sequences(setType):
    X1, X2, y = [], [], []
    index = 0
    for imageObj in (data[setType]["images"]):
        img_id = imageObj['id']
        x1, x2 = generate_inputs_by_id("train2017",img_id=img_id)
        print(x2)
        seq = generate_padded_seq(x2,wp_tokenizer=wp_tokenizer)
        print("seq: ",seq)
        for i in range(1, len(seq[0])):
            in_seq, out_seq = seq[0][:i], seq[0][i]
            out_seq = to_categorical(out_seq, num_classes=vocab_size)
            X1.append(x1) # Extending vectors give you extension along wrong axis
            padded_in_seq = pad_sequences(
                [in_seq],
                maxlen=max_length,
                padding='post',
                value=1
            )
            X2.extend(padded_in_seq)
            y.append(out_seq) # Extending vectors give you extension along wrong axis
        
        index+=1

        if(break_loop(index,count=1)):
            print(x1.shape,x2)
        if break_loop(index):
            break

    return np.array(X1), np.array(X2), np.array(y)

In [364]:
X1train,X2train,ytrain = generate_sequences("train2017")

startseq a man in a red shirt and a red hat is on a motorcycle on a hill side endseq
seq:  [[27061  7769 18339 12500  7769  9962  4861 24948  7769  9962 19937  3165
   8350  7769 24450  8350  7769 26669  6425  2327     1     1     1     1
      1     1     1     1     1     1     1     1     1     1     1     1
      1     1     1     1     1     1     1     1     1     1     1     1
      1     1     1]]
(2048,) startseq a man in a red shirt and a red hat is on a motorcycle on a hill side endseq


In [365]:
X1train = np.array(X1train)
X2train = np.array(X2train)
ytrain = np.array(ytrain )

print(X1train.shape)
print(X2train.shape)
print(ytrain.shape)

(50, 2048)
(50, 51)
(50, 28563)


In [366]:
lstm_model.fit([X1train, X2train], ytrain, epochs=2, batch_size=32)
gru_model.fit([X1train, X2train], ytrain, epochs=2, batch_size=32)

Epoch 1/2




[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 172ms/step - loss: 10.1150
Epoch 2/2
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 174ms/step - loss: 8.7894
Epoch 1/2




[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 162ms/step - loss: 10.2170
Epoch 2/2
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 215ms/step - loss: 9.2207


<keras.src.callbacks.history.History at 0x3bded9e50>

In [380]:
def generate_caption(model,tokenizer, photo, pretrained_img_model, max_length):
    
    in_text = 'startseq'
    gen_seq = ''
    gen_seq+=in_text

    for i in range(max_length):

        padded_seq = generate_padded_seq(gen_seq,wp_tokenizer)
        print(padded_seq)
        feature_vec = extract_features(photo,pretrained_img_model)
        feature_vec = np.resize(feature_vec, (1,2048))
        yhat = model.predict([feature_vec, padded_seq], verbose=0)
        # Convert to original class label
        original_token = np.argmax(yhat,axis=1)
        resp = tokenizer.detokenize(original_token)
        if resp is None:
            break
        
        gen_seq = gen_seq + " " + resp

        if resp == 'endseq':
            break

    return gen_seq


In [None]:
img_obj = io.imread(data["val2017"]["images"][0]['coco_url'])
caption_generated = generate_caption(lstm_model,wp_tokenizer,img_obj,inception_model_pretrained,51)
print(caption_generated)

In [None]:
# Calculate BLEU and semantic distance
def evaluate_model(model, tokenizer, photos, captions, max_length):
    actual, predicted = [], []
    for key, desc_list in captions.items():
        y_pred = generate_caption(model, tokenizer, photos[key], max_length)
        references = [d.split() for d in desc_list]
        y_pred = y_pred.split()
        bleu = sentence_bleu(references, y_pred)
        actual.append(references)
        predicted.append(y_pred)
        print(f'BLEU: {bleu:.3f}')


# Observations

- Need to actually train it. 
- Will train it in parallel.

# Conclusion

- This works
- Need better Models for NLP tasks

# Scope

- Future potential for compatibility with Large Language Models. Document based generation with Formated texts and Images embedded in a formatted fashion.
- Potential Model Improvements and Dataset Improvements.
- Scope for Image Generation with larger, diverse datasets. 