Image caption Generator = understanding image and a language description for that image. 

Sound Interesting? Let’s Begin!

# Approach

We will tackle this problem using an Encoder-Decoder model. Here our encoder model will combine both the encoded form of the image and the encoded form of the text caption and feed to the decoder.
Our model will treat CNN as the ‘image model’ and the RNN/LSTM as the ‘language model’ to encode the text sequences of varying length. The vectors resulting from both the encodings are then merged and processed by a Dense layer to make a final prediction.

To encode our image features we will make use of transfer learning. There are a lot of models that we can use like VGG-16, InceptionV3, ResNet, etc.

To encode our text sequence we will map every word to a 200-dimensional vector. For this will use a pre-trained Glove model. This mapping will be done in a separate layer after the input layer called the embedding layer.
To generate the caption we will be using two popular methods which are Greedy Search and Beam Search. These methods will help us in picking the best words to accurately define the image.

# Load Libraries

In [7]:
import string
import numpy as np
import PIL.Image
import glob

from os import listdir
from pickle import dump, load

from numpy import array
from numpy import argmax
from tqdm.notebook import tqdm
from tensorflow.keras.applications.vgg16 import VGG16
from tensorflow.keras.utils import load_img, img_to_array
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Model, load_model
from tensorflow.keras.layers import Input, Dense, LSTM, Embedding, Dropout, add
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.preprocessing.sequence import pad_sequences

from nltk.translate.bleu_score import corpus_bleu

from tensorflow.keras.applications.inception_v3 import InceptionV3
from tensorflow.keras.applications.resnet50 import ResNet50, preprocess_input
from tensorflow.keras.applications.imagenet_utils import decode_predictions
from tensorflow.keras.preprocessing import image, sequence

import matplotlib as plt

In [8]:
from google.colab import drive
drive.mount('/content/drive/', force_remount = True)

Mounted at /content/drive/


# Prepare Photo Data

In the Flickr8k dataset, each image is associated with five different captions that describe the entities and events depicted in the image that were collected. By associating each image with multiple, independently produced sentences, the dataset captures some of the linguistic variety that can be used to describe the same image.

Flickr8k is a good starting dataset as it is small in size and can be trained easily on low-end laptops/desktops using a CPU.
Our dataset structure is as follows:-

- Flick8k/
    - Flick8k_Dataset/ :- contains the 8000 images
    - Flick8k_Text/
        - Flickr8k.token.txt:- contains the image id along with the 5 captions
        - Flickr8k.trainImages.txt:- contains the training image id’s
        - Flickr8k.testImages.txt:- contains the test image id’s

In [9]:
# Defining the directory we are using
directory = '/content/drive/My Drive/Kaggle/ImageCaptioningGenerator/input/flickr8k_dataset/Flicker8k_Dataset'

# what feature image extractor model?

In [10]:
model = VGG16()

# Removing the last layer from the loaded model as we require only the features not the classification 
model.layers.pop()
model = Model(inputs=model.inputs, outputs=model.layers[-2].output)

#Total params: 134,260,544

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/vgg16/vgg16_weights_tf_dim_ordering_tf_kernels.h5


In [11]:
model = InceptionV3(weights='imagenet')
model = Model(model.input, model.layers[-2].output)

#Total params: 21,802,784

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/inception_v3/inception_v3_weights_tf_dim_ordering_tf_kernels.h5


In [12]:
model = ResNet50(include_top=False,weights='imagenet',input_shape=(224,224,3),pooling='avg')
#Total params: 23,587,712

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/resnet/resnet50_weights_tf_dim_ordering_tf_kernels_notop.h5


* We will choose RestNet50 model: good performance and less variable
* Since we are using ResNet50 we need to pre-process our input before feeding it into the model. Hence we define a preprocess function to reshape the images to (224 x 224) and feed to the preprocess_input() function of Keras.

In [13]:
def preprocessing(img_path) :
    # Loading an image and converting it into size 224 * 224
    im = image.load_img(img_path, target_size=(224,224,3))
    # Converting the image pixels into a numpy array
    im = image.img_to_array(im)
    # Reshaping data for the model
    im = im.reshape((1, im.shape[0], im.shape[1], im.shape[2]))
    return im

In [14]:
# Extract features from each photo in the directory
def extract_features(directory):

    # Extracting features from each photo and storing it in a dictionary 
    features = dict()

    for name in tqdm(listdir(directory)):

        # Defining the path of the image 
        filename = directory + '/' + name
        
        # Loading an image and converting it into size 224 * 224
        image = preprocessing(filename)

        # Preprocessing the images for the model
        # The preprocess_input function is meant to adequate your image to the format the model requires.
        image = preprocess_input(image)

        # Getting features of an image
        feature = model.predict(image, verbose=0)
        
        # Getting the image name
        image_id = name.split('.')[0]

        # Storing the feature corresponding to the image in the dictionary
        features[image_id] = feature
        
    return features

In [15]:
#features = extract_features(directory)

In [16]:
# store the features
#dump(features, open('features.pkl', 'wb'))

2. Preparing Text Data

In [17]:
# Loading the file containing all the descriptions into memory

def load_doc(filename):
    #Opening the file as read only
    file = open(filename, 'r')

    #Reading all text and storing it.
    text = file.read()

    #Closing the file
    file.close()

    return text

In [18]:
def photo_to_description_mapping(descriptions):
    
    # Dictionary to store the mapping of photo identifiers to descriptions
    description_mapping = dict()
    
    # Iterating through each line of the descriptions
    for line in descriptions.split('\n'):
        
        # Splitting the lines by white space
        words = line.split()
        
        # Skipping the lines with length less than 2
        if len(line)<2:
            continue
        
        # The first word is the image_id and the rest are the part of the description of that image
        image_id, image_description = words[0], words[1:]
        
        # Retaining only the name of the image and removing the extension from it
        image_id = image_id.split('.')[0]
        
        # Image_descriptions contains comma separated words of the description, hence, converting it back to string
        image_description = ' '.join(image_description)
        
        # There are multiple descriptions per image, 
        # hence, corresponding to every image identifier in the dictionary, there is a list of description
        # if the list does not exist then we need to create it
        
        if image_id not in description_mapping:
            description_mapping[image_id] = list()
        
        # Now storing the descriptions in the mapping
        description_mapping[image_id].append(image_description)
    
    return description_mapping

In [19]:
def clean_descriptions(description_mapping):
    
    # Preapring a translation table for removing all the punctuation
    table = str.maketrans('','', string.punctuation)
    # Traversing through the mapping we created
    for key, descriptions in description_mapping.items():
        for i in range(len(descriptions)):
            description = descriptions[i]
            description = description.split()
            
            # Converting all the words to lower case
            description = [word.lower() for word in description]
            
            # Removing the punctuation using the translation table we made
            description = [word.translate(table) for word in description]
            
            # Removing the words with length =1
            description = [word for word in description if len(word)>1]
            
            # Removing all words with number in them
            description = [word for word in description if word.isalpha()]

            # Converting the description back to string and overwriting in the descriptions list
            descriptions[i] = ' '.join(description)

Ideally, we want a vocabulary that is both expressive and as small as possible. A smaller vocabulary will result in a smaller model that will train faster.

For reference, we can transform the clean descriptions into a set and print its size to get an idea of the size of our dataset vocabulary.

In [20]:
# Converting the loaded descriptions into a vocabulary of words

def to_vocabulary(descriptions):
    
    # Build a list of all description strings
    all_desc = set()
    
    for key in descriptions.keys():
        [all_desc.update(d.split()) for d in descriptions[key]]
    
    return all_desc

In [21]:
# save descriptions to file, one per line
def save_descriptions(descriptions, filename):
    lines = list()
    for key, desc_list in descriptions.items():
        for desc in desc_list:
            lines.append(key + ' ' + desc)
    data = '\n'.join(lines)
    file = open(filename, 'w')
    file.write(data)
    file.close()

In [22]:
filename = "/content/drive/My Drive/Kaggle/ImageCaptioningGenerator/input/Flickr8k_text/Flickr8k.token.txt"

# Loading descriptions
doc = load_doc(filename)

# Parsing descriptions
descriptions = photo_to_description_mapping(doc)
print('Loaded: %d ' % len(descriptions))

# Cleaning the descriptions
clean_descriptions(descriptions)

# Summarizing the vocabulary
vocabulary = to_vocabulary(descriptions)
print('Vocabulary Size: %d' % len(vocabulary))

# Saving to the file
save_descriptions(descriptions, 'descriptions.txt')

Loaded: 8092 
Vocabulary Size: 8763


Next, we create a vocabulary of all the unique words present across all the 8000*5 (i.e. 40000) image captions in the data set. We have 8763 unique words across all the 40000 image captions.

In [23]:
# Function for loading a file into memory and returning text from it
def load_file(filename):
    file = open(filename, 'r')
    text = file.read()
    file.close()
    return text

# Function for loading a pre-defined list of photo identifiers
def load_photo_identifiers(filename):
    
    # Loading the file containing the list of photo identifier
    file = load_file(filename)
    
    # Creating a list for storing the identifiers
    photos = list()
    
    # Traversing the file one line at a time
    for line in file.split('\n'):
        if len(line) < 1:
            continue
        
        # Image name contains the extension as well but we need just the name
        identifier = line.split('.')[0]

        # Adding it to the list of photos
        photos.append(identifier)
        
    # Returning the set of photos created
    return set(photos)


# loading the cleaned descriptions that we created earlier
# we will only be loading the descriptions of the images that we will use for training
# hence we need to pass the set of train photos that the above function will be returning

def load_clean_descriptions(filename, photos):
    
    #loading the cleaned description file
    file = load_file(filename)
    
    #creating a dictionary of descripitions for storing the photo to description mapping of train images
    descriptions = dict()
    
    #traversing the file line by line
    for line in file.split('\n'):
        # splitting the line at white spaces
        words = line.split()

        # the first word will be the image name and the rest will be the description of that particular image
        image_id, image_description = words[0], words[1:]
        
        # we want to load only those description which corresponds to the set of photos we provided as argument
        if image_id in photos:
            #creating list of description if needed
            if image_id not in descriptions:
                descriptions[image_id] = list()
            
            #the model we will develop will generate a caption given a photo, 
            #and the caption will be generated one word at a time. 
            #The sequence of previously generated words will be provided as input. 
            #Therefore, we will need a ‘first word’ to kick-off the generation process 
            #and a ‘last word‘ to signal the end of the caption.
            #we will use 'startseq' and 'endseq' for this purpose
            #also we have to convert image description back to string
            
            desc = 'startseq ' + ' '.join(image_description) + ' endseq'
            descriptions[image_id].append(desc)
            
    return descriptions

# function to load the photo features created using the model
def load_photo_features(filename, photos):
    
    #this will load the entire features
    all_features = load(open(filename, 'rb'))
    #we are interested in loading the features of the required photos only
    features = {k: all_features[k] for k in photos}
    
    return features

In [24]:
filename = '/content/drive/My Drive/Kaggle/ImageCaptioningGenerator/input/Flickr8k_text/Flickr_8k.trainImages.txt'

train = load_photo_identifiers(filename)
print('Dataset: ',len(train))

train_descriptions = load_clean_descriptions('descriptions.txt', train)
print('Descriptions: train=', len(train_descriptions))

train_features = load_photo_features('/content/drive/My Drive/Kaggle/ImageCaptioningGenerator/features.pkl', train)
print('Photos: train=', len(train_features))

Dataset:  6000
Descriptions: train= 6000
Photos: train= 6000


To make our model more robust we will reduce our vocabulary to only those words which occur at least 5 times in the entire corpus.

In [25]:
word_count_threshold = 10
word_counts = {}
nsents = 0
for key, sents in train_descriptions.items():
    for sent in sents:
        nsents += 1
        for w in sent.split(' '):
            word_counts[w] = word_counts.get(w, 0) + 1
vocab = [w for w in word_counts if word_counts[w] >= word_count_threshold]

print('Vocabulary = %d' % (len(vocab)))

Vocabulary = 1651


In [26]:
# Dumping the features in a pickle file for further use
dump(train, open('train.pkl', 'wb'))

# Dumping the features in a pickle file for further use
dump(train_descriptions, open('train_descriptions.pkl', 'wb'))

# Dumping the features in a pickle file for further use
dump(train_features, open('train_features.pkl', 'wb'))

The description text will need to be encoded to numbers before it can be presented to the model as in input or compared to the model’s predictions.

The first step in encoding the data is to create a consistent mapping from words to unique integer values. Keras provides the Tokenizer class that can learn this mapping from the loaded description data.

Below defines the to_lines() to convert the dictionary of descriptions into a list of strings and the create_tokenizer() function that will fit a Tokenizer given the loaded photo description text.

In [27]:
# Given the descriptions, fit a tokenizer

# TOKENIZER CLASS:
# This class allows to vectorize a text corpus, 
# by turning each text into either a sequence of integers 
# (each integer being the index of a token in a dictionary) 
# or, into a vector where the coefficient for each token could be binary, based on word count, based on tf-idf...

def create_tokenizer():
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(vocab)
    return tokenizer

In [28]:
tokenizer = create_tokenizer()

vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: ', vocab_size)

Vocabulary Size:  1652


We also need to find out what the max length of a caption can be since we cannot have captions of arbitrary length.

In [29]:
# convert a dictionary of clean descriptions to a list of descriptions
def to_lines(descriptions):
    all_desc = list()
    for key in descriptions.keys():
        [all_desc.append(d) for d in descriptions[key]]
    return all_desc

#calculated the length of description with most words
def max_lengthTEMP(descriptions):
    lines = to_lines(descriptions)
    return max(len(d.split()) for d in lines)

We also need to find out what the max length of a caption can be since we cannot have captions of arbitrary length.

In [30]:
max_lengthTEMP(descriptions)

32

3. Defining the Model

Each description will be split into words. The model will be provided one word and the photo and generate the next word. Then the first two words of the description will be provided to the model as input with the image to generate the next word. This is how the model will be trained.

The model is in three parts:

1. Photo Feature Extractor: This is a Resnet50 model pre-trained on the ImageNet dataset. We have pre-processed the photos with the VGG model (without the output layer) and will use the extracted features predicted by this model as input.

2. Sequence Processor: This is a word embedding layer for handling the text input, followed by a Long Short-Term Memory (LSTM) recurrent neural network layer.

3. Decoder: Both the feature extractor and sequence processor output a fixed-length vector. These are merged together and processed by a Dense layer to make a final prediction. The Photo Feature Extractor model expects input photo features to be a vector of 4,096 elements. These are processed by a Dense layer to produce a 256 element representation of the photo.

The Sequence Processor model expects input sequences with a pre-defined length (34 words) which are fed into an Embedding layer that uses a mask to ignore padded values. This is followed by an LSTM layer with 256 memory units.

Both the input models produce a 256 element vector. Further, both input models use regularization in the form of 50% dropout. This is to reduce overfitting the training dataset, as this model configuration learns very fast.

The Decoder model merges the vectors from both input models using an addition operation. This is then fed to a Dense 256 neuron layer and then to a final output Dense layer that makes a softmax prediction over the entire output vocabulary for the next word in the sequence.


We have to train our model on a lot of images and each image will contain 2048 length feature vector and caption is also represented as numbers. This amount of images is not possible to hold into memory so we will be using a generator method that will yield batches.

The generator will yield the input and output sequence.

In [31]:
def create_sequences(tokenizer, max_length, desc_list, photo):
    X1, X2, y = list(), list(), list()
    # walk through each description for the image
    for desc in desc_list:
        # encode the sequence
        seq = tokenizer.texts_to_sequences([desc])[0]
        # split one sequence into multiple X,y pairs
        for i in range(1, len(seq)):
            # split into input and output pair
            in_seq, out_seq = seq[:i], seq[i]
            # pad input sequence
            in_seq = pad_sequences([in_seq], maxlen=max_length)[0]
            # encode output sequence
            out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
            # store
            X1.append(photo)
            X2.append(in_seq)
            y.append(out_seq)
    return array(X1), array(X2), array(y)

In [32]:
def data_generator(descriptions, photos, tokenizer, max_length):
    while 1:
        for key, description_list in descriptions.items():
            #retrieve photo features
            photo = photos[key][0]
            input_image, input_sequence, output_word = create_sequences(tokenizer, max_length, description_list, photo)
            yield [[input_image, input_sequence], output_word]

In [33]:
# define the captioning model
def define_model(vocab_size, max_length):
    
    # feature extractor model
    inputs1 = Input(shape=(2048,))
    fe1 = Dropout(0.5)(inputs1)
    fe2 = Dense(256, activation='relu')(fe1)

    # sequence model
    inputs2 = Input(shape=(max_length,))
    se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2)
    se2 = Dropout(0.5)(se1)
    se3 = LSTM(256)(se2)

    # decoder model
    decoder1 = add([fe2, se3])
    decoder2 = Dense(256, activation='relu')(decoder1)
    outputs = Dense(vocab_size, activation='softmax')(decoder2)

    # tie it together [image, seq] [word]
    model = Model(inputs=[inputs1, inputs2], outputs=outputs)
    model.compile(loss='categorical_crossentropy', optimizer='adam')
    
    # summarize model
    print(model.summary())
    
    return model

![Getting Started](./image.gif)

In [34]:
filename = '/content/drive/My Drive/Kaggle/ImageCaptioningGenerator/input/Flickr8k_text/Flickr_8k.trainImages.txt'
train = load(open('train.pkl', 'rb'))
print('Dataset: ', len(train))
train_descriptions = load(open('train_descriptions.pkl', 'rb'))
print('Descriptions: train=', len(train_descriptions))
train_features = load(open('train_features.pkl', 'rb'))
print('Photos: train=', len(train_features))
tokenizer = create_tokenizer()
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size:', vocab_size)
max_length = max_lengthTEMP(train_descriptions)
print('Description Length: ', max_length)

Dataset:  6000
Descriptions: train= 6000
Photos: train= 6000
Vocabulary Size: 1652
Description Length:  34


In [None]:
checkpoint = ModelCheckpoint(filepath='/content/drive/My Drive/Kaggle/ImageCaptioningGenerator/model_{epoch:02d}.h5', 
                                    save_freq='epoch',
                                    monitor='loss',
                                    mode='min',
                                    save_best_only=True,
                                    period = 2)

model = define_model(vocab_size, max_length)



Model: "model_2"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_5 (InputLayer)           [(None, 34)]         0           []                               
                                                                                                  
 input_4 (InputLayer)           [(None, 2048)]       0           []                               
                                                                                                  
 embedding (Embedding)          (None, 34, 256)      422912      ['input_5[0][0]']                
                                                                                                  
 dropout (Dropout)              (None, 2048)         0           ['input_4[0][0]']                
                                                                                            

In [None]:
epochs = 10
steps = len(train_descriptions)
generator = data_generator(train_descriptions, train_features, tokenizer, max_length)
history = model.fit(generator, epochs=epochs, steps_per_epoch=steps, verbose=1, callbacks = checkpoint)
model.save('model.h5')

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [None]:
def plot_train (history):
    # list all data in history
    print(history.history.keys())

    # summarize history for loss

    fig, ax = plt.subplots()
    plt.plot(history.history['loss'])
    plt.plot(history.history['val_loss'])
    plt.title('model loss')
    plt.ylabel('loss')
    plt.xlabel('epoch')
    plt.legend(['train'], loc='upper left')
    plt.show()
plot_train(history)

In [None]:
model = load_model('/content/drive/My Drive/Kaggle/ImageCaptioningGenerator/model_10.h5', compile=False)
model.compile(loss='categorical_crossentropy', optimizer='adam')

# Model Evaluation

In [35]:
#this function maps an integer to a word
def word_for_id(integer, tokenizer):
    for word, index in tokenizer.word_index.items():
        if index == integer:
            return word
    return None

#The function below generates a textual description given a trained model, 
#and a given prepared photo as input. It calls the function word_for_id() 
#in order to map an integer prediction back to a word.
def generate_desc(model, tokenizer, photo, max_length):
    #start tge generation process
    in_text = 'startseq'
    #iterating over the max_length since the maximum length of the description can be that only
    for i in range(max_length):
        #integer ncoding input sequence
        sequence = tokenizer.texts_to_sequences([in_text])[0]
        #padding the input
        sequence = pad_sequences([sequence], maxlen=max_length)
        #predicting next word
        #the predict function will return probability
        prob = model.predict([photo,sequence], verbose=0)
        #converting the probability to integer
        prob = argmax(prob)
        #calling the word_for_id function in order to map integer to word
        word = word_for_id(prob, tokenizer)
        #breaking if word cannot be mapped
        if word is None:
            break
        #appending as input
        in_text += ' ' + word

        #break if end is predicted
        if word == 'endseq':
            break
    return in_text

BLEU, or the Bilingual Evaluation Understudy, is a score for comparing a candidate translation of text to one or more reference translations. The approach works by counting matching n-grams in the candidate translation to n-grams in the reference text. It is common to report the cumulative BLEU-1 to BLEU-4 scores when describing the skill of a text generation system.

In [None]:
#the below function evaluates the skill of the model
def evaluate_model(model, descriptions, photos, tokenizer, max_length):
    actual, predicted = list(), list()
    for key, desc_list in descriptions.items():
        prediction = generate_desc(model, tokenizer, photos[key], max_length)
        actual_desc = [d.split() for d in desc_list]
        actual.append(actual_desc)
        predicted.append(prediction.split())

    print('BLEU-1: ', corpus_bleu(actual, predicted, weights=(1.0, 0, 0, 0)))
    print('BLEU-2: ', corpus_bleu(actual, predicted, weights=(0.5, 0.5, 0, 0)))
    print('BLEU-3: ', corpus_bleu(actual, predicted, weights=(0.3, 0.3, 0.3, 0)))
    print('BLEU-4: ', corpus_bleu(actual, predicted, weights=(0.25, 0.25, 0.25, 0.25)))
    
def max_length(descriptions):
    lines = to_lines(descriptions)
    return max(len(d.split()) for d in lines)

In [None]:
train_descriptions = load_clean_descriptions('descriptions.txt', train)
print('Descriptions: train=', len(train_descriptions))
tokenizer = create_tokenizer()
max_length = max_lengthTEMP(train_descriptions)
print('Description Length: ,', max_length)
filename = '/content/drive/My Drive/Kaggle/ImageCaptioningGenerator/input/Flickr8k_text/Flickr_8k.testImages.txt'
test = load_photo_identifiers(filename)
print('Dataset: ', len(test))
test_descriptions = load_clean_descriptions('descriptions.txt', test)
print('Descriptions: test=', len(test_descriptions))
test_features = load_photo_features('/content/drive/My Drive/Kaggle/ImageCaptioningGenerator/features.pkl', test)
print('Photos: test=', len(test_features))
filename = '/content/drive/My Drive/Kaggle/ImageCaptioningGenerator/model_10.h5'
model = load_model(filename)
evaluate_model(model, test_descriptions, test_features, tokenizer, max_length)

# Let's generate a text

In [67]:
def extract_features(filename):
    model = ResNet50(include_top=False,weights='imagenet',input_shape=(224,224,3),pooling='avg')
    image = load_img(filename, target_size=(224, 224))
    image = img_to_array(image)
    image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))
    image = preprocess_input(image)
    feature = model.predict(image, verbose=0)
    return feature

In [70]:
model = load_model('/content/drive/My Drive/Kaggle/ImageCaptioningGenerator/model_glove_12.h5')
path = '/content/drive/My Drive/Kaggle/ImageCaptioningGenerator/input/Flicker8k_Dataset/96973080_783e375945.jpg'
photo = extract_features(path)
max_length = 34
description = generate_desc(model, tokenizer, photo, max_length)
print(description)



startseq dog dog running through the the endseq


In [None]:
import googletrans  
from gtts import gTTS
import IPython.display as ipd

In [None]:
from googletrans import Translator
translator = Translator()
tranlated = translator.translate(description, dest='ar')
print(tranlated.text)

In [None]:
tts = gTTS(arabic_text, lang='en')
tts.save('test.mp3')

In [None]:
audio_path="test.mp3"

ipd.Audio(audio_path, autoplay=True)

# using glove as a pre trained embedding model

To encode our text sequence we will map every word to a 200-dimensional vector. For this will use a pre-trained Glove model. This mapping will be done in a separate layer after the input layer called the embedding layer.

To generate the caption we will be using two popular methods which are Greedy Search and Beam Search. These methods will help us in picking the best words to accurately define the image.

The basic premise behind Glove is that we can derive semantic relationships between words from the co-occurrence matrix. For our model, we will map all the words in our max_length-word long caption to a 200-dimension vector using Glove.

The advantage of using Glove over Word2Vec is that GloVe does not just rely on the local context of words but it incorporates global word co-occurrence to obtain word vectors.

In [None]:
ixtoword = {}
wordtoix = {}
ix = 1
for w in vocab:
    wordtoix[w] = ix
    ixtoword[ix] = w
    ix += 1

vocab_size = len(ixtoword) + 1

In [36]:
glove_path = '/content/drive/My Drive/Kaggle/ImageCaptioningGenerator/input/glove6b'
embeddings_index = {} 
with open('/content/drive/My Drive/Kaggle/ImageCaptioningGenerator/input/glove6b/glove.6B.200d.txt', 'rb') as f:
  for line in f:
      values = line.split()
      word = values[0]
      coefs = np.asarray(values[1:], dtype='float32')
      embeddings_index[word] = coefs

In [40]:
embedding_dim = 200
embedding_matrix = np.zeros((vocab_size, embedding_dim))
for word, i in tokenizer.word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

In [60]:
inputs1 = Input(shape=(2048,))
fe1 = Dropout(0.5)(inputs1)
fe2 = Dense(256, activation='relu')(fe1)

inputs2 = Input(shape=(max_length,))
se1 = Embedding(vocab_size, embedding_dim, mask_zero=True)(inputs2)
se2 = Dropout(0.5)(se1)
se3 = LSTM(256)(se2)

decoder1 = add([fe2, se3])
decoder2 = Dense(256, activation='relu')(decoder1)
outputs = Dense(vocab_size, activation='softmax')(decoder2)

model = Model(inputs=[inputs1, inputs2], outputs=outputs)
model.summary()

Model: "model_4"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_9 (InputLayer)           [(None, 34)]         0           []                               
                                                                                                  
 input_8 (InputLayer)           [(None, 2048)]       0           []                               
                                                                                                  
 embedding_2 (Embedding)        (None, 34, 200)      330400      ['input_9[0][0]']                
                                                                                                  
 dropout_4 (Dropout)            (None, 2048)         0           ['input_8[0][0]']                
                                                                                            

Before training the model we need to keep in mind that we do not want to retrain the weights in our embedding layer (pre-trained Glove vectors).

In [61]:
model.layers[2].set_weights([embedding_matrix])
model.layers[2].trainable = False

In [62]:
model.compile(loss='categorical_crossentropy', optimizer='adam')

In [63]:
def data_generator(descriptions, photos, tokenizer, max_length, num_photos_per_batch):
    X1, X2, y = list(), list(), list()
    n=0
    # loop for ever over images
    while 1:
        for key, desc_list in descriptions.items():
            n+=1
            # retrieve the photo feature
            photo = photos[key][0]
            for desc in desc_list:
                # encode the sequence
                seq = tokenizer.texts_to_sequences([desc])[0]
                # split one sequence into multiple X, y pairs
                for i in range(1, len(seq)):
                    # split into input and output pair
                    in_seq, out_seq = seq[:i], seq[i]
                    # pad input sequence
                    in_seq = pad_sequences([in_seq], maxlen=max_length)[0]
                    # encode output sequence
                    out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
                    # store
                    X1.append(photo)
                    X2.append(in_seq)
                    y.append(out_seq)

            if n==num_photos_per_batch:
                yield ([array(X1), array(X2)], array(y))
                X1, X2, y = list(), list(), list()
                n=0

In [None]:
epochs = 30
batch_size = 3
steps = len(train_descriptions)//batch_size
checkpoint = ModelCheckpoint(filepath='/content/drive/My Drive/Kaggle/ImageCaptioningGenerator/model_glove1_{epoch:02d}.h5', 
                                    save_freq='epoch',
                                    monitor='loss',
                                    mode='min',
                                    save_best_only=True,
                                    period = 2)

generator = data_generator(train_descriptions, train_features, tokenizer, max_length, batch_size)
model.fit(generator, epochs=epochs, steps_per_epoch=steps, verbose=1, callbacks = checkpoint)



Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
 448/2000 [=====>........................] - ETA: 10:03 - loss: 4.0983

In [66]:
model = load_model('/content/drive/My Drive/Kaggle/ImageCaptioningGenerator/model_glove_12.h5', compile=False)

# Greedy and Beam Search

As the model generates a 1660 long vector with a probability distribution across all the words in the vocabulary we greedily pick the word with the highest probability to get the next word prediction. This method is called Greedy Search.

In [None]:
def greedySearch(photo):
    in_text = 'startseq'
    for i in range(max_length):
        seq = tokenizer.texts_to_sequences([in_text])[0]
        sequence = pad_sequences([sequence], maxlen=max_length)
        yhat = model.predict([photo,sequence], verbose=0)
        yhat = np.argmax(yhat)
        word = word_for_id[yhat]
        in_text += ' ' + word
        if word == 'endseq':
            break

    final = in_text.split()
    final = final[1:-1]
    final = ' '.join(final)
    return final

Beam Search is where we take top k predictions, feed them again in the model and then sort them using the probabilities returned by the model. So, the list will always contain the top k predictions and we take the one with the highest probability and go through it till we encounter ‘endseq’ or reach the maximum caption length.

In [None]:
def beam_search_predictions(image, beam_index = 3):
    start = [word_for_id["startseq"]]
    start_word = [[start, 0.0]]
    while len(start_word[0][0]) < max_length:
        temp = []
        for s in start_word:
            par_caps = sequence.pad_sequences([s[0]], maxlen=max_length, padding='post')
            preds = model.predict([image,par_caps], verbose=0)
            word_preds = np.argsort(preds[0])[-beam_index:]
            # Getting the top <beam_index>(n) predictions and creating a 
            # new list so as to put them via the model again
            for w in word_preds:
                next_cap, prob = s[0][:], s[1]
                next_cap.append(w)
                prob += preds[0][w]
                temp.append([next_cap, prob])
                    
        start_word = temp
        # Sorting according to the probabilities
        start_word = sorted(start_word, reverse=False, key=lambda l: l[1])
        # Getting the top words
        start_word = start_word[-beam_index:]
    
    start_word = start_word[-1][0]
    intermediate_caption = [word_for_id[i] for i in start_word]
    final_caption = []
    
    for i in intermediate_caption:
        if i != 'endseq':
            final_caption.append(i)
        else:
            break

    final_caption = ' '.join(final_caption[1:])
    return final_caption

# Evaluation

In [None]:
print("Greedy Search:",greedySearch(image))
print("Beam Search, K = 3:",beam_search_predictions(image, beam_index = 3))
print("Beam Search, K = 5:",beam_search_predictions(image, beam_index = 5))
print("Beam Search, K = 7:",beam_search_predictions(image, beam_index = 7))
print("Beam Search, K = 10:",beam_search_predictions(image, beam_index = 10))

Things you can implement to improve your model:-

- Make use of the larger datasets, especially the MS COCO dataset or the Stock3M dataset which is 26 times larger than MS COCO.
- Implementing an Attention Based model:- Attention-based mechanisms are becoming increasingly popular in deep learning because they can dynamically focus on the various parts of the input image while the output sequences are being produced.

https://www.analyticsvidhya.com/blog/2020/11/create-your-own-image-caption-generator-using-keras/