## Columbia University
### ECBM E4040 Neural Networks and Deep Learning. Fall 2020.

# Task 2: Image Captioning

In [None]:
import sys
import os
if os.getcwd() not in sys.path:
    sys.path.append(os.getcwd())

# Load Data

For this task, we will use the **Flickr8k** dataset. 

<span style="color:red">TODO:</span>

Download the dataset from [here](https://drive.google.com/file/d/1RPYOmaFutLJxrcXr4cfLVQMJ9tKh_ER6/view?usp=sharing) in zip format into the assignment's root directory (DO NOT extract from zip, as a cell below contains the code to extract data).

<span style="color:red">NOTE:</span>

It is highly likely that you will use VM instance on GCP through jupyter interface for this task. You may download the zip file to your local machine and upload the same zip file to a VM instance through the jupyter interface.

**Flickr8k.token.txt** - the raw captions of the Flickr8k Dataset. The first column is the ID of the caption which contains "image address # caption number".

**Flickr_8k.trainImages.txt** - The training images used in our experiments.

**Flickr_8k.testImages.txt** - The test images used in our experiments.

In [None]:
! pip install zipfile37
! pip install pillow

In [None]:
from utils.imgCap import load_images_list
from zipfile import ZipFile

# Extract Data
if not os.path.exists('./Flickr_Data.zip'):
    raise Exception('Dataset not found. Please read instructions above this cell and download dataset.')

if not os.path.exists('./Flickr_Data'):
    print("Extracting data ...")
    ZipFile('./Flickr_Data.zip', 'r').extractall('./')

#File Containing captions of each image
descriptions_file = './Flickr_Data/Flickr8k_text/Flickr8k.token.txt'

#Files with names of corresponding images
train_image_list_path = './Flickr_Data/Flickr8k_text/Flickr_8k.trainImages.txt'
test_image_list_path = './Flickr_Data/Flickr8k_text/Flickr_8k.testImages.txt'

train_image_list = load_images_list(train_image_list_path)
test_image_list = load_images_list(test_image_list_path)

print('Total train images:',len(train_image_list))
print('Total test images:', len(test_image_list))

# Image Encoding
Extract features from images using InceptionV3 with imagenet weights and transform features into numpy arrays. **2nd layer from last** has the dimension (2048,) which represents features from an image. If you wish to use any other network to produce image encodings, you may do so (make sure to take care of input and output dimensions if you change the network).

<span style="color:red">TODO:</span>

1. Complete the function images_preprocess_generator in **./utils/imgCap.py**. 
2. Then **create a model according to the instructions given in the cell below**.

<span style="color:red">NOTE:</span> 

This process takes time and we thus save and reuse the encodings.

In [None]:
from tensorflow.keras.applications import InceptionV3
from tensorflow.keras import Model
from utils.imgCap import images_preprocess_generator
import numpy as np

##################################################################################################
#TODO: Load InceptionV3 with "imagenet" weights. Create a model with input and output as follows.
# Input : input of first layer of InceptionV3
# Output: output of second layer from the end of InceptionV3 network.
# Hint: Take a look at tf.keras.Model and Model.input, Model.layers, Model.layers[i].output
# https://www.tensorflow.org/api_docs/python/tf/keras/Model
# https://www.tensorflow.org/api_docs/python/tf/keras/applications/InceptionV3
###################################################################################################
# model = 
# encoder = 
###################################################################################################
# END TODO
###################################################################################################


images_path = './Flickr_Data/Flickr8k_Dataset'

encodings_path = './encoded_images'
if not os.path.exists(encodings_path):
    os.mkdir(encodings_path)

#Encode Train Images
try:
    train_encodings = np.load(encodings_path+"/train_encodings.npy")
    print('train images encodings found and loaded')
except:
    print('train images encodings not found. Initializing encoding process...')
    processed_train_images_generator = images_preprocess_generator(train_image_list,images_path)
    train_encodings = encoder.predict(processed_train_images_generator)
    np.save('./encoded_images/train_encodings.npy',train_encodings)
    print('train images encodings saved at '+encodings_path)

#Encode Test Images
try:
    test_encodings = np.load(encodings_path+"/test_encodings.npy")
    print('Test images encodings found and loaded')
except:
    print('Test images encodings not found. Initializing encoding process...')
    processed_test_images_generator = images_preprocess_generator(test_image_list,images_path)
    test_encodings = encoder.predict(processed_test_images_generator)
    np.save('./encoded_images/test_encodings.npy',test_encodings)
    print('Test images encodings saved at '+encodings_path)


# Prepare text
Load Descriptions from Flickr8k.token.txt and create Vocabulary from captions

In [None]:
from utils.imgCap import load_descriptions, clean_descriptions, generate_vocabulary, word_indexing
descriptions_file = './Flickr_Data/Flickr8k_text/Flickr8k.token.txt'
descriptions_dict, MAX_TEXT_LENGTH = clean_descriptions(load_descriptions(descriptions_file))
vocab = generate_vocabulary(descriptions_dict)
word_to_id, id_to_word = word_indexing(vocab)

# Training the Data Generator

The network that we will design in the next cells will have two inputs (one corresponds to an image and the other corresponds to a part of the caption), and it will have one output.

Each (image,caption) is converted into multiple (image_encodings, input_caption, output) pairs. For example, consider image 1.png having a caption 'seqstart this is an example seqend' and its corresponding image_encoding numpy array is represented by enc1.

Following shows all the data points that correspond to a single (image,caption)

| image_encoding | input                                        |  output  |
|----------------|----------------------------------------------|----------|
|       enc1     |[['seqstart']]                                | 'this'   |  
|       enc1     |[['seqstart', 'this']]                        | 'is'     |
|       enc1     |[['seqstart', 'this', 'is']]                  | 'an'     |
|       enc1     |[['seqstart', 'this', 'is', 'an']]            | 'example'|
|       enc1     |[['seqstart', 'this', 'is', 'an', 'example']] | 'seqend' |

Thus each (image,caption) pair corresponds to len(caption)-1 number of training points. In cells below  train_generator is written in this fashion.

In [None]:
def train_generator(max_text_len,vocab_size,word_to_id,train_encodings,batch_size=128):
    input_text_array = []
    input_image_array = []
    output_array = []
    while True:
        for (index,image_name) in enumerate(train_image_list):
            for caption in descriptions_dict[image_name]:
                caption_split = caption.split()
                for i in range(len(caption_split)-1):
                    input_temp = np.zeros(max_text_len)
                    input_temp[:i+1] = [word_to_id[caption_split[k]] for k in range(i+1)]
                    input_text_array.append(input_temp)

                    input_image_array.append(train_encodings[index])

                    output_temp = np.zeros(vocab_size)
                    output_temp[word_to_id[caption_split[i+1]]] = 1
                    output_array.append(output_temp)

                    if len(input_text_array) == batch_size:
                        yield ([np.array(input_image_array),np.array(input_text_array)],np.array(output_array))
                        input_text_array = []
                        input_image_array = []
                        output_array = []

# Model Definition
Now that we have image encodings and the corresponding text, our model should take two inputs and generate one output (as discussed above).

- Input 1 - (batch_size,2048) - corresponds to image encodings
- Input 2 - (batch_size,max_caption_length,) - corresponds to an array representing words of caption in the form of indices (word_to_id dictionary)

- Output - (batch_size,Vocab_size) - gives probability of each word given the image encodings and the sequence at any given time.

As we want our network to remember the context, using the LSTM cell is a good idea. Below is one such implementation of an Image Caption Generator.

<span style="color:red">TODO:</span>

Modify the network to hit the training accuracy of 35%. You may choose to come up with a completely different architecture, if you wish to.

Tips:

- Try adding more LSTM layers and dense layers. Looking at the official documentation of imported layers is recommended.

In [None]:
# Model Definition 
VOCAB_SIZE = len(vocab)+1 # Think why +1 is added. Answer: index 0 correspond to unknown word or word not present case. This has to be accomodated in the model.
EMBEDDING_DIM = 128

from tensorflow.keras.layers import Dense, Dropout, Input, LSTM, Embedding, Add, Bidirectional, Concatenate, RepeatVector
from tensorflow.keras import Model

# Image Features Path
image_input = Input(shape=(2048,))
image_features = Dense(EMBEDDING_DIM, activation='relu')(image_input)
image_features = RepeatVector(MAX_TEXT_LENGTH)(image_features)

# Text Features Path
text_input = Input(shape=(MAX_TEXT_LENGTH,))
text_features = Embedding(VOCAB_SIZE, EMBEDDING_DIM)(text_input)
##################################################################################################
#TODO: Improve text_features by passing through LSTM layer(s)
###################################################################################################

###################################################################################################
# END TODO
###################################################################################################


# Combined Features Path
combined_features = Concatenate()([image_features,text_features])
##################################################################################################
#TODO: Improve combined_features by passing through LSTM layer(s)
###################################################################################################

###################################################################################################
# END TODO
###################################################################################################


outputs = Dense(VOCAB_SIZE, activation='softmax')(combined_features)


captionGeneratorNet = Model(inputs=[image_input,text_input],outputs=outputs)
captionGeneratorNet.compile(loss='categorical_crossentropy', optimizer='RMSprop', metrics=['accuracy'])


In [None]:
captionGeneratorNet.summary()

In [None]:
batch_size = 256
generator = train_generator(MAX_TEXT_LENGTH,VOCAB_SIZE,word_to_id,train_encodings)
steps = len(train_image_list) * MAX_TEXT_LENGTH // batch_size
captionGeneratorNet.fit(generator,verbose=1,epochs=30,steps_per_epoch=steps)

In [None]:
captionGeneratorNet.save_weights("model_weights.h5")

# Image Decoder

We have a trained model that is capable of generating captions upon looking at images. This process happens sequentially as follows. For example, say enc_trail is a numpy array corresponding to image named trail.png. Now we feed the model the following input

Input 1 - enc_train. Shape - (1,2048)

Input 2 - [word_to_id[['seqstart']], 0, 0 ........]. Shape - (1,max_caption_length).

This will generate an output array of dimensions (Vocab_size) where each entry represent the probability of occurance corresponding to the words in vocabulary. We take argmax to get the max probable next word id, and then convert it back to the word. Say that the word is 'example'. Now we update Input 2 as [word_to_id[['seqstart']], word_to_id[['example']], 0 ........]. This process continues till we hit the max_caption_length or 'seqend'.

The code below is written for you, to do the same.

In [None]:
def image_decoder(enc_image): 
    id_sentence = [word_to_id['seqstart']]
    while True:
        temp_input = np.zeros(MAX_TEXT_LENGTH)
        temp_input[:len(id_sentence)] = id_sentence
        next_word_id = np.argmax(captionGeneratorNet.predict([enc_image.reshape(1,2048),temp_input.reshape(1,MAX_TEXT_LENGTH)]))
        id_sentence.append(next_word_id)
        if len(id_sentence) == MAX_TEXT_LENGTH or next_word_id == word_to_id['seqend']:
            out_seq = [id_to_word[ele] for ele in id_sentence]
            return out_seq

# Visualize the Process of Caption Generation

In [None]:
import PIL
from matplotlib import pyplot as plt

def show_image(images_path,image_name):
    image = PIL.Image.open(os.path.join(images_path, image_name))
    plt.imshow(np.asarray(image.resize((299,299))) / 255.0)

In [None]:
ID = 0
show_image(images_path,test_image_list[ID])
image_decoder(test_encodings[ID])