# Assignment 3.3

# Image Caption Retrieval Model

### 1. Data preprocessing

We will use Microsoft COCO (Common Objects in Context) data set to train our "Image Caption Retrieval Model". This data set consists of pretrained 10-crop VGG19 features (Neural codes) and its corresponding text caption. 


In [1]:
from __future__ import print_function

import os
import sys
import numpy as np
import pandas as pd
from collections import OrderedDict

DATA_PATH = 'data'
EMBEDDING_PATH = 'embeddings'
MODEL_PATH = 'models'

You will need to create above directories and locate data set provided in directory 'data'

#### Reading pairs of image (VGG19 features) - caption data

In [2]:
# DO NOT CHANGE BELOW CODE

import collections

np_train_data = np.load(os.path.join(DATA_PATH,'train_data.npy'))
np_val_data = np.load(os.path.join(DATA_PATH,'val_data.npy'))

train_data = collections.OrderedDict()
for i in range(len(np_train_data.item())):
    cap =  np_train_data.item()['caps']
    img =  np_train_data.item()['ims']
    train_data['caps'] = cap
    train_data['ims'] = img
    
val_data = collections.OrderedDict()
for i in range(len(np_val_data.item())):
    cap =  np_val_data.item()['caps']
    img =  np_val_data.item()['ims']
    val_data['caps'] = cap
    val_data['ims'] = img

In [3]:
# example of caption
print(train_data['caps'][0])
print(train_data['ims'].shape)
print(len(train_data['caps']))

b'a woman wearing a net on her head cutting a cake'
(10000, 4096)
50000


In [4]:
# example of pre-computed VGG19 features
print(val_data['ims'][0])
print(val_data['ims'].shape)
print(len(val_data['caps']))


[0.00109166 0.         0.         ... 0.         0.         0.        ]
(5000, 4096)
25000


#### Reading caption and information about its corresponding raw images from Microsoft COCO website

In [5]:
# DO NOT CHANGE BELOW CODE
# use them for your own additional preprocessing step
# to map precomputed features and location of raw images 

import json

with open(os.path.join(DATA_PATH,'instances_val2014.json')) as json_file:
    coco_instances_val = json.load(json_file)
    
with open(os.path.join(DATA_PATH,'captions_val2014.json')) as json_file:
    coco_caption_val = json.load(json_file)

#### Additional preprocessing

In [6]:
# create your own function to map pairs of precomputed features and filepath of raw images
# this will be used later for visualization part
# simple approach: based on matched text caption (see json file)

def match_features_and_filepath(index, features_data, coco_annotations, coco_images):
    low_index = index*5
    top_index = (index+1)*5
    features_captions = features_data['caps'][low_index: top_index]
    features_vector = features_data['ims'][index]
    img_id = None
    selected_caption = None
    
    for caption in features_captions:
        if img_id != None:
            break
        str_caption = caption.decode('utf-8')
        for annotation in coco_annotations:
            coco_caption = annotation['caption']
            if coco_caption.endswith('.'):
                coco_caption = coco_caption[:-1]
            if str_caption.lower() == coco_caption.lower():
                selected_caption = str_caption
                img_id = annotation['image_id']
                #print(annotation)
                break
    
    img_url = None
    for img_data in coco_images:
        if img_data['id'] == img_id:
            img_url = img_data['coco_url']
            break
    
    return (img_url, selected_caption, features_vector)
    


In [7]:
# debug implementation of previous function
for i in range(10):
    image_info = match_features_and_filepath(i, train_data, coco_caption_val['annotations'], coco_caption_val['images'])
    print(image_info)

('http://images.cocodataset.org/val2014/COCO_val2014_000000522418.jpg', 'a woman cutting a large white sheet cake', array([0.00495117, 0.        , 0.        , ..., 0.        , 0.00079085,
       0.01148911], dtype=float32))
('http://images.cocodataset.org/val2014/COCO_val2014_000000318219.jpg', 'a young boy standing in front of a computer keyboard', array([0.00460475, 0.02057743, 0.01465815, ..., 0.        , 0.        ,
       0.02613593], dtype=float32))
('http://images.cocodataset.org/val2014/COCO_val2014_000000554625.jpg', 'a boy wearing headphones using one computer in a long row of computers', array([0.00487754, 0.0193617 , 0.00561136, ..., 0.        , 0.        ,
       0.01967784], dtype=float32))
('http://images.cocodataset.org/val2014/COCO_val2014_000000397133.jpg', 'a man is in a kitchen making pizzas', array([0.03168815, 0.02294972, 0.00275241, ..., 0.        , 0.        ,
       0.00495452], dtype=float32))
('http://images.cocodataset.org/val2014/COCO_val2014_000000574769.j

#### Build vocabulary index 

In [8]:
# DO NOT CHANGE BELOW CODE

def build_dictionary(text):

    wordcount = OrderedDict()
    for cc in text:
        words = cc.split()
        for w in words:
            if w not in wordcount:
                wordcount[w] = 0
            wordcount[w] += 1
    words = list(wordcount.keys())
    freqs = list(wordcount.values())
    sorted_idx = np.argsort(freqs)[::-1]
    

    worddict = OrderedDict()
    worddict['<pad>'] = 0
    worddict['<unk>'] = 1
    for idx, sidx in enumerate(sorted_idx):
        worddict[words[sidx]] = idx+2  # 0: <pad>, 1: <unk>
    

    return worddict

# use the resulting vocabulary index as your look up dictionary
# to transform raw text into integer sequences

all_captions = []
all_captions = train_data['caps'] + val_data['caps']

# decode bytes to string format
caps = []
for w in all_captions:
    caps.append(w.decode())
    
words_indices = build_dictionary(caps)
print ('Dictionary size: ' + str(len(words_indices)))
indices_words = dict((v,k) for (k,v) in words_indices.items())

Dictionary size: 11473


### 2. Image - Caption Retrieval Model

In [27]:
from keras.layers import Input, Dense, Embedding, LSTM, Lambda
from keras.models import Model, Sequential
from keras import backend as K
#from keras.layers import Input, Conv2D, Lambda, Dense, Flatten, MaxPooling2D, Dropout, BatchNormalization
#from keras.models import Model, Sequential


### Image model

In [10]:
# YOUR CODE HERE 
image_input = Input(shape=(4096, ))

image_model = Sequential()
image_model.add(Dense(1024, input_shape=(4096, ), name="image_neural_codes", activation="sigmoid"))
image_model.summary()

image_encoding = image_model(image_input)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
image_neural_codes (Dense)   (None, 1024)              4195328   
Total params: 4,195,328
Trainable params: 4,195,328
Non-trainable params: 0
_________________________________________________________________


### Caption model

In [11]:
# function to load pretrained embedding
def load_embedding(vocab, dimension, filename):
    print('loading embeddings from "%s"' % filename, file=sys.stderr)
    embedding = np.zeros((max(vocab.values()) + 1, dimension), dtype=np.float32)
    seen = set()
    with open(filename) as fp:
        for line in fp:
            tokens = line.strip().split(' ')
            if len(tokens) == dimension + 1:
                word = tokens[0]
                if word in vocab:
                    embedding[vocab[word]] = [float(x) for x in tokens[1:]]
                    seen.add(word)
                    if len(seen) == len(vocab):
                        break
    return embedding

In [13]:
weights = load_embedding(words_indices, 100, os.path.join(DATA_PATH,'glove.6B.100d.txt'))

loading embeddings from "data/glove.6B.100d.txt"


In [20]:
weights.shape

(11473, 100)

In [25]:
# YOUR CODE HERE
caption_max_length = 50
embedding_size = 100
rnn_output_units = 1024
# For embedding layer, initialize with pretrained word embedding (GloVe)
caption_input = Input(shape=(caption_max_length, ), name='input_layer', dtype='int32')
noisy_input = Input(shape=(caption_max_length, ),  name='noisy_input_layer', dtype='int32')

caption_model = Sequential()
caption_model.add(Embedding(len(words_indices), embedding_size, weights=[weights], input_length=caption_max_length, \
                            trainable=False, name='word_embedding'))
caption_model.add(LSTM(rnn_output_units, name='caption_neural_codes'))

#model = Sequential()
#model.add(Lambda(binarize, output_shape=binarize_outshape,name='char_embedding', \
#                 input_shape=(max_sequence_length,), dtype='int32'))
#model.add(LSTM(rnn_dim, name='lstm_layer'))
#model.add(Dense(1 , name='prediction_layer', activation='sigmoid'))




## the input lenght of the embedding layer corresponds to the lenght of the encoded text sequences
## which have a maximum of 50 words
#embedding_layer = Embedding(len(words_indices), embedding_size, weights=[weights], input_length=caption_max_length, \
#                            trainable=False, name='word_embedding')
#embedded_captions = embedding_layer(caption_input)
#lstm_layer = LSTM(rnn_output_units, name='caption_neural_codes')(embedded_captions)
print(caption_model.summary())
caption_encoding = caption_model(caption_input)
noisy_encoding = caption_model(noisy_input)
#print(caption_encoding.summary())
#print(noisy_encoding.summary())


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
word_embedding (Embedding)   (None, 50, 100)           1147300   
_________________________________________________________________
caption_neural_codes (LSTM)  (None, 1024)              4608000   
Total params: 5,755,300
Trainable params: 4,608,000
Non-trainable params: 1,147,300
_________________________________________________________________
None


### Join model

In [29]:
# YOUR CODE HERE
dot_product = lambda x: K.dot(x[0], x[1])
positive_examples = Lambda(dot_product)([image_encoding, caption_encoding])

## think about how to implement this
negative_examples = Lambda(dot_product)([image_encoding, noisy_encoding])
# layer for computing dot product between tensors

### Main model for training stage

In [30]:
# YOUR CODE HERE

# define your model input and output

print ("loading the training model")
training_model = Model(inputs=[image_input, caption_input, noisy_input], outputs=[positive_examples, negative_examples])

loading the training model


### Retrieval model

In [32]:
# YOUR CODE HERE

# define your model input and output

print ("loading sub-models for retrieving Neural codes")
caption_model = Model(inputs=caption_input, outputs=caption_encoding)
image_model = Model(inputs=image_input, outputs=image_encoding)

loading sub-models for retrieving Neural codes


### Loss function

We define our loss function as a loss for maximizing the margin between a positive and
negative example.  If we call $p_i$ the score of the positive pair of the $i$-th example, and $n_i$ the score of the negative pair of that example, the loss is:

\begin{equation*}
loss = \sum_i{max(0, 1 -p_i + n_i)}
\end{equation*}

In [None]:
from keras import backend as K


def max_margin_loss(y_true, y_pred):
    
    # YOUR CODE HERE
    loss_ = K.sum(K.max(0, 1 - y_true + y_pred))
    
    return loss_
   

#### Accuracy metric for max-margin loss
How many times did the positive pair effectively get a higher value than the negative pair?

In [None]:
# YOUR CODE HERE
def accuracy(y_true, y_pred):
    
    # YOUR CODE HERE
    #accuracy_ = K.sum(y_true > y_pred)
    
    return accuracy_


### Compile model

In [None]:
# DO NOT CHANGE BELOW CODE
print ("compiling the training model")
training_model.compile(optimizer='adam', loss=max_margin_loss, metrics=[accuracy])

### 3. Data preparation for training the model

* adjust the length of captions into fixed maximum length (50 words)
* sampling caption for each image, while shuffling the image data
* encode captions into integer format based on look-up vocabulary index

In [34]:
print(train_data['caps'][0:10])
print(train_data['ims'][:2])

[b'a woman wearing a net on her head cutting a cake', b'a woman cutting a large white sheet cake', b'a woman wearing a hair net cutting a large sheet cake', b'there is a woman that is cutting a white cake', b'a woman marking a cake with the back of a chefs knife', b'a young boy standing in front of a computer keyboard', b'a little boy wearing headphones and looking at a computer monitor', b'he is listening intently to the computer at school', b'a young boy stares up at the computer monitor', b'a young kid with head phones on using a computer']
[[0.00495117 0.         0.         ... 0.         0.00079085 0.01148911]
 [0.00460475 0.02057743 0.01465815 ... 0.         0.         0.02613593]]


In [36]:
# sampling one caption per image
# return image_ids, caption_ids
import random

def sampling_img_cap(data):

    # YOUR CODE HERE
    # first sample for each image a caption
    pairs_ids = []
    
    for image_id in range(len(data['ims'])):
        caption_delta = random.randint(0, 4)
        caption_id = image_id*5 + caption_delta
        pairs_ids.append((image_id, caption_id))

    # then shuffle the image data with the corresponding captions
    random.shuffle(pairs_ids)
    image_ids, caption_ids = zip(*pairs_ids)
    
    return image_ids, caption_ids


In [40]:
# transform raw text caption into integer sequences of fixed maximum length
from keras.preprocessing import sequence

def prepare_caption(caption_ids, caption_data):
    
    # YOUR CODE HERE
    selected_captions = []
    for i in caption_ids:
        caption_text = caption_data[i]
        caption_tokens = [words_indices[w] if w in words_indices.keys() else words_indices['<unk>'] for w in caption_text]
        selected_captions.append(caption_tokens)
        
    selected_captions_arr = np.array(selected_captions)        
    caption_seqs = sequence.pad_sequences(selected_captions_arr, maxlen=caption_max_length, value=0)
      
    return caption_seqs

In [38]:
# DO NOT CHANGE BELOW CODE

train_caps = []
for cap in train_data['caps']:
    train_caps.append(cap.decode())

val_caps = []
for cap in val_data['caps']:
    val_caps.append(cap.decode())

In [41]:
# DO NOT CHANGE BELOW CODE

train_image_ids, train_caption_ids = sampling_img_cap(train_data)
val_image_ids, val_caption_ids = sampling_img_cap(val_data)

x_caption = prepare_caption(train_caption_ids, train_caps)
x_image = train_data['ims'][np.array(train_image_ids)]

x_val_caption = prepare_caption(val_caption_ids, val_caps)
x_val_image = val_data['ims'][np.array(val_image_ids)]

### 4. Create noise set for negative examples of image-fake caption and dummy output

Notice that we do not have real output with labels for training the model. Keras architecture expects labels, so we need to create dummy output -- which is numpy array of zeros. This dummy labels or output is never used since we compute loss function based on margin between positive examples (image-real caption) and negative examples (image-fake caption).

In [79]:
print(x_caption[0].shape)
print(len(words_indices))
print(x_caption[0])
ids = np.random.choice(x_caption[0].shape[0], 10, replace=False)
print(ids)


(50,)
11473
[ 9910  4163  2661   449   823  9557  7237 10223  5150  9872  5243  5814
  6303  1867  3758  8973 11403  4398  7625  1163  4974  5907  4557 10345
   653  3904 10982  6893  1181 10307  7509   550  5093   693  2086 10384
  2423  2227  6107     1  9148  3129  2780  1124  8095  3644  3524  3020
  2517  3737]
[18  6 45 29  9 39 10 13 33 15]


In [56]:
x_caption

array([[    0,  1263,     0, ...,  3756,   927,  3995],
       [ 6453,  5035,  2086, ...,     2,  1010,  6218],
       [    1,  9456,  4107, ...,  1315,  2086,  6453],
       ...,
       [    0,  6448,     0, ...,  8894,  5632,  6000],
       [    0,     0,     0, ...,  2086,  4477,  3995],
       [ 6364, 10788,  6169, ...,  2785,  2086,  3032]], dtype=int32)

In [74]:
# YOUR CODE HERE
def generate_noisy_caption(caption):
    ids = np.random.choice(caption.shape[0], 20, replace=False)
    values = np.random.choice(len(words_indices), 20)
    #print(values)
    #print(caption)
    new_caption = caption
    #print(new_caption)
    
    #for i in range
    for i, val in zip(ids, values):
        new_caption[i] = val
    
    return new_caption
    
#print(generate_noisy_caption(x_caption[0]))
    
def generate_noisy_captions(images, captions):
    noisy_captions = np.empty(captions.shape, dtype=np.int32)
    
    i = 0
    for img, caption in zip(images, captions):
        noisy_caption = generate_noisy_caption(caption)
        noisy_captions[i, :] = noisy_caption
        i += 1
    
    return noisy_captions

#print(x_caption[:2])
#print(generate_noisy_captions(x_image[:2], x_caption[:2]))

train_noise = generate_noisy_captions(x_image, x_caption)
val_noise = generate_noisy_captions(x_val_image, x_val_caption)

y_train_labels = np.zeros((len(x_image), rnn_output_units))
y_val_labels = np.zeros((len(x_val_image), rnn_output_units))

In [75]:
train_noise

array([[ 9910,  4163,  2661, ...,  3020,  2517,  3737],
       [ 2893,  2415, 11354, ...,  1296,  5308,  6218],
       [ 8021, 10883,  2029, ...,  8482,  2086,  6562],
       ...,
       [ 2057,  9723,   195, ...,  2591,  1065,  5345],
       [10562,     0,   647, ...,  2086,  5746,  2691],
       [ 4105, 10788,  1416, ...,  8189,  2250,  4586]], dtype=int32)

In [66]:
x_caption

array([[ 2260,  1542,  2174, ...,  3020,  3778,  3425],
       [ 6453,  5035,    97, ...,     2,  4021,  6218],
       [ 8021,  9456,  4107, ...,  1315,  2086,  5386],
       ...,
       [    0,  6046,  3281, ...,  2591,  4759,  5345],
       [    0,     0,     0, ...,  2086,  5746,  3995],
       [ 6364, 10788,  1499, ...,  8189,  2250,   409]], dtype=int32)

In [77]:
print(y_train_labels.shape)
print(y_val_labels.shape)

(10000, 1024)
(5000, 1024)


### 5. Training model

In [None]:
# YOUR CODE HERE

X_train = 
Y_train = 
X_valid = 
Y_valid = 

In [None]:
# YOUR CODE HERE

# fit the model on training and validation set

#### Storing models and weight parameters

In [None]:
# DO NOT CHANGE BELOW CODE

# Save model
training_model.save(os.path.join(MODEL_PATH,'image_caption_model.h5'))
# Save weight parameters
training_model.save_weights(os.path.join(MODEL_PATH, 'weights_image_caption.hdf5'))

# Save model for encoding caption and image
caption_model.save(os.path.join(MODEL_PATH,'caption_model.h5'))
image_model.save(os.path.join(MODEL_PATH,'image_model.h5'))

### 6. Feature extraction (Neural codes)

In [None]:
# YOUR CODE HERE

# Use caption_model and image_model to produce "Neural codes" 
# for both image and caption from validation set

### 7. Caption Retrieval

#### Display original image as query and its ground truth caption

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
from keras.preprocessing import image

In [None]:
# YOUR CODE HERE

# choose one image_id from validation set
# use this id to get filepath of image
img_id = 
filepath_image = 

# display original caption
original_caption = 
print(original_caption)

# DO NOT CHANGE BELOW CODE
img = image.load_img(os.path.join(IMAGE_DATA,filepath_image), target_size=(224,224))
plt.imshow(img)
plt.axis("off")
plt.show()

In [None]:
# function to retrieve caption, given an image query

def get_caption(image_filename, n=10):   
    
    # YOUR CODE HERE


In [None]:
# DO NOT CHANGE BELOW CODE
get_caption(filepath_image)

Briefly discuss the result. Why or how it works, and why do you think it does not work at some point.

#### Answer:

=== write your answer here ===

### 8. Image Retrieval

In [None]:
# given text query, display retrieved image, similarity score, and its original caption 

def search_image(text_caption, n=10):
    
    # YOUR CODE HERE
    

Consider to use the following settings for image retrieval task.

* use real caption that is available in validation set as a query.
* use part of caption as query. For instance, instead of use the whole text sentence of the
caption, you may consider to use key phrase or combination of words that is included in
corresponding caption.

In [None]:
# Example of text query 
# text = 'two giraffes standing near trees'

# YOUR QUERY-1
text1 = 

# DO NOT CHANGE BELOW CODE
search_image(text1)

In [None]:
# YOUR QUERY-2
text2 = 

# DO NOT CHANGE BELOW CODE
search_image(text2)

Briefly discuss the result. Why or how it works, and why do you think it does not work at some point.

#### Answer:

=== write your answer here ===