# Assignment 3.3

# Image Caption Retrieval Model

### 1. Data preprocessing

We will use Microsoft COCO (Common Objects in Context) data set to train our "Image Caption Retrieval Model". This data set consists of pretrained 10-crop VGG19 features (Neural codes) and its corresponding text caption. 


In [1]:
from __future__ import print_function

import os
import sys
import numpy as np
import pandas as pd
from collections import OrderedDict

DATA_PATH = 'data'
EMBEDDING_PATH = 'embeddings'
MODEL_PATH = 'models'

You will need to create above directories and locate data set provided in directory 'data'

#### Reading pairs of image (VGG19 features) - caption data

In [2]:
# DO NOT CHANGE BELOW CODE

import collections

np_train_data = np.load(os.path.join(DATA_PATH,'train_data.npy'))
np_val_data = np.load(os.path.join(DATA_PATH,'val_data.npy'))

train_data = collections.OrderedDict()
for i in range(len(np_train_data.item())):
    cap =  np_train_data.item()['caps']
    img =  np_train_data.item()['ims']
    train_data['caps'] = cap
    train_data['ims'] = img
    
val_data = collections.OrderedDict()
for i in range(len(np_val_data.item())):
    cap =  np_val_data.item()['caps']
    img =  np_val_data.item()['ims']
    val_data['caps'] = cap
    val_data['ims'] = img

In [3]:
# example of caption
train_data['caps'][0]

b'a woman wearing a net on her head cutting a cake'

In [4]:
# example of pre-computed VGG19 features
val_data['ims'][0]

array([0.00109166, 0.        , 0.        , ..., 0.        , 0.        ,
       0.        ], dtype=float32)

#### Reading caption and information about its corresponding raw images from Microsoft COCO website

In [5]:
# DO NOT CHANGE BELOW CODE
# use them for your own additional preprocessing step
# to map precomputed features and location of raw images 

import json

with open(os.path.join(DATA_PATH,'instances_val2014.json')) as json_file:
    coco_instances_val = json.load(json_file)
    
with open(os.path.join(DATA_PATH,'captions_val2014.json')) as json_file:
    coco_caption_val = json.load(json_file)

#### Additional preprocessing

In [56]:
# create your own function to map pairs of precomputed features and filepath of raw images
# this will be used later for visualization part
# simple approach: based on matched text caption (see json file)

# YOUR CODE HERE 
coco_caption_val['annotations']
print(coco_caption_val.keys())
coco_caption_val.get('images')[0]['coco_url']

coco_instances_val.get('images')

dict_keys(['info', 'images', 'licenses', 'annotations'])


[{'coco_url': 'http://images.cocodataset.org/val2014/COCO_val2014_000000391895.jpg',
  'date_captured': '2013-11-14 11:18:45',
  'file_name': 'COCO_val2014_000000391895.jpg',
  'flickr_url': 'http://farm9.staticflickr.com/8186/8119368305_4e622c8349_z.jpg',
  'height': 360,
  'id': 391895,
  'license': 3,
  'width': 640},
 {'coco_url': 'http://images.cocodataset.org/val2014/COCO_val2014_000000522418.jpg',
  'date_captured': '2013-11-14 11:38:44',
  'file_name': 'COCO_val2014_000000522418.jpg',
  'flickr_url': 'http://farm1.staticflickr.com/1/127244861_ab0c0381e7_z.jpg',
  'height': 480,
  'id': 522418,
  'license': 4,
  'width': 640},
 {'coco_url': 'http://images.cocodataset.org/val2014/COCO_val2014_000000184613.jpg',
  'date_captured': '2013-11-14 12:36:29',
  'file_name': 'COCO_val2014_000000184613.jpg',
  'flickr_url': 'http://farm3.staticflickr.com/2169/2118578392_1193aa04a0_z.jpg',
  'height': 336,
  'id': 184613,
  'license': 3,
  'width': 500},
 {'coco_url': 'http://images.cocoda

In [22]:
coco_caption_val.get('annotations')[0].keys()

dict_keys(['image_id', 'id', 'caption'])

In [25]:
ims = coco_caption_val['images']
captions = coco_caption_val['annotations']
ims.sort(key=lambda x: x['id'])
captions.sort(key=lambda x: x['image_id'])
# for caption in 

In [30]:
for im in ims:
    c = captions['image_id' == ]

[{'coco_url': 'http://images.cocodataset.org/val2014/COCO_val2014_000000000042.jpg',
  'date_captured': '2013-11-18 09:22:23',
  'file_name': 'COCO_val2014_000000000042.jpg',
  'flickr_url': 'http://farm7.staticflickr.com/6024/6016274664_ea4ecac20c_z.jpg',
  'height': 478,
  'id': 42,
  'license': 2,
  'width': 640},
 {'coco_url': 'http://images.cocodataset.org/val2014/COCO_val2014_000000000073.jpg',
  'date_captured': '2013-11-15 12:34:05',
  'file_name': 'COCO_val2014_000000000073.jpg',
  'flickr_url': 'http://farm6.staticflickr.com/5023/5881310882_d0342ec5df_z.jpg',
  'height': 640,
  'id': 73,
  'license': 4,
  'width': 565},
 {'coco_url': 'http://images.cocodataset.org/val2014/COCO_val2014_000000000074.jpg',
  'date_captured': '2013-11-15 03:08:44',
  'file_name': 'COCO_val2014_000000000074.jpg',
  'flickr_url': 'http://farm5.staticflickr.com/4087/5078192399_aaefdb5074_z.jpg',
  'height': 426,
  'id': 74,
  'license': 2,
  'width': 640},
 {'coco_url': 'http://images.cocodataset.or

In [33]:
captions('image_id')

TypeError: 'list' object is not callable

In [62]:
coco_caption_val.get('images')

[{'coco_url': 'http://images.cocodataset.org/val2014/COCO_val2014_000000391895.jpg',
  'date_captured': '2013-11-14 11:18:45',
  'file_name': 'COCO_val2014_000000391895.jpg',
  'flickr_url': 'http://farm9.staticflickr.com/8186/8119368305_4e622c8349_z.jpg',
  'height': 360,
  'id': 391895,
  'license': 3,
  'width': 640},
 {'coco_url': 'http://images.cocodataset.org/val2014/COCO_val2014_000000522418.jpg',
  'date_captured': '2013-11-14 11:38:44',
  'file_name': 'COCO_val2014_000000522418.jpg',
  'flickr_url': 'http://farm1.staticflickr.com/1/127244861_ab0c0381e7_z.jpg',
  'height': 480,
  'id': 522418,
  'license': 4,
  'width': 640},
 {'coco_url': 'http://images.cocodataset.org/val2014/COCO_val2014_000000184613.jpg',
  'date_captured': '2013-11-14 12:36:29',
  'file_name': 'COCO_val2014_000000184613.jpg',
  'flickr_url': 'http://farm3.staticflickr.com/2169/2118578392_1193aa04a0_z.jpg',
  'height': 336,
  'id': 184613,
  'license': 3,
  'width': 500},
 {'coco_url': 'http://images.cocoda

#### Build vocabulary index 

In [63]:
# DO NOT CHANGE BELOW CODE

def build_dictionary(text):

    wordcount = OrderedDict()
    for cc in text:
        words = cc.split()
        for w in words:
            if w not in wordcount:
                wordcount[w] = 0
            wordcount[w] += 1
    words = list(wordcount.keys())
    freqs = list(wordcount.values())
    sorted_idx = np.argsort(freqs)[::-1]
    

    worddict = OrderedDict()
    worddict['<pad>'] = 0
    worddict['<unk>'] = 1
    for idx, sidx in enumerate(sorted_idx):
        worddict[words[sidx]] = idx+2  # 0: <pad>, 1: <unk>
    

    return worddict

# use the resulting vocabulary index as your look up dictionary
# to transform raw text into integer sequences

all_captions = []
all_captions = train_data['caps'] + val_data['caps']

# decode bytes to string format
caps = []
for w in all_captions:
    caps.append(w.decode())
    
words_indices = build_dictionary(caps)
print ('Dictionary size: ' + str(len(words_indices)))
indices_words = dict((v,k) for (k,v) in words_indices.items())

Dictionary size: 11473


In [137]:
#OWN CODE
from keras.preprocessing.sequence import pad_sequences

encoded_caps = []
for caps in train_data['caps']:
    i = 0
    encoded = np.zeros(len(caps.split()))
    for word in caps.split():
        encoded[i] = words_indices.get(word.decode())
        i = i + 1
    encoded_caps.append(encoded)
#pad caps to a self-defined max length.
max_len = 15
padded_caps = pad_sequences(encoded_caps, maxlen = max_len, padding = 'post')

### 2. Image - Caption Retrieval Model

In [72]:
from keras.layers import Input, Conv2D, Lambda, Dense, Flatten, MaxPooling2D, Dropout, BatchNormalization
from keras.models import Model, Sequential
from keras.regularizers import l2
from keras import backend as K
from keras.losses import binary_crossentropy
import numpy as np
import os
import pickle
import matplotlib.pyplot as plt
from sklearn.utils import shuffle

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


### Image model

In [84]:
train_data['ims'][0].size

4096

In [85]:
# YOUR CODE HERE 
ims_input_shape = (4096,)

# construct architecture
input_layer = Input(shape=ims_input_shape, name='input_layer')
output_layer = Dense(1024, name='output_layer', activation='relu')(input_layer)

# define and load model
ims_model = Model(inputs=input_layer, outputs=output_layer)
ims_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_layer (InputLayer)     (None, 4096)              0         
_________________________________________________________________
output_layer (Dense)         (None, 1024)              4195328   
Total params: 4,195,328
Trainable params: 4,195,328
Non-trainable params: 0
_________________________________________________________________


### Caption model

In [88]:
from gensim.models import KeyedVectors
glove = KeyedVectors.load_word2vec_format(os.path.join(EMBEDDING_PATH,'glove.6B.300d.txt'), binary=False)

In [0]:
# YOUR CODE HERE

# For embedding layer, initialize with pretrained word embedding (GloVe)
input_layer = Input(Shape = (None,), name = 'input_layer')
embedding_layer = Embedding
recurrent_layer


### Join model

In [0]:
# YOUR CODE HERE

# layer for computing dot product between tensors

### Main model for training stage

In [0]:
# YOUR CODE HERE

# define your model input and output

print ("loading the training model")
training_model = Model(inputs=, outputs=)

### Retrieval model

In [0]:
# YOUR CODE HERE

# define your model input and output

print ("loading sub-models for retrieving Neural codes")
caption_model = Model(inputs=, outputs=)
image_model = Model(inputs=, outputs=)

### Loss function

We define our loss function as a loss for maximizing the margin between a positive and
negative example.  If we call $p_i$ the score of the positive pair of the $i$-th example, and $n_i$ the score of the negative pair of that example, the loss is:

\begin{equation*}
loss = \sum_i{max(0, 1 -p_i + n_i)}
\end{equation*}

In [0]:
from keras import backend as K


def max_margin_loss(y_true, y_pred):
    
    # YOUR CODE HERE
    loss_ =
    
    return loss_
   

#### Accuracy metric for max-margin loss
How many times did the positive pair effectively get a higher value than the negative pair?

In [0]:
# YOUR CODE HERE
def accuracy(y_true, y_pred):
    
    # YOUR CODE HERE
    accuracy_ =
    
    return accuracy_


### Compile model

In [0]:
# DO NOT CHANGE BELOW CODE
print ("compiling the training model")
training_model.compile(optimizer='adam', loss=max_margin_loss, metrics=[accuracy])

### 3. Data preparation for training the model

* adjust the length of captions into fixed maximum length (50 words)
* sampling caption for each image, while shuffling the image data
* encode captions into integer format based on look-up vocabulary index

In [0]:
# sampling one caption per image
# return image_ids, caption_ids

def sampling_img_cap(data):
    
    # YOUR CODE HERE
    
    return image_ids, caption_ids


In [0]:
# transform raw text caption into integer sequences of fixed maximum length

def prepare_caption(caption_ids, caption_data):
    
    # YOUR CODE HERE
    
    caption_seqs = 
    
      
    return caption_seqs

In [0]:
# DO NOT CHANGE BELOW CODE

train_caps = []
for cap in train_data['caps']:
    train_caps.append(cap.decode())

val_caps = []
for cap in val_data['caps']:
    val_caps.append(cap.decode())

In [0]:
# DO NOT CHANGE BELOW CODE

train_image_ids, train_caption_ids = sampling_img_cap(train_data)
val_image_ids, val_caption_ids = sampling_img_cap(val_data)

x_caption = prepare_caption(train_caption_ids, train_caps)
x_image = train_data['ims'][np.array(train_image_ids)]

x_val_caption = prepare_caption(val_caption_ids, val_caps)
x_val_image = val_data['ims'][np.array(val_image_ids)]

### 4. Create noise set for negative examples of image-fake caption and dummy output

Notice that we do not have real output with labels for training the model. Keras architecture expects labels, so we need to create dummy output -- which is numpy array of zeros. This dummy labels or output is never used since we compute loss function based on margin between positive examples (image-real caption) and negative examples (image-fake caption).

In [0]:
# YOUR CODE HERE

train_noise = 
val_noise = 

y_train_labels = 
y_val_labels = 

### 5. Training model

In [0]:
# YOUR CODE HERE

X_train = 
Y_train = 
X_valid = 
Y_valid = 

In [0]:
# YOUR CODE HERE

# fit the model on training and validation set

#### Storing models and weight parameters

In [0]:
# DO NOT CHANGE BELOW CODE

# Save model
training_model.save(os.path.join(MODEL_PATH,'image_caption_model.h5'))
# Save weight parameters
training_model.save_weights(os.path.join(MODEL_PATH, 'weights_image_caption.hdf5'))

# Save model for encoding caption and image
caption_model.save(os.path.join(MODEL_PATH,'caption_model.h5'))
image_model.save(os.path.join(MODEL_PATH,'image_model.h5'))

### 6. Feature extraction (Neural codes)

In [0]:
# YOUR CODE HERE

# Use caption_model and image_model to produce "Neural codes" 
# for both image and caption from validation set

### 7. Caption Retrieval

#### Display original image as query and its ground truth caption

In [0]:
import matplotlib.pyplot as plt
%matplotlib inline
from keras.preprocessing import image

In [0]:
# YOUR CODE HERE

# choose one image_id from validation set
# use this id to get filepath of image
img_id = 
filepath_image = 

# display original caption
original_caption = 
print(original_caption)

# DO NOT CHANGE BELOW CODE
img = image.load_img(os.path.join(IMAGE_DATA,filepath_image), target_size=(224,224))
plt.imshow(img)
plt.axis("off")
plt.show()

In [0]:
# function to retrieve caption, given an image query

def get_caption(image_filename, n=10):   
    
    # YOUR CODE HERE


In [0]:
# DO NOT CHANGE BELOW CODE
get_caption(filepath_image)

Briefly discuss the result. Why or how it works, and why do you think it does not work at some point.

#### Answer:

=== write your answer here ===

### 8. Image Retrieval

In [0]:
# given text query, display retrieved image, similarity score, and its original caption 

def search_image(text_caption, n=10):
    
    # YOUR CODE HERE
    

Consider to use the following settings for image retrieval task.

* use real caption that is available in validation set as a query.
* use part of caption as query. For instance, instead of use the whole text sentence of the
caption, you may consider to use key phrase or combination of words that is included in
corresponding caption.

In [0]:
# Example of text query 
# text = 'two giraffes standing near trees'

# YOUR QUERY-1
text1 = 

# DO NOT CHANGE BELOW CODE
search_image(text1)

In [0]:
# YOUR QUERY-2
text2 = 

# DO NOT CHANGE BELOW CODE
search_image(text2)

Briefly discuss the result. Why or how it works, and why do you think it does not work at some point.

#### Answer:

=== write your answer here ===