# Assignment 3.3

# Image Caption Retrieval Model

### 1. Data preprocessing

We will use Microsoft COCO (Common Objects in Context) data set to train our "Image Caption Retrieval Model". This data set consists of pretrained 10-crop VGG19 features (Neural codes) and its corresponding text caption. 


In [3]:
from __future__ import print_function

import os
import sys
import numpy as np
import pandas as pd
from collections import OrderedDict

DATA_PATH = 'data'
EMBEDDING_PATH = 'embeddings'
MODEL_PATH = 'models'

You will need to create above directories and locate data set provided in directory 'data'

#### Reading pairs of image (VGG19 features) - caption data

In [4]:
# DO NOT CHANGE BELOW CODE

import collections

np_train_data = np.load(os.path.join(DATA_PATH,'train_data.npy'))
np_val_data = np.load(os.path.join(DATA_PATH,'val_data.npy'))

train_data = collections.OrderedDict()
for i in range(len(np_train_data.item())):
    cap =  np_train_data.item()['caps']
    img =  np_train_data.item()['ims']
    train_data['caps'] = cap
    train_data['ims'] = img
    
val_data = collections.OrderedDict()
for i in range(len(np_val_data.item())):
    cap =  np_val_data.item()['caps']
    img =  np_val_data.item()['ims']
    val_data['caps'] = cap
    val_data['ims'] = img

In [76]:
np_train_data.item().keys()

dict_keys(['ims', 'caps'])

In [74]:
np.asarray(np_train_data.item()['ims']).shape


(5000, 4096)

In [91]:
np.asarray(np_val_data.item()['caps']).shape

(25000,)

In [138]:
np_train_data.item()['caps'][0]


b'a woman wearing a net on her head cutting a cake'

In [81]:
# example of caption
np_train_data.item()['caps'][0]

b'a woman wearing a net on her head cutting a cake'

In [142]:
# example of pre-computed VGG19 features
train_data['caps'][0]

b'a woman wearing a net on her head cutting a cake'

#### Reading caption and information about its corresponding raw images from Microsoft COCO website

In [5]:
# DO NOT CHANGE BELOW CODE
# use them for your own additional preprocessing step
# to map precomputed features and location of raw images 

import json

with open(os.path.join(DATA_PATH,'instances_val2014.json')) as json_file:
    coco_instances_val = json.load(json_file)
    
with open(os.path.join(DATA_PATH,'captions_val2014.json')) as json_file:
    coco_caption_val = json.load(json_file)

In [None]:
#todo: mapping

#### Additional preprocessing

In [41]:
coco_instances_val.keys()

dict_keys(['info', 'images', 'licenses', 'annotations', 'categories'])

In [47]:
coco_instances_val['images']

[{'coco_url': 'http://images.cocodataset.org/val2014/COCO_val2014_000000391895.jpg',
  'date_captured': '2013-11-14 11:18:45',
  'file_name': 'COCO_val2014_000000391895.jpg',
  'flickr_url': 'http://farm9.staticflickr.com/8186/8119368305_4e622c8349_z.jpg',
  'height': 360,
  'id': 391895,
  'license': 3,
  'width': 640},
 {'coco_url': 'http://images.cocodataset.org/val2014/COCO_val2014_000000522418.jpg',
  'date_captured': '2013-11-14 11:38:44',
  'file_name': 'COCO_val2014_000000522418.jpg',
  'flickr_url': 'http://farm1.staticflickr.com/1/127244861_ab0c0381e7_z.jpg',
  'height': 480,
  'id': 522418,
  'license': 4,
  'width': 640},
 {'coco_url': 'http://images.cocodataset.org/val2014/COCO_val2014_000000184613.jpg',
  'date_captured': '2013-11-14 12:36:29',
  'file_name': 'COCO_val2014_000000184613.jpg',
  'flickr_url': 'http://farm3.staticflickr.com/2169/2118578392_1193aa04a0_z.jpg',
  'height': 336,
  'id': 184613,
  'license': 3,
  'width': 500},
 {'coco_url': 'http://images.cocoda

In [59]:
# create your own function to map pairs of precomputed features and filepath of raw images
# this will be used later for visualization part
# simple approach: based on matched text caption (see json file)

# YOUR CODE HERE 
coco_instances_val['images'][0]

{'coco_url': 'http://images.cocodataset.org/val2014/COCO_val2014_000000391895.jpg',
 'date_captured': '2013-11-14 11:18:45',
 'file_name': 'COCO_val2014_000000391895.jpg',
 'flickr_url': 'http://farm9.staticflickr.com/8186/8119368305_4e622c8349_z.jpg',
 'height': 360,
 'id': 391895,
 'license': 3,
 'width': 640}

In [56]:
coco_caption_val.keys()

dict_keys(['info', 'images', 'licenses', 'annotations'])

In [64]:
len(coco_caption_val['annotations'])

{'area': 2765.1486500000005,
 'bbox': [199.84, 200.46, 77.71, 70.88],
 'category_id': 58,
 'id': 156,
 'image_id': 558840,
 'iscrowd': 0,
 'segmentation': [[239.97,
   260.24,
   222.04,
   270.49,
   199.84,
   253.41,
   213.5,
   227.79,
   259.62,
   200.46,
   274.13,
   202.17,
   277.55,
   210.71,
   249.37,
   253.41,
   237.41,
   264.51,
   242.54,
   261.95,
   228.87,
   271.34]]}

In [86]:
coco_caption_val['annotations'][0]

{'caption': 'A bicycle replica with a clock as the front wheel.',
 'id': 37,
 'image_id': 203564}

In [89]:
#tes_capt
len_capt=len(coco_caption_val['annotations'])
for i in coco_caption_val['annotations']:
    if(i['caption'].lower()==tes_capt.lower()):
        print(i)


#### Build vocabulary index 

In [6]:
# DO NOT CHANGE BELOW CODE

def build_dictionary(text):

    wordcount = OrderedDict()
    for cc in text:
        words = cc.split()
        for w in words:
            if w not in wordcount:
                wordcount[w] = 0
            wordcount[w] += 1
    words = list(wordcount.keys())
    freqs = list(wordcount.values())
    sorted_idx = np.argsort(freqs)[::-1]
    

    worddict = OrderedDict()
    worddict['<pad>'] = 0
    worddict['<unk>'] = 1
    for idx, sidx in enumerate(sorted_idx):
        worddict[words[sidx]] = idx+2  # 0: <pad>, 1: <unk>
    

    return worddict

# use the resulting vocabulary index as your look up dictionary
# to transform raw text into integer sequences

all_captions = []
all_captions = train_data['caps'] + val_data['caps']

# decode bytes to string format
caps = []
for w in all_captions:
    caps.append(w.decode())
    
words_indices = build_dictionary(caps)
print ('Dictionary size: ' + str(len(words_indices)))
indices_words = dict((v,k) for (k,v) in words_indices.items())

##add custom
#words_indices = dict((k,v) for (k,v) in words_indices.items())

Dictionary size: 11473


In [111]:
#traindata_ims= 10k of 4096
#traindata_caps= 50k of ??? 40-50?
#valdata_ims= 5k of 4096
#valdata_cas= 25k of ???

len(words_indices)

11473

In [2]:
from keras.layers import Dense, Embedding,Input,LSTM,GRU,Lambda,add,dot,subtract
from keras.models import Model
import keras.backend as K

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [42]:
voc_size = len(indices_words)#11k ish
cap_size = 50


#image network
img_input = Input(shape=(4096,),name='IMG_input')
condense_img = Dense(1024,name='Dense_IMG')(img_input)

#caption input
caption_input = Input(shape=(cap_size,),name='CAP_input')
noise_input = Input(shape=(cap_size,),name='Noise_input')

#shared layers
#embedding_layer = Embedding(voc_size,128,input_length=cap_size,name='Embedding_layer')
vocab_dim = 300 # dimensionality of your word vectors
n_symbols = voc_size + 1 # adding 1 to account for 0th index (for masking)
embedding_weights = np.zeros((n_symbols, vocab_dim))
for word,index in words_indices.items():
    try:
        embedding_weights[index, :] = glove[word]
    except KeyError:
        embedding_weights[index, :] = np.zeros(vocab_dim)
# define inputs here
embedding_layer = Embedding(output_dim=vocab_dim, input_dim=n_symbols, trainable=False)
embedding_layer.build((None,)) # if you don't do this, the next step won't work
embedding_layer.set_weights([embedding_weights])


recurrent_layer = LSTM(1024,name='recurrent_layer')

#inputs into shared layers
embed_caption = embedding_layer(caption_input)
embed_noise = embedding_layer(noise_input)

recurrent_noise = recurrent_layer(embed_noise)
recurrent_caption = recurrent_layer(embed_caption)

#noise and real score
cap_image = dot([condense_img,recurrent_caption],1,name='DotProd_postive_score')
noise_image = dot([condense_img,recurrent_noise],1,name='DotProd_negative_score')

#combined score
score_layer = add([cap_image,noise_image],name='Score_layer')

model = Model(inputs=[img_input,caption_input,noise_input],output=score_layer)
model.summary()


__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
CAP_input (InputLayer)          (None, 50)           0                                            
__________________________________________________________________________________________________
Noise_input (InputLayer)        (None, 50)           0                                            
__________________________________________________________________________________________________
IMG_input (InputLayer)          (None, 4096)         0                                            
__________________________________________________________________________________________________
embedding_2 (Embedding)         (None, 50, 300)      3442200     CAP_input[0][0]                  
                                                                 Noise_input[0][0]                
__________



In [43]:
model.compile(optimizer='adam',loss=identity_loss)

### 2. Image - Caption Retrieval Model

### Image model

In [113]:
# YOUR CODE HERE 
from keras.layers import Dense, Embedding,Dot,Input,LSTM,GRU,Add

#image network
img_input = Input(shape=(4096,),name='IMG_input')
condense_img = Dense(1024,name='Dense_IMG')(img_input)

### Caption model

In [8]:
import gensim
from gensim.models import KeyedVectors
path = ".."

#convert GloVe into word2vec format
#gensim.scripts.glove2word2vec.get_glove_info(path)
#gensim.scripts.glove2word2vec.glove2word2vec(path, "glove_converted.txt")

glove = KeyedVectors.load_word2vec_format("../glove_converted.txt", binary=False)





In [130]:
# YOUR CODE HERE

voc_size = len(indices_words)#11k ish
cap_size = 50


# For embedding layer, initialize with pretrained word embedding (GloVe)
#caption input
caption_input = Input(shape=(cap_size,),name='CAP_input')
noise_input = Input(shape=(cap_size,),name='Noise_input')

#shared layers
#embedding_layer = Embedding(voc_size,128,input_length=cap_size,name='Embedding_layer')

############# INIT EMBEDDING LAYER WITH GLOVE
# assemble the embedding_weights in one numpy array
vocab_dim = 300 # dimensionality of your word vectors
n_symbols = voc_size + 1 # adding 1 to account for 0th index (for masking)
embedding_weights = np.zeros((n_symbols, vocab_dim))
for word,index in words_indices.items():
    try:
        embedding_weights[index, :] = glove[word]
    except KeyError:
        embedding_weights[index, :] = np.zeros(vocab_dim)
# define inputs here
embedding_layer = Embedding(output_dim=vocab_dim, input_dim=n_symbols, trainable=True)
embedding_layer.build((None,)) # if you don't do this, the next step won't work
embedding_layer.set_weights([embedding_weights])
############### END INIT




recurrent_layer = LSTM(1024,name='recurrent_layer')

#inputs into shared layers
embed_caption = embedding_layer(caption_input)
embed_noise = embedding_layer(noise_input)

recurrent_noise = recurrent_layer(embed_noise)
recurrent_caption = recurrent_layer(embed_caption)

In [128]:
indices_words.items()



### Join model

In [131]:
# YOUR CODE HERE

# layer for computing dot product between tensors


#noise and real score
cap_image = dot([condense_img,recurrent_caption],1,name='DotProd_postive_score')
noise_image = dot([condense_img,recurrent_noise],1,name='DotProd_negative_score')



### Main model for training stage

In [134]:
# YOUR CODE HERE

# define your model input and output
#combined score
score_layer = add([cap_image,noise_image],name='Score_layer')

training_model=Model(inputs=[img_input,caption_input,noise_input],output=score_layer)



print ("loading the training model")
training_model.summary()


loading the training model
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
CAP_input (InputLayer)          (None, 300)          0                                            
__________________________________________________________________________________________________
Noise_input (InputLayer)        (None, 300)          0                                            
__________________________________________________________________________________________________
IMG_input (InputLayer)          (None, 4096)         0                                            
__________________________________________________________________________________________________
embedding_2 (Embedding)         (None, 300, 300)     3442200     CAP_input[0][0]                  
                                                                 Noise_input[0][0]

  import sys


### Retrieval model

In [None]:
# YOUR CODE HERE

# define your model input and output

print ("loading sub-models for retrieving Neural codes")
caption_model = Model(inputs=, outputs=)
image_model = Model(inputs=, outputs=)

### Loss function

We define our loss function as a loss for maximizing the margin between a positive and
negative example.  If we call $p_i$ the score of the positive pair of the $i$-th example, and $n_i$ the score of the negative pair of that example, the loss is:

\begin{equation*}
loss = \sum_i{max(0, 1 -p_i + n_i)}
\end{equation*}

In [9]:
from keras import backend as K

def identity_loss(y_true, y_pred):
    score = K.maximum(0.0,y_pred)
    loss = K.sum(score)
    #print(loss)
    return loss

def max_margin_loss(y_true, y_pred):
    
    # YOUR CODE HERE
    loss_ = 0 #todo
    
    return loss_

def custom_loss(score):
    score = 1 - score
    loss = max(0,score)
    return loss
   

#### Accuracy metric for max-margin loss
How many times did the positive pair effectively get a higher value than the negative pair?

In [None]:
# YOUR CODE HERE
def accuracy(y_true, y_pred):
    
    # YOUR CODE HERE
    accuracy_ =
    
    return accuracy_


### Compile model

In [None]:
# DO NOT CHANGE BELOW CODE
print ("compiling the training model")
training_model.compile(optimizer='adam', loss=max_margin_loss, metrics=[accuracy])

### 3. Data preparation for training the model

* adjust the length of captions into fixed maximum length (50 words)
* sampling caption for each image, while shuffling the image data
* encode captions into integer format based on look-up vocabulary index

In [112]:
# sampling one caption per image
# return image_ids, caption_ids


#['caps']['ims'] 
def sampling_img_cap(data):
    datalen=len(data['ims'])
    image_ids = np.arange(datalen)
    np.random.shuffle(image_ids)
    caption_ids=[image_ids[x]*5+np.random.randint(0, 5) for x in range(datalen)]

    #image_ids=np.zeros((datalen,len(data['ims'][0])))
    #caption_ids=["" for x in range(datalen)]
    #for cnt, img in enumerate(data['ims']):
    #    image_ids[cnt]=img
    #    rnd=np.random.randint(0, 5)
    #    caption_ids[cnt]=data['caps'][cnt*5+rnd][0:50]#capping to 50 length for captions
        
    # YOUR CODE HERE
    
    return image_ids, caption_ids

#train_image_ids, train_caption_ids = sampling_img_cap(train_data)

In [155]:
# transform raw text caption into integer sequences of fixed maximum length


def make_50(arr):
    return_arr=np.zeros(50,dtype=int)
    limit_id=min(50,len(arr))
    for i in range(limit_id):
        return_arr[i]+=arr[i]
    return return_arr
    
def prepare_caption(caption_ids, caption_data):
    
    # YOUR CODE HERE
    datalen=len(caption_ids)
    #[[float(y) for y in x] for x in l]
    zero=np.zeros(50)
    cap_transformed=[caption_data[caption_ids[x]] for x in range(datalen)]
    caption_seqs = [[words_indices[word] for word in sentence.split() ] for sentence in cap_transformed ]
    caption_seqs=np.asarray([make_50(i) for i in caption_seqs])
    
      
    return caption_seqs

#x_caption = prepare_caption(train_caption_ids, train_caps)

In [204]:
x_caption

[[2, 2665, 121, 43, 1386, 7, 317, 4, 3537],
 [2, 29, 198, 201, 7, 2, 4060, 6, 37],
 [13, 910, 580, 4, 2, 21, 457, 5, 22],
 [2, 48, 388, 9, 46, 41, 89, 3, 5, 370],
 [2, 121, 43, 7, 2, 106, 3, 5, 188],
 [869, 790, 7, 1669, 1293, 8, 380, 3, 2, 58, 8, 20, 42],
 [2, 38, 258, 4, 157, 8, 2, 3447, 114, 62],
 [79, 1305, 4, 2, 872, 8, 2, 311, 4, 1103, 547, 327],
 [2, 878, 36, 395, 95, 4, 2, 2305, 211],
 [2, 10, 12, 15, 13, 443, 23, 1560, 41, 113],
 [2, 42, 4, 584, 7, 5075, 1233, 8, 209],
 [14, 249, 114, 62, 6, 2, 34, 7, 122],
 [2, 81, 672, 78, 2, 177, 143, 2, 29, 120],
 [5, 44, 9, 100, 3, 5, 819, 6, 5, 366],
 [2, 10, 6, 41, 51, 206, 2, 93, 4, 856],
 [2, 387, 1137, 317, 4, 1163, 212, 8, 72, 1029],
 [2, 1617, 441, 6, 5, 138, 7, 2, 440, 152, 6, 170],
 [2, 29, 644, 388, 135, 231, 49, 856],
 [2,
  47,
  90,
  3,
  62,
  19,
  11,
  2,
  145,
  166,
  56,
  2,
  26,
  265,
  2,
  1031,
  8,
  2,
  398,
  141,
  7,
  47,
  5331],
 [14, 2863, 17, 3, 2, 42, 7, 161, 936, 8, 327],
 [13, 991, 359, 36, 395, 

In [73]:
train_caps

['a woman wearing a net on her head cutting a cake',
 'a woman cutting a large white sheet cake',
 'a woman wearing a hair net cutting a large sheet cake',
 'there is a woman that is cutting a white cake',
 'a woman marking a cake with the back of a chefs knife',
 'a young boy standing in front of a computer keyboard',
 'a little boy wearing headphones and looking at a computer monitor',
 'he is listening intently to the computer at school',
 'a young boy stares up at the computer monitor',
 'a young kid with head phones on using a computer',
 'a boy wearing headphones using one computer in a long row of computers',
 'a little boy with earphones on listening to something',
 'a group of people sitting at desk using computers',
 'children sitting at computer stations on a long table',
 'a small child wearing headphones plays on the computer',
 'a man is in a kitchen making pizzas',
 'man in apron standing on front of oven with pans and bakeware',
 'a baker is working in the kitchen rolli

In [145]:
# DO NOT CHANGE BELOW CODE

train_caps = []
for cap in train_data['caps']:
    train_caps.append(cap.decode())

val_caps = []
for cap in val_data['caps']:
    val_caps.append(cap.decode())

In [156]:
# DO NOT CHANGE BELOW CODE

train_image_ids, train_caption_ids = sampling_img_cap(train_data)
val_image_ids, val_caption_ids = sampling_img_cap(val_data)

x_caption = prepare_caption(train_caption_ids, train_caps)
x_image = train_data['ims'][np.array(train_image_ids)]

x_val_caption = prepare_caption(val_caption_ids, val_caps)
x_val_image = val_data['ims'][np.array(val_image_ids)]

In [157]:
print(x_caption[0])

[  2 226   4 414  24  52 200 176   4  27   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0]


In [189]:
print(train_image_ids[0])

4504


In [190]:
print(train_data['ims'][4504])

[0.00242774 0.00609638 0.         ... 0.00065424 0.00311061 0.00114428]


### 4. Create noise set for negative examples of image-fake caption and dummy output

Notice that we do not have real output with labels for training the model. Keras architecture expects labels, so we need to create dummy output -- which is numpy array of zeros. This dummy labels or output is never used since we compute loss function based on margin between positive examples (image-real caption) and negative examples (image-fake caption).

In [177]:
# YOUR CODE HERE


#tr_noise = [[abs(np.random.normal(np.mean(y),np.std(y),50))] for y in x_image]
#vl_noise = [[abs(np.random.normal(np.mean(y),np.std(y),50))] for y in x_val_image]

#train_noise = [x_image[i] + tr_noise[i] for i in range(len(x_image))]
#val_noise = [x_val_image[i] + vl_noise[i] for i in range(len(x_val_image))]

train_noise = np.asarray([np.random.randint(0,50,size=50) for y in x_caption])
val_noise = np.asarray([np.random.randint(0,50,size=50)for y in x_val_caption])

y_train_labels = np.zeros(10000)#((len(x_image),50))
y_val_labels = np.zeros(5000)#((len(x_val_image),50))

In [178]:
print(y_train_labels.shape)

(10000,)


### 5. Training model

In [179]:
# YOUR CODE HERE

X_train = [x_image,x_caption,train_noise]
Y_train = y_train_labels
X_valid = [x_val_image,x_val_caption,val_noise]
Y_valid = y_val_labels

In [180]:
model.fit(X_train,Y_train, validation_data=(X_valid, Y_valid), batch_size=batch_size, epochs=epochs)

Train on 10000 samples, validate on 5000 samples
Epoch 1/20


InternalError: Blas GEMM launch failed : a.shape=(100, 4096), b.shape=(4096, 1024), m=100, n=1024, k=4096
	 [[Node: Dense_IMG_3/MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](_arg_IMG_input_3_0_1/_97, Dense_IMG_3/kernel/read)]]
	 [[Node: loss_1/mul/_115 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2896_loss_1/mul", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Caused by op 'Dense_IMG_3/MatMul', defined at:
  File "c:\users\illia\appdata\local\conda\conda\envs\tensorflow-gpu\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "c:\users\illia\appdata\local\conda\conda\envs\tensorflow-gpu\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "c:\users\illia\appdata\local\conda\conda\envs\tensorflow-gpu\lib\site-packages\ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "c:\users\illia\appdata\local\conda\conda\envs\tensorflow-gpu\lib\site-packages\traitlets\config\application.py", line 658, in launch_instance
    app.start()
  File "c:\users\illia\appdata\local\conda\conda\envs\tensorflow-gpu\lib\site-packages\ipykernel\kernelapp.py", line 486, in start
    self.io_loop.start()
  File "c:\users\illia\appdata\local\conda\conda\envs\tensorflow-gpu\lib\site-packages\tornado\platform\asyncio.py", line 112, in start
    self.asyncio_loop.run_forever()
  File "c:\users\illia\appdata\local\conda\conda\envs\tensorflow-gpu\lib\asyncio\base_events.py", line 421, in run_forever
    self._run_once()
  File "c:\users\illia\appdata\local\conda\conda\envs\tensorflow-gpu\lib\asyncio\base_events.py", line 1425, in _run_once
    handle._run()
  File "c:\users\illia\appdata\local\conda\conda\envs\tensorflow-gpu\lib\asyncio\events.py", line 127, in _run
    self._callback(*self._args)
  File "c:\users\illia\appdata\local\conda\conda\envs\tensorflow-gpu\lib\site-packages\tornado\platform\asyncio.py", line 102, in _handle_events
    handler_func(fileobj, events)
  File "c:\users\illia\appdata\local\conda\conda\envs\tensorflow-gpu\lib\site-packages\tornado\stack_context.py", line 276, in null_wrapper
    return fn(*args, **kwargs)
  File "c:\users\illia\appdata\local\conda\conda\envs\tensorflow-gpu\lib\site-packages\zmq\eventloop\zmqstream.py", line 450, in _handle_events
    self._handle_recv()
  File "c:\users\illia\appdata\local\conda\conda\envs\tensorflow-gpu\lib\site-packages\zmq\eventloop\zmqstream.py", line 480, in _handle_recv
    self._run_callback(callback, msg)
  File "c:\users\illia\appdata\local\conda\conda\envs\tensorflow-gpu\lib\site-packages\zmq\eventloop\zmqstream.py", line 432, in _run_callback
    callback(*args, **kwargs)
  File "c:\users\illia\appdata\local\conda\conda\envs\tensorflow-gpu\lib\site-packages\tornado\stack_context.py", line 276, in null_wrapper
    return fn(*args, **kwargs)
  File "c:\users\illia\appdata\local\conda\conda\envs\tensorflow-gpu\lib\site-packages\ipykernel\kernelbase.py", line 283, in dispatcher
    return self.dispatch_shell(stream, msg)
  File "c:\users\illia\appdata\local\conda\conda\envs\tensorflow-gpu\lib\site-packages\ipykernel\kernelbase.py", line 233, in dispatch_shell
    handler(stream, idents, msg)
  File "c:\users\illia\appdata\local\conda\conda\envs\tensorflow-gpu\lib\site-packages\ipykernel\kernelbase.py", line 399, in execute_request
    user_expressions, allow_stdin)
  File "c:\users\illia\appdata\local\conda\conda\envs\tensorflow-gpu\lib\site-packages\ipykernel\ipkernel.py", line 208, in do_execute
    res = shell.run_cell(code, store_history=store_history, silent=silent)
  File "c:\users\illia\appdata\local\conda\conda\envs\tensorflow-gpu\lib\site-packages\ipykernel\zmqshell.py", line 537, in run_cell
    return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
  File "c:\users\illia\appdata\local\conda\conda\envs\tensorflow-gpu\lib\site-packages\IPython\core\interactiveshell.py", line 2728, in run_cell
    interactivity=interactivity, compiler=compiler, result=result)
  File "c:\users\illia\appdata\local\conda\conda\envs\tensorflow-gpu\lib\site-packages\IPython\core\interactiveshell.py", line 2850, in run_ast_nodes
    if self.run_code(code, result):
  File "c:\users\illia\appdata\local\conda\conda\envs\tensorflow-gpu\lib\site-packages\IPython\core\interactiveshell.py", line 2910, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-42-f65aa8b846ef>", line 7, in <module>
    condense_img = Dense(1024,name='Dense_IMG')(img_input)
  File "c:\users\illia\appdata\local\conda\conda\envs\tensorflow-gpu\lib\site-packages\keras\engine\topology.py", line 619, in __call__
    output = self.call(inputs, **kwargs)
  File "c:\users\illia\appdata\local\conda\conda\envs\tensorflow-gpu\lib\site-packages\keras\layers\core.py", line 855, in call
    output = K.dot(inputs, self.kernel)
  File "c:\users\illia\appdata\local\conda\conda\envs\tensorflow-gpu\lib\site-packages\keras\backend\tensorflow_backend.py", line 1075, in dot
    out = tf.matmul(x, y)
  File "c:\users\illia\appdata\local\conda\conda\envs\tensorflow-gpu\lib\site-packages\tensorflow\python\ops\math_ops.py", line 2108, in matmul
    a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
  File "c:\users\illia\appdata\local\conda\conda\envs\tensorflow-gpu\lib\site-packages\tensorflow\python\ops\gen_math_ops.py", line 4492, in mat_mul
    name=name)
  File "c:\users\illia\appdata\local\conda\conda\envs\tensorflow-gpu\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "c:\users\illia\appdata\local\conda\conda\envs\tensorflow-gpu\lib\site-packages\tensorflow\python\framework\ops.py", line 3290, in create_op
    op_def=op_def)
  File "c:\users\illia\appdata\local\conda\conda\envs\tensorflow-gpu\lib\site-packages\tensorflow\python\framework\ops.py", line 1654, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InternalError (see above for traceback): Blas GEMM launch failed : a.shape=(100, 4096), b.shape=(4096, 1024), m=100, n=1024, k=4096
	 [[Node: Dense_IMG_3/MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](_arg_IMG_input_3_0_1/_97, Dense_IMG_3/kernel/read)]]
	 [[Node: loss_1/mul/_115 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2896_loss_1/mul", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]


#### Storing models and weight parameters

In [None]:
# DO NOT CHANGE BELOW CODE

# Save model
training_model.save(os.path.join(MODEL_PATH,'image_caption_model.h5'))
# Save weight parameters
training_model.save_weights(os.path.join(MODEL_PATH, 'weights_image_caption.hdf5'))

# Save model for encoding caption and image
caption_model.save(os.path.join(MODEL_PATH,'caption_model.h5'))
image_model.save(os.path.join(MODEL_PATH,'image_model.h5'))

### 6. Feature extraction (Neural codes)

In [None]:
# YOUR CODE HERE

# Use caption_model and image_model to produce "Neural codes" 
# for both image and caption from validation set

### 7. Caption Retrieval

#### Display original image as query and its ground truth caption

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
from keras.preprocessing import image

In [None]:
# YOUR CODE HERE

# choose one image_id from validation set
# use this id to get filepath of image
img_id = 
filepath_image = 

# display original caption
original_caption = 
print(original_caption)

# DO NOT CHANGE BELOW CODE
img = image.load_img(os.path.join(IMAGE_DATA,filepath_image), target_size=(224,224))
plt.imshow(img)
plt.axis("off")
plt.show()

In [None]:
# function to retrieve caption, given an image query

def get_caption(image_filename, n=10):   
    
    # YOUR CODE HERE


In [None]:
# DO NOT CHANGE BELOW CODE
get_caption(filepath_image)

Briefly discuss the result. Why or how it works, and why do you think it does not work at some point.

#### Answer:

=== write your answer here ===

### 8. Image Retrieval

In [None]:
# given text query, display retrieved image, similarity score, and its original caption 

def search_image(text_caption, n=10):
    
    # YOUR CODE HERE
    

Consider to use the following settings for image retrieval task.

* use real caption that is available in validation set as a query.
* use part of caption as query. For instance, instead of use the whole text sentence of the
caption, you may consider to use key phrase or combination of words that is included in
corresponding caption.

In [None]:
# Example of text query 
# text = 'two giraffes standing near trees'

# YOUR QUERY-1
text1 = 

# DO NOT CHANGE BELOW CODE
search_image(text1)

In [None]:
# YOUR QUERY-2
text2 = 

# DO NOT CHANGE BELOW CODE
search_image(text2)

Briefly discuss the result. Why or how it works, and why do you think it does not work at some point.

#### Answer:

=== write your answer here ===