# Tutorial on Simple Visual Question Answering 

### Understandin the challenge

![](https://github.com/mateuszmalinowski/visual_turing_test-tutorial/raw/master/fig/challenges.jpg)

### Understand the data

We can read parts of the training file to see how questions and answers are prepared in the corpus:

In [2]:
! head -15 data/daquar/qa.894.raw.train.format_triple 

what is on the right side of the black telephone and on the left side of the red chair ?
desk
image3
what is in front of the white door on the left side of the desk ?
telephone
image3
what is on the desk ?
book, scissor, papers, tape_dispenser
image3
what is the largest brown objects ?
carton
image3
what color is the chair in front of the white wall ?
red
image3


You can see the image in jupyter notebook using Markdown as following:

`![](data/daquar/images/image3.png)`

![](data/daquar/images/image3.png)

## Pre-trained visual features

we can potentially use the preprocessed image features:

In [3]:
! find data/daquar/visual_features/*/*.npy

data/daquar/visual_features/fb_resnet/blobs.l2_res5c-152.npy
data/daquar/visual_features/fb_resnet/blobs.res5c-152.npy
data/daquar/visual_features/googlenet/blobs.loss3-classifier.npy
data/daquar/visual_features/googlenet/blobs.pool5-7x7_s1.npy
data/daquar/visual_features/googlenet/blobs.prob.npy


These features are learned within an end-to-end neural network architecture to represent invariant features for images for object recognition task in ImageNet. 

![](https://raw.githubusercontent.com/mateuszmalinowski/visual_turing_test-tutorial/master/fig/features_extractor.jpg)


In order to produce visual features for a given image, we take the output of the last layer in convolutional neural networks (CNN) pre-trained for object recognition on ImageNet dataset. 

In [4]:
import os
os.environ["CUDA_DEVICE_ORDER"]= "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]= "0"

import numpy as np
from keras.applications.resnet50 import ResNet50
from keras.applications.imagenet_utils import preprocess_input
from keras.preprocessing import image as kimage


Using TensorFlow backend.


In [5]:
# If you are running this for the first time in this machine, keras will download the pre-trained weights.
pretrained_cnn_model = ResNet50(weights='imagenet', include_top=False)

In [6]:
pretrained_cnn_model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
input_1 (InputLayer)             (None, None, None, 3) 0                                            
____________________________________________________________________________________________________
conv1 (Conv2D)                   (None, None, None, 64 9472        input_1[0][0]                    
____________________________________________________________________________________________________
bn_conv1 (BatchNormalization)    (None, None, None, 64 256         conv1[0][0]                      
____________________________________________________________________________________________________
activation_1 (Activation)        (None, None, None, 64 0           bn_conv1[0][0]                   
___________________________________________________________________________________________

For example, the image `image3.png` can be processed with this model:

In [7]:
# first we read the image from file and resize it to the ImageNet size
image3 = kimage.load_img("data/daquar/images/image3.png", target_size=[224,224])

# we have to convert it to numpy array:
image3_array = kimage.img_to_array(image3)

# the neural network is designed to process a batch of images as input.
# so, need to make it as an array of images:
images_array = np.array([
    image3_array,
])

# At this point we suppose that image is prepared as an RGB formated array
# with integer values of 0-255 for each color code.
# we should make sure that the format is matching input format of the pre-trained network.
# (in this case, the following code makes convernt RGB to BGR)
images_ready = preprocess_input(images_array)

Now, we can the pre-processed image use the CNN model to produce visual features:

In [8]:
vfearues = pretrained_cnn_model.predict(images_ready)

image3_features = vfearues[0].flatten()

print('     dtype:', image3_features.dtype)
print(' dimension:', image3_features.shape[0])
print('the vector:', image3_features)


     dtype: float32
 dimension: 2048
the vector: [ 0.          0.08101185  0.0076139  ...,  0.16474207  0.18622221
  1.01108205]


In [9]:
# we can make this into a function (from file path to feature vectors)
def img2vec(image_path):
    x = kimage.load_img(image_path, target_size=[224,224])
    x_array = kimage.img_to_array(x)
    xs_array = np.array([x_array,])

    return pretrained_cnn_model.predict(preprocess_input(xs_array)).flatten()



## A Vision and Language Network

### Understand this neural network model

In this tutorial, we want to build a neural net function, with inputs being a question and its context picture and the output being a single word answer:

![](https://github.com/mateuszmalinowski/visual_turing_test-tutorial/raw/master/fig/LSTM_vision_model.jpg)

### Preprocess and prepare the data

The goal is to first prepare the text for processing. Also apply techniques for dealing with out-of-vocabulary (OOV) words, e.g. adding `<unk>` for rare words and `<num>` for numbers. 

*** (1) Build the vocabulary, (2) Remove punctuation, and (3) mask numbers from text if it's needed***

In [10]:
# You can skip this part and use the prepared files in next step

In [11]:
import string
from collections import Counter, defaultdict 

def preprocess_triple_file(filepath):
    # useful metadata:
    # vocabulary, longest_sentence_length
    
    # vocabulary for each file
    metadata = {
        'questions': {'vocab': Counter(), 'max_len': 0},
        'answers': {'vocab': Counter(), 'max_len': 0},
        'contexts': {'vocab': Counter(), 'max_len': 0},
    }
    
    # split the file into three parallel text files:
    files = {
        'questions': open(filepath+'.questions', 'w'),
        'answers'  : open(filepath+'.answers', 'w'),
        'contexts' : open(filepath+'.contexts', 'w'),
    }
    
    def process(s, mask_numbers=True, metadata=None):
        def is_number(s):
            try:
                float(s)
                return True
            except ValueError:
                return False
        
        word_sequence = [w.strip().rstrip(string.punctuation) for w in s.split()]
        
        if mask_numbers:
            word_sequence = ['<num>' if mask_numbers and is_number(w) else w for w in word_sequence]

        metadata['vocab'].update(word_sequence)
        metadata['max_len'] = max(len(word_sequence), metadata['max_len'])

        return ' '.join(word_sequence)
    
    
    # stateful reading of the file, each line changes the states as follwing:
    # question => answer => context => question
    state = 'questions'
    for line in open(filepath, encoding='utf-8'):
        files[state].write(process(line, metadata=metadata[state])+'\n')
        
        if state == 'questions':
            state = 'answers'
        elif state == 'answers':
            state = 'contexts'
        elif state == 'contexts':
            state = 'questions'
        
    for state in files:
        files[state].close()

    np.save(filepath+'.metadata.npy', metadata)
    return None

# process these files and produce new files:
# training qa data "data/daquar/qa.894.raw.train.format_triple"
# testing qa data "data/daquar/qa.894.raw.test.format_triple"

preprocess_triple_file("data/daquar/qa.894.raw.train.format_triple")
preprocess_triple_file("data/daquar/qa.894.raw.test.format_triple")

In [12]:
! ls data/daquar/qa.894.raw.train.format_triple.*
! ls data/daquar/qa.894.raw.test.format_triple.*

data/daquar/qa.894.raw.train.format_triple.answers
data/daquar/qa.894.raw.train.format_triple.contexts
data/daquar/qa.894.raw.train.format_triple.metadata.npy
data/daquar/qa.894.raw.train.format_triple.questions
data/daquar/qa.894.raw.test.format_triple.answers
data/daquar/qa.894.raw.test.format_triple.contexts
data/daquar/qa.894.raw.test.format_triple.metadata.npy
data/daquar/qa.894.raw.test.format_triple.questions


In [13]:
# Now, we can read from preprocessed files but still there are some preprocessing left to do

In [14]:
metadata = np.load('data/daquar/qa.894.raw.train.format_triple.metadata.npy')[None][0]
question_len = metadata['questions']['max_len']
vocab = metadata['questions']['vocab']
# dump the frequencies:
vocab = ['<pad>', '<unk>', '?']+[w for w,f in vocab.items() if f > 1] # don't keep the rare words.
word2index = defaultdict(lambda: 1, zip(vocab, range(len(vocab)))) # this will set <unk> for unknwon words

answers = metadata['answers']['vocab']
answers = ['<unk>']+[w for w,f in answers.items() if f > 1] # don't keep the rare answers.
answer2index = defaultdict(lambda: 0, zip(answers, range(len(answers)))) # this will set <unk> for unknwon words


When in order to deal with large files in python we need to write generators. But for this tutorial we are not working with very large dataset. So, we can pus all training data lists:

In [16]:
X_questions = [
    [word2index['<pad>']]*(question_len-len(line.split()))+[word2index[w.strip()] for w in line.split()]+[word2index['?']]
    for line in open('data/daquar/qa.894.raw.train.format_triple.questions')
]

Y_answers = [
    [answer2index[w.strip()] for w in line.split()]
    for line in open('data/daquar/qa.894.raw.train.format_triple.answers')
]

In [15]:
# this code takes time:
X_images = [
    img2vec("data/daquar/images/{0}.png".format(image_name.strip()))
    for image_name in open('data/daquar/qa.894.raw.train.format_triple.contexts')
]

In [28]:
# repeat questions with several answers, and flatten them with one answer:
trainig_data = list(zip(*[(q,a,i) for q, a_s, i in zip(X_questions, Y_answers, X_images) for a in a_s]))
    

In [33]:
len(trainig_data), len(trainig_data[0]), len(trainig_data[1]), len(trainig_data[2])

(3, 7768, 7768, 7768)

In [None]:
# this preprocessing tool is not the most efficient way to process large files
# but it is desined to be as understandable as possible for this tutorial.

# initialise the vocabulary list
vocab = [
    '<pad>', # the dummy word which will be used to make standard size sentences
    '<eoq>', # anotate the end of question
    '<unk>', # replace the unknown words 
    '<num>', # replace the numbers with this token
]

# initialise the list of possible answers to questions.
answers = ['<unk>','<num>']


# training qa data "data/daquar/qa.894.raw.train.format_triple"
# testing qa data "data/daquar/qa.894.raw.test.format_triple"
def preprocess_data(filepath, vocab, answers, test_mode=True):
    # initialise the final input and output supervised data
    X_q = [] # questions 
    X_c = [] # contexts
    Y_list = [] # answers

    # stateful reading of the file, each line changes the states as follwing:
    # question => answer => context => question
    state = 'question'
    for line in open(filepath, encoding='utf-8'):
        if state == 'question':
            state = 'answer'
            q = line.strip().split()

            # update vocab
            vocab += [w for w in q if w not in vocab]

            # keep the word indices as prepared input:
            X_q.append([vocab.index(w) if w in vocab else vocab.index('<unk>') for w in q])
        elif state == 'answer':
            state = 'context'
            a = line.strip().split(',')

            # update vocab
            answers += [w for w in a if w not in answers]

            # keep the answer indices as prepared output:
            Y_list.append([answers.index(w) if w in answers else answers.index('<unk>') for w in a])        
        elif state == 'context':
            state = 'question'
            image_name = line.strip()
            # take the visual features
            image_path = "data/daquar/images/{0}.png".format(image_name)
            X_c.append(img2vec(image_path))

    max_len = max(len(q) for q in X_q)
    
    # pad all questions with a dummy word to make all sentences same size:
    X_q = np.array([q + [vocab.index('<eoq>')] + [vocab.index('<pad>')]*(max_len-len(q)) for q in X_q])
    
    # convert multiple answers to several single answer: 
    q_indices, Y = tuple(zip(*[
        (i, y)
        for i, ys in enumerate(Y_list)
        for y in ys
    ]))
    
    q_indices = list(q_indices)
    X_q = X_q[q_indices]
    X_c = np.array(X_c)[q_indices]
    
    # prepared as (inputs, outputs)
    return ([X_q, X_c], np.array(list(Y)))

In [None]:
X, Y = preprocess_data("data/daquar/qa.894.raw.train.format_triple", vocab, answers)

In [None]:
question_len = X[0].shape[1]
visual_vec_len = X[1].shape[1]

In [None]:
print('the number of all possible:', len(answers))
print('the size of vocabulary:', len(vocab))
print('the total number of training samples', len(Y))

In [None]:
X_test, Y_test = preprocess_data("data/daquar/qa.894.raw.train.format_triple", vocab[:], answers[:])

### A Keras model

In [None]:
from keras.models import Sequential, Model
from keras.layers import Dense, LSTM, Embedding, Concatenate, Dropout
from keras.layers import Input
import keras.backend as K

In [None]:
input_question = Input([question_len,])
input_context = Input([visual_vec_len,])

# learn embedings (size=50 as we chose just now :D)
q_embs = Embedding(len(vocab), 50)(input_question)

# encode the question
q_encoded = LSTM(50)(q_embs)

mlp_1 = Dense(visual_vec_len, activation='tanh')(q_encoded)
print(mlp_1)

q_composed = Concatenate()([input_context, mlp_1])
print(q_composed)

mlp_2 = Dropout(0.2)(Dense(visual_vec_len, activation='relu')(q_composed))
print(mlp_2)

final_a = Dense(len(answers), activation='softmax')(mlp_2)

model = Model([input_question, input_context], final_a)
model.summary()

## Train the model

In [None]:
# what is the loss, and how the parameters should be updated:
model.compile('adam', 'sparse_categorical_crossentropy')

In [None]:
model.fit(X, Y, epochs=20, batch_size=10)

In [None]:
model.fit(X, Y, epochs=20, batch_size=32)

In [None]:

model.fit(X, Y, epochs=20, batch_size=32)

In [None]:
from gensim.models import KeyedVectors