# Visual Turing Test - Tutorial

[ Scalable Learning and Perception Group](http://scalable.mpi-inf.mpg.de), authored by [Dr. Mario Fritz](http://scalable.mpi-inf.mpg.de/publications/) and [Mateusz Malinowski](https://www.mpi-inf.mpg.de/departments/computer-vision-and-multimodal-computing/people/mateusz-malinowski/)

This tutorial is based on our ICCV'15 paper ["Ask Your Neurons: A Neural-based Approach to Answering Questions about Images"](https://www.d2.mpi-inf.mpg.de/sites/default/files/iccv15-neural_qa.pdf), and, more broadly, our project on [Visual Turing Test](https://www.mpi-inf.mpg.de/departments/computer-vision-and-multimodal-computing/research/vision-and-language/visual-turing-challenge/).

Since visual features are large, you should download the features separately, and put them to data/visual_features/visual_features or data/vqa/visual_features directory.
 * daquar (it is recommended to download all them)
  * [residual net [23 MB]](http://datasets.d2.mpi-inf.mpg.de/mateusz16visual_features/daquar/fb_resnet.zip)
   -- place under: data/daquar/visual_features/fb_resnet/blobs.*.npy
  * [googlenet [17 MB]](http://datasets.d2.mpi-inf.mpg.de/mateusz16visual_features/daquar/googlenet.zip)
   -- place under: data/daquar/visual_features/googlenet/blobs.*.npy
  * [NYU-Depth images [430 MB]](http://datasets.d2.mpi-inf.mpg.de/mateusz16language_vision_tutorial/nyu_depth-images.zip)
   -- place under: data/daquar/images/*.png
 * vqa
  * [residual_net train [1.2GB]](http://datasets.d2.mpi-inf.mpg.de/mateusz16visual_features/vqa/train2014/fb_resnet.zip) 
   -- recommended, place under: data/vqa/visual_features/train2014/fb_resnet/blobs.*.npy
  * [residual_net val [573MB]](http://datasets.d2.mpi-inf.mpg.de/mateusz16visual_features/vqa/val2014/fb_resnet.zip) 
   -- recommended, place under: data/vqa/visual_features/val2014/fb_resnet/blobs.*.npy
  * [vgg_net train [2.2GB]](http://datasets.d2.mpi-inf.mpg.de/mateusz16visual_features/vqa/train2014/vgg_net.zip)
   -- a few different visual features, place under: data/vqa/visual_features/train2014/vgg_net/blobs.*.npy
  * [vgg_net val [1.3GB]](http://datasets.d2.mpi-inf.mpg.de/mateusz16visual_features/vqa/val2014/vgg_net.zip)
   -- a few different visual features, place under: data/vqa/visual_features/val2014/vgg_net/blobs.*.npy
 * vqa - question answer pairs
  * [Questions [134MB]](http://datasets.d2.mpi-inf.mpg.de/mateusz16language_vision_tutorial/Questions.zip) -- place under: data/vqa/Questions/
  * [Annotations [56MB]](http://datasets.d2.mpi-inf.mpg.de/mateusz16language_vision_tutorial/Annotations.zip) -- place under: data/vqa/Annotations
 * nltk data [recommended]
  * [nltk_data [20MB]](http://datasets.d2.mpi-inf.mpg.de/mateusz16language_vision_tutorial/nltk_data.zip)
   -- place under: data/nltk_data
 * tutorial
  * this version of tutorial can be found [here](http://datasets.d2.mpi-inf.mpg.de/mateusz16language_vision_tutorial/visual_turing_test-tutorial.zip)
  * github version: 

If you use this tutorial or its parts or Kraino in your project, please consider citing at least our 'Ask Your Neurons' paper (you can find the bibtex below).

Bibtex:

```
@inproceedings{malinowski2015ask,
  title={Ask your neurons: A neural-based approach to answering questions about images},
  author={Malinowski, Mateusz and Rohrbach, Marcus and Fritz, Mario},
  booktitle={Proceedings of the IEEE International Conference on Computer Vision},
  pages={1--9},
  year={2015}
}
```

Before starting the tutorial make sure that you have the following hierarchy of the folders:
boring_function.py
neural_solver.py
visual_turing_test.ipynb
data/*
kraino/*
local/*
fig/*

# Introduction

### How to use this notebook?

Before focusing on the actual task, let's briefly see what we can do in the Jupyter Notebook. We will also introduce notation that we use in this tutorial.

Shortcuts:
 * Shift + Enter - runs the cell, and step inside the next cell
 * Ctrl + Enter - runs the cell (stay in the same cell)
 * Esc + x - deletes the cell (be careful)
 * Esc + b - creates a cell bellow the current cell


The following represents an exercise that doesn't need programming. Its role is to practice some newly introduced concepts.
```
Exercise
```

The following is a python script. 
```python
print("Hello world")

# Comment
# Now we write a loop printing numbers from 0 to 9
for k in xrange(10):
    # remember that the python's syntax is driven by indentation
    print k
```

Some exercises need some programming, or at least executing the code.
However, in this tutorial, we try to keep the programming part rather minimal, and focus on the Visual Turing Test.
The following cell is a small programming exercise. You can edit it by double clicking the cell, and execute it by running the cell (Shift + Enter). We use #TODO: to give some hints or more detailed explanation.

In [None]:
def print_n_numbers(n):
    #TODO: write a loop that prints numbers from 0 to n (excluding n)
    for i in xrange(n):
        print(i)

# now we execute the function
print_n_numbers(5)

The function below print each element in the list in a new line. We will use this function later, so please run the interpreter over the following cell (Shift+Enter).

In [None]:
def print_list(ll):
    # Prints the list
    print('\n'.join(ll))
    
print_list(['Visual Turing Test', 'Summer School', 'Dr. Mario Fritz', 'Mateusz Malinowski'])

The notebook can also interface with the command line. Try the following line (again Shift+Enter).

In [None]:
! ls

And now let's execute python's 'boring_function' with an argument. It prints the available GPU, the argument, as well as versions of Theano, and Keras (we will talk about both frameworks later in the tutorial). Since the boring_function imports Theano, its execution may take a while.

In [None]:
! python boring_function.py 'hello world'

To view the content of the function, run the following

In [None]:
! tail boring_function.py

The command below checks the available GPU machines.

In [1]:
! nvidia-smi

Mon Apr 25 21:57:45 2016       
+------------------------------------------------------+                       
| NVIDIA-SMI 352.30     Driver Version: 352.30         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla K40m          Off  | 0000:04:00.0     Off |                    0 |
| N/A   71C    P0   142W / 235W |   2852MiB / 11519MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K40m          Off  | 0000:42:00.0     Off |                    0 |
| N/A   42C    P8    20W / 235W |     22MiB / 11519MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                            

### Challenge

During this tutorial, we will look at very recent research thread that interlinks language and vision together -- a Visual Turing Test -- that is about answering on natural language questions about images by machines.

![challenges](fig/challenges.jpg)

### Roadmap

1. [Datasets](#Datasets)
1. [Textual Features](#Textual-Features)
1. [Language Only](#Language-Only)
1. [Evaluation Measures](#Evaluation-Measures)
1. [New Predictions](#New-Predictions)
1. [Visual Features](#Visual-Features)
1. [Vision+Language](#Vision+Language)
1. [New Predictions with Vision+Language](#New Predictions-with-Vision+Language)
1. [VQA](#VQA)
1. [Kraino](#Kraino)
1. [Further Experiments](#Further-Experiments)
1. [New Research Opportunities](#New-Research-Opportunities)
1. [External Links](#External-Links)
1. [Logs](#Logs)

# Datasets

In this section, we will get familiar with both datasets, accuracy measures, and features.

### DAQUAR

Let's first look into the folder *data/daquar* to make the problem a bit more tangible for us. Here, we have training and test data in *qa.894.raw.train.format_triple* and *qa.894.raw.test.format_triple*.

```
Execute the cell below to see how input data look like.
Please use Shift+Enter on the cell below.
Make sure you understand the format.
```


In [2]:
! head -15 data/daquar/qa.894.raw.train.format_triple

what is on the right side of the black telephone and on the left side of the red chair ?
desk
image3
what is in front of the white door on the left side of the desk ?
telephone
image3
what is on the desk ?
book, scissor, papers, tape_dispenser
image3
what is the largest brown objects ?
carton
image3
what color is the chair in front of the white wall ?
red
image3



Let's have a look at the figure in [Introduction->Challenge](#Challenge). The figure lists images with associated question-answer pairs. It also comments on challenges associated with every question-answer-triplet. We see that to answer properly on the questions, the answerer needs to understand the scene visually, understand the question, but also, arguably, has to resort to common sense knowledge, or even know the preferences of the person asking a question ('What is behind the table?' - what 'behind' means?).

```
Can you see anything particularly interesting in the first column of the figure in [Introduction->Challenge]?
Think about a spatial relationship between an observer, object of interest, and the world. 
```

In [3]:
#TODO: Execute the following procedure (Shift+Enter)
from kraino.utils import data_provider

dp = data_provider.select['daquar-triples']
dp

{'perception': <function kraino.utils.data_provider.<lambda>>,
 'save_predictions': <function kraino.utils.data_provider.daquar_save_results>,
 'text': <function kraino.utils.data_provider.daquar_qa_triples>}

The code above returns a dictionary of three representations of the DAQUAR dataset. For now, we will look only into the 'text' representation. dp['text'] returns a function from dataset split into the dataset's textual representation. It will be more clear after executing the following instruction.

In [4]:
# check the keys of the representation of DAQUAR train
train_text_representation = dp['text'](train_or_test='train')
train_text_representation.keys()

['answer_words_delimiter',
 'end_of_answer',
 'img_name',
 'y',
 'x',
 'end_of_question',
 'img_ind',
 'question_id']

This representation specifies how questions are ended ('?'), answers are ended ('.'), answer words are delimited (DAQUAR sometimes has a set of answer words as an answer, for instance 'knife, fork' may be a valid answer), but most important, it has questions (key 'x'), answers (key 'y'), and names of the corresponding images (key 'img_name').

In [5]:
# let's check some entries of the text's representation
n_elements = 10
print('== Questions:')
print_list(train_text_representation['x'][:n_elements])
print
print('== Answers:')
print_list(train_text_representation['y'][:n_elements])
print
print('== Image Names:')
print_list(train_text_representation['img_name'][:n_elements])

== Questions:


NameError: name 'print_list' is not defined

__Summary__

DAQUAR consists of question, answer, image triplets. Pairs question, answer for different folds are accessible from
```python
data_provider.select['text']
```

# Textual Features

Ok. We have a text. But unfortunately neural networks expect numerical input, so we cannot really work with the raw text. We need to transform an raw input into some numerical value or a vector of values. One particularly successful representation is called one-hot vector and it is a binary vector with exactly one non-zero entry. This entry points to the corresponding word in the vocabulary. See the illustration below.

![one hot](fig/one_hot.jpg)

```
* How does a vector computed after we sum up one-hot representations of 'what is behind the table' (see the illustration above) look like?
* What if we sum up 'What table is behind the table'? Can you interpret the resulting vector?
* Can you guess why it's nice to work with one-hot vector representation of the text?
```

As we see from the illustrative example above, we first need to build a suitable vocabulary from our raw textual training data, and next transform them into one-hot representation.

In [6]:
from toolz import frequencies
train_raw_x = train_text_representation['x']
# we start from building the frequencies table
wordcount_x = frequencies(' '.join(train_raw_x).split(' '))
# print the most and least frequent words
n_show = 5
print(sorted(wordcount_x.items(), key=lambda x: x[1], reverse=True)[:n_show])
print(sorted(wordcount_x.items(), key=lambda x: x[1])[:n_show])

[('the', 9847), ('?', 6795), ('what', 5847), ('is', 5368), ('on', 2909)]
[('all', 1), ('surrounded', 1), ('four', 1), ('displaying', 1), ('children', 1)]


In [7]:
# Kraino is a framework that helps in fast prototyping Visual Turing Test models
from kraino.utils.input_output_space import build_vocabulary

# This function takes wordcounts and returns word2index - mapping from words into indices, 
# and index2word - mapping from indices to words.
word2index_x, index2word_x = build_vocabulary(
    this_wordcount=wordcount_x,
    truncate_to_most_frequent=0)
word2index_x

{'3': 506,
 u'<eoa>': 2,
 u'<eoq>': 3,
 u'<pad>': 0,
 u'<unk>': 1,
 '?': 52,
 'a': 205,
 'above': 80,
 'ac': 817,
 'across': 512,
 'against': 533,
 'air': 588,
 'airconditionerg': 790,
 'alarm': 424,
 'all': 4,
 'along': 92,
 'amidst': 783,
 'and': 382,
 'any': 390,
 'apart': 688,
 'apples': 513,
 'appliance': 510,
 'appliances': 112,
 'are': 488,
 'arm': 494,
 'armchair': 546,
 'armchairs': 320,
 'around': 765,
 'at': 828,
 'attached': 807,
 'audio': 647,
 'available': 514,
 'away': 501,
 'baby': 137,
 'back': 708,
 'backpack': 541,
 'bag': 471,
 'bags': 607,
 'ball': 623,
 'bananas': 651,
 'bars': 477,
 'base': 303,
 'basin': 566,
 'basins': 545,
 'basket': 215,
 'baskets': 110,
 'bath': 257,
 'bathroom': 558,
 'bathtub': 592,
 'bean': 463,
 'bear': 462,
 'bed': 374,
 'bedding': 720,
 'beds': 873,
 'bedside': 51,
 'been': 563,
 'before': 234,
 'behind': 660,
 'beige': 757,
 'below': 811,
 'belt': 556,
 'bench': 667,
 'beneath': 269,
 'benhind': 95,
 'between': 509,
 'bicycle': 335,
 

In addition, we are using a few special symbols that don't occur in the training dataset.
Most important are $<pad>$ and $<unk>$. We will use the former to pad sequences in order to have the same 
number of temporal elements; we will use the latter for words (at test time) that don't exist in training dataset.

Armed with vocabulary, we can build one-hot representation of the training data. However, this is not neccessary and maybe even wasteful. Our one-hot representation of the input text doesn't explicitely build long vectors, but instead it operates on indices. The example above would be encoded as [0,1,4,2,7,3]. 
```
Can you prove the equivalence in the claim?
```
__claim__:

Let $x$ be a binary vector with exactly one value $1$ at position $index$, that is $x[index]=1$. Then $$W[:,index] = Wx$$ where $W[:,b]$ denotes a vector built from a column $b$ of $W$.




In [8]:
from kraino.utils.input_output_space import encode_questions_index
one_hot_x = encode_questions_index(train_raw_x, word2index_x)
print(train_raw_x[:3])
print(one_hot_x[:3])

['what is on the right side of the black telephone and on the left side of the red chair ?', 'what is in front of the white door on the left side of the desk ?', 'what is on the desk ?']
[[71, 597, 744, 646, 704, 272, 160, 646, 124, 134, 382, 744, 646, 649, 272, 160, 646, 298, 15, 52, 3], [71, 597, 602, 255, 160, 646, 352, 130, 744, 646, 649, 272, 160, 646, 655, 52, 3], [71, 597, 744, 646, 655, 52, 3]]


As we can see, the sequences have different elements. We will pad the sequences to have the same length $MAXLEN$.

In [9]:
# We use another framework that is useful to build deep learning models - Keras
from keras.preprocessing import sequence
MAXLEN=30
train_x = sequence.pad_sequences(one_hot_x, maxlen=MAXLEN)
train_x[:3]

array([[  0,   0,   0,   0,   0,   0,   0,   0,   0,  71, 597, 744, 646,
        704, 272, 160, 646, 124, 134, 382, 744, 646, 649, 272, 160, 646,
        298,  15,  52,   3],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         71, 597, 602, 255, 160, 646, 352, 130, 744, 646, 649, 272, 160,
        646, 655,  52,   3],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,  71, 597, 744,
        646, 655,  52,   3]], dtype=int32)

And do the same with the answers.

In [10]:
# for simplicity, we consider only first answer words; that is, if answer is 'knife,fork' we encode only 'knife'
MAX_ANSWER_TIME_STEPS=1

from kraino.utils.input_output_space import encode_answers_one_hot
train_raw_y = train_text_representation['y']
wordcount_y = frequencies(' '.join(train_raw_y).split(' '))
word2index_y, index2word_y = build_vocabulary(this_wordcount=wordcount_y)
train_y, _ = encode_answers_one_hot(
    train_raw_y, 
    word2index_y, 
    answer_words_delimiter=train_text_representation['answer_words_delimiter'],
    is_only_first_answer_word=True,
    max_answer_time_steps=MAX_ANSWER_TIME_STEPS)
print(train_x.shape)
print(train_y.shape)

(6795, 30)
(6795, 686)


Finally, we can also encode test questions. We need it later to see how well our models generalise to new question,answer,image triplets. Remember however that we should use vocabulary we generated from training samples.

```
Why should we use the training vocabulary to encode test questions?
```

In [11]:
test_text_representation = dp['text'](train_or_test='test')
test_raw_x = test_text_representation['x']
test_one_hot_x = encode_questions_index(test_raw_x, word2index_x)
test_x = sequence.pad_sequences(test_one_hot_x, maxlen=MAXLEN)
print_list(test_raw_x[:3])
test_x[:3]

NameError: name 'print_list' is not defined

With encoded question, answer pairs we finish the first section. But before delving into building and training new models, let's have a look at summary to see bigger picture.

__Summary__

We started from raw questions from the training set. Use them to build a vocabulary. Next, we encode questions into sequences of one-hot vectors based on the vocabulary. Finally, we use the same vocabulary to encode questions from test set, if a word is absent we use extra token $<unk>$ to encode this fact (we encode the $<unk>$ token, not the word).

# Language Only

### Training - overall picture

Ok. We have textual features already built. Let's create some models that we will use for training and later for answering on questions about images.

As you may already know, we train models by weights updates. Let $x$ and $y$ be training samples (input, output), and $\ell(x,y)$ is an objective function. The formula for weights updates is:
$$w := w - \alpha \nabla \ell(x,y; w)$$
with $\alpha$ that we call the learning rate, and $\nabla$ is  a gradient wrt. weights $w$. This is a hyper-parameter that must be set in advance. The rule shown above is called SGD update, but other its variants are also possible. In fact, we will use a variant called [ADAM](http://arxiv.org/pdf/1412.6980v8.pdf).

We cast the question answering problem into a classification framework, so that we classify input $x$ into some class that represents an answer word. Therefore, we use, popular in classification, logistic regression as the objective:
$$\ell(x,y;w):=\sum_{y' \in \mathcal{C}} \mathbb{1}\{y'=y\}\log p(y'\;|\;x,w)$$
where $\mathcal{C}$ is a set of all classes, and for $p(y\;|\;x,w)$ we will use softmax: $e^{w^y\phi(x)} / \sum_{z}e^{w^z\phi(x)}$. Here $\phi(x)$ denotes an output of a model (more precisely, it's often a neural network's response to the input, just before softmax).

The training can be formalised (and automatised) so that you need to execute a procedure that looks something like that:
```python
training(gradient_of_the_model, optimizer='Adam')
```

__Summary__
Given a model, and an optimization procedure (SGD, Adam, etc.) all we need is to compute gradient of the model $\nabla \ell(x,y;w)$ wrt. to its parameters $w$, and next plug it to the optimisation procedure.

### Theano

Since computing gradients $\nabla \ell(x,y;w)$ may become quickly tedious, especially for more complex models, we search for tools that could automitise it as well. Imagine that you build some model $M$ and you get its gradient $\nabla M$ by just executing the tool
```python
nabla_M = compute_gradient_symbolically(M,x,y)
```
This would definitely speed up prototyping.

[Theano](http://deeplearning.net/software/theano/) is such a tool that is specifically tailored to work with deep learning models. For broader understanding Theano, you can check [this nice tutorial](http://deeplearning.net/tutorial/). 

The programmming example shown cell below defines ReLU, a popular activation function, as well as shows its derivative using Theano.
However, with this example, we only scratch the surface.

Assume ReLU is defined as follows $ReLU(x) = \max(x,0)$.
```
What is the gradient of ReLU? Consider two cases. 
```
Btw. ReLU is a nondifferentiable function, so technically we are computing its [subgradient](https://en.wikipedia.org/wiki/Subderivative) - it is still fine for Theano.

In [None]:
import theano
import theano.tensor as T

# Theano is using symbolic calculations, so we need to first create symbolic variables
theano_x = T.scalar()
# we define a relationship between a symbolic input and a symbolic output
theano_y = T.maximum(0,theano_x)
# now it's time for a symbolic gradient wrt. to symbolic variable x
theano_nabla_y = T.grad(theano_y, theano_x)

# we can see that both variables are symbolic, they don't have any numerical values
print(theano_x)
print(theano_y)
print(theano_nabla_y)

# theano.function compiles the symbolic representation of the network
theano_f_x = theano.function([theano_x], theano_y)
print(theano_f_x(3))
print(theano_f_x(-3))
# and now for gradients

nabla_f_x = theano.function([theano_x], theano_nabla_y)
print(nabla_f_x(3))
print(nabla_f_x(-3))

__Summary__

To compute gradient symbolically, we can use [Theano](http://deeplearning.net/software/theano/).

### Keras

[Keras](http://keras.io) builds on top of Theano, and significantly simplifies creating new models as well as training such models, effectively speeding up the prototyping even further. Keras also abstract away from some technical burden such as symbolic variable creation. Metaphorically, while Theano can be seen as a deep learning equivalent of assembler, Keras is more like Java :)

In [None]:
# we sample from noisy x^2 function
from numpy import asarray
from numpy import random
def myfun(x):
    return x*x

NUM_SAMPLES = 10000
HIGH_VALUE=10
keras_x = asarray(random.randint(low=0, high=HIGH_VALUE, size=NUM_SAMPLES))
keras_noise = random.normal(loc=0.0, scale=0.1, size=NUM_SAMPLES)
keras_noise = asarray([max(x,0) for x in keras_noise])
keras_y = asarray([myfun(x) + n for x,n in zip(keras_x, keras_noise)])
# print('X:')
# print(keras_x)
# print('Noise')
# print(keras_noise)
# print('Noisy X^2:')
# print(keras_y)

keras_x = keras_x.reshape(keras_x.shape[0],1)
keras_y = keras_y.reshape(keras_y.shape[0],1)

# import keras packages
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout

# build a regression network
KERAS_NUM_HIDDEN = 150
KERAS_NUM_HIDDEN_SECOND = 150
KERAS_NUM_HIDDEN_THIRD = 150
KERAS_DROPOUT_FRACTION = 0.5
m = Sequential()
m.add(Dense(KERAS_NUM_HIDDEN, input_dim=1))
m.add(Activation('relu'))
m.add(Dropout(KERAS_DROPOUT_FRACTION))
#TODO: add one more layer
# m.add(Dense(KERAS_NUM_HIDDEN_SECOND))
# m.add(Activation('relu'))
# m.add(Dropout(KERAS_DROPOUT_FRACTION))
#TODO: add one more layer
# m.add(Dense(KERAS_NUM_HIDDEN_THIRD))
# m.add(Activation('relu'))
# m.add(Dropout(KERAS_DROPOUT_FRACTION))
m.add(Dense(1))

# compile and fit
m.compile(loss='mse', optimizer='adam')
m.fit(keras_x, keras_y, nb_epoch=100, batch_size=250)

keras_x_predict = asarray([1,3,6,12,HIGH_VALUE+10])
keras_x_predict = keras_x_predict.reshape(keras_x_predict.shape[0],1)
keras_predictions = m.predict(keras_x_predict)
print("{0:>10}{1:>10}{2:>10}".format('X', 'Y', 'GT'))
for x,y in zip(keras_x_predict, keras_predictions):
    print("{0:>10}{1:>10.2f}{2:>10}".format(x[0], y[0], myfun(x[0])))

You can play with the example above.

 * What happens if you add one more layer? Or two more layers?
 * What happens if you change hidden size?
 * What happens if you use more/less samples?
 
Key features of Keras
 * Automatic shape inference

### Models

For the purpose of Visual Turing Test, and this tutorial, we have compiled a light framework that builds on top of Keras, and simplify building and training question answering machines. With the tradition of using fancy Greek names, we call it Kraino. Note that some parts of the Kraino, such as data provider, you have already seen. 

In the following, we will go through BOW and LSTM approaches to answer questions about images, but, surprisingly, without the images. It turns out that a substantial fraction of questions can be answered without an access to an image, but rather by resorting to common sense (or statistics of the dataset). For instance, what can be placed at the table? How many eyes this human have?. Answers like 'chair' and '2' are quite likely to be good answers.

Please make sure that all the cells from [Datasets](#Datasets) and [Textual Features](#Textual-Features) have been executed.

### BOW

The figure below illustrates BOW (Bag Of Words) approach. As we have already seen in [Textual Features](#Textual-Features), we first encode the input sentence into one-hot vector representations. Such (very) sparse representation is next embedded into a denser space by a matrix $W_e$. Next, the denser representations are summed up and classified via 'Softmax'. Notice that, if $W_e$ were an identity matrix, we would obtain a histogram of the word's occurrences. 


```
What is your biggest complain about such BOW representation? What happens if instead of 
'What is behind the table' we would have 'is What the behind table'? How does the BOW representation change? 
```

![bow](fig/BOW_model.jpg)

In [None]:
#== Model definition

# First we define a model using keras/kraino
from keras.layers.core import Activation
from keras.layers.core import Dense
from keras.layers.core import Dropout
from keras.layers.core import TimeDistributedMerge
from keras.layers.embeddings import Embedding

from kraino.core.model_zoo import AbstractSequentialModel
from kraino.core.model_zoo import AbstractSingleAnswer
from kraino.core.model_zoo import AbstractSequentialMultiplewordAnswer
from kraino.core.model_zoo import Config
from kraino.core.keras_extensions import DropMask
from kraino.core.keras_extensions import LambdaWithMask
from kraino.core.keras_extensions import time_distributed_masked_ave

# This model inherits from AbstractSingleAnswer, and so it produces single answer words
# To use multiple answer words, you need to inherit from AbstractSequentialMultiplewordAnswer
class BlindBOW(AbstractSequentialModel, AbstractSingleAnswer):
    """
    BOW Language only model that produces single word answers.
    """
    def create(self):
        self.add(Embedding(
                self._config.input_dim, 
                self._config.textual_embedding_dim, 
                mask_zero=True))
        self.add(LambdaWithMask(time_distributed_masked_ave, output_shape=[self.output_shape[2]]))
        self.add(DropMask())
        self.add(Dropout(0.5))
        self.add(Dense(self._config.output_dim))
        self.add(Activation('softmax'))
        


In [None]:
model_config = Config(
    textual_embedding_dim=500,
    input_dim=len(word2index_x.keys()),
    output_dim=len(word2index_y.keys()))
model = BlindBOW(model_config)
model.create()

model.compile(
    loss='categorical_crossentropy', 
    optimizer='adam')
text_bow_model = model

In [None]:
#== Model training
text_bow_model.fit(
    train_x, 
    train_y,
    batch_size=512,
    nb_epoch=40,
    validation_split=0.1,
    show_accuracy=True)

### Recurrent Neural Network

Although BOW is working pretty well, there is still something very disturbing about this approach.
Consider the following question:

In [None]:
train_raw_x[0]

If we swap 'chair' with 'telephone' in the question, we would get a different meaning, wouldn't we? Recurrent Neural Networks (RNNs) have been developed to mitigate this issue by directly processing the time series. As the figure below illustrates, the (temporarily) first word's embedding is given to an RNN unit. The RNN unit next 'processes' such embedding and outputs to the second RNN unit. This unit takes both the output of the first RNN unit and the 2nd word's embedding as inputs, and outputs some algebraic combination of both inputs. And so on. The last recurrent unit builds the representation of the whole sequence. Its output is next given to Softmax for the classification. One of the challenged that such approaches have to deal with are keeping long-term dependencies. Roughly speaking, as new inputs are coming it's getting easier to 'forget' information from the beginning. [LSTM](http://deeplearning.cs.cmu.edu/pdfs/Hochreiter97_lstm.pdf) and [GRU](http://www.aclweb.org/anthology/W14-4012) are two particularly successful Recurrent Neural Networks that can preserve such longer dependencies to [some degree](http://karpathy.github.io/2015/05/21/rnn-effectiveness/).

__Note__: If the code below is not compiling, please restart the notebook, and run only [Datasets](#Datasets) and [Textual Features](#Textual-Features). In particular, don't run BOW.

![LSTM](fig/LSTM_model.jpg)

In [None]:
#== Model definition

# First we define a model using keras/kraino
from keras.layers.core import Activation
from keras.layers.core import Dense
from keras.layers.core import Dropout
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import GRU
from keras.layers.recurrent import LSTM

from kraino.core.model_zoo import AbstractSequentialModel
from kraino.core.model_zoo import AbstractSingleAnswer
from kraino.core.model_zoo import AbstractSequentialMultiplewordAnswer
from kraino.core.model_zoo import Config
from kraino.core.keras_extensions import DropMask
from kraino.core.keras_extensions import LambdaWithMask
from kraino.core.keras_extensions import time_distributed_masked_ave

# This model inherits from AbstractSingleAnswer, and so it produces single answer words
# To use multiple answer words, you need to inherit from AbstractSequentialMultiplewordAnswer
class BlindRNN(AbstractSequentialModel, AbstractSingleAnswer):
    """
    RNN Language only model that produces single word answers.
    """
    def create(self):
        self.add(Embedding(
                self._config.input_dim, 
                self._config.textual_embedding_dim, 
                mask_zero=True))
        #TODO: Replace averaging with RNN (you can choose between LSTM and GRU)
#         self.add(LambdaWithMask(time_distributed_masked_ave, output_shape=[self.output_shape[2]]))
        self.add(GRU(self._config.hidden_state_dim, 
                      return_sequences=False))
        self.add(Dropout(0.5))
        self.add(Dense(self._config.output_dim))
        self.add(Activation('softmax'))
        

In [None]:
model_config = Config(
    textual_embedding_dim=500,
    hidden_state_dim=500,
    input_dim=len(word2index_x.keys()),
    output_dim=len(word2index_y.keys()))
model = BlindRNN(model_config)
model.create()
model.compile(
    loss='categorical_crossentropy', 
    optimizer='adam')
text_rnn_model = model

In [None]:
#== Model training
text_rnn_model.fit(
    train_x, 
    train_y,
    batch_size=512,
    nb_epoch=40,
    validation_split=0.1,
    show_accuracy=True)

At the end of this Tutorial, you are free to experiment with two examples above.
* You can change the size of embedding.
* You can change the number of hidden state of RNN.
* You can change number of epochs to train.
* You can experiment with different batch sizes.
* You can modify the models (many RNN layers, deeper classifiers). Use [Keras](http://keras.io) documentation if you are in needs.

__Summary__

RNN models, as opposite to BOW, consider order of the words in the question. Moreover, apparently, a substantial number of questions can be answered without any access to image. This can be explained as models learn some specific dataset biases, some of them can be interpreted as common sense knowledge.

# Evaluation Measures

First of all, please run the cell below to set up a link to the NLTK data.

In [12]:
%env NLTK_DATA=/home/ubuntu/data/visual_turing_test/nltk_data

env: NLTK_DATA=/home/ubuntu/data/visual_turing_test/nltk_data


### Ambiguities

To be able to monitor progress on any task, we need to find ways to evaluate the task. Otherwise, we wouldn't know how to compare two architectures, or even worse, we wouldn't even know what our goal is. Moreover, we should also aim at automatic evaluation measures, otherwise reproducibility is questionable, and the costs are high (speed and money; just imagine that you want to evaluate 100 different architectures of yours).

On the other hand, it's difficult to automatically evaluate holistic tasks such as question answering about images, because of, in just one word, ambiguities. We have ambiguities in naming objects, sometimes due to synonyms, but sometimes due to fuzziness. For instance is 'chair' == 'armchair' or 'chair' != 'armchair' or something in between? Such semantic boundaries become even more fuzzy when we increase the number of categories. We could easily find a mutually exclusive set of 10 different categories, but what if there are 1000 categories, or 10000 categories? Arguably, we cannot think in terms of equivalence class anymore, but rather in terms of similarities. That is 'chair' is semantically more similar to 'armchair', than to 'horse'. This simple example shows the main drawback of a traditional binary evaluation measure Accuracy, which scores 1 if the names are the same and 0 otherwise. So that Acc('chair', 'armchair') == Acc('chair', 'horse'). We use WUPS to handle such ambiguities.

We call these ambiguities, word-level ambiguities, but there are other ambiguities that are arguably more difficult to handle. For instance, the same question can be phrased in multiple other ways. The language of spatial relations is also ambiguous (you may be surprised that what you think is on the left, for others may be on the right). Language tends to be also rather vague - we sometimes skip details and resort to common sense. Some such ambiguities are rooted in a culture. A couple of such question-level ambiguities, we handle with Consensus Measure.

From an another side, arguably, it's easier to evaluate architectures on DAQUAR than on the Image Captioning tasks. The former restricts the output space to $N$ categories, while it still requires holistic (visual and linguistic) comprehension.

### Wu-Palmer Similarity

Given an ontology a Wu-Palmer Similarity between two words (or broader concepts) is a soft measure defined as
$$WuP(a,b) := \frac{lca(a,b)}{depth(a) + depth(b)}$$
where $lca(a,b)$ is the least common ancestor of $a$ and $b$, and $depth(a)$ is depth of $a$ in the ontology.

![small taxonomy](fig/small_taxonomy.jpg)

```
What is WuP(Dog, Horse) and WuP(Dog, Dalmatian) according to the ontology above? 
Can you also calculate Acc(Dog, Horse) and Acc(Dog, Dalmatian)?
What are your conclusions?
```

### WUPS

Wu-Palmer Similarity depends on a ontology. One popular, large ontology is [WordNet](https://wordnet.princeton.edu). Although Wu-Palmer Similarity may work on shallow ontologies, we are rather interested in ontologies with hundreds or even thousands categories. In indoor scenerio, it turns out that many indoor things share similar levels in the taxomy, and hence Wu-Palmer Similarities are very small between each other.

The code below exemplifies the issue.

In [13]:
from nltk.corpus import wordnet as wn
armchair_synset = wn.synset('armchair.n.01')
chair_synset = wn.synset('chair.n.01')
wardrobe_synset = wn.synset('wardrobe.n.01')

print(armchair_synset.wup_similarity(armchair_synset))
print(armchair_synset.wup_similarity(chair_synset))
print(armchair_synset.wup_similarity(wardrobe_synset))
wn.synset('chair.n.01').wup_similarity(wn.synset('person.n.01'))

1.0
0.952380952381
0.8


0.47058823529411764

From the code we see that 'armchair' and 'wardrobe' are surprisingly close to each other. It is because, for large ontologies like [WordNet](https://wordnet.princeton.edu), all the indoor things are essentially 'indoor things'.


This issue has motivated us to define thresholded Wu-Palmer Similarity Score, defined as follows
$$
\begin{array}{rl}
WuP(a,b) & \text{if}\; WuP(a,b) \ge \tau \\
0.1 \cdot WuP(a,b) & \text{otherwise}
\end{array}
$$
where $\tau$ is a hand-chosen threshold. Empirically, we found that $\tau=0.9$ works fine on DAQUAR.

Moreover, since DAQUAR has answers as set answer words, so that 'knife,fork' == 'fork,knife', we have extended the above measure to work with sets. We call it Wu-Palmer Set score, or shortly WUPS.

A detailed exposition of WUPS is beyond this tutorial, but a curious reader is encoraged to read the 'Performance Measure' paragraph in [our paper](http://arxiv.org/pdf/1410.0210v4.pdf). Note that the measure in [the paper](http://arxiv.org/pdf/1410.0210v4.pdf) is defined broader, and essentially it abstracts from any particular similarities such as Wu-Palmer Similarity. WUPS at 0.9 is WUPS with threshold $\tau=0.9$.

Although the WUPS is conceptually as we described here, technically, it's slightly different as it also needs to deal with synsets. Thus it's recommended to download the script from [here](http://datasets.d2.mpi-inf.mpg.de/mateusz14visual-turing/calculate_wups.py), or re-implement with caution.

### Consensus

In this tutorial, we won't cover the consensus measure.
The curious reader is encouraged to read the 'Human Consensus' in the [Ask Your Neurons paper](https://www.d2.mpi-inf.mpg.de/sites/default/files/iccv15-neural_qa.pdf).

### A few caveats

A few caveats with WUPS, especially useful if you want to use the measure to your own dataset.

__Lack of coverage__
Since WUPS is based on an ontology, not always it recognises words. For instance 'garbage bin' is missing, but 'garbage can' is perfectly fine. You can check it by yourself, either with the source code provided above, or by using [this online script](http://wordnetweb.princeton.edu/perl/webwn).

__Synsets__
If you execute 
```python
wn.synsets('chair')
```
you will notice a list with many elements, these elements are [semantically equivalent](https://en.wikipedia.org/wiki/Synonym_ring). You can check their definitions, for instance
```python
wn.synset('chair.n.03').definition()
```
indicates a person. Indeed, the following has quite high value
```python
wn.synset('chair.n.03').wup_similarity(wn.synset('person.n.01'))
```
but this one has a more preffered low value
```python
wn.synset('chair.n.01').wup_similarity(wn.synset('person.n.01'))
```
How to deal with it? In DAQUAR we take an optimistic perspective and always consider the highest similarity score. This works with WUPS 0.9 and a restricted indoor domain with a vocabulary only from the trainin set. This may not be true in other domains though.

__Ontology__
Since WUPS is based on an ontology, specifically on WordNet, it may give different scores on different ontologies, or even on different versions of the same ontology.

__Threshold__
A good threshold $\tau$ is dataset dependent. In our case $\tau = 0.9$ seems to work well, while $\tau = 0.0$ is too forgivable and is rather reported due to the 'historical' reasons. However, following our papers, you should still consider to report plain set-based accuracy scores (so that Acc('knife,'fork','fork,knife')==1; it can be computed with WUPS -1 using [our script](http://datasets.d2.mpi-inf.mpg.de/mateusz14visual-turing/calculate_wups.py)) as this metric is widely recognised.

### Summary

WUPS is an evaluation measure that works with sets and word-level ambiguities. Arguably, WUPS at 0.9 is the most practical measure.

# New Predictions

### Predictions - BOW

With more and more iterations we can increase training accuracy, however our goal is to see how well the models generalise. For that, we take a test, previously unknown, set.

In [14]:
test_text_representation = dp['text'](train_or_test='test')
test_raw_x = test_text_representation['x']
test_one_hot_x = encode_questions_index(test_raw_x, word2index_x)
test_x = sequence.pad_sequences(test_one_hot_x, maxlen=MAXLEN)

Given encoded test questions, we use the maximum likelihood principle to withdraw answers.

In [15]:
from numpy import argmax
# predict the probabilities for every word
predictions_scores = text_bow_model.predict([test_x])
print(predictions_scores.shape)
# follow the maximum likelihood principle, and get the best indices to vocabulary
predictions_best = argmax(predictions_scores, axis=-1)
print(predictions_best.shape)
# decode the predicted indices into word answers
predictions_answers = [index2word_y[x] for x in predictions_best]
print(len(predictions_answers))

NameError: name 'text_bow_model' is not defined

We can now evaluate the answers using [WUPS scores](#Evaluation-Measures). For this tutorial, we care only about Accuracy, and WUPS at 0.9.

In [None]:
from kraino.utils import print_metrics
test_raw_y = test_text_representation['y']
_ = print_metrics.select['wups'](
        gt_list=test_raw_y,
        pred_list=predictions_answers,
        verbose=1,
        extra_vars=None)

Let's also see the predictions.


In [None]:
from numpy import random
test_image_name_list = test_text_representation['img_name']
indices_to_see = random.randint(low=0, high=len(test_image_name_list), size=5)
for index_now in indices_to_see:
    print(test_raw_x[index_now], predictions_answers[index_now])

```
Do you agree with the answers given above? What are your guesses?
Of course,neither you nor the model have seen any images so far.
```

But, what if you actually see the images? 
```
Execute the code below.
Do your answers change after seeing the images?
```

In [None]:
from matplotlib.pyplot import axis
from matplotlib.pyplot import figure
from matplotlib.pyplot import imshow

import numpy as np
from PIL import Image

%matplotlib inline
for index_now in indices_to_see:
    image_name_now = test_image_name_list[index_now]
    pil_im = Image.open('data/daquar/images/{0}.png'.format(image_name_now), 'r')
    fig = figure()
    fig.text(.2,.05,test_raw_x[index_now], fontsize=14)
    axis('off')
    imshow(np.asarray(pil_im))

Finally, let's also see the ground truths.

In [None]:
print('question, prediction, ground truth answer')
for index_now in indices_to_see:
    print(test_raw_x[index_now], predictions_answers[index_now], test_raw_y[index_now])

In the code above, we have randomly taken questions, so for different executations we may get different answers.

### Predictions - RNN

Curious how predictions with blind RNN went? 

This time, we will use the help of Kraino, to make the predictions shorter.

In [None]:
from kraino.core.model_zoo import word_generator
# we first need to add word_generator to _config (we could have done this before, in the Config constructor)
# we use maximum likelihood as a word generator
text_rnn_model._config.word_generator = word_generator['max_likelihood']
predictions_answers = text_rnn_model.decode_predictions(
    X=test_x,
    temperature=None,
    index2word=index2word_y,
    verbose=0)

In [None]:
_ = print_metrics.select['wups'](
        gt_list=test_raw_y,
        pred_list=predictions_answers,
        verbose=1,
        extra_vars=None)

```
Visualise question, predicted answers, ground truth answers as before.
Check also images.
```

# Visual Features

We won't go very far using only textual features. Hence, it's now time to consider its visual counterpart.

As shown in the figure below, a quite common procedure works as follows:
* Use a CNN already pre-trained on some large-scale classification task, most often it is [ImageNet](http://image-net.org) with $1000$ for recognition.
* 'Chop off' CNN after some layer. We will use responses of that layer as visual features.

In this tutorial, we will use features extracted from the second last $4096$ dimensional layer of [VGG NET-19](http://arxiv.org/pdf/1409.1556.pdf). We have already extracted features in advance using [Caffe](http://caffe.berkeleyvision.org) - another excellent framework for deep learning, particularly good for CNNs.

![features extractor](fig/features_extractor.jpg)

Please run the cell below in order to get visual features aligned with textual featurs.

In [16]:
# this contains a list of the image names of our interest; 
# it also makes sure that visual and textual features are aligned correspondingly
train_image_names = train_text_representation['img_name']
# the name for visual features that we use
# CNN_NAME='vgg_net'
# CNN_NAME='googlenet'
CNN_NAME='fb_resnet'
# the layer in CNN that is used to extract features
# PERCEPTION_LAYER='fc7'
# PERCEPTION_LAYER='pool5-7x7_s1'
# PERCEPTION_LAYER='res5c-152'
PERCEPTION_LAYER='l2_res5c-152' # l2 prefix since there are l2-normalized visual features

train_visual_features = dp['perception'](
    train_or_test='train',
    names_list=train_image_names,
    parts_extractor=None,
    max_parts=None,
    perception=CNN_NAME,
    layer=PERCEPTION_LAYER,
    second_layer=None
    )
train_visual_features.shape

Shuffling memories ...
Skipped images 0 of them:


(6795, 2048)

# Vision+Language

Since we are talking about answering on questions about images, we likely need images too :)

Take a look at the figure below one more time. How far can you go by blind guesses? 

![challenges](fig/challenges.jpg)

Let's creat an input as a pair of textual and visual features.

In [17]:
train_input = [train_x, train_visual_features]

### BOW + Vision

As with Language Only model, we start from a simpler BOW model that we will combine with [visual features](fig/#Visual Features). Here, we will explore two ways of combining both modalities (circle with 'C' in the figure below): concatenation, and piece-wise multiplication. We will use CNN features extracted from the image, but for the sake of simplicity we won't backprop to fine tune the visual representation (dot line symbolizes the barrier that blocks back-prop in the figure below). Although in our [Ask Your Neurons](http://arxiv.org/abs/1505.01121) fine-tuning the last layer was actually important, benefits of end-to-end training on DAQUAR or larger [VQA](http://visualqa.org) datasets remain an open question. 

![BOW_vision](fig/BOW_vision_model.jpg)

In [None]:
#== Model definition

# First we define a model using keras/kraino
from keras.models import Sequential
from keras.layers.core import Activation
from keras.layers.core import Dense
from keras.layers.core import Dropout
from keras.layers.core import Layer
from keras.layers.core import Merge
from keras.layers.core import TimeDistributedMerge
from keras.layers.embeddings import Embedding

from kraino.core.model_zoo import AbstractSequentialModel
from kraino.core.model_zoo import AbstractSingleAnswer
from kraino.core.model_zoo import AbstractSequentialMultiplewordAnswer
from kraino.core.model_zoo import Config
from kraino.core.keras_extensions import DropMask
from kraino.core.keras_extensions import LambdaWithMask
from kraino.core.keras_extensions import time_distributed_masked_ave

# This model inherits from AbstractSingleAnswer, and so it produces single answer words
# To use multiple answer words, you need to inherit from AbstractSequentialMultiplewordAnswer
class VisionLanguageBOW(AbstractSequentialModel, AbstractSingleAnswer):
    """
    BOW Language only model that produces single word answers.
    """
    def create(self):
        language_model = Sequential()
        language_model.add(Embedding(
                self._config.input_dim, 
                self._config.textual_embedding_dim, 
                mask_zero=True))
        language_model.add(LambdaWithMask(
                time_distributed_masked_ave, 
                output_shape=[language_model.output_shape[2]]))
        language_model.add(DropMask())
        visual_model = Sequential()
        if self._config.visual_embedding_dim > 0:
            visual_model.add(Dense(
                    self._config.visual_embedding_dim,
                    input_shape=(self._config.visual_dim,)))
        else:
            visual_model.add(Layer(input_shape=(self._config.visual_dim,)))
        self.add(Merge([language_model, visual_model], mode=self._config.multimodal_merge_mode))
        self.add(Dropout(0.5))
        self.add(Dense(self._config.output_dim))
        self.add(Activation('softmax'))
        

In [None]:
# dimensionality of embeddings
EMBEDDING_DIM = 500
# kind of multimodal fusion (ave, concat, mul, sum)
MULTIMODAL_MERGE_MODE = 'concat'

model_config = Config(
    textual_embedding_dim=EMBEDDING_DIM,
    visual_embedding_dim=0,
    multimodal_merge_mode=MULTIMODAL_MERGE_MODE,
    input_dim=len(word2index_x.keys()),
    output_dim=len(word2index_y.keys()),
    visual_dim=train_visual_features.shape[1])
model = VisionLanguageBOW(model_config)
model.create()
model.compile(
    loss='categorical_crossentropy', 
    optimizer='adam')

Now, we can train the model.

In [None]:
#== Model training
model.fit(
    train_input, 
    train_y,
    batch_size=512,
    nb_epoch=40,
    validation_split=0.1,
    show_accuracy=True)

Interestingly, if we use a piece-wise multiplication to merge both modalities together, we will get better results.


In [None]:
#== Model definition

# First we define a model using keras/kraino
from keras.models import Sequential
from keras.layers.core import Activation
from keras.layers.core import Dense
from keras.layers.core import Dropout
from keras.layers.core import Layer
from keras.layers.core import Merge
from keras.layers.core import TimeDistributedMerge
from keras.layers.embeddings import Embedding

from kraino.core.model_zoo import AbstractSequentialModel
from kraino.core.model_zoo import AbstractSingleAnswer
from kraino.core.model_zoo import AbstractSequentialMultiplewordAnswer
from kraino.core.model_zoo import Config
from kraino.core.keras_extensions import DropMask
from kraino.core.keras_extensions import LambdaWithMask
from kraino.core.keras_extensions import time_distributed_masked_ave

# This model inherits from AbstractSingleAnswer, and so it produces single answer words
# To use multiple answer words, you need to inherit from AbstractSequentialMultiplewordAnswer
class VisionLanguageBOW(AbstractSequentialModel, AbstractSingleAnswer):
    """
    BOW Language only model that produces single word answers.
    """
    def create(self):
        language_model = Sequential()
        language_model.add(Embedding(
                self._config.input_dim, 
                self._config.textual_embedding_dim, 
                mask_zero=True))
        language_model.add(LambdaWithMask(
                time_distributed_masked_ave, 
                output_shape=[language_model.output_shape[2]]))
        language_model.add(DropMask())
        visual_model = Sequential()
        if self._config.visual_embedding_dim > 0:
            visual_model.add(Dense(
                    self._config.visual_embedding_dim,
                    input_shape=(self._config.visual_dim,)))
        else:
            visual_model.add(Layer(input_shape=(self._config.visual_dim,)))
        self.add(Merge([language_model, visual_model], mode=self._config.multimodal_merge_mode))
        self.add(Dropout(0.5))
        self.add(Dense(self._config.output_dim))
        self.add(Activation('softmax'))
        


In [None]:
# dimensionality of embeddings
EMBEDDING_DIM = 500
# kind of multimodal fusion (ave, concat, mul, sum)
MULTIMODAL_MERGE_MODE = 'mul'

model_config = Config(
    textual_embedding_dim=EMBEDDING_DIM,
    visual_embedding_dim=EMBEDDING_DIM,
    multimodal_merge_mode=MULTIMODAL_MERGE_MODE,
    input_dim=len(word2index_x.keys()),
    output_dim=len(word2index_y.keys()),
    visual_dim=train_visual_features.shape[1])
model = VisionLanguageBOW(model_config)
model.create()
model.compile(
    loss='categorical_crossentropy', 
    optimizer='adam')
text_image_bow_model = model

```
If we merge language and visual features with 'mul', do we need to set both embeddings to have the same number  of dimensions (textual_embedding_dim == visual_embedding_dim)?
```

In [None]:
#== Model training
text_image_bow_model.fit(
    train_input, 
    train_y,
    batch_size=512,
    nb_epoch=40,
    validation_split=0.1,
    show_accuracy=True)

### RNN + Vision

Now, we will repeat the BOW experiments but with RNN.

![LSTM_vision](fig/LSTM_vision_model.jpg)

In [21]:
#== Model definition

# First we define a model using keras/kraino
from keras.models import Sequential
from keras.layers.core import Activation
from keras.layers.core import Dense
from keras.layers.core import Dropout
from keras.layers.core import Layer
from keras.layers.core import Merge
from keras.layers.core import TimeDistributedMerge
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import GRU
from keras.layers.recurrent import LSTM

from kraino.core.model_zoo import AbstractSequentialModel
from kraino.core.model_zoo import AbstractSingleAnswer
from kraino.core.model_zoo import AbstractSequentialMultiplewordAnswer
from kraino.core.model_zoo import Config
from kraino.core.keras_extensions import DropMask
from kraino.core.keras_extensions import LambdaWithMask
from kraino.core.keras_extensions import time_distributed_masked_ave

# This model inherits from AbstractSingleAnswer, and so it produces single answer words
# To use multiple answer words, you need to inherit from AbstractSequentialMultiplewordAnswer
class VisionLanguageLSTM(AbstractSequentialModel, AbstractSingleAnswer):
    """
    BOW Language only model that produces single word answers.
    """
    def create(self):
        language_model = Sequential()
        language_model.add(Embedding(
                self._config.input_dim, 
                self._config.textual_embedding_dim, 
                mask_zero=True))
        #TODO: Replace averaging with RNN (you can choose between LSTM and GRU)
#         language_model.add(LambdaWithMask(time_distributed_masked_ave, output_shape=[self.output_shape[2]]))
        language_model.add(LSTM(self._config.hidden_state_dim, 
                      return_sequences=False))

        visual_model = Sequential()
        if self._config.visual_embedding_dim > 0:
            visual_model.add(Dense(
                    self._config.visual_embedding_dim,
                    input_shape=(self._config.visual_dim,)))
        else:
            visual_model.add(Layer(input_shape=(self._config.visual_dim,)))
        self.add(Merge([language_model, visual_model], mode=self._config.multimodal_merge_mode))
        self.add(Dropout(0.5))
        self.add(Dense(self._config.output_dim))
        self.add(Activation('softmax'))
        
        
# dimensionality of embeddings
EMBEDDING_DIM = 500
# kind of multimodal fusion (ave, concat, mul, sum)
MULTIMODAL_MERGE_MODE = 'sum'

model_config = Config(
    textual_embedding_dim=EMBEDDING_DIM,
    visual_embedding_dim=EMBEDDING_DIM,
    hidden_state_dim=EMBEDDING_DIM,
    multimodal_merge_mode=MULTIMODAL_MERGE_MODE,
    input_dim=len(word2index_x.keys()),
    output_dim=len(word2index_y.keys()),
    visual_dim=train_visual_features.shape[1])
model = VisionLanguageLSTM(model_config)
model.create()
model.compile(
    loss='categorical_crossentropy', 
    optimizer='adam')
text_image_rnn_model = model

### Batch Size

So, again, let's train the model (if the following cell crashes, please move to the next cell).

In [None]:
#== Model training
text_image_rnn_model.fit(
    train_input, 
    train_y,
    batch_size=5500,
    nb_epoch=40,
    validation_split=0.1,
    show_accuracy=True)

Ooops, apparently we run out of memory in our GPUs. Note, how large our batches are! 
Let's make them much smaller (argument batch_size).

In [None]:
#== Model training
text_image_rnn_model.fit(
    train_input, 
    train_y,
    batch_size=1,
    nb_epoch=1,
    validation_split=0.1,
    show_accuracy=True)

Ok. Please, stop it! Batch size 1 is not good neither. Training is very slow.
Let's use standard batch size 512. Please re-run the cell with the model definition, and next run the cell below.

In [22]:
#== Model training
text_image_rnn_model.fit(
    train_input, 
    train_y,
    batch_size=512,
    nb_epoch=40,
    validation_split=0.1,
    show_accuracy=True)

Train on 6115 samples, validate on 680 samples
Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40

KeyboardInterrupt: 

```
Can you explain both issues regarding the batch size? Why training is impossible in the first case, and very tedious in the second case?

When do you get the best performance, with multiplication, concatenation, or summation?
```

__Summary__

As previously, using RNN makes the sequence processing order-aware. This time, however, we combine two modalities so that the whole model 'sees' the image. Finally, it's important how both modalities are combined, we have found that piece-wise multiplication outperforms traditional concatenation.

# New Predictions with Vision+Language

### Predictions (Features)

In [None]:
test_image_names = test_text_representation['img_name']
test_visual_features = dp['perception'](
    train_or_test='test',
    names_list=test_image_names,
    parts_extractor=None,
    max_parts=None,
    perception=CNN_NAME,
    layer=PERCEPTION_LAYER,
    second_layer=None
    )
test_visual_features.shape

In [None]:
test_input = [test_x, test_visual_features]

### Predictions (Bow with Vision)

Let's evaluate the Vision+Language architectures as well.

In [None]:
from kraino.core.model_zoo import word_generator
# we first need to add word_generator to _config (we could have done this before, in the Config constructor)
# we use maximum likelihood as a word generator
text_image_bow_model._config.word_generator = word_generator['max_likelihood']
predictions_answers = text_image_bow_model.decode_predictions(
    X=test_input,
    temperature=None,
    index2word=index2word_y,
    verbose=0)

In [None]:
_ = print_metrics.select['wups'](
        gt_list=test_raw_y,
        pred_list=predictions_answers,
        verbose=1,
        extra_vars=None)

### Predictions (RNN with Vision)

In [None]:
from kraino.core.model_zoo import word_generator
# we first need to add word_generator to _config (we could have done this before, in the Config constructor)
# we use maximum likelihood as a word generator
text_image_rnn_model._config.word_generator = word_generator['max_likelihood']
predictions_answers = text_image_rnn_model.decode_predictions(
    X=test_input,
    temperature=None,
    index2word=index2word_y,
    verbose=0)

In [None]:
_ = print_metrics.select['wups'](
        gt_list=test_raw_y,
        pred_list=predictions_answers,
        verbose=1,
        extra_vars=None)

# VQA

The models that we have built so far can be transferred to other dataset.
Let's consider recently introduced large-scale [VQA](visual question answering) built on top of [COCO](http://mscoco.org). In this section, we will train and evaluate VQA models. Since we are using all pieces introduced before, we will just quickly go into coding. For the sake of simplicity, we will use only BOW architectures, but you are free to experiment with RNN. Please also to pay attention to the comments.

Since VQA hides the test data for the purpose of challenge, we will use the publically validation set to evaluate the architectures.

### VQA Language Features

In [None]:
#TODO: Execute the following procedure (Shift+Enter)
from kraino.utils import data_provider

vqa_dp = data_provider.select['vqa-real_images-open_ended']
# VQA has a few answers associated with one question. 
# We take the most frequently occuring answers (single_frequent).
# Formal argument 'keep_top_qa_pairs' allows to filter out rare answers with the associated questions.
# We use 0 as we want to keep all question answer pairs, but you can change into 1000 and see how the results differ
vqa_train_text_representation = vqa_dp['text'](
    train_or_test='train',
    answer_mode='single_frequent',
    keep_top_qa_pairs=1000)
vqa_val_text_representation = vqa_dp['text'](
    train_or_test='val',
    answer_mode='single_frequent')

In [None]:
from toolz import frequencies
vqa_train_raw_x = vqa_train_text_representation['x']
vqa_train_raw_y = vqa_train_text_representation['y']
vqa_val_raw_x = vqa_val_text_representation['x']
vqa_val_raw_y = vqa_val_text_representation['y']
# we start from building the frequencies table
vqa_wordcount_x = frequencies(' '.join(vqa_train_raw_x).split(' '))
# we can keep all answer words in the answer as a class
# therefore we use an artificial split symbol '{' to not split the answer into words
# you can see the difference if you replace '{' with ' ' and print vqa_wordcount_y
vqa_wordcount_y = frequencies('{'.join(vqa_train_raw_y).split('{'))
vqa_wordcount_y

### Language-Only

In [None]:
from keras.preprocessing import sequence
from kraino.utils.input_output_space import build_vocabulary
from kraino.utils.input_output_space import encode_questions_index
from kraino.utils.input_output_space import encode_answers_one_hot
MAXLEN=30
vqa_word2index_x, vqa_index2word_x = build_vocabulary(this_wordcount = vqa_wordcount_x)
vqa_word2index_y, vqa_index2word_y = build_vocabulary(this_wordcount = vqa_wordcount_y)
vqa_train_x = sequence.pad_sequences(encode_questions_index(vqa_train_raw_x, vqa_word2index_x), maxlen=MAXLEN)
vqa_val_x = sequence.pad_sequences(encode_questions_index(vqa_val_raw_x, vqa_word2index_x), maxlen=MAXLEN)
vqa_train_y, _ = encode_answers_one_hot(
    vqa_train_raw_y, 
    vqa_word2index_y, 
    answer_words_delimiter=vqa_train_text_representation['answer_words_delimiter'],
    is_only_first_answer_word=True,
    max_answer_time_steps=1)
vqa_val_y, _ = encode_answers_one_hot(
    vqa_val_raw_y, 
    vqa_word2index_y, 
    answer_words_delimiter=vqa_train_text_representation['answer_words_delimiter'],
    is_only_first_answer_word=True,
    max_answer_time_steps=1)


In [None]:
from kraino.core.model_zoo import Config
from kraino.core.model_zoo import word_generator
# We are re-using the BlindBOW mode
# Please make sure you have run the cell with the class definition
# VQA is larger, so we can increase the dimensionality of the embedding
vqa_model_config = Config(
    textual_embedding_dim=1000,
    input_dim=len(vqa_word2index_x.keys()),
    output_dim=len(vqa_word2index_y.keys()),
    word_generator = word_generator['max_likelihood'])
vqa_text_bow_model = BlindBOW(vqa_model_config)
vqa_text_bow_model.create()
vqa_text_bow_model.compile(
    loss='categorical_crossentropy', 
    optimizer='adam')

In [None]:
vqa_text_bow_model.fit(
    vqa_train_x, 
    vqa_train_y,
    batch_size=512,
    nb_epoch=10,
    validation_split=0.1,
    show_accuracy=True)

In [None]:
# we first need to add word_generator to _config (we could have done this before, in the Config constructor)
# we use maximum likelihood as a word generator
vqa_predictions_answers = vqa_text_bow_model.decode_predictions(
    X=vqa_val_x,
    temperature=None,
    index2word=vqa_index2word_y,
    verbose=0)

In [None]:
# Using VQA is unfortunately not that transparent
# we need extra VQA object.
vqa_vars = {
    'question_id':vqa_val_text_representation['question_id'],
    'vqa_object':vqa_val_text_representation['vqa_object'],
    'resfun': 
        lambda x: \
            vqa_val_text_representation['vqa_object'].loadRes(x, vqa_val_text_representation['questions_path'])
}

In [None]:
from kraino.utils import print_metrics


_ = print_metrics.select['vqa'](
        gt_list=vqa_val_raw_y,
        pred_list=vqa_predictions_answers,
        verbose=1,
        extra_vars=vqa_vars)

### VQA Language+Vision

In [None]:
# the name for visual features that we use
VQA_CNN_NAME='vgg_net'
# VQA_CNN_NAME='googlenet'
# the layer in CNN that is used to extract features
VQA_PERCEPTION_LAYER='fc7'
# PERCEPTION_LAYER='pool5-7x7_s1'

vqa_train_visual_features = vqa_dp['perception'](
    train_or_test='train',
    names_list=vqa_train_text_representation['img_name'],
    parts_extractor=None,
    max_parts=None,
    perception=VQA_CNN_NAME,
    layer=VQA_PERCEPTION_LAYER,
    second_layer=None
    )
vqa_train_visual_features.shape

In [None]:
vqa_val_visual_features = vqa_dp['perception'](
    train_or_test='val',
    names_list=vqa_val_text_representation['img_name'],
    parts_extractor=None,
    max_parts=None,
    perception=VQA_CNN_NAME,
    layer=VQA_PERCEPTION_LAYER,
    second_layer=None
    )
vqa_val_visual_features.shape

In [None]:
from kraino.core.model_zoo import Config
from kraino.core.model_zoo import word_generator

# dimensionality of embeddings
VQA_EMBEDDING_DIM = 1000
# kind of multimodal fusion (ave, concat, mul, sum)
VQA_MULTIMODAL_MERGE_MODE = 'mul'

vqa_model_config = Config(
    textual_embedding_dim=VQA_EMBEDDING_DIM,
    visual_embedding_dim=VQA_EMBEDDING_DIM,
    multimodal_merge_mode=VQA_MULTIMODAL_MERGE_MODE,
    input_dim=len(vqa_word2index_x.keys()),
    output_dim=len(vqa_word2index_y.keys()),
    visual_dim=vqa_train_visual_features.shape[1],
    word_generator=word_generator['max_likelihood'])
vqa_text_image_bow_model = VisionLanguageBOW(vqa_model_config)
vqa_text_image_bow_model.create()
vqa_text_image_bow_model.compile(
    loss='categorical_crossentropy', 
    optimizer='adam')

In [None]:
vqa_train_input = [vqa_train_x, vqa_train_visual_features]
vqa_val_input = [vqa_val_x, vqa_val_visual_features]

In [None]:
#== Model training
vqa_text_image_bow_model.fit(
    vqa_train_input, 
    vqa_train_y,
    batch_size=512,
    nb_epoch=10,
    validation_split=0.1,
    show_accuracy=True)

In [None]:
# we first need to add word_generator to _config (we could have done this before, in the Config constructor)
# we use maximum likelihood as a word generator
vqa_predictions_answers = vqa_text_image_bow_model.decode_predictions(
    X=vqa_val_input,
    temperature=None,
    index2word=vqa_index2word_y,
    verbose=0)

In [None]:
# Using VQA is unfortunately not that transparent
# we need extra VQA object.
vqa_vars = {
    'question_id':vqa_val_text_representation['question_id'],
    'vqa_object':vqa_val_text_representation['vqa_object'],
    'resfun': 
        lambda x: \
            vqa_val_text_representation['vqa_object'].loadRes(x, vqa_val_text_representation['questions_path'])
}

In [None]:
from kraino.utils import print_metrics


_ = print_metrics.select['vqa'](
        gt_list=vqa_val_raw_y,
        pred_list=vqa_predictions_answers,
        verbose=1,
        extra_vars=vqa_vars)

# Kraino

For the purpose of fast experimentations on Visual Turing Test, we have prepared Kraino that builds on top of Keras.
In this short section, you will see how to use it from a command line, example by example.

### Kraino on DAQUAR

We use a blind model with a temporal fusion of the question (equivalent of BOW).

One era consists of max_epoch epochs, at the end of era we gather some statistics such as the model performance, or we dump the model weights. Since calculating wups scores is slow, we use 5 epoch before we output such information. In the example below we also use 5 eras, so in total we perform 25 epochs.

In the example below we monitor wups scores on test set (--metric=wups, --verbosity=monitor_test_metric), but we also use 10% of training data for validation (--validation_split=0.1).
__Please remember to pick up the model based on validation set, NOT the test set!___

We use one_hot vector representation. As an alternative we could use --word_representation=dense with a pre-trained embedding such as Word2Vec or Glove. You need to download both pre-trained embeddings.

The code below may be slow due to WUPS calculations.

In [25]:
! python neural_solver.py --dataset=daquar-triples --model=sequential-blind-temporal_fusion-single_answer --validation_split=0.1 --metric=wups --max_epoch=20 --max_era=2 --verbosity=monitor_test_metric --word_representation=one_hot 

Using Theano backend.
Couldn't import dot_parser, loading of dot files will not be possible.
Using gpu device 1: Tesla K40m (CNMeM is disabled, CuDNN 4007)
{'LANGUAGE_CNN_VIEWS': 3, 'TEST_SUBSET_SIZE': -1, 'MERGE_MODE': 'ave', 'VERBOSITY': 'monitor_test_metric', 'MLP_HIDDEN_SIZE': 1000, 'METRIC': 'wups', 'PERCEPTION': 'googlenet', 'TRUNCATE_INPUT_SPACE': 0, 'MULTIMODAL_MERGE_MODE': 'concat', 'VISUALIZATION_FIG_LOSS_TITLE': 'Loss', 'RESULTS_FILENAME': 'results', 'HIDDEN_STATE_SIZE': 1000, 'PARTS_EXTRACTOR': 'whole', 'MODEL': 'sequential-blind-temporal_fusion-single_answer', 'IS_EARLY_STOPPING': False, 'LR_PATIENCE': 5, 'OPTIMIZER': 'adam', 'MAX_ERA': 2, 'VISUALIZATION_URL': 'default', 'MAX_EPOCH': 20, 'BATCH_SIZE': 755, 'DATASET': 'daquar-triples', 'MAX_MEMORY_TIME_STEPS': 35, 'LR': -1, 'PERCEPTION_SECOND_LAYER': '', 'TRAINING_SUBSET_SIZE': -1, 'IS_WHOLE_ANSWER_AS_ANSWER_WORD': False, 'VISUAL_HIDDEN_STATE_SIZE': 1000, 'TEXT_ENCODER': 'lstm', 'FUSION_LAYER_INDEX': 0, 'VISUALIZATION_FIG_M

Maybe we should use smaller embedding layer with --embedding_size=500 (500 dimensions).

In [26]:
! python neural_solver.py --textual_embedding_size=500 --dataset=daquar-triples --model=sequential-blind-temporal_fusion-single_answer --validation_split=0.1 --metric=wups --max_epoch=20 --max_era=2 --verbosity=monitor_test_metric --word_representation=one_hot 

Using Theano backend.
Couldn't import dot_parser, loading of dot files will not be possible.
Using gpu device 1: Tesla K40m (CNMeM is disabled, CuDNN 4007)
{'LANGUAGE_CNN_VIEWS': 3, 'TEST_SUBSET_SIZE': -1, 'MERGE_MODE': 'ave', 'VERBOSITY': 'monitor_test_metric', 'MLP_HIDDEN_SIZE': 1000, 'METRIC': 'wups', 'PERCEPTION': 'googlenet', 'TRUNCATE_INPUT_SPACE': 0, 'MULTIMODAL_MERGE_MODE': 'concat', 'VISUALIZATION_FIG_LOSS_TITLE': 'Loss', 'RESULTS_FILENAME': 'results', 'HIDDEN_STATE_SIZE': 1000, 'PARTS_EXTRACTOR': 'whole', 'MODEL': 'sequential-blind-temporal_fusion-single_answer', 'IS_EARLY_STOPPING': False, 'LR_PATIENCE': 5, 'OPTIMIZER': 'adam', 'MAX_ERA': 2, 'VISUALIZATION_URL': 'default', 'MAX_EPOCH': 20, 'BATCH_SIZE': 755, 'DATASET': 'daquar-triples', 'MAX_MEMORY_TIME_STEPS': 35, 'LR': -1, 'PERCEPTION_SECOND_LAYER': '', 'TRAINING_SUBSET_SIZE': -1, 'IS_WHOLE_ANSWER_AS_ANSWER_WORD': False, 'VISUAL_HIDDEN_STATE_SIZE': 1000, 'TEXT_ENCODER': 'lstm', 'FUSION_LAYER_INDEX': 0, 'VISUALIZATION_FIG_M

Now we replace the temporal by the recurrent fussion (LSTM) with --model=sequential-blind-__reccurent_fusion__-single_answer

In [27]:
! python neural_solver.py --dataset=daquar-triples --model=sequential-blind-recurrent_fusion-single_answer --validation_split=0.1 --metric=wups --max_epoch=20 --max_era=1 --verbosity=monitor_test_metric --word_representation=one_hot 

Using Theano backend.
Couldn't import dot_parser, loading of dot files will not be possible.
Using gpu device 1: Tesla K40m (CNMeM is disabled, CuDNN 4007)
{'LANGUAGE_CNN_VIEWS': 3, 'TEST_SUBSET_SIZE': -1, 'MERGE_MODE': 'ave', 'VERBOSITY': 'monitor_test_metric', 'MLP_HIDDEN_SIZE': 1000, 'METRIC': 'wups', 'PERCEPTION': 'googlenet', 'TRUNCATE_INPUT_SPACE': 0, 'MULTIMODAL_MERGE_MODE': 'concat', 'VISUALIZATION_FIG_LOSS_TITLE': 'Loss', 'RESULTS_FILENAME': 'results', 'HIDDEN_STATE_SIZE': 1000, 'PARTS_EXTRACTOR': 'whole', 'MODEL': 'sequential-blind-recurrent_fusion-single_answer', 'IS_EARLY_STOPPING': False, 'LR_PATIENCE': 5, 'OPTIMIZER': 'adam', 'MAX_ERA': 1, 'VISUALIZATION_URL': 'default', 'MAX_EPOCH': 20, 'BATCH_SIZE': 755, 'DATASET': 'daquar-triples', 'MAX_MEMORY_TIME_STEPS': 35, 'LR': -1, 'PERCEPTION_SECOND_LAYER': '', 'TRAINING_SUBSET_SIZE': -1, 'IS_WHOLE_ANSWER_AS_ANSWER_WORD': False, 'VISUAL_HIDDEN_STATE_SIZE': 1000, 'TEXT_ENCODER': 'lstm', 'FUSION_LAYER_INDEX': 0, 'VISUALIZATION_FIG_

We can easily replace LSTM by GRU as a question encoder with --text_encoder=gru.

In [28]:
! python neural_solver.py --text_encoder=gru --dataset=daquar-triples --model=sequential-blind-recurrent_fusion-single_answer --validation_split=0.1 --metric=wups --max_epoch=20 --max_era=1 --verbosity=monitor_test_metric --word_representation=one_hot  

Using Theano backend.
Couldn't import dot_parser, loading of dot files will not be possible.
Using gpu device 1: Tesla K40m (CNMeM is disabled, CuDNN 4007)
{'LANGUAGE_CNN_VIEWS': 3, 'TEST_SUBSET_SIZE': -1, 'MERGE_MODE': 'ave', 'VERBOSITY': 'monitor_test_metric', 'MLP_HIDDEN_SIZE': 1000, 'METRIC': 'wups', 'PERCEPTION': 'googlenet', 'TRUNCATE_INPUT_SPACE': 0, 'MULTIMODAL_MERGE_MODE': 'concat', 'VISUALIZATION_FIG_LOSS_TITLE': 'Loss', 'RESULTS_FILENAME': 'results', 'HIDDEN_STATE_SIZE': 1000, 'PARTS_EXTRACTOR': 'whole', 'MODEL': 'sequential-blind-recurrent_fusion-single_answer', 'IS_EARLY_STOPPING': False, 'LR_PATIENCE': 5, 'OPTIMIZER': 'adam', 'MAX_ERA': 1, 'VISUALIZATION_URL': 'default', 'MAX_EPOCH': 20, 'BATCH_SIZE': 755, 'DATASET': 'daquar-triples', 'MAX_MEMORY_TIME_STEPS': 35, 'LR': -1, 'PERCEPTION_SECOND_LAYER': '', 'TRAINING_SUBSET_SIZE': -1, 'IS_WHOLE_ANSWER_AS_ANSWER_WORD': False, 'VISUAL_HIDDEN_STATE_SIZE': 1000, 'TEXT_ENCODER': 'gru', 'FUSION_LAYER_INDEX': 0, 'VISUALIZATION_FIG_M

We can also use 1 dimensional CNN to represent questions with --model=sequential-blind-cnn_fusion-single_answer_with_temporal_fusion.

In [29]:
! python neural_solver.py --text_encoder=gru --dataset=daquar-triples --model=sequential-blind-cnn_fusion-single_answer_with_temporal_fusion --temporal_fusion=sum --validation_split=0.1 --metric=wups --max_epoch=20 --max_era=1 --verbosity=monitor_test_metric --word_representation=one_hot  

Using Theano backend.
Couldn't import dot_parser, loading of dot files will not be possible.
Using gpu device 1: Tesla K40m (CNMeM is disabled, CuDNN 4007)
{'LANGUAGE_CNN_VIEWS': 3, 'TEST_SUBSET_SIZE': -1, 'MERGE_MODE': 'sum', 'VERBOSITY': 'monitor_test_metric', 'MLP_HIDDEN_SIZE': 1000, 'METRIC': 'wups', 'PERCEPTION': 'googlenet', 'TRUNCATE_INPUT_SPACE': 0, 'MULTIMODAL_MERGE_MODE': 'concat', 'VISUALIZATION_FIG_LOSS_TITLE': 'Loss', 'RESULTS_FILENAME': 'results', 'HIDDEN_STATE_SIZE': 1000, 'PARTS_EXTRACTOR': 'whole', 'MODEL': 'sequential-blind-cnn_fusion-single_answer_with_temporal_fusion', 'IS_EARLY_STOPPING': False, 'LR_PATIENCE': 5, 'OPTIMIZER': 'adam', 'MAX_ERA': 1, 'VISUALIZATION_URL': 'default', 'MAX_EPOCH': 20, 'BATCH_SIZE': 755, 'DATASET': 'daquar-triples', 'MAX_MEMORY_TIME_STEPS': 35, 'LR': -1, 'PERCEPTION_SECOND_LAYER': '', 'TRAINING_SUBSET_SIZE': -1, 'IS_WHOLE_ANSWER_AS_ANSWER_WORD': False, 'VISUAL_HIDDEN_STATE_SIZE': 1000, 'TEXT_ENCODER': 'gru', 'FUSION_LAYER_INDEX': 0, 'VISU

Or we can combine GRU with visual CNN.

--model=sequential-multimodal-recurrent_fusion-at_last_timestep_multimodal_fusion-single_answer 

--multimodal_fusion=mul

In [30]:
! python neural_solver.py --text_encoder=gru --dataset=daquar-triples --model=sequential-multimodal-recurrent_fusion-at_last_timestep_multimodal_fusion-single_answer --temporal_fusion=sum --validation_split=0.1 --metric=wups --max_epoch=20 --max_era=1 --verbosity=monitor_test_metric --word_representation=one_hot --multimodal_fusion=mul 

Using Theano backend.
Couldn't import dot_parser, loading of dot files will not be possible.
Using gpu device 1: Tesla K40m (CNMeM is disabled, CuDNN 4007)
{'LANGUAGE_CNN_VIEWS': 3, 'TEST_SUBSET_SIZE': -1, 'MERGE_MODE': 'sum', 'VERBOSITY': 'monitor_test_metric', 'MLP_HIDDEN_SIZE': 1000, 'METRIC': 'wups', 'PERCEPTION': 'googlenet', 'TRUNCATE_INPUT_SPACE': 0, 'MULTIMODAL_MERGE_MODE': 'mul', 'VISUALIZATION_FIG_LOSS_TITLE': 'Loss', 'RESULTS_FILENAME': 'results', 'HIDDEN_STATE_SIZE': 1000, 'PARTS_EXTRACTOR': 'whole', 'MODEL': 'sequential-multimodal-recurrent_fusion-at_last_timestep_multimodal_fusion-single_answer', 'IS_EARLY_STOPPING': False, 'LR_PATIENCE': 5, 'OPTIMIZER': 'adam', 'MAX_ERA': 1, 'VISUALIZATION_URL': 'default', 'MAX_EPOCH': 20, 'BATCH_SIZE': 755, 'DATASET': 'daquar-triples', 'MAX_MEMORY_TIME_STEPS': 35, 'LR': -1, 'PERCEPTION_SECOND_LAYER': '', 'TRAINING_SUBSET_SIZE': -1, 'IS_WHOLE_ANSWER_AS_ANSWER_WORD': False, 'VISUAL_HIDDEN_STATE_SIZE': 1000, 'TEXT_ENCODER': 'gru', 'FUSION_

Or use the above with Resnet (by default it's GoogLeNet) with piece-wise summation. 
We use parameters: --perception=fb_resnet --perception_layer=l2_res5c-152.

In [33]:
! python neural_solver.py --text_encoder=gru --dataset=daquar-triples --model=sequential-multimodal-recurrent_fusion-at_last_timestep_multimodal_fusion-single_answer --temporal_fusion=sum --validation_split=0.1 --metric=wups --max_epoch=20 --max_era=1 --verbosity=monitor_test_metric --word_representation=one_hot --multimodal_fusion=sum --perception=fb_resnet --perception_layer=l2_res5c-152 

Using Theano backend.
Couldn't import dot_parser, loading of dot files will not be possible.
Using gpu device 1: Tesla K40m (CNMeM is disabled, CuDNN 4007)
{'LANGUAGE_CNN_VIEWS': 3, 'TEST_SUBSET_SIZE': -1, 'MERGE_MODE': 'sum', 'VERBOSITY': 'monitor_test_metric', 'MLP_HIDDEN_SIZE': 1000, 'METRIC': 'wups', 'PERCEPTION': 'fb_resnet', 'TRUNCATE_INPUT_SPACE': 0, 'MULTIMODAL_MERGE_MODE': 'sum', 'VISUALIZATION_FIG_LOSS_TITLE': 'Loss', 'RESULTS_FILENAME': 'results', 'HIDDEN_STATE_SIZE': 1000, 'PARTS_EXTRACTOR': 'whole', 'MODEL': 'sequential-multimodal-recurrent_fusion-at_last_timestep_multimodal_fusion-single_answer', 'IS_EARLY_STOPPING': False, 'LR_PATIENCE': 5, 'OPTIMIZER': 'adam', 'MAX_ERA': 1, 'VISUALIZATION_URL': 'default', 'MAX_EPOCH': 20, 'BATCH_SIZE': 755, 'DATASET': 'daquar-triples', 'MAX_MEMORY_TIME_STEPS': 35, 'LR': -1, 'PERCEPTION_SECOND_LAYER': '', 'TRAINING_SUBSET_SIZE': -1, 'IS_WHOLE_ANSWER_AS_ANSWER_WORD': False, 'VISUAL_HIDDEN_STATE_SIZE': 1000, 'TEXT_ENCODER': 'gru', 'FUSION_

But there are more possibilities.
Unfortunately the documentation is not ready yet, but you are welcome to experiment with different settings.
Interestingly, some subsets of the parameters don't go well with each other.

To see available models, check kraino/core/model_zoo.py.

To see available command-line parameters together with default values, check kraino/utils/parsers.py.

### Kraino on VQA

We can, however, also work with VQA.

We need to switch dataset (--dataset=vqa-real_images-open_ended), metric (--metric=vqa), and add the answer mode (--vqa_answer_mode=single_frequent).
Moreover, we also truncate question answer pairs according to 2000 most frequent answers, and --use_whole_answer_as_answer_word as we want to treat answer 'yellow cab' as the answer, and won't split into words.

If the following code is too slow, try using BOW (--model=sequential-blind-temporal_fusion-single_answer).

In [34]:
! python neural_solver.py --dataset=vqa-real_images-open_ended --model=sequential-blind-recurrent_fusion-single_answer --vqa_answer_mode=single_frequent --metric=vqa --max_epoch=10 --max_era=1 --verbosity=monitor_val_metric --word_representation=one_hot --number_most_frequent_qa_pairs=2000 --use_whole_answer_as_answer_word 

Using Theano backend.
Couldn't import dot_parser, loading of dot files will not be possible.
Using gpu device 1: Tesla K40m (CNMeM is disabled, CuDNN 4007)
{'LANGUAGE_CNN_VIEWS': 3, 'TEST_SUBSET_SIZE': -1, 'MERGE_MODE': 'ave', 'VERBOSITY': 'monitor_val_metric', 'MLP_HIDDEN_SIZE': 1000, 'METRIC': 'vqa', 'PERCEPTION': 'googlenet', 'TRUNCATE_INPUT_SPACE': 0, 'MULTIMODAL_MERGE_MODE': 'concat', 'VISUALIZATION_FIG_LOSS_TITLE': 'Loss', 'RESULTS_FILENAME': 'results', 'HIDDEN_STATE_SIZE': 1000, 'PARTS_EXTRACTOR': 'whole', 'MODEL': 'sequential-blind-recurrent_fusion-single_answer', 'IS_EARLY_STOPPING': False, 'LR_PATIENCE': 5, 'OPTIMIZER': 'adam', 'MAX_ERA': 1, 'VISUALIZATION_URL': 'default', 'MAX_EPOCH': 10, 'BATCH_SIZE': 755, 'DATASET': 'vqa-real_images-open_ended', 'MAX_MEMORY_TIME_STEPS': 35, 'LR': -1, 'PERCEPTION_SECOND_LAYER': '', 'TRAINING_SUBSET_SIZE': -1, 'IS_WHOLE_ANSWER_AS_ANSWER_WORD': True, 'VISUAL_HIDDEN_STATE_SIZE': 1000, 'TEXT_ENCODER': 'lstm', 'FUSION_LAYER_INDEX': 0, 'VISUALIZA

Or we can use Vision + Language model. If there are memory problems try either smaller batches (--batch_size=...), smaller model (--hidden_state_size or --textual_embedding_size), or use BOW model (--model=equential-multimodal-temporal_fusion-single_answer).

In [None]:
! python neural_solver.py --dataset=vqa-real_images-open_ended --model=sequential-multimodal-recurrent_fusion-at_last_timestep_multimodal_fusion-single_answer --vqa_answer_mode=single_frequent --metric=vqa --max_epoch=10 --max_era=1 --verbosity=monitor_val_metric --word_representation=one_hot --number_most_frequent_qa_pairs=2000 --use_whole_answer_as_answer_word --perception=fb_resnet --perception_layer=l2_res5c-152 --multimodal_fusion=mul 

Using Theano backend.
Couldn't import dot_parser, loading of dot files will not be possible.
Using gpu device 1: Tesla K40m (CNMeM is disabled, CuDNN 4007)
{'LANGUAGE_CNN_VIEWS': 3, 'TEST_SUBSET_SIZE': -1, 'MERGE_MODE': 'ave', 'VERBOSITY': 'monitor_val_metric', 'MLP_HIDDEN_SIZE': 1000, 'METRIC': 'vqa', 'PERCEPTION': 'fb_resnet', 'TRUNCATE_INPUT_SPACE': 0, 'MULTIMODAL_MERGE_MODE': 'mul', 'VISUALIZATION_FIG_LOSS_TITLE': 'Loss', 'RESULTS_FILENAME': 'results', 'HIDDEN_STATE_SIZE': 1000, 'PARTS_EXTRACTOR': 'whole', 'MODEL': 'sequential-multimodal-recurrent_fusion-at_last_timestep_multimodal_fusion-single_answer', 'IS_EARLY_STOPPING': False, 'LR_PATIENCE': 5, 'OPTIMIZER': 'adam', 'MAX_ERA': 1, 'VISUALIZATION_URL': 'default', 'MAX_EPOCH': 10, 'BATCH_SIZE': 755, 'DATASET': 'vqa-real_images-open_ended', 'MAX_MEMORY_TIME_STEPS': 35, 'LR': -1, 'PERCEPTION_SECOND_LAYER': '', 'TRAINING_SUBSET_SIZE': -1, 'IS_WHOLE_ANSWER_AS_ANSWER_WORD': True, 'VISUAL_HIDDEN_STATE_SIZE': 1000, 'TEXT_ENCODER': 'lstm'

# Further Experiments

* Data and Task Understanding.
 * Try to find by yourself how difficult/easy is to answer questions without looking into images (do this on DAQUAR or VQA). If you are better than our models, that's great. If you are worse, that's fine, our models, even the blind ones, were trained to answer on this particular dataset.
 * Take a look at more images, and questions. Can you recognise other challenges that machines can potentially face off? Can you classify the new challenges? 
 * Experiment with evaluation measures. For instance Wu-Palmer Similarities with other categories, or look into [WUPS source code](http://datasets.d2.mpi-inf.mpg.de/mateusz14visual-turing/calculate_wups.py), or check [Consensus](http://datasets.d2.mpi-inf.mpg.de/mateusz15neural_qa/compute_consensus.py). You are also encouraged to check 'Performance Measure' in the [Multiworld paper](http://arxiv.org/pdf/1410.0210v4.pdf), 'Quantifying the Performance of Holistic Architectures in the [Towards Visual Turing Challenge](http://arxiv.org/pdf/1410.8027v3.pdf), and 'Human Consensus' in the [Ask Your Neurons paper](https://www.d2.mpi-inf.mpg.de/sites/default/files/iccv15-neural_qa.pdf).
 * You can experiment with New Predictions sections. Particularly, [the last predictions section](#New-Predictions-with-Vision+Language) was quite short. For instance, you can visualise the predictions, similarly to what you did [here](#New-Predictions). Maybe, you can dump a file with predictions, and come up with new conclusions by inspecting it.
 * Look into [VQA](http://visualqa.org). [Check images, and question answer pairs](http://visualqa.org/browser/). What are the differences between VQA and DAQUAR?
* Experiment with the provided code.
 * More RNN layers, deeper classifiers.
 * Different RNN models (GRU, LSTM, your own?).
 * Using the best model found on a validation set. For this you may be willing to pass the [checkpoint callback](http://keras.io/callbacks/#modelcheckpoint) to the fit function as [this example suggests](http://keras.io/callbacks/#example-model-checkpoints).
 * Recognise and change hyperparameters such as dimensionality of embeddings or the number of hidden units.
 * Different ways to fuse two modalities (concat, ave, ...). If you use many RNN layers, you can fuse the modalities at different levels.
 * Investigate different visual features
   * vgg_net: fc6, fc7, fc8
   * googlenet: loss3-classifier, pool5-7x7_s1
 * If in needs, please consult with the [official documentation](keras.io).
* Experiments with Keras.
* Experiments with [Kraino](#Kraino). 

# New Research Opportunities

* __Global Representation__ So far we have been using so called global representations of the images. Such representations may destroy too much information, and so we should consider a fine-grained alternative. Maybe we should use detections, or attention models. The latter becomes recently quite successful in answering questions about images. However, there is still a hope for global representations if they are trained end-to-end for the task. Recall that our global representation is extracted from CNN trained on different dataset (ImageNet), and for different task (object classification). 
* __3D Scene Representation__ Most of current approaches, and all neural approaches, are trained on 2D images. However, it seems that some spatial relations such as 'behind' may need 3d representation of the scene. Luckily, DAQUAR is built on [NYU-Depth dataset](http://cs.nyu.edu/~silberman/datasets/nyu_depth_v2.html) that provides both modes (2D images, and 3D depth). The question if such extra information helps remains open.
* __Recurrent Neural Networks__ There is disturbingly small gap between BOW and RNN models. As we have seen before, some questions clearly require an order, but such questions at the same time become longer, semantically more difficult, and require better visual understanding of the world. To handle them we may need other RNNs architectures, or better ways of fusing two modalities, or better __Global Representation__.
* __Logical Reasoning__ There are few questions that require a bit more sophisticated logical reasoning such as negation. Can Reccurent Neural Networks learn such logical operators? What about compositionality of the language?
* __Language + Vision__ There is too small gap between Language Only and Vision + Language models. But clearly, we need pictures to answer questions about images. So what is missing here? Is it due to __Global Representation__, __3D Scene Representation__ or there is something missing in fusing two modalities?
* __Learning from Few Examples__ In the Visual Turint Test, many questions are quite unique. But then how the models can generalise to new questions? What if a question is completely new, but its parts have been already observed (compositionality)? Can models guess the meaning of a new word from its context?
* __Ambiguities__ How to deal with ambiguities? They are all inherent in the task, so cannot be just ignored, and should be incorporated into question answering methods as well as evaluation metrics. 
* __Evaluation Measures__ Although we have WUPS and Consensus, both are far from being perfect. Consensus has higher annotation cost for ambiguous tasks, and is unclear how to formally define good consensus measure. WUPS is an ontology dependent, but can we build one complete ontology that covers all cases?

# External Links

 * Visual Turing Test - project webpage: [https://www.d2.mpi-inf.mpg.de/visual-turing-challenge](https://www.d2.mpi-inf.mpg.de/visual-turing-challenge)
 * Ask Your Neurons - the main inspiration for this tutorial. [https://www.d2.mpi-inf.mpg.de/sites/default/files/iccv15-neural_qa.pdf](https://www.d2.mpi-inf.mpg.de/sites/default/files/iccv15-neural_qa.pdf)
 * Multiworld Approach - our first paper on Visual Turing Test and DAQUAR. It also describes a symbolic approach to handle the challenge. [http://arxiv.org/pdf/1410.0210v4.pdf](http://arxiv.org/pdf/1410.0210v4.pdf)
 * Towards a Visual Turing Challenge - we hope that it's a nice and accessible introduction to the task. [http://arxiv.org/abs/1410.8027](http://arxiv.org/abs/1410.8027). Its more compact version: [http://arxiv.org/abs/1501.03302](http://arxiv.org/abs/1501.03302)
 * My [talk from the ICCV'15 conference](http://videolectures.net/iccv2015_malinowski_your_neurons/) together with the [slides](http://datasets.d2.mpi-inf.mpg.de/mateusz15neural_qa/ask_your_neurons-slides-low_res.pdf)
 * VQA - a complementary, large-scale dataset for Visual Question Answering. Project webpage: [http://visualqa.org](http://visualqa.org)
 * Image Question Answering - another large-scale dataset for Image Question Answering. Project webpage [http://www.cs.toronto.edu/~mren/imageqa/](http://www.cs.toronto.edu/~mren/imageqa/)
 * Learning to Answer Questions about Images - similar to Ask Your Neurons model. [http://www.hangli-hl.com/uploads/3/4/4/6/34465961/ma_et_al_aaai_2016.pdf](http://www.hangli-hl.com/uploads/3/4/4/6/34465961/ma_et_al_aaai_2016.pdf)
 * A nice introduction to VQA with working source codes, also in Keras. [https://avisingh599.github.io/deeplearning/visual-qa/](https://avisingh599.github.io/deeplearning/visual-qa/)
 * In the Tutorial we use the [facebook residual net features](http://torch.ch/blog/2016/02/04/resnets.html)
 * DAQUAR uses NYU-Depth images, for more information please take a look [here](http://cs.nyu.edu/~silberman/datasets/nyu_depth_v2.html)

# Logs

1. 25.04.2016: Added residual net features, more on Kraino, made it publicly available
1. 20.03.2016: 2nd Summer School on Integrating Vision & Language: Deep Learning