# Dynamic Memory Networks for Visual and Textual Question Answering

* 싸이그래머 / 텐서팔로우 : 파트 1 - 텐서코드리뷰 [1]
* 김무성

# Contents
* Abstract
* 1 Introduction
* 2 Dynamic Memory Networks
* 3 Improved Dynamic Memory Networks: DMN+
     - 3.1. Input Module for Text QA
     - 3.2. Input Module for VQA
     - 3.3. The Episodic Memory Module
* 4 Related Work
* 5 Datasets
    - 5.1. bAbI-10k
    - 5.2. DAQUAR-ALL visual dataset
    - 5.3. Visual Question Answering
* 6 Experiments
    - 6.1. Model Analysis
    - 6.2. Comparison to state of the art using bAbI-10k
    - 6.3. Comparison to state of the art using VQA
* 7 Conclusion

# Abstract

* Neural network architectures with <font color="red">memory</font> and <font color="red">attention</font> mechanisms exhibit certain reasoning capabilities required for <font color="red">question answering</font>.
* Our new <font color="red">DMN+</font> model improves the state of the art on both the <font color="red">Visual Question Answering dataset</font> and the <font color="red">bAbI-10k text question-answering dataset</font> <font color="blue">without supporting fact supervision</font>.

#### 참고
* [3] <font color="red">The future of Deep Learning for NLP: Dynamic Memory Networks </font> (in CS224d: Deep Learning for Natural Language Processing) - http://cs224d.stanford.edu/lectures/CS224d-Lecture17.pdf
* [5] Implementing Dynamic memory networks - https://yerevann.github.io/2016/02/05/implementing-dynamic-memory-networks/
* [6] Playground for bAbI tasks - https://yerevann.github.io/2016/02/23/playground-for-babi-tasks/
* [8] <font color="red">Dynamic Memory Networks by YerevanNN Web Demo</font> - ([6]의 웹 데모) - http://yerevann.com/dmn-ui/#/

# 1 Introduction

* We analyze the DMN components, specifically 
    - the input module and 
    - memory module, to improve question answering. 
* We propose a new input module which 
    - uses a two level encoder with 
        - a sentence reader and 
        - input fusion layer 
            - to allow for information flow 
                - between sentences. 
* For the memory, we propose 
    - a modification to gated recurrent units (GRU) (Chung et al., 2014). 
    - The new GRU formulation 
        - incorporates attention gates that 
            - are computed using global knowledge over the facts. 
* Unlike before, the new DMN+ model 
    - does not require that supporting facts 
        - (i.e. the facts that are relevant for answering a particular question) 
        - are labeled during training. 
    - The model learns to select the important facts from a larger set.
* In addition, we introduce a new input module to represent images.
    - We show that the changes in the memory module that improved textual question answering also improve visual question answering. Both tasks are illustrated in Fig. 1.

<img src="figures/cap1.png" width=600 />

# 2 Dynamic Memory Networks
* Input Module
* Question Module
* Episodic Memory Module
* Answer Module

<img src="figures/overview.png" width=600 />

#### Input Module

##### 참고
* [10] GRUs and LSTMs -- for machine translation (in CS224d: Deep Learning for Natural Language Processing) - http://cs224d.stanford.edu/lectures/CS224d-Lecture9.pdf
* [11] Recurrent Neural Networks (RNN), Long Short Term Memory (LSTM) / RNN language models / Image captioning (in CS231n: Convolutional Neural Networks for Visual Recognition) - http://cs231n.stanford.edu/slides/winter1516_lecture10.pdf
* [12] 엘에스티엠 네트워크 이해하기 - http://roboticist.tistory.com/m/post/571

<img src="figures/input.png" width=600 />

This module processes the input data about which a question is being asked into a set of vectors termed <font color="red">facts, represented as</font> $F = [f_1, . . . , f_N ]$ , where $N$ is the total number of facts.

<img src="figures/cap3.png" width=600 />

<img src="figures/cap2.png" width=600 />

<img src="figures/cap8.png" width=600 />

#### Question Module

This module computes a vector representation $q$ of the question, where $q ∈ R^{n_H}$ is the final hidden state of a GRU over the words in the question.

<img src="figures/query.png" width=600 />

#### Episodic Memory Module

<img src="figures/episodic.png" width=600 />

<img src="figures/episodic_repeat.png" width=600 />

Episode memory aims to retrieve the information required to answer the question $q$ from the input facts. 
* To improve our understanding of both the question and input, especially if questions require transitive reasoning, the episode memory module may pass over the input multiple times, updating episode memory after each pass.
* We refer to 
    - the episode memory on the $t$th pass over the inputs as $m_t$, 
        - where $m_t ∈ R^{n_H}$ , 
    - the initial memory vector is set to the question vector: $m_0 = q$.

<img src="figures/cap6.png" width=600 />

##### 참고
* [13] Deep Learning for Computer Vision: Attention Models (UPC 2016) - http://www.slideshare.net/xavigiro/deep-learning-for-computer-vision-attention-models-upc-2016

The <font color="blue">episodic memory module</font> consists of two separate components:
* <font color="red">The attention mechanism</font> 
    - is responsible for producing a contextual vector $c_t$, 
        - where $c_t ∈ R^{n_H}$ is a summary of relevant input for pass $t$, 
            - with relevance inferred by the question $q$
            - and previous episode memory $m_{t−1}$. 
* <font color="red">The memory update mechanism</font> 
    - is responsible for generating the episode memory $m_t$ 
        - based upon the contextual vector $c_t$ and 
        - previous episode memory $m_{t−1}$. 
    - By the final pass $T$ , 
        - the episodic memory $m_T$ 
            - should contain all the information 
                - required to answer the question $q$.

#### Answer Module

The answer module receives both q and $m^T$ to generate the model’s predicted answer. 
* For simple answers,
    - such as a single word, a linear layer with softmax activation may be used. 
* For tasks requiring a sequence output, 
    - an RNN may be used to decode $a = [q; m^T]$, 
        - the concatenation of vectors $q$ and $m^T$ , to an ordered set of tokens.
* The cross entropy error on the answers is used for training and backpropagated through the entire network.

<img src="figures/answer.png" width=600 />

# 3 Improved Dynamic Memory Networks: DMN+
* 3.1. Input Module for Text QA
* 3.2. Input Module for VQA
* 3.3. The Episodic Memory Module

 The final DMN+ model obtains the highest accuracy on the bAbI-10k dataset without supporting facts and the VQA dataset (Antol et al., 2015).

## 3.1. Input Module for Text QA
* Input Fusion Layer

In the DMN specified in Kumaretal.(2015),
* a single GRU 
    - is used to process all the words in the story, extracting sentence representations by storing the hidden states produced at the end of sentence markers.
    - The GRU also provides a temporal component by allowing a sentence to know the content of the sentences that came before them.
* Whilst this input module 
    - worked well for bAbI-1k 
        - with supporting facts, as reported in Kumar et al. (2015), it did 
    - not perform well on bAbI-10k 
        - without supporting facts (Sec. 6.1)
* We speculate that there are two main reasons for this performance disparity, all exacerbated by the removal of supporting facts. 
    * First, 
        - the GRU only allows sentences to have context from sentences before them, but not after them. 
            - This prevents information propagation from future sentences. 
    * Second, 
        - the supporting sentences may be too far away from each other on a word level to allow for these distant sentences to interact through the word level GRU.

#### Input Fusion Layer

For the DMN+, we propose replacing this single GRU with two different components. 
* The first component is 
    - a <font color="red">sentence reader</font>, 
        - responsible only for encoding the words into a sentence embedding. 
* The second component is 
    - the <font color="red">input fusion layer</font>, 
        - allowing for interactions between sentences.

Each sentence encoding 
* $f_i$ is the output of an encoding scheme 
    - taking the word tokens $[w^{i}_1 , . . . , w^{i}_{M_i}]$, 
        - where $M_i$ is the length of the sentence.

<img src="figures/cap3.png" width=600 />

##### positional encoding scheme

###### 참고
* [14] End-To-End Memory Networks - http://arxiv.org/abs/1503.08895

The sentence reader could be based on any variety of encoding schemes. We selected positional encoding described in Sukhbaatar et al. (2015) to allow for a comparison to their work.

<img src="figures/cap4.png" width=600 />

## 3.2. Input Module for VQA
* Local region feature extraction
* Visual feature embedding
* Input fusion layer

<img src="figures/cap5.png" width=600 />

#### Local region feature extraction

#### Visual feature embedding

#### Input fusion layer

## 3.3. The Episodic Memory Module
* Attention Mechanism
* Soft attention
* Attention based GRU
* Episode Memory Updates

<img src="figures/cap6.png" width=600 />

<img src="figures/cap7.png" width=600 />

#### Attention Mechanism

#### Soft attention

<img src="figures/cap9.png" />

<img src="figures/cap10.png"  />

#### Attention based GRU

<img src="figures/cap8.png" width=600 />

<img src="figures/cap11.png" width=600 />

<img src="figures/cap17.png" width=600 />

#### Episode Memory Updates 

<img src="figures/cap12.png" width=600 />

<img src="figures/cap13.png" width=600 />

# 4 Related Work
* Neural Memory Models
* Neural Attention Mechanisms 
* Question Answering in NLP
* Visual Question Answering (VQA) 

#### Neural Memory Models

#### Neural Attention Mechanisms 

#### Question Answering in NLP

#### Visual Question Answering (VQA) 

# 5 Datasets
* 5.1. bAbI-10k
* 5.2. DAQUAR-ALL visual dataset
* 5.3. Visual Question Answering

## 5.1. bAbI-10k

## 5.2. DAQUAR-ALL visual dataset

## 5.3. Visual Question Answering

# 6 Experiments
* 6.1. Model Analysis
* 6.2. Comparison to state of the art using bAbI-10k
* 6.3. Comparison to state of the art using VQA

## 6.1. Model Analysis

<img src="figures/cap14.png" width=600 />

## 6.2. Comparison to state of the art using bAbI-10k
* Text QA Results

#### Text QA Results

<img src="figures/cap15.png" width=600 />

## 6.3. Comparison to state of the art using VQA
* Training Details 
* Results and Analysis

#### Training Details 

#### Results and Analysis

<img src="figures/cap16.png" width=600 />

# 7 Conclusion

# 참고자료
* [1] Dynamic Memory Networks for Visual and Textual Question Answering - https://arxiv.org/abs/1603.01417
* [2] Ask Me Anything: Dynamic Memory Networks for Natural Language Processing - http://arxiv.org/abs/1506.07285
* [3] The future of Deep Learning for NLP: Dynamic Memory Networks (in CS224d: Deep Learning for Natural Language Processing) - http://cs224d.stanford.edu/lectures/CS224d-Lecture17.pdf
* [4] code(therne's) - https://github.com/therne/dmn-tensorflow
* [5] Implementing Dynamic memory networks - https://yerevann.github.io/2016/02/05/implementing-dynamic-memory-networks/
* [6] Playground for bAbI tasks - https://yerevann.github.io/2016/02/23/playground-for-babi-tasks/
* [7] code(YerevaNN's) - https://github.com/YerevaNN/Dynamic-memory-networks-in-Theano
* [8] Dynamic Memory Networks by YerevanNN Web Demo - ([6]의 웹 데모) - http://yerevann.com/dmn-ui/#/
* [9] TOWARDS AI-COMPLETE QUESTION ANSWERING : A SET OF PREREQUISITE TOY TASKS -  http://arxiv.org/pdf/1502.05698v10.pdf
* [10] GRUs and LSTMs -- for machine translation (in CS224d: Deep Learning for Natural Language Processing) - http://cs224d.stanford.edu/lectures/CS224d-Lecture9.pdf
* [11] Recurrent Neural Networks (RNN), Long Short Term Memory (LSTM) / RNN language models / Image captioning (in CS231n: Convolutional Neural Networks for Visual Recognition) - http://cs231n.stanford.edu/slides/winter1516_lecture10.pdf
* [12] 엘에스티엠 네트워크 이해하기 - http://roboticist.tistory.com/m/post/571
* [13] Deep Learning for Computer Vision: Attention Models (UPC 2016) - http://www.slideshare.net/xavigiro/deep-learning-for-computer-vision-attention-models-upc-2016
* [14] End-To-End Memory Networks - http://arxiv.org/abs/1503.08895