# Getting the dataset
A script has been provided to download all the datasets required for running the below examples.
It will dowload and unzip the WikiQA Corpus and the Quora Duplicate Questions dataset.

In [None]:
!python data/get_data.py

We also need to get the word embeddings for the training.
For this, we will use the Glove Embeddings.
Luckily, [gensim-data](https://github.com/RaRe-Technologies/gensim-data) provides an easy interface for it.

In [None]:
import gensim.downloader as api
kv_model = api.load("glove-wiki-gigaword-50")

## Installing dependencies for running the Similarity Learning task

In [2]:
import os
import csv
import re
from drmm_tks import DRMM_TKS

Using TensorFlow backend.


## Data Format

We have to provide data in a format which is understood by the model.
The model understands sentences as a list of words. 
Further, we need to give a :
 1. Queries List
 2. Candidate Document List
 3. Correct Label List

1 is a list of list of words
2 and 3 is actually a list of list of list of words/ints

Example:
```
queries = ["When was Abraham Lincoln born ?".split(), 
            "When was the first World War ?".split()]
docs = [
		 ["Abraham Lincoln was the president of the United States of America".split(),
		 "He was born in 1809".split()],
		 ["The first world war was bad".split(),
		 "It was fought in 1914".split(),
		 "There were over a million deaths".split()]
       ]
labels = [[0,
           1],
		  [0,
           1,
           0]
          ]
```

## About the dataset : WikiQA

The WikiQA corpus is a set of question-answer pairs in which for every query there are several candidate documents of which none, one or more documents might be relevant.
Relevance is purely binary, i.e., 1: relavant, 0: not relevant

Sample data:

QuestionID | Question | DocumentID | DocumentTitle | SentenceID | Sentence | Label
-- | -- | -- | -- | -- | -- | --
Q1 | how are glacier caves formed? | D1 | Glacier cave | D1-0 | A partly submerged glacier cave on Perito Moreno Glacier . | 0
Q1 | how are glacier caves formed? | D1 | Glacier cave | D1-1 | The ice facade is approximately 60 m high | 0
Q1 | how are glacier caves formed? | D1 | Glacier cave | D1-2 | Ice formations in the Titlis glacier cave | 0
Q1 | how are glacier caves formed? | D1 | Glacier cave | D1-3 | A glacier cave is a cave formed within the ice of a glacier . | 1
Q1 | how are glacier caves formed? | D1 | Glacier cave | D1-4 | Glacier caves are often called ice caves , but this term is properly used to describe bedrock caves that contain year-round ice. | 0
Q2 | How are the directions of the velocity and force vectors related in a circular motion | D2 | Circular motion | D2-0 | In physics , circular motion is a movement of an object along the circumference of a circle or rotation along a circular path. | 0
Q2 | How are the directions of the velocity and force vectors related in a circular motion | D2 | Circular motion | D2-1 | It can be uniform, with constant angular rate of rotation (and constant speed), or non-uniform with a changing rate of rotation. | 0
Q2 | How are the directions of the velocity and force vectors related in a circular motion | D2 | Circular motion | D2-2 | The rotation around a fixed axis of a three-dimensional body involves circular motion of its parts. | 0
Q2 | How are the directions of the velocity and force vectors related in a circular motion | D2 | Circular motion | D2-3 | The equations of motion describe the movement of the center of mass of a body. | 0
Q2 | How are the directions of the velocity and force vectors related in a circular motion | D2 | Circular motion | D2-4 | Examples of circular motion include: an artificial satellite orbiting the Earth at constant height, a stone which is tied to a rope and is being swung in circles, a car turning through a curve in a race track , an electron moving perpendicular to a uniform magnetic field , and a gear turning inside a mechanism. | 0
Q2 | How are the directions of the velocity and force vectors related in a circular motion | D2 | Circular motion | D2-5 | Since the object's velocity vector is constantly changing direction, the moving object is undergoing acceleration by a centripetal force in the direction of the center of rotation. | 0
Q2 | How are the directions of the velocity and force vectors related in a circular motion | D2 | Circular motion | D2-6 | Without this acceleration, the object would move in a straight line, according to Newton's laws of motion . | 0

## Data Preprocessing
We need to take the above text and make it into `queries, docs, labels` form. For this, we will create an iterable object with the below class which will allow the data to be streamed into the model as the need arises.

In [3]:
class MyWikiIterable:
    """"Yields the next data point in the data set
    A data point is: (query, doc_group, label_group)
    where
    query : list of str words
    doc_group : list of docs
                where a doc is a list of str words
    label_group : list of int
        The relevance between adjacent queries and docs
    """

    def __init__(self, iter_type, fpath):
        """
        Parameters
        ----------
        iter_type : {'query', 'doc', 'label'}
            the type of iterable to be yielded
        fpath : str
            path to the dataset
        """

        # To map the `iter_type` to an index
        self.type_translator = {'query': 0, 'doc': 1, 'label': 2}
        self.iter_type = iter_type

        with open(fpath, encoding='utf8') as tsv_file:
            tsv_reader = csv.reader(tsv_file, delimiter='\t')
            self.data_rows = []
            self.data_rows = [row for row in tsv_reader]

    def preprocess_sent(self, sent):
        """Utility function to lower, strip and tokenize each sentence

        Replace this function if you want to handle preprocessing differently"""
        return re.sub("[^a-zA-Z0-9]", " ", sent.strip().lower()).split()

    def __iter__(self):
        # Defining some consants for .tsv reading
        # They represent the columns of the respective values
        QUESTION_ID_INDEX = 0
        QUESTION_INDEX = 1
        ANSWER_INDEX = 5
        LABEL_INDEX = 6


        # The group of documents and labels that belong to one question
        document_group = []
        label_group = []

        # Number of relevant documents per query
        n_relevant_docs = 0
        # Number of filtered docs (query-doc pairs which have zero relevant docs)
        n_filtered_docs = 0

        # The data
        queries = []
        docs = []
        labels = []

        
        # The code below goes through the data line by line
        # It checks the current document id with the next document id
        for i, line in enumerate(self.data_rows[1:], start=1):
            if i < len(self.data_rows) - 1:  # check if out of bounds might occur
                if self.data_rows[i][QUESTION_ID_INDEX] == self.data_rows[i + 1][QUESTION_ID_INDEX]:
                    document_group.append(self.preprocess_sent(self.data_rows[i][ANSWER_INDEX]))
                    label_group.append(int(self.data_rows[i][LABEL_INDEX]))
                    n_relevant_docs += int(self.data_rows[i][LABEL_INDEX])
                else:
                    document_group.append(self.preprocess_sent(self.data_rows[i][ANSWER_INDEX]))
                    label_group.append(int(self.data_rows[i][LABEL_INDEX]))

                    n_relevant_docs += int(self.data_rows[i][LABEL_INDEX])

                    if n_relevant_docs > 0:
                        docs.append(document_group)
                        labels.append(label_group)
                        queries.append(self.preprocess_sent(self.data_rows[i][QUESTION_INDEX]))

                        yield [queries[-1], document_group, label_group][self.type_translator[self.iter_type]]
                    else:
                        n_filtered_docs += 1

                    n_relevant_docs = 0
                    document_group = []
                    label_group = []

            else:
                # If we are on the last line
                document_group.append(self.preprocess_sent(self.data_rows[i][ANSWER_INDEX]))
                label_group.append(int(self.data_rows[i][LABEL_INDEX]))
                n_relevant_docs += int(self.data_rows[i][LABEL_INDEX])

                if n_relevant_docs > 0:
                    docs.append(document_group)
                    labels.append(label_group)
                    queries.append(self.preprocess_sent(self.data_rows[i][QUESTION_INDEX]))
                    yield [queries[-1], document_group, label_group][self.type_translator[self.iter_type]]
                else:
                    n_filtered_docs += 1
                    n_relevant_docs = 0


Now, will use the class to create objects of the training iterable

In [4]:
q_iterable = MyWikiIterable('query', os.path.join( 'data', 'WikiQACorpus', 'WikiQA-train.tsv'))
d_iterable = MyWikiIterable('doc', os.path.join('data', 'WikiQACorpus', 'WikiQA-train.tsv'))
l_iterable = MyWikiIterable('label', os.path.join('data', 'WikiQACorpus', 'WikiQA-train.tsv'))

Additionally, we will also create a validation iterable so that we can see the progress the model makes as it trains.
Notice that the path to the class is that of the test set. (Ideally, it should be the dev set)

In [5]:
q_val_iterable = MyWikiIterable('query', os.path.join( 'data', 'WikiQACorpus', 'WikiQA-test.tsv'))
d_val_iterable = MyWikiIterable('doc', os.path.join('data', 'WikiQACorpus', 'WikiQA-test.tsv'))
l_val_iterable = MyWikiIterable('label', os.path.join('data', 'WikiQACorpus', 'WikiQA-test.tsv'))

# Using word embeddings
The model needs some pre-trained word embeddings like Glove.

We have 2 options for this:
1. Use the [KeyedVectors](https://radimrehurek.com/gensim/models/keyedvectors.html)
2. Specify the path to the .txt

This will serve as the `word_embedding` parameter in the model.

Initially, we had loaded the KeyedVectors into `kv_model` using gensim-data and can now directly pass it.

For now, however, let's just give the path. DRMM_TKS will internally create the KeyedVectors.

In [6]:
word_embedding_path = os.path.join('data', 'glove.6B.50d.txt')

# Training the Model
|
We would like to monitor the progress of training of the model.
However, we can't rely on the metrics provided by keras as those metrics don't necessarily apply to Information Retrieval problems.

We can additionally provide a validation dataset which will be tested after every epoch.

Now that we have the preprocessed extracted data, training the model just takes one line:

In [7]:
# Train the model
drmm_tks_model = DRMM_TKS(q_iterable, d_iterable, l_iterable, word_embedding=word_embedding_path,
                                epochs=2, validation_data=[q_val_iterable, d_val_iterable, l_val_iterable])

2018-06-28 19:55:12,258 : INFO : Starting Vocab Build
2018-06-28 19:55:13,097 : INFO : Vocab Build Complete
2018-06-28 19:55:13,098 : INFO : Vocab Size is 19894
2018-06-28 19:55:13,098 : INFO : Building embedding index using pretrained word embeddings
2018-06-28 19:55:17,398 : INFO : converting 400000 vectors from data/glove.6B.50d.txt to /tmp/tmp_word2vec.txt
2018-06-28 19:55:18,190 : INFO : loading projection weights from /tmp/tmp_word2vec.txt
2018-06-28 19:55:41,951 : INFO : loaded (400000, 50) matrix from /tmp/tmp_word2vec.txt
2018-06-28 19:55:41,952 : INFO : The embeddings_index built from the given file has 400000 words of 50 dimensions
2018-06-28 19:55:41,953 : INFO : Building the Embedding Matrix for the model's Embedding Layer
2018-06-28 19:55:42,050 : INFO : There are 744 words out of 19894 (3.74%) not in the embeddings. Setting them to zero
2018-06-28 19:55:42,051 : INFO : Adding additional words from the embedding file to embedding matrix
2018-06-28 19:55:43,337 : INFO : No

Instructions for updating:
dim is deprecated, use axis instead


Instructions for updating:
dim is deprecated, use axis instead


__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
query (InputLayer)              (None, 200)          0                                            
__________________________________________________________________________________________________
doc (InputLayer)                (None, 200)          0                                            
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, 200, 50)      20037300    query[0][0]                      
                                                                 doc[0][0]                        
__________________________________________________________________________________________________
dot_1 (Dot)                     (None, 200, 200)     0           embedding_1[0][0]                
          

2018-06-28 19:55:54,741 : INFO : Found 21 unknown words. Set them to unknown word index : 400745
2018-06-28 19:55:54,812 : INFO : Found 263 unknown words. Set them to unknown word index : 400745


Epoch 1/2


2018-06-28 19:56:19,135 : INFO : MAP: 0.47
2018-06-28 19:56:19,150 : INFO : nDCG@1 : 0.32
2018-06-28 19:56:19,172 : INFO : nDCG@3 : 0.47
2018-06-28 19:56:19,185 : INFO : nDCG@5 : 0.53
2018-06-28 19:56:19,202 : INFO : nDCG@10 : 0.59
2018-06-28 19:56:19,224 : INFO : nDCG@20 : 0.62


Epoch 2/2


2018-06-28 19:56:40,081 : INFO : MAP: 0.48
2018-06-28 19:56:40,095 : INFO : nDCG@1 : 0.33
2018-06-28 19:56:40,110 : INFO : nDCG@3 : 0.48
2018-06-28 19:56:40,124 : INFO : nDCG@5 : 0.54
2018-06-28 19:56:40,138 : INFO : nDCG@10 : 0.60
2018-06-28 19:56:40,152 : INFO : nDCG@20 : 0.62


## Testing the model on new data

The testing of the data can be done on completely unseen data using `model.predict(queries, docs)` where
queries: list of list of words
docs: list of list of list of words

In [8]:
# Example:
queries = ["how are glacier caves formed ?".split()]
docs = ["A partly submerged glacier cave on Perito Moreno Glacier".split(),
        "A glacier cave is a cave formed within the ice of a glacier".split()]

In [9]:
drmm_tks_model.predict(queries, docs)

2018-06-28 19:56:41,449 : INFO : Found 0 unknown words. Set them to unknown word index : 400745
2018-06-28 19:56:41,451 : INFO : Found 5 unknown words. Set them to unknown word index : 400745


array([[0.50430447],
       [0.57990086]], dtype=float32)

As can be seen above, the correct answer has the higher similarity score.