# Getting the dataset
A script has been provided to download all the datasets required for running the below examples.
It will dowload and unzip the WikiQA Corpus and the Quora Duplicate Questions dataset.

In [None]:
!python experimental_data/get_data.py

## Installing dependencies for running the Similarity Learning task

In [1]:
import os
import csv
import re
from gensim.models.experimental import DRMM_TKS
from gensim.utils import simple_preprocess

Using TensorFlow backend.
2018-07-06 00:34:19,104 : INFO : 'pattern' package not found; tag filters are not available for English


## Data Format

We have to provide data in a format which is understood by the model.
The model understands sentences as a list of words. 
Further, we need to give a :
 1. Queries List
 2. Candidate Document List
 3. Correct Label List

1 is a list of list of words
2 and 3 is actually a list of list of list of words/ints

Example:
```
queries = ["When was Abraham Lincoln born ?".split(), 
            "When was the first World War ?".split()]
docs = [
		 ["Abraham Lincoln was the president of the United States of America".split(),
		 "He was born in 1809".split()],
		 ["The first world war was bad".split(),
		 "It was fought in 1914".split(),
		 "There were over a million deaths".split()]
       ]
labels = [[0,
           1],
		  [0,
           1,
           0]
          ]
```

## About the dataset : WikiQA

The WikiQA corpus is a set of question-answer pairs in which for every query there are several candidate documents of which none, one or more documents might be relevant.
Relevance is purely binary, i.e., 1: relavant, 0: not relevant

Sample data:

QuestionID | Question | DocumentID | DocumentTitle | SentenceID | Sentence | Label
-- | -- | -- | -- | -- | -- | --
Q1 | how are glacier caves formed? | D1 | Glacier cave | D1-0 | A partly submerged glacier cave on Perito Moreno Glacier . | 0
Q1 | how are glacier caves formed? | D1 | Glacier cave | D1-1 | The ice facade is approximately 60 m high | 0
Q1 | how are glacier caves formed? | D1 | Glacier cave | D1-2 | Ice formations in the Titlis glacier cave | 0
Q1 | how are glacier caves formed? | D1 | Glacier cave | D1-3 | A glacier cave is a cave formed within the ice of a glacier . | 1
Q1 | how are glacier caves formed? | D1 | Glacier cave | D1-4 | Glacier caves are often called ice caves , but this term is properly used to describe bedrock caves that contain year-round ice. | 0
Q2 | How are the directions of the velocity and force vectors related in a circular motion | D2 | Circular motion | D2-0 | In physics , circular motion is a movement of an object along the circumference of a circle or rotation along a circular path. | 0
Q2 | How are the directions of the velocity and force vectors related in a circular motion | D2 | Circular motion | D2-1 | It can be uniform, with constant angular rate of rotation (and constant speed), or non-uniform with a changing rate of rotation. | 0
Q2 | How are the directions of the velocity and force vectors related in a circular motion | D2 | Circular motion | D2-2 | The rotation around a fixed axis of a three-dimensional body involves circular motion of its parts. | 0
Q2 | How are the directions of the velocity and force vectors related in a circular motion | D2 | Circular motion | D2-3 | The equations of motion describe the movement of the center of mass of a body. | 0
Q2 | How are the directions of the velocity and force vectors related in a circular motion | D2 | Circular motion | D2-4 | Examples of circular motion include: an artificial satellite orbiting the Earth at constant height, a stone which is tied to a rope and is being swung in circles, a car turning through a curve in a race track , an electron moving perpendicular to a uniform magnetic field , and a gear turning inside a mechanism. | 0
Q2 | How are the directions of the velocity and force vectors related in a circular motion | D2 | Circular motion | D2-5 | Since the object's velocity vector is constantly changing direction, the moving object is undergoing acceleration by a centripetal force in the direction of the center of rotation. | 0
Q2 | How are the directions of the velocity and force vectors related in a circular motion | D2 | Circular motion | D2-6 | Without this acceleration, the object would move in a straight line, according to Newton's laws of motion . | 0

## Data Preprocessing
We need to take the above text and make it into `queries, docs, labels` form. For this, we will create an iterable object with the below class which will allow the data to be streamed into the model as the need arises.

In [2]:
class MyWikiIterable:
    """"Yields the next data point in the data set based on the `iter_type`
    
    Based on `iter_type` the object can yield the following:
        'query' : list of str words
        'doc' : list of docs
                    where a doc is a list of str words
        'label' : list of int
                  The relevance between adjacent queries and docs
    """

    def __init__(self, iter_type, fpath):
        """
        Parameters
        ----------
        iter_type : {'query', 'doc', 'label'}
            the type of iterable to be yielded
        fpath : str
            path to the dataset
        """

        # To map the `iter_type` to an index
        self.type_translator = {'query': 0, 'doc': 1, 'label': 2}
        self.iter_type = iter_type

        with open(fpath, encoding='utf8') as tsv_file:
            tsv_reader = csv.reader(tsv_file, delimiter='\t', quoting=csv.QUOTE_NONE)
            self.data_rows = []
            self.data_rows = [row for row in tsv_reader]

    def preprocess_sent(self, sent):
        """Utility function to lower, strip and tokenize each sentence
        Replace this function if you want to handle preprocessing differently"""

        return simple_preprocess(sent)

    def __iter__(self):
        # Defining some consants for .tsv reading
        # They represent the columns of the respective values
        QUESTION_ID_INDEX = 0
        QUESTION_INDEX = 1
        ANSWER_INDEX = 5
        LABEL_INDEX = 6


        # The group of documents and labels that belong to one question
        document_group = []
        label_group = []

        # Number of relevant documents per query
        n_relevant_docs = 0
        # Number of filtered docs (query-doc pairs which have zero relevant docs)
        n_filtered_docs = 0

        # The data
        queries = []
        docs = []
        labels = []

        # The code below goes through the data line by line
        # It checks the current document id with the next document id
        for i, line in enumerate(self.data_rows[1:], start=1):
            if i < len(self.data_rows) - 1:  # check if out of bounds might occur
                if self.data_rows[i][QUESTION_ID_INDEX] == self.data_rows[i + 1][QUESTION_ID_INDEX]:
                    document_group.append(self.preprocess_sent(self.data_rows[i][ANSWER_INDEX]))
                    label_group.append(int(self.data_rows[i][LABEL_INDEX]))
                    n_relevant_docs += int(self.data_rows[i][LABEL_INDEX])
                else:
                    document_group.append(self.preprocess_sent(self.data_rows[i][ANSWER_INDEX]))
                    label_group.append(int(self.data_rows[i][LABEL_INDEX]))

                    n_relevant_docs += int(self.data_rows[i][LABEL_INDEX])

                    if n_relevant_docs > 0:
                        docs.append(document_group)
                        labels.append(label_group)
                        queries.append(self.preprocess_sent(self.data_rows[i][QUESTION_INDEX]))

                        yield [queries[-1], document_group, label_group][self.type_translator[self.iter_type]]
                    else:
                        n_filtered_docs += 1

                    n_relevant_docs = 0
                    document_group = []
                    label_group = []

            else:
                # If we are on the last line
                document_group.append(self.preprocess_sent(self.data_rows[i][ANSWER_INDEX]))
                label_group.append(int(self.data_rows[i][LABEL_INDEX]))
                n_relevant_docs += int(self.data_rows[i][LABEL_INDEX])

                if n_relevant_docs > 0:
                    docs.append(document_group)
                    labels.append(label_group)
                    queries.append(self.preprocess_sent(self.data_rows[i][QUESTION_INDEX]))
                    yield [queries[-1], document_group, label_group][self.type_translator[self.iter_type]]
                else:
                    n_filtered_docs += 1
                    n_relevant_docs = 0


Now, will use the class to create objects of the training iterable

In [3]:
q_iterable = MyWikiIterable('query', os.path.join( 'experimental_data', 'WikiQACorpus', 'WikiQA-train.tsv'))
d_iterable = MyWikiIterable('doc', os.path.join('experimental_data', 'WikiQACorpus', 'WikiQA-train.tsv'))
l_iterable = MyWikiIterable('label', os.path.join('experimental_data', 'WikiQACorpus', 'WikiQA-train.tsv'))

We will also initialize some validation iterables
Note: the path has `dev` in it

In [4]:
q_val_iterable = MyWikiIterable('query', os.path.join( 'experimental_data', 'WikiQACorpus', 'WikiQA-dev.tsv'))
d_val_iterable = MyWikiIterable('doc', os.path.join('experimental_data', 'WikiQACorpus', 'WikiQA-dev.tsv'))
l_val_iterable = MyWikiIterable('label', os.path.join('experimental_data', 'WikiQACorpus', 'WikiQA-dev.tsv'))

# Using word embeddings
We also need to get the word embeddings for the training. For this, we will use the Glove Embeddings.
Luckily, [gensim-data](https://github.com/RaRe-Technologies/gensim-data) provides an easy interface for it.

We will use the [KeyedVectors](https://radimrehurek.com/gensim/models/keyedvectors.html) object that we for from gensim-data api and pass it as the `word_embedding` parameter in the model.

In [5]:
import gensim.downloader as api
kv_model = api.load("glove-wiki-gigaword-300")

2018-07-06 00:34:23,010 : INFO : loading projection weights from /home/aneeshj/gensim-data/glove-wiki-gigaword-300/glove-wiki-gigaword-300.gz
2018-07-06 00:36:07,145 : INFO : loaded (400000, 300) matrix from /home/aneeshj/gensim-data/glove-wiki-gigaword-300/glove-wiki-gigaword-300.gz


# Training the Model
Now that we have the preprocessed extracted data and word embeddings, training the model just takes one line:

In [6]:
# Train the model
drmm_tks_model = DRMM_TKS(
                    queries=q_iterable, docs=d_iterable, labels=l_iterable, word_embedding=kv_model, epochs=3,
                    validation_data=[q_val_iterable, d_val_iterable, l_val_iterable], topk=20
                )

2018-07-06 00:36:07,151 : INFO : Starting Vocab Build
2018-07-06 00:36:08,602 : INFO : Vocab Build Complete
2018-07-06 00:36:08,603 : INFO : Vocab Size is 18814
2018-07-06 00:36:08,605 : INFO : Building embedding index using KeyedVector pretrained word embeddings
2018-07-06 00:36:08,605 : INFO : The embeddings_index built from the given file has 400000 words of 300 dimensions
2018-07-06 00:36:08,606 : INFO : Building the Embedding Matrix for the model's Embedding Layer
2018-07-06 00:36:08,836 : INFO : There are 642 words out of 18814 (3.41%) not in the embeddings. Setting them to random
2018-07-06 00:36:08,836 : INFO : Adding additional words from the embedding file to embedding matrix
2018-07-06 00:36:10,775 : INFO : Normalizing the word embeddings
2018-07-06 00:36:59,403 : INFO : Embedding Matrix build complete. It now has shape (400644, 300)
2018-07-06 00:37:06,320 : INFO : Pad word has been set to index 400642
2018-07-06 00:37:06,815 : INFO : Unknown word has been set to index 4006

Epoch 1/3


2018-07-06 00:41:17,729 : INFO : MAP: 0.55
2018-07-06 00:41:17,735 : INFO : nDCG@1 : 0.38
2018-07-06 00:41:17,740 : INFO : nDCG@3 : 0.54
2018-07-06 00:41:17,746 : INFO : nDCG@5 : 0.60
2018-07-06 00:41:17,751 : INFO : nDCG@10 : 0.66
2018-07-06 00:41:17,756 : INFO : nDCG@20 : 0.67


Epoch 2/3


2018-07-06 00:42:46,586 : INFO : MAP: 0.61
2018-07-06 00:42:46,592 : INFO : nDCG@1 : 0.46
2018-07-06 00:42:46,597 : INFO : nDCG@3 : 0.61
2018-07-06 00:42:46,604 : INFO : nDCG@5 : 0.67
2018-07-06 00:42:46,616 : INFO : nDCG@10 : 0.71
2018-07-06 00:42:46,621 : INFO : nDCG@20 : 0.72


Epoch 3/3


2018-07-06 00:44:15,788 : INFO : MAP: 0.62
2018-07-06 00:44:15,793 : INFO : nDCG@1 : 0.46
2018-07-06 00:44:15,800 : INFO : nDCG@3 : 0.60
2018-07-06 00:44:15,809 : INFO : nDCG@5 : 0.67
2018-07-06 00:44:15,815 : INFO : nDCG@10 : 0.71
2018-07-06 00:44:15,821 : INFO : nDCG@20 : 0.72


## Testing the model on new data

The testing of the data can be done on completely unseen data using `model.predict(queries, docs)` where
queries: list of list of words
docs: list of list of list of words

In [8]:
queries = [simple_preprocess("how are glacier caves formed"),
           simple_preprocess("What is AWS")]

docs = [[simple_preprocess("A partly submerged glacier cave on Perito Moreno Glacier"),
        simple_preprocess("A glacier cave is a cave formed within the ice of a glacier")],
       [simple_preprocess("AWS stands for Amazon Web Services"),
        simple_preprocess("AWS was established in 2001"),
        simple_preprocess("It is a cloud service")]]

The predict function returns the similarity between a query-document pair in a list format

For example
```
queries = [q1, q2]
docs = [[d1_1, d1_2],
        [d2_1, d2_2, d2_3]]

model.predict(queries, docs)

Output
------
q1-d1_1 similarity
q1-d1_2 similarity
q2-d2_1 similarity
q2-d2_2 similarity
q2-d2_3 similarity
```

In [9]:
drmm_tks_model.predict(queries, docs)

2018-07-06 00:46:33,249 : INFO : Found 0 unknown words. Set them to unknown word index : 400643
2018-07-06 00:46:33,283 : INFO : Found 0 unknown words. Set them to unknown word index : 400643
2018-07-06 00:46:33,778 : INFO : Predictions in the format query, doc, similarity
2018-07-06 00:46:33,800 : INFO : ['how', 'are', 'glacier', 'caves', 'formed']	['partly', 'submerged', 'glacier', 'cave', 'on', 'perito', 'moreno', 'glacier']	0.75623834
2018-07-06 00:46:33,801 : INFO : ['how', 'are', 'glacier', 'caves', 'formed']	['glacier', 'cave', 'is', 'cave', 'formed', 'within', 'the', 'ice', 'of', 'glacier']	0.88229656
2018-07-06 00:46:33,802 : INFO : ['what', 'is', 'aws']	['aws', 'stands', 'for', 'amazon', 'web', 'services']	0.5922452
2018-07-06 00:46:33,802 : INFO : ['what', 'is', 'aws']	['aws', 'was', 'established', 'in']	0.581025
2018-07-06 00:46:33,803 : INFO : ['what', 'is', 'aws']	['it', 'is', 'cloud', 'service']	0.65737


array([[0.75623834],
       [0.88229656],
       [0.5922452 ],
       [0.581025  ],
       [0.65737   ]], dtype=float32)

As can be seen from the logs and results above, within each query-document group, the correct answer has the highest score

For example,
In the first group
```
['how', 'are', 'glacier', 'caves', 'formed'] ['partly', 'submerged', 'glacier', 'cave', 'on', 'perito', 'moreno', 'glacier']	0.7
['how', 'are', 'glacier', 'caves', 'formed'] ['glacier', 'cave', 'is', 'cave', 'formed', 'within', 'the', 'ice', 'of', 'glacier']	0.8
```

The correct answer, "glacier cave is cave ..." has the higher score as compared to the first answer
The same can be seen for the second part

### Testing on a test set
We can pass a whole dataset and get evaluations based on that. Let's try with the test set of WikiQA Corpus

In [10]:
q_test_iterable = MyWikiIterable('query', os.path.join( 'experimental_data', 'WikiQACorpus', 'WikiQA-test.tsv'))
d_test_iterable = MyWikiIterable('doc', os.path.join('experimental_data', 'WikiQACorpus', 'WikiQA-test.tsv'))
l_test_iterable = MyWikiIterable('label', os.path.join('experimental_data', 'WikiQACorpus', 'WikiQA-test.tsv'))

In [11]:
drmm_tks_model.evaluate(q_test_iterable, d_test_iterable, l_test_iterable)

2018-07-06 00:48:00,129 : INFO : Found 21 unknown words. Set them to unknown word index : 400643
2018-07-06 00:48:00,202 : INFO : Found 253 unknown words. Set them to unknown word index : 400643
2018-07-06 00:48:09,461 : INFO : MAP: 0.60
2018-07-06 00:48:09,523 : INFO : nDCG@1 : 0.47
2018-07-06 00:48:09,541 : INFO : nDCG@3 : 0.60
2018-07-06 00:48:09,567 : INFO : nDCG@5 : 0.66
2018-07-06 00:48:09,591 : INFO : nDCG@10 : 0.70
2018-07-06 00:48:09,607 : INFO : nDCG@20 : 0.71


## Comparing DRMM TKS with other models

It would be good to get an idea of how our model works against some unsupervised models like word2vec and FastText.
For this, we will, given a query-document pair, we will get a vector for the query and document. We can get the similarity between them using the cosine similarity between their vectors.

### For word2vec


In [12]:
def cosine_similarity(vec1, vec2):
    return np.dot(vec1, vec2)/(np.linalg.norm(vec1)* np.linalg.norm(vec2))

In [13]:
import numpy as np
from gensim.models.experimental import mapk, mean_ndcg

def eval_model(queries, docs, labels, model):
    long_doc_list = []
    long_label_list = []
    long_query_list = []
    doc_lens = []

    def sent2vec(sentence):
        vec = np.zeros((model.vector_size))
        for word in sentence:
            if word in model:
                vec += model[word]
        return vec/len(sentence)
    
    for query, doc, label in zip(queries, docs, labels):
        i = 0
        for d, l in zip(doc, label):
            if len(d) == 0 or len(query) == 0:
                print("skipping query-doc pair due to no words in vocab")
                continue
            long_query_list.append(sent2vec(query))
            long_doc_list.append(sent2vec(d))
            long_label_list.append(l)
            i += 1
        doc_lens.append(len(doc))

    doc_lens = np.array(doc_lens)

    predictions = []
    for q, d in zip(long_query_list, long_doc_list):
        predictions.append(cosine_similarity(q, d))

    Y_pred = []
    Y_true = []
    offset = 0

    for doc_size in doc_lens:
        Y_pred.append(predictions[offset: offset + doc_size])
        Y_true.append(long_label_list[offset: offset + doc_size])
        offset += doc_size
        
    print("MAP: %.2f"% mapk(Y_true, Y_pred))
    for k in [1, 3, 5, 10, 20]:
        print("nDCG@%d : %.2f " % (k, mean_ndcg(Y_true, Y_pred, k=k)))


In [14]:
eval_model(q_test_iterable, d_test_iterable, l_test_iterable, kv_model)

skipping query-doc pair due to no words in vocab
skipping query-doc pair due to no words in vocab
MAP: 0.58
nDCG@1 : 0.43 
nDCG@3 : 0.60 
nDCG@5 : 0.66 
nDCG@10 : 0.70 
nDCG@20 : 0.71 


Let's compare that with our model

In [15]:
drmm_tks_model.evaluate(q_test_iterable, d_test_iterable, l_test_iterable)

2018-07-06 00:49:11,315 : INFO : Found 21 unknown words. Set them to unknown word index : 400643
2018-07-06 00:49:11,379 : INFO : Found 253 unknown words. Set them to unknown word index : 400643
2018-07-06 00:49:21,218 : INFO : MAP: 0.60
2018-07-06 00:49:21,229 : INFO : nDCG@1 : 0.47
2018-07-06 00:49:21,246 : INFO : nDCG@3 : 0.60
2018-07-06 00:49:21,263 : INFO : nDCG@5 : 0.66
2018-07-06 00:49:21,274 : INFO : nDCG@10 : 0.70
2018-07-06 00:49:21,286 : INFO : nDCG@20 : 0.71


While the accuracy isn't any better, it is worse, this is still a Work In Progress and we hope to improve it further soon.

## Saving and loading the model
The trained model can be saved and loaded from memory for future use.

In [7]:
drmm_tks_model.save('drmm_tks_model')

2018-07-06 00:44:16,527 : INFO : saving DRMM_TKS object under drmm_tks_model, separately None
2018-07-06 00:44:16,529 : INFO : storing np array 'vectors' to drmm_tks_model.word_embedding.vectors.npy
2018-07-06 00:45:09,654 : INFO : storing np array 'embedding_matrix' to drmm_tks_model.embedding_matrix.npy
2018-07-06 00:45:18,682 : INFO : not storing attribute model
2018-07-06 00:45:18,684 : INFO : not storing attribute _get_pair_list
2018-07-06 00:45:18,685 : INFO : not storing attribute _get_full_batch_iter
2018-07-06 00:45:18,687 : INFO : not storing attribute queries
2018-07-06 00:45:18,688 : INFO : not storing attribute docs
2018-07-06 00:45:18,690 : INFO : not storing attribute labels
2018-07-06 00:45:18,691 : INFO : not storing attribute pair_list
2018-07-06 00:45:36,062 : INFO : saved drmm_tks_model


In [None]:
del drmm_tks_model
drmm_tks_model = DRMM_TKS.load('drmm_tks_model')