## First, we will install the dependencies for running the Similarity Learning task

In [1]:
import os
import csv
import re
from drmm_tks import DRMM_TKS_Model
from pprint import pprint

Using TensorFlow backend.


## Data Format

We have to provide data in a format which is understood by the model.
The model understands sentences as a list of words. 
Further, we need to give a :
 1. Queries List
 2. Candidate Document List
 3. Correct Label List

1 is a list of list of words
2 and 3 is actually a list of list of list of words/ints

Example:
```
queries = ["When was Abraham Lincoln born ?".split(), 
            "When was the first World War ?".split()]
docs = [
		 ["Abraham Lincoln was the president of the United States of America".split(),
		 "He was born in 1809".split()],
		 ["The first world war was bad".split(),
		 "It was fought in 1914".split(),
		 "There were over a million deaths".split()]
       ]
labels = [[0,
           1],
		  [0,
           1,
           0]
          ]
```

## About the dataset : WikiQA

The WikiQA corpus is a set of question-answer pairs in which for every query there are several candidate documents of which none, one or more documents might be relevant.
Relevance is purely binary, i.e., 1: relavant, 0: not relevant

Sample data:
```
QuestionID	Question	DocumentID	DocumentTitle	SentenceID	Sentence	Label
Q1	how are glacier caves formed?	D1	Glacier cave	D1-0	A partly submerged glacier cave on Perito Moreno Glacier .	0
Q1	how are glacier caves formed?	D1	Glacier cave	D1-1	The ice facade is approximately 60 m high	0
Q1	how are glacier caves formed?	D1	Glacier cave	D1-2	Ice formations in the Titlis glacier cave	0
Q1	how are glacier caves formed?	D1	Glacier cave	D1-3	A glacier cave is a cave formed within the ice of a glacier .	1
Q1	how are glacier caves formed?	D1	Glacier cave	D1-4	Glacier caves are often called ice caves , but this term is properly used to describe bedrock caves that contain year-round ice.	0
```

## Data Preprocessing
We need to take the above text and make it into `queries, docs, labels` form
We use the below code for that


In [2]:
# Fill the below with wherever you have your WikiQACorpus Folder
wikiqa_data_path = os.path.join('data', 'WikiQACorpus', 'WikiQA-train.tsv')


def preprocess_sent(sent):
    """Utility function to lower, strip and tokenize each sentence
    
    Replace this function if you want to handle preprocessing differently"""
    return re.sub("[^a-zA-Z0-9]", " ", sent.strip().lower()).split()

# Defining some consants for .tsv reading
QUESTION_ID_INDEX = 0
QUESTION_INDEX = 1
ANSWER_INDEX = 5
LABEL_INDEX = 6

with open(wikiqa_data_path, encoding='utf8') as tsv_file:
    tsv_reader = csv.reader(tsv_file, delimiter='\t')
    data_rows = []
    for row in tsv_reader:
        data_rows.append(row)


        
document_group = []
label_group = []

n_relevant_docs = 0
n_filtered_docs = 0

queries = []
docs = []
labels = []

for i, line in enumerate(data_rows[1:], start=1):
    if i < len(data_rows) - 1:  # check if out of bounds might occur
        if data_rows[i][QUESTION_ID_INDEX] == data_rows[i + 1][QUESTION_ID_INDEX]:
            document_group.append(preprocess_sent(data_rows[i][ANSWER_INDEX]))
            label_group.append(int(data_rows[i][LABEL_INDEX]))
            n_relevant_docs += int(data_rows[i][LABEL_INDEX])
        else:
            document_group.append(preprocess_sent(data_rows[i][ANSWER_INDEX]))
            label_group.append(int(data_rows[i][LABEL_INDEX]))

            n_relevant_docs += int(data_rows[i][LABEL_INDEX])

            if n_relevant_docs > 0:
                docs.append(document_group)
                labels.append(label_group)
                queries.append(preprocess_sent(data_rows[i][QUESTION_INDEX]))
            else:
                n_filtered_docs += 1

            n_relevant_docs = 0
            document_group = []
            label_group = []

    else:
        # If we are on the last line
        document_group.append(preprocess_sent(data_rows[i][ANSWER_INDEX]))
        label_group.append(int(data_rows[i][LABEL_INDEX]))
        n_relevant_docs += int(data_rows[i][LABEL_INDEX])

        if n_relevant_docs > 0:
            docs.append(document_group)
            labels.append(label_group)
            queries.append(preprocess_sent(data_rows[i][QUESTION_INDEX]))
        else:
            n_filtered_docs += 1
            n_relevant_docs = 0

## Let's have a look at the data

In [3]:
queries[300]

['what', 'type', 'of', 'batteries', 'are', '357', 'lr44']

In [4]:
print(docs[300])

[['lr44', 'is', 'the', 'iec', 'designation', 'for', 'an', 'alkaline', '1', '5', 'volt', 'button', 'cell', 'commonly', 'used', 'in', 'small', 'led', 'flashlights', 'digital', 'thermometers', 'calculators', 'calipers', 'watches', 'clocks', 'toys', 'and', 'laser', 'pointers'], ['lr44', 'alkaline', 'cell'], ['the', 'battery', 'nomenclature', 'is', 'defined', 'by', 'the', 'international', 'electrotechnical', 'commission', 'iec', 'in', 'its', '60086', '3', 'standard', 'primary', 'batteries', 'part', '3', 'watch', 'batteries'], ['the', 'letter', 'l', 'indicates', 'the', 'electrochemical', 'system', 'used', 'a', 'zinc', 'negative', 'electrode', 'manganese', 'dioxide', 'depolarizer', 'and', 'positive', 'electrode', 'and', 'an', 'alkaline', 'electrolyte'], ['r44', 'indicates', 'a', 'round', 'cell', '11', '4', '0', '2', 'mm', 'diameter', 'and', '5', '2', '0', '2', 'mm', 'height', 'as', 'defined', 'by', 'the', 'iec', 'standard', '60086'], ['manufacturers', 'have', 'their', 'own', 'part', 'numbers'

In [5]:
print(labels[300])

[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


## Making a train-validation split
At this point, it would be good to make a train-validation split so we can see how the model performs as it trains

In [6]:
train_queries, test_queries = queries[:int(len(queries)*0.8)], queries[int(len(queries)*0.8): ]
train_docs, test_docs = docs[:int(len(docs)*0.8)], docs[int(len(docs)*0.8):]
train_labels, test_labels = labels[:int(len(labels)*0.8)], labels[int(len(labels)*0.8):]

In [7]:
print(len(train_queries), len(test_queries))
print(len(train_docs), len(test_docs))
print(len(train_labels), len(test_labels))

698 175
698 175
698 175


# Training the Model
If we want to train the model with some pretrained word embeddingd like Glove, we will have to specify the path

In [8]:
word_embedding_path = os.path.join('data', 'glove.6B.50d.txt')

We would like to monitor the progress of training of the model.
However, we can't rely on the metrics provided by keras as those metrics don't necessarily apply to Information Retrieval problems.

We can additionally provide a validation dataset which will be tested after every epoch.

Now that we have the preprocessed extracted data, training the model just takes one line:

In [9]:
# Train the model
drmm_tks_model = DRMM_TKS_Model(train_queries, train_docs, train_labels, word_embedding_path=word_embedding_path,
                                epochs=3, validation_data=[test_queries, test_docs, test_labels])

2018-06-21 19:09:33,376 : INFO : Starting Vocab Build
2018-06-21 19:09:33,457 : INFO : Vocab Build Complete
2018-06-21 19:09:33,458 : INFO : Vocab Size is 17787
2018-06-21 19:09:33,459 : INFO : Building embedding index using pretrained word embeddings
2018-06-21 19:09:41,432 : INFO : The embeddings_index built from the given file has 400000 words of 50 dimensions
2018-06-21 19:09:41,433 : INFO : Embedding Matrix for Embedding Layer has shape (17788, 50) 
2018-06-21 19:09:41,472 : INFO : There are 590 words not in the embeddings. Setting them to zero
2018-06-21 19:09:41,473 : INFO : Adding additional dimensions from the embedding file to embedding matrix
2018-06-21 19:09:42,084 : INFO : Normalizing the word embeddings
2018-06-21 19:09:42,373 : INFO : Embedding Matrix now has shape (400593, 50)
2018-06-21 19:09:42,374 : INFO : Pad word has been set to index 400590
2018-06-21 19:09:42,375 : INFO : Embedding index build complete


____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
query (InputLayer)               (None, 200)           0                                            
____________________________________________________________________________________________________
doc (InputLayer)                 (None, 200)           0                                            
____________________________________________________________________________________________________
embedding_1 (Embedding)          (None, 200, 50)       20029650    query[0][0]                      
                                                                   doc[0][0]                        
____________________________________________________________________________________________________
mm_q_embed_DOT_d_embed (Dot)     (None, 200, 200)      0           embedding_1[0][0]       

2018-06-21 19:10:21,022 : INFO : Using 175 out of 175 data points which is 0.00%. 0 were skipped
2018-06-21 19:10:21,038 : INFO : Using 175 out of 175 data points which is 0.00%. 0 were skipped
2018-06-21 19:10:21,064 : INFO : Using 175 out of 175 data points which is 0.00%. 0 were skipped
2018-06-21 19:10:21,085 : INFO : Using 175 out of 175 data points which is 0.00%. 0 were skipped
2018-06-21 19:10:21,110 : INFO : Using 175 out of 175 data points which is 0.00%. 0 were skipped
2018-06-21 19:10:21,137 : INFO : Using 175 out of 175 data points which is 0.00%. 0 were skipped


MAP:  0.476730718902
nDCG@ 1 :  0.268571428571
nDCG@ 3 :  0.475654849525
nDCG@ 5 :  0.538095362202
nDCG@ 10 :  0.603041857308
nDCG@ 20 :  0.614393862232
Epoch 2/3

2018-06-21 19:10:55,931 : INFO : Using 175 out of 175 data points which is 0.00%. 0 were skipped
2018-06-21 19:10:55,940 : INFO : Using 175 out of 175 data points which is 0.00%. 0 were skipped
2018-06-21 19:10:55,958 : INFO : Using 175 out of 175 data points which is 0.00%. 0 were skipped
2018-06-21 19:10:55,970 : INFO : Using 175 out of 175 data points which is 0.00%. 0 were skipped
2018-06-21 19:10:55,981 : INFO : Using 175 out of 175 data points which is 0.00%. 0 were skipped
2018-06-21 19:10:55,992 : INFO : Using 175 out of 175 data points which is 0.00%. 0 were skipped


MAP:  0.51234904502
nDCG@ 1 :  0.337142857143
nDCG@ 3 :  0.505490060509
nDCG@ 5 :  0.575814710287
nDCG@ 10 :  0.629110718279
nDCG@ 20 :  0.642106424557
Epoch 3/3

2018-06-21 19:11:30,690 : INFO : Using 175 out of 175 data points which is 0.00%. 0 were skipped
2018-06-21 19:11:30,701 : INFO : Using 175 out of 175 data points which is 0.00%. 0 were skipped
2018-06-21 19:11:30,711 : INFO : Using 175 out of 175 data points which is 0.00%. 0 were skipped
2018-06-21 19:11:30,722 : INFO : Using 175 out of 175 data points which is 0.00%. 0 were skipped
2018-06-21 19:11:30,733 : INFO : Using 175 out of 175 data points which is 0.00%. 0 were skipped
2018-06-21 19:11:30,746 : INFO : Using 175 out of 175 data points which is 0.00%. 0 were skipped


MAP:  0.533664713478
nDCG@ 1 :  0.371428571429
nDCG@ 3 :  0.525354523284
nDCG@ 5 :  0.593719007026
nDCG@ 10 :  0.647851673016
nDCG@ 20 :  0.658232724156


## Testing the model on new data

The testing of the data can be done on completely unseen data using `model.predict(queries, docs)` where
queries: list of list of words
docs: list of list of list of words

In [10]:
# Example:
queries = ["how are glacier caves formed ?".split()]
docs = ["A partly submerged glacier cave on Perito Moreno Glacier".split(),
        "A glacier cave is a cave formed within the ice of a glacier".split()]

In [11]:
drmm_tks_model.predict(queries, docs)

[[ 0.44529271]
 [ 0.57637918]]


As can be seen above, the correct answer has the higher similarity score.