# Question Answering with DeepMatcher

Note: you can run **[this notebook live in Google Colab](https://colab.research.google.com/github/anhaidgroup/deepmatcher/blob/master/examples/question_answering.ipynb)**.

DeepMatcher can be easily be used for text matching tasks such Question Answering, Text Entailment, etc. In this tutorial we will see how to use DeepMatcher for Answer Selection, a major sub-task of Question Answering. Specifically, we will look at [WikiQA](https://aclweb.org/anthology/D15-1237), a benchmark dataset for Answer Selection. There are three main steps in this tutorial:

1. Get data and transform it into DeepMatcher input format
2. Setup and train DeepMatcher model
3. Evaluate model using QA eval metrics

Before we begin, if you are running this notebook in Colab, you will first need to install necessary packages by running the code below:

In [None]:
try:
    import deepmatcher
except:
    !pip install -qqq deepmatcher

## Step 1:  Get data and transform it into DeepMatcher input format

First let's import relevant packages and download the dataset:

In [1]:
import deepmatcher as dm
import pandas as pd
import os

!wget -qnc https://download.microsoft.com/download/E/5/F/E5FCFCEE-7005-4814-853D-DAA7C66507E0/WikiQACorpus.zip
!unzip -qn WikiQACorpus.zip

Let's see how this dataset looks like:

In [2]:
raw_train = pd.read_csv(os.path.join('WikiQACorpus', 'WikiQA-train.txt'), sep='\t', header=None)
raw_train.head()

Unnamed: 0,0,1,2
0,how are glacier caves formed ?,A partly submerged glacier cave on Perito More...,0
1,how are glacier caves formed ?,The ice facade is approximately 60 m high,0
2,how are glacier caves formed ?,Ice formations in the Titlis glacier cave,0
3,how are glacier caves formed ?,A glacier cave is a cave formed within the ice...,1
4,how are glacier caves formed ?,"Glacier caves are often called ice caves , but...",0


Clearly, it is not in the format `deepmatcher` wants its input data to be in - this  file has no column names, no ID column, and its not a CSV file. Let's fix that:

In [3]:
raw_train.columns = ['left_value', 'right_value', 'label']
raw_train.index.name = 'id'
raw_train.head()

Unnamed: 0_level_0,left_value,right_value,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,how are glacier caves formed ?,A partly submerged glacier cave on Perito More...,0
1,how are glacier caves formed ?,The ice facade is approximately 60 m high,0
2,how are glacier caves formed ?,Ice formations in the Titlis glacier cave,0
3,how are glacier caves formed ?,A glacier cave is a cave formed within the ice...,1
4,how are glacier caves formed ?,"Glacier caves are often called ice caves , but...",0


Looks good, now let's save this to disk and transform the validation and test data in the same way:

In [4]:
raw_train.to_csv(os.path.join('WikiQACorpus', 'dm_train.csv'))

raw_files = ['WikiQA-dev.txt', 'WikiQA-test.txt']
csv_files = ['dm_valid.csv', 'dm_test.csv']
for i in range(2):
    raw_data = pd.read_csv(os.path.join('WikiQACorpus', raw_files[i]), sep='\t', header=None)
    raw_data.columns = ['left_value', 'right_value', 'label']
    raw_data.index.name = 'id'
    raw_data.to_csv(os.path.join('WikiQACorpus', csv_files[i]))

## Step 2: Setup and train DeepMatcher model

Now we are ready to load and process the data for `deepmatcher`:

In [5]:
train, validation, test = dm.data.process(
    path='WikiQACorpus',
    train='dm_train.csv',
    validation='dm_valid.csv',
    test='dm_test.csv')


Reading and processing data from "WikiQACorpus/dm_train.csv"
0% [##############################] 100% | ETA: 00:00:00
Reading and processing data from "WikiQACorpus/dm_valid.csv"
0% [############################# ] 100% | ETA: 00:00:00
Reading and processing data from "WikiQACorpus/dm_test.csv"

Building vocabulary
0% [####################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00

Computing principal components
0% [####################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:07


Next, we create a `deepmatcher` model and train it. Note that since this is a demo, we do not perform hyperparameter tuning - we simply use the default settings for everything except the `pos_neg_ratio` param. This must be set since there are very few "positive matches" (candidates that correctly answer the question) in this dataset. In a real application setting you must tune other model hyperparameters as well to get optimal performance.

In [6]:
model = dm.MatchingModel()
model.run_train(
    train,
    validation,
    epochs=10,
    best_save_path='hybrid_model.pth',
    pos_neg_ratio=7)

* Number of trainable parameters: 2798703
===>  TRAIN Epoch 1


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:04:45


Finished Epoch 1 || Run Time:  280.9 | Load Time:    5.7 || F1:  17.99 | Prec:  13.30 | Rec:  27.79 || Ex/s:  71.05

===>  EVAL Epoch 1


0% [█████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:14


Finished Epoch 1 || Run Time:   14.1 | Load Time:    0.8 || F1:  37.42 | Prec:  34.12 | Rec:  41.43 || Ex/s: 183.75

* Best F1: tensor(37.4193)
Saving best model...
Done.
---------------------

===>  TRAIN Epoch 2


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:04:56


Finished Epoch 2 || Run Time:  291.6 | Load Time:    5.7 || F1:  28.32 | Prec:  18.86 | Rec:  56.83 || Ex/s:  68.47

===>  EVAL Epoch 2


0% [█████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:13


Finished Epoch 2 || Run Time:   12.9 | Load Time:    0.7 || F1:  32.89 | Prec:  26.16 | Rec:  44.29 || Ex/s: 200.58

---------------------

===>  TRAIN Epoch 3


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:04:43


Finished Epoch 3 || Run Time:  278.9 | Load Time:    5.4 || F1:  36.27 | Prec:  24.85 | Rec:  67.12 || Ex/s:  71.62

===>  EVAL Epoch 3


0% [█████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:12


Finished Epoch 3 || Run Time:   12.2 | Load Time:    0.7 || F1:  27.66 | Prec:  22.03 | Rec:  37.14 || Ex/s: 212.62

---------------------

===>  TRAIN Epoch 4


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:04:39


Finished Epoch 4 || Run Time:  275.1 | Load Time:    5.7 || F1:  47.29 | Prec:  33.83 | Rec:  78.56 || Ex/s:  72.52

===>  EVAL Epoch 4


0% [█████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:11


Finished Epoch 4 || Run Time:   11.0 | Load Time:    0.6 || F1:  27.94 | Prec:  21.27 | Rec:  40.71 || Ex/s: 234.33

---------------------

===>  TRAIN Epoch 5


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:04:40


Finished Epoch 5 || Run Time:  276.5 | Load Time:    5.7 || F1:  60.03 | Prec:  46.85 | Rec:  83.56 || Ex/s:  72.16

===>  EVAL Epoch 5


0% [█████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:12


Finished Epoch 5 || Run Time:   11.9 | Load Time:    0.7 || F1:  25.47 | Prec:  22.53 | Rec:  29.29 || Ex/s: 216.15

---------------------

===>  TRAIN Epoch 6


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:04:38


Finished Epoch 6 || Run Time:  274.5 | Load Time:    5.5 || F1:  72.57 | Prec:  62.58 | Rec:  86.35 || Ex/s:  72.71

===>  EVAL Epoch 6


0% [█████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:12


Finished Epoch 6 || Run Time:   11.7 | Load Time:    0.7 || F1:  26.18 | Prec:  21.46 | Rec:  33.57 || Ex/s: 219.82

---------------------

===>  TRAIN Epoch 7


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:04:42


Finished Epoch 7 || Run Time:  278.7 | Load Time:    5.5 || F1:  80.47 | Prec:  73.34 | Rec:  89.13 || Ex/s:  71.64

===>  EVAL Epoch 7


0% [█████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:12


Finished Epoch 7 || Run Time:   11.9 | Load Time:    0.7 || F1:  30.23 | Prec:  27.49 | Rec:  33.57 || Ex/s: 217.26

---------------------

===>  TRAIN Epoch 8


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:04:40


Finished Epoch 8 || Run Time:  276.3 | Load Time:    5.6 || F1:  86.29 | Prec:  81.99 | Rec:  91.06 || Ex/s:  72.22

===>  EVAL Epoch 8


0% [█████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:12


Finished Epoch 8 || Run Time:   11.9 | Load Time:    0.7 || F1:  29.82 | Prec:  30.37 | Rec:  29.29 || Ex/s: 217.92

---------------------

===>  TRAIN Epoch 9


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:04:43


Finished Epoch 9 || Run Time:  278.9 | Load Time:    5.6 || F1:  91.06 | Prec:  90.11 | Rec:  92.02 || Ex/s:  71.58

===>  EVAL Epoch 9


0% [█████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:12


Finished Epoch 9 || Run Time:   11.9 | Load Time:    0.7 || F1:  26.95 | Prec:  26.76 | Rec:  27.14 || Ex/s: 216.89

---------------------

===>  TRAIN Epoch 10


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:04:44


Finished Epoch 10 || Run Time:  280.5 | Load Time:    5.6 || F1:  93.96 | Prec:  95.17 | Rec:  92.79 || Ex/s:  71.16

===>  EVAL Epoch 10


0% [█████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:12


Finished Epoch 10 || Run Time:   11.9 | Load Time:    0.7 || F1:  23.97 | Prec:  23.03 | Rec:  25.00 || Ex/s: 217.11

---------------------

Loading best model...
Training done.


tensor(37.4193)

Now that we have a trained model, we obtain the predictions for the test data. Note that `deepmatcher` computes F1, precision and recall by default but these may not be optimal evaluation metrics for your end task. For instance, in Question Answering, the more relevant metrics are MAP and MRR which we will compute in the next step.

In [7]:
predictions = model.run_prediction(test, output_attributes=True)

===>  PREDICT Epoch 1


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:00:27


Finished Epoch 1 || Run Time:   26.5 | Load Time:    1.5 || F1:  28.90 | Prec:  24.88 | Rec:  34.47 || Ex/s: 220.45



## Step 3: Evaluate model using QA eval metrics 

Finally, we compute the Mean Average Precision (MAP) and Mean Reciprocal Rank (MRR) using the model's predictions on the test set. Following the approach of the [paper that introduced this dataset](https://aclweb.org/anthology/D15-1237), questions in the test set without answers are ignored when computing these metrics.

In [8]:
MAP, MRR = 0, 0

grouped = predictions.groupby('left_value')
num_questions = 0
for question, answers in grouped:
    sorted_answers = answers.sort_values('match_score', ascending=False)
    
    p, ap = 0, 0
    top_answer_found = False
    for idx, answer in enumerate(sorted_answers.itertuples()):
        if answer.label == 1:
            if not top_answer_found:
                MRR += 1 / (idx + 1)
                top_answer_found = True
            p += 1
            ap += p / (idx + 1)
            
    if p > 0:
        ap /= p
        num_questions += 1
    MAP += ap
    
MAP /= num_questions
MRR /= num_questions

print('MAP:', MAP)
print('MRR:', MRR)

MAP: 0.6570252723620554
MRR: 0.6691690413731083
