## Dense IR ##

In this notebook, we show how to train a model, index data, and run search using Direct Passage Retrieval (DPR) based Neural IR, using techniques as described in Karpukhin et al., "Dense Passage Retrieval for Open-Domain Question Answering" [here](https://arxiv.org/pdf/2004.04906.pdf)

In orded to run the cells (almost) instantaneously, we use trivial data sizes of training data and collection to search.


## Dependencies
If not already done, make sure to install PrimeQA with notebooks extras before getting started.

## Configuration

First, we need to include the required modules.


In [1]:
import os
import argparse, sys

import tempfile
from unittest.mock import patch

from primeqa.ir.dense.dpr_top.dpr.biencoder_trainer import BiEncoderTrainer
from primeqa.ir.dense.dpr_top.dpr.index_simple_corpus import DPRIndexer
from primeqa.ir.dense.dpr_top.dpr.searcher import DPRSearcher

No CUDA runtime is found, using CUDA_HOME='/opt/share/cuda-10.1/x86_64'
{"time":"2022-10-26 15:42:31,560", "name": "faiss.loader", "level": "INFO", "message": "Loading faiss."}
{"time":"2022-10-26 15:42:32,053", "name": "faiss.loader", "level": "INFO", "message": "Successfully loaded faiss."}


## Training
We will train a DPR model using a TSV file containing [query, positive document, negative document] triples. We use a small subset of the XOR-TyDi dataset, as described [here](https://nlp.cs.washington.edu/xorqa/)

The path in `test_files_location` below points to the location of files used by the notebook, by default it points to the files used by CI testing.

In [2]:
test_files_location = '../../../tests/resources/ir_dense'
with tempfile.TemporaryDirectory() as working_dir:
    output_dir=os.path.join(working_dir, 'output_dir')
os.makedirs(output_dir, exist_ok=True)
print(output_dir)
text_triples_fn = os.path.join(test_files_location, "xorqa.train_ir_negs_5_poss_1_001pct_at_0pct_en.tsv")

/tmp/tmpxyvdwant/output_dir


Here is an example of a training file record:

In [3]:
import pandas as pd
from IPython.display import display, HTML
data = pd.read_csv(text_triples_fn, sep='\t', nrows=1, header=None)
display(HTML(data.to_html()))

Unnamed: 0,0,1,2
0,Who maintained the throne for the longest time in China?,"Kangxi Emperor The Kangxi Emperor's reign of 61 years makes him the longest-reigning emperor in Chinese history (although his grandson, the Qianlong Emperor, had the longest period of ""de facto"" power) and one of the longest-reigning rulers in the world. However, since he ascended the throne at the age of seven, actual power was held for six years by four regents and his grandmother, the Grand Empress Dowager Xiaozhuang.","Chiddy Bang new songs from the duo and in November 2009 debuted the group's first free mixtape entitled ""The Swelly Express"". On 28 April 2011 during the first-ever MTV O Music Awards, Anamege broke the Guinness World Record for Longest Freestyle Rap and Longest Marathon Rapping Record by freestyling for 9 hours, 18 minutes, and 22 seconds, stealing the throne from rapper M-Eighty, who originally broke the record in 2009 by rapping for 9 hours, 15 minutes and 15 seconds. Anamege had also beat Canadian rapper D.O. for Longest Marathon Rapping session, the previous record being for 8 hours and 45 minutes."


Here are the parameters of the training, corresponding to the command line arguments used with the top-level script (`run_ir.py`):

In [4]:
model_training_args = [
    "prog",
    "--train_dir", text_triples_fn,
    "--output_dir", output_dir,
    "--full_train_batch_size", "1",
    "--num_train_epochs", "1",
    "--training_data_type", "text_triples"]

Next we run the training:

In [5]:
with patch.object(sys, 'argv', model_training_args):
    trainer = BiEncoderTrainer()
    trainer.train()

{"time":"2022-10-26 15:42:37,396", "name": "primeqa.ir.dense.dpr_top.torch_util.hypers_base", "level": "INFO", "message": "world_rank 0 cuda_is_available False cuda_device_cnt 0 on cccxl009, CUDA_VISIBLE_DEVICES = NOT SET"}


10/26/2022 15:42:37 hypers_base.py:157 - On cccxl009, Process rank: -1, device: cpu, n_gpu: 0, distributed training: False, 16-bits training: False
10/26/2022 15:42:37 hypers_base.py:166 - hypers:
{
  "local_rank": -1,
  "global_rank": 0,
  "world_size": 1,
  "model_type": "",
  "model_name_or_path": "",
  "resume_from": "",
  "config_name": "",
  "tokenizer_name": "",
  "cache_dir": "",
  "do_lower_case": false,
  "gradient_accumulation_steps": 1,
  "learning_rate": 2e-5,
  "weight_decay": 0.0,
  "adam_epsilon": 1e-8,
  "max_grad_norm": 2.0,
  "warmup_instances": 0,
  "warmup_fraction": 0.0,
  "num_train_epochs": 1,
  "no_cuda": false,
  "n_gpu": 0,
  "seed": 42,
  "fp16": false,
  "fp16_opt_level": "O1",
  "full_train_batch_size": 1,
  "per_gpu_eval_batch_size": 8,
  "output_dir": "\/tmp\/tmpxyvdwant\/output_dir",
  "log_on_all_nodes": false,
  "server_ip": "",
  "server_port": "",
  "qry_encoder_name_or_path": "facebook\/dpr-question_encoder-multiset-base",
  "ctx_encoder_name_or_pa

10/26/2022 15:44:07 reporting.py:153 - loss = 0.016342264413827936
10/26/2022 15:44:07 reporting.py:153 - accuracy = 1.0
10/26/2022 15:44:07 biencoder_trainer.py:182 - saving to /tmp/tmpxyvdwant/output_dir
10/26/2022 15:44:14 biencoder_trainer.py:184 - Took 1.4 minutes


## Indexing
Next, we will index a collection of documents, using model representaion from the previous step. 
The collection is a TSV file, containing each document's ID, title, and text.

In [6]:
collection_fn = os.path.join(test_files_location, "xorqa.train_ir_001pct_at_0_pct_collection_fornum.tsv")

Here is an example document:

In [7]:
data = pd.read_csv(collection_fn, sep='\t', header=0, nrows=1)
display(HTML(data.to_html()))

Unnamed: 0,id,text,title
0,1,"The Kangxi Emperor's reign of 61 years makes him the longest-reigning emperor in Chinese history (although his grandson, the Qianlong Emperor, had the longest period of ""de facto"" power) and one of the longest-reigning rulers in the world. However, since he ascended the throne at the age of seven, actual power was held for six years by four regents and his grandmother, the Grand Empress Dowager Xiaozhuang.",Kangxi Emperor


Here are the indexer arguments:

In [8]:
indexing_args = [
            "prog",
            "--dpr_ctx_encoder_path", os.path.join(output_dir, "ctx_encoder"),
            "--embed", "1of1",
            "--sharded_index",
            "--batch_size", "1",
            "--corpus", os.path.join(test_files_location,"xorqa.train_ir_001pct_at_0_pct_collection_fornum.tsv"),
            "--output_dir", output_dir]

Next we run the indexing:

In [9]:
with patch.object(sys, 'argv', indexing_args):
    indexer = DPRIndexer()
    indexer.index()

10/26/2022 15:44:30 index_simple_corpus.py:107 - wrote passages_1_of_1.json.gz.records in 12 seconds
10/26/2022 15:44:30 faiss_index.py:70 - building index, reading data from /tmp/tmpxyvdwant/output_dir/passages_1_of_1.json.gz.records, writing to /tmp/tmpxyvdwant/output_dir/index_1_of_1.faiss
10/26/2022 15:44:30 faiss_index.py:138 - processed 0 passages
10/26/2022 15:44:30 faiss_index.py:131 - calling index.add with 6 vectors
10/26/2022 15:44:30 faiss_index.py:150 - processed 6 passages
10/26/2022 15:44:30 faiss_index.py:151 - finished building index, writing index file to /tmp/tmpxyvdwant/output_dir/index_1_of_1.faiss
10/26/2022 15:44:30 faiss_index.py:154 - took 0 seconds


The resulting index files are in output_dir

## Search
The easiest way to test the trained model is to use the searcher in a "query list" mode, where the searcher's search function is called with a list of queries as an argument.
First, we initialize the searcher, pointing to the model we have trained, and the document index:

In [10]:
search_args = [
    "prog",
    "--model_name_or_path", os.path.join(output_dir, "qry_encoder"),
    "--index_location", output_dir,
    "--output_dir", output_dir]  

with patch.object(sys, 'argv', search_args):
    searcher = DPRSearcher()

10/26/2022 15:44:32 searcher.py:66 - Using default tokenizer in facebook/dpr-question_encoder-multiset-base.  If that is not what you want, specify the tokenizer in '--qry_tokenizer_path' argument.
10/26/2022 15:44:32 searcher.py:82 - Using sharded faiss, reading shards from /tmp/tmpxyvdwant/output_dir
10/26/2022 15:44:32 searcher.py:86 - Reading passages_1_of_1.json.gz.records
10/26/2022 15:44:32 searcher.py:91 - Using sharded faiss with 1 shards.


Next, we run search for a query entered here as a one-element list:

In [11]:
query_batch = ['Who maintained the throne for the longest time in China?']
retrieved_doc_ids, passages = searcher.search(query_batch = query_batch, top_k = 1, mode = 'query_list')

Here are the retrived results:

In [12]:
import json
print(json.dumps(passages, indent = 4))

[
    {
        "titles": [
            "Kangxi Emperor"
        ],
        "texts": [
            "The Kangxi Emperor's reign of 61 years makes him the longest-reigning emperor in Chinese history (although his grandson, the Qianlong Emperor, had the longest period of \"de facto\" power) and one of the longest-reigning rulers in the world. However, since he ascended the throne at the age of seven, actual power was held for six years by four regents and his grandmother, the Grand Empress Dowager Xiaozhuang."
        ],
        "scores": [
            91.072509765625
        ]
    }
]


Next, we use the trained model and the index to search the collection, reading queries from a TSV query file and saving the search results in a JSON file.

In [13]:
queries_fn = os.path.join(test_files_location, "xorqa.train_ir_001pct_at_0_pct_queries_fornum_en.tsv")

Here are the search arguments:

In [14]:
search_args = [
    "prog",
    "--queries", queries_fn,
    "--model_name_or_path", os.path.join(output_dir, "qry_encoder"),
    "--bsize", "1",
    "--index_location", output_dir,
    "--output_dir", output_dir,
    "--top_k", "1"]  

 Next we run the search:

In [15]:
with patch.object(sys, 'argv', search_args):
    searcher = DPRSearcher()
    searcher.search()

10/26/2022 15:44:35 searcher.py:66 - Using default tokenizer in facebook/dpr-question_encoder-multiset-base.  If that is not what you want, specify the tokenizer in '--qry_tokenizer_path' argument.
10/26/2022 15:44:36 searcher.py:82 - Using sharded faiss, reading shards from /tmp/tmpxyvdwant/output_dir
10/26/2022 15:44:36 searcher.py:86 - Reading passages_1_of_1.json.gz.records
10/26/2022 15:44:36 searcher.py:91 - Using sharded faiss with 1 shards.
10/26/2022 15:44:37 searcher.py:247 - Finished instance 1, 0.2711329205524996 per second.


Search output is in .tsv format in `output_dir/ranked_passages.tsv` file containing query IDs, document IDs, ranks, and scores.
Here are the actual values:

In [16]:
with open(os.path.join(output_dir, "ranked_passages.tsv"), 'r') as f:
    for line in f.readlines():
         print(line.rstrip())

-7239279093922981232	1	0	91.072509765625
