# Tutorial: Build a search index using DPR #

In this tutorial, we will learn how to build a Neural Search index over your document collection. The algorithm displayed here is called Dense Passage Retrieval (DPR) as described in Karpukhin et al., "Dense Passage Retrieval for Open-Domain Question Answering" [here](https://arxiv.org/pdf/2004.04906.pdf).

For the purposes of making this tutorial easy to understand we show the steps using a very small document collection. Note that this technique can be used to scale to millions of documents. We have tested upto 21 million Wikipedia passages!!!


## Preparing a Colab Environment to run this tutorial ##

Make sure to "Enable GPU Runtime" -> make a URL with a page with screenshots on how to do this.

## Installing PrimeQA

First, we need to include the required modules.


In [1]:
%%bash

pip install --upgrade pip
pip install primeqa

Process is terminated.


## Pre-process your document collection here to be ready to be stored in your Neural Search Index.

TODO- add some steps after this to ingest from the sample wikipedia docs.

In [2]:
from primeqa.components.indexer.dense import ColBERTIndexer 

No CUDA runtime is found, using CUDA_HOME='/opt/share/cuda-11.1/x86_64'
{"time":"2023-06-01 07:33:32,811", "name": "faiss.loader", "level": "INFO", "message": "Loading faiss."}
{"time":"2023-06-01 07:33:32,831", "name": "faiss.loader", "level": "INFO", "message": "Successfully loaded faiss."}


## Initializing the Indexer

We initialize a ColBERT indexer which will be used for indexing the embeddings created for each document (passage) in the collection. It takes a passage_embedding_model to create the embedding vectors and a vector_db specification where it stores the embedding vectors to search later. 

In [3]:
#indexer= ColBERTIndexer (passage_embedding_model = "/dccstor/colbert-ir/bsiyer/PQLL/experiments/xor_squad_04182023/2023-04/22/17.23.31/checkpoints/colbert.dnn.batch_17524.model")
#ToDO checkpoint to be renamed to passage_embedding_model
indexer= ColBERTIndexer (checkpoint = "/dccstor/colbert-ir/bsiyer/PQLL/experiments/xor_squad_04182023/2023-04/22/17.23.31/checkpoints/colbert.dnn.batch_17524.model", vector_db='FAISS')

In [5]:
#change it to pre-processed file location as given in 1st step
indexer.index_documents("/dccstor/irl-tableqa/jaydeep/sample-document-store2.tsv")



[Jun 01, 07:34:26] #> Creating directory index_root/index_name 


#> Starting...
No CUDA runtime is found, using CUDA_HOME='/opt/share/cuda-11.1/x86_64'
{"time":"2023-06-01 07:34:29,172", "name": "faiss.loader", "level": "INFO", "message": "Loading faiss."}
{"time":"2023-06-01 07:34:29,266", "name": "faiss.loader", "level": "INFO", "message": "Successfully loaded faiss."}
{
    "ncells": null,
    "centroid_score_threshold": null,
    "ndocs": null,
    "index_path": "index_root\/index_name",
    "index_location": null,
    "nbits": 1,
    "kmeans_niters": 4,
    "num_partitions_max": 10000000,
    "similarity": "cosine",
    "bsize": 64,
    "accumsteps": 1,
    "lr": 3e-6,
    "maxsteps": 500000,
    "save_every": null,
    "resume": false,
    "resume_optimizer": false,
    "warmup": null,
    "warmup_bert": null,
    "relu": false,
    "nway": 2,
    "use_ib_negatives": false,
    "reranker": false,
    "distillation_alpha": 1.0,
    "ignore_scores": false,
    "shuffle_every_epo



[Jun 01, 07:34:49] [0] 		 # of sampled PIDs = 11 	 sampled_pids[:3] = [6, 0, 4]
[Jun 01, 07:34:49] [0] 		 #> Encoding 11 passages..
[Jun 01, 07:34:49] #> checkpoint, docFromText, Input: title | text, 		 64
[Jun 01, 07:34:49] #> Roberta DocTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==
[Jun 01, 07:34:49] #> Input: $ title | text, 		 64
[Jun 01, 07:34:49] #> Output IDs: torch.Size([159]), tensor([    0, 50262,  1270,  1721,  2788,     2,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,   

0it [00:00, ?it/s]

  Iteration 3 (0.12 s, search 0.10 s): objective=80.5059 imbalance=3.782 nsplit=0        
[0.018, 0.015, 0.016, 0.016, 0.017, 0.017, 0.017, 0.014, 0.015, 0.019, 0.015, 0.016, 0.015, 0.019, 0.018, 0.017, 0.017, 0.018, 0.014, 0.015, 0.015, 0.018, 0.015, 0.022, 0.013, 0.015, 0.013, 0.016, 0.016, 0.015, 0.022, 0.014, 0.018, 0.017, 0.02, 0.017, 0.019, 0.014, 0.02, 0.016, 0.018, 0.017, 0.018, 0.022, 0.016, 0.014, 0.016, 0.015, 0.015, 0.018, 0.016, 0.015, 0.017, 0.015, 0.02, 0.015, 0.017, 0.018, 0.017, 0.014, 0.017, 0.016, 0.014, 0.019, 0.016, 0.016, 0.017, 0.02, 0.015, 0.015, 0.018, 0.015, 0.014, 0.017, 0.017, 0.018, 0.018, 0.018, 0.017, 0.015, 0.012, 0.02, 0.016, 0.019, 0.014, 0.019, 0.019, 0.018, 0.014, 0.019, 0.021, 0.02, 0.016, 0.016, 0.016, 0.015, 0.019, 0.019, 0.016, 0.016, 0.017, 0.015, 0.018, 0.014, 0.015, 0.018, 0.019, 0.013, 0.018, 0.016, 0.017, 0.016, 0.017, 0.017, 0.021, 0.013, 0.016, 0.02, 0.018, 0.017, 0.012, 0.019, 0.015, 0.02, 0.019, 0.018, 0.018, 0.017]
[Jun 01, 07:34:55] #>

1it [00:04,  4.96s/it]
100%|██████████| 256/256 [00:00<00:00, 88989.05it/s]


#> Joined...


## Initializing the Retriever

We initialize a ColBERT retriever to search documents from the indexed document corpus.  Note: since we will retrieve the documents based on questions so we need to embed the questions too.

In [None]:
from primeqa.components.retriever.dense import ColBERTRetriever
# retriever = ColBERTRetriever(ColBERTIndexerindexer=indexer,
#                       query_embedding_model = "/dccstor/colbert-ir/bsiyer/PQLL/experiments/xor_squad_04182023/2023-04/22/17.23.31/checkpoints/colbert.dnn.batch_17524.model",
#                       use_gpu=True, embed_title=True)
retriever = ColBERTRetriever(indexer=indexer,
                      checkpoint = "/dccstor/colbert-ir/bsiyer/PQLL/experiments/xor_squad_04182023/2023-04/22/17.23.31/checkpoints/colbert.dnn.batch_17524.model"
                       )

## Start asking Questions

We're now ready to query the index we created.

In [None]:
question = ['Who maintained the throne for the longest time in China?']
retrieved_doc_ids, passages = retriever.search(query = question, top_k = 1, mode = 'query_list')

Here are the retrived results:

In [None]:
import json
print(json.dumps(passages, indent = 4))