# Tutorial: Build a search index using DPR #

In this tutorial, we will learn how to build a Neural Search index over your document collection. The algorithm displayed here is called Dense Passage Retrieval (DPR) as described in Karpukhin et al., "Dense Passage Retrieval for Open-Domain Question Answering" [here](https://arxiv.org/pdf/2004.04906.pdf).

For the purposes of making this tutorial easy to understand we show the steps using a very small document collection. Note that this technique can be used to scale to millions of documents. We have tested upto 21 million Wikipedia passages!!!


## Preparing a Colab Environment to run this tutorial ##

Make sure to "Enable GPU Runtime" -> make a URL with a page with screenshots on how to do this.

## Installing PrimeQA

First, we need to include the required modules.


In [None]:
%%bash

pip install --upgrade pip
pip install primeqa

## Pre-process your document collection here to be ready to be stored in your Neural Search Index.

TODO- add some steps after this to ingest from the sample wikipedia docs.

In [1]:
from primeqa.components.indexer.dense import ColBERTIndexer 

No CUDA runtime is found, using CUDA_HOME='/opt/share/cuda-11.1/x86_64'
{"time":"2023-06-04 19:53:51,723", "name": "faiss.loader", "level": "INFO", "message": "Loading faiss."}
{"time":"2023-06-04 19:53:51,741", "name": "faiss.loader", "level": "INFO", "message": "Successfully loaded faiss."}


## Initializing the Indexer

We initialize a ColBERT indexer which will be used for indexing the embeddings created for each document (passage) in the collection. It takes a passage_embedding_model to create the embedding vectors and a vector_db specification where it stores the embedding vectors to search later. 

In [2]:
#indexer= ColBERTIndexer (passage_embedding_model = "/dccstor/colbert-ir/bsiyer/PQLL/experiments/xor_squad_04182023/2023-04/22/17.23.31/checkpoints/colbert.dnn.batch_17524.model")
#ToDO checkpoint to be renamed to passage_embedding_model
indexer= ColBERTIndexer (checkpoint = "/dccstor/colbert-ir/bsiyer/PQLL/experiments/xor_squad_04182023/2023-04/22/17.23.31/checkpoints/colbert.dnn.batch_17524.model", vector_db='FAISS')

In [4]:
#change it to pre-processed file location as given in 1st step
indexer.index_documents("/dccstor/irl-tableqa/jaydeep/sample-document-store2.tsv")

Indexer: collection path set to: /dccstor/irl-tableqa/jaydeep/sample-document-store2.tsv


[Jun 04, 19:54:35] #> Creating directory index_root/index_name 


#> Starting...
No CUDA runtime is found, using CUDA_HOME='/opt/share/cuda-11.1/x86_64'
{"time":"2023-06-04 19:54:37,875", "name": "faiss.loader", "level": "INFO", "message": "Loading faiss."}
{"time":"2023-06-04 19:54:37,942", "name": "faiss.loader", "level": "INFO", "message": "Successfully loaded faiss."}
{
    "ncells": null,
    "centroid_score_threshold": null,
    "ndocs": null,
    "index_path": "index_root\/index_name",
    "index_location": null,
    "nbits": 1,
    "kmeans_niters": 4,
    "num_partitions_max": 10000000,
    "similarity": "cosine",
    "bsize": 64,
    "accumsteps": 1,
    "lr": 3e-6,
    "maxsteps": 500000,
    "save_every": null,
    "resume": false,
    "resume_optimizer": false,
    "warmup": null,
    "warmup_bert": null,
    "relu": false,
    "nway": 2,
    "use_ib_negatives": false,
    "reranker":



[Jun 04, 19:54:54] [0] 		 # of sampled PIDs = 11 	 sampled_pids[:3] = [6, 0, 4]
[Jun 04, 19:54:54] [0] 		 #> Encoding 11 passages..
[Jun 04, 19:54:54] #> checkpoint, docFromText, Input: title | text, 		 64
[Jun 04, 19:54:54] #> Roberta DocTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==
[Jun 04, 19:54:54] #> Input: $ title | text, 		 64
[Jun 04, 19:54:54] #> Output IDs: torch.Size([159]), tensor([    0, 50262,  1270,  1721,  2788,     2,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,   

0it [00:00, ?it/s]

[Jun 04, 19:55:04] [0] 		 #> Saving chunk 0: 	 11 passages and 1,615 embeddings. From #0 onward.
[Jun 04, 19:55:04] offset: 0
[Jun 04, 19:55:04] chunk codes size(0): 1615
[Jun 04, 19:55:04] codes size(0): 1615
[Jun 04, 19:55:04] codes size(): torch.Size([1615])
[Jun 04, 19:55:04] >>>>partition.size(0): 256
[Jun 04, 19:55:04] >>>>num_partition: 256
[Jun 04, 19:55:04] #> Optimizing IVF to store map from centroids to list of pids..
[Jun 04, 19:55:04] #> Building the emb2pid mapping..
[Jun 04, 19:55:04] len(emb2pid) = 1615
[Jun 04, 19:55:04] #> Saved optimized IVF to index_root/index_name/ivf.pid.pt
[Jun 04, 19:55:04] [0] 		 #> Saving the indexing metadata to index_root/index_name/metadata.json ..


1it [00:04,  4.84s/it]
100%|██████████| 256/256 [00:00<00:00, 92452.37it/s]


#> Joined...


## Initializing the Retriever

We initialize a ColBERT retriever to search documents from the indexed document corpus.  Note: since we will retrieve the documents based on questions so we need to embed the questions too.

In [5]:
from primeqa.components.retriever.dense import ColBERTRetriever
# retriever = ColBERTRetriever(ColBERTIndexerindexer=indexer,
#                       query_embedding_model = "/dccstor/colbert-ir/bsiyer/PQLL/experiments/xor_squad_04182023/2023-04/22/17.23.31/checkpoints/colbert.dnn.batch_17524.model",
#                       use_gpu=True, embed_title=True)
retriever = ColBERTRetriever(indexer=indexer,
                      checkpoint = "/dccstor/colbert-ir/bsiyer/PQLL/experiments/xor_squad_04182023/2023-04/22/17.23.31/checkpoints/colbert.dnn.batch_17524.model"
                       )


self.indexer.index_name index_name
self.index_root index_root
self.index_name index_name
Indexer: get collection returned as : /dccstor/irl-tableqa/jaydeep/sample-document-store2.tsv
self.collection /dccstor/irl-tableqa/jaydeep/sample-document-store2.tsv
self._config ColBERTConfig(ncells=None, centroid_score_threshold=None, ndocs=None, index_path='index_root/index_name', index_location=None, nbits=1, kmeans_niters=20, num_partitions_max=10000000, similarity='cosine', bsize=32, accumsteps=1, lr=3e-06, maxsteps=500000, save_every=None, resume=False, resume_optimizer=False, warmup=None, warmup_bert=None, relu=False, nway=2, use_ib_negatives=False, reranker=False, distillation_alpha=1.0, ignore_scores=False, shuffle_every_epoch=False, save_steps=2000, save_epochs=-1, epochs=10, input_arguments={}, model_type='bert-base-uncased', init_from_lm=None, local_models_repository=None, ranks_fn=None, output_dir=None, topK=100, student_teacher_temperature=1.0, student_teacher_top_loss_weight=0.5, te



[Jun 04, 19:55:46] Loading filter_pids_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
[Jun 04, 19:55:47] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...


11it [00:00, 20551.16it/s]


## Start asking Questions

We're now ready to query the index we created.

In [7]:
question = ['What are some famous inventions by Einstein', "When did Aple introduce iPhone 7"]
retrieved_doc_ids, passages = retriever.predict(input_texts = question, mode = 'query_list',return_passages=True)


100%|██████████| 2/2 [00:00<00:00, 44.88it/s]


2
[[2, 1, 5], [9, 5, 5]]
2
[['"""Albert Einstein"""\n "Albert Einstein Albert Einstein (; ; 14 March 1879 – 18 April 1955) was a German-born theoretical physicist who developed the theory of relativity, one of the two pillars of modern physics (alongside quantum mechanics). His work is also known for its influence on the philosophy of science. He is best known to the general public for his mass–energy equivalence formula , which has been dubbed ""the world\'s most famous equation"". He received the 1921 Nobel Prize in Physics ""for his services to theoretical physics, and especially for his discovery of the law of the photoelectric effect"", a pivotal step"', '"""Albert Einstein"""\n "to Einstein in 1922. Footnotes Citations Albert Einstein Albert Einstein (; ; 14 March 1879 – 18 April 1955) was a German-born theoretical physicist who developed the theory of relativity, one of the two pillars of modern physics (alongside quantum mechanics). His work is also known for its influence on t

100%|██████████| 2/2 [00:00<00:00, 42.17it/s]

2
[[(2, 24.019685745239258), (1, 23.459884643554688), (5, 17.92898941040039)], [(9, 14.543024063110352), (5, 4.002993583679199), (5, 4.002993583679199)]]





Here are the retrived results:

In [None]:
import json
print(json.dumps(passages, indent = 4))