# Tutorial: Build a search index using DPR #

In this tutorial, we will learn how to build a Neural Search index over your document collection. The algorithm displayed here is called Dense Passage Retrieval (DPR) as described in Karpukhin et al., "Dense Passage Retrieval for Open-Domain Question Answering" [here](https://arxiv.org/pdf/2004.04906.pdf).

For the purposes of making this tutorial easy to understand we show the steps using a very small document collection. Note that this technique can be used to scale to millions of documents. We have tested upto 21 million Wikipedia passages!!!


## Preparing a Colab Environment to run this tutorial ##

Make sure to "Enable GPU Runtime" -> make a URL with a page with screenshots on how to do this.

## Installing PrimeQA

First, we need to include the required modules.


In [None]:
%%bash

pip install --upgrade pip
pip install primeqa

## Pre-process your document collection here to be ready to be stored in your Neural Search Index.

TODO- add some steps after this to ingest from the sample wikipedia docs.

In [None]:
# save your input document as a .tsv
import pandas as pd
url='https://drive.google.com/file/d/1LULJRPgN_hfuI2kG-wH4FUwXCCdDh9zh/view?usp=sharing'
url='https://drive.google.com/uc?id=' + url.split('/')[-2]
df = pd.read_csv(url)
df.to_csv('input.tsv', sep='\t')

In [None]:
# Use DocumentCollection class to convert your input.tsv to the specific format needed by PrimeQA indexer/retriever.
from primeqa.ir.util.corpus_reader import DocumentCollection
doc_class = DocumentCollection("input.tsv")
doc_class.write_corpus_tsv("output.tsv")

## Initializing the Indexer

We initialize a ColBERT indexer which will be used for indexing the embeddings created for each document (passage) in the collection. It takes a passage_embedding_model to create the embedding vectors and a vector_db specification where it stores the embedding vectors to search later. 

In [2]:
from primeqa.components.DocumentStore.dense import ColBERTDocumentStore
colbert= ColBERTDocumentStore (ctx_encoder_model_checkpoint = "/dccstor/colbert-ir/bsiyer/PQLL/experiments/xor_squad_04182023/2023-04/22/17.23.31/checkpoints/colbert.dnn.batch_17524.model", vector_db='FAISS')

In [3]:
colbert.index("output.tsv")



[Jun 10, 18:17:43] #> Creating directory index_root/index_name 


#> Starting...
No CUDA runtime is found, using CUDA_HOME='/opt/share/cuda-11.1/x86_64'
{"time":"2023-06-10 18:17:49,684", "name": "faiss.loader", "level": "INFO", "message": "Loading faiss."}
{"time":"2023-06-10 18:17:49,758", "name": "faiss.loader", "level": "INFO", "message": "Successfully loaded faiss."}
{
    "ncells": null,
    "centroid_score_threshold": null,
    "ndocs": null,
    "index_path": "index_root\/index_name",
    "index_location": null,
    "nbits": 1,
    "kmeans_niters": 4,
    "num_partitions_max": 10000000,
    "similarity": "cosine",
    "bsize": 64,
    "accumsteps": 1,
    "lr": 3e-6,
    "maxsteps": 500000,
    "save_every": null,
    "resume": false,
    "resume_optimizer": false,
    "warmup": null,
    "warmup_bert": null,
    "relu": false,
    "nway": 2,
    "use_ib_negatives": false,
    "reranker": false,
    "distillation_alpha": 1.0,
    "ignore_scores": false,
    "shuffle_every_epo

Downloading: 100%|██████████| 482/482 [00:00<00:00, 231kB/s]
Downloading: 100%|██████████| 1.43G/1.43G [00:32<00:00, 43.7MB/s]
Downloading: 100%|██████████| 899k/899k [00:00<00:00, 16.5MB/s]
Downloading: 100%|██████████| 456k/456k [00:00<00:00, 9.08MB/s]
Downloading: 100%|██████████| 1.36M/1.36M [00:00<00:00, 20.5MB/s]


[Jun 10, 18:18:39] Loading segmented_maxsim_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
[Jun 10, 18:19:12] factory model type: roberta


Downloading: 100%|██████████| 481/481 [00:00<00:00, 564kB/s]
Downloading: 100%|██████████| 899k/899k [00:00<00:00, 15.5MB/s]
Downloading: 100%|██████████| 456k/456k [00:00<00:00, 10.3MB/s]
Downloading: 100%|██████████| 1.36M/1.36M [00:00<00:00, 17.9MB/s]


[Jun 10, 18:19:15] factory model type: roberta




[Jun 10, 18:19:20] [0] 		 # of sampled PIDs = 76 	 sampled_pids[:3] = [53, 1, 38]
[Jun 10, 18:19:20] [0] 		 #> Encoding 76 passages..
[Jun 10, 18:19:20] #> checkpoint, docFromText, Input: title | text, 		 64
[Jun 10, 18:19:20] #> Roberta DocTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==
[Jun 10, 18:19:20] #> Input: $ title | text, 		 64
[Jun 10, 18:19:20] #> Output IDs: torch.Size([177]), tensor([    0, 50262,  1270,  1721,  2788,     2,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1, 



  Iteration 3 (0.27 s, search 0.25 s): objective=1605.67 imbalance=1.780 nsplit=0       
[0.024, 0.025, 0.023, 0.025, 0.026, 0.022, 0.026, 0.022, 0.024, 0.023, 0.025, 0.024, 0.025, 0.024, 0.023, 0.023, 0.025, 0.024, 0.026, 0.025, 0.026, 0.024, 0.027, 0.026, 0.025, 0.023, 0.022, 0.025, 0.026, 0.022, 0.027, 0.025, 0.025, 0.024, 0.024, 0.023, 0.026, 0.023, 0.024, 0.026, 0.026, 0.024, 0.024, 0.025, 0.025, 0.024, 0.022, 0.025, 0.025, 0.024, 0.024, 0.024, 0.024, 0.024, 0.025, 0.021, 0.025, 0.024, 0.025, 0.021, 0.025, 0.023, 0.023, 0.023, 0.024, 0.025, 0.024, 0.025, 0.021, 0.026, 0.025, 0.026, 0.025, 0.024, 0.026, 0.026, 0.023, 0.023, 0.023, 0.023, 0.024, 0.025, 0.023, 0.025, 0.024, 0.028, 0.026, 0.027, 0.024, 0.027, 0.024, 0.025, 0.026, 0.025, 0.023, 0.026, 0.025, 0.026, 0.025, 0.025, 0.027, 0.025, 0.024, 0.026, 0.026, 0.023, 0.024, 0.024, 0.025, 0.025, 0.025, 0.025, 0.026, 0.024, 0.028, 0.023, 0.025, 0.025, 0.026, 0.024, 0.025, 0.027, 0.025, 0.028, 0.028, 0.025, 0.027, 0.025]
[Jun 10, 18:19

0it [00:00, ?it/s]

[Jun 10, 18:20:12] [0] 		 #> Saving chunk 0: 	 76 passages and 12,344 embeddings. From #0 onward.
[Jun 10, 18:20:12] offset: 0
[Jun 10, 18:20:12] chunk codes size(0): 12344
[Jun 10, 18:20:12] codes size(0): 12344
[Jun 10, 18:20:12] codes size(): torch.Size([12344])
[Jun 10, 18:20:12] >>>>partition.size(0): 512
[Jun 10, 18:20:12] >>>>num_partition: 512
[Jun 10, 18:20:12] #> Optimizing IVF to store map from centroids to list of pids..
[Jun 10, 18:20:12] #> Building the emb2pid mapping..
[Jun 10, 18:20:12] len(emb2pid) = 12344
[Jun 10, 18:20:12] #> Saved optimized IVF to index_root/index_name/ivf.pid.pt
[Jun 10, 18:20:12] [0] 		 #> Saving the indexing metadata to index_root/index_name/metadata.json ..


1it [00:25, 25.39s/it]
100%|██████████| 512/512 [00:00<00:00, 84155.64it/s]


#> Joined...


## Initializing the Retriever

We initialize a ColBERT retriever to search documents from the indexed document corpus.  Note: since we will retrieve the documents based on questions so we need to embed the questions too.

In [4]:
from primeqa.components.retriever.dense import ColBERTRetriever
retriever = ColBERTRetriever(document_store=colbert,
                      query_encoder_model_checkpoint = "/dccstor/colbert-ir/bsiyer/PQLL/experiments/xor_squad_04182023/2023-04/22/17.23.31/checkpoints/colbert.dnn.batch_17524.model"
                       )



Indexer: get collection returned as : output.tsv
self._config ColBERTConfig(ncells=None, centroid_score_threshold=None, ndocs=None, index_path='index_root/index_name', index_location=None, nbits=1, kmeans_niters=20, num_partitions_max=10000000, similarity='cosine', bsize=32, accumsteps=1, lr=3e-06, maxsteps=500000, save_every=None, resume=False, resume_optimizer=False, warmup=None, warmup_bert=None, relu=False, nway=2, use_ib_negatives=False, reranker=False, distillation_alpha=1.0, ignore_scores=False, shuffle_every_epoch=False, save_steps=2000, save_epochs=-1, epochs=10, input_arguments={}, model_type='roberta', init_from_lm=None, local_models_repository=None, ranks_fn=None, output_dir=None, topK=100, student_teacher_temperature=1.0, student_teacher_top_loss_weight=0.5, teacher_model_type='bert', teacher_doc_maxlen=180, distill_query_passage_separately=False, query_only=False, loss_function=None, query_weight=0.5, rng_seed=12345, query_maxlen=32, attend_to_mask_tokens=False, interacti



[Jun 10, 18:21:13] Loading filter_pids_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
[Jun 10, 18:21:40] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...


76it [00:00, 39741.57it/s]


## Start asking Questions

We're now ready to query the index we created.

In [5]:
question = ['What are some famous inventions by Einstein', "When did Aple introduce iPhone 7"]
retrieved_doc_ids, passages = retriever.predict(input_texts = question, mode = 'query_list',return_passages=True)


[Jun 10, 18:22:14] #> Roberta QueryTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==
[Jun 10, 18:22:14] #> Input: $ What are some famous inventions by Einstein, 		 True, 		 None
[Jun 10, 18:22:14] #> Output IDs: torch.Size([32]), tensor([    0, 50261,   653,    32,   103,  3395, 39232,    30, 27648,     2,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1])
[Jun 10, 18:22:14] #> Output Mask: torch.Size([32]), tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0])
[Jun 10, 18:22:14] #>>>> colbert query ==
[Jun 10, 18:22:14] #>>>>> input_ids: torch.Size([32]), tensor([    0, 50261,   653,    32,   103,  3395, 39232,    30, 27648,     2,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,




[Jun 10, 18:22:16] #>>>> before linear query ==
[Jun 10, 18:22:16] #>>>>> Q: torch.Size([32, 1024]), tensor([[-1.0246,  1.1656, -0.0771,  ...,  0.1875, -0.0456,  1.3663],
        [-0.1284,  1.6476, -0.0994,  ..., -0.4682,  0.0463,  0.7251],
        [-0.1178,  0.7274, -0.2226,  ..., -0.7760,  1.3722,  0.8347],
        ...,
        [-0.3671,  1.4453,  0.1219,  ..., -0.6817,  0.2066,  0.7360],
        [-0.3671,  1.4453,  0.1219,  ..., -0.6817,  0.2066,  0.7360],
        [-0.3671,  1.4453,  0.1219,  ..., -0.6817,  0.2066,  0.7360]])
[Jun 10, 18:22:16] #>>>>> self.linear query : Parameter containing:
tensor([[-3.0738e-02,  2.1602e-03,  3.6676e-02,  ..., -2.8078e-03,
          2.0939e-02,  1.6086e-02],
        [ 1.5858e-02, -1.4224e-02, -1.4469e-05,  ...,  2.9249e-02,
         -8.7473e-04, -2.0210e-02],
        [ 1.5264e-02,  2.7762e-03, -6.8552e-03,  ...,  8.7342e-03,
          6.7920e-03,  3.1651e-03],
        ...,
        [-2.3147e-03,  3.5463e-02, -3.9315e-03,  ..., -6.7647e-03,
        

100%|██████████| 2/2 [00:00<00:00, 41.20it/s]


Here are the retrived results:

In [6]:
import json
print(json.dumps(passages, indent = 4))

[
    [
        "\"\"\"Albert Einstein\"\"\"\n \"Albert Einstein Albert Einstein (; ; 14 March 1879 \u2013 18 April 1955) was a German-born theoretical physicist who developed the theory of relativity, one of the two pillars of modern physics (alongside quantum mechanics). His work is also known for its influence on the philosophy of science. He is best known to the general public for his mass\u2013energy equivalence formula , which has been dubbed \"\"the world's most famous equation\"\". He received the 1921 Nobel Prize in Physics \"\"for his services to theoretical physics, and especially for his discovery of the law of the photoelectric effect\"\", a pivotal step\"",
        "\"\"\"Albert Einstein\"\"\"\n \"model for depictions of mad scientists and absent-minded professors; his expressive face and distinctive hairstyle have been widely copied and exaggerated. \"\"Time\"\" magazine's Frederic Golden wrote that Einstein was \"\"a cartoonist's dream come true\"\". Many popular quotat