# Tutorial: Build a search index using DPR #

In this tutorial, we will learn how to build a Neural Search index over your document collection. The algorithm displayed here is called Dense Passage Retrieval (DPR) as described in Karpukhin et al., "Dense Passage Retrieval for Open-Domain Question Answering" [here](https://arxiv.org/pdf/2004.04906.pdf).

For the purposes of making this tutorial easy to understand we show the steps using a very small document collection. Note that this technique can be used to scale to millions of documents. We have tested upto 21 million Wikipedia passages!!!


## Preparing a Colab Environment to run this tutorial ##

Make sure to "Enable GPU Runtime" -> make a URL with a page with screenshots on how to do this.

## Installing PrimeQA

First, we need to include the required modules.


In [None]:
%%bash

pip install --upgrade pip
pip install primeqa

## Pre-process your document collection here to be ready to be stored in your Neural Search Index.

In [118]:
import pandas as pd
url='https://drive.google.com/file/d/1LULJRPgN_hfuI2kG-wH4FUwXCCdDh9zh/view?usp=sharing'
url='https://drive.google.com/uc?id=' + url.split('/')[-2]
df = pd.read_csv(url)
df = df.reset_index().rename(columns={'index': 'id'})
print(df[:3])


   id              title                                               text
0   0  "Albert Einstein"  to Einstein in 1922. Footnotes Citations Alber...
1   1  "Albert Einstein"  Albert Einstein Albert Einstein (; ; 14 March ...
2   2  "Albert Einstein"  observations were published in the internation...


In [119]:
#data should be formted as: id\ttext\title

#swap text and title columns
df[['text', 'title']] = df[['title', 'text']]


In [120]:
df.rename(columns={'text': 'TITLE', 'title': 'TEXT'}, inplace=True)
df.to_csv('collection.tsv', sep='\t')

## Initializing the Retriever

We initialize a DPR model to embed our documents from the collection. Note: since we will ask questions later over this document collection we need to embed the questions too.

In [121]:
from primeqa.components.indexer.dense import DPRIndexer

dpr = DPRIndexer(ctx_encoder_name_or_path="PrimeQA/XOR-TyDi_monolingual_DPR_ctx_encoder")

Some weights of the model checkpoint at facebook/dpr-ctx_encoder-multiset-base were not used when initializing DPRContextEncoder: ['ctx_encoder.bert_model.pooler.dense.weight', 'ctx_encoder.bert_model.pooler.dense.bias']
- This IS expected if you are initializing DPRContextEncoder from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DPRContextEncoder from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'DPRQuestionEncoderTokenizer'. 
The class this function is called from is 'DPRContextEncoderTokenize

## Embedding documents into a DPR based Vector DB

Take all the documents pre-processed and embed them in this step.

In [122]:
# index documents
dpr.index(collection='collection.tsv')

{"time":"2023-06-01 18:38:28,217", "name": "primeqa.ir.dense.dpr_top.dpr.index_simple_corpus", "level": "INFO", "message": "wrote passages_1_of_1.json.gz.records in 2 seconds"}
{"time":"2023-06-01 18:38:28,218", "name": "primeqa.ir.dense.dpr_top.dpr.faiss_index", "level": "INFO", "message": "building index, reading data from dpr_index/passages_1_of_1.json.gz.records, writing to dpr_index/index_1_of_1.faiss"}
{"time":"2023-06-01 18:38:28,254", "name": "primeqa.ir.dense.dpr_top.dpr.faiss_index", "level": "INFO", "message": "processed 0 passages"}
{"time":"2023-06-01 18:38:28,256", "name": "primeqa.ir.dense.dpr_top.dpr.faiss_index", "level": "INFO", "message": "calling index.add with 76 vectors"}
{"time":"2023-06-01 18:38:28,258", "name": "primeqa.ir.dense.dpr_top.dpr.faiss_index", "level": "INFO", "message": "processed 76 passages"}
{"time":"2023-06-01 18:38:28,258", "name": "primeqa.ir.dense.dpr_top.dpr.faiss_index", "level": "INFO", "message": "finished building index, writing index fi

## Start asking Questions

We're now ready to query the index we created.

In [102]:
from primeqa.components.retriever.dense import DPRRetriever

retriever = DPRRetriever(checkpoint="PrimeQA/XOR-TyDi_monolingual_DPR_qry_encoder", index_location=dpr.output_dir)

question = ['Who maintained the throne for the longest time in China?']
retrieved_doc_ids, passages = retriever.predict(input_texts = question, return_passages=True, max_num_documents=10)


{"time":"2023-06-01 18:32:02,332", "name": "primeqa.ir.dense.dpr_top.dpr.searcher", "level": "INFO", "message": "Using sharded faiss, reading shards from dpr_index"}
{"time":"2023-06-01 18:32:02,333", "name": "primeqa.ir.dense.dpr_top.dpr.searcher", "level": "INFO", "message": "Reading passages_1_of_1.json.gz.records"}
{"time":"2023-06-01 18:32:02,340", "name": "primeqa.ir.dense.dpr_top.dpr.searcher", "level": "INFO", "message": "Using sharded faiss with 1 shards."}


## Retrieval Results

Here are the retrived results:

In [103]:
import json

print(json.dumps(passages, indent = 4))

[
    {
        "titles": [
            "Ashoka",
            "Ashoka",
            "Ashoka",
            "\"Alexander Graham Bell\"",
            "\"America the Beautiful\"",
            "\"Ainu people\"",
            "\"Akira Kurosawa\"",
            "\"Aquarius (constellation)\"",
            "\"The Ashes\"",
            "\"Amplitude modulation\""
        ],
        "texts": [
            "Wheel of Dharma). The wheel has 24 spokes which represent the 12 Laws of Dependent Origination and the 12 Laws of Dependent Termination. The Ashoka Chakra has been widely inscribed on many relics of the Mauryan Emperor, most prominent among which is the Lion Capital of Sarnath and The Ashoka Pillar. The most visible use of the Ashoka Chakra today is at the centre of the National flag of the Republic of India (adopted on 22 July 1947), where it is rendered in a Navy-blue color on a White background, by replacing the symbol of Charkha (Spinning wheel) of the",
            "pre-independence versions 