# Tutorial: Build a search index using DPR #

In this tutorial, we will learn how to build a Neural Search index over your document collection. The algorithm displayed here is called Dense Passage Retrieval (DPR) as described in Karpukhin et al., "Dense Passage Retrieval for Open-Domain Question Answering" [here](https://arxiv.org/pdf/2004.04906.pdf).

For the purposes of making this tutorial easy to understand we show the steps using a very small document collection. Note that this technique can be used to scale to millions of documents. We have tested upto 21 million Wikipedia passages!!!


## Preparing a Colab Environment to run this tutorial ##

Make sure to "Enable GPU Runtime" -> make a URL with a page with screenshots on how to do this.

## Installing PrimeQA

First, we need to include the required modules.


In [None]:
%%bash

pip install --upgrade pip
pip install primeqa

## Pre-process your document collection here to be ready to be stored in your Neural Search Index.

In [None]:
# save your input document as a .tsv
import pandas as pd
url='https://drive.google.com/file/d/1LULJRPgN_hfuI2kG-wH4FUwXCCdDh9zh/view?usp=sharing'
url='https://drive.google.com/uc?id=' + url.split('/')[-2]
df = pd.read_csv(url)
df.to_csv('input.tsv', sep='\t')

In [None]:
# Use DocumentCollection class to convert your input.tsv to the specific format needed by PrimeQA indexer/retriever.
from primeqa.ir.util.corpus_reader import DocumentCollection
doc_class = DocumentCollection("input.tsv")
doc_class.write_corpus_tsv("output.tsv")

## Initializing the Retriever

We initialize a DPR model to embed our documents from the collection. Note: since we will ask questions later over this document collection we need to embed the questions too.

In [None]:
from primeqa.components.indexer.dense import DPRIndexer

dpr = DPRIndexer(doc_encoder_model_name_or_path="PrimeQA/XOR-TyDi_monolingual_DPR_ctx_encoder", vector_db="FAISS")

## Embedding documents into a DPR based Vector DB

Take all the documents pre-processed and embed them in this step.

In [None]:
dpr.index(collection='output.tsv')

## Start asking Questions

We're now ready to query the index we created.

In [None]:
from primeqa.components.retriever.dense import DPRRetriever

retriever = DPRRetriever(query_encoder_model_name_or_path="PrimeQA/XOR-TyDi_monolingual_DPR_qry_encoder", indexer=dpr)

question = ['Who maintained the throne for the longest time in China?']
retrieved_doc_ids, passages = retriever.predict(input_texts = question, return_passages=True, max_num_documents=10)


## Retrieval Results

Here are the retrived results:

In [None]:
import json

print(json.dumps(passages, indent = 4))