# Tutorial: Build a search index using DPR #

In this tutorial, we will learn how to build a Neural Search index over your document collection. The algorithm displayed here is called Dense Passage Retrieval (DPR) as described in Karpukhin et al., "Dense Passage Retrieval for Open-Domain Question Answering" [here](https://arxiv.org/pdf/2004.04906.pdf).

For the purposes of making this tutorial easy to understand we show the steps using a very small document collection. Note that this technique can be used to scale to millions of documents. We have tested upto 21 million Wikipedia passages!!!


## Preparing a Colab Environment to run this tutorial ##

Make sure to "Enable GPU Runtime" -> make a URL with a page with screenshots on how to do this.

## Installing PrimeQA

First, we need to include the required modules.


In [None]:
%%bash

pip install --upgrade pip
pip install primeqa

## Pre-process your document collection here to be ready to be stored in your Neural Search Index.

TODO- add some steps after this to ingest from the sample wikipedia docs.

In [None]:
# save your input document as a .tsv
import pandas as pd
url='https://drive.google.com/file/d/1LULJRPgN_hfuI2kG-wH4FUwXCCdDh9zh/view?usp=sharing'
url='https://drive.google.com/uc?id=' + url.split('/')[-2]
df = pd.read_csv(url)
df.to_csv('input.tsv', sep='\t')

In [None]:
# Use DocumentCollection class to convert your input.tsv to the specific format needed by PrimeQA indexer/retriever.
from primeqa.ir.util.corpus_reader import DocumentCollection
doc_class = DocumentCollection("input.tsv")
doc_class.write_corpus_tsv("output.tsv")

## Initializing the Indexer

We initialize a ColBERT indexer which will be used for indexing the embeddings created for each document (passage) in the collection. It takes a passage_embedding_model to create the embedding vectors and a vector_db specification where it stores the embedding vectors to search later. 

In [None]:
from primeqa.components.indexer.dense import ColBERTIndexer 
indexer= ColBERTIndexer (doc_encoder_model_checkpoint = "/dccstor/colbert-ir/bsiyer/PQLL/experiments/xor_squad_04182023/2023-04/22/17.23.31/checkpoints/colbert.dnn.batch_17524.model", vector_db='FAISS')

In [None]:
indexer.index("output.tsv")

## Initializing the Retriever

We initialize a ColBERT retriever to search documents from the indexed document corpus.  Note: since we will retrieve the documents based on questions so we need to embed the questions too.

In [None]:
from primeqa.components.retriever.dense import ColBERTRetriever
retriever = ColBERTRetriever(indexer=indexer,
                      query_encoder_model_checkpoint = "/dccstor/colbert-ir/bsiyer/PQLL/experiments/xor_squad_04182023/2023-04/22/17.23.31/checkpoints/colbert.dnn.batch_17524.model"
                       )



## Start asking Questions

We're now ready to query the index we created.

In [None]:
question = ['What are some famous inventions by Einstein', "When did Aple introduce iPhone 7"]
retrieved_doc_ids, passages = retriever.predict(input_texts = question, mode = 'query_list',return_passages=True)


Here are the retrived results:

In [None]:
import json
print(json.dumps(passages, indent = 4))