# Tutorial: Build a search index using DPR #

In this tutorial, we will learn how to build a Neural Search index over your document collection. The algorithm displayed here is called Dense Passage Retrieval (DPR) as described in Karpukhin et al., "Dense Passage Retrieval for Open-Domain Question Answering" [here](https://arxiv.org/pdf/2004.04906.pdf).

For the purposes of making this tutorial easy to understand we show the steps using a very small document collection. Note that this technique can be used to scale to millions of documents. We have tested upto 21 million Wikipedia passages!!!


## Preparing a Colab Environment to run this tutorial ##

Make sure to "Enable GPU Runtime" -> make a URL with a page with screenshots on how to do this.

## Installing PrimeQA

First, we need to include the required modules.


In [None]:
%%bash

pip install --upgrade pip
pip install primeqa

## Pre-process your document collection here to be ready to be stored in your Neural Search Index.

TODO- add some steps after this to ingest from the sample wikipedia docs.

## Initializing the Retriever

We initialize a DPR model to embed our documents from the collection. Note: since we will ask questions later over this document collection we need to embed the questions too.

In [None]:
from primeqa.retrieve import DPR
from primeqa.embed import DocumentStore

# remove all the unnecessary imports - let's make this very simple as I wrote this here

document_store = DocumentStore(vector_db='FAISS')

retriever = DPR (document_store=document_store,
                 query_embedding_model = "PrimeQA/XOR-TyDi_monolingual_DPR_qry_encoder", # please change to NQ
                passage_embedding_model = "PrimeQA/XOR-TyDi_monolingual_DPR_ctx_encoder",
                use_gpu=True, embed_title=True)


## Embedding documents into a DPR based Vector DB

Take all the documents pre-processed and embed them in this step.

In [None]:
#add docs to your doc store
document_store.add_documents(documents)

# Add document embeddings to index
document_store.update_embeddings(retriever=retriever)

## Start asking Questions

We're now ready to query the index we created.

In [11]:
question = ['Who maintained the throne for the longest time in China?']
retrieved_doc_ids, passages = searcher.search(query = question, top_k = 1, mode = 'query_list')

Here are the retrived results:

In [12]:
import json
print(json.dumps(passages, indent = 4))

[
    {
        "titles": [
            "Kangxi Emperor"
        ],
        "texts": [
            "The Kangxi Emperor's reign of 61 years makes him the longest-reigning emperor in Chinese history (although his grandson, the Qianlong Emperor, had the longest period of \"de facto\" power) and one of the longest-reigning rulers in the world. However, since he ascended the throne at the age of seven, actual power was held for six years by four regents and his grandmother, the Grand Empress Dowager Xiaozhuang."
        ],
        "scores": [
            84.17091369628906
        ]
    }
]
