# Tutorial: Retrieval of documents from a corpus using Neural Information Retrieval (IR)

In this tutorial you'll learn how to use a popular Neural IR system called DPR [Karpukhin2020].


## Step 0: Install the required packages

In [12]:

! pip install primeqa

## Step 1: Init -- Initialize your model. In PrimeQA for searching through your corpus, we use a class called SearchableCorpus.

For DPR, you need to point to a question and context encoder models available via the HuggingFace model hub.

In [6]:
from primeqa.components.retriever.searchable_corpus import SearchableCorpus
collection = SearchableCorpus(model_name="PrimeQA/XOR-TyDi_monolingual_DPR_ctx_encoder", 
                              query_encoder_model_name_or_path="PrimeQA/XOR-TyDi_monolingual_DPR_qry_encoder", 
                              batch_size=64, top_k=10)

  from .autonotebook import tqdm as notebook_tqdm


{"time":"2023-06-14 12:29:17,621", "name": "faiss.loader", "level": "INFO", "message": "Loading faiss."}
{"time":"2023-06-14 12:29:17,646", "name": "faiss.loader", "level": "INFO", "message": "Successfully loaded faiss."}
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'


## Step 2: Add -- add your documents into the searchable corpus.

In this step you create a tsv file with the following format:
`id \t text \t title_of_document`
Note: since DPR is based on an encoder language model the typical sequence length is 512 max sub-word tokens. So please make sure your documents are splitted into text length of ~220 words.

In [9]:
# Please update to point to your collection tsv format id\ttext\ttitle with these headers
path_to_collection_tsv_file="../path_to_tsv/sample.tsv"
collection.add_documents(path_to_collection_tsv_file)

9it [00:00, 8021.41it/s]


9
{"time":"2023-06-14 12:40:46,771", "name": "primeqa.ir.dense.dpr_top.dpr.index_simple_corpus", "level": "INFO", "message": "wrote passages_1_of_1.json.gz.records in 0 seconds"}
{"time":"2023-06-14 12:40:46,772", "name": "primeqa.ir.dense.dpr_top.dpr.faiss_index", "level": "INFO", "message": "building index, reading data from dpr_index_dir/passages_1_of_1.json.gz.records, writing to dpr_index_dir/index_1_of_1.faiss"}
{"time":"2023-06-14 12:40:46,829", "name": "primeqa.ir.dense.dpr_top.dpr.faiss_index", "level": "INFO", "message": "processed 0 passages"}
{"time":"2023-06-14 12:40:46,830", "name": "primeqa.ir.dense.dpr_top.dpr.faiss_index", "level": "INFO", "message": "calling index.add with 9 vectors"}
{"time":"2023-06-14 12:40:46,832", "name": "primeqa.ir.dense.dpr_top.dpr.faiss_index", "level": "INFO", "message": "processed 9 passages"}
{"time":"2023-06-14 12:40:46,832", "name": "primeqa.ir.dense.dpr_top.dpr.faiss_index", "level": "INFO", "message": "finished building index, writing 

Some weights of the model checkpoint at PrimeQA/XOR-TyDi_monolingual_DPR_ctx_encoder were not used when initializing DPRQuestionEncoder: ['ctx_encoder.bert_model.encoder.layer.10.attention.output.LayerNorm.weight', 'ctx_encoder.bert_model.encoder.layer.8.attention.output.dense.bias', 'ctx_encoder.bert_model.encoder.layer.8.attention.self.query.weight', 'ctx_encoder.bert_model.encoder.layer.11.attention.output.LayerNorm.weight', 'ctx_encoder.bert_model.encoder.layer.5.attention.self.query.weight', 'ctx_encoder.bert_model.encoder.layer.10.intermediate.dense.bias', 'ctx_encoder.bert_model.encoder.layer.7.attention.self.value.weight', 'ctx_encoder.bert_model.encoder.layer.5.output.LayerNorm.bias', 'ctx_encoder.bert_model.encoder.layer.5.attention.self.value.weight', 'ctx_encoder.bert_model.encoder.layer.4.attention.self.query.weight', 'ctx_encoder.bert_model.encoder.layer.0.attention.output.dense.weight', 'ctx_encoder.bert_model.encoder.layer.3.output.LayerNorm.bias', 'ctx_encoder.bert_mod

{"time":"2023-06-14 12:40:49,841", "name": "primeqa.ir.dense.dpr_top.dpr.searcher", "level": "INFO", "message": "Using sharded faiss, reading shards from dpr_index_dir"}
{"time":"2023-06-14 12:40:49,843", "name": "primeqa.ir.dense.dpr_top.dpr.searcher", "level": "INFO", "message": "Reading passages_1_of_1.json.gz.records"}
{"time":"2023-06-14 12:40:49,844", "name": "primeqa.ir.dense.dpr_top.dpr.searcher", "level": "INFO", "message": "Using sharded faiss with 1 shards."}


## Step 3: Search -- start asking questions.

Your queries can be a list. You can also retrieve the scores of retrieved documents.

In [10]:
queries = ['When was Idaho split in two?' , 'Who was Danny Nozel']
retrieved_doc_ids, passages = collection.search(queries)
#res, scores = collection.search2(queries)
import json
print(json.dumps(passages, indent = 4))

[
    {
        "titles": [
            "History of Idaho",
            "History of Idaho",
            "History of Idaho",
            "History of Idaho",
            "History of Idaho",
            "History of Idaho",
            "History of Idaho",
            "History of Idaho",
            "History of Idaho",
            "History of Idaho"
        ],
        "texts": [
            ", early American fur companies in this region had difficulty maintaining the long - distance supply lines from the Missouri River system into the Intermountain West . However , Americans William H. Ashley and Jedediah Smith expanded the Saint Louis fur trade into Idaho in 1824 . The 1832 trapper 's rendezvous at Pierre 's Hole , held at the foot of the Three Tetons in modern Teton County , was followed by an intense battle between the Gros Ventre and a large party of American trappers aided by their Nez Perce and Flathead allies . The prospect of missionary work among the Native Americans also attracted