# Tutorial: Retrieval of documents from a corpus using Neural Information Retrieval (IR)

In this tutorial you'll learn how to use a popular Neural IR system called DPR [Karpukhin2020].


## Step 0: Install the required packages

In [None]:
! pip install primeqa

## Step 1: Init -- Initialize your model. In PrimeQA for searching through your corpus, we use a class called SearchableCorpus.

For DPR, you need to point to a question and context encoder models available via the HuggingFace model hub.

In [3]:
# from primeqa.components.retriever.searchable_corpus import SearchableCorpus
from primeqa.util import SearchableCorpus
collection = SearchableCorpus(context_encoder_name_or_path="PrimeQA/XOR-TyDi_monolingual_DPR_ctx_encoder",
                              query_encoder_name_or_path="PrimeQA/XOR-TyDi_monolingual_DPR_qry_encoder",
                              batch_size=64, top_k=10)

## Step 2: Add -- add your documents into the searchable corpus.

In this step you create a tsv file with the following format:
`id \t text \t title_of_document`
Note: since DPR is based on an encoder language model the typical sequence length is 512 max sub-word tokens. So please make sure your documents are splitted into text length of ~220 words.

In [6]:
# Please update to point to your collection tsv format id\ttext\ttitle with these headers
path_to_collection_tsv_file="../../../../ibm-generative-ai-cookbooks/notebooks/docs_and_qs/psgs.tsv"
collection.add_documents(path_to_collection_tsv_file)

{"time":"2023-06-19 16:37:31,346", "name": "primeqa.ir.dense.dpr_top.dpr.index_simple_corpus", "level": "INFO", "message": "wrote passages_1_of_1.json.gz.records in 17 seconds"}
{"time":"2023-06-19 16:37:31,347", "name": "primeqa.ir.dense.dpr_top.dpr.faiss_index", "level": "INFO", "message": "building index, reading data from dpr_index_dir/passages_1_of_1.json.gz.records, writing to dpr_index_dir/index_1_of_1.faiss"}
{"time":"2023-06-19 16:37:31,350", "name": "primeqa.ir.dense.dpr_top.dpr.faiss_index", "level": "INFO", "message": "processed 0 passages"}
{"time":"2023-06-19 16:37:31,685", "name": "primeqa.ir.dense.dpr_top.dpr.faiss_index", "level": "INFO", "message": "calling index.add with 10601 vectors"}
{"time":"2023-06-19 16:37:33,624", "name": "primeqa.ir.dense.dpr_top.dpr.faiss_index", "level": "INFO", "message": "processed 10601 passages"}
{"time":"2023-06-19 16:37:33,625", "name": "primeqa.ir.dense.dpr_top.dpr.faiss_index", "level": "INFO", "message": "finished building index, w

Downloading (…)lve/main/config.json:   0%|          | 0.00/664 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/436M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/361 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

{"time":"2023-06-19 16:37:46,980", "name": "primeqa.ir.dense.dpr_top.dpr.searcher", "level": "INFO", "message": "Using sharded faiss, reading shards from dpr_index_dir"}
{"time":"2023-06-19 16:37:46,982", "name": "primeqa.ir.dense.dpr_top.dpr.searcher", "level": "INFO", "message": "Reading passages_1_of_1.json.gz.records"}
{"time":"2023-06-19 16:37:46,993", "name": "primeqa.ir.dense.dpr_top.dpr.searcher", "level": "INFO", "message": "Using sharded faiss with 1 shards."}


## Step 3: Search -- start asking questions.

Your queries can be a list. You can also retrieve the scores of retrieved documents.

In [7]:
queries = ['When was Idaho split in two?' , 'Who was Danny Nozel']
retrieved_doc_ids, passages = collection.search(queries)
#res, scores = collection.search2(queries)
import json
print(json.dumps(passages, indent = 4))

[
    {
        "titles": [
            "History of Idaho",
            "History of Idaho",
            "Territorial evolution of the United States",
            "History of Idaho",
            "History of Idaho",
            "List of United States cities by population",
            "List of capitals in the United States",
            "Alaska",
            "Treaty of Guadalupe Hidalgo",
            "Contiguous United States"
        ],
        "texts": [
            "When President Benjamin Harrison signed the law admitting Idaho as a U.S. state on July 3 , 1890 , the population was 88,548 . George L. Shoup became the state 's first governor , but resigned after only a few weeks in office to take a seat in the United States Senate .",
            "Despite their best efforts , early American fur companies in this region had difficulty maintaining the long - distance supply lines from the Missouri River system into the Intermountain West . However , Americans William H. Ashley and Jededi