# Dense IR using _Pipelines_ interface

In this notebook, we show how to index data, and run search using the _Pipelines_ API.
In orded to run (almost) instantaneously, we use trivial data sizes of training data and collection to search.


## Initial setup

We start by defining variables specifying locations of data we will use:

In [1]:
import os
import tempfile

test_files_location = '../../../tests/resources/ir_dense'
with tempfile.TemporaryDirectory() as working_dir:
    output_dir=os.path.join(working_dir, 'output_dir')
    
index_root = os.path.join(output_dir, 'index_root')
index_name = 'index_name'

## Indexing

To run indexing, we need an existing model (checkpoint).  In this example, we use the Dr.Decr model, downloaded to the test examples directory.  This can be done by running (on a command line):
```
 wget https://huggingface.co/PrimeQA/DrDecr_XOR-TyDi_whitebox/resolve/main/DrDecr.dnn
```
Next, we will index a collection of documents, using model representaion from the previous step. The collection is a TSV file, containing each document's ID, title, and text.

In [2]:
collection_fn = os.path.join(test_files_location, "xorqa.train_ir_001pct_at_0_pct_collection_fornum.tsv")

Here is an example document:

In [3]:
import pandas as pd
from IPython.display import display, HTML
data = pd.read_csv(collection_fn, sep='\t', header=0, nrows=1)
display(HTML(data.to_html()))

Unnamed: 0,id,text,title
0,1,"The Kangxi Emperor's reign of 61 years makes him the longest-reigning emperor in Chinese history (although his grandson, the Qianlong Emperor, had the longest period of ""de facto"" power) and one of the longest-reigning rulers in the world. However, since he ascended the throne at the age of seven, actual power was held for six years by four regents and his grandmother, the Grand Empress Dowager Xiaozhuang.",Kangxi Emperor


Next we instantiate the indexer and index the collection:

In [4]:
from primeqa.components.indexer.dense import ColBERTIndexer
os.makedirs(index_root, exist_ok = True)
checkpoint_fn = os.path.join(test_files_location, "DrDecr.dnn")

indexer = ColBERTIndexer(checkpoint = checkpoint_fn, index_root = index_root, index_name = index_name, num_partitions_max = 2)
indexer.load()
indexer.index(collection = collection_fn, overwrite=True)

No CUDA runtime is found, using CUDA_HOME='/opt/share/cuda-10.1/x86_64'


[Oct 13, 15:27:25] #> Note: Output directory /u/franzm/git8/PrimeQA_services/primeqa/notebooks/ir/dense/experiments/default/indexes/index_name already exists


[Oct 13, 15:27:25] #> Will delete 10 files already at /u/franzm/git8/PrimeQA_services/primeqa/notebooks/ir/dense/experiments/default/indexes/index_name in 20 seconds...
#> Starting...
No CUDA runtime is found, using CUDA_HOME='/opt/share/cuda-10.1/x86_64'
{
    "ncells": null,
    "centroid_score_threshold": null,
    "ndocs": null,
    "index_path": null,
    "nbits": 1,
    "kmeans_niters": 4,
    "num_partitions_max": 2,
    "similarity": "cosine",
    "bsize": 64,
    "accumsteps": 1,
    "lr": 3e-6,
    "maxsteps": 500000,
    "save_every": null,
    "resume": false,
    "resume_optimizer": false,
    "warmup": null,
    "warmup_bert": null,
    "relu": false,
    "nway": 2,
    "use_ib_negatives": false,
    "reranker": false,
    "distillation_alpha



[Oct 13, 15:28:34] [0] 		 # of sampled PIDs = 7 	 sampled_pids[:3] = [3, 5, 0]
[Oct 13, 15:28:34] [0] 		 #> Encoding 7 passages..
[Oct 13, 15:28:34] #> checkpoint, docFromText, Input: title | text, 		 64
[Oct 13, 15:28:34] #> XLMR DocTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==
[Oct 13, 15:28:34] #> Input: $ title | text, 		 64
[Oct 13, 15:28:34] #> Output IDs: torch.Size([180]), tensor([    0,  9749, 44759,     6, 58745,  7986,     2,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1, 

1it [00:00,  1.01it/s]
100%|██████████| 2/2 [00:00<00:00, 13252.15it/s]


[Oct 13, 15:28:36] [0] 		 #> Saving chunk 0: 	 7 passages and 1,220 embeddings. From #0 onward.
[Oct 13, 15:28:36] offset: 0
[Oct 13, 15:28:36] chunk codes size(0): 1220
[Oct 13, 15:28:36] codes size(0): 1220
[Oct 13, 15:28:36] codes size(): torch.Size([1220])
[Oct 13, 15:28:36] >>>>partition.size(0): 2
[Oct 13, 15:28:36] >>>>num_partition: 2
[Oct 13, 15:28:36] #> Optimizing IVF to store map from centroids to list of pids..
[Oct 13, 15:28:36] #> Building the emb2pid mapping..
[Oct 13, 15:28:36] len(emb2pid) = 1220
[Oct 13, 15:28:36] #> Saved optimized IVF to /u/franzm/git8/PrimeQA_services/primeqa/notebooks/ir/dense/experiments/default/indexes/index_name/ivf.pid.pt
[Oct 13, 15:28:36] [0] 		 #> Saving the indexing metadata to /u/franzm/git8/PrimeQA_services/primeqa/notebooks/ir/dense/experiments/default/indexes/index_name/metadata.json ..

#> Joined...


### Search
Next, we use the trained model and the index to search the collection, using queries in the form of a list of strings:

In [5]:
from primeqa.components.retriever.dense import ColBERTRetriever

retriever = ColBERTRetriever(index_root = index_root, index_name = index_name, ndocs = 5, max_num_documents = 2)
retriever.load()
results = retriever.retrieve(input_texts = ['Who is Michael Wigge'])

[Oct 13, 15:28:37] #> base_config.py from_path /u/franzm/git8/PrimeQA_services/primeqa/notebooks/ir/dense/experiments/default/indexes/index_name/metadata.json
[Oct 13, 15:28:37] #> base_config.py from_path args loaded! 
[Oct 13, 15:28:37] #> base_config.py from_path args replaced ! 
[Oct 13, 15:28:40] #> Loading collection...
0M 
[Oct 13, 15:28:40] #>>>>> at ColBERT name (model type) : ../../../tests/resources/ir_dense/DrDecr.dnn
[Oct 13, 15:28:40] #>>>>> at BaseColBERT name (model type) : ../../../tests/resources/ir_dense/DrDecr.dnn
[Oct 13, 15:28:46] factory model type: xlm-roberta-base
[Oct 13, 15:28:59] Loading segmented_maxsim_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
[Oct 13, 15:29:03] get query model type: xlm-roberta-base
[Oct 13, 15:29:04] get doc model type: xlm-roberta-base
[Oct 13, 15:29:05] #> Loading codec...
[Oct 13, 15:29:05] #> base_config.py from_path /u/franzm/git8/PrimeQA_services/primeqa/notebooks/ir/dense/experiments/default/in



[Oct 13, 15:29:06] Loading filter_pids_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
[Oct 13, 15:29:06] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
[Oct 13, 15:29:07] #> XMLR QueryTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==
[Oct 13, 15:29:07] #> Input: $ Who is Michael Wigge, 		 True, 		 None
[Oct 13, 15:29:07] #> Output IDs: torch.Size([32]), tensor([    0,  9748, 40469,    83, 11617,  5140, 23359,     2,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1])
[Oct 13, 15:29:07] #> Output Mask: torch.Size([32]), tensor([1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0])
[Oct 13, 15:29:07] #>>>> colbert query ==
[Oct 13, 15:29:07] #>>>>> input_ids: torch.Size([32]), tensor([    0, 



[Oct 13, 15:29:07] #>>>> before linear query ==
[Oct 13, 15:29:07] #>>>>> Q: torch.Size([32, 768]), tensor([[-0.0133, -0.0071, -0.0338,  ..., -0.1157,  0.2084,  0.0775],
        [-0.4863, -0.1339, -0.5917,  ..., -0.3788,  0.8931,  0.5833],
        [-0.3398, -0.3887, -0.4318,  ..., -0.1584,  0.9153,  0.1573],
        ...,
        [-0.3348, -0.1802, -0.1922,  ..., -0.2447,  0.4070,  0.6802],
        [-0.3348, -0.1802, -0.1922,  ..., -0.2447,  0.4070,  0.6802],
        [-0.3348, -0.1802, -0.1922,  ..., -0.2447,  0.4070,  0.6802]])
[Oct 13, 15:29:07] #>>>>> self.linear query : Parameter containing:
tensor([[-0.0167, -0.0032, -0.0119,  ..., -0.0170,  0.0129, -0.0077],
        [-0.0066,  0.0045, -0.0204,  ..., -0.0280,  0.0008, -0.0067],
        [-0.0126, -0.0006,  0.0135,  ..., -0.0071, -0.0104, -0.0233],
        ...,
        [-0.0122,  0.0324,  0.0043,  ...,  0.0306,  0.0014,  0.0217],
        [ 0.0048,  0.0010,  0.0126,  ..., -0.0258,  0.0022,  0.0084],
        [-0.0069,  0.0300, -0.0226,

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 85.83it/s]


Here is the top search result for our query, containing document_id and score:

In [6]:
results[0][0]

(6, 12.305487632751465)

Here is the top retrieved document:

In [7]:
with open(collection_fn, 'r') as f:
    for line in f.readlines():
        if str(results[0][0][0]) == line.split()[0]:
            print(line)

6	Michael Wigge Michael Wigge is a travel writer and entertainment personality in Europe and in the United States. His work is characterized by a mixture of journalism and entertainment. His specialties are cultural issues which he examines in a very entertaining way. In 2002, Wigge drew attention to himself in Germany for the first time on TV broadcaster VIVA plus presenting comedy clips on the daily show “London Calling”. In this context he sets a record for the longest donkey ride in music television history and visits the Queen of England, dressed as King Henry VIII, on her 50th throne	Michael Wigge

