# Dense IR using _Pipelines_ interface

In this notebook, we show how to index data, and run search using the _Pipelines_ API.
In orded to run (almost) instantaneously, we use trivial data sizes of training data and collection to search.


## Initial setup

We start by defining variables specifying locations of data we will use:

In [1]:
import os
import tempfile

model_name_or_path = "PrimeQA/DrDecr_XOR-TyDi_whitebox-2"
test_files_location = '../../../tests/resources/ir_dense'
with tempfile.TemporaryDirectory() as working_dir:
    output_dir=os.path.join(working_dir, 'output_dir')
    
index_root = os.path.join(output_dir, 'index_root')
index_name = 'index_name'

## Indexing

To run indexing, we need an existing model (checkpoint). For this tutorial, we will use the [Dr.Decr](https://huggingface.co/PrimeQA/DrDecr_XOR-TyDi_whitebox-2) model from huggingface.
Next, we will index a collection of documents, using model representaion from the previous step. The collection is a TSV file, containing each document's ID, title, and text.

In [2]:
collection_fn = os.path.join(test_files_location, "xorqa.train_ir_001pct_at_0_pct_collection_fornum.tsv")

Here is an example document:

In [3]:
import pandas as pd
from IPython.display import display, HTML
data = pd.read_csv(collection_fn, sep='\t', header=0, nrows=1)
display(HTML(data.to_html()))

Unnamed: 0,id,text,title
0,1,"The Kangxi Emperor's reign of 61 years makes him the longest-reigning emperor in Chinese history (although his grandson, the Qianlong Emperor, had the longest period of ""de facto"" power) and one of the longest-reigning rulers in the world. However, since he ascended the throne at the age of seven, actual power was held for six years by four regents and his grandmother, the Grand Empress Dowager Xiaozhuang.",Kangxi Emperor


Next we instantiate the indexer and index the collection:

In [4]:
from primeqa.components.indexer.dense import ColBERTIndexer
os.makedirs(index_root, exist_ok = True)
#checkpoint_fn = os.path.join(test_files_location, "DrDecr.dnn")

indexer = ColBERTIndexer(checkpoint = model_name_or_path, index_root = index_root, index_name = index_name, num_partitions_max = 2)
indexer.load()
indexer.index(collection = collection_fn, overwrite=True)

No CUDA runtime is found, using CUDA_HOME='/opt/share/cuda-11.8'
{"time":"2023-11-03 04:28:54,917", "name": "faiss.loader", "level": "INFO", "message": "Loading faiss."}
{"time":"2023-11-03 04:28:54,980", "name": "faiss.loader", "level": "INFO", "message": "Successfully loaded faiss."}
[Nov 03, 04:28:55] #> base_config.py load_from_checkpoint PrimeQA/DrDecr_XOR-TyDi_whitebox-2
[Nov 03, 04:28:55] #> base_config.py load_from_checkpoint PrimeQA/DrDecr_XOR-TyDi_whitebox-2/artifact.metadata


[Nov 03, 04:28:55] #> Creating directory /tmp/tmpvtm_ip29/output_dir/index_root/index_name 


#> Starting...
No CUDA runtime is found, using CUDA_HOME='/opt/share/cuda-11.8'
{"time":"2023-11-03 04:29:22,528", "name": "faiss.loader", "level": "INFO", "message": "Loading faiss."}
{"time":"2023-11-03 04:29:22,557", "name": "faiss.loader", "level": "INFO", "message": "Successfully loaded faiss."}
{
    "ncells": null,
    "centroid_score_threshold": null,
    "ndocs": null,
    "index_path": "\/tmp\/tmpvtm



[Nov 03, 04:29:35] [0] 		 # of sampled PIDs = 7 	 sampled_pids[:3] = [3, 5, 0]
[Nov 03, 04:29:35] [0] 		 #> Encoding 7 passages..
[Nov 03, 04:29:35] #> checkpoint, docFromText, Input: title | text, 		 64
[Nov 03, 04:29:35] #> XLMR DocTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==
[Nov 03, 04:29:35] #> Input: $ title | text, 		 64
[Nov 03, 04:29:35] #> Output IDs: torch.Size([180]), tensor([    0,  9749, 44759,     6, 58745,  7986,     2,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1, 

0it [00:00, ?it/s]

[Nov 03, 04:29:38] [0] 		 #> Saving chunk 0: 	 7 passages and 1,222 embeddings. From #0 onward.
[Nov 03, 04:29:38] offset: 0
[Nov 03, 04:29:38] chunk codes size(0): 1222
[Nov 03, 04:29:38] codes size(0): 1222
[Nov 03, 04:29:38] codes size(): torch.Size([1222])
[Nov 03, 04:29:38] >>>>partition.size(0): 2
[Nov 03, 04:29:38] >>>>num_partition: 2
[Nov 03, 04:29:38] #> Optimizing IVF to store map from centroids to list of pids..
[Nov 03, 04:29:38] #> Building the emb2pid mapping..
[Nov 03, 04:29:38] len(emb2pid) = 1222
[Nov 03, 04:29:38] #> Saved optimized IVF to /tmp/tmpvtm_ip29/output_dir/index_root/index_name/ivf.pid.pt
[Nov 03, 04:29:38] [0] 		 #> Saving the indexing metadata to /tmp/tmpvtm_ip29/output_dir/index_root/index_name/metadata.json ..


1it [00:01,  1.10s/it]1it [00:01,  1.11s/it]
  0%|          | 0/2 [00:00<?, ?it/s]100%|██████████| 2/2 [00:00<00:00, 15679.64it/s]



#> Joined...


### Search
Next, we use the trained model and the index to search the collection, using queries in the form of a list of strings:

In [5]:
from primeqa.components.retriever.dense import ColBERTRetriever

retriever = ColBERTRetriever(checkpoint = model_name_or_path, index_root = index_root, index_name = index_name, ndocs = 5, max_num_documents = 2)
retriever.load()
results = retriever.predict(input_texts = ['Who is Michael Wigge'])

[Nov 03, 04:30:28] #> base_config.py from_path /tmp/tmpvtm_ip29/output_dir/index_root/index_name/metadata.json
[Nov 03, 04:30:28] #> base_config.py from_path args loaded! 
[Nov 03, 04:30:28] #> base_config.py from_path args replaced ! 
[Nov 03, 04:30:28] #> base_config.py load_from_checkpoint PrimeQA/DrDecr_XOR-TyDi_whitebox-2
[Nov 03, 04:30:28] #> base_config.py load_from_checkpoint PrimeQA/DrDecr_XOR-TyDi_whitebox-2/artifact.metadata
[Nov 03, 04:30:28] #>>>>> at ColBERT name (model name) : PrimeQA/DrDecr_XOR-TyDi_whitebox-2
[Nov 03, 04:30:28] #>>>>> at BaseColBERT name (model name) : PrimeQA/DrDecr_XOR-TyDi_whitebox-2
[Nov 03, 04:30:28] #> base_config.py load_from_checkpoint PrimeQA/DrDecr_XOR-TyDi_whitebox-2
[Nov 03, 04:30:28] #> base_config.py load_from_checkpoint PrimeQA/DrDecr_XOR-TyDi_whitebox-2/artifact.metadata
[Nov 03, 04:30:28] factory model type: xlm-roberta
[Nov 03, 04:30:38] Loading segmented_maxsim_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more inf



[Nov 03, 04:30:40] Loading filter_pids_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
[Nov 03, 04:30:41] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
[Nov 03, 04:30:41] #> XMLR QueryTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==
[Nov 03, 04:30:41] #> Input: $ Who is Michael Wigge, 		 True, 		 None
[Nov 03, 04:30:41] #> Output IDs: torch.Size([32]), tensor([    0,  9748, 40469,    83, 11617,  5140, 23359,     2,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1])
[Nov 03, 04:30:41] #> Output Mask: torch.Size([32]), tensor([1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0])
[Nov 03, 04:30:41] #>>>> colbert query ==
[Nov 03, 04:30:41] #>>>>> input_ids: torch.Size([32]), tensor([    0, 



[Nov 03, 04:30:42] #>>>> before linear query ==
[Nov 03, 04:30:42] #>>>>> Q: torch.Size([32, 768]), tensor([[-0.0056,  0.1896, -0.2805,  ..., -0.5975, -0.3414,  0.4975],
        [-0.5635,  0.2752, -0.2151,  ..., -0.9036,  0.3531,  0.5631],
        [-0.3071,  0.2370, -0.1805,  ..., -0.5553,  0.3696,  0.3634],
        ...,
        [-0.1919,  0.1687, -0.2729,  ..., -0.7984, -0.0091,  0.5765],
        [-0.1919,  0.1687, -0.2729,  ..., -0.7984, -0.0091,  0.5765],
        [-0.1919,  0.1687, -0.2729,  ..., -0.7984, -0.0091,  0.5765]])
[Nov 03, 04:30:42] #>>>>> self.linear query : Parameter containing:
tensor([[-0.0286,  0.0017, -0.0202,  ..., -0.0262,  0.0210,  0.0006],
        [-0.0102,  0.0121, -0.0111,  ..., -0.0362, -0.0165, -0.0012],
        [-0.0047, -0.0172, -0.0054,  ..., -0.0069, -0.0194, -0.0193],
        ...,
        [-0.0286,  0.0231,  0.0004,  ...,  0.0373, -0.0045,  0.0125],
        [ 0.0051,  0.0023,  0.0212,  ..., -0.0254,  0.0034,  0.0206],
        [-0.0068,  0.0256, -0.0263,

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 84.29it/s]


Here is the top search result for our query, containing document_id and score:

In [6]:
results[0][0]

(6, 18.614036560058594)

Here is the top retrieved document:

In [7]:
with open(collection_fn, 'r') as f:
    for line in f.readlines():
        if str(results[0][0][0]) == line.split()[0]:
            print(line)

6	Michael Wigge Michael Wigge is a travel writer and entertainment personality in Europe and in the United States. His work is characterized by a mixture of journalism and entertainment. His specialties are cultural issues which he examines in a very entertaining way. In 2002, Wigge drew attention to himself in Germany for the first time on TV broadcaster VIVA plus presenting comedy clips on the daily show “London Calling”. In this context he sets a record for the longest donkey ride in music television history and visits the Queen of England, dressed as King Henry VIII, on her 50th throne	Michael Wigge

