## Dense IR
In this notebook, we show how to train a model, index data, and run search using Neural IR.

In orded to run (almost) instantaneously, we use trivial data sizes of training data and collection to search.

## Configuration
First, we need to include the required modules.

In [1]:
import os
import tempfile
 
from oneqa.ir.dense.colbert_top.colbert.utils.utils import create_directory, print_message
from oneqa.ir.dense.colbert_top.colbert.infra import Run, RunConfig
from oneqa.ir.dense.colbert_top.colbert.infra.config import ColBERTConfig
from oneqa.ir.dense.colbert_top.colbert.training.training import train
from oneqa.ir.dense.colbert_top.colbert.indexing.collection_indexer import encode
from oneqa.ir.dense.colbert_top.colbert.searcher import Searcher

## Training
We will train a ColBERT model using a TSVfile containing [query, positive document, negative document] triples.

The path in `test_files_location` below points to the location of files used by the notebook, by default it poits to the files used by CI testing.

In [2]:
test_files_location = '../../../tests/resources/ir_dense'
model_type = 'xlm-roberta-base'
with tempfile.TemporaryDirectory() as working_dir:
    output_dir=os.path.join(working_dir, 'output_dir')
text_triples_fn = os.path.join(test_files_location, "xorqa.train_ir_negs_5_poss_1_001pct_at_0pct.tsv")

In [3]:
import pandas as pd
from IPython.display import display, HTML
data = pd.read_csv(text_triples_fn, sep='\t', nrows=1, header=None)
display(HTML(data.to_html()))

Unnamed: 0,0,1,2
0,중국에서 가장 오랜기간 왕위를 유지한 인물은 누구인가?,"Kangxi Emperor The Kangxi Emperor's reign of 61 years makes him the longest-reigning emperor in Chinese history (although his grandson, the Qianlong Emperor, had the longest period of ""de facto"" power) and one of the longest-reigning rulers in the world. However, since he ascended the throne at the age of seven, actual power was held for six years by four regents and his grandmother, the Grand Empress Dowager Xiaozhuang.","Chiddy Bang new songs from the duo and in November 2009 debuted the group's first free mixtape entitled ""The Swelly Express"". On 28 April 2011 during the first-ever MTV O Music Awards, Anamege broke the Guinness World Record for Longest Freestyle Rap and Longest Marathon Rapping Record by freestyling for 9 hours, 18 minutes, and 22 seconds, stealing the throne from rapper M-Eighty, who originally broke the record in 2009 by rapping for 9 hours, 15 minutes and 15 seconds. Anamege had also beat Canadian rapper D.O. for Longest Marathon Rapping session, the previous record being for 8 hours and 45 minutes."


In [4]:
args_dict = {
                'root': output_dir,
                'experiment': 'test_training',
                'triples': text_triples_fn,
                'similarity': 'l2',
                'model_type': model_type,
                'maxsteps': 3,
                'bsize': 1,
                'accumsteps': 1,
                'amp': True,
                'epochs': 1,
                'rank': 0,
                'nranks': 1
            }

Next we train the model, and save it's location in `the latest_model_fn`variable

In [5]:
with Run().context(RunConfig(root=args_dict['root'], experiment=args_dict['experiment'], nranks=args_dict['nranks'], amp=args_dict['amp'])):
    colBERTConfig = ColBERTConfig(**args_dict)
    latest_model_fn = train(colBERTConfig, text_triples_fn, None, None)

{
    "nprobe": 2,
    "ncandidates": 8192,
    "index_path": null,
    "nbits": 1,
    "kmeans_niters": 20,
    "num_partitions_max": 10000000,
    "similarity": "l2",
    "bsize": 1,
    "accumsteps": 1,
    "lr": 3e-6,
    "maxsteps": 3,
    "save_every": null,
    "resume": false,
    "resume_optimizer": false,
    "warmup": null,
    "warmup_bert": null,
    "relu": false,
    "nway": 2,
    "use_ib_negatives": false,
    "reranker": false,
    "distillation_alpha": 1.0,
    "ignore_scores": false,
    "shuffle_every_epoch": false,
    "save_steps": 2000,
    "save_epochs": -1,
    "epochs": 1,
    "input_arguments": {},
    "model_type": "xlm-roberta-base",
    "init_from_lm": null,
    "local_models_repository": null,
    "ranks_fn": null,
    "topK": 100,
    "student_teacher_temperature": 1.0,
    "student_teacher_top_loss_weight": 0.5,
    "teacher_model_type": "xlm-roberta-base",
    "teacher_doc_maxlen": 180,
    "distill_query_passage_separately": false,
    "query_only": 

Some weights of the model checkpoint at xlm-roberta-base were not used when initializing HF_ColBERT_XLMR: ['lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.bias', 'lm_head.decoder.weight']
- This IS expected if you are initializing HF_ColBERT_XLMR from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing HF_ColBERT_XLMR from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of HF_ColBERT_XLMR were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['bert.encoder.layer.2.attention.self.key.bias', 'bert.encoder.layer.4.attention.output.dense.weight', 'bert.encoder.layer.7.attention.output.LayerNorm.weight

[May 05, 11:46:01] maxsteps: 3
[May 05, 11:46:01] 1 epochs of 5 examples
[May 05, 11:46:01] batch size: 1
[May 05, 11:46:01] maxsteps set to 3
[May 05, 11:46:01] start batch idx: 0
[May 05, 11:46:01] #> XMLR QueryTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==
[May 05, 11:46:01] #> Input: $ 중국에서 가장 오랜기간 왕위를 유지한 인물은 누구인가?, 		 True, 		 None
[May 05, 11:46:01] #> Output IDs: torch.Size([32]), tensor([     0,   9748,  24120,   1180,  13968, 211059,  83639,  76826,  78363,
         57104,    993, 161732,    697, 116932, 114150,     32,      2,      1,
             1,      1,      1,      1,      1,      1,      1,      1,      1,
             1,      1,      1,      1,      1])
[May 05, 11:46:01] #> Output Mask: torch.Size([32]), tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0])
[May 05, 11:46:01] #> XLMR DocTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==
[May 05, 11:46:01] #> Input: $ "Kangxi



[May 05, 11:46:01] #>>>> before linear query ==
[May 05, 11:46:01] #>>>>> Q: torch.Size([32, 768]), tensor([[ 0.1269,  0.0868,  0.0588,  ..., -0.0978,  0.0423,  0.0024],
        [ 0.0237,  0.0695,  0.0303,  ..., -0.0671, -0.0818, -0.0653],
        [ 0.0626,  0.0780,  0.0324,  ...,  0.0213, -0.0027,  0.1598],
        ...,
        [ 0.1314,  0.0969,  0.0671,  ..., -0.1059,  0.0561, -0.0338],
        [ 0.0398,  0.0168,  0.0476,  ..., -0.0210, -0.0167, -0.0480],
        [ 0.0871,  0.0832,  0.0208,  ..., -0.0546,  0.0003,  0.0225]],
       grad_fn=<SelectBackward0>)
[May 05, 11:46:01] #>>>>> self.linear query : Parameter containing:
tensor([[-1.2613e-02, -3.6824e-03, -1.0773e-02,  ..., -1.3896e-02,
          1.2244e-02, -8.2326e-03],
        [-3.4988e-03,  3.9623e-03, -2.4905e-02,  ..., -2.6214e-02,
          4.0837e-03, -8.0812e-03],
        [-1.1158e-02, -2.1660e-05,  1.5019e-02,  ..., -1.1310e-02,
         -9.3691e-03, -2.2202e-02],
        ...,
        [-1.0295e-02,  3.2577e-02,  3.3341

## Indexing
Next, we will index a collection of documents, using model representaion from the previous step. 
The collection is a TSV file, containing each document's ID, title, and text.

In [6]:
collection_fn = os.path.join(test_files_location, "xorqa.train_ir_001pct_at_0_pct_collection_fornum.tsv")

Here is an example document:

In [7]:
data = pd.read_csv(collection_fn, sep='\t', header=0, nrows=1)
#data = pd.read_csv(collection_fn, sep='\t', header=None, skiprows=3, nrows=1)
display(HTML(data.to_html()))

Unnamed: 0,id,text,title
0,1,"The Kangxi Emperor's reign of 61 years makes him the longest-reigning emperor in Chinese history (although his grandson, the Qianlong Emperor, had the longest period of ""de facto"" power) and one of the longest-reigning rulers in the world. However, since he ascended the throne at the age of seven, actual power was held for six years by four regents and his grandmother, the Grand Empress Dowager Xiaozhuang.",Kangxi Emperor


Here are the indexer arguments:

In [8]:
args_dict = {
                'root': os.path.join(output_dir,'test_indexing'),
                'experiment': 'test_indexing',
                'checkpoint': latest_model_fn,
                'collection': collection_fn,
                'index_root': os.path.join(output_dir, 'test_indexing', 'indexes'),
                'index_name': 'index_name',
                'doc_maxlen': 180,
                'num_partitions_max': 2,
                'kmeans_niters': 1,
                'nway': 1,
                'rank': 0,
                'nranks': 1,
                'amp': True
            }

Here we run the indexer:

In [9]:
with Run().context(RunConfig(root=args_dict['root'], experiment=args_dict['experiment'], nranks=args_dict['nranks'], amp=args_dict['amp'])):
    colBERTConfig = ColBERTConfig(**args_dict)
    create_directory(colBERTConfig.index_path_)
    encode(colBERTConfig, collection_fn, None, None)




[May 05, 11:46:18] #> Creating directory /tmp/tmpol5v7ufl/output_dir/test_indexing/indexes/index_name 


{
    "nprobe": 2,
    "ncandidates": 8192,
    "index_path": null,
    "nbits": 1,
    "kmeans_niters": 1,
    "num_partitions_max": 2,
    "similarity": "cosine",
    "bsize": 32,
    "accumsteps": 1,
    "lr": 3e-6,
    "maxsteps": 500000,
    "save_every": null,
    "resume": false,
    "resume_optimizer": false,
    "warmup": null,
    "warmup_bert": null,
    "relu": false,
    "nway": 1,
    "use_ib_negatives": false,
    "reranker": false,
    "distillation_alpha": 1.0,
    "ignore_scores": false,
    "shuffle_every_epoch": false,
    "save_steps": 2000,
    "save_epochs": -1,
    "epochs": 10,
    "input_arguments": {},
    "model_type": "bert-base-uncased",
    "init_from_lm": null,
    "local_models_repository": null,
    "ranks_fn": null,
    "topK": 100,
    "student_teacher_temperature": 1.0,
    "student_teacher_top_loss_weight": 0.5,
    "teacher_model_type": "xlm-



[May 05, 11:46:31] #>>>> before linear doc ==
[May 05, 11:46:31] #>>>>> D: torch.Size([180, 768]), tensor([[ 7.1369e-02,  9.6666e-02,  5.4361e-02,  ..., -1.4448e-01,
          5.4727e-02,  1.2758e-04],
        [ 5.5677e-03,  2.6758e-02,  3.2364e-02,  ..., -1.3423e-01,
          1.2431e-02,  1.1201e-01],
        [-4.3298e-02,  9.1787e-02, -3.2852e-02,  ..., -2.9251e-01,
         -2.6410e-02,  8.9396e-02],
        ...,
        [-5.3112e-02,  5.4709e-02,  4.0843e-03,  ..., -1.7059e-01,
         -4.2763e-02,  2.6375e-03],
        [-5.3112e-02,  5.4709e-02,  4.0843e-03,  ..., -1.7059e-01,
         -4.2763e-02,  2.6375e-03],
        [-5.3112e-02,  5.4709e-02,  4.0843e-03,  ..., -1.7059e-01,
         -4.2763e-02,  2.6375e-03]])
[May 05, 11:46:31] #>>>>> self.linear doc : Parameter containing:
tensor([[-1.2620e-02, -3.6884e-03, -1.0780e-02,  ..., -1.3896e-02,
          1.2237e-02, -8.2247e-03],
        [-3.5069e-03,  3.9557e-03, -2.4896e-02,  ..., -2.6217e-02,
          4.0822e-03, -8.0775e-03

0it [00:00, ?it/s]

[May 05, 11:46:31] [0] 		 #> Encoding 7 passages..
[May 05, 11:46:32] [0] 		 #> Saving chunk 0: 	 7 passages and 1,220 embeddings. From #0 onward.


1it [00:00,  1.03it/s]

[May 05, 11:46:32] offset: 0
[May 05, 11:46:32] chunk codes size(0): 1220
[May 05, 11:46:32] codes size(0): 1220
[May 05, 11:46:32] codes size(): torch.Size([1220])
[May 05, 11:46:32] >>>>partition.size(0): 2
[May 05, 11:46:32] >>>>num_partition: 2
[May 05, 11:46:32] [0] 		 #> Saving the indexing metadata to /tmp/tmpol5v7ufl/output_dir/test_indexing/indexes/index_name/metadata.json ..





The resulting index files are in `output_dir/test_indexing/indexes/index_name/metadata.json`

## Search
Next, we use the trained model and the index to search the collection, using queries from a TSV query file.

In [10]:
queries_fn = os.path.join(test_files_location, "xorqa.train_ir_001pct_at_0_pct_queries_fornum.tsv")

Here are the search arguments:

In [11]:
args_dict = {
                'root': output_dir,
                'experiment': 'test_indexing' ,
                'checkpoint': latest_model_fn,
                'model_type': model_type,
                'collection': collection_fn,
                'index_root': output_dir,
                'index_name': 'index_name',
                'queries': queries_fn,
                #'ranks_fn': ranks_fn,
                'bsize': 1,
                'topK': 1,
                'nprobe': 1,
                'nway': 1,
                'rank': 0,
                'nranks': 1,
                'amp': True,
            }

Here we initalize and run the searcher:

In [12]:
with Run().context(RunConfig(root=args_dict['root'], experiment=args_dict['experiment'], nranks=args_dict['nranks'], amp=args_dict['amp'])):
    colBERTConfig = ColBERTConfig(**args_dict)
    searcher = Searcher(args_dict['index_name'], checkpoint=args_dict['checkpoint'], collection=args_dict['collection'], config=colBERTConfig)
    rankings = searcher.search_all(args_dict['queries'], args_dict['topK'])

[May 05, 11:46:33] #> base_config.py from_path /tmp/tmpol5v7ufl/output_dir/test_indexing/indexes/index_name/metadata.json
[May 05, 11:46:33] #> base_config.py from_path args loaded! 
[May 05, 11:46:33] #> base_config.py from_path args replaced ! 
[May 05, 11:46:33] #> base_config.py load_from_checkpoint /tmp/tmpol5v7ufl/output_dir/test_training/none/2022-05/05/11.45.43/checkpoints/colbert
[May 05, 11:46:33] #> base_config.py load_from_checkpoint /tmp/tmpol5v7ufl/output_dir/test_training/none/2022-05/05/11.45.43/checkpoints/colbert/artifact.metadata
[May 05, 11:46:33] #> base_config.py from_path /tmp/tmpol5v7ufl/output_dir/test_training/none/2022-05/05/11.45.43/checkpoints/colbert/artifact.metadata
[May 05, 11:46:33] #> base_config.py from_path args loaded! 
[May 05, 11:46:33] #> Loading collection...
0M 
[May 05, 11:46:33] #>>>>> at ColBERT name (model type) : /tmp/tmpol5v7ufl/output_dir/test_training/none/2022-05/05/11.45.43/checkpoints/colbert
[May 05, 11:46:33] #>>>>> at BaseColBERT

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  7.05it/s]


Here is the search result for our query [query_id, document_id, rank, score]:

In [13]:
rankings.flat_ranking[0]

(-7239279093922981232, 1, 1, 29.515625)

Here is the text of the query:

In [14]:
with open(queries_fn, 'r') as f:
    for line in f.readlines():
        if str(rankings.flat_ranking[0][0]) == line.split()[0]:
            print(line)

-7239279093922981232	중국에서 가장 오랜기간 왕위를 유지한 인물은 누구인가?



English translation: `Who maintained the throne for the longest time in China?`

Here is the top retrieved document:

In [15]:
with open(collection_fn, 'r') as f:
    for line in f.readlines():
        if str(rankings.flat_ranking[0][1]) == line.split()[0]:
            print(line)

1	"The Kangxi Emperor's reign of 61 years makes him the longest-reigning emperor in Chinese history (although his grandson, the Qianlong Emperor, had the longest period of ""de facto"" power) and one of the longest-reigning rulers in the world. However, since he ascended the throne at the age of seven, actual power was held for six years by four regents and his grandmother, the Grand Empress Dowager Xiaozhuang."	Kangxi Emperor

