## DR.DECR (Dense Retrieval with Distillation-Enhanced Cross-Lingual Representation) Student/Teacher Training
In this notebook, we show how to train a model using Knowledge Distillation (Student/Teacher) to improve performance on Cross-lingual retrieval, as desribed in `Learning Cross-Lingual IR from an English Retriever`, https://arxiv.org/abs/2112.08185.

After training the model, we index the data and run search.  


In orded to run (almost) instantaneously, we use trivial sizes of training data and collection to search.

## Dependencies
If not already done, please make sure to install PrimeQA with notebooks extras before getting started.

In [1]:
# If you want CUDA 11 uncomment and run this (for CUDA 10 or CPU you can ignore this line).
#! pip install 'torch~=1.11.0' --extra-index-url https://download.pytorch.org/whl/cu113

# Uncomment to install PrimeQA from source (pypi package pending).
# The path should be the project root (e.g. '.' below).
#! pip install .[notebooks]

## Configuration
First, we need to include the required modules.

In [2]:
import os
import tempfile
 
from primeqa.ir.dense.colbert_top.colbert.utils.utils import create_directory, print_message
from primeqa.ir.dense.colbert_top.colbert.infra import Run, RunConfig
from primeqa.ir.dense.colbert_top.colbert.infra.config import ColBERTConfig
from primeqa.ir.dense.colbert_top.colbert.training.training import train
from primeqa.ir.dense.colbert_top.colbert.indexing.collection_indexer import encode
from primeqa.ir.dense.colbert_top.colbert.searcher import Searcher

No CUDA runtime is found, using CUDA_HOME='/opt/share/cuda-10.1/x86_64'
{"time":"2022-10-26 08:44:32,566", "name": "faiss.loader", "level": "INFO", "message": "Loading faiss."}
{"time":"2022-10-26 08:44:33,466", "name": "faiss.loader", "level": "INFO", "message": "Successfully loaded faiss."}


## Training
We will train a ColBERT model using a TSV file containing [query, positive document, negative document] triples. We use the XOR-TyDi dataset, as described here: https://nlp.cs.washington.edu/xorqa/

The path in `test_files_location` below points to the location of files used by the notebook, by default it points to the files used by CI testing.

In [3]:
test_files_location = '../../../tests/resources/ir_dense'
model_type = 'xlm-roberta-base'
with tempfile.TemporaryDirectory() as working_dir:
    output_dir=os.path.join(working_dir, 'output_dir')
text_triples_fn = os.path.join(test_files_location, "xorqa.train_ir_negs_5_poss_1_001pct_at_0pct.tsv")

Here is an example of a training file record:

In [4]:
import pandas as pd
from IPython.display import display, HTML
data = pd.read_csv(text_triples_fn, sep='\t', nrows=1, header=None)
display(HTML(data.to_html()))

Unnamed: 0,0,1,2
0,중국에서 가장 오랜기간 왕위를 유지한 인물은 누구인가?,"Kangxi Emperor The Kangxi Emperor's reign of 61 years makes him the longest-reigning emperor in Chinese history (although his grandson, the Qianlong Emperor, had the longest period of ""de facto"" power) and one of the longest-reigning rulers in the world. However, since he ascended the throne at the age of seven, actual power was held for six years by four regents and his grandmother, the Grand Empress Dowager Xiaozhuang.","Chiddy Bang new songs from the duo and in November 2009 debuted the group's first free mixtape entitled ""The Swelly Express"". On 28 April 2011 during the first-ever MTV O Music Awards, Anamege broke the Guinness World Record for Longest Freestyle Rap and Longest Marathon Rapping Record by freestyling for 9 hours, 18 minutes, and 22 seconds, stealing the throne from rapper M-Eighty, who originally broke the record in 2009 by rapping for 9 hours, 15 minutes and 15 seconds. Anamege had also beat Canadian rapper D.O. for Longest Marathon Rapping session, the previous record being for 8 hours and 45 minutes."


In [5]:
args_dict = {
                'root': output_dir,
                'experiment': 'test_training_student',
                'triples': text_triples_fn,
                'model_type': model_type,
                'maxsteps': 3,
                'bsize': 1,
                'accumsteps': 1,
                'amp': True,
                'epochs': 1,
                'rank': 0,
                'nranks': 1
            }

Next we train the the student starting-point model, and save it's location in the `student_model_fn`variable

In [6]:
with Run().context(RunConfig(root=args_dict['root'], experiment=args_dict['experiment'], nranks=args_dict['nranks'], amp=args_dict['amp'])):
    colBERTConfig = ColBERTConfig(**args_dict)   
    student_model_fn = train(colBERTConfig, text_triples_fn, None, None)

{
    "ncells": null,
    "centroid_score_threshold": null,
    "ndocs": null,
    "index_path": null,
    "index_location": null,
    "nbits": 1,
    "kmeans_niters": 20,
    "num_partitions_max": 10000000,
    "similarity": "cosine",
    "bsize": 1,
    "accumsteps": 1,
    "lr": 3e-6,
    "maxsteps": 3,
    "save_every": null,
    "resume": false,
    "resume_optimizer": false,
    "warmup": null,
    "warmup_bert": null,
    "relu": false,
    "nway": 2,
    "use_ib_negatives": false,
    "reranker": false,
    "distillation_alpha": 1.0,
    "ignore_scores": false,
    "shuffle_every_epoch": false,
    "save_steps": 2000,
    "save_epochs": -1,
    "epochs": 1,
    "input_arguments": {},
    "model_type": "xlm-roberta-base",
    "init_from_lm": null,
    "local_models_repository": null,
    "ranks_fn": null,
    "output_dir": null,
    "topK": 100,
    "student_teacher_temperature": 1.0,
    "student_teacher_top_loss_weight": 0.5,
    "teacher_model_type": "xlm-roberta-base",
    "

Some weights of the model checkpoint at xlm-roberta-base were not used when initializing HF_ColBERT_XLMR: ['lm_head.bias', 'lm_head.decoder.weight', 'lm_head.layer_norm.weight', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing HF_ColBERT_XLMR from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing HF_ColBERT_XLMR from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of HF_ColBERT_XLMR were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['bert.encoder.layer.2.output.dense.weight', 'bert.encoder.layer.7.attention.output.LayerNorm.bias', 'bert.encoder.layer.0.output.LayerNorm.bias', 'bert.encod

[Oct 26, 08:45:17] Loading segmented_maxsim_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...




[Oct 26, 08:45:20] maxsteps: 3
[Oct 26, 08:45:20] 1 epochs of 5 examples
[Oct 26, 08:45:20] batch size: 1
[Oct 26, 08:45:20] maxsteps set to 3
[Oct 26, 08:45:20] start batch idx: 0
[Oct 26, 08:45:21] #> XMLR QueryTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==
[Oct 26, 08:45:21] #> Input: $ 중국에서 가장 오랜기간 왕위를 유지한 인물은 누구인가?, 		 True, 		 None
[Oct 26, 08:45:21] #> Output IDs: torch.Size([32]), tensor([     0,   9748,  24120,   1180,  13968, 211059,  83639,  76826,  78363,
         57104,    993, 161732,    697, 116932, 114150,     32,      2,      1,
             1,      1,      1,      1,      1,      1,      1,      1,      1,
             1,      1,      1,      1,      1])
[Oct 26, 08:45:21] #> Output Mask: torch.Size([32]), tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0])
[Oct 26, 08:45:21] #> XLMR DocTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==
[Oct 26, 08:45:21] #> Input: $ "Kangxi



[Oct 26, 08:45:25] #>>>> before linear query ==
[Oct 26, 08:45:25] #>>>>> Q: torch.Size([32, 768]), tensor([[ 0.1269,  0.0868,  0.0588,  ..., -0.0978,  0.0423,  0.0024],
        [ 0.0237,  0.0695,  0.0303,  ..., -0.0671, -0.0818, -0.0653],
        [ 0.0626,  0.0780,  0.0324,  ...,  0.0213, -0.0027,  0.1598],
        ...,
        [ 0.1314,  0.0969,  0.0671,  ..., -0.1059,  0.0561, -0.0338],
        [ 0.0398,  0.0168,  0.0476,  ..., -0.0210, -0.0167, -0.0480],
        [ 0.0871,  0.0832,  0.0208,  ..., -0.0546,  0.0003,  0.0225]],
       grad_fn=<SelectBackward0>)
[Oct 26, 08:45:25] #>>>>> self.linear query : Parameter containing:
tensor([[-1.2613e-02, -3.6824e-03, -1.0773e-02,  ..., -1.3896e-02,
          1.2244e-02, -8.2326e-03],
        [-3.4988e-03,  3.9623e-03, -2.4905e-02,  ..., -2.6214e-02,
          4.0837e-03, -8.0812e-03],
        [-1.1158e-02, -2.1660e-05,  1.5019e-02,  ..., -1.1310e-02,
         -9.3691e-03, -2.2202e-02],
        ...,
        [-1.0295e-02,  3.2577e-02,  3.3341

Next we train the the teacher model, and save its location in the `teacher_model_fn`variable.

When training this model we use data with the same passages as for the student model, but with English translations of the queries.

In [7]:
text_triples_en_fn = os.path.join(test_files_location, "xorqa.train_ir_negs_5_poss_1_001pct_at_0pct_en.tsv")
args_dict = {
                'root': output_dir,
                'experiment': 'test_training_teacher',
                'triples': text_triples_en_fn,
                'model_type': model_type,
                'maxsteps': 3,
                'bsize': 1,
                'accumsteps': 1,
                'amp': True,
                'epochs': 1,
                'rank': 0,
                'nranks': 1
            }

In [8]:
with Run().context(RunConfig(root=args_dict['root'], experiment=args_dict['experiment'], nranks=args_dict['nranks'], amp=args_dict['amp'])):
    colBERTConfig = ColBERTConfig(**args_dict)   
    teacher_model_fn = train(colBERTConfig, text_triples_en_fn, None, None)

{
    "ncells": null,
    "centroid_score_threshold": null,
    "ndocs": null,
    "index_path": null,
    "index_location": null,
    "nbits": 1,
    "kmeans_niters": 20,
    "num_partitions_max": 10000000,
    "similarity": "cosine",
    "bsize": 1,
    "accumsteps": 1,
    "lr": 3e-6,
    "maxsteps": 3,
    "save_every": null,
    "resume": false,
    "resume_optimizer": false,
    "warmup": null,
    "warmup_bert": null,
    "relu": false,
    "nway": 2,
    "use_ib_negatives": false,
    "reranker": false,
    "distillation_alpha": 1.0,
    "ignore_scores": false,
    "shuffle_every_epoch": false,
    "save_steps": 2000,
    "save_epochs": -1,
    "epochs": 1,
    "input_arguments": {},
    "model_type": "xlm-roberta-base",
    "init_from_lm": null,
    "local_models_repository": null,
    "ranks_fn": null,
    "output_dir": null,
    "topK": 100,
    "student_teacher_temperature": 1.0,
    "student_teacher_top_loss_weight": 0.5,
    "teacher_model_type": "xlm-roberta-base",
    "

Some weights of the model checkpoint at xlm-roberta-base were not used when initializing HF_ColBERT_XLMR: ['lm_head.bias', 'lm_head.decoder.weight', 'lm_head.layer_norm.weight', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing HF_ColBERT_XLMR from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing HF_ColBERT_XLMR from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of HF_ColBERT_XLMR were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['bert.encoder.layer.2.output.dense.weight', 'bert.encoder.layer.7.attention.output.LayerNorm.bias', 'bert.encoder.layer.0.output.LayerNorm.bias', 'bert.encod

[Oct 26, 08:48:26] maxsteps: 3
[Oct 26, 08:48:26] 1 epochs of 5 examples
[Oct 26, 08:48:26] batch size: 1
[Oct 26, 08:48:26] maxsteps set to 3
[Oct 26, 08:48:26] start batch idx: 0
[Oct 26, 08:48:26] #> XMLR QueryTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==
[Oct 26, 08:48:26] #> Input: $ Who maintained the throne for the longest time in China?, 		 True, 		 None
[Oct 26, 08:48:26] #> Output IDs: torch.Size([32]), tensor([    0,  9748, 40469, 76104,   297,    70,     6, 42294,    86,   100,
           70,  4989,   525,  1733,    23,  9098,    32,     2,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1])
[Oct 26, 08:48:26] #> Output Mask: torch.Size([32]), tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0])
[Oct 26, 08:48:26] #> XLMR DocTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==
[Oct 26, 08:48:26] #> Input: $ "Kangxi Emper



[Oct 26, 08:48:31] #>>>> before linear query ==
[Oct 26, 08:48:31] #>>>>> Q: torch.Size([32, 768]), tensor([[ 0.1266,  0.1302,  0.0628,  ..., -0.1106,  0.0519, -0.0265],
        [-0.0445,  0.0772,  0.0160,  ..., -0.0904, -0.0560, -0.0523],
        [-0.0128,  0.0967, -0.0064,  ..., -0.0320, -0.0303,  0.1300],
        ...,
        [ 0.1200,  0.1188, -0.0589,  ..., -0.2018, -0.0487, -0.0523],
        [ 0.0734,  0.0938, -0.0106,  ..., -0.1310, -0.0079,  0.0202],
        [ 0.1056,  0.1080, -0.0078,  ..., -0.0156, -0.0259,  0.0222]],
       grad_fn=<SelectBackward0>)
[Oct 26, 08:48:31] #>>>>> self.linear query : Parameter containing:
tensor([[-1.2613e-02, -3.6824e-03, -1.0773e-02,  ..., -1.3896e-02,
          1.2244e-02, -8.2326e-03],
        [-3.4988e-03,  3.9623e-03, -2.4905e-02,  ..., -2.6214e-02,
          4.0837e-03, -8.0812e-03],
        [-1.1158e-02, -2.1660e-05,  1.5019e-02,  ..., -1.1310e-02,
         -9.3691e-03, -2.2202e-02],
        ...,
        [-1.0295e-02,  3.2577e-02,  3.3341

In the following two Knowledge Distillation (KD) steps, we will train the student model using 
1. parallel data contatining a) English and non-English passages for the student model, and b) English passages for the teacher model, 
2. parallel data containg a) non-English queries and English passages for the student model, and b) English queries and English passages for the student model.

First, we run Knowledge Distillation where the student learns teacher's token representations using the parallel English and non-English passage data from the following two files: 

a) English and non-English passages

b) English only passages:

In [9]:
parallel_non_en_fn = os.path.join(test_files_location, "7lan_notrim_triple_2ep.other.clean.h5")
parallel_en_fn = os.path.join(test_files_location, "7lan_notrim_triple_2ep.en.clean.h5")

Here is an example item from the mixed-language file:

In [10]:
data = pd.read_csv(parallel_non_en_fn, sep='\t', nrows=1, header=None)
display(HTML(data.to_html()))

Unnamed: 0,0,1,2
0,সে একটা জরুরি পুনর্বাসনের কাজ করছিল,She was right in the middle of an important piece of restoration .,She was right in the middle of an important piece of restoration .


Here are the parameters and command for the first Knowledge Distillation (KD) stage:

In [11]:
parallel_non_en_fn = os.path.join(test_files_location, "7lan_notrim_triple_2ep.other.clean.h5")
parallel_en_fn = os.path.join(test_files_location, "7lan_notrim_triple_2ep.en.clean.h5")

args_dict = {
    'root': output_dir,
    'experiment': 'test_training',
    'model_type': model_type,
    'checkpoint': student_model_fn + '-LAST.dnn',
    'distill_query_passage_separately': True,
    'teacher_model_type': model_type,
    'teacher_checkpoint': teacher_model_fn + '-LAST.dnn',
    'triples': parallel_non_en_fn,
    'teacher_triples': parallel_en_fn,
    'maxsteps': 3,
    'bsize': 1,
    'accumsteps': 1,
    'amp': True,
    'epochs': 1,
    'rank': 0,
    'nranks': 1
}

with Run().context(RunConfig(root=args_dict['root'], experiment=args_dict['experiment'], nranks=args_dict['nranks'], amp=args_dict['amp'])):
    colBERTConfig = ColBERTConfig(**args_dict)
    stage_1_student_model_fn = train(colBERTConfig, parallel_non_en_fn, None, None)


{
    "ncells": null,
    "centroid_score_threshold": null,
    "ndocs": null,
    "index_path": null,
    "index_location": null,
    "nbits": 1,
    "kmeans_niters": 20,
    "num_partitions_max": 10000000,
    "similarity": "cosine",
    "bsize": 1,
    "accumsteps": 1,
    "lr": 3e-6,
    "maxsteps": 3,
    "save_every": null,
    "resume": false,
    "resume_optimizer": false,
    "warmup": null,
    "warmup_bert": null,
    "relu": false,
    "nway": 2,
    "use_ib_negatives": false,
    "reranker": false,
    "distillation_alpha": 1.0,
    "ignore_scores": false,
    "shuffle_every_epoch": false,
    "save_steps": 2000,
    "save_epochs": -1,
    "epochs": 1,
    "input_arguments": {},
    "model_type": "xlm-roberta-base",
    "init_from_lm": null,
    "local_models_repository": null,
    "ranks_fn": null,
    "output_dir": null,
    "topK": 100,
    "student_teacher_temperature": 1.0,
    "student_teacher_top_loss_weight": 0.5,
    "teacher_model_type": "xlm-roberta-base",
    "

Some weights of the model checkpoint at xlm-roberta-base were not used when initializing HF_ColBERT_XLMR: ['lm_head.bias', 'lm_head.decoder.weight', 'lm_head.layer_norm.weight', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing HF_ColBERT_XLMR from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing HF_ColBERT_XLMR from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of HF_ColBERT_XLMR were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['bert.encoder.layer.2.output.dense.weight', 'bert.encoder.layer.7.attention.output.LayerNorm.bias', 'bert.encoder.layer.0.output.LayerNorm.bias', 'bert.encod

[Oct 26, 08:52:07] #> Starting from checkpoint /tmp/tmptrqmqrf1/output_dir/test_training_student/2022-10/26/08.43.55/checkpoints/colbert-LAST.dnn
[Oct 26, 08:52:15] #>>>>> at ColBERT name (model type) : xlm-roberta-base
[Oct 26, 08:52:15] #>>>>> at BaseColBERT name (model type) : xlm-roberta-base
[Oct 26, 08:52:15] #> base_config.py load_from_checkpoint xlm-roberta-base
[Oct 26, 08:52:15] #> base_config.py load_from_checkpoint xlm-roberta-base/artifact.metadata
[Oct 26, 08:52:15] factory model type: xlm-roberta-base


Some weights of the model checkpoint at xlm-roberta-base were not used when initializing HF_ColBERT_XLMR: ['lm_head.bias', 'lm_head.decoder.weight', 'lm_head.layer_norm.weight', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing HF_ColBERT_XLMR from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing HF_ColBERT_XLMR from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of HF_ColBERT_XLMR were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['bert.encoder.layer.2.output.dense.weight', 'bert.encoder.layer.7.attention.output.LayerNorm.bias', 'bert.encoder.layer.0.output.LayerNorm.bias', 'bert.encod

[Oct 26, 08:52:35] #> Loading teacher checkpoint /tmp/tmptrqmqrf1/output_dir/test_training_teacher/2022-10/26/08.43.55/checkpoints/colbert-LAST.dnn
[Oct 26, 08:52:40] distill_query_passage_separately functionality is not supported (yet)




[Oct 26, 08:52:41] maxsteps: 3
[Oct 26, 08:52:41] 1 epochs of 5 examples
[Oct 26, 08:52:41] batch size: 1
[Oct 26, 08:52:41] maxsteps set to 3
[Oct 26, 08:52:41] start batch idx: 0
[Oct 26, 08:52:41] #> XMLR QueryTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==
[Oct 26, 08:52:41] #> Input: $ সে একটা জরুরি পুনর্বাসনের কাজ করছিল, 		 True, 		 None
[Oct 26, 08:52:41] #> Output IDs: torch.Size([32]), tensor([     0,   9748,  22540,  71896, 237528, 231247,  71994,  31938,  30234,
        168033,   2763,      2,      1,      1,      1,      1,      1,      1,
             1,      1,      1,      1,      1,      1,      1,      1,      1,
             1,      1,      1,      1,      1])
[Oct 26, 08:52:41] #> Output Mask: torch.Size([32]), tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0])
[Oct 26, 08:52:41] #> XLMR DocTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==
[Oct 26, 08:52:41] #> Input: $ Sh



[Oct 26, 08:52:45] #>>>> before linear query ==
[Oct 26, 08:52:45] #>>>>> Q: torch.Size([32, 768]), tensor([[-1.5845e-03,  7.9839e-02,  5.0856e-02,  ..., -1.0550e-01,
          9.7534e-02,  1.7295e-02],
        [-8.6329e-03,  2.0453e-02,  4.7944e-05,  ...,  4.7402e-02,
          8.4028e-02,  4.5498e-02],
        [-9.8698e-03,  8.0432e-02, -2.1066e-02,  ..., -1.1526e-01,
          1.0836e-02,  6.8963e-02],
        ...,
        [ 1.9038e-02,  8.9034e-02,  4.3189e-02,  ..., -2.7036e-02,
          5.8573e-02, -2.5946e-02],
        [ 6.7096e-02,  9.8762e-02,  1.1187e-02,  ..., -4.2818e-02,
          5.3878e-02,  4.3672e-02],
        [ 7.1772e-02,  9.2112e-02,  8.6087e-03,  ..., -1.7460e-01,
          6.0347e-02,  4.0607e-03]], grad_fn=<SelectBackward0>)
[Oct 26, 08:52:45] #>>>>> self.linear query : Parameter containing:
tensor([[-1.2620e-02, -3.6884e-03, -1.0780e-02,  ..., -1.3896e-02,
          1.2237e-02, -8.2247e-03],
        [-3.5070e-03,  3.9557e-03, -2.4896e-02,  ..., -2.6217e-02,
   

#>>>    29.24 29.53 		|		 -0.2900000000000027
[Oct 26, 08:53:39] 0 0.021523209288716316
#>>>    26.44 28.06 		|		 -1.6199999999999974
[Oct 26, 08:54:39] 1 0.021525634886696933
#>>>    29.57 28.82 		|		 0.75
[Oct 26, 08:55:30] 2 0.021523664447976276
#> Saving a checkpoint to /tmp/tmptrqmqrf1/output_dir/test_training/2022-10/26/08.43.55/checkpoints/colbert.dnn.batch_3.model ..
#> Saving a checkpoint to /tmp/tmptrqmqrf1/output_dir/test_training/2022-10/26/08.43.55/checkpoints/colbert-batch_3 ..
[Oct 26, 08:55:43] name:/tmp/tmptrqmqrf1/output_dir/test_training/2022-10/26/08.43.55/checkpoints/colbert-LAST.dnn
[Oct 26, 08:55:43] Make a sym link of /tmp/tmptrqmqrf1/output_dir/test_training/2022-10/26/08.43.55/checkpoints/colbert.dnn.batch_3.model to /tmp/tmptrqmqrf1/output_dir/test_training/2022-10/26/08.43.55/checkpoints/colbert-LAST.dnn
[Oct 26, 08:55:43] #> Done with all triples!
#> Saving a checkpoint to /tmp/tmptrqmqrf1/output_dir/test_training/2022-10/26/08.43.55/checkpoints/colbert ..


Next, we run Knowledge Distillation where the student learns teacher's relavance prediction score using parallel data containing:

a) non-English queries and English passages for the student model

b) English queries and English passages for the teacher model.

In [20]:
args_dict = {
    'root': output_dir,
    'experiment': 'test_training',
    'model_type': model_type,
    'checkpoint': stage_1_student_model_fn + '-LAST.dnn',
    'distill_query_passage_separately': False,
    'teacher_model_type': model_type,
    'teacher_checkpoint': teacher_model_fn + '-LAST.dnn',
    'triples': text_triples_en_fn,
    'teacher_triples': text_triples_en_fn,
    'maxsteps': 3,
    'bsize': 1,
    'accumsteps': 1,
    'amp': True,
    'epochs': 1,
    'rank': 0,
    'nranks': 1
}

with Run().context(RunConfig(root=args_dict['root'], experiment=args_dict['experiment'], nranks=args_dict['nranks'], amp=args_dict['amp'])):
    colBERTConfig = ColBERTConfig(**args_dict)
    stage_2_student_model_fn = train(colBERTConfig, text_triples_fn, None, None)


{
    "ncells": null,
    "centroid_score_threshold": null,
    "ndocs": null,
    "index_path": null,
    "index_location": null,
    "nbits": 1,
    "kmeans_niters": 20,
    "num_partitions_max": 10000000,
    "similarity": "cosine",
    "bsize": 1,
    "accumsteps": 1,
    "lr": 3e-6,
    "maxsteps": 3,
    "save_every": null,
    "resume": false,
    "resume_optimizer": false,
    "warmup": null,
    "warmup_bert": null,
    "relu": false,
    "nway": 2,
    "use_ib_negatives": false,
    "reranker": false,
    "distillation_alpha": 1.0,
    "ignore_scores": false,
    "shuffle_every_epoch": false,
    "save_steps": 2000,
    "save_epochs": -1,
    "epochs": 1,
    "input_arguments": {},
    "model_type": "xlm-roberta-base",
    "init_from_lm": null,
    "local_models_repository": null,
    "ranks_fn": null,
    "output_dir": null,
    "topK": 100,
    "student_teacher_temperature": 1.0,
    "student_teacher_top_loss_weight": 0.5,
    "teacher_model_type": "xlm-roberta-base",
    "

Some weights of the model checkpoint at xlm-roberta-base were not used when initializing HF_ColBERT_XLMR: ['lm_head.bias', 'lm_head.decoder.weight', 'lm_head.layer_norm.weight', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing HF_ColBERT_XLMR from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing HF_ColBERT_XLMR from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of HF_ColBERT_XLMR were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['bert.encoder.layer.2.output.dense.weight', 'bert.encoder.layer.7.attention.output.LayerNorm.bias', 'bert.encoder.layer.0.output.LayerNorm.bias', 'bert.encod

[Oct 26, 09:02:33] #> Starting from checkpoint /tmp/tmptrqmqrf1/output_dir/test_training/2022-10/26/08.43.55/checkpoints/colbert-LAST.dnn
[Oct 26, 09:02:39] #>>>>> at ColBERT name (model type) : xlm-roberta-base
[Oct 26, 09:02:39] #>>>>> at BaseColBERT name (model type) : xlm-roberta-base
[Oct 26, 09:02:39] #> base_config.py load_from_checkpoint xlm-roberta-base
[Oct 26, 09:02:39] #> base_config.py load_from_checkpoint xlm-roberta-base/artifact.metadata
[Oct 26, 09:02:39] factory model type: xlm-roberta-base


Some weights of the model checkpoint at xlm-roberta-base were not used when initializing HF_ColBERT_XLMR: ['lm_head.bias', 'lm_head.decoder.weight', 'lm_head.layer_norm.weight', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing HF_ColBERT_XLMR from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing HF_ColBERT_XLMR from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of HF_ColBERT_XLMR were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['bert.encoder.layer.2.output.dense.weight', 'bert.encoder.layer.7.attention.output.LayerNorm.bias', 'bert.encoder.layer.0.output.LayerNorm.bias', 'bert.encod

[Oct 26, 09:02:59] #> Loading teacher checkpoint /tmp/tmptrqmqrf1/output_dir/test_training_teacher/2022-10/26/08.43.55/checkpoints/colbert-LAST.dnn




[Oct 26, 09:03:05] maxsteps: 3
[Oct 26, 09:03:05] 1 epochs of 5 examples
[Oct 26, 09:03:05] batch size: 1
[Oct 26, 09:03:05] maxsteps set to 3
[Oct 26, 09:03:05] start batch idx: 0
[Oct 26, 09:03:05] #> XMLR QueryTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==
[Oct 26, 09:03:05] #> Input: $ 중국에서 가장 오랜기간 왕위를 유지한 인물은 누구인가?, 		 True, 		 None
[Oct 26, 09:03:05] #> Output IDs: torch.Size([32]), tensor([     0,   9748,  24120,   1180,  13968, 211059,  83639,  76826,  78363,
         57104,    993, 161732,    697, 116932, 114150,     32,      2,      1,
             1,      1,      1,      1,      1,      1,      1,      1,      1,
             1,      1,      1,      1,      1])
[Oct 26, 09:03:05] #> Output Mask: torch.Size([32]), tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0])
[Oct 26, 09:03:05] #> XLMR DocTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==
[Oct 26, 09:03:05] #> Input: $ "Kangxi

[Oct 26, 09:03:10] #>>>> colbert query ==
[Oct 26, 09:03:10] #>>>>> Q: torch.Size([32, 128]), tensor([[-0.2179, -0.0212, -0.3695,  ..., -0.1597, -0.1289, -0.1689],
        [-0.0519,  0.2481, -0.4561,  ..., -0.2290, -0.2685, -0.0693],
        [-0.0309,  0.2590, -0.4800,  ..., -0.3211, -0.2496, -0.0094],
        ...,
        [-0.1647,  0.1821, -0.4240,  ..., -0.1822, -0.3135, -0.0908],
        [-0.2180, -0.0300, -0.3580,  ..., -0.1671, -0.1416, -0.1824],
        [-0.1909, -0.0392, -0.3386,  ..., -0.1474, -0.0767, -0.1944]],
       grad_fn=<SelectBackward0>)
[Oct 26, 09:03:10] #>>>> colbert doc ==
[Oct 26, 09:03:10] #>>>>> input_ids: torch.Size([155]), tensor([     0,   9749,     44,  50245, 122809,  31678,     56,    748,    581,
         30267,   5134,  31678,     56,    748,     25,      7,   1690,  38529,
           111,  11716,   5369,  30482,   4049,     70,   4989,    525,      9,
           107,    872,    592,      6,  88940,    748,     23,  76438,  32692,
            15,    289

[Oct 26, 09:03:32] #>>>>> self.linear doc : Parameter containing:
tensor([[-1.2613e-02, -3.6825e-03, -1.0767e-02,  ..., -1.3904e-02,
          1.2236e-02, -8.2245e-03],
        [-3.5072e-03,  3.9647e-03, -2.4897e-02,  ..., -2.6205e-02,
          4.0816e-03, -8.0879e-03],
        [-1.1153e-02, -2.5158e-05,  1.5023e-02,  ..., -1.1317e-02,
         -9.3637e-03, -2.2197e-02],
        ...,
        [-1.0294e-02,  3.2583e-02,  3.3276e-03,  ...,  3.1877e-02,
          3.6833e-03,  2.3003e-02],
        [ 1.5650e-03,  2.2403e-03,  1.3361e-02,  ..., -2.8309e-02,
          3.9306e-03,  1.1987e-02],
        [-5.5885e-03,  3.1930e-02, -2.3424e-02,  ..., -1.1074e-04,
          1.1096e-02, -9.7742e-03]], requires_grad=True)
[Oct 26, 09:03:32] #>>>> colbert doc ==
[Oct 26, 09:03:32] #>>>>> D: torch.Size([155, 128]), tensor([[-0.2345, -0.0354, -0.3565,  ..., -0.1199, -0.1221, -0.1999],
        [-0.0886,  0.2280, -0.5029,  ..., -0.1394, -0.2653, -0.0359],
        [-0.0194,  0.2471, -0.5022,  ..., -0.2341

## Indexing
Next, we will index a collection of documents, using model representaion from the previous step. 
The collection is a TSV file, containing each document's ID, title, and text.

In [13]:
collection_fn = os.path.join(test_files_location, "xorqa.train_ir_001pct_at_0_pct_collection_fornum.tsv")

Here is an example document:

In [14]:
data = pd.read_csv(collection_fn, sep='\t', header=0, nrows=1)
#data = pd.read_csv(collection_fn, sep='\t', header=None, skiprows=3, nrows=1)
display(HTML(data.to_html()))

Unnamed: 0,id,text,title
0,1,"The Kangxi Emperor's reign of 61 years makes him the longest-reigning emperor in Chinese history (although his grandson, the Qianlong Emperor, had the longest period of ""de facto"" power) and one of the longest-reigning rulers in the world. However, since he ascended the throne at the age of seven, actual power was held for six years by four regents and his grandmother, the Grand Empress Dowager Xiaozhuang.",Kangxi Emperor


Here are the indexer arguments:

In [15]:
args_dict = {
                'root': os.path.join(output_dir,'test_indexing'),
                'experiment': 'test_indexing',
                'checkpoint': stage_2_student_model_fn,
                'collection': collection_fn,
                'index_root': os.path.join(output_dir, 'test_indexing', 'indexes'),
                'index_name': 'index_name',
                'doc_maxlen': 180,
                'num_partitions_max': 2,
                'kmeans_niters': 1,
                'nway': 1,
                'rank': 0,
                'nranks': 1,
                'amp': True
            }

Here we run the indexer:

In [16]:
with Run().context(RunConfig(root=args_dict['root'], experiment=args_dict['experiment'], nranks=args_dict['nranks'], amp=args_dict['amp'])):
    colBERTConfig = ColBERTConfig(**args_dict)
    create_directory(colBERTConfig.index_path_)
    encode(colBERTConfig, collection_fn, None, None)




[Oct 26, 09:01:04] #> Creating directory /tmp/tmptrqmqrf1/output_dir/test_indexing/indexes/index_name 


{
    "ncells": null,
    "centroid_score_threshold": null,
    "ndocs": null,
    "index_path": null,
    "index_location": null,
    "nbits": 1,
    "kmeans_niters": 1,
    "num_partitions_max": 2,
    "similarity": "cosine",
    "bsize": 32,
    "accumsteps": 1,
    "lr": 3e-6,
    "maxsteps": 500000,
    "save_every": null,
    "resume": false,
    "resume_optimizer": false,
    "warmup": null,
    "warmup_bert": null,
    "relu": false,
    "nway": 1,
    "use_ib_negatives": false,
    "reranker": false,
    "distillation_alpha": 1.0,
    "ignore_scores": false,
    "shuffle_every_epoch": false,
    "save_steps": 2000,
    "save_epochs": -1,
    "epochs": 10,
    "input_arguments": {},
    "model_type": "bert-base-uncased",
    "init_from_lm": null,
    "local_models_repository": null,
    "ranks_fn": null,
    "output_dir": null,
    "topK": 100,
    "student_teacher_tempera



[Oct 26, 09:01:24] #> XLMR DocTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==
[Oct 26, 09:01:24] #> Input: $ title | text, 		 32
[Oct 26, 09:01:24] #> Output IDs: torch.Size([180]), tensor([    0,  9749, 44759,     6, 58745,  7986,     2,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1



[Oct 26, 09:01:35] #>>>> before linear doc ==
[Oct 26, 09:01:35] #>>>>> D: torch.Size([180, 768]), tensor([[ 0.0680,  0.0969,  0.0546,  ..., -0.1383,  0.0549,  0.0015],
        [ 0.0007,  0.0250,  0.0328,  ..., -0.1287,  0.0133,  0.1089],
        [-0.0445,  0.0889, -0.0314,  ..., -0.2847, -0.0218,  0.0868],
        ...,
        [-0.0585,  0.0579,  0.0066,  ..., -0.1692, -0.0392, -0.0020],
        [-0.0585,  0.0579,  0.0066,  ..., -0.1692, -0.0392, -0.0020],
        [-0.0585,  0.0579,  0.0066,  ..., -0.1692, -0.0392, -0.0020]])
[Oct 26, 09:01:35] #>>>>> self.linear doc : Parameter containing:
tensor([[-1.2608e-02, -3.6779e-03, -1.0781e-02,  ..., -1.3897e-02,
          1.2241e-02, -8.2146e-03],
        [-3.4997e-03,  3.9648e-03, -2.4895e-02,  ..., -2.6225e-02,
          4.0922e-03, -8.0729e-03],
        [-1.1159e-02, -1.7624e-05,  1.5027e-02,  ..., -1.1298e-02,
         -9.3556e-03, -2.2198e-02],
        ...,
        [-1.0298e-02,  3.2585e-02,  3.3553e-03,  ...,  3.1883e-02,
          3.

0it [00:00, ?it/s]

[Oct 26, 09:01:42] [0] 		 #> Encoding 7 passages..
[Oct 26, 09:01:53] [0] 		 #> Saving chunk 0: 	 7 passages and 1,220 embeddings. From #0 onward.


1it [00:11, 11.54s/it]

[Oct 26, 09:01:54] offset: 0
[Oct 26, 09:01:54] chunk codes size(0): 1220
[Oct 26, 09:01:54] codes size(0): 1220
[Oct 26, 09:01:54] codes size(): torch.Size([1220])
[Oct 26, 09:01:54] >>>>partition.size(0): 2
[Oct 26, 09:01:54] >>>>num_partition: 2
[Oct 26, 09:01:54] #> Optimizing IVF to store map from centroids to list of pids..
[Oct 26, 09:01:54] #> Building the emb2pid mapping..
[Oct 26, 09:01:54] len(emb2pid) = 1220



100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 5493.52it/s]


[Oct 26, 09:01:54] #> Saved optimized IVF to /tmp/tmptrqmqrf1/output_dir/test_indexing/indexes/index_name/ivf.pid.pt
[Oct 26, 09:01:54] [0] 		 #> Saving the indexing metadata to /tmp/tmptrqmqrf1/output_dir/test_indexing/indexes/index_name/metadata.json ..


The resulting index files are in `output_dir/test_indexing/indexes/index_name/metadata.json`

## Search
Next, we use the trained model and the index to search the collection, using queries from a TSV query file.

In [17]:
queries_fn = os.path.join(test_files_location, "xorqa.train_ir_001pct_at_0_pct_queries_fornum.tsv")

Here are the search arguments:

In [21]:
args_dict = {
                'root': output_dir,
                'experiment': 'test_indexing' ,
                'checkpoint': stage_2_student_model_fn,
                'model_type': model_type,
                'index_location': os.path.join(output_dir, 'test_indexing', 'indexes', 'index_name'),
                'queries': queries_fn,
                'bsize': 1,
                'topK': 1,
                'nway': 1,
                'rank': 0,
                'nranks': 1,
                'amp': True,
            }

Here we initalize and run the searcher:

In [23]:
with Run().context(RunConfig(root=args_dict['root'], experiment=args_dict['experiment'], nranks=args_dict['nranks'], amp=args_dict['amp'])):
    colBERTConfig = ColBERTConfig(**args_dict)
    searcher = Searcher(args_dict['index_location'], checkpoint=args_dict['checkpoint'], config=colBERTConfig)
    rankings = searcher.search_all(args_dict['queries'], args_dict['topK'])

[Oct 26, 09:09:49] #> base_config.py from_path /tmp/tmptrqmqrf1/output_dir/test_indexing/indexes/index_name/metadata.json
[Oct 26, 09:09:49] #> base_config.py from_path args loaded! 
[Oct 26, 09:09:49] #> base_config.py from_path args replaced ! 
[Oct 26, 09:09:49] #> base_config.py load_from_checkpoint /tmp/tmptrqmqrf1/output_dir/test_training/2022-10/26/08.43.55/checkpoints/colbert
[Oct 26, 09:09:49] #> base_config.py load_from_checkpoint /tmp/tmptrqmqrf1/output_dir/test_training/2022-10/26/08.43.55/checkpoints/colbert/artifact.metadata
[Oct 26, 09:09:49] #> base_config.py from_path /tmp/tmptrqmqrf1/output_dir/test_training/2022-10/26/08.43.55/checkpoints/colbert/artifact.metadata
[Oct 26, 09:09:49] #> base_config.py from_path args loaded! 
[Oct 26, 09:09:49] #>>>>> at ColBERT name (model type) : /tmp/tmptrqmqrf1/output_dir/test_training/2022-10/26/08.43.55/checkpoints/colbert
[Oct 26, 09:09:49] #>>>>> at BaseColBERT name (model type) : /tmp/tmptrqmqrf1/output_dir/test_training/2022-



[Oct 26, 09:10:09] Loading segmented_lookup_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
[Oct 26, 09:10:25] Loading filter_pids_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
[Oct 26, 09:10:38] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
[Oct 26, 09:10:43] #> Loading the queries from ../../../tests/resources/ir_dense/xorqa.train_ir_001pct_at_0_pct_queries_fornum.tsv ...
[Oct 26, 09:10:43] #> Got 1 queries. All QIDs are unique.

[Oct 26, 09:10:43] #> XMLR QueryTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==
[Oct 26, 09:10:43] #> Input: $ 중국에서 가장 오랜기간 왕위를 유지한 인물은 누구인가?, 		 True, 		 None
[Oct 26, 09:10:43] #> Output IDs: torch.Size([32]), tensor([     0,   9748,  24120,   1180,  13968, 211059,  83639,  76826,  78363,
         57104,    993, 161732,    697, 116932, 114150,     32,      2,      1,
             1,      1,      1,      1,      1, 



[Oct 26, 09:10:47] #>>>> before linear query ==
[Oct 26, 09:10:47] #>>>>> Q: torch.Size([32, 768]), tensor([[ 0.1762,  0.1015,  0.0653,  ..., -0.1249,  0.0495, -0.0164],
        [ 0.0720,  0.0497,  0.0575,  ..., -0.0727, -0.1199, -0.0622],
        [ 0.1116,  0.0483,  0.0107,  ...,  0.0009, -0.0022,  0.2009],
        ...,
        [ 0.1664,  0.0902, -0.0267,  ..., -0.2714, -0.0662,  0.0430],
        [ 0.1664,  0.0902, -0.0267,  ..., -0.2714, -0.0662,  0.0430],
        [ 0.1664,  0.0902, -0.0267,  ..., -0.2714, -0.0662,  0.0430]])
[Oct 26, 09:10:47] #>>>>> self.linear query : Parameter containing:
tensor([[-1.2609e-02, -3.6701e-03, -1.0775e-02,  ..., -1.3890e-02,
          1.2242e-02, -8.2134e-03],
        [-3.4918e-03,  3.9656e-03, -2.4894e-02,  ..., -2.6227e-02,
          4.0927e-03, -8.0784e-03],
        [-1.1166e-02, -2.3727e-05,  1.5029e-02,  ..., -1.1290e-02,
         -9.3548e-03, -2.2198e-02],
        ...,
        [-1.0306e-02,  3.2584e-02,  3.3568e-03,  ...,  3.1891e-02,
         

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.14it/s]


Here is the search result for our query, containing [query_id, document_id, rank, score]:

In [24]:
rankings.flat_ranking[0]

(-7239279093922981232, 2, 1, 30.695838928222656)

Here is the text of the query record, contataing the ID and text of the query:

In [25]:
with open(queries_fn, 'r') as f:
    for line in f.readlines():
        if str(rankings.flat_ranking[0][0]) == line.split()[0]:
            print(line)

-7239279093922981232	중국에서 가장 오랜기간 왕위를 유지한 인물은 누구인가?



English translation: `Who maintained the throne for the longest time in China?`

Here is the top retrieved document:

In [26]:
with open(collection_fn, 'r') as f:
    for line in f.readlines():
        if str(rankings.flat_ranking[0][1]) == line.split()[0]:
            print(line)

2	Yao. The Bamboo Annals says that when Emperor Zhuanxu died, a descendent of Shennong named ShuQe raised a disturbance, but was destroyed by the prince of Sin, who was Ku (GaoXin), a descendant of HuangDi, who then ascended to the throne. In the 45th year, Ku designated the prince of Tang (唐) (his son Yao) as his successor, however upon his death in the 63rd year, his elder son Zhi then took the throne instead, ruling nine years before being deposed and replaced by Yao. Emperor Zhi Di Zhì () or simply Zhì, was a mythological emperor of ancient China.	Emperor Zhi

