# PLAID-X for CLIR

In this notebook, we will go through a quick demonstration of using PLAID-X (an extension of [ColBERT-X](https://arxiv.org/abs/2201.08471)) for running a CLIR experiment with a subset of NeuCLIR Chinese collection.

The overall run time of this notebook is about 15 minutes. Please remember to select the a Colab runtime with GPU (`Runtime > Change runtime type`).

## Get Started

The following cell will check if you have GPU access in this notebook.

In [1]:
!nvidia-smi

Tue Jul  4 06:17:34 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   69C    P8    10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

And let's install the packages!
The following command will install `ir_measurees`, Huggingface `datasets`, Google Translate (for presentation), and [PLAID-X](https://github.com/hltcoe/ColBERT-X/tree/plaid-x) from GitHub.

In [2]:
!pip install -q -U --progress-bar on ir_measures datasets googletrans==3.1.0a0 git+https://github.com/hltcoe/ColBERT-X.git@plaid-x

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.8/48.8 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m486.2/486.2 kB[0m [31m34.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.1/55.1 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m59.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m133.4/133.4 kB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.8/58.8 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.6/42.6 kB[0m [31m2.6 MB/s[0m eta [36m

After installation, let's download the dataset. The [NeuCLIR 1 Collection](https://huggingface.co/datasets/neuclir/neuclir1) is publicly available on Huggingface Datasets! Topics and qrels are available on TREC website, which we will directly download from.

However, working with the entire NeuCLIR Chinese collection will take too much time on indexing. For demonstration, we just use the first 40k documents in this tutorial.

In [3]:
# Download topics and qrels from NIST
!wget -q --show-progress https://trec.nist.gov/data/neuclir/topics.0720.utf8.jsonl
!wget -q --show-progress https://trec.nist.gov/data/neuclir/2022-qrels.zho

import json
import pandas as pd
from tqdm.auto import tqdm

import ir_measures as irms
from datasets import load_dataset

# Only loading the first 40k docs from HF Datasets
ds = load_dataset('neuclir/neuclir1', split='zho', streaming=True) # total 3179209
doc_subset = [ o for i, o in zip(tqdm(range(40_000), desc='Loading first 40k docs from NeuCLIR Chinese Collection'), ds) ]
subset_doc_ids = set([ d['id'] for d in doc_subset ])

use_topic = '66' # use topic 66 as demo -- expecting to have 9 relevant docs

qrels = pd.DataFrame([ l for l in irms.read_trec_qrels('2022-qrels.zho') if l.query_id == use_topic and l.doc_id in subset_doc_ids ])
topics = [ t for t in map(json.loads, open("topics.0720.utf8.jsonl")) if t['topic_id'] == use_topic ]



Downloading builder script:   0%|          | 0.00/1.31k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

Loading first 40k docs from NeuCLIR Chinese Collection:   0%|          | 0/40000 [00:00<?, ?it/s]

## Indexing

In this tutorial, we use a Multilingual ColBERT-X model(`eugene-yang/plaidx-xlmr-large-mlir-neuclir`) that is trained on Chinese, Persian, and Russian for NeuCLIR Track. If you are interested in the detail of the model, we have [published a paper at ECIR 2023](https://arxiv.org/abs/2209.01335) on this model but trained for CLEF languages.

In [4]:
from colbert.infra import ColBERTConfig

from colbert.data import Collection
from colbert import Indexer, Searcher

Since the System RAM on Colab VM is quite limited, let's precompile some c++ extension to avoid peak memory usage going over the limit. This process will take around 2 minutes.

In [5]:
from colbert.indexing.codecs.residual import ResidualCodec
ResidualCodec.try_load_torch_extensions(True)

[Jul 04, 06:19:09] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
[Jul 04, 06:20:42] Loading packbits_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...


We then create the collection object and the indexer. To avoid running out of memory, we cap the batch size as 64. If you are running on your own machine, you can potentially increase the batch size to speed up the indexing process.

In [6]:
collection = Collection.cast([ l['text'] for l in doc_subset ])
indexer = Indexer(checkpoint='eugene-yang/plaidx-xlmr-large-mlir-neuclir', config=ColBERTConfig(bsize=64))

Downloading (…)in/artifact.metadata:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

Indexing is be broken into two parts. The preparation step first calculate the cluster centroids. The actual indexing step will index the collection according to the centroids.

In [7]:
# This command will run for ~10 mins
indexer.prepare(name='neuclir.zho.40k', collection=collection, overwrite=True)



[Jul 04, 06:22:01] #> Creating directory /content/experiments/default/indexes/neuclir.zho.40k 


{
    "query_token_id": "[unused0]",
    "doc_token_id": "[unused1]",
    "query_token": "[Q]",
    "doc_token": "[D]",
    "ncells": null,
    "centroid_score_threshold": null,
    "ndocs": null,
    "index_path": null,
    "nbits": 1,
    "kmeans_niters": 4,
    "resume": false,
    "similarity": "l2",
    "bsize": 64,
    "accumsteps": 1,
    "lr": 5e-6,
    "maxsteps": 400000,
    "save_every": null,
    "warmup": null,
    "warmup_bert": null,
    "relu": false,
    "nway": 2,
    "use_ib_negatives": false,
    "reranker": false,
    "distillation_alpha": 1.0,
    "ignore_scores": false,
    "model_name": "xlm-roberta-large",
    "force_resize_embeddings": true,
    "query_maxlen": 32,
    "attend_to_mask_tokens": false,
    "interaction": "colbert",
    "dim": 128,
    "doc_maxlen": 180,
    "mask_punctuation": true,
    "checkpoint": "eugene-yang\/plaidx-xlmr-large-mlir-neuclir",
 

Downloading (…)lve/main/config.json:   0%|          | 0.00/616 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/698 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/452 [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

[Jul 04, 06:24:54] [0] 		 # of sampled PIDs = 35055 	 sampled_pids[:3] = [13311, 7611, 13726]
[Jul 04, 06:24:54] [0] 		 #> Encoding 35055 passages..
[Jul 04, 06:30:31] [0] 		 avg_doclen_est = 171.46205139160156 	 len(local_sample) = 35,055
[Jul 04, 06:30:42] [0] 		 Creaing 32,768 partitions.
[Jul 04, 06:30:42] [0] 		 *Estimated* 6,858,482 embeddings.
[Jul 04, 06:30:42] [0] 		 #> Saving the indexing plan to /content/experiments/default/indexes/neuclir.zho.40k/plan.json ..
{
    "query_token_id": "[unused0]",
    "doc_token_id": "[unused1]",
    "query_token": "[Q]",
    "doc_token": "[D]",
    "ncells": null,
    "centroid_score_threshold": null,
    "ndocs": null,
    "index_path": null,
    "nbits": 1,
    "kmeans_niters": 4,
    "resume": false,
    "similarity": "l2",
    "bsize": 64,
    "accumsteps": 1,
    "lr": 5e-6,
    "maxsteps": 400000,
    "save_every": null,
    "warmup": null,
    "warmup_bert": null,
    "relu": false,
    "nway": 2,
    "use_ib_negatives": false,
    "r

'/content/experiments/default/indexes/neuclir.zho.40k'

In [8]:
# This command takes ~7 mins.
indexer.index(name='neuclir.zho.40k', collection=collection)

{
    "query_token_id": "[unused0]",
    "doc_token_id": "[unused1]",
    "query_token": "[Q]",
    "doc_token": "[D]",
    "ncells": null,
    "centroid_score_threshold": null,
    "ndocs": null,
    "index_path": null,
    "nbits": 1,
    "kmeans_niters": 4,
    "resume": true,
    "similarity": "l2",
    "bsize": 64,
    "accumsteps": 1,
    "lr": 5e-6,
    "maxsteps": 400000,
    "save_every": null,
    "warmup": null,
    "warmup_bert": null,
    "relu": false,
    "nway": 2,
    "use_ib_negatives": false,
    "reranker": false,
    "distillation_alpha": 1.0,
    "ignore_scores": false,
    "model_name": "xlm-roberta-large",
    "force_resize_embeddings": true,
    "query_maxlen": 32,
    "attend_to_mask_tokens": false,
    "interaction": "colbert",
    "dim": 128,
    "doc_maxlen": 180,
    "mask_punctuation": true,
    "checkpoint": "eugene-yang\/plaidx-xlmr-large-mlir-neuclir",
    "triples": "\/expscratch\/eyang\/workspace\/clir-pretrain\/multilingual\/mixed_msmarco\/hc4_combi

0it [00:00, ?it/s]

[Jul 04, 06:33:01] [0] 		 #> Encoding 25000 passages..
[Jul 04, 06:37:04] [0] 		 #> Saving chunk 0: 	 25,000 passages and 4,290,147 embeddings. From #0 onward.


1it [04:06, 246.92s/it]

[Jul 04, 06:37:08] [0] 		 #> Encoding 15000 passages..
[Jul 04, 06:39:34] [0] 		 #> Saving chunk 1: 	 15,000 passages and 2,570,372 embeddings. From #25,000 onward.


2it [06:35, 197.63s/it]

[Jul 04, 06:39:36] [0] 		 #> Checking all files were saved...
[Jul 04, 06:39:36] [0] 		 Found all files!
[Jul 04, 06:39:36] [0] 		 #> Building IVF...
[Jul 04, 06:39:36] [0] 		 #> Loading codes...



100%|██████████| 2/2 [00:00<00:00, 111.14it/s]

[Jul 04, 06:39:36] [0] 		 Sorting codes...





[Jul 04, 06:39:37] [0] 		 Getting unique codes...
[Jul 04, 06:39:37] #> Optimizing IVF to store map from centroids to list of pids..
[Jul 04, 06:39:37] #> Building the emb2pid mapping..
[Jul 04, 06:39:37] len(emb2pid) = 6860519


100%|██████████| 32768/32768 [00:00<00:00, 39902.16it/s]

[Jul 04, 06:39:38] #> Saved optimized IVF to /content/experiments/default/indexes/neuclir.zho.40k/ivf.pid.pt
[Jul 04, 06:39:38] [0] 		 #> Saving the indexing metadata to /content/experiments/default/indexes/neuclir.zho.40k/metadata.json ..





'/content/experiments/default/indexes/neuclir.zho.40k'

And we are done! The default index location is at `experiments/default/indexes/neuclir.zho.40k/`, but you can modify this by providing `index_root` to the `ColBERTConfig` object.

In [9]:
!ls ./experiments/default/indexes/neuclir.zho.40k/

0.codes.pt	 1.metadata.json  centroids.pt	  metadata.json
0.metadata.json  1.residuals.pt   doclens.0.json  plan.json
0.residuals.pt	 avg_residual.pt  doclens.1.json
1.codes.pt	 buckets.pt	  ivf.pid.pt


## Searching

Finally, we search our index with a query. In this tutorial, we use topic `66` as an example.

In [10]:
searcher = Searcher(index='neuclir.zho.40k', collection=collection)

[Jul 04, 06:40:02] #> Loading codec...
[Jul 04, 06:40:02] #> Loading IVF...
[Jul 04, 06:40:02] #> Loading doclens...


100%|██████████| 2/2 [00:00<00:00, 239.96it/s]

[Jul 04, 06:40:02] #> Loading codes and residuals...



100%|██████████| 2/2 [00:00<00:00,  3.55it/s]


In [11]:
raw_scores = searcher.search_all({ t['topic_id']: t['topics'][0]['topic_title'] for t in topics }, k=2500)


#> QueryTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==
#> Input: . COVID-19 vaccination rate in China, 		 True, 		 None
#> Output IDs: torch.Size([32]), tensor([     0, 250002,      5,   8244,  74116,   8363,  51294,   2320,  34515,
            23,   9098,      2, 250001, 250001, 250001, 250001, 250001, 250001,
        250001, 250001, 250001, 250001, 250001, 250001, 250001, 250001, 250001,
        250001, 250001, 250001, 250001, 250001])
#> Output Mask: torch.Size([32]), tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0])



100%|██████████| 1/1 [00:00<00:00, 11.20it/s]


We assemble the search results into a format that `ir_measures` likes to use and evaluate the search results with nDCG@20 and R@100.

In [12]:
run = {
    qid: {
        doc_subset[didx]['id']: score
        for didx, _, score in ranking
    }
    for qid, ranking in raw_scores.items()
}
irms.calc_aggregate([irms.nDCG@20, irms.R@100], qrels, run)

{R@100: 0.7777777777777778, nDCG@20: 0.7290602700185145}

Let's pull out the top 5 documents and see how good they are!

In [13]:
top5 = [
    {**doc_subset[didx], 'score': score, 'rank': rank}
    for didx, rank, score in raw_scores.data['66'][:5]
]

top5

[{'id': '857abc3e-5f0b-4c59-83e4-aadee93e63f2',
  'cc_file': 'crawl-data/CC-NEWS/2021/07/CC-NEWS-20210714102729-00668.warc.gz',
  'time': '2021-07-14T18:21:30+00:00',
  'title': '沒打疫苗步步難行 未接種者在中國多地出行受限',
  'text': '中國的COVID-19疫苗接種已逾14億劑次，隨著疫苗施打的普及，多地政府相繼對未接種者作出限制，過去一週，江西、浙江、安徽都發布告知說，沒接種疫苗將影響出行。\n\n據中國國家衛生健康委員會官網，截至13日，中國大陸2019冠狀病毒疾病（COVID-19）疫苗接種累計14億201萬9000劑次。\n\n隨著疫苗施打的普及，各地也對未接種民眾作出限制。\n\n綜合澎湃新聞等陸媒報導，浙江省寧波市寧海縣衛生和計畫生育局官方微信公眾號公布，疫情防控辦公室11日發布通知：25日起，原則上不允許未接種疫苗者進入醫療機構住院部、養老院、學校（幼兒園、托兒所、校外培訓機構）、圖書館、博物館、監所等重點場所。\n\n浙江麗水市青田縣8日也公布了類似通知：21日起，不允許未接種者進入醫療機構住院部、養老院、托兒所、學校（幼兒園、校外培訓機構）、圖書館、博物館、監所等重點場所。\n\n江西省撫州市崇仁縣的最新通知則說，將在商場、景區、車站、影院等公共場所實行掃「贛通碼」查看疫苗接種紀錄，居民若未接種疫苗，將對生活和出行帶來不便。\n\n另據江西贛州定南縣的通告，26日起，不允許未接種疫苗者進入超市、醫院、學校、車站等；贛州的安遠縣也發通告指，26日起不允許未接種疫苗者進入超市、醫院、學校、車站等重點公共場所。\n\n安徽省黃山市休寧縣衛健委昨天也通知，8月1日起將在全縣範圍內，對出入超市、市場、銀行、賓館酒店、電影院、醫院、藥店、理髮店、政務大廳等各類公共場所人員和乘坐公車的市民展開疫苗接種查驗（安康碼）。',
  'url': 'https://udn.com/news/story/121707/5601597',
  'score': 27.09375,
  'rank': 1},
 {'id': 'a75b38b3-7982-4167-8a84-8564b

Well, you might not be able to read Chinese (exactly why we need CLIR!). But we can leverage Google Translation!

In [14]:
from googletrans import Translator
translate = lambda x: Translator().translate(x, src='zh-tw', dest='en').text

[
    {**d, 'title': translate(d['title']), 'text': translate(d['text'])}
    for d in top5
]

[{'id': '857abc3e-5f0b-4c59-83e4-aadee93e63f2',
  'cc_file': 'crawl-data/CC-NEWS/2021/07/CC-NEWS-20210714102729-00668.warc.gz',
  'time': '2021-07-14T18:21:30+00:00',
  'title': 'It is difficult to walk without vaccination, and the travel of unvaccinated people is restricted in many places in China',
  'text': 'China has received more than 1.4 billion doses of COVID-19 vaccinations. With the popularity of vaccinations, governments in many places have successively imposed restrictions on those who have not been vaccinated. In the past week, Jiangxi, Zhejiang, and Anhui have all issued notices saying that those who have not been vaccinated affect travel.\n\nAccording to the official website of the National Health Commission of China, as of the 13th, a total of 1,420,190,000 doses of vaccinations against the coronavirus disease 2019 (COVID-19) have been administered in mainland China.\n\nWith the popularization of vaccination, various places have also imposed restrictions on unvaccinated 

And there you go! Here's how to run an CLIR experiment with PLAID-X!