<a href="https://colab.research.google.com/github/linool/msc-2024-li/blob/main/clir_tutorial_bm25.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BM25 model for CLIR

In this notebook we are going to walk through a CLIR example using a BM25 model with query translation to generate a ranked list on the NeuCLIR Chinese collection.



## Setup
Replicating the steps from the official Anserini [notebook](https://github.com/castorini/anserini-notebooks/blob/master/anserini_robust04_demo.ipynb)

First, install Maven (Java 11 comes pre-installed already):


In [None]:
%%capture
!apt-get install maven -qq

Clone and build Anserini:

In [None]:
%%capture
!git clone --recurse-submodules https://github.com/castorini/anserini.git
%cd anserini
!cd tools/eval && tar xvfz trec_eval.9.0.4.tar.gz && cd trec_eval.9.0.4 && make && cd ../../..
!mvn clean package appassembler:assemble -DskipTests -Dmaven.javadoc.skip=true

If all goes well, you should be able to see anserini-X.Y.Z-SNAPSHOT-fatjar.jar in target/:

In [None]:
!ls target

anserini-0.21.1-SNAPSHOT-fatjar.jar   classes		      maven-status
anserini-0.21.1-SNAPSHOT.jar	      generated-sources       test-classes
anserini-0.21.1-SNAPSHOT-sources.jar  generated-test-sources
appassembler			      maven-archiver


Let's install the packages!
The following command will install `ir_measurees`, Huggingface `datasets`, Google Translate (for presentation), and Huggingface Transformers.

In [None]:
!pip install -q -U --progress-bar on ir_measures transformers datasets googletrans==3.1.0a0

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/48.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.8/48.8 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m27.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m486.2/486.2 kB[0m [31m32.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.1/55.1 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m62.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m133.4/133.4 kB[0m [31m15.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

After installation, let's download the dataset. The NeuCLIR 1 Collection is publicly available on Huggingface Datasets! Topics and qrels are available on the TREC website, from which we will directly download it.

Working with the entire NeuCLIR Chinese collection will take too much indexing time. For this demonstration, we'll just use the first 40k documents.

In [None]:
# Download topics and qrels from NIST
!wget -q --show-progress https://trec.nist.gov/data/neuclir/topics.0720.utf8.jsonl
!wget -q --show-progress https://trec.nist.gov/data/neuclir/2022-qrels.zho

import json
import pandas as pd
from tqdm.auto import tqdm

import ir_measures as irms
from datasets import load_dataset

# Only loading the first 40k docs from HF Datasets
ds = load_dataset('neuclir/neuclir1', split='zho', streaming=True) # total 3179209
doc_subset = [ o for i, o in zip(tqdm(range(40_000), desc='Loading first 40k docs from NeuCLIR Chinese Collection'), ds) ]
subset_doc_ids = set([ d['id'] for d in doc_subset ])

use_topic = '66' # use topic 66 as demo -- expecting to have 9 relevant docs

qrels = pd.DataFrame([ l for l in irms.read_trec_qrels('2022-qrels.zho') if l.query_id == use_topic and l.doc_id in subset_doc_ids ])
topics = [ t for t in map(json.loads, open("topics.0720.utf8.jsonl")) if t['topic_id'] == use_topic ]



Loading first 40k docs from NeuCLIR Chinese Collection:   0%|          | 0/40000 [00:00<?, ?it/s]

Here we create helper functions so we can obtain the query and document text more conveniently.




In [None]:
topic_id_idx = { t['topic_id']: i for i, t in enumerate(topics) }
def get_query_by_topic_id(topic_id, query_type='title', lang="eng"):
    for topic in topics[ topic_id_idx[topic_id] ]['topics']:
      if topic["lang"] == lang:
        return topic[f'topic_{query_type}']

doc_id_to_idx = { d['id']: i for i, d in enumerate(doc_subset) }
def get_doc_text_by_doc_id(doc_id):
    doc = doc_subset[ doc_id_to_idx[doc_id] ]
    return doc['title'] + ' ' + doc['text']

## Indexing

We first index the NeuCLIR Chinese document subset using Anserini


In [None]:
!mkdir -p collection

Creating jsonl files for Anserini

In [None]:
import json
with open("collection/zho_neuclir_subset.jsonl", "w") as f:
  for doc_id in tqdm(doc_id_to_idx, total = len(doc_id_to_idx)):
    content = get_doc_text_by_doc_id(doc_id)
    text = json.dumps({"id": doc_id, "contents": content})
    f.write(text+"\n")

  0%|          | 0/40000 [00:00<?, ?it/s]

Starting the indexing for Chinese documents. At the end of the indexing, you should see 40,000 documents indexed

In [None]:
!sh target/appassembler/bin/IndexCollection \
  -collection JsonCollection \
  -generator DefaultLuceneDocumentGenerator \
  -threads 9 \
  -input collection \
  -index indexes/zho_neuclir_subset_bm25 \
  -storePositions \
  -storeDocvectors \
  -storeRaw \
  -language zh

2023-07-09 19:38:22,011 INFO  [main] index.IndexCollection (IndexCollection.java:380) - Setting log level to INFO
2023-07-09 19:38:22,017 INFO  [main] index.IndexCollection (IndexCollection.java:383) - Starting indexer...
2023-07-09 19:38:22,019 INFO  [main] index.IndexCollection (IndexCollection.java:385) - DocumentCollection path: collection
2023-07-09 19:38:22,020 INFO  [main] index.IndexCollection (IndexCollection.java:386) - CollectionClass: JsonCollection
2023-07-09 19:38:22,022 INFO  [main] index.IndexCollection (IndexCollection.java:387) - Generator: DefaultLuceneDocumentGenerator
2023-07-09 19:38:22,024 INFO  [main] index.IndexCollection (IndexCollection.java:388) - Threads: 9
2023-07-09 19:38:22,026 INFO  [main] index.IndexCollection (IndexCollection.java:389) - Language: zh
2023-07-09 19:38:22,026 INFO  [main] index.IndexCollection (IndexCollection.java:390) - Stemmer: porter
2023-07-09 19:38:22,027 INFO  [main] index.IndexCollection (IndexCollection.java:391) - Keep stopwor

## Retrieval

Post indexing of Chinese documents, we want to generate a ranked list for a given translated Chinese query using BM25 model.

In [None]:
!mkdir -p runs

Get the translated Chinese query for a specific topic_id (66). See the [cell](https://colab.research.google.com/drive/1u_8ESzz_f26toFy45m17UQRZXGVqMt0B#scrollTo=PI64O_uLCK_o&line=19&uniqifier=1) for more details

In [None]:
topic_text = get_query_by_topic_id(use_topic, lang="zho")

Create a text file for the topic in the following tsv format

In [None]:
with open("zho_topics.txt", "w") as f:
  f.write(f"{use_topic}\t{topic_text}\n")

Perform retrieval using Anserini's BM25 with default hyperparameters

In [None]:
!sh target/appassembler/bin/SearchCollection \
  -index indexes/zho_neuclir_subset_bm25 \
  -topics zho_topics.txt \
  -topicreader TsvInt \
  -output runs/zho_neuclir_subset_bm25.title.txt \
  -bm25 \
  -language zh

2023-07-09 19:43:46,881 INFO  [main] search.SearchCollection (SearchCollection.java:951) - Index: indexes/zho_neuclir_subset_bm25
2023-07-09 19:43:47,187 INFO  [main] search.SearchCollection (SearchCollection.java:955) - Fields: []
2023-07-09 19:43:47,189 INFO  [main] search.SearchCollection (SearchCollection.java:686) - Using language-specific analyzer
2023-07-09 19:43:47,190 INFO  [main] search.SearchCollection (SearchCollection.java:687) - Language: zh
2023-07-09 19:43:47,224 INFO  [main] search.SearchCollection (SearchCollection.java:1227) - runtag: Anserini
2023-07-09 19:43:47,911 INFO  [pool-2-thread-1] search.SearchCollection$SearcherThread (SearchCollection.java:883) - ranker: bm25(k1=0.9,b=0.4), reranker: default: 1 queries processed in 00:00:00 = ~1.59 q/s
2023-07-09 19:43:47,948 INFO  [main] search.SearchCollection (SearchCollection.java:1439) - Total run time: 00:00:01


Scoring against the filtered qrels, this BM25 result is just ok -- giving us an nDCG@20 of 0.1483.

In [None]:
to_rerank = pd.DataFrame([ l for l in irms.read_trec_run("runs/zho_neuclir_subset_bm25.title.txt")])

irms.calc_aggregate([irms.nDCG@20, irms.AP], qrels, to_rerank)

{nDCG@20: 0.1482972305701491, AP: 0.06837054789182448}

# Exercise
Perform retrieval using a different topic id.

For generating a score, refer to this [cell](https://colab.research.google.com/drive/1u_8ESzz_f26toFy45m17UQRZXGVqMt0B#scrollTo=PI64O_uLCK_o&line=19&uniqifier=1) on how to filter qrels to only include the chosen topic id.

Try it out yourself here:

In [None]:
# Your solution
use_topic =
qrels =

# And there you go!

You've learned how to run a simple BM25 retrieval model using query translation!