# Experiment on NTCIR-18 Transfer DCLR Subtask (Train Set)

This notebook shows how to apply TFIDF, BM25, and RM3 to the train set of NTCIR-18 Transfer DCLR Subtask using [PyTerrier](https://pyterrier.readthedocs.io/en/latest/) (v0.10.1).

## Previous Step

- `preprocess-transfer2-train.ipynb`

## Requirement

- Java

## Path

In [1]:
import os
INDEX = os.getcwd() + '/../indexes/ntcir18-transfer/train'
RUN = os.getcwd() + '/../runs/ntcir18-transfer/train'

## Datasets

In [None]:
import sys
!{sys.executable} -m pip install -q ir_datasets pandas

In [3]:

sys.path.append(os.path.join(os.path.dirname(os.path.abspath('__file__')), '../datasets'))

In [4]:
import ir_datasets
import ntcir_transfer
dataset = ir_datasets.load('ntcir-transfer/2/train')

### Queries and Qrels

- Note that not all queries in Japanese are included in qrels of English collection

In [None]:
import pandas as pd
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', None)
pd.DataFrame(dataset.qrels_iter()).groupby('query_id')['relevance'].value_counts().unstack(fill_value=0)


## Query Translation

- This example uses a free version of [DeepL API](https://www.deepl.com/en/pro-api).
- You need to sign up and get an API key.

In [6]:
DEEPL_API_KEY = 'Your API Key'

In [None]:
import sys
!{sys.executable} -m pip install -q deepl

In [8]:
import deepl
translator = deepl.Translator(DEEPL_API_KEY)

In [9]:
import time
import re
import pandas as pd

queries_en = []

for query_id, text_ja in dataset.queries_iter():
    text_en = translator.translate_text(text_ja, source_lang="JA", target_lang="EN-US")
    # removing non-alphabetical characters as Terrier doesn't like it
    text_en_clean = re.sub(r'\s+', ' ', re.sub(r'[^a-zA-Z]', ' ', text_en.text)).strip()
    queries_en.append({'qid': query_id, 'query': text_en_clean})
    print(f'{query_id}, ', end='', file=sys.stderr)

    time.sleep(1)

print(f'Done!', file=sys.stderr)

queries_en_df = pd.DataFrame(queries_en)

0001, 0002, 0003, 0004, 0005, 0006, 0007, 0008, 0009, 0010, 0011, 0012, 0013, 0014, 0015, 0016, 0017, 0018, 0019, 0020, 0021, 0022, 0023, 0024, 0025, 0026, 0027, 0028, 0029, 0030, 0031, 0032, 0033, 0034, 0035, 0036, 0037, 0038, 0039, 0040, 0041, 0042, 0043, 0044, 0045, 0046, 0047, 0048, 0049, 0050, 0051, 0052, 0053, 0054, 0055, 0056, 0057, 0058, 0059, 0060, 0061, 0062, 0063, 0064, 0065, 0066, 0067, 0068, 0069, 0070, 0071, 0072, 0073, 0074, 0075, 0076, 0077, 0078, 0079, 0080, 0081, 0082, 0083, Done!


#### Translation check (Optional)

In [None]:
queries_en_df_tmp = queries_en_df.copy()
queries_en_df_tmp.set_index('qid', inplace=True)
queries_ja_df = pd.DataFrame(dataset.queries_iter())
queries_ja_df.set_index('query_id', inplace=True)
combined_df = queries_ja_df.join(queries_en_df_tmp, how='inner') 
combined_df

## Experiment

### PyTerrier

In [11]:
# Change JAVA_HOME to fit your environment
JAVA_HOME = 'FIT YOUR ENVIRONMENT'
os.environ['JAVA_HOME'] = JAVA_HOME
os.getenv('JAVA_HOME')

'C:\\jdk-22.0.1'

In [None]:
import sys
!{sys.executable} -m pip install -q python-terrier numpy==1.26.4 ipywidgets

In [13]:
import pandas as pd
import pyterrier as pt
if not pt.started():
  pt.init(boot_packages=["com.github.terrierteam:terrier-prf:-SNAPSHOT"], tqdm='notebook')

PyTerrier 0.10.1 has loaded Terrier 5.9 (built by craigm on 2024-05-02 17:40) and terrier-helper 0.0.8



In [14]:
dataset_pt = pt.get_dataset('irds:ntcir-transfer/2/train')

### Indexing

In [15]:
os.makedirs(INDEX, exist_ok=True)

In [16]:
indexer = pt.IterDictIndexer(INDEX)

In [17]:
%%time
indexref = indexer.index(dataset_pt.get_corpus_iter())

ntcir-transfer/2/train documents:   0%|          | 0/187080 [00:00<?, ?it/s]

CPU times: total: 10 s
Wall time: 34.2 s


In [18]:
os.listdir(INDEX)

['data.direct.bf',
 'data.document.fsarrayfile',
 'data.inverted.bf',
 'data.lexicon.fsomapfile',
 'data.lexicon.fsomaphash',
 'data.lexicon.fsomapid',
 'data.meta-0.fsomapfile',
 'data.meta.idx',
 'data.meta.zdata',
 'data.properties']

### Retrieval

In [19]:
# Load existing index files
indexref = pt.IndexFactory.of(INDEX)

In [20]:
os.makedirs(RUN, exist_ok=True)

In [21]:
tfidf = pt.BatchRetrieve(indexref, wmodel="TF_IDF")
bm25 = pt.BatchRetrieve(indexref, wmodel="BM25")
bm25_rm3 = bm25 >> pt.rewrite.RM3(indexref) >> bm25

In [22]:
%%time
from pyterrier.measures import *
pt.Experiment(
    [tfidf, bm25, bm25_rm3],
    queries_en_df,
    dataset_pt.get_qrels(),
    eval_metrics=[MRR, nDCG@10, nDCG],
    names = ["MyRun-TFIDF", "MyRun-BM25", "MyRun-BM25_RM3"],
    save_dir = RUN,
    save_mode = "overwrite",
    filter_by_qrels=True, # Not all topics are available in qrels
    # perquery=True
)

CPU times: total: 1.53 s
Wall time: 5.94 s


Unnamed: 0,name,RR,nDCG@10,nDCG
0,MyRun-TFIDF,0.575961,0.361124,0.469943
1,MyRun-BM25,0.574356,0.359483,0.470693
2,MyRun-BM25_RM3,0.555899,0.373065,0.492993
