# Experiment on NTCIR-18 Transfer DCLR Subtask (Valid Set)

This notebook shows how to apply TFIDF, BM25, and RM3 to the validation set of NTCIR-18 Transfer DCLR Subtask using [PyTerrier](https://pyterrier.readthedocs.io/en/latest/) (v0.10.1).

## Previous Step

- `preprocess-transfer2-valid.ipynb`

## Requirement

- Java

## Path

In [1]:
import os
INDEX = os.getcwd() + '/../indexes/ntcir18-transfer/valid'
RUN = os.getcwd() + '/../runs/ntcir18-transfer/valid'

## Datasets

In [None]:
import sys
!{sys.executable} -m pip install -q ir_datasets pandas

In [4]:
sys.path.append(os.path.join(os.path.dirname(os.path.abspath('__file__')), '../datasets'))

In [5]:
import ir_datasets
import ntcir_transfer
dataset = ir_datasets.load('ntcir-transfer/2/valid')

### Queries and Qrels

In [None]:
import pandas as pd
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', None)
pd.DataFrame(dataset.qrels_iter()).groupby('query_id')['relevance'].value_counts().unstack(fill_value=0)

## Query Translation

- This example uses a free version of [DeepL API](https://www.deepl.com/en/pro-api).
- You need to sign up and get an API key.

In [8]:
DEEPL_API_KEY = 'Your API Key'

In [9]:
import sys
!{sys.executable} -m pip install -q deepl


[notice] A new release of pip is available: 24.1 -> 24.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [10]:
import deepl
translator = deepl.Translator(DEEPL_API_KEY)

In [11]:
import time
import re
import pandas as pd

queries_en = []

for query_id, text_ja in dataset.queries_iter():
    text_en = translator.translate_text(text_ja, source_lang="JA", target_lang="EN-US")
    # removing non-alphabetical characters as Terrier doesn't like it
    text_en_clean = re.sub(r'\s+', ' ', re.sub(r'[^a-zA-Z]', ' ', text_en.text)).strip()
    queries_en.append({'qid': query_id, 'query': text_en_clean})
    print(f'{query_id}, ', end='', file=sys.stderr)

    time.sleep(1)

print(f'Done!', file=sys.stderr)

queries_en_df = pd.DataFrame(queries_en)

0101, 0102, 0103, 0104, 0105, 0106, 0107, 0108, 0109, 0110, 0111, 0112, 0113, 0114, 0115, 0116, 0117, 0118, 0119, 0120, 0121, 0122, 0123, 0124, 0125, 0126, 0127, 0128, 0129, 0130, 0131, 0132, 0133, 0134, 0135, 0136, 0137, 0138, 0139, 0140, 0141, 0142, 0143, 0144, 0145, 0146, 0147, 0148, 0149, Done!


#### Translation check (Optional)

In [None]:
queries_en_df_tmp = queries_en_df.copy()
queries_en_df_tmp.set_index('qid', inplace=True)
queries_ja_df = pd.DataFrame(dataset.queries_iter())
queries_ja_df.set_index('query_id', inplace=True)
combined_df = queries_ja_df.join(queries_en_df_tmp, how='inner') 
combined_df

## Experiment

### PyTerrier

In [13]:
# Change JAVA_HOME to fit your environment
JAVA_HOME = 'FIT YOUR ENVIRONMENT'
os.environ['JAVA_HOME'] = JAVA_HOME
os.getenv('JAVA_HOME')

'C:\\jdk-22.0.1'

In [None]:
import sys
!{sys.executable} -m pip install -q python-terrier numpy==1.26.4 ipywidgets

In [15]:
import pyterrier as pt
if not pt.started():
  pt.init(boot_packages=["com.github.terrierteam:terrier-prf:-SNAPSHOT"], tqdm='notebook')

PyTerrier 0.10.1 has loaded Terrier 5.9 (built by craigm on 2024-05-02 17:40) and terrier-helper 0.0.8



In [16]:
dataset_pt = pt.get_dataset('irds:ntcir-transfer/2/valid')

### Indexing

In [17]:
os.makedirs(INDEX, exist_ok=True)

In [18]:
indexer = pt.IterDictIndexer(INDEX)

In [19]:
%%time
indexref = indexer.index(dataset_pt.get_corpus_iter())

ntcir-transfer/2/valid documents:   0%|          | 0/322058 [00:00<?, ?it/s]

10:13:15.761 [main] WARN org.terrier.structures.indexing.Indexer - Indexed 2 empty documents
CPU times: total: 23.4 s
Wall time: 1min 6s


In [20]:
os.listdir(INDEX)

['data.direct.bf',
 'data.document.fsarrayfile',
 'data.inverted.bf',
 'data.lexicon.fsomapfile',
 'data.lexicon.fsomaphash',
 'data.lexicon.fsomapid',
 'data.meta-0.fsomapfile',
 'data.meta.idx',
 'data.meta.zdata',
 'data.properties']

### Retrieval

In [21]:
# Load existing index files
indexref = pt.IndexFactory.of(INDEX)

In [22]:
os.makedirs(RUN, exist_ok=True)

In [23]:
tfidf = pt.BatchRetrieve(indexref, wmodel="TF_IDF")
bm25 = pt.BatchRetrieve(indexref, wmodel="BM25")
bm25_rm3 = bm25 >> pt.rewrite.RM3(indexref) >> bm25

In [24]:
%%time
from pyterrier.measures import *
pt.Experiment(
    [tfidf, bm25, bm25_rm3],
    queries_en_df,
    dataset_pt.get_qrels(),
    eval_metrics=[MRR, nDCG@10, nDCG],
    names = ["MyRun-TFIDF", "MyRun-BM25", "MyRun-BM25_RM3"],
    save_dir = RUN,
    save_mode = "overwrite",
    filter_by_qrels=True, # Not all topics are available in qrels
    # perquery=True
)

CPU times: total: 1.11 s
Wall time: 4.55 s


Unnamed: 0,name,RR,nDCG@10,nDCG
0,MyRun-TFIDF,0.548938,0.266006,0.154939
1,MyRun-BM25,0.545641,0.264869,0.155248
2,MyRun-BM25_RM3,0.510178,0.275785,0.168072
