# Experiment on NTCIR-17 Transfer Task Eval Dataset

This notebook shows how to apply BM25 to the eval dataset of NTCIR-17 Transfer Task using [PyTerrier](https://pyterrier.readthedocs.io/en/latest/) (v0.9.2).

## Previous Step

- `preprocess-transfer1-eval-ipynb`

## Requirement

- Java v11

## Path

In [1]:
import os
os.environ['INDEX'] = '../indexes/ntcir17-transfer/eval'
os.environ['RUN'] = '../runs/ntcir17-transfer/eval'

## Datasets

In [2]:
import sys
!{sys.executable} -m pip install -q ir_datasets

In [3]:
sys.path.append(os.path.join(os.path.dirname(os.path.abspath('__file__')), '../datasets'))

In [4]:
import ir_datasets
import ntcir_transfer
dataset = ir_datasets.load('ntcir-transfer/1/eval')

## Tokenization

- In this example, we use [SudachiPy](https://github.com/WorksApplications/SudachiPy) (v0.5.4) + sudachidict_core dictionary + SplitMode.A
- Other tokenizers can also be used

In [5]:
import sys
!{sys.executable} -m pip install -q sudachipy sudachidict_core

In [6]:
import re
import json
from sudachipy import tokenizer
from sudachipy import dictionary
tokenizer_obj = dictionary.Dictionary().create()
mode = tokenizer.Tokenizer.SplitMode.A

In [7]:
def tokenize_text(text):
    atok = ' '.join([m.surface() for m in tokenizer_obj.tokenize(text, mode)])
    return atok

In [8]:
tokenize_text('すもももももももものうち')

'すもも も もも も もも の うち'

## Experiment

### PyTerrier

In [9]:
# Change JAVA_HOME to fit your environment
JAVA_HOME = '/usr/lib/jvm/java-11-openjdk-amd64'
os.environ['JAVA_HOME'] = JAVA_HOME
os.getenv('JAVA_HOME')

'/usr/lib/jvm/java-11-openjdk-amd64'

In [10]:
import sys
!{sys.executable} -m pip install -q python-terrier

In [11]:
import pandas as pd
import pyterrier as pt
if not pt.started():
  pt.init(tqdm='notebook')

PyTerrier 0.9.2 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


In [12]:
dataset_pt = pt.get_dataset('irds:ntcir-transfer/1/eval')

### Indexing

In [13]:
# !rm -rf $INDEX
!mkdir -p $INDEX

In [14]:
indexer = pt.IterDictIndexer(os.getenv('INDEX'))
indexer.setProperty("tokeniser", "UTFTokeniser")
indexer.setProperty("termpipelines", "")

In [15]:
def train_doc_generate():
    for doc in dataset.docs_iter():
        yield { 'docno': doc.doc_id, 'text': tokenize_text(doc.text) }

In [16]:
%%time
indexref = indexer.index(train_doc_generate())

09:50:47.286 [ForkJoinPool-1-worker-3] WARN org.terrier.structures.indexing.Indexer - Indexed 1 empty documents
CPU times: user 20min 5s, sys: 17.7 s, total: 20min 23s
Wall time: 15min 19s


In [17]:
!ls $INDEX

data.direct.bf		   data.lexicon.fsomaphash  data.meta.zdata
data.document.fsarrayfile  data.lexicon.fsomapid    data.properties
data.inverted.bf	   data.meta-0.fsomapfile
data.lexicon.fsomapfile    data.meta.idx


### Topics

In [18]:
def tokenize_topics():
    import re
    code = re.compile('[!"#$%&\'\\\\()*+,-./:;<=>?@[\\]^_`{|}~「」〔〕“”〈〉『』【】＆＊・（）＄＃＠。、？！｀＋￥％]')
    queries = dataset_pt.get_topics(tokenise_query=False)
    for idx, row in queries.iterrows():
        queries.iloc[idx, 1] = code.sub('', tokenize_text(row.query))
    return queries

In [None]:
tokenize_topics()

### Retrieval

- The performance value (e.g., nDCG) is expected to be 0.0.
- You can use the generated run files for submission.

In [20]:
# Load existing index files
indexref = pt.IndexFactory.of(os.getenv('INDEX'))

In [21]:
!mkdir -p $RUN

In [22]:
bm25 = pt.BatchRetrieve(indexref, wmodel="BM25")

In [23]:
# dummy qrels
import pandas as pd
dummy_qrels = pd.DataFrame(dataset_pt.get_topics(), columns=['qid'])
dummy_qrels['docno'] = 'docno'
dummy_qrels['label'] = 0

In [24]:
%%time
from pyterrier.measures import *
pt.Experiment(
    [bm25],
    tokenize_topics(),
    dummy_qrels,
    eval_metrics=[nDCG],
    names = ["MyRun-BM25"],
    save_dir = os.getenv('RUN'),
    save_mode = "overwrite"
)

CPU times: user 9.83 s, sys: 160 ms, total: 9.99 s
Wall time: 6.15 s


Unnamed: 0,name,nDCG
0,MyRun-BM25,0.0


In [25]:
!gunzip -c $RUN/MyRun-BM25.res.gz | head

0101 Q0 kaken-j-0911436000 0 21.864852856182523 pyterrier
0101 Q0 kaken-j-0921440800 1 21.733548675216895 pyterrier
0101 Q0 kaken-j-0960142800 2 21.699355803624673 pyterrier
0101 Q0 kaken-j-0975101400 3 21.6598678406465 pyterrier
0101 Q0 kaken-j-0934033100 4 21.65174296819228 pyterrier
0101 Q0 kaken-j-0912100600 5 21.59438980838825 pyterrier
0101 Q0 kaken-j-0882391600 6 21.511995277698023 pyterrier
0101 Q0 kaken-j-0883102100 7 21.460978250332282 pyterrier
0101 Q0 kaken-j-0937129200 8 21.450990717308382 pyterrier
0101 Q0 kaken-j-0941469900 9 21.42146237447012 pyterrier
