# Experiment on NTCIR-17 Transfer Task Train Dataset

This notebook shows how to apply BM25 to the train dataset of NTCIR-17 Transfer Task using [PyTerrier](https://pyterrier.readthedocs.io/en/latest/) (v0.9.2).

## Previous Step

- `preprocess-transfer1-train.ipynb`

## Requirement

- Java v11

## Path

In [1]:
import os
os.environ['INDEX'] = '../indexes/ntcir17-transfer/train'
os.environ['RUN'] = '../runs/ntcir17-transfer/train'

## Datasets

In [2]:
import sys
!{sys.executable} -m pip install -q ir_datasets

In [3]:
sys.path.append(os.path.join(os.path.dirname(os.path.abspath('__file__')), '../datasets'))

In [4]:
import ir_datasets
import ntcir_transfer
dataset = ir_datasets.load('ntcir-transfer/1/train')

## Tokenization

- In this example, we use [SudachiPy](https://github.com/WorksApplications/SudachiPy) (v0.5.4) + sudachidict_core dictionary + SplitMode.A
- Other tokenizers can also be used

In [5]:
import sys
!{sys.executable} -m pip install -q sudachipy sudachidict_core

In [6]:
import re
import json
from sudachipy import tokenizer
from sudachipy import dictionary
tokenizer_obj = dictionary.Dictionary().create()
mode = tokenizer.Tokenizer.SplitMode.A

In [7]:
def tokenize_text(text):
    atok = ' '.join([m.surface() for m in tokenizer_obj.tokenize(text, mode)])
    return atok

In [8]:
tokenize_text('すもももももももものうち')

'すもも も もも も もも の うち'

## Experiment

### PyTerrier

In [9]:
# Change JAVA_HOME to fit your environment
JAVA_HOME = '/usr/lib/jvm/java-11-openjdk-amd64'
os.environ['JAVA_HOME'] = JAVA_HOME
os.getenv('JAVA_HOME')

'/usr/lib/jvm/java-11-openjdk-amd64'

In [10]:
import sys
!{sys.executable} -m pip install -q python-terrier

In [11]:
import pandas as pd
import pyterrier as pt
if not pt.started():
  pt.init(tqdm='notebook')

PyTerrier 0.9.2 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


In [12]:
dataset_pt = pt.get_dataset('irds:ntcir-transfer/1/train')

### Indexing

In [13]:
# !rm -rf $INDEX
!mkdir -p $INDEX

In [14]:
indexer = pt.IterDictIndexer(os.getenv('INDEX'))
indexer.setProperty("tokeniser", "UTFTokeniser")
indexer.setProperty("termpipelines", "")

In [15]:
def train_doc_generate():
    for doc in dataset.docs_iter():
        yield { 'docno': doc.doc_id, 'text': tokenize_text(doc.text) }

In [16]:
%%time
indexref = indexer.index(train_doc_generate())

09:28:23.185 [ForkJoinPool-1-worker-3] WARN org.terrier.structures.indexing.Indexer - Indexed 1 empty documents
CPU times: user 6min 31s, sys: 4.94 s, total: 6min 36s
Wall time: 4min 46s


In [17]:
!ls $INDEX

data.direct.bf		   data.lexicon.fsomaphash  data.meta.zdata
data.document.fsarrayfile  data.lexicon.fsomapid    data.properties
data.inverted.bf	   data.meta-0.fsomapfile
data.lexicon.fsomapfile    data.meta.idx


### Topics

In [18]:
def tokenize_topics():
    import re
    code = re.compile('[!"#$%&\'\\\\()*+,-./:;<=>?@[\\]^_`{|}~「」〔〕“”〈〉『』【】＆＊・（）＄＃＠。、？！｀＋￥％]')
    queries = dataset_pt.get_topics(tokenise_query=False)
    for idx, row in queries.iterrows():
        queries.iloc[idx, 1] = code.sub('', tokenize_text(row.query))
    return queries

In [None]:
tokenize_topics()

### Retrieval

In [20]:
# Load existing index files
indexref = pt.IndexFactory.of(os.getenv('INDEX'))

In [21]:
!mkdir -p $RUN

In [22]:
bm25 = pt.BatchRetrieve(indexref, wmodel="BM25")

In [23]:
%%time
from pyterrier.measures import *
pt.Experiment(
    [bm25],
    tokenize_topics(),
    dataset_pt.get_qrels(),
    eval_metrics=[nDCG],
    names = ["MyRun-BM25"],
    save_dir = os.getenv('RUN'),
    save_mode = "overwrite"
)

CPU times: user 10.8 s, sys: 217 ms, total: 11 s
Wall time: 7.15 s


Unnamed: 0,name,nDCG
0,MyRun-BM25,0.526288


In [24]:
!gunzip -c $RUN/MyRun-BM25.res.gz | head

0001 Q0 gakkai-0000064659 0 13.583940439240665 pyterrier
0001 Q0 gakkai-0000225773 1 13.527180838463924 pyterrier
0001 Q0 gakkai-0000328806 2 13.432803791458504 pyterrier
0001 Q0 gakkai-0000198139 3 13.41909496235353 pyterrier
0001 Q0 gakkai-0000124728 4 13.402377678402779 pyterrier
0001 Q0 gakkai-0000168454 5 13.397874287752243 pyterrier
0001 Q0 gakkai-0000297977 6 13.395025854222729 pyterrier
0001 Q0 gakkai-0000245010 7 13.392895780536069 pyterrier
0001 Q0 gakkai-0000045041 8 13.392088659104303 pyterrier
0001 Q0 gakkai-0000094695 9 13.391086331487342 pyterrier
