# Preprocessing of NTCIR-17 Transfer Task Train Dataset

## About the dataset

NTCIR-17 Transfer Task uses the following test collection as the training dataset.

### Overview of NTCIR-1 AdHoc Test Collection

- Reference
> Kando, et al. (1999). [Overview of IR Tasks at the First NTCIR Workshop](http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings/IR-overview.pdf). In: Proceedings of the First NTCIR Workshop on Research in Japanese Text Retrieval and Term Recognition, August 30 - September 1, 1999, pp.11-44.
- How to obtain the data: [Research Purpose Use of NTCIR Test Collections or Data Archive/ User Agreement](http://research.nii.ac.jp/ntcir/permission/perm-en.html#ntcir-1)
> The IR Test collection includes (1) Document data (Author abstracts of the Academic Conference Paper Database (1988-1997) = author abstracts of the paper presented at the academic conference hosted by either of 65 academic societies in Japan. about 330,000 documents; more than half are English-Japanese paired,) (2) 83 Search topics (Japanese,) and (3) Relevance Judgements. The collection can be used for retrieval experiments of Japanese text retrieval and CLIR of search Either of English documents or Japanese-English documents by Japanese topics. The Term Extraction Test collection includes tagged corpus using the 2000 Japanese documents selected from the above IR test collection. The whole test collection is available for research purpose use from NII.

## Data path
- Get a copy of the test collection based on the above instruction.
- We assume that the downloaded file has been uncompressed to the following path.

In [1]:
import os
os.environ['DATA'] = '../testcollections/ntcir/NTCIR-1'

In [2]:
!ls $DATA

ADHOC.TGZ		 MANUAL-E.PDF		      README-J.PDF
AGREEM-E.PDF		 MANUAL-J.PDF		      README-J.TXT
AGREEM-J.PDF		 MLIR.TGZ		      TAGREE-E.PDF
CLIR.TGZ		 README-E-REVISED-130709.pdf  TAGREE-J.PDF
CORRECTION-E-130709.pdf  README-E.TXT		      TMREC.TGZ
CORRECTION-J-130705.pdf  README-J-REVISED-130705.pdf  TOPICS.TGZ


## Preprocessing

### Corpus files

In [3]:
!tar xvfz $DATA/MLIR.TGZ -C $DATA/

mlir/
mlir/ntc1-j1
mlir/rel1_ntc1-j1_0001-0030
mlir/rel2_ntc1-j1_0001-0030
mlir/rel1_ntc1-j1_0031-0083
mlir/rel2_ntc1-j1_0031-0083


In [4]:
!iconv -f EUC-JP -t UTF-8 -c $DATA/mlir/ntc1-j1 > $DATA/mlir/ntc1-j1.utf8

In [5]:
# Number of documents
!grep "^<ACCN" $DATA/mlir/ntc1-j1.utf8 | wc -l

332918


In [6]:
import re
def docs_jsonl(in_file):
    out_file = in_file + '.jsonl'
    with open(in_file, 'r') as f:
        s = f.read()
        s = re.sub('<ABST.P>|</ABST.P>', '', s)
        s = re.sub(r'\\', r'\\\\', s)
        s = re.sub('"', '\\"', s)

        accn = re.findall('<ACCN.*?>(.*)</ACCN>', s)
        titl = re.findall('<TITL.*?>(.*)</TITL>', s)
        abst = re.findall('<ABST.*?>(.*)</ABST>', s)

    with open(out_file, 'w') as f:
        for i in range(len(accn)):
            # text = title + abstract
            f.write(f'{{ "doc_id": "{accn[i]}", "text": "{titl[i]} {abst[i]}" }}\n')

In [7]:
docs_jsonl(os.getenv('DATA') + '/mlir/ntc1-j1.utf8')

In [8]:
!wc -l $DATA/mlir/ntc1-j1.utf8.jsonl

332918 ../testcollections/ntcir/NTCIR-1/mlir/ntc1-j1.utf8.jsonl


### Topic files

In [9]:
!tar xvfz $DATA/TOPICS.TGZ -C $DATA/

topics/
topics/topic0001-0030
topics/topic0031-0083


In [10]:
!iconv -f EUC-JP -t UTF-8 -c $DATA/topics/topic0001-0030 > $DATA/topics/topic0001-0030.utf8
!iconv -f EUC-JP -t UTF-8 -c $DATA/topics/topic0031-0083 > $DATA/topics/topic0031-0083.utf8

In [11]:
import re
def topics_jsonl(in_file):
    out_file = in_file + '.jsonl'
    with open(in_file, 'r') as f:
        s = f.read()
        qid = re.findall('<TOPIC q=([^>]+)>', s)
        title = re.findall('<TITLE>\n(.*)\n<\/TITLE>', s)
        desc = re.findall('<DESCRIPTION>\n(.*)\n<\/DESCRIPTION>', s)
    with open(out_file, 'w') as f:
        for i in range(len(qid)):
            f.write(f'{{ "query_id": "{qid[i]}", "text": "{title[i]}", "description": "{desc[i]}" }}\n')

In [12]:
topics_jsonl(os.getenv('DATA') + '/topics/topic0001-0030.utf8')
topics_jsonl(os.getenv('DATA') + '/topics/topic0031-0083.utf8')

In [13]:
!cat $DATA/topics/topic0001-0030.utf8.jsonl $DATA/topics/topic0031-0083.utf8.jsonl > $DATA/topics/topic0001-0083.utf8.jsonl

In [14]:
!ls $DATA/topics

topic0001-0030		   topic0001-0083.utf8.jsonl  topic0031-0083.utf8.jsonl
topic0001-0030.utf8	   topic0031-0083
topic0001-0030.utf8.jsonl  topic0031-0083.utf8


### Qrel files
- This test collection provides graded relevance scores (A: Relevant, B: Partially Relevant, C: Not Relevant)
- We convert them as follows.
    - A: 2
    - B: 1
    - C: 0

In [15]:
!iconv -f EUC-JP -t UTF-8 -c $DATA/mlir/rel2_ntc1-j1_0001-0030 > $DATA/mlir/rel2_ntc1-j1_0001-0030.utf8
!iconv -f EUC-JP -t UTF-8 -c $DATA/mlir/rel2_ntc1-j1_0031-0083 > $DATA/mlir/rel2_ntc1-j1_0031-0083.utf8

In [16]:
def qrel_graded_tsv(in_file):
    out_file = in_file + '.tsv'
    with open(in_file, 'r') as f, open(out_file, 'w') as f2:
        for line in f:
            line = line.rstrip()
            flds = line.split('\t')
            if flds[1] == 'A':
                f2.write(f'{flds[0]}\tQ0\t{flds[2]}\t2\n')
            if flds[1] == 'B':
                f2.write(f'{flds[0]}\tQ0\t{flds[2]}\t1\n')
            if flds[1] == 'C':
                f2.write(f'{flds[0]}\tQ0\t{flds[2]}\t0\n')

In [17]:
qrel_graded_tsv(os.getenv('DATA') + '/mlir/rel2_ntc1-j1_0001-0030.utf8')
qrel_graded_tsv(os.getenv('DATA') + '/mlir/rel2_ntc1-j1_0031-0083.utf8')

In [18]:
!cat $DATA/mlir/rel2_ntc1-j1_0001-0030.utf8.tsv $DATA/mlir/rel2_ntc1-j1_0031-0083.utf8.tsv > $DATA/mlir/rel2_ntc1-j1_0001-0083.utf8.tsv

In [19]:
!ls $DATA/mlir/

ntc1-j1			rel2_ntc1-j1_0001-0030.utf8
ntc1-j1.utf8		rel2_ntc1-j1_0001-0030.utf8.tsv
ntc1-j1.utf8.jsonl	rel2_ntc1-j1_0001-0083.utf8.tsv
rel1_ntc1-j1_0001-0030	rel2_ntc1-j1_0031-0083
rel1_ntc1-j1_0031-0083	rel2_ntc1-j1_0031-0083.utf8
rel2_ntc1-j1_0001-0030	rel2_ntc1-j1_0031-0083.utf8.tsv


## Top 1000 data

- NTCIR-17 Transfer Task Participant only (for Reranking subtask)
- Download `top1000.train.tsv` into `../testcollections/ntcir/NTCIR-1/mlir` folder
- Note that not all topics have 1000 docs.

In [20]:
!ls $DATA/mlir

ntc1-j1			     rel2_ntc1-j1_0001-0030.utf8.tsv
ntc1-j1.utf8		     rel2_ntc1-j1_0001-0083.utf8.tsv
ntc1-j1.utf8.jsonl	     rel2_ntc1-j1_0031-0083
rel1_ntc1-j1_0001-0030	     rel2_ntc1-j1_0031-0083.utf8
rel1_ntc1-j1_0031-0083	     rel2_ntc1-j1_0031-0083.utf8.tsv
rel2_ntc1-j1_0001-0030	     top1000.train.tsv
rel2_ntc1-j1_0001-0030.utf8


## Register to ir_datasets module locally

- Dataset name: `ntcir-transfer`
- subset: `1/train`

### Location of dataset files

- `../datasets/ntcir-transfer.yaml`
- `../datasets/ntcir_transfer.py`

In [21]:
# Remove old cache (if any)
# !rm -rf ~/.ir_datasets/ntcir-transfer/1/train

In [22]:
import sys
!{sys.executable} -m pip install -q ir_datasets pandas

In [23]:
sys.path.append(os.path.join(os.path.dirname(os.path.abspath('__file__')), '../datasets'))

In [24]:
import ir_datasets
import ntcir_transfer
dataset = ir_datasets.load('ntcir-transfer/1/train')

In [25]:
dataset.docs_cls().__annotations__

OrderedDict([('doc_id', str), ('text', str)])

In [None]:
docstore = dataset.docs_store()
docstore.get('gakkai-0000011144').text # the one in the overview paper

In [27]:
dataset.queries_cls().__annotations__

OrderedDict([('query_id', str), ('text', str)])

In [None]:
import pandas as pd
pd.DataFrame(dataset.queries_iter())

In [29]:
dataset.qrels_defs()

{2: 'relevant', 1: 'partially relevant', 0: 'not relevant'}

In [None]:
pd.DataFrame(dataset.qrels_iter())

In [31]:
pd.DataFrame(dataset.scoreddocs_iter())

Unnamed: 0,query_id,doc_id,score
0,0001,gakkai-0000064659,13.563926
1,0001,gakkai-0000225773,13.524426
2,0001,gakkai-0000198139,13.403230
3,0001,gakkai-0000245010,13.403230
4,0001,gakkai-0000328806,13.402888
...,...,...,...
76510,0083,gakkai-0000272261,-6.225316
76511,0083,gakkai-0000242113,-6.228070
76512,0083,gakkai-0000075436,-6.229898
76513,0083,gakkai-0000151829,-6.229998
