# Preprocessing of NTCIR-17 Transfer Task Eval Dataset

## About the dataset

NTCIR-17 Transfer Task uses the following test collection as the evaluation dataset.

### Overview of NTCIR-2 AdHoc Test Collection

- Reference
> Kando, et al. (2001). [Overview of Japanese and English Information Retrieval Tasks (JEIR) at the Second NTCIR Workshop](http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings2/ovview-kando2.pdf). In: Proceedings of the Second NTCIR Workshop on Research in Chinese & Japanese Text Retrieval and Text Summarization, May 2000- March 2001.
- How to obtain the data: [Research Purpose Use of NTCIR Test Collections or Data Archive/ User Agreement](http://research.nii.ac.jp/ntcir/permission/perm-en.html#ntcir-2)
> The collection includes (1) Document data (Author abstracts of the Academic Conference Paper Database (1997-1999) and Grant Reports (1988-1997) = about 400,000 Japanese and 130,000 English documents,) (2) 49 Search topics (Japanese and English,) and (3) Relevance Judgements. The whole test collection is available for research purpose use from NII For experiments, the document data must be used with those of the NTCIR-1. Relevance judgments were done of the merged database of NTCIR-1 and NTCIR-2. To merge document collections, the document IDs in the NTCIR-1 must be converted using the script included in the NTCIR-2 CD-ROM. At the Second NTCIR Workshop, segmented data, in which the whole document data were segmented into terms (short units as well as longer units) using the standard software for segmentation in the year of 2000.

## Previous Step

- `preprocess-transfer1-train.ipynb`

## Data path
- Get a copy of the test collection based on the above instruction.
- We assume that the downloaded file has been uncompressed to the following path.

In [1]:
import os
os.environ['DATA1'] = '../testcollections/ntcir/NTCIR-1'
os.environ['DATA2'] = '../testcollections/ntcir/NTCIR-2'

In [2]:
!ls $DATA2

agreem2-e.pdf		 j-docs.tgz		      readme-j.pdf
agreem2-j.pdf		 manual-e.pdf		      readme-j.txt
correction-e-130709.pdf  manual-j.pdf		      rels.tgz
correction-j-130705.pdf  readme-e-revised-130709.pdf  scripts.tgz
e-docs.tgz		 readme-e.txt		      topics
j-docs			 readme-j-revised-130709.pdf  topics.tgz


## Preprocessing of corpus files

- NTCIR-2 uses both the new corpus files and that of NTCIR-1

In [3]:
!tar xvfz $DATA2/j-docs.tgz -C $DATA2/

j-docs/
j-docs/ntc2-j1g
j-docs/ntc2-j1k


In [4]:
!iconv -f EUC-JP -t UTF-8 -c $DATA2/j-docs/ntc2-j1g > $DATA2/j-docs/ntc2-j1g.utf8
!iconv -f EUC-JP -t UTF-8 -c $DATA2/j-docs/ntc2-j1k > $DATA2/j-docs/ntc2-j1k.utf8

In [5]:
# Number of documents
!grep "^<ACCN" $DATA2/j-docs/ntc2-j1*.utf8 | wc -l

403240


### ntc2-j1g

In [6]:
import sys
import re
import json
def docs_g_jsonl(in_file):
    out_file = in_file + '.jsonl'
    with open(in_file, 'r') as f, open(out_file, 'w') as f1:
        record = ''
        items = {}
        count = 0
        for line in f:
            line = line.rstrip()
            if line == '</REC>':
                accn = re.findall(r'<ACCN>(.+?)<', record)[0]
                titl = re.findall(r'<TITL .+?>(.+?)<', record)[0]
                abst = re.findall(r'<ABST .+?>(.+?)</ABST>', record)[0]
                abst = re.sub(r'<ABST.P>', '', abst)
                abst = re.sub(r'</ABST.P>', '', abst)
                contents = titl + ' ' + abst
                items = {
                    'doc_id': accn,
                    'text': contents
                }
                j = json.dumps(items, ensure_ascii=False)
                f1.write(f'{j}\n')
                record = ''
                items = {}
                count += 1
                if count % 10000 == 0:
                    print(f'{count}, ', end='', file=sys.stderr)
            else:
                record += line
        print(f'{count}, Done!', file=sys.stderr)

In [7]:
docs_g_jsonl(os.getenv('DATA2') + '/j-docs/ntc2-j1g.utf8')

10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 110000, 116177, Done!


### ntc2-j1k

In [8]:
import sys
import re
import json
def docs_k_jsonl(in_file):
    out_file = in_file + '.jsonl'
    with open(in_file, 'r') as f, open(out_file, 'w') as f1:
        record = ''
        items = {}
        count = 0
        for line in f:
            line = line.rstrip()
            if line == '</REC>':
                accn = re.findall(r'<ACCN>(.+?)<', record)[0]
                titl = re.findall(r'<PJNM .+?>(.+?)<', record)[0] # Difference
                abst = re.findall(r'<ABST .+?>(.+?)</ABST>', record)[0]
                abst = re.sub(r'<ABST.P>', '', abst)
                abst = re.sub(r'</ABST.P>', '', abst)
                contents = titl + ' ' + abst
                items = {
                    'doc_id': accn,
                    'text': contents
                }
                j = json.dumps(items, ensure_ascii=False)
                f1.write(f'{j}\n')
                record = ''
                items = {}
                count += 1
                if count % 10000 == 0:
                    print(f'{count}, ', end='', file=sys.stderr)
            else:
                record += line
        print(f'{count}, Done!', file=sys.stderr)

In [9]:
docs_k_jsonl(os.getenv('DATA2') + '/j-docs/ntc2-j1k.utf8')

10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 110000, 120000, 130000, 140000, 150000, 160000, 170000, 180000, 190000, 200000, 210000, 220000, 230000, 240000, 250000, 260000, 270000, 280000, 287063, Done!


In [10]:
!wc -l $DATA2/j-docs/ntc2-j1*.utf8.jsonl

   116177 ../testcollections/ntcir/NTCIR-2/j-docs/ntc2-j1g.utf8.jsonl
   287063 ../testcollections/ntcir/NTCIR-2/j-docs/ntc2-j1k.utf8.jsonl
   403240 total


### NTCIR-1 corpus file

In [11]:
import sys
import json
def convert_ntcir1_to_ntcir2(in_file, out_file):
    with open(in_file, 'r') as f, open(out_file, 'w') as f2:
        for i, line in enumerate(f):
            j = json.loads(line)
            docid = j['doc_id'].replace('gakkai-', 'gakkai-j-')
            j['doc_id'] = docid
            jline = json.dumps(j, ensure_ascii=False)
            f2.write(f'{jline}\n')
            if i % 10000 == 0:
                print(f'{i}, ', end='', file=sys.stderr)
        print(f'{i}, Done!', file=sys.stderr)

In [12]:
convert_ntcir1_to_ntcir2(
    os.getenv('DATA1') + '/mlir/ntc1-j1.utf8.jsonl',
    os.getenv('DATA2') + '/j-docs/ntc1-j1.utf8.mod.jsonl'
)

0, 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 110000, 120000, 130000, 140000, 150000, 160000, 170000, 180000, 190000, 200000, 210000, 220000, 230000, 240000, 250000, 260000, 270000, 280000, 290000, 300000, 310000, 320000, 330000, 332917, Done!


In [13]:
!cat $DATA2/j-docs/ntc1-j1.utf8.mod.jsonl $DATA2/j-docs/ntc2-j1g.utf8.jsonl $DATA2/j-docs/ntc2-j1k.utf8.jsonl > $DATA2/j-docs/ntc12-j1gk.mod.jsonl

In [14]:
!ls $DATA2/j-docs

ntc1-j1.utf8.mod.jsonl	ntc2-j1g       ntc2-j1g.utf8.jsonl  ntc2-j1k.utf8
ntc12-j1gk.mod.jsonl	ntc2-j1g.utf8  ntc2-j1k		    ntc2-j1k.utf8.jsonl


## Preprocessing of topic files

**WARNING FOR NTCIR-17 TRANSFER TASK PARTICIPANTS**

- You are NOT allowed to access the eval dataset topics until you freeze the development of your systems.
- Use the train dataset topics for all your development.
- See Getting Started at https://github.com/orgs/ntcirtransfer/discussions/ for more details.

In [15]:
!tar xvfz $DATA2/topics.tgz -C $DATA2/

topics/
topics/topic-e0101-0149
topics/topic-j0101-0149


In [16]:
!iconv -f EUC-JP -t UTF-8 -c $DATA2/topics/topic-j0101-0149 > $DATA2/topics/topic-j0101-0149.utf8

In [17]:
!ls $DATA2/topics

topic-e0101-0149  topic-j0101-0149  topic-j0101-0149.utf8


In [18]:
import re
def topics_jsonl(in_file):
    out_file = in_file + '.jsonl'
    with open(in_file, 'r') as f:
        s = f.read()
        qid = re.findall('<TOPIC q=([^>]+)>', s)
        title = re.findall('<TITLE>\n(.*)\n<\/TITLE>', s)
        desc = re.findall('<DESCRIPTION>\n(.*)\n<\/DESCRIPTION>', s)
    with open(out_file, 'w') as f:
        for i in range(len(qid)):
            f.write(f'{{ "query_id": "{qid[i]}", "text": "{title[i]}" }}\n')

In [19]:
topics_jsonl(os.getenv('DATA2') + '/topics/topic-j0101-0149.utf8')

In [20]:
!ls $DATA2/topics

topic-e0101-0149  topic-j0101-0149.utf8
topic-j0101-0149  topic-j0101-0149.utf8.jsonl


## Top 1000 data

- NTCIR-17 Transfer Task Participant only (for Reranking subtask)
- Download `top1000.eval.tsv` into `../testcollections/ntcir/NTCIR-2/j-docs` folder
- Note that not all topics have 1000 docs.

In [21]:
!ls $DATA2/j-docs

ntc1-j1.utf8.mod.jsonl	ntc2-j1g.utf8	     ntc2-j1k.utf8
ntc12-j1gk.mod.jsonl	ntc2-j1g.utf8.jsonl  ntc2-j1k.utf8.jsonl
ntc2-j1g		ntc2-j1k	     top1000.eval.tsv


## Register to ir_datasets module locally

- Dataset name: `ntcir-transfer`
- subset: `1/eval`
- No qrels

### Location of dataset files

- `../datasets/ntcir-transfer.yaml`
- `../datasets/ntcir_transfer.py`

In [22]:
# Remove old cache (if any)
# !rm -rf ~/.ir_datasets/ntcir-transfer/1/eval

In [23]:
import sys
!{sys.executable} -m pip install -q ir_datasets pandas

In [24]:
import os
sys.path.append(os.path.join(os.path.dirname(os.path.abspath('__file__')), '../datasets'))

In [25]:
import ir_datasets
import ntcir_transfer
dataset = ir_datasets.load('ntcir-transfer/1/eval')

In [26]:
dataset.docs_cls().__annotations__

OrderedDict([('doc_id', str), ('text', str)])

In [None]:
docstore = dataset.docs_store()
docstore.get('kaken-j-0924516300').text # the one in the overview paper

In [28]:
dataset.queries_cls().__annotations__

OrderedDict([('query_id', str), ('text', str)])

In [29]:
import pandas as pd
pd.DataFrame(dataset.scoreddocs_iter())

Unnamed: 0,query_id,doc_id,score
0,0101,kaken-j-0975101400,17.722502
1,0101,kaken-j-0960142800,17.664987
2,0101,kaken-j-0911436000,17.568185
3,0101,kaken-j-0970425300,17.503291
4,0101,kaken-j-0934033100,17.445929
...,...,...,...
43635,0149,kaken-j-0972466500,-17.015109
43636,0149,kaken-j-0960134100,-17.017582
43637,0149,gakkai-j-0000185751,-17.018490
43638,0149,kaken-j-0904518400,-17.031096
