# Preprocessing of NTCIR-18 Transfer 2 DCLR Subtask (Train Set)

## About the subtask

NTCIR-18 Transfer DCLR Subask aims to retrieve English documents from Japanese topics.


## About the dataset

NTCIR-18 Transfer DCLR Subask uses the following test collections as the training set.

### Overview of NTCIR-1 Cross-Lingual IR Test Collection

- Reference
> Kando, et al. (1999). [Overview of IR Tasks at the First NTCIR Workshop](http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings/IR-overview.pdf). In: Proceedings of the First NTCIR Workshop on Research in Japanese Text Retrieval and Term Recognition, August 30 - September 1, 1999, pp.11-44.
- How to obtain the data: [Research Purpose Use of NTCIR Test Collections or Data Archive/ User Agreement](http://research.nii.ac.jp/ntcir/permission/perm-en.html#ntcir-1)
> The IR Test collection includes (1) Document data (Author abstracts of the Academic Conference Paper Database (1988-1997) = author abstracts of the paper presented at the academic conference hosted by either of 65 academic societies in Japan. about 330,000 documents; more than half are English-Japanese paired,) (2) 83 Search topics (Japanese,) and (3) Relevance Judgements. The collection can be used for retrieval experiments of Japanese text retrieval and CLIR of search Either of English documents or Japanese-English documents by Japanese topics. The Term Extraction Test collection includes tagged corpus using the 2000 Japanese documents selected from the above IR test collection. The whole test collection is available for research purpose use from NII.

## Data path
- Get a copy of the test collections based on the above instruction.
- We assume that the downloaded file has been uncompressed to the following path.

In [1]:
import os
NTCIR1 = os.getcwd() + '/../testcollections/ntcir/NTCIR-1'

In [2]:
os.listdir(NTCIR1)

['ADHOC.TGZ',
 'AGREEM-E.PDF',
 'AGREEM-J.PDF',
 'clir',
 'CLIR.TGZ',
 'CORRECTION-E-130709.pdf',
 'CORRECTION-J-130705.pdf',
 'MANUAL-E.PDF',
 'MANUAL-J.PDF',
 'mlir',
 'MLIR.TGZ',
 'README-E-REVISED-130709.pdf',
 'README-E.TXT',
 'README-J-REVISED-130705.pdf',
 'README-J.PDF',
 'README-J.TXT',
 'TAGREE-E.PDF',
 'TAGREE-J.PDF',
 'TMREC.TGZ',
 'topics',
 'TOPICS.TGZ']

## Preprocessing of NTCIR-1 Dataset

### Corpus files

In [3]:
import tarfile
tarfile.open(NTCIR1 + '/CLIR.TGZ').extractall(path=NTCIR1)

In [4]:
os.listdir(NTCIR1 + '/clir')

['ntc1-e1',
 'rel1_ntc1-e1_0001-0030',
 'rel1_ntc1-e1_0031-0083',
 'rel2_ntc1-e1_0001-0030',
 'rel2_ntc1-e1_0031-0083']

In [5]:
import codecs

def convert_encoding(input_file_path, output_file_path, input_encoding, output_encoding, error_handling='ignore'):
    # Open the input file with the specified encoding
    with codecs.open(input_file_path, 'r', encoding=input_encoding, errors=error_handling) as file:
        contents = file.read()
    
    # Open the output file with the desired encoding
    with codecs.open(output_file_path, 'w', encoding=output_encoding) as file:
        file.write(contents)

In [6]:
convert_encoding(NTCIR1 + '/clir/ntc1-e1', NTCIR1 + '/clir/ntc1-e1.utf8', 'ascii', 'utf-8')

In [7]:
os.listdir(NTCIR1 + '/clir')

['ntc1-e1',
 'ntc1-e1.utf8',
 'rel1_ntc1-e1_0001-0030',
 'rel1_ntc1-e1_0031-0083',
 'rel2_ntc1-e1_0001-0030',
 'rel2_ntc1-e1_0031-0083']

In [8]:
import re
def docs_jsonl(in_file):
    out_file = in_file + '.jsonl'
    with open(in_file, 'r') as f:
        s = f.read()
        s = re.sub('<ABSE.P>|</ABSE.P>', '', s)
        s = re.sub(r'\\', r'\\\\', s)
        s = re.sub('"', '\\"', s)

        accn = re.findall('<ACCN.*?>(.*)</ACCN>', s)
        titl = re.findall('<TITE.*?>(.*)</TITE>', s)
        abst = re.findall('<ABSE.*?>(.*)</ABSE>', s)

    with open(out_file, 'w') as f:
        for i in range(len(accn)):
            # text = title + abstract
            f.write(f'{{ "doc_id": "{accn[i]}", "text": "{titl[i]}. {abst[i]}" }}\n')

In [9]:
docs_jsonl(NTCIR1 + '/clir/ntc1-e1.utf8')

In [10]:
os.listdir(NTCIR1 + '/clir')

['ntc1-e1',
 'ntc1-e1.utf8',
 'ntc1-e1.utf8.jsonl',
 'rel1_ntc1-e1_0001-0030',
 'rel1_ntc1-e1_0031-0083',
 'rel2_ntc1-e1_0001-0030',
 'rel2_ntc1-e1_0031-0083']

### Topic files

In [11]:
tarfile.open(NTCIR1 + '/TOPICS.TGZ').extractall(path=NTCIR1)

In [12]:
os.listdir(NTCIR1 + '/topics')

['topic0001-0030', 'topic0031-0083']

In [13]:
convert_encoding(NTCIR1 + '/topics/topic0001-0030', NTCIR1 + '/topics/topic0001-0030.utf8', 'euc_jp', 'utf-8')
convert_encoding(NTCIR1 + '/topics/topic0031-0083', NTCIR1 + '/topics/topic0031-0083.utf8', 'euc_jp', 'utf-8')

In [14]:
os.listdir(NTCIR1 + '/topics')

['topic0001-0030',
 'topic0001-0030.utf8',
 'topic0031-0083',
 'topic0031-0083.utf8']

In [15]:
import re
def topics_jsonl(in_file):
    out_file = in_file + '.jsonl'
    with open(in_file, 'r', encoding='utf-8') as f:
        s = f.read()
        qid = re.findall('<TOPIC q=([^>]+)>', s)
        title = re.findall('<TITLE>\n(.*)\n</TITLE>', s)
        desc = re.findall('<DESCRIPTION>\n(.*)\n</DESCRIPTION>', s)
    with open(out_file, 'w', encoding='utf-8') as f:
        for i in range(len(qid)):
            f.write(f'{{ "query_id": "{qid[i]}", "text": "{title[i]}", "description": "{desc[i]}" }}\n')

In [16]:
topics_jsonl(NTCIR1 + '/topics/topic0001-0030.utf8')
topics_jsonl(NTCIR1 + '/topics/topic0031-0083.utf8')

In [17]:
def concatenate_files(file_path1, file_path2, output_file_path):
    with open(file_path1, 'r', encoding='utf-8') as file1:
        data1 = file1.read()
        
    with open(file_path2, 'r', encoding='utf-8') as file2:
        data2 = file2.read()
    
    with open(output_file_path, 'w', encoding='utf-8') as outfile:
        outfile.write(data1)
        outfile.write(data2)

In [18]:
concatenate_files(NTCIR1 + '/topics/topic0001-0030.utf8.jsonl', NTCIR1 + '/topics/topic0031-0083.utf8.jsonl', NTCIR1 + '/topics/topic0001-0083.utf8.jsonl')

In [19]:
os.listdir(NTCIR1 + '/topics')

['topic0001-0030',
 'topic0001-0030.utf8',
 'topic0001-0030.utf8.jsonl',
 'topic0001-0083.utf8.jsonl',
 'topic0031-0083',
 'topic0031-0083.utf8',
 'topic0031-0083.utf8.jsonl']

### Qrel files
- This test collection provides graded relevance scores (A: Relevant, B: Partially Relevant, C: Not Relevant)
- We convert them as follows.
    - A: 2
    - B: 1
    - C: 0

In [20]:
convert_encoding(NTCIR1 + '/clir/rel2_ntc1-e1_0001-0030', NTCIR1 + '/clir/rel2_ntc1-e1_0001-0030.utf8', 'ascii', 'utf-8')
convert_encoding(NTCIR1 + '/clir/rel2_ntc1-e1_0031-0083', NTCIR1 + '/clir/rel2_ntc1-e1_0031-0083.utf8', 'ascii', 'utf-8')

In [21]:
os.listdir(NTCIR1 + '/clir')

['ntc1-e1',
 'ntc1-e1.utf8',
 'ntc1-e1.utf8.jsonl',
 'rel1_ntc1-e1_0001-0030',
 'rel1_ntc1-e1_0031-0083',
 'rel2_ntc1-e1_0001-0030',
 'rel2_ntc1-e1_0001-0030.utf8',
 'rel2_ntc1-e1_0031-0083',
 'rel2_ntc1-e1_0031-0083.utf8']

In [22]:
def qrel_graded_tsv(in_file):
    out_file = in_file + '.tsv'
    with open(in_file, 'r', encoding='utf-8') as f, open(out_file, 'w', encoding='utf-8') as f2:
        for line in f:
            line = line.rstrip()
            flds = line.split('\t')
            if flds[1] == 'A':
                f2.write(f'{flds[0]}\tQ0\t{flds[2]}\t2\n')
            if flds[1] == 'B':
                f2.write(f'{flds[0]}\tQ0\t{flds[2]}\t1\n')
            if flds[1] == 'C':
                f2.write(f'{flds[0]}\tQ0\t{flds[2]}\t0\n')

In [23]:
qrel_graded_tsv(NTCIR1 + '/clir/rel2_ntc1-e1_0001-0030.utf8')
qrel_graded_tsv(NTCIR1 + '/clir/rel2_ntc1-e1_0031-0083.utf8')

In [24]:
concatenate_files(NTCIR1 + '/clir/rel2_ntc1-e1_0001-0030.utf8.tsv', NTCIR1 + '/clir/rel2_ntc1-e1_0031-0083.utf8.tsv', NTCIR1 + '/clir/rel2_ntc1-e1_0001-0083.utf8.tsv')

In [25]:
os.listdir(NTCIR1 + '/clir')

['ntc1-e1',
 'ntc1-e1.utf8',
 'ntc1-e1.utf8.jsonl',
 'rel1_ntc1-e1_0001-0030',
 'rel1_ntc1-e1_0031-0083',
 'rel2_ntc1-e1_0001-0030',
 'rel2_ntc1-e1_0001-0030.utf8',
 'rel2_ntc1-e1_0001-0030.utf8.tsv',
 'rel2_ntc1-e1_0001-0083.utf8.tsv',
 'rel2_ntc1-e1_0031-0083',
 'rel2_ntc1-e1_0031-0083.utf8',
 'rel2_ntc1-e1_0031-0083.utf8.tsv']

### Register to ir_datasets module locally

- Dataset name: `ntcir-transfer`
- subset: `2/train`

#### Location of dataset files

- `../datasets/ntcir-transfer.yaml`
- `../datasets/ntcir_transfer.py`

In [27]:
import sys
!{sys.executable} -m pip install -q ir_datasets pandas

In [28]:
sys.path.append(os.path.join(os.path.dirname(os.path.abspath('__file__')), '../datasets'))

In [29]:
import ir_datasets
import ntcir_transfer
dataset = ir_datasets.load('ntcir-transfer/2/train')

In [30]:
dataset.docs_cls().__annotations__

{'doc_id': str, 'text': str}

In [None]:
docstore = dataset.docs_store()
docstore.get('gakkai-0000011144').text # the one in the overview paper

In [32]:
dataset.queries_cls().__annotations__

{'query_id': str, 'text': str}

In [None]:
import pandas as pd
pd.DataFrame(dataset.queries_iter())

In [34]:
dataset.qrels_defs()

{2: 'relevant', 1: 'partially relevant', 0: 'not relevant'}

In [None]:
pd.DataFrame(dataset.qrels_iter())