# Preprocessing of NTCIR-18 Transfer 2 DCLR Subtask (Valid Set)

## About the subtask

NTCIR-18 Transfer DCLR Subask aims to retrieve English documents from Japanese topics.


## About the dataset

NTCIR-18 Transfer DCLR Subask uses the following test collections as the validation set.

### Overview of NTCIR-2 Cross-Lingual IR Test Collection
- Reference
> Kando, et al. (2001). [Overview of Japanese and English Information Retrieval Tasks (JEIR) at the Second NTCIR Workshop](http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings2/ovview-kando2.pdf). In: Proceedings of the Second NTCIR Workshop on Research in Chinese & Japanese Text Retrieval and Text Summarization, May 2000- March 2001.
- How to obtain the data: [Research Purpose Use of NTCIR Test Collections or Data Archive/ User Agreement](http://research.nii.ac.jp/ntcir/permission/perm-en.html#ntcir-2)
> The collection includes (1) Document data (Author abstracts of the Academic Conference Paper Database (1997-1999) and Grant Reports (1988-1997) = about 400,000 Japanese and 130,000 English documents,) (2) 49 Search topics (Japanese and English,) and (3) Relevance Judgements. The whole test collection is available for research purpose use from NII For experiments, the document data must be used with those of the NTCIR-1. Relevance judgments were done of the merged database of NTCIR-1 and NTCIR-2. To merge document collections, the document IDs in the NTCIR-1 must be converted using the script included in the NTCIR-2 CD-ROM. At the Second NTCIR Workshop, segmented data, in which the whole document data were segmented into terms (short units as well as longer units) using the standard software for segmentation in the year of 2000.

## Previous Step

- `preprocess-transfer2-train.ipynb`

## Data path
- Get a copy of the test collections based on the above instruction.
- We assume that the downloaded file has been uncompressed to the following path.

In [1]:
import os
NTCIR1 = os.getcwd() + '/../testcollections/ntcir/NTCIR-1'
NTCIR2 = os.getcwd() + '/../testcollections/ntcir/NTCIR-2'

In [2]:
import os
os.listdir(NTCIR1)

['ADHOC.TGZ',
 'AGREEM-E.PDF',
 'AGREEM-J.PDF',
 'clir',
 'CLIR.TGZ',
 'CORRECTION-E-130709.pdf',
 'CORRECTION-J-130705.pdf',
 'MANUAL-E.PDF',
 'MANUAL-J.PDF',
 'mlir',
 'MLIR.TGZ',
 'README-E-REVISED-130709.pdf',
 'README-E.TXT',
 'README-J-REVISED-130705.pdf',
 'README-J.PDF',
 'README-J.TXT',
 'TAGREE-E.PDF',
 'TAGREE-J.PDF',
 'TMREC.TGZ',
 'topics',
 'TOPICS.TGZ']

In [3]:
os.listdir(NTCIR2)

['agreem2-e.pdf',
 'agreem2-j.pdf',
 'correction-e-130709.pdf',
 'correction-j-130705.pdf',
 'e-docs',
 'e-docs.tgz',
 'j-docs',
 'j-docs.tgz',
 'manual-e.pdf',
 'manual-j.pdf',
 'readme-e-revised-130709.pdf',
 'readme-e.txt',
 'readme-j-revised-130709.pdf',
 'readme-j.pdf',
 'readme-j.txt',
 'rels',
 'rels.tgz',
 'scripts.tgz',
 'topics',
 'topics.tgz']

## Preprocessing of NTCIR-2 Dataset

### Corpus files

In [4]:
import tarfile
tarfile.open(NTCIR2 + '/e-docs.tgz').extractall(path=NTCIR2)

In [5]:
os.listdir(NTCIR2 + '/e-docs')

['ntc2-e1g', 'ntc2-e1k']

In [6]:
import codecs

def convert_encoding(input_file_path, output_file_path, input_encoding, output_encoding, error_handling='ignore'):
    # Open the input file with the specified encoding
    with codecs.open(input_file_path, 'r', encoding=input_encoding, errors=error_handling) as file:
        contents = file.read()
    
    # Open the output file with the desired encoding
    with codecs.open(output_file_path, 'w', encoding=output_encoding) as file:
        file.write(contents)

In [7]:
convert_encoding(NTCIR2 + '/e-docs/ntc2-e1g', NTCIR2 + '/e-docs/ntc2-e1g.utf8', 'ascii', 'utf-8')
convert_encoding(NTCIR2 + '/e-docs/ntc2-e1k', NTCIR2 + '/e-docs/ntc2-e1k.utf8', 'ascii', 'utf-8')

In [8]:
os.listdir(NTCIR2 + '/e-docs')

['ntc2-e1g', 'ntc2-e1g.utf8', 'ntc2-e1k', 'ntc2-e1k.utf8']

#### ntc2-e1g

In [9]:
import sys
import re
import json
def docs_g_jsonl(in_file):
    out_file = in_file + '.jsonl'
    with open(in_file, 'r', encoding='utf-8') as f, open(out_file, 'w', encoding='utf-8') as f1:
        record = ''
        items = {}
        count = 0
        for line in f:
            line = line.rstrip()
            if line == '</REC>':
                accn = re.findall(r'<ACCN>(.+?)<', record)[0]
                titl = re.findall(r'<TITE .+?>(.+?)<', record)[0]
                abst = re.findall(r'<ABSE .+?>(.+?)</ABSE>', record)[0]
                abst = re.sub(r'<ABSE.P>', '', abst)
                abst = re.sub(r'</ABSE.P>', '', abst)
                contents = titl + ' ' + abst
                items = {
                    'doc_id': accn,
                    'text': contents
                }
                j = json.dumps(items, ensure_ascii=False)
                f1.write(f'{j}\n')
                record = ''
                items = {}
                count += 1
                if count % 10000 == 0:
                    print(f'{count}, ', end='', file=sys.stderr)
            else:
                record += line
        print(f'{count}, Done!', file=sys.stderr)

In [10]:
docs_g_jsonl(NTCIR2 + '/e-docs/ntc2-e1g.utf8')

10000, 20000, 30000, 40000, 50000, 60000, 70000, 77433, Done!


#### ntc2-e1k

In [11]:
def docs_k_jsonl(in_file):
    out_file = in_file + '.jsonl'
    with open(in_file, 'r', encoding='utf-8') as f, open(out_file, 'w', encoding='utf-8') as f1:
        record = ''
        items = {}
        count = 0
        for line in f:
            line = line.rstrip()
            if line == '</REC>':
                accn = re.findall(r'<ACCN>(.+?)<', record)[0]
                titl = re.findall(r'<PJNE .+?>(.+?)<', record)[0] # Difference
                abst = re.findall(r'<ABSE .+?>(.+?)</ABSE>', record)[0]
                abst = re.sub(r'<ABSE.P>', '', abst)
                abst = re.sub(r'</ABSE.P>', '', abst)
                contents = titl + ' ' + abst
                items = {
                    'doc_id': accn,
                    'text': contents
                }
                j = json.dumps(items, ensure_ascii=False)
                f1.write(f'{j}\n')
                record = ''
                items = {}
                count += 1
                if count % 10000 == 0:
                    print(f'{count}, ', end='', file=sys.stderr)
            else:
                record += line
        print(f'{count}, Done!', file=sys.stderr)

In [12]:
docs_k_jsonl(NTCIR2 + '/e-docs/ntc2-e1k.utf8')

10000, 20000, 30000, 40000, 50000, 57545, Done!


In [13]:
os.listdir(NTCIR2 + '/e-docs')

['ntc2-e1g',
 'ntc2-e1g.utf8',
 'ntc2-e1g.utf8.jsonl',
 'ntc2-e1k',
 'ntc2-e1k.utf8',
 'ntc2-e1k.utf8.jsonl']

#### NTCIR-1 Corpus file

In [14]:
import sys
import json
def convert_ntcir1_to_ntcir2(in_file, out_file):
    with open(in_file, 'r', encoding='utf-8') as f, open(out_file, 'w', encoding='utf-8') as f2:
        for i, line in enumerate(f):
            j = json.loads(line)
            docid = j['doc_id'].replace('gakkai-', 'gakkai-e-')
            j['doc_id'] = docid
            jline = json.dumps(j, ensure_ascii=False)
            f2.write(f'{jline}\n')
            if i % 10000 == 0:
                print(f'{i}, ', end='', file=sys.stderr)
        print(f'{i}, Done!', file=sys.stderr)

In [15]:
convert_ntcir1_to_ntcir2(
    NTCIR1 + '/clir/ntc1-e1.utf8.jsonl',
    NTCIR2 + '/e-docs/ntc1-e1.utf8.mod.jsonl'
)

0, 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 110000, 120000, 130000, 140000, 150000, 160000, 170000, 180000, 187079, Done!


In [16]:
def concatenate_files(file_path1, file_path2, output_file_path):
    with open(file_path1, 'r', encoding='utf-8') as file1:
        data1 = file1.read()
        
    with open(file_path2, 'r', encoding='utf-8') as file2:
        data2 = file2.read()
    
    with open(output_file_path, 'w', encoding='utf-8') as outfile:
        outfile.write(data1)
        outfile.write(data2)

In [17]:
concatenate_files(NTCIR2 + '/e-docs/ntc2-e1g.utf8.jsonl', NTCIR2 + '/e-docs/ntc2-e1k.utf8.jsonl', NTCIR2 + '/e-docs/ntc2-e1.utf8.jsonl')

In [18]:
concatenate_files(NTCIR2 + '/e-docs/ntc1-e1.utf8.mod.jsonl', NTCIR2 + '/e-docs/ntc2-e1.utf8.jsonl', NTCIR2 + '/e-docs/ntc12-e1gk.utf8.mod.jsonl')

In [19]:
os.listdir(NTCIR2 + '/e-docs')

['ntc1-e1.utf8.mod.jsonl',
 'ntc12-e1gk.utf8.mod.jsonl',
 'ntc2-e1.utf8.jsonl',
 'ntc2-e1g',
 'ntc2-e1g.utf8',
 'ntc2-e1g.utf8.jsonl',
 'ntc2-e1k',
 'ntc2-e1k.utf8',
 'ntc2-e1k.utf8.jsonl']

### Topic files

In [20]:
tarfile.open(NTCIR2 + '/topics.tgz').extractall(path=NTCIR2)

In [21]:
os.listdir(NTCIR2 + '/topics')

['topic-e0101-0149', 'topic-j0101-0149']

In [22]:
convert_encoding(NTCIR2 + '/topics/topic-j0101-0149', NTCIR2 + '/topics/topic-j0101-0149.utf8', 'euc_jp', 'utf-8')

In [23]:
os.listdir(NTCIR2 + '/topics')

['topic-e0101-0149', 'topic-j0101-0149', 'topic-j0101-0149.utf8']

In [24]:
import re
def topics_jsonl(in_file):
    out_file = in_file + '.jsonl'
    with open(in_file, 'r', encoding='utf-8') as f:
        s = f.read()
        qid = re.findall('<TOPIC q=([^>]+)>', s)
        title = re.findall('<TITLE>\n(.*)\n</TITLE>', s)
        desc = re.findall('<DESCRIPTION>\n(.*)\n</DESCRIPTION>', s)
    with open(out_file, 'w', encoding='utf-8') as f:
        for i in range(len(qid)):
            f.write(f'{{ "query_id": "{qid[i]}", "text": "{title[i]}", "description": "{desc[i]}" }}\n')

In [25]:
topics_jsonl(NTCIR2 + '/topics/topic-j0101-0149.utf8')

In [26]:
os.listdir(NTCIR2 + '/topics')

['topic-e0101-0149',
 'topic-j0101-0149',
 'topic-j0101-0149.utf8',
 'topic-j0101-0149.utf8.jsonl']

### Qrel files
- This test collection provides graded relevance scores (A: Relevant, B: Partially Relevant, C: Not Relevant)
- We convert them as follows.
    - A: 2
    - B: 1
    - C: 0

In [27]:
tarfile.open(NTCIR2 + '/rels.tgz').extractall(path=NTCIR2)

In [28]:
os.listdir(NTCIR2 + '/rels')

['rel1_ntc2-e2_0101-0149',
 'rel1_ntc2-e2_0101-0149.nc',
 'rel1_ntc2-j2_0101-0149',
 'rel1_ntc2-j2_0101-0149.nc',
 'rel1_ntc2-je2_0101-0149',
 'rel1_ntc2-je2_0101-0149.nc',
 'rel2_ntc2-e2_0101-0149',
 'rel2_ntc2-e2_0101-0149.nc',
 'rel2_ntc2-j2_0101-0149',
 'rel2_ntc2-j2_0101-0149.nc',
 'rel2_ntc2-je2_0101-0149',
 'rel2_ntc2-je2_0101-0149.nc']

In [29]:
convert_encoding(NTCIR2 + '/rels/rel2_ntc2-je2_0101-0149.nc', NTCIR2 + '/rels/rel2_ntc2-je2_0101-0149.nc.utf8', 'ascii', 'utf-8')

In [30]:
os.listdir(NTCIR2 + '/rels')

['rel1_ntc2-e2_0101-0149',
 'rel1_ntc2-e2_0101-0149.nc',
 'rel1_ntc2-j2_0101-0149',
 'rel1_ntc2-j2_0101-0149.nc',
 'rel1_ntc2-je2_0101-0149',
 'rel1_ntc2-je2_0101-0149.nc',
 'rel2_ntc2-e2_0101-0149',
 'rel2_ntc2-e2_0101-0149.nc',
 'rel2_ntc2-j2_0101-0149',
 'rel2_ntc2-j2_0101-0149.nc',
 'rel2_ntc2-je2_0101-0149',
 'rel2_ntc2-je2_0101-0149.nc',
 'rel2_ntc2-je2_0101-0149.nc.utf8']

In [31]:
def qrel_graded_tsv(in_file):
    out_file = in_file + '.tsv'
    with open(in_file, 'r', encoding='utf-8') as f, open(out_file, 'w', encoding='utf-8') as f2:
        for line in f:
            line = line.rstrip()
            flds = line.split('\t')
            if flds[1] == 'A':
                f2.write(f'{flds[0]}\tQ0\t{flds[2]}\t2\n')
            if flds[1] == 'B':
                f2.write(f'{flds[0]}\tQ0\t{flds[2]}\t1\n')
            if flds[1] == 'C':
                f2.write(f'{flds[0]}\tQ0\t{flds[2]}\t0\n')

In [32]:
qrel_graded_tsv(NTCIR2 + '/rels/rel2_ntc2-je2_0101-0149.nc.utf8')

In [33]:
os.listdir(NTCIR2 + '/rels')

['rel1_ntc2-e2_0101-0149',
 'rel1_ntc2-e2_0101-0149.nc',
 'rel1_ntc2-j2_0101-0149',
 'rel1_ntc2-j2_0101-0149.nc',
 'rel1_ntc2-je2_0101-0149',
 'rel1_ntc2-je2_0101-0149.nc',
 'rel2_ntc2-e2_0101-0149',
 'rel2_ntc2-e2_0101-0149.nc',
 'rel2_ntc2-j2_0101-0149',
 'rel2_ntc2-j2_0101-0149.nc',
 'rel2_ntc2-je2_0101-0149',
 'rel2_ntc2-je2_0101-0149.nc',
 'rel2_ntc2-je2_0101-0149.nc.utf8',
 'rel2_ntc2-je2_0101-0149.nc.utf8.tsv']

### Register to ir_datasets module locally

- Dataset name: `ntcir-transfer`
- subset: `2/valid`

#### Location of dataset files

- `../datasets/ntcir-transfer.yaml`
- `../datasets/ntcir_transfer.py`

In [35]:
import sys
!{sys.executable} -m pip install -q ir_datasets pandas

In [36]:
sys.path.append(os.path.join(os.path.dirname(os.path.abspath('__file__')), '../datasets'))

In [37]:
import ir_datasets
import ntcir_transfer
dataset = ir_datasets.load('ntcir-transfer/2/valid')

In [38]:
dataset.docs_cls().__annotations__

{'doc_id': str, 'text': str}

In [None]:
docstore = dataset.docs_store()
docstore.get('kaken-e-2469487463').text # the one in the overview paper

In [40]:
dataset.queries_cls().__annotations__

{'query_id': str, 'text': str}

In [None]:
import pandas as pd
pd.DataFrame(dataset.queries_iter())

In [42]:
dataset.qrels_defs()

{2: 'relevant', 1: 'partially relevant', 0: 'not relevant'}

In [None]:
pd.DataFrame(dataset.qrels_iter())