# Experiment on Multilingual LM on Transfer-2 DCLR Subtask (Train set)

- Input: Japanese queries (includes some English words)
- Output: English documents

## Previous Step

- `preprocess-transfer2-train.ipynb`

## Requirement

- Java v21
- Maven 3.6.3
- Python 3.10+

In [1]:
import os
# Change JAVA_HOME and MAVEN_HOME to fit your environment
JAVA_HOME = '/path/to/your/java'
os.environ['JAVA_HOME'] = JAVA_HOME
MAVEN_HOME = '/path/to/your/maven'
os.environ['MAVEN_HOME'] = MAVEN_HOME

## Reference Spec

- Azure VM: `Standard-NC4as-T4-v3`
    - 4 cores, 28 GB RAM, 176 GB disk
    - NVIDIA Tesla T4 (16 GB)
    - $0.73/hr (when running)


## GPU Check

In [2]:
!nvidia-smi

Thu Aug  1 11:33:54 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       On  | 00000001:00:00.0 Off |                  Off |
| N/A   30C    P8               9W /  70W |      2MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [3]:
import torch
print(torch.cuda.is_available())
print(torch.cuda.device_count())
for i in range(torch.cuda.device_count()):
    print(torch.cuda.get_device_name(i))

True
1
Tesla T4


## Path

In [2]:
import os
# Set your ROOT folder
os.environ['ROOT'] = '/path/to/your/transfer2/DCLR'
os.environ['DATASET'] = os.getenv('ROOT') + '/datasets'
os.environ['INDEX'] = os.getenv('ROOT') + '/indexes/ntcir18-transfer/train/pyserini'
os.environ['RUN'] = os.getenv('ROOT') + '/runs/ntcir18-transfer/train'
os.environ['VENDOR'] = os.getenv('ROOT') + '/vendors'
os.environ['TC'] = os.getenv('ROOT') + '/testcollections/ntcir/NTCIR-1'

## Install software

### Lucene

In [11]:
!mkdir -p $VENDOR

In [12]:
!wget -O $VENDOR/lucene-9.11.1.tgz "https://dlcdn.apache.org/lucene/java/9.11.1/lucene-9.11.1.tgz"

--2024-07-31 13:00:06--  https://dlcdn.apache.org/lucene/java/9.11.1/lucene-9.11.1.tgz
Resolving dlcdn.apache.org (dlcdn.apache.org)... 151.101.2.132, 2a04:4e42::644
Connecting to dlcdn.apache.org (dlcdn.apache.org)|151.101.2.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 79610139 (76M) [application/x-gzip]
Saving to: ‘/home/azureuser/cloudfiles/code/Users/joho.hideo.gb/transfer2/DCLR/vendors/lucene-9.11.1.tgz’


2024-07-31 13:00:15 (291 MB/s) - ‘/home/azureuser/cloudfiles/code/Users/joho.hideo.gb/transfer2/DCLR/vendors/lucene-9.11.1.tgz’ saved [79610139/79610139]



In [13]:
!tar xfz $VENDOR/lucene-9.11.1.tgz -C $VENDOR

In [6]:
os.environ['LUCENE_HOME'] = os.getenv('VENDOR') + '/lucene-9.11.1'
os.environ['CLASSPATH'] = os.getenv('LUCENE_HOME') + '/modules/*'

### Anserini

In [28]:
!git clone --recurse-submodules https://github.com/castorini/anserini.git $VENDOR/anserini

Cloning into '/home/azureuser/cloudfiles/code/Users/joho.hideo.gb/transfer2/DCLR/vendors/anserini'...
remote: Enumerating objects: 41275, done.[K
remote: Counting objects: 100% (7806/7806), done.[K
remote: Compressing objects: 100% (614/614), done.[K
remote: Total 41275 (delta 7613), reused 7225 (delta 7188), pack-reused 33469[K
Receiving objects: 100% (41275/41275), 91.30 MiB | 18.81 MiB/s, done.
Resolving deltas: 100% (30682/30682), done.
Updating files: 100% (3139/3139), done.
Submodule 'tools' (https://github.com/castorini/anserini-tools.git) registered for path 'tools'
Cloning into '/mnt/batch/tasks/shared/LS_root/mounts/clusters/standardnc4ast4v3/code/Users/joho.hideo.gb/transfer2/DCLR/vendors/anserini/tools'...
remote: Enumerating objects: 991, done.        
remote: Counting objects: 100% (748/748), done.        
remote: Compressing objects: 100% (630/630), done.        
remote: Total 991 (delta 141), reused 717 (delta 117), pack-reused 243        
Receiving objects: 100% (9

In [31]:
# -T 2C means two cores for mvn. Change it based on your enviornment.
!cd $VENDOR/anserini && mvn clean package -q -DskipTests -Dmaven.javadoc.skip=true -T 2C

In [36]:
os.environ['ANSERINI_CLASSPATH'] = os.getenv('VENDOR') + '/anserini/target/'
os.environ['CLASSPATH'] = os.getenv('ANSERINI_CLASSPATH') + ':' + os.getenv('CLASSPATH')

### Pyserini

In [9]:
import sys
!{sys.executable} -m pip install -q pyserini

### PyTerrier

In [10]:
import sys
!{sys.executable} -m pip install -q python-terrier

## Dataset

In [11]:
import sys
!{sys.executable} -m pip install -q ir_datasets

In [8]:
import sys
sys.path.append(os.getenv('DATASET'))

In [9]:
import ir_datasets
import ntcir_transfer
dataset = ir_datasets.load('ntcir-transfer/2/train')

## Experiments

### BM25

- Japanese queries against English collection (without translation)

#### Indexing

In [12]:
!ls $TC/clir/

ntc1-e1			rel2_ntc1-e1_0001-0030.utf8
ntc1-e1.utf8		rel2_ntc1-e1_0001-0030.utf8.tsv
ntc1-e1.utf8.jsonl	rel2_ntc1-e1_0001-0083.utf8.tsv
pyserini		rel2_ntc1-e1_0031-0083
rel1_ntc1-e1_0001-0030	rel2_ntc1-e1_0031-0083.utf8
rel1_ntc1-e1_0031-0083	rel2_ntc1-e1_0031-0083.utf8.tsv
rel2_ntc1-e1_0001-0030


In [29]:
!mkdir -p $TC/clir/pyserini/jsonl && \
 cat $TC/clir/ntc1-e1.utf8.jsonl | \
 sed 's/"doc_id":/"id":/' | \
 sed 's/"text":/"contents":/' > $TC/clir/pyserini/jsonl/ntc1-e1.utf8.jsonl

In [None]:
!head -1 $TC/clir/pyserini/jsonl/ntc1-e1.utf8.jsonl

In [31]:
!java -cp $CLASSPATH io.anserini.index.IndexCollection \
  -collection JsonCollection \
  -input $TC/clir/pyserini  \
  -language en \
  -index $INDEX/sparse \
  -generator DefaultLuceneDocumentGenerator \
  -threads 2 \
  -storePositions -storeDocvectors -storeRaw

2024-07-31 13:16:10,056 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:204) - Setting log level to INFO
2024-07-31 13:16:10,059 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:208) - AbstractIndexer settings:
2024-07-31 13:16:10,059 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:209) -  + DocumentCollection path: /home/azureuser/cloudfiles/code/Users/joho.hideo.gb/transfer2/DCLR/testcollections/ntcir/NTCIR-1/clir/pyserini
2024-07-31 13:16:10,059 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:210) -  + CollectionClass: JsonCollection
2024-07-31 13:16:10,060 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:211) -  + Index path: /home/azureuser/cloudfiles/code/Users/joho.hideo.gb/transfer2/DCLR/indexes/ntcir18-transfer/train/pyserini/sparse
2024-07-31 13:16:10,060 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:212) -  + Threads: 2
2024-07-31 13:16:10,060 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:213) -  + Optimize (me

#### Retrieval

In [None]:
!cat $TC/topics/topic0001-0083.utf8.jsonl | \
 sed -r 's/.*: "(.*)", "text": "(.*)", "description": "(.*)".*/\1\t\2/g' > $TC/topics/topic0001-0083-ja.utf8.tsv

In [3]:
!ls $TC/topics

topic0001-0030		    topic0001-0083.utf8.jsonl
topic0001-0030.utf8	    topic0031-0083
topic0001-0030.utf8.jsonl   topic0031-0083.utf8
topic0001-0083-en.utf8.tsv  topic0031-0083.utf8.jsonl
topic0001-0083-ja.utf8.tsv


In [None]:
!head $TC/topics/topic0001-0083-ja.utf8.tsv

In [38]:
!!mkdir -p $RUN && \
  java -cp $CLASSPATH io.anserini.search.SearchCollection \
  -index $INDEX/sparse \
  -topicReader TsvString \
  -topics $TC/topics/topic0001-0083-ja.utf8.tsv \
  -output $RUN/MyRun-BM25-pyserini.res \
  -language ja \
  -bm25

['/bin/bash: /anaconda/envs/myenv/lib/libtinfo.so.6: no version information available (required by /bin/bash)',
 '2024-08-01 17:12:02,165 INFO  [main] search.SearchCollection (SearchCollection.java:1009) - Index: /home/azureuser/cloudfiles/code/Users/joho.hideo.gb/transfer2/DCLR/indexes/ntcir18-transfer/train/pyserini/sparse',
 'Aug 01, 2024 5:12:02 PM org.apache.lucene.store.MemorySegmentIndexInputProvider <init>',
 'INFO: Using MemorySegmentIndexInput with Java 21; to disable start with -Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false',
 '2024-08-01 17:12:04,453 INFO  [main] search.SearchCollection (SearchCollection.java:1012) - Threads: 4',
 '2024-08-01 17:12:04,454 INFO  [main] search.SearchCollection (SearchCollection.java:1013) - Fields: []',
 '2024-08-01 17:12:04,454 INFO  [main] search.SearchCollection (SearchCollection.java:1027) - MaxPassage: false',
 '2024-08-01 17:12:04,454 INFO  [main] search.SearchCollection (SearchCollection.java:1032) - Hits: 1000',
 '

### mDPR

- https://castorini.github.io/pyserini/2cr/miracl.html
- https://huggingface.co/castorini/mdpr-tied-pft-msmarco

#### Encoding

In [23]:
%%time
!{sys.executable} -m pyserini.encode \
  input   --corpus $TC/clir/pyserini/jsonl/ntc1-e1.utf8.jsonl \
          --fields text \
          --delimiter "\n" \
          --shard-id 0 \
          --shard-num 1 \
  output  --embeddings $TC/clir/pyserini/embedding-mdpr \
          --to-faiss \
  encoder --encoder castorini/mdpr-tied-pft-msmarco \
          --encoder-class auto \
          --fields text \
          --batch-size 32 \
          --max-length 512 \
          --device cuda:0 \
          --fp16

/bin/bash: /anaconda/envs/myenv/lib/libtinfo.so.6: no version information available (required by /bin/bash)
187080it [00:01, 116045.98it/s]
100%|███████████████████████████████████████| 5847/5847 [51:22<00:00,  1.90it/s]
CPU times: user 21 s, sys: 5.1 s, total: 26.1 s
Wall time: 51min 33s


#### Indexing

In [26]:
%%time
!{sys.executable} -m pyserini.index.faiss \
  --input $TC/clir/pyserini/embedding-mdpr \
  --output $INDEX/dense/mdpr/ \
  --hnsw \
  --pq

/bin/bash: /anaconda/envs/myenv/lib/libtinfo.so.6: no version information available (required by /bin/bash)
Vector Shape: (187080, 768)
hnsw_add_vertices: adding 187080 elements on top of 0 (preset_levels=0)
  max_level = 2
Adding 2 elements at level 2
Adding 696 elements at level 1
Adding 186382 elements at level 0
Done in 343522.081 ms
Number of indexed vectors: 187080
CPU times: user 3.84 s, sys: 530 ms, total: 4.37 s
Wall time: 6min 11s


#### Retrieval

In [42]:
%%time
!{sys.executable} -m pyserini.search.faiss \
  --index $INDEX/dense/mdpr \
  --topics $TC/topics/topic0001-0083-ja.utf8.tsv \
  --output $RUN/MyRun-mdpr-pyserini.res \
  --encoder-class auto \
  --encoder castorini/mdpr-tied-pft-msmarco \
  --tokenizer castorini/mdpr-tied-pft-msmarco \
  --device cuda:0 \
  --batch-size 32

/bin/bash: /anaconda/envs/myenv/lib/libtinfo.so.6: no version information available (required by /bin/bash)
Running /home/azureuser/cloudfiles/code/Users/joho.hideo.gb/transfer2/DCLR/testcollections/ntcir/NTCIR-1/topics/topic0001-0083-ja.utf8.tsv topics, saving to /home/azureuser/cloudfiles/code/Users/joho.hideo.gb/transfer2/DCLR/runs/ntcir18-transfer/train/MyRun-mdpr-pyserini.res...
  0%|                                                    | 0/83 [00:00<?, ?it/s]Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
100%|███████████████████████████████████████████| 83/83 [00:01<00:00, 58.47it/s]
CPU times: user 108 ms, sys: 39.3 ms, total: 147 ms
Wall time: 13.6 s


### mDPR (En)

- https://castorini.github.io/pyserini/2cr/miracl.html
- https://huggingface.co/castorini/mdpr-tied-pft-msmarco-ft-miracl-en

#### Encoding

In [24]:
%%time
!{sys.executable} -m pyserini.encode \
  input   --corpus $TC/clir/pyserini/jsonl/ntc1-e1.utf8.jsonl \
          --fields text \
          --delimiter "\n" \
          --shard-id 0 \
          --shard-num 1 \
  output  --embeddings $TC/clir/pyserini/embedding-mdpr-en \
          --to-faiss \
  encoder --encoder castorini/mdpr-tied-pft-msmarco-ft-miracl-en \
          --encoder-class auto \
          --fields text \
          --batch-size 32 \
          --max-length 512 \
          --device cuda:0 \
          --fp16

/bin/bash: /anaconda/envs/myenv/lib/libtinfo.so.6: no version information available (required by /bin/bash)
187080it [00:01, 106912.78it/s]
100%|███████████████████████████████████████| 5847/5847 [51:40<00:00,  1.89it/s]
CPU times: user 21.2 s, sys: 5.38 s, total: 26.5 s
Wall time: 51min 51s


#### Indexing

In [27]:
%%time
!{sys.executable} -m pyserini.index.faiss \
  --input $TC/clir/pyserini/embedding-mdpr-en \
  --output $INDEX/dense/mdpr-en/ \
  --hnsw \
  --pq

/bin/bash: /anaconda/envs/myenv/lib/libtinfo.so.6: no version information available (required by /bin/bash)
Vector Shape: (187080, 768)
hnsw_add_vertices: adding 187080 elements on top of 0 (preset_levels=0)
  max_level = 2
Adding 2 elements at level 2
Adding 696 elements at level 1
Adding 186382 elements at level 0
Done in 307654.108 ms
Number of indexed vectors: 187080
CPU times: user 3.39 s, sys: 556 ms, total: 3.95 s
Wall time: 5min 35s


#### Retrieval

In [43]:
%%time
!{sys.executable} -m pyserini.search.faiss \
  --index $INDEX/dense/mdpr-en \
  --topics $TC/topics/topic0001-0083-ja.utf8.tsv \
  --output $RUN/MyRun-mdpr-en-pyserini.res \
  --encoder-class auto \
  --encoder castorini/mdpr-tied-pft-msmarco-ft-miracl-en \
  --tokenizer castorini/mdpr-tied-pft-msmarco-ft-miracl-en \
  --device cuda:0 \
  --batch-size 32

/bin/bash: /anaconda/envs/myenv/lib/libtinfo.so.6: no version information available (required by /bin/bash)
Running /home/azureuser/cloudfiles/code/Users/joho.hideo.gb/transfer2/DCLR/testcollections/ntcir/NTCIR-1/topics/topic0001-0083-ja.utf8.tsv topics, saving to /home/azureuser/cloudfiles/code/Users/joho.hideo.gb/transfer2/DCLR/runs/ntcir18-transfer/train/MyRun-mdpr-en-pyserini.res...
100%|███████████████████████████████████████████| 83/83 [00:01<00:00, 58.84it/s]
CPU times: user 102 ms, sys: 40.5 ms, total: 142 ms
Wall time: 13.2 s


### mDPR (ja)

- https://castorini.github.io/pyserini/2cr/miracl.html
- https://huggingface.co/castorini/mdpr-tied-pft-msmarco-ft-miracl-ja

#### Encoding

In [25]:
%%time
!{sys.executable} -m pyserini.encode \
  input   --corpus $TC/clir/pyserini/jsonl/ntc1-e1.utf8.jsonl \
          --fields text \
          --delimiter "\n" \
          --shard-id 0 \
          --shard-num 1 \
  output  --embeddings $TC/clir/pyserini/embedding-mdpr-ja \
          --to-faiss \
  encoder --encoder castorini/mdpr-tied-pft-msmarco-ft-miracl-ja \
          --encoder-class auto \
          --fields text \
          --batch-size 32 \
          --max-length 512 \
          --device cuda:0 \
          --fp16

/bin/bash: /anaconda/envs/myenv/lib/libtinfo.so.6: no version information available (required by /bin/bash)
187080it [00:01, 113487.55it/s]
100%|███████████████████████████████████████| 5847/5847 [51:41<00:00,  1.89it/s]
CPU times: user 20.6 s, sys: 5.6 s, total: 26.2 s
Wall time: 51min 52s


#### Indexing

In [28]:
%%time
!{sys.executable}-m pyserini.index.faiss \
  --input $TC/clir/pyserini/embedding-mdpr-ja \
  --output $INDEX/dense/mdpr-ja/ \
  --hnsw \
  --pq

/bin/bash: /anaconda/envs/myenv/lib/libtinfo.so.6: no version information available (required by /bin/bash)
Vector Shape: (187080, 768)
hnsw_add_vertices: adding 187080 elements on top of 0 (preset_levels=0)
  max_level = 2
Adding 2 elements at level 2
Adding 696 elements at level 1
Adding 186382 elements at level 0
Done in 343710.555 ms
Number of indexed vectors: 187080
CPU times: user 3.75 s, sys: 601 ms, total: 4.35 s
Wall time: 6min 10s


#### Retrieval

In [44]:
%%time
!{sys.executable} -m pyserini.search.faiss \
  --index $INDEX/dense/mdpr-ja \
  --topics $TC/topics/topic0001-0083-ja.utf8.tsv \
  --output $RUN/MyRun-mdpr-ja-pyserini.res \
  --encoder-class auto \
  --encoder castorini/mdpr-tied-pft-msmarco-ft-miracl-ja \
  --tokenizer castorini/mdpr-tied-pft-msmarco-ft-miracl-ja \
  --device cuda:0 \
  --batch-size 32

/bin/bash: /anaconda/envs/myenv/lib/libtinfo.so.6: no version information available (required by /bin/bash)
Running /home/azureuser/cloudfiles/code/Users/joho.hideo.gb/transfer2/DCLR/testcollections/ntcir/NTCIR-1/topics/topic0001-0083-ja.utf8.tsv topics, saving to /home/azureuser/cloudfiles/code/Users/joho.hideo.gb/transfer2/DCLR/runs/ntcir18-transfer/train/MyRun-mdpr-ja-pyserini.res...
100%|███████████████████████████████████████████| 83/83 [00:01<00:00, 53.90it/s]
CPU times: user 106 ms, sys: 39.7 ms, total: 145 ms
Wall time: 13.1 s


## Evaluation

In [45]:
import pandas as pd
import pyterrier as pt
if not pt.started():
  pt.init(tqdm='notebook')

In [46]:
BM25_DF = pt.io.read_results(os.getenv('RUN') + "/MyRun-BM25-pyserini.res")
MDPR_DF = pt.io.read_results(os.getenv('RUN') + "/MyRun-mdpr-pyserini.res")
MDPR_EN_DF = pt.io.read_results(os.getenv('RUN') + "/MyRun-mdpr-en-pyserini.res")
MDPR_JA_DF = pt.io.read_results(os.getenv('RUN') + "/MyRun-mdpr-ja-pyserini.res")
BM25_DF['qid'] = BM25_DF['qid'].str.zfill(4)
MDPR_DF['qid'] = MDPR_DF['qid'].str.zfill(4)
MDPR_EN_DF['qid'] = MDPR_EN_DF['qid'].str.zfill(4)
MDPR_JA_DF['qid'] = MDPR_JA_DF['qid'].str.zfill(4)

In [47]:
dataset_pt = pt.get_dataset('irds:ntcir-transfer/2/train')

In [48]:
from pyterrier.measures import *
pt.Experiment(
    [BM25_DF,MDPR_DF,MDPR_EN_DF,MDPR_JA_DF],
    topics=dataset_pt.get_topics(),
    qrels=dataset_pt.get_qrels(),
    names=["BM25", "mDPR", "mDPR (En)", "mDPR (Ja)"],
    eval_metrics=[MRR, nDCG@10, nDCG],
    filter_by_qrels=True
)

Unnamed: 0,name,RR,nDCG@10,nDCG
0,BM25,0.054581,0.039214,0.053989
1,mDPR,0.206787,0.082869,0.171059
2,mDPR (En),0.162232,0.088099,0.177437
3,mDPR (Ja),0.135592,0.04818,0.140727


## Where can we go from here?

- Try other multilingual models
- Fine-tuning
- etc.