# Experiment 2 on NTCIR-17 Transfer First Retrieval Task with Train Dataset

This notebook shows how to apply BM25 and ANN using the train dataset of NTCIR-17 Transfer Task.

## Previous Step

- `preprocess-transfer1-train.ipynb`

## Requirement

- Java v11
- Maven 3.3+

## Path

In [1]:
import os
os.environ['INDEX'] = '../indexes/ntcir17-transfer/train/pyserini'
os.environ['RUN'] = '../runs/ntcir17-transfer/train'
os.environ['VENDOR'] = '../vendors'
os.environ['TC'] = '../testcollections/ntcir/NTCIR-1'

## Install Lucene

In [7]:
!mkdir -p $VENDOR

In [8]:
!wget -O $VENDOR/lucene-9.7.0.tgz "https://dlcdn.apache.org/lucene/java/9.7.0/lucene-9.7.0.tgz"

--2023-06-27 11:51:52--  https://dlcdn.apache.org/lucene/java/9.7.0/lucene-9.7.0.tgz
Resolving dlcdn.apache.org (dlcdn.apache.org)... 151.101.2.132, 2a04:4e42::644
Connecting to dlcdn.apache.org (dlcdn.apache.org)|151.101.2.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 69162218 (66M) [application/x-gzip]
Saving to: ‘../vendors/lucene-9.7.0.tgz’


2023-06-27 11:51:58 (11.3 MB/s) - ‘../vendors/lucene-9.7.0.tgz’ saved [69162218/69162218]



In [9]:
!tar xvfz $VENDOR/lucene-9.7.0.tgz -C $VENDOR

lucene-9.7.0/
...
lucene-9.7.0/docs/misc/org/apache/lucene/misc/search/similarity/


In [None]:
#!rm -f $VENDOR/lucene-9.7.0.tgz

In [2]:
os.environ['LUCENE_HOME'] = os.getenv('VENDOR') + '/lucene-9.7.0'
os.environ['CLASSPATH'] = os.getenv('LUCENE_HOME') + '/modules/lucene-core-9.7.0.jar'
os.environ['CLASSPATH'] = os.getenv('CLASSPATH') + ':' + os.getenv('LUCENE_HOME') + '/modules/lucene-core-9.7.0.jar'
os.environ['CLASSPATH'] = os.getenv('CLASSPATH') + ':' + os.getenv('LUCENE_HOME') + '/modules/lucene-queryparser-9.7.0.jar'
os.environ['CLASSPATH'] = os.getenv('CLASSPATH') + ':' + os.getenv('LUCENE_HOME') + '/modules/lucene-analysis-common-9.7.0.jar'
os.environ['CLASSPATH'] = os.getenv('CLASSPATH') + ':' + os.getenv('LUCENE_HOME') + '/modules/lucene-demo-9.7.0.jar'

In [62]:
!echo $CLASSPATH

../vendors/lucene-9.7.0/modules/lucene-core-9.7.0.jar:../vendors/lucene-9.7.0/modules/lucene-core-9.7.0.jar:../vendors/lucene-9.7.0/modules/lucene-queryparser-9.7.0.jar:../vendors/lucene-9.7.0/modules/lucene-analysis-common-9.7.0.jar:../vendors/lucene-9.7.0/modules/lucene-demo-9.7.0.jar


## Install Anserini

In [14]:
!git clone --recurse-submodules https://github.com/castorini/anserini.git $VENDOR/anserini

Cloning into '../vendors/anserini'...
remote: Enumerating objects: 28261, done.[K
remote: Counting objects: 100% (2579/2579), done.[K
remote: Compressing objects: 100% (494/494), done.[K
remote: Total 28261 (delta 2195), reused 2371 (delta 2032), pack-reused 25682[K
Receiving objects: 100% (28261/28261), 84.69 MiB | 6.59 MiB/s, done.
Resolving deltas: 100% (18844/18844), done.
Submodule 'tools' (https://github.com/castorini/anserini-tools.git) registered for path 'tools'
Cloning into '/home/jovyan/transfer1/vendors/anserini/tools'...
remote: Enumerating objects: 806, done.        
remote: Counting objects: 100% (563/563), done.        
remote: Compressing objects: 100% (482/482), done.        
remote: Total 806 (delta 106), reused 527 (delta 80), pack-reused 243        
Receiving objects: 100% (806/806), 139.32 MiB | 11.09 MiB/s, done.
Resolving deltas: 100% (190/190), done.
Submodule path 'tools': checked out '80691da60909ef0a123a0643e8af6552d741281e'


In [54]:
!cd $VENDOR/anserini && mvn clean package appassembler:assemble -q -DskipTests -Dmaven.javadoc.skip=true



In [3]:
os.environ['ANSERINI_CLASSPATH'] = os.getenv('VENDOR') + '/anserini/target/'

In [87]:
!echo $ANSERINI_CLASSPATH

../vendors/anserini/target/


## Install pyserini and other packages

In [122]:
import sys
!{sys.executable} -m pip install -U -q pyserini torch transformers fugashi ipadic

In [18]:
!conda install -y -c pytorch faiss-gpu

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.



## Install PyTerrier (for Evaluation)

In [5]:
# Change JAVA_HOME to fit your environment
JAVA_HOME = '/usr/lib/jvm/java-11-openjdk-amd64'
os.environ['JAVA_HOME'] = JAVA_HOME
os.getenv('JAVA_HOME')

'/usr/lib/jvm/java-11-openjdk-amd64'

In [6]:
import sys
!{sys.executable} -m pip install -U -q python-terrier

In [7]:
import pandas as pd
import pyterrier as pt
if not pt.started():
  pt.init(tqdm='notebook')

PyTerrier 0.9.2 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


## Datasets

In [8]:
import sys
!{sys.executable} -m pip install -U -q ir_datasets

In [9]:
sys.path.append(os.path.join(os.path.dirname(os.path.abspath('__file__')), '../datasets'))

In [10]:
import ir_datasets
import ntcir_transfer
dataset = ir_datasets.load('ntcir-transfer/1/train')

## Sparse Retrieval (BM25)

### Indexing

In [22]:
!ls $TC/mlir/

ntc1-j1			     rel2_ntc1-j1_0001-0030.utf8.tsv
ntc1-j1.utf8		     rel2_ntc1-j1_0001-0083.utf8.tsv
ntc1-j1.utf8.jsonl	     rel2_ntc1-j1_0031-0083
rel1_ntc1-j1_0001-0030	     rel2_ntc1-j1_0031-0083.utf8
rel1_ntc1-j1_0031-0083	     rel2_ntc1-j1_0031-0083.utf8.tsv
rel2_ntc1-j1_0001-0030	     top1000.train.tsv
rel2_ntc1-j1_0001-0030.utf8


In [30]:
!mkdir -p $TC/mlir/pyserini && \
 cat $TC/mlir/ntc1-j1.utf8.jsonl | \
 sed 's/"doc_id":/"id":/' | \
 sed 's/"text":/"contents":/' > $TC/mlir/pyserini/jsonl/ntc1-j1.utf8.jsonl

In [None]:
!head -1 $TC/mlir/pyserini/jsonl/ntc1-j1.utf8.jsonl

In [32]:
!python -m pyserini.index.lucene \
  --collection JsonCollection \
  --input $TC/mlir/pyserini  \
  --language ja \
  --index $INDEX/sparse \
  --generator DefaultLuceneDocumentGenerator \
  --threads 2 \
  --storePositions --storeDocvectors --storeRaw

2023-06-27 12:33:26,063 INFO  [main] index.IndexCollection (IndexCollection.java:380) - Setting log level to INFO
2023-06-27 12:33:26,065 INFO  [main] index.IndexCollection (IndexCollection.java:383) - Starting indexer...
2023-06-27 12:33:26,065 INFO  [main] index.IndexCollection (IndexCollection.java:385) - DocumentCollection path: ../testcollections/ntcir/NTCIR-1/mlir/tmp
2023-06-27 12:33:26,066 INFO  [main] index.IndexCollection (IndexCollection.java:386) - CollectionClass: JsonCollection
2023-06-27 12:33:26,066 INFO  [main] index.IndexCollection (IndexCollection.java:387) - Generator: DefaultLuceneDocumentGenerator
2023-06-27 12:33:26,067 INFO  [main] index.IndexCollection (IndexCollection.java:388) - Threads: 2
2023-06-27 12:33:26,067 INFO  [main] index.IndexCollection (IndexCollection.java:389) - Language: ja
2023-06-27 12:33:26,067 INFO  [main] index.IndexCollection (IndexCollection.java:390) - Stemmer: porter
2023-06-27 12:33:26,068 INFO  [main] index.IndexCollection (IndexColl

### Retrieval

In [34]:
!ls $TC/topics

topic0001-0030		   topic0001-0083.utf8.jsonl  topic0031-0083.utf8.jsonl
topic0001-0030.utf8	   topic0031-0083
topic0001-0030.utf8.jsonl  topic0031-0083.utf8


In [25]:
!cat $TC/topics/topic0001-0083.utf8.jsonl | \
 sed -r 's/.*: "(.*)", "text": "(.*)", "description": "(.*)".*/\1\t\2/g' > $TC/topics/topic0001-0083.utf8.tsv

In [None]:
!head $TC/topics/topic0001-0083.utf8.tsv

In [26]:
!python -m pyserini.search.lucene \
  --index $INDEX/sparse \
  --topics $TC/topics/topic0001-0083.utf8.tsv \
  --output $RUN/MyRun-BM25-pyserini.res \
  --language ja \
  --bm25

Running ../testcollections/ntcir/NTCIR-1/topics/topic0001-0083.utf8.tsv topics, saving to ../runs/ntcir17-transfer/train/MyRun-BM25-pyserini.res...
100%|███████████████████████████████████████████| 83/83 [00:12<00:00,  6.78it/s]


In [118]:
!head $RUN/MyRun-BM25-pyserini.res

1 Q0 gakkai-0000297977 1 4.415100 Anserini
1 Q0 gakkai-0000064659 2 4.394100 Anserini
1 Q0 gakkai-0000328806 3 4.382800 Anserini
1 Q0 gakkai-0000245010 4 4.381500 Anserini
1 Q0 gakkai-0000094695 5 4.371500 Anserini
1 Q0 gakkai-0000193955 6 4.371499 Anserini
1 Q0 gakkai-0000198139 7 4.369100 Anserini
1 Q0 gakkai-0000225773 8 4.359300 Anserini
1 Q0 gakkai-0000133457 9 4.352500 Anserini
1 Q0 gakkai-0000099991 10 4.338900 Anserini


### Evaluation

In [25]:
baselineDF = pt.io.read_results(os.getenv('RUN') + "/MyRun-BM25-pyserini.res")
baselineDF['qid'] = baselineDF['qid'].str.zfill(4)

In [26]:
baselineDF

Unnamed: 0,qid,docno,rank,score,name
0,0001,gakkai-0000297977,1,4.415100,Anserini
1,0001,gakkai-0000064659,2,4.394100,Anserini
2,0001,gakkai-0000328806,3,4.382800,Anserini
3,0001,gakkai-0000245010,4,4.381500,Anserini
4,0001,gakkai-0000094695,5,4.371500,Anserini
...,...,...,...,...,...
77787,0083,gakkai-0000073368,996,4.316998,Anserini
77788,0083,gakkai-0000116980,997,4.316997,Anserini
77789,0083,gakkai-0000241915,998,4.316996,Anserini
77790,0083,gakkai-0000308691,999,4.316995,Anserini


In [108]:
dataset_pt = pt.get_dataset('irds:ntcir-transfer/1/train')

In [124]:
from pyterrier.measures import *
pt.Experiment(
    [baselineDF],
    topics=dataset_pt.get_topics(),
    qrels=dataset_pt.get_qrels(),
    names=["baselineDF"],
    eval_metrics=[nDCG]
)

Unnamed: 0,name,nDCG
0,baselineDF,0.532006


## Dense Retrieval (BERT + ANN)

### Embedding corpus

In [None]:
import torch
with torch.no_grad():
    torch.cuda.empty_cache()

In [39]:
%%time
!python -m pyserini.encode \
  input   --corpus $TC/mlir/pyserini/jsonl/ntc1-j1.utf8.jsonl \
          --fields text \
          --delimiter "\n" \
          --shard-id 0 \
          --shard-num 1 \
  output  --embeddings $TC/mlir/pyserini/embedding \
          --to-faiss \
  encoder --encoder cl-tohoku/bert-base-japanese \
          --encoder-class auto \
          --fields text \
          --batch-size 32 \
          --max-length 512 \
          --device cuda:1 \
          --fp16

2023-06-28 06:18:53.377809: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-06-28 06:18:53.657265: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-06-28 06:18:54.366504: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-06-28 06:18:54.366613: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or 

In [40]:
!ls -lh $TC/mlir/pyserini/embedding

total 982M
-rw-rw-r-- 1 jovyan users 5.8M Jun 28 06:50 docid
-rw-rw-r-- 1 jovyan users 976M Jun 28 06:50 index


### Indexing

In [50]:
%%time
!python -m pyserini.index.faiss \
  --input $TC/mlir/pyserini/embedding \
  --output $INDEX/dense \
  --hnsw \
  --pq

(332918, 768)
hnsw_add_vertices: adding 332918 elements on top of 0 (preset_levels=0)
  max_level = 2
Adding 7 elements at level 2
Adding 1274 elements at level 1
Adding 331637 elements at level 0
Done in 664098.158 ms
332918
CPU times: user 14.1 s, sys: 1.75 s, total: 15.9 s
Wall time: 11min 44s


In [51]:
!ls -lh $INDEX/dense

total 723M
-rw-rw-r-- 1 jovyan users 5.8M Jun 28 07:02 docid
-rw-rw-r-- 1 jovyan users 718M Jun 28 07:14 index


### Retrieval (title only)

In [78]:
!cat $TC/topics/topic0001-0083.utf8.jsonl | \
 sed -r 's/.*: "(.*)", "text": "(.*)", "description": "(.*)".*/\1\t\2/g' > $TC/topics/topic0001-0083.utf8.tsv

In [None]:
!head $TC/topics/topic0001-0083.utf8.tsv

In [87]:
%%time
!python -m pyserini.search.faiss \
  --index $INDEX/dense \
  --topics $TC/topics/topic0001-0083.utf8.tsv \
  --output $RUN/MyRun-ANN-pyserini.res \
  --encoder-class auto \
  --encoder cl-tohoku/bert-base-japanese \
  --tokenizer cl-tohoku/bert-base-japanese \
  --device cuda:1 \
  --batch-size 32

2023-06-28 07:42:50.495187: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-06-28 07:42:50.667351: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-06-28 07:42:51.236893: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-06-28 07:42:51.237070: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or 

### Evaluation 1

In [90]:
denseDF = pt.io.read_results(os.getenv('RUN') + "/MyRun-ANN-pyserini.res")
denseDF['qid'] = denseDF['qid'].str.zfill(4)

In [91]:
denseDF

Unnamed: 0,qid,docno,rank,score,name
0,0001,gakkai-0000117570,1,-101.680191,Faiss
1,0001,gakkai-0000267380,2,-111.181946,Faiss
2,0001,gakkai-0000285625,3,-116.075027,Faiss
3,0001,gakkai-0000288940,4,-117.044373,Faiss
4,0001,gakkai-0000127851,5,-117.356834,Faiss
...,...,...,...,...,...
82995,0083,gakkai-0000254192,996,-130.657593,Faiss
82996,0083,gakkai-0000308041,997,-130.659897,Faiss
82997,0083,gakkai-0000004119,998,-130.663055,Faiss
82998,0083,gakkai-0000107671,999,-130.673615,Faiss


In [76]:
dataset_pt = pt.get_dataset('irds:ntcir-transfer/1/train')

In [95]:
from pyterrier.measures import *
pt.Experiment(
    [baselineDF, denseDF],
    topics=dataset_pt.get_topics(),
    qrels=dataset_pt.get_qrels(),
    names=["BM25", "ANN(title)"],
    eval_metrics=[nDCG]
)

Unnamed: 0,name,nDCG
0,BM25,0.532006
1,ANN(title),0.008415


### Retrieval (title + description)

**NOTE:** Only the title field is provided in the Dense First Stage Retrieval subtask at NTCIR-17 Transfer Task. The performance shown in this section is for reference only.

In [80]:
!cat $TC/topics/topic0001-0083.utf8.jsonl | \
 sed -r 's/.*: "(.*)", "text": "(.*)", "description": "(.*)".*/\1\t\2 \3/g' > $TC/topics/topic-td-0001-0083.utf8.tsv

In [None]:
!head $TC/topics/topic-td-0001-0083.utf8.tsv

In [83]:
%%time
!python -m pyserini.search.faiss \
  --index $INDEX/dense \
  --topics $TC/topics/topic-td-0001-0083.utf8.tsv \
  --output $RUN/MyRun-ANN-td-pyserini.res \
  --encoder-class auto \
  --encoder cl-tohoku/bert-base-japanese \
  --tokenizer cl-tohoku/bert-base-japanese \
  --device cuda:1 \
  --batch-size 32

2023-06-28 07:40:54.223300: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-06-28 07:40:54.405614: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-06-28 07:40:54.992625: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-06-28 07:40:54.992761: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or 

### Evaluation 2

In [85]:
denseDF_td = pt.io.read_results(os.getenv('RUN') + "/MyRun-ANN-td-pyserini.res")
denseDF_td['qid'] = denseDF_td['qid'].str.zfill(4)

In [94]:
from pyterrier.measures import *
pt.Experiment(
    [baselineDF, denseDF, denseDF_td],
    topics=dataset_pt.get_topics(),
    qrels=dataset_pt.get_qrels(),
    names=["BM25", "ANN(title)", "ANN(title+desc)"],
    eval_metrics=[nDCG]
)

Unnamed: 0,name,nDCG
0,BM25,0.532006
1,ANN(title),0.008415
2,ANN(title+desc),0.01244


## Dense Retrieval (SentenceBERT + ANN)

### Embedding corpus

In [106]:
import torch
with torch.no_grad():
    torch.cuda.empty_cache()

In [107]:
%%time
torch.cuda.empty_cache() 
!python -m pyserini.encode \
  input   --corpus $TC/mlir/pyserini/jsonl/ntc1-j1.utf8.jsonl \
          --fields text \
          --delimiter "\n" \
          --shard-id 0 \
          --shard-num 1 \
  output  --embeddings $TC/mlir/pyserini/embedding2 \
          --to-faiss \
  encoder --encoder sonoisa/sentence-bert-base-ja-mean-tokens-v2 \
          --encoder-class sentence-transformers \
          --fields text \
          --batch-size 16 \
          --max-length 512 \
          --device cuda:0 \
          --fp16

2023-06-28 08:36:37.730483: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-06-28 08:36:37.904449: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-06-28 08:36:38.550662: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-06-28 08:36:38.550844: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or 

In [108]:
!ls -lh $TC/mlir/pyserini/embedding2

total 982M
-rw-rw-r-- 1 jovyan users 5.8M Jun 28 09:14 docid
-rw-rw-r-- 1 jovyan users 976M Jun 28 09:14 index


### Indexing

In [109]:
%%time
!python -m pyserini.index.faiss \
  --input $TC/mlir/pyserini/embedding2 \
  --output $INDEX/dense2 \
  --hnsw \
  --pq

(332918, 768)
hnsw_add_vertices: adding 332918 elements on top of 0 (preset_levels=0)
  max_level = 2
Adding 7 elements at level 2
Adding 1274 elements at level 1
Adding 331637 elements at level 0
Done in 581597.836 ms
332918
CPU times: user 12.3 s, sys: 1.42 s, total: 13.8 s
Wall time: 10min 36s


In [110]:
!ls -lh $INDEX/dense2

total 723M
-rw-rw-r-- 1 jovyan users 5.8M Jun 28 09:15 docid
-rw-rw-r-- 1 jovyan users 718M Jun 28 09:26 index


### Retrieval (title only)

In [111]:
import torch
with torch.no_grad():
    torch.cuda.empty_cache()

In [113]:
%%time
!python -m pyserini.search.faiss \
  --index $INDEX/dense2 \
  --topics $TC/topics/topic0001-0083.utf8.tsv \
  --output $RUN/MyRun-SBERT-pyserini.res \
  --encoder-class sentence \
  --encoder sonoisa/sentence-bert-base-ja-mean-tokens-v2 \
  --tokenizer sonoisa/sentence-bert-base-ja-mean-tokens-v2 \
  --device cuda:0 \
  --batch-size 16

2023-06-28 09:28:23.764811: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-06-28 09:28:23.962203: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-06-28 09:28:24.597994: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-06-28 09:28:24.598166: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or 

### Evaluation 3

In [114]:
dense2DF = pt.io.read_results(os.getenv('RUN') + "/MyRun-SBERT-pyserini.res")
dense2DF['qid'] = dense2DF['qid'].str.zfill(4)

In [115]:
dense2DF

Unnamed: 0,qid,docno,rank,score,name
0,0001,gakkai-0000241510,1,-0.625328,Faiss
1,0001,gakkai-0000183593,2,-0.637275,Faiss
2,0001,gakkai-0000144389,3,-0.645404,Faiss
3,0001,gakkai-0000298172,4,-0.645452,Faiss
4,0001,gakkai-0000230011,5,-0.655468,Faiss
...,...,...,...,...,...
82995,0083,gakkai-0000186726,996,-1.047936,Faiss
82996,0083,gakkai-0000172806,997,-1.048079,Faiss
82997,0083,gakkai-0000267755,998,-1.048119,Faiss
82998,0083,gakkai-0000338233,999,-1.048172,Faiss


In [116]:
from pyterrier.measures import *
pt.Experiment(
    [baselineDF, denseDF, denseDF_td, dense2DF],
    topics=dataset_pt.get_topics(),
    qrels=dataset_pt.get_qrels(),
    names=["BM25", "BERT(title)", "BERT(title+desc)", "SBERT(title)"],
    eval_metrics=[nDCG]
)

Unnamed: 0,name,nDCG
0,BM25,0.532006
1,BERT(title),0.008415
2,BERT(title+desc),0.01244
3,SBERT(title),0.292386


### Retrieval (title + desc)

**NOTE** Only the title field is provided in the Dense First Stage Retrieval subtask at NTCIR-17 Transfer Task. The performance shown in this section is for reference only.

In [117]:
%%time
!python -m pyserini.search.faiss \
  --index $INDEX/dense2 \
  --topics $TC/topics/topic-td-0001-0083.utf8.tsv \
  --output $RUN/MyRun-SBERT-td-pyserini.res \
  --encoder-class sentence \
  --encoder sonoisa/sentence-bert-base-ja-mean-tokens-v2 \
  --tokenizer sonoisa/sentence-bert-base-ja-mean-tokens-v2 \
  --device cuda:0 \
  --batch-size 16

2023-06-28 09:32:38.189945: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-06-28 09:32:38.357934: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-06-28 09:32:38.926317: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-06-28 09:32:38.926495: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or 

### Evaluation 4

In [118]:
dense2DF_td = pt.io.read_results(os.getenv('RUN') + "/MyRun-SBERT-td-pyserini.res")
dense2DF_td['qid'] = dense2DF_td['qid'].str.zfill(4)

In [119]:
dense2DF_td

Unnamed: 0,qid,docno,rank,score,name
0,0001,gakkai-0000241512,1,-0.468825,Faiss
1,0001,gakkai-0000067968,2,-0.506423,Faiss
2,0001,gakkai-0000067257,3,-0.516469,Faiss
3,0001,gakkai-0000062729,4,-0.518317,Faiss
4,0001,gakkai-0000063518,5,-0.533578,Faiss
...,...,...,...,...,...
82995,0083,gakkai-0000314831,996,-0.948320,Faiss
82996,0083,gakkai-0000092062,997,-0.948396,Faiss
82997,0083,gakkai-0000188904,998,-0.948484,Faiss
82998,0083,gakkai-0000287401,999,-0.948498,Faiss


In [121]:
from pyterrier.measures import *
pt.Experiment(
    [baselineDF, denseDF, denseDF_td, dense2DF, dense2DF_td],
    topics=dataset_pt.get_topics(),
    qrels=dataset_pt.get_qrels(),
    names=["BM25(title)", "BERT(title)", "BERT(title+desc)", "SBERT(title)", "SBERT(title+desc)"],
    eval_metrics=[nDCG]
)

Unnamed: 0,name,nDCG
0,BM25(title),0.532006
1,BERT(title),0.008415
2,BERT(title+desc),0.01244
3,SBERT(title),0.292386
4,SBERT(title+desc),0.40527


---
## Where can we go from here?

- Try different transformer models for embedding and retrieval.
- Finetune BERT or Sentence BERT models using the train set or other reosources.
- Expand a title with a generative model and do the retrieval with expanded queries (since title+desc has a better performance than title only).