<a href="https://colab.research.google.com/github/luanps/pyserini/blob/master/Run_pyserini_distillbert_kd_tasb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Distilbert-KD Balanced Topic Aware Sampling (TASB) Passage Ranking on MSMARCO

Sebastian Hofstätter, Sheng-Chieh Lin, Jheng-Hong Yang, Jimmy Lin, Allan Hanbury. Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling. SIGIR 2021.


## Install dependencies

In [1]:
from google.colab import auth
auth.authenticate_user()

In [2]:
%%capture
!pip install pyserini

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"

In [3]:
!pip install faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.7.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (8.6 MB)
[K     |████████████████████████████████| 8.6 MB 14.6 MB/s 
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.7.2


## Dense retrieval with Distilbert-KD TASB - MSMARCO Passage Ranking

In [9]:
! python -m pyserini.dsearch --topics msmarco-passage-dev-subset \
                             --index msmarco-passage-distilbert-dot-tas_b-b256-bf \
                             --encoded-queries distilbert_tas_b-msmarco-passage-dev-subset \
                             --batch-size 36 \
                             --threads 12 \
                             --output run.msmarco-passage.distilbert-dot-tas_b-b256.bf.tsv \
                             --output-format msmarco

Using pre-defined topic order for msmarco-passage-dev-subset
Attempting to initialize pre-encoded queries distilbert_tas_b-msmarco-passage-dev-subset.
Downloading index at https://github.com/castorini/pyserini-data/raw/main/encoded-queries/query-embedding-distilbert_dot_tas_b_b256-msmarco-passage-dev-subset-20210527-63276f.tar.gz...
query-embedding-distilbert_dot_tas_b_b256-msmarco-passage-dev-subset-20210527-63276f.tar.gz: 19.1MB [00:01, 16.2MB/s]                
Extracting /root/.cache/pyserini/queries/query-embedding-distilbert_dot_tas_b_b256-msmarco-passage-dev-subset-20210527-63276f.tar.gz into /root/.cache/pyserini/queries/query-embedding-distilbert_dot_tas_b_b256-msmarco-passage-dev-subset-20210527-63276f.17a3f81de7ba497728050b83733b1c46...
Initializing distilbert_tas_b-msmarco-passage-dev-subset...
Attempting to initialize pre-built index msmarco-passage-distilbert-dot-tas_b-b256-bf.
Downloading index at https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/pyserini-indexes/dindex-msmar

In [10]:
!gsutil cp run.msmarco-passage.distilbert-dot-tas_b-b256.bf.tsv gs://luanps/information_retrieval/pyserini/distilbert-kd-tasb/

Copying file://run.msmarco-passage.distilbert-dot-tas_b-b256.bf.tsv [Content-Type=text/tab-separated-values]...
\ [1 files][127.0 MiB/127.0 MiB]                                                
Operation completed over 1 objects/127.0 MiB.                                    


### Evaluation

In [11]:
#MRR Eval
!python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset run.msmarco-passage.distilbert-dot-tas_b-b256.bf.tsv >> msmarco_passage_distilbert-dot-tas_b-b256_bf_mrr_eval.txt

In [12]:
#TREC Eval
!python -m pyserini.eval.convert_msmarco_run_to_trec_run --input run.msmarco-passage.distilbert-dot-tas_b-b256.bf.tsv \
                                                         --output run.msmarco-passage.distilbert-dot-tas_b-b256.bf.trec

!python -m pyserini.eval.trec_eval -c -mrecall.1000 \
                                      -mmap msmarco-passage-dev-subset \
                                      run.msmarco-passage.distilbert-dot-tas_b-b256.bf.trec >> msmarco_passage_distilbert-dot-tas_b-b256.bf.tsv_eval.txt

Done!


In [13]:
!gsutil cp msmarco_passage_distilbert-dot-tas_b-b256_bf_mrr_eval.txt msmarco_passage_distilbert-dot-tas_b-b256.bf.tsv_eval.txt gs://luanps/information_retrieval/pyserini/distilbert-kd-tasb/

Copying file://msmarco_passage_distilbert-dot-tas_b-b256_bf_mrr_eval.txt [Content-Type=text/plain]...
Copying file://msmarco_passage_distilbert-dot-tas_b-b256.bf.tsv_eval.txt [Content-Type=text/plain]...
/ [2 files][  3.0 KiB/  3.0 KiB]                                                
Operation completed over 2 objects/3.0 KiB.                                      
