<a href="https://colab.research.google.com/github/luanps/pyserini/blob/master/Run_pyserini_ance.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Install dependencies

In [1]:
from google.colab import auth
auth.authenticate_user()

In [2]:
%%capture
!pip install pyserini

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"

In [3]:
!pip install faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.7.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (8.6 MB)
[K     |████████████████████████████████| 8.6 MB 15.7 MB/s 
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.7.2


## Dense retrieval with ANCE - MSMARCO Passage Ranking

In [4]:
!python -m pyserini.dsearch --topics msmarco-passage-dev-subset \
                             --index msmarco-passage-ance-bf \
                             --encoded-queries ance-msmarco-passage-dev-subset \
                             --batch-size 36 \
                             --threads 12 \
                             --output run.msmarco-passage.ance.bf.tsv \
                             --output-format msmarco

Using pre-defined topic order for msmarco-passage-dev-subset
Attempting to initialize pre-encoded queries ance-msmarco-passage-dev-subset.
Downloading index at https://github.com/castorini/pyserini-data/raw/main/encoded-queries/query-embedding-ance-msmarco-passage-dev-subset-20210419-9323ec.tar.gz...
query-embedding-ance-msmarco-passage-dev-subset-20210419-9323ec.tar.gz: 19.0MB [00:01, 14.7MB/s]                
Extracting /root/.cache/pyserini/queries/query-embedding-ance-msmarco-passage-dev-subset-20210419-9323ec.tar.gz into /root/.cache/pyserini/queries/query-embedding-ance-msmarco-passage-dev-subset-20210419-9323ec.adad81bb1495eff2f0463e809ecc01b8...
Initializing ance-msmarco-passage-dev-subset...
Attempting to initialize pre-built index msmarco-passage-ance-bf.
Downloading index at https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/pyserini-indexes/dindex-msmarco-passage-ance-bf-20210224-060cef.tar.gz...
dindex-msmarco-passage-ance-bf-20210224-060cef.tar.gz: 23.4GB [16:09, 25.9MB/s]     

In [5]:
!gsutil cp run.* gs://luanps/information_retrieval/pyserini/ance/

Copying file://run.msmarco-passage.ance.bf.tsv [Content-Type=text/tab-separated-values]...
/
Operation completed over 1 objects/127.0 MiB.                                    


### Evaluation

In [6]:
#MRR Eval
!python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset run.msmarco-passage.ance.bf.tsv >> msmarco_passage_mrr_eval.txt

msmarco_passage_eval.py: 8.00kB [00:00, 34.8kB/s]


In [7]:
#TREC Eval
!python -m pyserini.eval.convert_msmarco_run_to_trec_run --input run.msmarco-passage.ance.bf.tsv \
                                                         --output run.msmarco-passage.ance.bf.trec

!python -m pyserini.eval.trec_eval -c -mrecall.1000 \
                                      -mmap msmarco-passage-dev-subset \
                                      run.msmarco-passage.ance.bf.trec >> msmarco_passage_eval_eval.txt

Done!
jtreceval-0.0.5-jar-with-dependencies.jar: 1.79MB [00:00, 3.12MB/s]                


In [8]:
!gsutil cp msmarco_passage_trec_eval.txt msmarco_passage_mrr_eval.txt gs://luanps/information_retrieval/pyserini/ance/

Copying file://trec_eval.txt [Content-Type=text/plain]...
Copying file://mrr_eval.txt [Content-Type=text/plain]...
/ [2 files][   1023 B/   1023 B]                                                
Operation completed over 2 objects/1023.0 B.                                     


## Dense retrieval with ANCE - MSMARCO Document Ranking

In [None]:
!python -m pyserini.dsearch --topics msmarco-doc-dev \
                             --index msmarco-doc-ance-maxp-bf \
                             --encoded-queries ance_maxp-msmarco-doc-dev \
                             --output run.msmarco-doc.passage.ance-maxp.txt \
                             --hits 1000 \
                             --max-passage \
                             --max-passage-hits 100 \
                             --output-format msmarco \
                             --batch-size 36 \
                             --threads 12

Using pre-defined topic order for msmarco-doc-dev
Attempting to initialize pre-encoded queries ance_maxp-msmarco-doc-dev.
Downloading index at https://github.com/castorini/pyserini-data/raw/main/encoded-queries/query-embedding-ance_maxp-msmarco-doc-dev-20210419-9323ec.tar.gz...
query-embedding-ance_maxp-msmarco-doc-dev-20210419-9323ec.tar.gz: 14.2MB [00:00, 16.4MB/s]                
Extracting /root/.cache/pyserini/queries/query-embedding-ance_maxp-msmarco-doc-dev-20210419-9323ec.tar.gz into /root/.cache/pyserini/queries/query-embedding-ance_maxp-msmarco-doc-dev-20210419-9323ec.3d41ae797cb97e42649c4f4fa7b97d56...
Initializing ance_maxp-msmarco-doc-dev...
Attempting to initialize pre-built index msmarco-doc-ance-maxp-bf.
Downloading index at https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/pyserini-indexes/dindex-msmarco-doc-ance_maxp-bf-20210304-b2a1b0.tar.gz...
dindex-msmarco-doc-ance_maxp-bf-20210304-b2a1b0.tar.gz:  10% 5.36G/54.3G [03:57<29:30, 29.7MB/s]

In [None]:
!gsutil cp run.* gs://luanps/information_retrieval/pyserini/ance/

### Evaluation

In [None]:
#MRR Eval
!$ python -m pyserini.eval.msmarco_doc_eval --judgments msmarco-doc-dev --run run.msmarco-doc.passage.ance-maxp.txt >> msmarco_document_mrr_eval.txt

In [None]:
#TREC Eval
! python -m pyserini.eval.convert_msmarco_run_to_trec_run --input/run.msmarco-doc.passage.ance-maxp.txt \
                                                          --output run.msmarco-doc.passage.ance-maxp.trec

! python -m pyserini.eval.trec_eval -c -mrecall.100 -mmap msmarco-doc-dev run.msmarco-doc.passage.ance-maxp.trec >> msmarco_document_trec_eval.txt