<a href="https://colab.research.google.com/github/luanps/pyserini/blob/master/Run_pyserini_tct_colbert_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# TCT Colbert V2 Passage Ranking on MSMARCO

Sheng-Chieh Lin, Jheng-Hong Yang, and Jimmy Lin. In-Batch Negatives for Knowledge Distillation with Tightly-CoupledTeachers for Dense Retrieval. Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021), pages 163-173, August 2021.


Summary of results (figures from the paper are in parentheses):

| Condition | MRR@10 (paper) | MAP | Recall@1000 |
|:----------|-------:|----:|------------:|
| TCT_ColBERT-V2 (brute-force index) |  0.3440 (0.344) | 0.3509 | 0.9670 |
| TCT_ColBERT-V2-HN (brute-force index) |  0.3543 (0.354) | 0.3608 | 0.9708 |
| TCT_ColBERT-V2-HN+ (brute-force index) | 0.3585 (0.359) | 0.3645 | 0.9695 |
| TCT_ColBERT-V2-HN+ (brute-force index) + BoW BM25 | 0.3682 (0.369)  | 0.3737 | 0.9707 |
| TCT_ColBERT-V2-HN+ (brute-force index) + BM25 w/ doc2query-T5 | 0.3731 (0.375) | 0.3789 | 0.9759 |

## Install dependencies

In [6]:
from google.colab import auth
auth.authenticate_user()

In [7]:
%%capture
!pip install pyserini

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"

In [8]:
!pip install faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.7.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (8.6 MB)
[K     |████████████████████████████████| 8.6 MB 9.0 MB/s 
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.7.2


## DENSE RETRIEVAL

### Dense retrieval with TCT-ColBERT V2, brute-force index


In [4]:
!python -m pyserini.dsearch --topics msmarco-passage-dev-subset \
                             --index msmarco-passage-tct_colbert-v2-bf \
                             --encoded-queries tct_colbert-v2-msmarco-passage-dev-subset \
                             --batch-size 36 \
                             --threads 12 \
                             --output run.msmarco-passage.tct_colbert-v2.bf.tsv \
                             --output-format msmarco

Using pre-defined topic order for msmarco-passage-dev-subset
Attempting to initialize pre-encoded queries tct_colbert-v2-msmarco-passage-dev-subset.
Downloading index at https://github.com/castorini/pyserini-data/raw/main/encoded-queries/query-embedding-tct_colbert-v2-msmarco-passage-dev-subset-20210608-5f341b.tar.gz...
query-embedding-tct_colbert-v2-msmarco-passage-dev-subset-20210608-5f341b.tar.gz: 19.1MB [00:03, 6.20MB/s]                
Extracting /root/.cache/pyserini/queries/query-embedding-tct_colbert-v2-msmarco-passage-dev-subset-20210608-5f341b.tar.gz into /root/.cache/pyserini/queries/query-embedding-tct_colbert-v2-msmarco-passage-dev-subset-20210608-5f341b.ee8d76e596aef02c5027a2ffd0ff66f8...
Initializing tct_colbert-v2-msmarco-passage-dev-subset...
Attempting to initialize pre-built index msmarco-passage-tct_colbert-v2-bf.
Downloading index at https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/pyserini-indexes/dindex-msmarco-passage-tct_colbert-v2-bf-20210608-5f341b.tar.gz...
dind

In [5]:
!gsutil cp run.msmarco-passage.tct_colbert-v2.bf.tsv gs://luanps/information_retrieval/pyserini/tct_colbert_v2_bf/

Copying file://run.msmarco-passage.tct_colbert-v2.bf.tsv [Content-Type=text/tab-separated-values]...
|
Operation completed over 1 objects/127.0 MiB.                                    


#### Evaluation

In [6]:
#MRR Eval
!python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset run.msmarco-passage.tct_colbert-v2.bf.tsv >> bf_mrr_eval.txt


msmarco_passage_eval.py: 8.00kB [00:00, 37.6kB/s]


In [None]:
#TREC Eval
!python -m pyserini.eval.convert_msmarco_run_to_trec_run --input run.msmarco-passage.tct_colbert-v2.bf.tsv \
                                                         --output run.msmarco-passage.tct_colbert-v2.bf.trec

!python -m pyserini.eval.trec_eval -c -mrecall.1000 \
                                      -mmap msmarco-passage-dev-subset \
                                      run.msmarco-passage.tct_colbert-v2.bf.trec >> bf_trec_eval.txt

In [8]:
!gsutil cp bf_trec_eval.txt bf_mrr_eval.txt gs://luanps/information_retrieval/pyserini/tct_colbert_v2_bf/

Copying file://bf_trec_eval.txt [Content-Type=text/plain]...
Copying file://bf_mrr_eval.txt [Content-Type=text/plain]...
/ [2 files][  1.0 KiB/  1.0 KiB]                                                
Operation completed over 2 objects/1.0 KiB.                                      


### Dense retrieval with TCT-ColBERT V2-HN


In [None]:
!python -m pyserini.dsearch --topics msmarco-passage-dev-subset \
                             --index msmarco-passage-tct_colbert-v2-hn-bf \
                             --encoded-queries tct_colbert-v2-hn-msmarco-passage-dev-subset \
                             --batch-size 36 \
                             --threads 12 \
                             --output run.msmarco-passage.tct_colbert-v2-hn.bf.tsv \
                             --output-format msmarco

Using pre-defined topic order for msmarco-passage-dev-subset
Attempting to initialize pre-encoded queries tct_colbert-v2-hn-msmarco-passage-dev-subset.
Downloading index at https://github.com/castorini/pyserini-data/raw/main/encoded-queries/query-embedding-tct_colbert-v2-hn-msmarco-passage-dev-subset-20210608-5f341b.tar.gz...
query-embedding-tct_colbert-v2-hn-msmarco-passage-dev-subset-20210608-5f341b.tar.gz: 19.1MB [00:02, 9.55MB/s]                
Extracting /root/.cache/pyserini/queries/query-embedding-tct_colbert-v2-hn-msmarco-passage-dev-subset-20210608-5f341b.tar.gz into /root/.cache/pyserini/queries/query-embedding-tct_colbert-v2-hn-msmarco-passage-dev-subset-20210608-5f341b.f7e39cf2cd3ee53f7f8f2e0a1821431c...
Initializing tct_colbert-v2-hn-msmarco-passage-dev-subset...
Attempting to initialize pre-built index msmarco-passage-tct_colbert-v2-hn-bf.
Downloading index at https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/pyserini-indexes/dindex-msmarco-passage-tct_colbert-v2-hn-bf-202106

In [15]:
!gsutil cp run.msmarco-passage.tct_colbert-v2-hn.bf.tsv gs://luanps/information_retrieval/pyserini/tct_colbert_v2_hn/

Copying file://run.msmarco-passage.tct_colbert-v2-hn.bf.tsv [Content-Type=text/tab-separated-values]...
\ [1 files][127.0 MiB/127.0 MiB]                                                
Operation completed over 1 objects/127.0 MiB.                                    


#### Evaluation

In [16]:
#MRR Eval
!python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset run.msmarco-passage.tct_colbert-v2-hn.bf.tsv >> hn_mrr_eval.txt


In [17]:
#TREC Eval
!python -m pyserini.eval.convert_msmarco_run_to_trec_run --input run.msmarco-passage.tct_colbert-v2-hn.bf.tsv \
                                                         --output run.msmarco-passage.tct_colbert-v2-hn.bf.trec

!python -m pyserini.eval.trec_eval -c -mrecall.1000 \
                                      -mmap msmarco-passage-dev-subset \
                                      run.msmarco-passage.tct_colbert-v2-hn.bf.trec >> hn_trec_eval.txt

Done!


In [19]:
!gsutil cp hn_trec_eval.txt hn_mrr_eval.txt gs://luanps/information_retrieval/pyserini/tct_colbert_v2_hn/

Copying file://hn_trec_eval.txt [Content-Type=text/plain]...
Copying file://hn_mrr_eval.txt [Content-Type=text/plain]...
/ [2 files][  1.2 KiB/  1.2 KiB]                                                
Operation completed over 2 objects/1.2 KiB.                                      


## HYBRID DENSE-SPARSE RETRIEVAL with TCT_ColBERT-V2-HN+

## Hybrid retrieval with dense-sparse representations (without document expansion):

- dense retrieval with TCT-ColBERT, brute force index.
- sparse retrieval with BM25 msmarco-passage (i.e., default bag-of-words) index.


In [None]:
!python -m pyserini.hsearch dense  --index msmarco-passage-tct_colbert-v2-hnp-bf \
                                    --encoded-queries tct_colbert-v2-hnp-msmarco-passage-dev-subset \
                             sparse --index msmarco-passage \
                             fusion --alpha 0.06 \
                             run    --topics msmarco-passage-dev-subset \
                                    --output run.msmarco-passage.tct_colbert-v2-hnp.bf.bm25.tsv \
                                    --batch-size 36 --threads 12 \
                                    --output-format msmarco

Using pre-defined topic order for msmarco-passage-dev-subset
Attempting to initialize pre-encoded queries tct_colbert-v2-hnp-msmarco-passage-dev-subset.
Downloading index at https://github.com/castorini/pyserini-data/raw/main/encoded-queries/query-embedding-tct_colbert-v2-hnp-msmarco-passage-dev-subset-20210608-5f341b.tar.gz...
query-embedding-tct_colbert-v2-hnp-msmarco-passage-dev-subset-20210608-5f341b.tar.gz: 19.2MB [00:01, 16.6MB/s]                
Extracting /root/.cache/pyserini/queries/query-embedding-tct_colbert-v2-hnp-msmarco-passage-dev-subset-20210608-5f341b.tar.gz into /root/.cache/pyserini/queries/query-embedding-tct_colbert-v2-hnp-msmarco-passage-dev-subset-20210608-5f341b.bed8036475774d12915c8af2a44612f4...
Initializing tct_colbert-v2-hnp-msmarco-passage-dev-subset...
Attempting to initialize pre-built index msmarco-passage-tct_colbert-v2-hnp-bf.
Downloading index at https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/pyserini-indexes/dindex-msmarco-passage-tct_colbert-v2-hnp-b

In [None]:
!gsutil cp run.msmarco-passage.tct_colbert-v2-hnp.bf.bm25.tsv gs://luanps/information_retrieval/pyserini/tct_colbert_v2_hnp/

### Evaluation

In [None]:
#MRR Eval
!python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset run.msmarco-passage.tct_colbert-v2-hnp.bf.bm25.tsv >> hnp_bm25_mrr_eval.txt


In [None]:
#TREC Eval
!python -m pyserini.eval.convert_msmarco_run_to_trec_run --input run.msmarco-passage.tct_colbert-v2-hnp.bf.bm25.tsv \
                                                         --output run.msmarco-passage.tct_colbert-v2-hnp.bf.bm25.trec

!python -m pyserini.eval.trec_eval -c -mrecall.1000 \
                                      -mmap msmarco-passage-dev-subset \
                                      run.msmarco-passage.tct_colbert-v2-hnp.bf.bm25.trec >> hnp_bm25_trec_eval.txt

In [None]:
!gsutil cp hnp_trec_eval.txt hnp_mrr_eval.txt gs://luanps/information_retrieval/pyserini/tct_colbert_v2_hnp/

## Hybrid retrieval with dense-sparse representations (with document expansion):

- dense retrieval with TCT-ColBERT, brute force index.
- sparse retrieval with doc2query-T5 expanded index.


In [None]:
!python -m pyserini.hsearch dense  --index msmarco-passage-tct_colbert-v2-hnp-bf \
                                    --encoded-queries tct_colbert-v2-hnp-msmarco-passage-dev-subset \
                             sparse --index msmarco-passage-expanded \
                             fusion --alpha 0.1 \
                             run    --topics msmarco-passage-dev-subset \
                                    --output run.msmarco-passage.tct_colbert-v2-hnp.bf.doc2queryT5.tsv \
                                    --batch-size 36 --threads 12 \
                                    --output-format msmarco

In [None]:
!gsutil cp run.msmarco-passage.tct_colbert-v2-hnp.bf.doc2queryT5.tsv gs://luanps/information_retrieval/pyserini/tct_colbert_v2_hnp_bf_doc2queryT5/

### Evaluation

In [None]:
#MRR Eval
!python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset run.msmarco-passage.tct_colbert-v2-hnp.bf.doc2queryT5.tsv >> hnp_bf_doc2queryT5_mrr_eval.txt


In [None]:
#TREC Eval
!python -m pyserini.eval.convert_msmarco_run_to_trec_run --input run.msmarco-passage.tct_colbert-v2-hnp.bf.doc2queryT5.tsv \
                                                         --output run.msmarco-passage.tct_colbert-v2-hnp.bf.doc2queryT5.trec

!python -m pyserini.eval.trec_eval -c -mrecall.1000 \
                                      -mmap msmarco-passage-dev-subset \
                                      run.msmarco-passage.tct_colbert-v2-hnp.bf.doc2queryT5.trec >> hnp_bf_doc2queryT5_trec_eval.txt

In [None]:
!gsutil cp hnp_bf_doc2queryT5_trec_eval.txt hnp_bf_doc2queryT5_mrr_eval.txt gs://luanps/information_retrieval/pyserini/tct_colbert_v2_hnp_bf_doc2queryT5/