<a href="https://colab.research.google.com/github/luanps/pyserini/blob/master/Run_pyserini_tct_colbert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# TCT Colbert Passage Ranking on MSMARCO

Sheng-Chieh Lin, Jheng-Hong Yang, and Jimmy Lin. Distilling Dense Representations for Ranking using Tightly-Coupled Teachers. arXiv:2010.11386, October 2020.

Summary of results:

| Condition | MRR@10 | MAP | Recall@1000 |
|:----------|-------:|----:|------------:|
| TCT-ColBERT (brute-force index) | 0.3350 | 0.3416 | 0.9640 |
| TCT-ColBERT (HNSW index) | 0.3345 | 0.3410 | 0.9618 |
| TCT-ColBERT (brute-force index) + BoW BM25 | 0.3529 | 0.3594 | 0.9698 |
| TCT-ColBERT (brute-force index) + BM25 w/ doc2query-T5 | 0.3647 | 0.3711 | 0.9751 |

## Install dependencies

In [1]:
from google.colab import auth
auth.authenticate_user()

In [2]:
%%capture
!pip install pyserini

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"

In [3]:
!pip install faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.7.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (8.6 MB)
[K     |████████████████████████████████| 8.6 MB 14.9 MB/s 
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.7.2


## DENSE RETRIEVAL

### Dense retrieval with TCT-ColBERT, brute-force index


In [None]:
!python -m pyserini.dsearch --topics msmarco-passage-dev-subset \
                             --index msmarco-passage-tct_colbert-bf \
                             --encoded-queries tct_colbert-msmarco-passage-dev-subset \
                             --batch-size 36 \
                             --threads 12 \
                             --output run.msmarco-passage.tct_colbert.bf.tsv \
                             --output-format msmarco

Using pre-defined topic order for msmarco-passage-dev-subset
Attempting to initialize pre-encoded queries tct_colbert-msmarco-passage-dev-subset.
/root/.cache/pyserini/queries/query-embedding-tct_colbert-msmarco-passage-dev-subset-20210419-9323ec.b2fe6494241639153f26cc61acf3b39d already exists, skipping download.
Initializing tct_colbert-msmarco-passage-dev-subset...
Attempting to initialize pre-built index msmarco-passage-tct_colbert-bf.
/root/.cache/pyserini/indexes/dindex-msmarco-passage-tct_colbert-bf-20210112-be7119.7312e0e7acec2a686e994902ca064fc5 already exists, skipping download.
Initializing msmarco-passage-tct_colbert-bf...
tcmalloc: large alloc 27162083328 bytes == 0x55deaf9b2000 @  0x7f1e52919887 0x7f1cc9410ce1 0x7f1cc94196b3 0x7f1cc92b7f90 0x55dea4ed9258 0x55dea500d18e 0x55dea50069ee 0x55dea4f99bda 0x55dea500bd00 0x55dea4f99afa 0x55dea5007c0d 0x55dea50069ee 0x55dea4f9a271 0x55dea4fdb159 0x55dea4fd80a4 0x55dea4f98c52 0x55dea500bc25 0x55dea4f99afa 0x55dea500bd00 0x55dea50069

In [None]:
!gsutil cp run.msmarco-passage.tct_colbert.bf.tsv gs://luanps/information_retrieval/pyserini/tct_colbert_bf/

Copying file://run.msmarco-passage.tct_colbert.bf.tsv [Content-Type=text/tab-separated-values]...
|
Operation completed over 1 objects/127.0 MiB.                                    


#### Evaluation

In [None]:
#MRR Eval
!python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset run.msmarco-passage.tct_colbert.bf.tsv >> bf_mrr_eval.txt


In [None]:
#TREC Eval
!python -m pyserini.eval.convert_msmarco_run_to_trec_run --input run.msmarco-passage.tct_colbert.bf.tsv \
                                                         --output run.msmarco-passage.tct_colbert.bf.trec

!python -m pyserini.eval.trec_eval -c -mrecall.1000 \
                                      -mmap msmarco-passage-dev-subset \
                                      run.msmarco-passage.tct_colbert.bf.trec >> bf_trec_eval.txt

Done!


In [None]:
!gsutil cp bf_trec_eval.txt bf_mrr_eval.txt gs://luanps/information_retrieval/pyserini/tct_colbert_bf/

Copying file://trec_eval.txt [Content-Type=text/plain]...
Copying file://mrr_eval.txt [Content-Type=text/plain]...
/ [2 files][  1.8 KiB/  1.8 KiB]                                                
Operation completed over 2 objects/1.8 KiB.                                      


### Dense retrieval with TCT-ColBERT, Hybrid Dense-Sparse Retrieval (HNSW) index


In [5]:
!python -m pyserini.dsearch --topics msmarco-passage-dev-subset \
                             --index msmarco-passage-tct_colbert-hnsw \
                             --output run.msmarco-passage.tct_colbert.hnsw.tsv \
                             --output-format msmarco 

Using pre-defined topic order for msmarco-passage-dev-subset
Attempting to initialize pre-encoded queries tct_colbert-msmarco-passage-dev-subset.
Downloading index at https://github.com/castorini/pyserini-data/raw/main/encoded-queries/query-embedding-tct_colbert-msmarco-passage-dev-subset-20210419-9323ec.tar.gz...
query-embedding-tct_colbert-msmarco-passage-dev-subset-20210419-9323ec.tar.gz: 19.2MB [00:01, 10.2MB/s]                  
Extracting /root/.cache/pyserini/queries/query-embedding-tct_colbert-msmarco-passage-dev-subset-20210419-9323ec.tar.gz into /root/.cache/pyserini/queries/query-embedding-tct_colbert-msmarco-passage-dev-subset-20210419-9323ec.b2fe6494241639153f26cc61acf3b39d...
Initializing tct_colbert-msmarco-passage-dev-subset...
Attempting to initialize pre-built index msmarco-passage-tct_colbert-hnsw.
Downloading index at https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/pyserini-indexes/dindex-msmarco-passage-tct_colbert-hnsw-20210112-be7119.tar.gz...
dindex-msmarco-passage

In [6]:
!gsutil cp run.msmarco-passage.tct_colbert.hnsw.tsv gs://luanps/information_retrieval/pyserini/tct_colbert_hnsw/

Copying file://run.msmarco-passage.tct_colbert.hnsw.tsv [Content-Type=text/tab-separated-values]...
|
Operation completed over 1 objects/127.0 MiB.                                    


#### Evaluation

In [7]:
#MRR Eval
!python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset run.msmarco-passage.tct_colbert.hnsw.tsv >> hnsw_mrr_eval.txt


msmarco_passage_eval.py: 8.00kB [00:00, 40.5kB/s]


In [8]:
#TREC Eval
!python -m pyserini.eval.convert_msmarco_run_to_trec_run --input run.msmarco-passage.tct_colbert.hnsw.tsv \
                                                         --output run.msmarco-passage.tct_colbert.hnsw.trec

!python -m pyserini.eval.trec_eval -c -mrecall.1000 \
                                      -mmap msmarco-passage-dev-subset \
                                      run.msmarco-passage.tct_colbert.hnsw.trec >> hnsw_trec_eval.txt

Done!
jtreceval-0.0.5-jar-with-dependencies.jar: 1.79MB [00:00, 4.39MB/s]                


In [9]:
!gsutil cp hnsw_trec_eval.txt hnsw_mrr_eval.txt gs://luanps/information_retrieval/pyserini/tct_colbert_hnsw/

Copying file://hnsw_trec_eval.txt [Content-Type=text/plain]...
Copying file://hnsw_mrr_eval.txt [Content-Type=text/plain]...
/ [2 files][  1.0 KiB/  1.0 KiB]                                                
Operation completed over 2 objects/1.0 KiB.                                      


## HYBRID DENSE-SPARSE RETRIEVAL

## Hybrid retrieval with dense-sparse representations (without document expansion):

- dense retrieval with TCT-ColBERT, brute force index.
- sparse retrieval with BM25 msmarco-passage (i.e., default bag-of-words) index.


In [10]:
!python -m pyserini.hsearch dense  --index msmarco-passage-tct_colbert-bf \
                                    --encoded-queries tct_colbert-msmarco-passage-dev-subset \
                             sparse --index msmarco-passage \
                             fusion --alpha 0.12 \
                             run    --topics msmarco-passage-dev-subset \
                                    --output run.msmarco-passage.tct_colbert.bf.bm25.tsv \
                                    --batch-size 36 --threads 12 \
                                    --output-format msmarco

Using pre-defined topic order for msmarco-passage-dev-subset
Attempting to initialize pre-encoded queries tct_colbert-msmarco-passage-dev-subset.
/root/.cache/pyserini/queries/query-embedding-tct_colbert-msmarco-passage-dev-subset-20210419-9323ec.b2fe6494241639153f26cc61acf3b39d already exists, skipping download.
Initializing tct_colbert-msmarco-passage-dev-subset...
Attempting to initialize pre-built index msmarco-passage-tct_colbert-bf.
Downloading index at https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/pyserini-indexes/dindex-msmarco-passage-tct_colbert-bf-20210112-be7119.tar.gz...
dindex-msmarco-passage-tct_colbert-bf-20210112-be7119.tar.gz: 23.5GB [16:16, 25.8MB/s]                
Extracting /root/.cache/pyserini/indexes/dindex-msmarco-passage-tct_colbert-bf-20210112-be7119.tar.gz into /root/.cache/pyserini/indexes/dindex-msmarco-passage-tct_colbert-bf-20210112-be7119.7312e0e7acec2a686e994902ca064fc5...
Initializing msmarco-passage-tct_colbert-bf...
tcmalloc: large alloc 27162083328

In [11]:
!gsutil cp run.msmarco-passage.tct_colbert.bf.bm25.tsv gs://luanps/information_retrieval/pyserini/tct_colbert_bf_bm25/

Copying file://run.msmarco-passage.tct_colbert.bf.bm25.tsv [Content-Type=text/tab-separated-values]...
\ [1 files][127.0 MiB/127.0 MiB]                                                
Operation completed over 1 objects/127.0 MiB.                                    


### Evaluation

In [12]:
#MRR Eval
!python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset run.msmarco-passage.tct_colbert.bf.bm25.tsv >> bf_bm25_mrr_eval.txt


In [13]:
#TREC Eval
!python -m pyserini.eval.convert_msmarco_run_to_trec_run --input run.msmarco-passage.tct_colbert.bf.bm25.tsv \
                                                         --output run.msmarco-passage.tct_colbert.bf.bm25.trec

!python -m pyserini.eval.trec_eval -c -mrecall.1000 \
                                      -mmap msmarco-passage-dev-subset \
                                      run.msmarco-passage.tct_colbert.bf.bm25.trec >> bf_bm25_trec_eval.txt

Done!


In [14]:
!gsutil cp bf_bm25_trec_eval.txt bf_bm25_mrr_eval.txt gs://luanps/information_retrieval/pyserini/tct_colbert_bf_bm25/

Copying file://bf_bm25_trec_eval.txt [Content-Type=text/plain]...
Copying file://bf_bm25_mrr_eval.txt [Content-Type=text/plain]...
/ [2 files][  1.2 KiB/  1.2 KiB]                                                
Operation completed over 2 objects/1.2 KiB.                                      


## Hybrid retrieval with dense-sparse representations (with document expansion):

- dense retrieval with TCT-ColBERT, brute force index.
- sparse retrieval with doc2query-T5 expanded index.


In [15]:
!python -m pyserini.hsearch dense  --index msmarco-passage-tct_colbert-bf \
                                    --encoded-queries tct_colbert-msmarco-passage-dev-subset \
                             sparse --index msmarco-passage-expanded \
                             fusion --alpha 0.22 \
                             run    --topics msmarco-passage-dev-subset \
                                    --output run.msmarco-passage.tct_colbert.bf.doc2queryT5.tsv \
                                    --batch-size 36 --threads 12 \
                                    --output-format msmarco

Using pre-defined topic order for msmarco-passage-dev-subset
Attempting to initialize pre-encoded queries tct_colbert-msmarco-passage-dev-subset.
/root/.cache/pyserini/queries/query-embedding-tct_colbert-msmarco-passage-dev-subset-20210419-9323ec.b2fe6494241639153f26cc61acf3b39d already exists, skipping download.
Initializing tct_colbert-msmarco-passage-dev-subset...
Attempting to initialize pre-built index msmarco-passage-tct_colbert-bf.
/root/.cache/pyserini/indexes/dindex-msmarco-passage-tct_colbert-bf-20210112-be7119.7312e0e7acec2a686e994902ca064fc5 already exists, skipping download.
Initializing msmarco-passage-tct_colbert-bf...
tcmalloc: large alloc 27162083328 bytes == 0x560f2e8ce000 @  0x7f8ed3b43887 0x7f8e4a5f9ce1 0x7f8e4a6026b3 0x7f8e4a4a0f90 0x560f22ebc258 0x560f22ff018e 0x560f22fe99ee 0x560f22f7cbda 0x560f22feed00 0x560f22f7cafa 0x560f22feac0d 0x560f22fe99ee 0x560f22f7d271 0x560f22fbe159 0x560f22fbb0a4 0x560f22f7bc52 0x560f22feec25 0x560f22f7cafa 0x560f22feed00 0x560f22fe99

In [16]:
!gsutil cp run.msmarco-passage.tct_colbert.bf.doc2queryT5.tsv gs://luanps/information_retrieval/pyserini/tct_colbert_bf_doc2queryT5/

Copying file://run.msmarco-passage.tct_colbert.bf.doc2queryT5.tsv [Content-Type=text/tab-separated-values]...
-
Operation completed over 1 objects/127.0 MiB.                                    


### Evaluation

In [17]:
#MRR Eval
!python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset run.msmarco-passage.tct_colbert.bf.doc2queryT5.tsv >> bf_doc2queryT5_mrr_eval.txt


In [18]:
#TREC Eval
!python -m pyserini.eval.convert_msmarco_run_to_trec_run --input run.msmarco-passage.tct_colbert.bf.doc2queryT5.tsv \
                                                         --output run.msmarco-passage.tct_colbert.bf.doc2queryT5.trec

!python -m pyserini.eval.trec_eval -c -mrecall.1000 \
                                      -mmap msmarco-passage-dev-subset \
                                      run.msmarco-passage.tct_colbert.bf.doc2queryT5.trec >> bf_doc2queryT5_trec_eval.txt

Done!


In [20]:
!gsutil cp bf_doc2queryT5_trec_eval.txt bf_doc2queryT5_mrr_eval.txt gs://luanps/information_retrieval/pyserini/tct_colbert_bf_doc2queryT5/

Copying file://bf_doc2queryT5_trec_eval.txt [Content-Type=text/plain]...
Copying file://bf_doc2queryT5_mrr_eval.txt [Content-Type=text/plain]...
/ [2 files][  1.2 KiB/  1.2 KiB]                                                
Operation completed over 2 objects/1.2 KiB.                                      
