# ColBERTv2: Indexing & Search Notebook

We start by importing the relevant classes. As we'll see below, `Indexer` and `Searcher` are the key actors here. 

In [1]:
import os
import sys
sys.path.insert(0, '../')

from colbert.infra import Run, RunConfig, ColBERTConfig
from colbert.data import Queries, Collection
from colbert import Indexer, Searcher

The workflow here assumes an IR dataset: a set of queries and a corresponding collection of passages.

The classes `Queries` and `Collection` provide a convenient interface for working with such datasets.

We will use the *dev set* of the **LoTTE benchmark** we recently introduced in the ColBERTv2 paper. The dev and test sets contain several domain-specific corpora, and we'll use the smallest dev set corpus, namely `lifestyle:dev`.

In [None]:
!mkdir -p downloads/

# ColBERTv2 checkpoint trained on MS MARCO Passage Ranking (388MB compressed)
!wget https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/colbertv2.0.tar.gz -P downloads/
!tar -xvzf downloads/colbertv2.0.tar.gz -C downloads/

# The LoTTE dev and test sets (3.4GB compressed)
!wget https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz -P downloads/
!tar -xvzf downloads/lotte.tar.gz -C downloads/

In [2]:
dataroot = '/home/ubuntu/ColBERT/downloads/lotte_passages'
dataset = 'lifestyle'
datasplit = 'dev'

queries = os.path.join(dataroot, dataset, datasplit, 'questions.search.tsv')
collection = os.path.join(dataroot, dataset, datasplit, 'collection.tsv')

queries = Queries(path=queries)
collection = Collection(path=collection)

f'Loaded {len(queries)} queries and {len(collection):,} passages'

[Jun 30, 23:01:03] #> Loading the queries from /home/ubuntu/ColBERT/downloads/lotte_passages/lifestyle/dev/questions.search.tsv ...
[Jun 30, 23:01:03] #> Got 417 queries. All QIDs are unique.

[Jun 30, 23:01:03] #> Loading collection...
0M 


'Loaded 417 queries and 268,893 passages'

This loaded 417 queries and 269k passages. Let's inspect one query and one passage.

In [3]:
print(queries[24])
print()
print(collection[89852])
print()

are blossom end rot tomatoes edible?

I don't know what J-Rock is definitively as it could mean a number of things, however if you mean a 'metal' style rock guitar sound that I believe to be popular over there, then it can be quite simple: Strip all of your pedals and effects, and turn all dials to zero; Turn the volume up as far as it will go; Keep the Bass and Treble dials high; Turn down any 'Middle' dials until the sound becomes scooped - bear in the mind that middle frequencies are important for clarity, so find a decent compromise; Bring up the gain until you get the sound you require; Finally, bring the volume down to a more manageable level. This will get you a lot of the way, but without any clearer description of what you want to sound like, then I can't offer any more help.



## Indexing

For efficient search, we can pre-compute the ColBERT representation of each passage and index them.

Below, the `Indexer` take a model checkpoint and writes a (compressed) index to disk. We then prepare a `Searcher` for retrieval from this index.

(With four Titan V GPUs, indexing should take about 13 minutes. The output is fairly long/ugly at the moment!)

In [4]:
nbits = 2   # encode each dimension with 2 bits
doc_maxlen = 300   # truncate passages at 300 tokens

checkpoint = 'colbert-ir/colbertv2.0'
index_name = f'{dataset}.{datasplit}.{nbits}bits'

In [5]:
with Run().context(RunConfig(nranks=1, experiment='notebook')):  # nranks specifies the number of GPUs to use.
    config = ColBERTConfig(doc_maxlen=doc_maxlen, nbits=nbits)

    indexer = Indexer(checkpoint=checkpoint, config=config)
    indexer.index(name=index_name, collection=collection, overwrite=True)



[Jun 30, 23:01:10] #> Note: Output directory /home/ubuntu/ColBERT/docs/experiments/notebook/indexes/lifestyle.dev.2bits already exists


[Jun 30, 23:01:10] #> Will delete 1 files already at /home/ubuntu/ColBERT/docs/experiments/notebook/indexes/lifestyle.dev.2bits in 20 seconds...
#> Starting...
nranks = 1 	 num_gpus = 1 	 device=0
{
    "query_token_id": "[unused0]",
    "doc_token_id": "[unused1]",
    "query_token": "[Q]",
    "doc_token": "[D]",
    "ncells": null,
    "centroid_score_threshold": null,
    "ndocs": null,
    "index_path": null,
    "nbits": 2,
    "kmeans_niters": 4,
    "resume": false,
    "similarity": "cosine",
    "bsize": 64,
    "accumsteps": 1,
    "lr": 3e-6,
    "maxsteps": 500000,
    "save_every": null,
    "warmup": null,
    "warmup_bert": null,
    "relu": false,
    "nway": 2,
    "use_ib_negatives": false,
    "reranker": false,
    "distillation_alpha": 1.0,
    "ignore_scores": false,
    "model_name": null,
    "query_maxlen": 32,
    "attend_

0it [00:00, ?it/s]

[Jun 30, 23:05:35] [0] 		 #> Encoding 25000 passages..
[Jun 30, 23:05:59] [0] 		 #> Saving chunk 0: 	 25,000 passages and 3,779,083 embeddings. From #0 onward.


1it [00:24, 24.80s/it]

[Jun 30, 23:06:00] [0] 		 #> Encoding 25000 passages..
[Jun 30, 23:06:24] [0] 		 #> Saving chunk 1: 	 25,000 passages and 4,073,198 embeddings. From #25,000 onward.


2it [00:49, 24.90s/it]

[Jun 30, 23:06:25] [0] 		 #> Encoding 25000 passages..
[Jun 30, 23:06:50] [0] 		 #> Saving chunk 2: 	 25,000 passages and 4,442,623 embeddings. From #50,000 onward.


3it [01:15, 25.41s/it]

[Jun 30, 23:06:51] [0] 		 #> Encoding 25000 passages..
[Jun 30, 23:07:15] [0] 		 #> Saving chunk 3: 	 25,000 passages and 4,047,185 embeddings. From #75,000 onward.


4it [01:40, 25.32s/it]

[Jun 30, 23:07:16] [0] 		 #> Encoding 25000 passages..
[Jun 30, 23:07:40] [0] 		 #> Saving chunk 4: 	 25,000 passages and 3,953,755 embeddings. From #100,000 onward.


5it [02:05, 25.12s/it]

[Jun 30, 23:07:41] [0] 		 #> Encoding 25000 passages..
[Jun 30, 23:08:04] [0] 		 #> Saving chunk 5: 	 25,000 passages and 3,347,195 embeddings. From #125,000 onward.


6it [02:30, 24.91s/it]

[Jun 30, 23:08:05] [0] 		 #> Encoding 25000 passages..
[Jun 30, 23:08:29] [0] 		 #> Saving chunk 6: 	 25,000 passages and 3,441,185 embeddings. From #150,000 onward.


7it [02:54, 24.67s/it]

[Jun 30, 23:08:29] [0] 		 #> Encoding 25000 passages..
[Jun 30, 23:08:53] [0] 		 #> Saving chunk 7: 	 25,000 passages and 3,597,393 embeddings. From #175,000 onward.


8it [03:18, 24.53s/it]

[Jun 30, 23:08:54] [0] 		 #> Encoding 25000 passages..
[Jun 30, 23:09:17] [0] 		 #> Saving chunk 8: 	 25,000 passages and 3,698,831 embeddings. From #200,000 onward.


9it [03:42, 24.47s/it]

[Jun 30, 23:09:18] [0] 		 #> Encoding 25000 passages..
[Jun 30, 23:09:42] [0] 		 #> Saving chunk 9: 	 25,000 passages and 3,739,499 embeddings. From #225,000 onward.


10it [04:07, 24.44s/it]

[Jun 30, 23:09:42] [0] 		 #> Encoding 18893 passages..
[Jun 30, 23:10:00] [0] 		 #> Saving chunk 10: 	 18,893 passages and 2,636,639 embeddings. From #250,000 onward.


11it [04:25, 24.16s/it]


[Jun 30, 23:10:01] [0] 		 #> Checking all files were saved...
[Jun 30, 23:10:01] [0] 		 Found all files!
[Jun 30, 23:10:01] [0] 		 #> Building IVF...
[Jun 30, 23:10:01] [0] 		 #> Loading codes...
[Jun 30, 23:10:01] [0] 		 Sorting codes...


100%|██████████| 11/11 [00:00<00:00, 243.62it/s]


[Jun 30, 23:10:06] [0] 		 Getting unique codes...
[Jun 30, 23:10:06] #> Optimizing IVF to store map from centroids to list of pids..
[Jun 30, 23:10:06] #> Building the emb2pid mapping..
[Jun 30, 23:10:07] len(emb2pid) = 40756586


100%|██████████| 65536/65536 [00:03<00:00, 21399.16it/s]


[Jun 30, 23:10:10] #> Saved optimized IVF to /home/ubuntu/ColBERT/docs/experiments/notebook/indexes/lifestyle.dev.2bits/ivf.pid.pt
[Jun 30, 23:10:10] [0] 		 #> Saving the indexing metadata to /home/ubuntu/ColBERT/docs/experiments/notebook/indexes/lifestyle.dev.2bits/metadata.json ..
#> Joined...


In [6]:
indexer.get_index() # You can get the absolute path of the index, if needed.

'/home/ubuntu/ColBERT/docs/experiments/notebook/indexes/lifestyle.dev.2bits'

## Search

Having built the index and prepared our `searcher`, we can search for individual query strings.

We can use the `queries` set we loaded earlier — or you can supply your own questions. Feel free to get creative! But keep in mind this set of ~300k lifestyle passages can only answer a small, focused set of questions!

In [7]:
# To create the searcher using its relative name (i.e., not a full path), set
# experiment=value_used_for_indexing in the RunConfig.
with Run().context(RunConfig(experiment='notebook')):
    searcher = Searcher(index=index_name)


# If you want to customize the search latency--quality tradeoff, you can also supply a
# config=ColBERTConfig(ncells=.., centroid_score_threshold=.., ndocs=..) argument.
# The default settings with k <= 10 (1, 0.5, 256) gives the fastest search,
# but you can gain more extensive search by setting larger values of k or
# manually specifying more conservative ColBERTConfig settings (e.g. (4, 0.4, 4096)).

[Jun 30, 23:11:46] #> Loading collection...
0M 
[Jun 30, 23:11:51] #> Loading codec...
[Jun 30, 23:11:51] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
[Jun 30, 23:11:52] Loading packbits_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
[Jun 30, 23:11:52] #> Loading IVF...
[Jun 30, 23:11:52] #> Loading doclens...


100%|██████████| 11/11 [00:00<00:00, 888.59it/s]

[Jun 30, 23:11:52] #> Loading codes and residuals...



100%|██████████| 11/11 [00:01<00:00,  8.44it/s]


In [22]:
query = queries[37]   # or supply your own query

query = "How frequent are injuries in the Tour De France?"
print(f"#> {query}")

# Find the top-3 passages for this query
import time
now = time.time()
results = searcher.search(query, k=3)
after = time.time()
print(after-now)

# Print out the top-k retrieved passages
for passage_id, passage_rank, passage_score in zip(*results):
    print(f"\t [{passage_rank}] \t\t {passage_score:.1f} \t\t {searcher.collection[passage_id]}")

#> How frequent are injuries in the Tour De France?
0.040285587310791016
	 [1] 		 17.8 		 On average, every Dutch person makes a trip by bicycle 5.6 times per week. This works out as an average across the whole population of 2.5 km cycled every day. That's the highest figure for any population in the world. If we assume that people cycle every day of their lives to the age of 80, and that they cycle that 2.5 km every day of their life, they will ride a bike for a total of 73000 km during their lifetime. Divide it into 6.5 million and you find a figure that a typical Dutch cyclist can expect a "head/brain injury" once every 90 lifetimes Note that it doesn't say how serious the injuries have to be in order to be included. However, it does give total numbers of head/brain injuries per year as 550 + 1600 = 2150 which is more than ten times the total deaths of cyclists per year from all types of injuries. For the sake of making the maths easy, let's lazily (and very inaccurately) assume tha

## Batch Search

In many applications, you have a large batch of queries and you need to maximize the overall throughput. For that, you can use the `searcher.search_all(queries, k)` method, which returns a `Ranking` object that organizes the results across all queries.

(Batching provides many opportunities for higher-throughput search, though we have not implemented most of those optimizations for compressed indexes yet.)

In [24]:
now = time.time()
rankings = searcher.search_all(queries, k=5).todict()
after = time.time()
t = after-now
print(t/len(queries))

100%|██████████| 417/417 [00:02<00:00, 152.75it/s]


0.007222590686605988


In [25]:
rankings[30]  # For query 30, a list of (passage_id, rank, score) for the top-k passages

[(24367, 1, 16.015625),
 (35359, 2, 15.8125),
 (131623, 3, 15.75),
 (3789, 4, 15.7109375),
 (25089, 5, 15.703125)]