# ColBERTv2: Indexing & Search Notebook

We start by importing the relevant classes. As we'll see below, `Indexer` and `Searcher` are the key actors here. 

In [2]:
import os
import sys
# sys.path.insert(0, '../')
os.chdir('ColBERT')

from colbert.infra import Run, RunConfig, ColBERTConfig
from colbert.data import Queries, Collection
from colbert import Indexer, Searcher

The workflow here assumes an IR dataset: a set of queries and a corresponding collection of passages.

The classes `Queries` and `Collection` provide a convenient interface for working with such datasets.

We will use the *dev set* of the **LoTTE benchmark** we recently introduced in the ColBERTv2 paper. The dev and test sets contain several domain-specific corpora, and we'll use the smallest dev set corpus, namely `lifestyle:dev`.

In [2]:
!mkdir -p downloads/

# ColBERTv2 checkpoint trained on MS MARCO Passage Ranking (388MB compressed)
!wget https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/colbertv2.0.tar.gz -P downloads/
!tar -xvzf downloads/colbertv2.0.tar.gz -C downloads/

# The LoTTE dev and test sets (3.4GB compressed)
!wget https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz -P downloads/
!tar -xvzf downloads/lotte.tar.gz -C downloads/

--2022-05-24 21:31:43--  https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/colbertv2.0.tar.gz
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 405924985 (387M) [application/octet-stream]
Saving to: ‘downloads/colbertv2.0.tar.gz’


2022-05-24 21:32:57 (5.23 MB/s) - ‘downloads/colbertv2.0.tar.gz’ saved [405924985/405924985]

colbertv2.0/
colbertv2.0/artifact.metadata
colbertv2.0/vocab.txt
colbertv2.0/tokenizer.json
colbertv2.0/special_tokens_map.json
colbertv2.0/tokenizer_config.json
colbertv2.0/config.json
colbertv2.0/pytorch_model.bin
--2022-05-24 21:33:02--  https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/lotte.tar.gz
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:4

In [52]:
dataroot = 'downloads/lotte'
dataset = 'lifestyle'
datasplit = 'dev'

queries = os.path.join(dataroot, dataset, datasplit, 'questions.search.tsv')
collection = os.path.join(dataroot, dataset, datasplit, 'collection.tsv')

queries = Queries(path=queries)
collection = Collection(path=collection)

f'Loaded {len(queries)} queries and {len(collection):,} passages'

[May 24, 23:04:50] #> Loading the queries from downloads/lotte/lifestyle/dev/questions.search.tsv ...
[May 24, 23:04:50] #> Got 417 queries. All QIDs are unique.

[May 24, 23:04:50] #> Loading collection...
0M 


'Loaded 417 queries and 268,893 passages'

This loaded 417 queries and 269k passages. Let's inspect one query and one passage.

In [5]:
# collection= collection[:10000]

In [53]:
print(queries[24])
print()
print(collection[9999])
print()

are blossom end rot tomatoes edible?

I'd say you shouldn't. She doesn't even let you touch her. If she has problems in the new place, she can't ask help from you. She might be too old to compete with the other cats from scratch. Now, she should have a territory of her own, and it is easier to defend than claim new one. She might have other people giving out food. The best you can do is to contact the neighbours to offer food at least from time to time, at a point close to your current feeding point. Even if they do it one-two times a week, it would help the cat a lot. Also, try to contact the new residents of your place, you never know when you meet a cat lover. The odds are stacked against moving the cat, but it doesn't matter much. Bad things happen to feral cats all the time and it might happen to her whether you move her or leave her. I hope this helps.



## Indexing

For efficient search, we can pre-compute the ColBERT representation of each passage and index them.

Below, the `Indexer` take a model checkpoint and writes a (compressed) index to disk. We then prepare a `Searcher` for retrieval from this index.

(With four Titan V GPUs, indexing should take about 13 minutes. The output is fairly long/ugly at the moment!)

In [54]:
nbits = 2   # encode each dimension with 2 bits
doc_maxlen = 300   # truncate passages at 300 tokens

checkpoint = 'downloads/colbertv2.0'
index_name = f'{dataset}.{datasplit}.{nbits}bits'

In [55]:
index_name

'lifestyle.dev.2bits'

In [56]:
collection

<colbert.data.collection.Collection at 0x7f9d10145280>

In [57]:
with Run().context(RunConfig(nranks=1, experiment='notebook')):  # nranks specifies the number of GPUs to use.
    config = ColBERTConfig(doc_maxlen=doc_maxlen, nbits=nbits)

    indexer = Indexer(checkpoint=checkpoint, config=config)
    indexer.index(name=index_name, collection=collection, overwrite=True)



[May 24, 23:06:08] #> Note: Output directory /home/zhanj289/projects/cs224u_nlu_project/ColBERT/docs/experiments/notebook/indexes/lifestyle.dev.2bits already exists


[May 24, 23:06:08] #> Will delete 10 files already at /home/zhanj289/projects/cs224u_nlu_project/ColBERT/docs/experiments/notebook/indexes/lifestyle.dev.2bits in 20 seconds...
#> Starting...
nranks = 1 	 num_gpus = 1 	 device=0
{
    "nprobe": 2,
    "ncandidates": 8192,
    "index_path": null,
    "nbits": 2,
    "kmeans_niters": 20,
    "similarity": "cosine",
    "bsize": 64,
    "accumsteps": 1,
    "lr": 1e-5,
    "maxsteps": 400000,
    "save_every": null,
    "resume": false,
    "warmup": 20000,
    "warmup_bert": null,
    "relu": false,
    "nway": 64,
    "use_ib_negatives": true,
    "reranker": false,
    "distillation_alpha": 1.0,
    "ignore_scores": false,
    "query_maxlen": 32,
    "attend_to_mask_tokens": false,
    "interaction": "colbert",
    "dim": 128,
    "doc_maxlen": 300,
    "mask_punctuation

0it [00:00, ?it/s]

[May 24, 23:54:31] [0] 		 #> Saving chunk 0: 	 25,000 passages and 3,779,083 embeddings. From #0 onward.


1it [03:20, 200.02s/it]

[May 24, 23:54:41] [0] 		 #> Encoding 25000 passages..
[May 24, 23:57:51] [0] 		 #> Saving chunk 1: 	 25,000 passages and 4,073,198 embeddings. From #25,000 onward.


2it [06:37, 198.26s/it]

[May 24, 23:57:58] [0] 		 #> Encoding 25000 passages..
[May 25, 00:01:19] [0] 		 #> Saving chunk 2: 	 25,000 passages and 4,442,623 embeddings. From #50,000 onward.


3it [10:06, 203.56s/it]

[May 25, 00:01:28] [0] 		 #> Encoding 25000 passages..
[May 25, 00:05:04] [0] 		 #> Saving chunk 3: 	 25,000 passages and 4,047,185 embeddings. From #75,000 onward.


4it [13:51, 211.66s/it]

[May 25, 00:05:12] [0] 		 #> Encoding 25000 passages..
[May 25, 00:08:48] [0] 		 #> Saving chunk 4: 	 25,000 passages and 3,953,755 embeddings. From #100,000 onward.


5it [17:34, 216.08s/it]

[May 25, 00:08:56] [0] 		 #> Encoding 25000 passages..
[May 25, 00:12:30] [0] 		 #> Saving chunk 5: 	 25,000 passages and 3,347,195 embeddings. From #125,000 onward.


6it [21:14, 217.41s/it]

[May 25, 00:12:36] [0] 		 #> Encoding 25000 passages..
[May 25, 00:15:43] [0] 		 #> Saving chunk 6: 	 25,000 passages and 3,441,185 embeddings. From #150,000 onward.


7it [24:27, 209.44s/it]

[May 25, 00:15:49] [0] 		 #> Encoding 25000 passages..
[May 25, 00:19:16] [0] 		 #> Saving chunk 7: 	 25,000 passages and 3,597,393 embeddings. From #175,000 onward.


8it [28:02, 210.96s/it]

[May 25, 00:19:23] [0] 		 #> Encoding 25000 passages..
[May 25, 00:22:42] [0] 		 #> Saving chunk 8: 	 25,000 passages and 3,698,831 embeddings. From #200,000 onward.


9it [31:27, 209.08s/it]

[May 25, 00:22:48] [0] 		 #> Encoding 25000 passages..
[May 25, 00:26:00] [0] 		 #> Saving chunk 9: 	 25,000 passages and 3,739,499 embeddings. From #225,000 onward.


10it [34:46, 205.99s/it]

[May 25, 00:26:07] [0] 		 #> Encoding 18893 passages..
[May 25, 00:28:48] [0] 		 #> Saving chunk 10: 	 18,893 passages and 2,636,639 embeddings. From #250,000 onward.


11it [37:32, 204.78s/it]


[May 25, 00:29:00] [0] 		 #> Saving the indexing metadata to /home/zhanj289/projects/cs224u_nlu_project/ColBERT/docs/experiments/notebook/indexes/lifestyle.dev.2bits/metadata.json ..
#> Joined...


In [58]:
indexer.get_index() # You can get the absolute path of the index, if needed.

'/home/zhanj289/projects/cs224u_nlu_project/ColBERT/docs/experiments/notebook/indexes/lifestyle.dev.2bits'

## Search

Having built the index and prepared our `searcher`, we can search for individual query strings.

We can use the `queries` set we loaded earlier — or you can supply your own questions. Feel free to get creative! But keep in mind this set of ~300k lifestyle passages can only answer a small, focused set of questions!

In [59]:
# To create the searcher using its relative name (i.e., not a full path), set
# experiment=value_used_for_indexing in the RunConfig.
with Run().context(RunConfig(experiment='notebook')):
    searcher = Searcher(index=index_name)


# If you want to customize the search latency--quality tradeoff, you can also supply a
# config=ColBERTConfig(nprobe=.., ncandidates=..) argument. The default (2, 8192) works well,
# but you can trade away some latency to gain more extensive search with (4, 16384).
# Conversely, you can get faster search with (1, 4096).

[May 25, 00:29:06] #> Loading collection...
0M 
[May 25, 00:29:12] #> Building the emb2pid mapping..
[May 25, 00:29:12] len(self.emb2pid) = 40756586


In [62]:
results

([3736, 7629, 656, 2724, 314],
 [1, 2, 3, 4, 5],
 [15.3125, 14.84375, 14.6875, 14.6484375, 14.5703125])

In [61]:
len(searcher.collection)

268893

In [63]:
searcher.collection[5555]

'First of all, dogs are dogs. The all have the same basic needs. And every dog can be trained. And even dogs of the same breed can be really different in behavior and character. But, you are right, not every breed is equally suitable for every task or life circumstances and the purebreeds are known for unique qualities. I want to bring the dog to a small (60-70 SqM) apartment. My main concern is the behavior. Can they be trained like a purebred? (Labs, Goldens, Border Collies, Aussies,...) These breeds you mentioned are really different. Border Collies and Aussies were bred for herding tasks. That means, they are highly intelligent, very sensible, need very much exercise, and are really self-confident. If you let them, they are working, running or playing until they die. They don\'t give up. They recognize every motion far away, because they were bred to see, if any sheep is running away. Labs and Goldens are bred for waterfowling and hunting. They are intelligent as well, need exercis

In [64]:
query = queries[37]   # or supply your own query

print(f"#> {query}")

# Find the top-3 passages for this query
results = searcher.search(query, k=5)

# Print out the top-k retrieved passages
for passage_id, passage_rank, passage_score in zip(*results):
    print(f"\t [{passage_rank}] \t\t {passage_score:.1f} \t\t {searcher.collection[passage_id]}")

#> what are white spots on raspberries?

#> QueryTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==
#> Input: . what are white spots on raspberries?, 		 True, 		 None
#> Output IDs: torch.Size([32]), tensor([  101,     1,  2054,  2024,  2317,  7516,  2006, 20710,  2361, 20968,
         1029,   102,   103,   103,   103,   103,   103,   103,   103,   103,
          103,   103,   103,   103,   103,   103,   103,   103,   103,   103,
          103,   103])
#> Output Mask: torch.Size([32]), tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0])

	 [1] 		 26.0 		 You've got a heat problem, this is UV damage or excessive heat during the ripening phase and referred to as White Drupelet syndrome (white spot). It's quite common on Raspberries during the final crops of the year as summer heat increases. Also occurs in blackberries. Last year, we had issues in the US Pacific Northwest with it, only a handful at end of season 

## Batch Search

In many applications, you have a large batch of queries and you need to maximize the overall throughput. For that, you can use the `searcher.search_all(queries, k)` method, which returns a `Ranking` object that organizes the results across all queries.

(Batching provides many opportunities for higher-throughput search, though we have not implemented most of those optimizations for compressed indexes yet.)

In [None]:
rankings = searcher.search_all(queries, k=5).todict()

In [None]:
rankings[30]  # For query 30, a list of (passage_id, rank, score) for the top-k passages