# ColBERTv2: Indexing & Search Notebook

If you're working in Google Colab, we recommend selecting "GPU" as your hardware accelerator in the runtime settings.

First, we'll import the relevant classes. Note that `Indexer` and `Searcher` are the key actors here. Next, we'll download the necessary dependencies.

In [None]:
!git -C ColBERT/ pull || git clone https://github.com/stanford-futuredata/ColBERT.git
import sys; sys.path.insert(0, 'ColBERT/')

In [None]:
try: # When on google Colab, let's install all dependencies with pip.
    import google.colab
    !pip install -U pip
    !pip install -e ColBERT/['faiss-gpu','torch']
except Exception:
  import sys; sys.path.insert(0, 'ColBERT/')
  try:
    from colbert import Indexer, Searcher
  except Exception:
    print("If you're running outside Colab, please make sure you install ColBERT in conda following the instructions in our README. You can also install (as above) with pip but it may install slower or less stable faiss or torch dependencies. Conda is recommended.")
    assert False

In [None]:
import colbert

In [None]:
from colbert import Indexer, Searcher
from colbert.infra import Run, RunConfig, ColBERTConfig
from colbert.data import Queries, Collection

We will use the dev set of the **LoTTE benchmark** we recently introduced in the ColBERTv2 paper. We'll download it from HuggingFace datasets. The dev and test sets contain several domain-specific corpora, and we'll use the smallest dev set corpus, namely lifestyle:dev.

For the purposes of a quick demo, we will only run the `Indexer` on the first 10,000 passages. As we do this, let's also remove the queries whose relevant passages are all outside this small set of passages.

In [None]:
import os

dataroot = '/content/data'

collection = os.path.join(dataroot, 'collections.tsv')
collection = Collection(path=collection)

f'Loaded {len(collection)} passages'

[Dec 03, 11:12:24] #> Loading collection...
0M 


'Loaded 247 passages'

In [None]:
print(collection[19])
print()

 Corporate Social Responsibility@ What is Corporate Social Responsibility?@ With our Policy Template, we help you create awareness among your employees regarding the company’s steps to return the good deeds to society. The company’s existence is latin small letter a part of the bigger system formed with the harmony of the people, values, and nature. It is both latin small letter a responsibility and latin small letter a deed of conscience that encourages the companies to take some extra steps to engage in donating or volunteering activities and partner with the non-proﬁt organizations to complete ventures. Download the template now.



## Indexing

For an efficient search, we can pre-compute the ColBERT representation of each passage and index them.

Below, the `Indexer` take a model checkpoint and writes a (compressed) index to disk. We then prepare a `Searcher` for retrieval from this index.

In [None]:
nbits = 2   # encode each dimension with 2 bits
doc_maxlen = 500 # truncate passages at 300 tokens
max_id = 10000

index_name = f'ncert.{nbits}bits'

Now run the `Indexer` on the collection subset. Assuming the use of only one GPU, this cell should take about six minutes to finish running.

In [None]:
!rm -r /content/experiments

In [None]:
checkpoint = 'colbert-ir/colbertv2.0'

with Run().context(RunConfig(nranks=1, experiment='copilot')):  # nranks specifies the number of GPUs to use
    config = ColBERTConfig(doc_maxlen=doc_maxlen, nbits=nbits, kmeans_niters=4) # kmeans_niters specifies the number of iterations of k-means clustering; 4 is a good and fast default.
                                                                                # Consider larger numbers for small datasets.

    indexer = Indexer(checkpoint=checkpoint, config=config)
    indexer.index(name=index_name, collection=collection, overwrite=True)



[Dec 03, 11:12:33] #> Creating directory /content/experiments/copilot/indexes/ncert.2bits 


#> Starting...
#> Joined...


In [None]:
indexer.get_index() # You can get the absolute path of the index, if needed.

'/content/experiments/copilot/indexes/ncert.2bits'

In [None]:
# !zip -r /content/colbert_ncert.zip /content/experiments

## Search

Having built the index and prepared our `searcher`, we can search for individual query strings.

We can use the `queries` set we loaded earlier — or you can supply your own questions. Feel free to get creative! But keep in mind this set of ~300k lifestyle passages can only answer a small, focused set of questions!

In [None]:
# To create the searcher using its relative name (i.e., not a full path), set
# experiment=value_used_for_indexing in the RunConfig.
with Run().context(RunConfig(experiment='copilot')):
    searcher = Searcher(index=index_name, collection=collection)


# If you want to customize the search latency--quality tradeoff, you can also supply a
# config=ColBERTConfig(ncells=.., centroid_score_threshold=.., ndocs=..) argument.
# The default settings with k <= 10 (1, 0.5, 256) gives the fastest search,
# but you can gain more extensive search by setting larger values of k or
# manually specifying more conservative ColBERTConfig settings (e.g. (4, 0.4, 4096)).

[Dec 03, 11:16:06] #> Loading codec...
[Dec 03, 11:16:06] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
[Dec 03, 11:16:06] Loading packbits_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
[Dec 03, 11:16:06] #> Loading IVF...
[Dec 03, 11:16:06] #> Loading doclens...


100%|██████████| 1/1 [00:00<00:00, 1283.05it/s]

[Dec 03, 11:16:06] #> Loading codes and residuals...



100%|██████████| 1/1 [00:00<00:00, 924.26it/s]


In [None]:
query = "What is the charge for replacement of ID cards?" # try with an in-range query or supply your own
print(f"#> {query}")

# Find the top-3 passages for this query
results = searcher.search(query, k=5)
retrived_passages = []
sources = []
# Print out the top-k retrieved passages
for passage_id, passage_rank, passage_score in zip(*results):
    print(f"\n Rank : [{passage_rank}] \n\n Score : {passage_score:.1f} \n\n PID: {passage_id} \n\n Passages:\n\n {searcher.collection[passage_id]}")
    retrived_passages.append(searcher.collection[passage_id].split('@')[-1])
    sources.append(searcher.collection[passage_id].split('@')[0])

#> What is the charge for replacement of ID cards?

#> QueryTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==
#> Input: . What is the charge for replacement of ID cards?, 		 True, 		 None
#> Output IDs: torch.Size([32]), tensor([ 101,    1, 2054, 2003, 1996, 3715, 2005, 6110, 1997, 8909, 5329, 1029,
         102,  103,  103,  103,  103,  103,  103,  103,  103,  103,  103,  103,
         103,  103,  103,  103,  103,  103,  103,  103])
#> Output Mask: torch.Size([32]), tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0])


 Rank : [1] 

 Score : 21.0 

 PID: 147 

 Passages:

  ID Card Policy@ Lost or Stolen ID Card@ Suppose the employee discovers that his/her ID card is stolen or lost; in that case, they MUST report the same to HR Business partner immediately because this can be latin small letter a threat to the organization’s security. If unreported, the employee shall be held responsible for all the activitie

In [None]:
retrived_passages

[' Suppose the employee discovers that his/her ID card is stolen or lost; in that case, they MUST report the same to HR Business partner immediately because this can be latin small letter a threat to the organization’s security. If unreported, the employee shall be held responsible for all the activities undertaken using their ID card. On receiving such reports, the organization must deactivate the ID card and order latin small letter a replacement ID card for the employee. The employee will be given the ﬁrst replacement card for free. Any further replacement for misplacing or losing the ID card will cause the employee to be charged rupees\xa01500\xa0for each substitution. If any employee ﬁnds latin small letter a lost ID card, they MUST return it to\xa0HR Business partner full stop',
 ' This policy format carries the provisions regarding the necessity of wearing an ID card. Apart from that, the policy explains the company’s actions against the employee if he breaches this contract. In

In [None]:
len(retrived_passages)

5

In [None]:
sources

['BIOLOGICAL CLASSIFICATION',
 'BIOLOGICAL CLASSIFICATION',
 'BIOLOGICAL CLASSIFICATION',
 'BIOLOGICAL CLASSIFICATION',
 'BIOLOGICAL CLASSIFICATION']

In [None]:
context = ' '.join(retrived_passages)
context

'Lichens : Lichens are symbiotic associations i.e. mutually useful associations, between algae and fungi. The algal component is known as phycobiont and fungal component as mycobiont, which are autotrophic and heterotrophic, respectively. Algae prepare food for fungi and fungi provide shelter and absorb mineral nutrients and water for its partner. So close is their association that if one saw a lichen in nature one would never imagine that they had two different organisms within them. Lichens are very good pollution indicators – they do not grow in polluted areas. In the five kingdom classification of Whittaker there is no mention of lichens and some acellular organisms like viruses, viroids and prions. These are briefly introduced here. Virus means venom or poisonous fluid. Dmitri Ivanowsky (1892) recognised certain microbes as causal organism of the mosaic disease of tobacco (Figure 2.6a). These were found to be smaller than bacteria because they passed through bacteria-proof filters

# Colbert API

In [None]:
!pip install flask-ngrok

Collecting flask-ngrok
  Downloading flask_ngrok-0.0.25-py3-none-any.whl (3.1 kB)
Installing collected packages: flask-ngrok
Successfully installed flask-ngrok-0.0.25
[0m

In [None]:
import pandas as pd

df = pd.read_csv("/content/data/office-collections.csv")
df = df.drop(['Paragraph'], axis=1)
df = df.drop(['pid'], axis=1)
df_dict = df.to_dict(orient='records')
df_dict[0]

{'Chapter': ' Compensation and Benefits Policy',
 'Page_number': 1,
 'Topic': 'What is the Compensation and Beneﬁts Policy?'}

In [17]:
from flask_ngrok import run_with_ngrok
from flask import Flask, render_template, request
from functools import lru_cache
import math
import os
from dotenv import load_dotenv

from colbert.infra import Run, RunConfig, ColBERTConfig
from colbert import Searcher
import pandas as pd

load_dotenv()

# INDEX_NAME = os.getenv("INDEX_NAME")
# INDEX_ROOT = os.getenv("INDEX_ROOT")
# /content/experiments/copilot/indexes/ncert.2bits
INDEX_ROOT="/content/experiments/copilot/indexes"
INDEX_NAME="ncert.2bits"
PORT="8893"

app = Flask(__name__)
run_with_ngrok(app)
searcher = Searcher(index=f"{INDEX_ROOT}/{INDEX_NAME}")
counter = {"api" : 0}

df = pd.read_csv("/content/data/office-collections.csv")
df = df.drop(['Paragraph'], axis=1)
df = df.drop(['pid'], axis=1)
df_dict = df.to_dict(orient='records')

@lru_cache(maxsize=1000000)
def api_search_query(query, k):
    print(f"Query={query}")
    if k == None: k = 10
    k = min(int(k), 100)
    pids, ranks, scores = searcher.search(query, k=100)
    pids, ranks, scores = pids[:k], ranks[:k], scores[:k]
    passages = [searcher.collection[pid] for pid in pids]
    probs = [math.exp(score) for score in scores]
    probs = [prob / sum(probs) for prob in probs]
    topk = []
    for pid, rank, score, prob in zip(pids, ranks, scores, probs):
        text = searcher.collection[pid]
        source = df_dict[pid]
        d = {'text': text, 'pid': pid, 'rank': rank, 'score': score, 'prob': prob, 'source': source}
        topk.append(d)
    topk = list(sorted(topk, key=lambda p: (-1 * p['score'], p['pid'])))
    return {"query" : query, "topk": topk}

@app.route("/api/search", methods=["GET"])
def api_search():
    if request.method == "GET":
        counter["api"] += 1
        print("API request count:", counter["api"])
        return api_search_query(request.args.get("query"), request.args.get("k"))
    else:
        return ('', 405)

if __name__ == "__main__":
    app.run()

[Dec 03, 16:06:03] #> Loading collection...
0M 
[Dec 03, 16:06:06] #> Loading codec...
[Dec 03, 16:06:06] #> Loading IVF...
[Dec 03, 16:06:06] #> Loading doclens...


100%|██████████| 1/1 [00:00<00:00, 1431.01it/s]

[Dec 03, 16:06:06] #> Loading codes and residuals...



100%|██████████| 1/1 [00:00<00:00, 567.87it/s]

 * Serving Flask app '__main__'
 * Debug mode: off



 * Running on http://127.0.0.1:5000
INFO:werkzeug:[33mPress CTRL+C to quit[0m


 * Running on http://c358-34-126-157-109.ngrok.io
 * Traffic stats available on http://127.0.0.1:4040


INFO:werkzeug:127.0.0.1 - - [03/Dec/2023 16:06:56] "GET /api/search?query=What%20is%20the%20charge%20for%20replacement%20of%20ID%20cards??&k=3 HTTP/1.1" 200 -


API request count: 1
Query=What is the charge for replacement of ID cards??

#> QueryTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==
#> Input: . What is the charge for replacement of ID cards??, 		 True, 		 None
#> Output IDs: torch.Size([32]), tensor([ 101,    1, 2054, 2003, 1996, 3715, 2005, 6110, 1997, 8909, 5329, 1029,
        1029,  102,  103,  103,  103,  103,  103,  103,  103,  103,  103,  103,
         103,  103,  103,  103,  103,  103,  103,  103])
#> Output Mask: torch.Size([32]), tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0])



INFO:werkzeug:127.0.0.1 - - [03/Dec/2023 16:08:47] "GET /api/search?query=What%20is%20the%20charge%20for%20replacement%20of%20ID%20cards??&k=3 HTTP/1.1" 200 -


API request count: 2


INFO:werkzeug:127.0.0.1 - - [03/Dec/2023 16:09:35] "GET /api/search?query=what%20is%20the%20deadline%20for%20returning%20the%20assets%20after%20work%20from%20home?%0A?&k=3 HTTP/1.1" 200 -


API request count: 3
Query=what is the deadline for returning the assets after work from home?
?


INFO:werkzeug:127.0.0.1 - - [03/Dec/2023 16:13:52] "GET /api/search?query=What%20is%20the%20charge%20for%20replacement%20of%20ID%20cards??&k=3 HTTP/1.1" 200 -


API request count: 4


INFO:werkzeug:127.0.0.1 - - [03/Dec/2023 16:16:13] "GET /api/search?query=What%20is%20the%20charge%20for%20replacement%20of%20ID%20cards??&k=3 HTTP/1.1" 200 -


API request count: 5


INFO:werkzeug:127.0.0.1 - - [03/Dec/2023 16:16:57] "GET /api/search?query=what%20is%20the%20deadline%20for%20returning%20the%20assets%20after%20work%20from%20home??&k=3 HTTP/1.1" 200 -


API request count: 6
Query=what is the deadline for returning the assets after work from home??


INFO:werkzeug:127.0.0.1 - - [03/Dec/2023 16:20:03] "GET /api/search?query=What%20is%20the%20charge%20for%20replacement%20of%20ID%20cards??&k=3 HTTP/1.1" 200 -


API request count: 7


INFO:werkzeug:127.0.0.1 - - [03/Dec/2023 16:20:44] "GET /api/search?query=what%20is%20the%20deadline%20for%20returning%20the%20assets%20after%20work%20from%20home??&k=3 HTTP/1.1" 200 -


API request count: 8


In [None]:
http://6736-34-124-217-36.ngrok.io/
http://8ea4-34-124-217-36.ngrok.io/api/search?query=What is the charge for replacement of ID cards??&k=5