# ColBERTv2: Indexing & Search Notebook

If you're working in Google Colab, we recommend selecting "GPU" as your hardware accelerator in the runtime settings.

First, we'll import the relevant classes. Note that `Indexer` and `Searcher` are the key actors here. Next, we'll download the necessary dependencies.

In [None]:
!git -C ColBERT/ pull || git clone https://github.com/stanford-futuredata/ColBERT.git
import sys; sys.path.insert(0, 'ColBERT/')

fatal: cannot change to 'ColBERT/': No such file or directory
Cloning into 'ColBERT'...
remote: Enumerating objects: 2835, done.[K
remote: Counting objects: 100% (1346/1346), done.[K
remote: Compressing objects: 100% (437/437), done.[K
remote: Total 2835 (delta 1041), reused 991 (delta 905), pack-reused 1489 (from 1)[K
Receiving objects: 100% (2835/2835), 2.06 MiB | 16.50 MiB/s, done.
Resolving deltas: 100% (1795/1795), done.


In [None]:
try: # When on google Colab, let's install all dependencies with pip.
    import google.colab
    !pip install -U pip
    !pip install -e ColBERT/['faiss-gpu','torch']
except Exception:
  import sys; sys.path.insert(0, 'ColBERT/')
  try:
    from colbert import Indexer, Searcher
  except Exception:
    print("If you're running outside Colab, please make sure you install ColBERT in conda following the instructions in our README. You can also install (as above) with pip but it may install slower or less stable faiss or torch dependencies. Conda is recommended.")
    assert False

Collecting pip
  Downloading pip-24.3.1-py3-none-any.whl.metadata (3.7 kB)
Downloading pip-24.3.1-py3-none-any.whl (1.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m17.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 24.1.2
    Uninstalling pip-24.1.2:
      Successfully uninstalled pip-24.1.2
Successfully installed pip-24.3.1
Obtaining file:///content/ColBERT
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting bitarray (from colbert-ai==0.2.20)
  Downloading bitarray-3.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (32 kB)
Collecting datasets (from colbert-ai==0.2.20)
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting git-python (from colbert-ai==0.2.20)
  Downloading git_python-1.0.3-py2.py3-none-any.whl.metadata (331 bytes)
Collecting python-dotenv (from colbert-ai==0.2.20)
  Downloading python_d

In [None]:
!pip3 install -U torch

Collecting torch
  Downloading torch-2.5.1-cp310-cp310-manylinux1_x86_64.whl.metadata (28 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)


In [None]:
import colbert
from colbert import Indexer, Searcher
from colbert.infra import Run, RunConfig, ColBERTConfig
from colbert.data import Queries, Collection

We will use the dev set of the **LoTTE benchmark** we recently introduced in the ColBERTv2 paper. We'll download it from HuggingFace datasets. The dev and test sets contain several domain-specific corpora, and we'll use the smallest dev set corpus, namely lifestyle:dev.

For the purposes of a quick demo, we will only run the `Indexer` on the first 10,000 passages. As we do this, let's also remove the queries whose relevant passages are all outside this small set of passages.

In [None]:
import os
collection = []
file_names = {}
i = 0

dir_path = './txt_papers'

for file in sorted(os.listdir(dir_path)):
    if file.endswith('.txt'):
        print(f"{file = }")
        with open(dir_path + '/' + file, 'r') as f:
            collection.append(f.read())
            file_names[i] = file
            i += 1

file = '1308.0850.txt'
file = '1406.1078.txt'
file = '1409.0473.txt'
file = '1412.3555.txt'
file = '1412.6980.txt'
file = '1508.04025.txt'
file = '1508.07909.txt'
file = '1511.06114.txt'
file = '1511.08228.txt'
file = '1512.00567.txt'
file = '1512.03385.txt'
file = '1601.06733.txt'
file = '1602.02410.txt'
file = '1606.04199.txt'
file = '1607.06450.txt'
file = '1608.05859.txt'
file = '1609.08144.txt'
file = '1610.02357.txt'
file = '1610.10099v2.txt'
file = '1701.06538.txt'
file = '1702.00887.txt'
file = '1703.03130.txt'
file = '1703.03906.txt'
file = '1703.10722.txt'
file = '1705.03122v2.txt'
file = '1705.04304.txt'
file = '1706.03762.txt'
file = 'D09-1082.txt'
file = 'D16-1244.txt'
file = 'J93-2004.txt'
file = 'N16-1118.txt'
file = 'P06-1054.txt'
file = 'P13-1045.txt'
file = 'srivastava14a.txt'


In [None]:
print(len(collection))

34


## Indexing

For an efficient search, we can pre-compute the ColBERT representation of each passage and index them.

Below, the `Indexer` take a model checkpoint and writes a (compressed) index to disk. We then prepare a `Searcher` for retrieval from this index.

In [None]:
nbits = 2   # encode each dimension with 2 bits
doc_maxlen = 300 # truncate passages at 300 tokens
max_id = 10000

index_name = f'crud.colbert.{nbits}bits'
checkpoint = 'colbert-ir/colbertv2.0'

To save space and time, we will only run the `Indexer` on the first 10,000 passages. To do so, we will filter out queries that do not contain passages with ids less than 10,000.

Now run the `Indexer` on the collection subset. Assuming the use of only one GPU, this cell should take about six minutes to finish running.

In [None]:
with Run().context(RunConfig(nranks=1, experiment='notebook')):  # nranks specifies the number of GPUs to use
    config = ColBERTConfig(doc_maxlen=doc_maxlen, nbits=nbits, kmeans_niters=4) # kmeans_niters specifies the number of iterations of k-means clustering; 4 is a good and fast default.
                                                                                # Consider larger numbers for small datasets.

    indexer = Indexer(checkpoint=checkpoint, config=config)
    indexer.index(name=index_name, collection=collection, overwrite=True)

In [None]:
indexer.get_index() # You can get the absolute path of the index, if needed.

'/content/experiments/notebook/indexes/crud.colbert.2bits'

## Search

Having built the index and prepared our `searcher`, we can search for individual query strings.

We can use the `queries` set we loaded earlier — or you can supply your own questions. Feel free to get creative! But keep in mind this set of ~300k lifestyle passages can only answer a small, focused set of questions!

In [None]:
with Run().context(RunConfig(experiment='notebook')):
    searcher = Searcher(index=index_name, collection=collection)

[Nov 16, 13:52:35] #> Loading codec...
[Nov 16, 13:52:35] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...


  self.scaler = torch.cuda.amp.GradScaler()
  centroids = torch.load(centroids_path, map_location='cpu')
  avg_residual = torch.load(avgresidual_path, map_location='cpu')
  bucket_cutoffs, bucket_weights = torch.load(buckets_path, map_location='cpu')
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].


[Nov 16, 13:52:35] Loading packbits_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...


If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].


[Nov 16, 13:52:35] #> Loading IVF...
[Nov 16, 13:52:35] #> Loading doclens...


  ivf, ivf_lengths = torch.load(os.path.join(self.index_path, "ivf.pid.pt"), map_location='cpu')
100%|██████████| 1/1 [00:00<00:00, 1543.73it/s]

[Nov 16, 13:52:35] #> Loading codes and residuals...



  return torch.load(codes_path, map_location='cpu')
  return torch.load(residuals_path, map_location='cpu')
100%|██████████| 1/1 [00:00<00:00, 356.57it/s]


In [None]:
from dotenv import load_dotenv
import os

load_dotenv()

paper_name = os.getenv("PAPER_NAME")
# paper_name = "1706.03762.txt"

for indx, name in file_names.items():
    if paper_name == name:
        paper_indx = indx

query = collection[paper_indx]

print(paper_indx)

results = searcher.search(query, k=10)

# Find the top-5 passages for this query
l1 = []
for passage_id, passage_rank, passage_score in zip(*results):
    l1.append(file_names[passage_id])

print(l1)

26
['1706.03762.txt', 'D16-1244.txt', '1701.06538.txt', '1602.02410.txt', '1703.10722.txt', '1705.04304.txt', 'N16-1118.txt', '1512.00567.txt', 'srivastava14a.txt', '1508.04025.txt']


In [None]:
import os
import shutil

def copy_files_from_list(source_path, dest_path, file_list):
    if not os.path.exists(dest_path):
        os.makedirs(dest_path)

    for filename in file_list:
        source_file = os.path.join(source_path, filename)
        dest_file = os.path.join(dest_path, filename)

        try:
          shutil.copy2(source_file, dest_file)  # copy2 preserves metadata
        except Exception as e:
            print(f"Error copying {filename}: {str(e)}")

source_folder = "./txt_papers"
destination_folder = "./final_input"
copy_files_from_list(source_folder, destination_folder, l1)