# Using Our ColBERTer Checkpoints

This notebook gives you a minimal usage example of downloading our ColBERTer checkpoint to encode passages and queries & to create a score of their relevance. 


---


Let's get started by installing the awesome *transformers* library from HuggingFace:


The next step is to download our checkpoint and initialize the tokenizer and models:


In [None]:
from colberter import ColBERTer
from bow2_tokenizer import BOW2Tokenizer
from transformers import AutoTokenizer
import torch

#
# init the model & tokenizer (using the distilbert tokenizer)
#
tokenizer = BOW2Tokenizer(AutoTokenizer.from_pretrained("distilbert-base-uncased"))

# we have 2 models available: a 32 dim model and a 1 dim model, please select either one of them
model = ColBERTer.from_pretrained("sebastian-hofstaetter/colberter-128-32-msmarco")
#model = ColBERTer.from_pretrained("sebastian-hofstaetter/colberter-128-1-msmarco")

# Pairwise scoring (for training & re-ranking)

Now we are ready to use the model to encode two sample passage and query pairs (this would be the re-ranking mode, where we have a candidate list):

In [15]:
# our relevant example
passage1_input = tokenizer.tokenize("We are very happy to show you the 🤗 Transformers library for pre-trained language models. We are helping the community work together towards the goal of advancing NLP 🔥.")
# a non-relevant example
passage2_input = tokenizer.tokenize("Hmm I don't like this new movie about transformers that i got from my local library. Those transformers are robots?")

# the user query -> which should give us a better score for the first passage
query_input = tokenizer.tokenize("what is the transformers library")

#print("Passage 1 Tokenized:",passage1_input)
#print("Passage 2 Tokenized:",passage2_input)
#print("Query Tokenized:",query_input)

# note how we call the ColBERTer model for pairs, can be changed to: forward_representation and forward_aggregation
# set fp16=False if run on a CPU
score_for_p1 = model.forward(query_input,passage1_input,use_fp16=False)[0].squeeze(0)
score_for_p2 = model.forward(query_input,passage2_input,use_fp16=False)[0].squeeze(0)

print("---")
print("Score passage 1 <-> query: ",float(score_for_p1))
print("Score passage 2 <-> query: ",float(score_for_p2))

---
Score passage 1 <-> query:  39.04731750488281
Score passage 2 <-> query:  32.5419807434082


# Separate Encoding & Scoring (For Pre-Computation & Indexing)

For indexing or pre-compute mode you need to call forward_representation and forward_aggregation independently:

In [16]:

# we re-use the tokenized input from the previous cell, and independently encode all 3 sequences
p1_encoded_cls, p1_encoded_bow, p1_encoded_bow_mask = model.forward_representation(passage1_input,sequence_type="doc_encode")
p2_encoded_cls, p2_encoded_bow, p2_encoded_bow_mask = model.forward_representation(passage2_input,sequence_type="doc_encode")

#
# let's assume at this point the p1_encoded_cls and p2_encoded_cls get index in an ANN, 
#   the bow encodings get saved by id -> workflow #2 (in the paper)
#

# now we get a query in from the user and encode it
q_encoded_cls, q_encoded_bow, q_encoded_bow_mask = model.forward_representation(query_input,sequence_type="query_encode")

# this is done by the ANN index
cls_score_p1 = p1_encoded_cls @ q_encoded_cls.T
cls_score_p2 = p2_encoded_cls @ q_encoded_cls.T

# now we assume that the two passages have been returned by the ANN index
exact_scoring_mask_p1=None
exact_scoring_mask_p2=None

if model.compress_to_exact_mini_mode:
    print("Using exact matching")
    exact_scoring_mask_p1 = query_input["unique_words"].unsqueeze(-1) == passage1_input["unique_words"].unsqueeze(1)
    exact_scoring_mask_p2 = query_input["unique_words"].unsqueeze(-1) == passage2_input["unique_words"].unsqueeze(1)

score_for_p1 = model.forward_aggregation(cls_score_p1,
                                         q_encoded_bow,q_encoded_bow_mask,
                                         p1_encoded_bow,p1_encoded_bow_mask,
                                         exact_scoring_mask=exact_scoring_mask_p1)

score_for_p2 = model.forward_aggregation(cls_score_p2,
                                         q_encoded_bow,q_encoded_bow_mask,
                                         p2_encoded_bow,p2_encoded_bow_mask,
                                         exact_scoring_mask=exact_scoring_mask_p2)

print("---")
print("Score passage 1 <-> query: ",float(score_for_p1),"(CLS score:",float(cls_score_p1*torch.sigmoid(model.score_merger)),")")
print("Score passage 2 <-> query: ",float(score_for_p2),"(CLS score:",float(cls_score_p2*torch.sigmoid(model.score_merger)),")")

---
Score passage 1 <-> query:  39.04731750488281 (CLS score: 16.83791732788086 )
Score passage 2 <-> query:  32.5419807434082 (CLS score: 15.022980690002441 )


As we see the model gives the first passage a higher score than the second - these scores would now be used to generate a list (if we run this comparison on all passages in our collection or candidate set). 

- If you want to look at more complex usages and training code we have a library for that: https://github.com/sebastian-hofstaetter/matchmaker 👏

- If you use our model checkpoint please cite our work as:

    ```
@article{Hofstaetter2022_colberter,
 author = {Sebastian Hofst{\"a}tter and Omar Khattab and Sophia Althammer and Mete Sertkan and Allan Hanbury},
 title = {Introducing Neural Bag of Whole-Words with ColBERTer: Contextualized Late Interactions using Enhanced Reduction},
 publisher = {arXiv},
 url = {https://arxiv.org/abs/2203.13088},
 doi = {10.48550/ARXIV.2203.13088},
 year = {2022},
}
    ```

Thank You 😊 If you have any questions feel free to reach out to Sebastian via mail. 
