<picture>
  <source media="(prefers-color-scheme: dark)" srcset="https://vespa.ai/assets/vespa-ai-logo-heather.svg">
  <source media="(prefers-color-scheme: light)" srcset="https://vespa.ai/assets/vespa-ai-logo-rock.svg">
  <img alt="#Vespa" width="200" src="https://vespa.ai/assets/vespa-ai-logo-heather.svg" style="margin-bottom: 25px;">
</picture>


# BGE-M3 - The Mother of all embedding models

BAAI released BGE-M3 on January 30th, a new member of the BGE model series. 

> M3 stands for Multi-linguality (100+ languages), Multi-granularities (input length up to 8192), Multi-Functionality (unification of dense, lexical, multi-vec (colbert) retrieval).

This notebook demonstrates how to use [BGE_M3](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/BGE_M3) embeddings and 
represent all three representations in Vespa! The only scalable serving engine that can handle all M3 representations.

This code is inspired by the README from the model hub [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3).


Let's get started! First, install dependencies: 

In [None]:
!pip3 install -U pyvespa FlagEmbedding 

### Explore the multiple representations of M3
When encoding text, we can ask for the representations we want

- Sparse (SPLADE) vectors 
- Dense (DPR) regular text embeddings 
- Multi-Dense (ColBERT) - contextualized multi-token vectors 

Let us dive into it - To use this model on CPU, we set `use_fp16` to False, for GPU inference, it is recommended to use `use_fp16=True`. 

In [1]:
from FlagEmbedding import BGEM3FlagModel

model = BGEM3FlagModel('BAAI/bge-m3', use_fp16=False)



Fetching 19 files:   0%|          | 0/19 [00:00<?, ?it/s]

loading existing colbert_linear and sparse_linear---------


## A demo passage 

Let us encode a simple passage

In [2]:

passage = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction."]

In [3]:
passage_embeddings = model.encode(passage, return_dense=True, return_sparse=True, return_colbert_vecs=True)

encoding:   0%|          | 0/1 [00:00<?, ?it/s]You're using a XLMRobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
encoding: 100%|██████████| 1/1 [00:01<00:00,  1.25s/it]


In [22]:
passage_embeddings['colbert_vecs'][0].shape

(30, 1024)

In [25]:
passage_embeddings['dense_vecs'][0].shape

(1024,)

In [29]:
passage_embeddings['lexical_weights']

[defaultdict(int,
             {'335': 0.14094123,
              '11679': 0.25865352,
              '276': 0.17205116,
              '363': 0.2689343,
              '83': 0.12733835,
              '142': 0.073539406,
              '55720': 0.21414939,
              '59725': 0.16704923,
              '3299': 0.25499487,
              '8060': 0.19095251,
              '214': 0.0827628,
              '168': 0.18121913,
              '184': 0.1212738,
              '456': 0.057080604,
              '97351': 0.15733702,
              '1405': 0.06340542,
              '75675': 0.15143114,
              '21533': 0.10568179,
              '14858': 0.15083122,
              '136': 0.015894324,
              '6024': 0.08422447,
              '272': 0.14546244,
              '18770': 0.14017706,
              '182809': 0.15259874})]

In [24]:
passage_embeddings.keys()

dict_keys(['dense_vecs', 'lexical_weights', 'colbert_vecs'])

## Defining the Vespa application
[PyVespa](https://pyvespa.readthedocs.io/en/latest/) helps us build the [Vespa application package](https://docs.vespa.ai/en/application-packages.html). 
A Vespa application package consists of configuration files, schemas, models, and code (plugins).   

First, we define a [Vespa schema](https://docs.vespa.ai/en/schemas.html) with the fields we want to store and their type. We
use Vespa [tensors](https://docs.vespa.ai/en/tensor-user-guide.html) to represent the 3 different M3 representations. 

- We use a mapped tensor denoted by `t{}` to represent the sparse lexical representation 
- We use an indexed tensor denoted by `x[1024]` to represent the dense single vector representation of 1024 dimensions
- For the colbert_rep (multi vector), we use a mixed tensor that combines a mapped and an indexed dimension. 

To save resource footprint, we use `bfloat16` tensor cell type, this saves 50% storage compared to `float`. 

In [4]:
from vespa.package import Schema, Document, Field, FieldSet
m_schema = Schema(
            name="m3",
            document=Document(
                fields=[
                    Field(name="id", type="string", indexing=["summary"]),
                    Field(name="text", type="string", indexing=["summary", "index"]),
                    Field(name="lexical_rep", type="tensor<bfloat16>(t{})", indexing=["summary", "attribute"]),
                    Field(name="dense_rep", type="tensor<bfloat16>(x[1024])", indexing=["summary", "attribute"], attribute=["distance-metric: angular"]),
                    Field(name="colbert_rep", type="tensor<bfloat16>(t{}, x[1024])", indexing=["summary", "attribute"])
                ],
            ),
            fieldsets=[
                FieldSet(name = "default", fields = ["text"])
            ]
)

The above defines our `m` schema with the original text and the three different representations

In [5]:
from vespa.package import ApplicationPackage

vespa_app_name = "m3"
vespa_application_package = ApplicationPackage(
        name=vespa_app_name,
        schema=[m_schema]
) 

In the last step, we configure [ranking](https://docs.vespa.ai/en/ranking.html) by adding `rank-profile`'s to the schema. 


We define three functions that implement the three different scoring functions for the different representations

- dense (dense cosine similarity)
- sparse (sparse dot product)
- max_sim (The colbert max sim operation)

Then, we combine these three scoring functions using a linear combination with weights, as suggested
by the authors [here](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/BGE_M3#compute-score-for-text-pairs). 


In [6]:
from vespa.package import RankProfile, Function,  FirstPhaseRanking


semantic = RankProfile(
    name="m3hybrid", 
    inputs=[
        ("query(q_dense)", "tensor<bfloat16>(x[1024])"), 
        ("query(q_lexical)", "tensor<bfloat16>(t{})"), 
        ("query(q_colbert)", "tensor<bfloat16>(qt{}, x[1024])"),
        ("query(q_len_colbert)", "float"),
    ],
    functions=[
        Function(
            name="dense",
            expression="cosine_similarity(query(q_dense), attribute(dense_rep),x)"
        ),
        Function(
            name="lexical",
            expression="sum(query(q_lexical) * attribute(lexical_rep))"
        ),
        Function(
            name="max_sim",
            expression="sum(reduce(sum(query(q_colbert) * attribute(colbert_rep) , x),max, t),qt)/query(q_len_colbert)"
        )
    ],
    first_phase=FirstPhaseRanking(
        expression="0.4*dense + 0.2*lexical +  0.4*max_sim",
        rank_score_drop_limit=0.0
    ),
    match_features=["dense", "lexical", "max_sim"]
)
m_schema.add_rank_profile(semantic)

The `m3hybrid` rank-profile above defines the query input embedding type and a similarities function that
uses a Vespa [tensor compute function](https://docs.vespa.ai/en/reference/ranking-expressions.html#tensor-functions) that calculates
the M3 similarities for dense, lexical, and the max_sim for the colbert representations. 

The profile only defines a single ranking phase, using a linear combination of multiple features using the suggested weighting.

Using [match-features](https://docs.vespa.ai/en/reference/schema-reference.html#match-features), Vespa
returns selected features along with the hit in the SERP (result page).

In [7]:
from vespa.deployment import VespaDocker

vespa_docker = VespaDocker()
app = vespa_docker.deploy(application_package=vespa_application_package, debug=True)

Waiting for configuration server, 0/300 seconds...
Waiting for configuration server, 5/300 seconds...
Using plain http against endpoint http://localhost:8080/ApplicationStatus
Waiting for application status, 0/300 seconds...
Using plain http against endpoint http://localhost:8080/ApplicationStatus
Waiting for application status, 5/300 seconds...
Using plain http against endpoint http://localhost:8080/ApplicationStatus
Waiting for application status, 10/300 seconds...
Using plain http against endpoint http://localhost:8080/ApplicationStatus
Application is up!
Finished deployment.


Now deploy the app to Vespa Cloud dev zone. 

The first deployment typically takes 2 minutes until the endpoint is up. 

# Feed the M3 representations

We convert the three different representations to Vespa feed format

In [8]:
vespa_fields = {
    "text": passage[0],
    "lexical_rep": {key: float(value) for key, value in passage_embeddings['lexical_weights'][0].items()},
    "dense_rep":passage_embeddings['dense_vecs'][0].tolist(),
    "colbert_rep":  {index: passage_embeddings['colbert_vecs'][0][index].tolist() for index in range(passage_embeddings['colbert_vecs'][0].shape[0])}
}

In [9]:
app.feed_data_point(schema='m3', data_id=0, fields=vespa_fields)

<vespa.io.VespaResponse at 0x706034a586a0>

In [16]:
app.

'http://localhost'

### Querying data

Now, we can also query our data. 

Read more about querying Vespa in:

- [Vespa Query API](https://docs.vespa.ai/en/query-api.html)
- [Vespa Query API reference](https://docs.vespa.ai/en/reference/query-api-reference.html)
- [Vespa Query Language API (YQL)](https://docs.vespa.ai/en/query-language.html)



In [10]:
query  = ["What is BGE M3?"]
query_embeddings = model.encode(query, return_dense=True, return_sparse=True, return_colbert_vecs=True)


encoding:   0%|          | 0/1 [00:00<?, ?it/s]You're using a XLMRobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
encoding: 100%|██████████| 1/1 [00:00<00:00,  1.65it/s]


The M3 colbert scoring function needs the query length to normalize the score to the range 0 to 1. This helps when combining
the score with the other scoring functions. 

In [11]:
query_length = query_embeddings['colbert_vecs'][0].shape[0]

In [12]:
query_fields = {
    "input.query(q_lexical)": {key: float(value) for key, value in query_embeddings['lexical_weights'][0].items()},
    "input.query(q_dense)": query_embeddings['dense_vecs'][0].tolist(),
    "input.query(q_colbert)":  str({index: query_embeddings['colbert_vecs'][0][index].tolist() for index in range(query_embeddings['colbert_vecs'][0].shape[0])}),
    "input.query(q_len_colbert)": query_length
}

In [13]:
from vespa.io import VespaQueryResponse
import json

response:VespaQueryResponse = app.query(
    yql="select id, text from m3 where ({targetHits:10}nearestNeighbor(dense_rep,q_dense))",
    ranking="m3hybrid",
    body={
        **query_fields
    }
)
assert(response.is_successful())
print(json.dumps(response.hits[0], indent=2))

{
  "id": "index:m3_content/0/cfcd208456509d9b37146efc",
  "relevance": 0.5993382421532011,
  "source": "m3_content",
  "fields": {
    "matchfeatures": {
      "dense": 0.6259023168183205,
      "lexical": 0.1941967010498047,
      "max_sim": 0.7753449380397797
    },
    "text": "BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction."
  }
}


Notice the `matchfeatures` that returns the configured match-features from the rank-profile. We can 
use these to compare the torch model scoring with the computations specified in Vespa. 

Now, we can compare the Vespa computed scores with the model torch code and they line up perfectly 

In [38]:
model.compute_lexical_matching_score(passage_embeddings['lexical_weights'][0], query_embeddings['lexical_weights'][0])

0.1955444384366274

In [158]:
query_embeddings['dense_vecs'][0] @ passage_embeddings['dense_vecs'][0].T

0.6259037

In [39]:
model.colbert_score(query_embeddings['colbert_vecs'][0],passage_embeddings['colbert_vecs'][0])

tensor(0.7797)

### That is it! 

That is how easy it is to represent the brand new M3 FlagEmbedding representations in Vespa! Read more in the 
[M3 technical report](https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/BGE_M3/BGE_M3.pdf). 

We can go ahead and delete the Vespa cloud instance we deployed by:
