<picture>
  <source media="(prefers-color-scheme: dark)" srcset="https://assets.vespa.ai/logos/Vespa-logo-green-RGB.svg">
  <source media="(prefers-color-scheme: light)" srcset="https://assets.vespa.ai/logos/Vespa-logo-dark-RGB.svg">
  <img alt="#Vespa" width="200" src="https://assets.vespa.ai/logos/Vespa-logo-dark-RGB.svg" style="margin-bottom: 25px;">
</picture>

# Creating a code search application with Vespa and ModernBERT

In this notebook, we will demonstrate how to build a code search application using [Vespa](https://vespa.ai/) and the recently released [Alibaba-NLP/gte-modernbert-base](https://huggingface.co/Alibaba-NLP/gte-modernbert-base).

In december 2024, [Answer.AI](www.answer.ai) and [LightOn](https://www.lighton.ai/) recently [announced](https://huggingface.co/blog/modernbert) ModernBERT, a new and modernized version of [BERT](https://huggingface.co/papers/1810.04805).

BERT was released in 2018, and is in 2025 still _the_ most downloaded text model from [Huggingface Hub](https://huggingface.co/models?sort=downloads) (and 3rd overall). There have been many learnings and improvements since then, and ModernBERT's goal was to incorporate all the learnings from the past 7 years when training a new base model.


![ModernBERT](https://www.answer.ai/posts/2024-12-19-modernbert/modernbert_pareto_curve.png)

And as you can see in the image above, they succeeded. ModernBERT is a pareto improvement (both speed and performance) over BERT and its descendants. 

Like BERT, ModernBERT is a base model, which is trained on the masked language model (MLM) objective. (Predict a masked token). This means that it is a general-purpose model that can (and should) be fine-tuned on a wide range of tasks, like text classification, named entity recognition, and retrieval.

And only a month after its release, we are already starting to see many fine-tuned models based on ModernBERT.

Some notable examples are:
- [jxm/cde-small-v2](https://huggingface.co/jxm/cde-small-v2)
- [Alibaba-NLP/gte-modernbert-base](https://huggingface.co/Alibaba-NLP/gte-modernbert-base)

To view all the models that are fine-tuned on ModernBERT, you can check out its model tree at [Huggingface Hub](https://huggingface.co/models?other=base_model:finetune:answerdotai/ModernBERT-base).

## Code retrieval is where ModernBERT shines

As explained in this insightful [blog post](https://jina.ai/news/what-should-we-learn-from-modernbert/) by Jina.ai, ModernBERT performs particularly well on code retrieval, largely because is trained on a lot of code, and uses the [OLMo-tokenizer](https://huggingface.co/docs/transformers/en/model_doc/olmo), whic also was trained on code.

![code tokens](https://jina-ai-gmbh.ghost.io/content/images/2025/01/code_tokens-cheat-2.svg)

As we can see, the ModernBERT tokenizer does a better job of tokenizing code.

Excellent [analysis](https://github.com/kevlariii/ModernBERT-code-tokenizer-insights)

The ModernBERT authors shared their interest for programming-related tasks in their [blog post](https://huggingface.co/blog/modernbert):

> ... out of all the existing open source encoders, ModernBERT is in a class of its own on programming-related tasks. We’re particularly interested in what downstream uses this will lead to, in terms of improving programming assistants.

## Code retrieval benchmark

[CoIR](https://arxiv.org/pdf/2407.02883) is a comprehensive benchmark for code information retrieval. 

![CoIR](../_static/coir.png)


<div class="alert alert-info">
    Refer to <a href="https://pyvespa.readthedocs.io/en/latest/troubleshooting.html">troubleshooting</a>
    for any problem when running this guide.
</div>

In [36]:
#!pip3 install pyvespa vespacli datasets

## Retrieving the ONNX model for use with Vespa

Vespa supports both [embedding](https://docs.vespa.ai/en/embedding.html#huggingface-embedder) and [ranking](https://docs.vespa.ai/en/onnx.html) with ONNX models.

Previously, we would have to convert each model to ONNX format for use with Vespa.

Lucky for us, [Tom Aarsen](https://huggingface.co/tomaarsen) and [transformers.js](https://github.com/huggingface/transformers.js)-maintainer [@xenova](https://huggingface.co/Xenova) have done an amazing job in making sure new huggingface models are converted to ONNX format for us.


## Find the model you are interested in using with Vespa on Huggingface

https://huggingface.co/models?pipeline_tag=sentence-similarity&library=onnx&sort=trending is a good starting point.

When you have found a model you are interested in using, click the "Files"-tab and locate the download button of the ONNX model-file, right-click and copy the download link.

![Download ONNX model](../_static/download_hf.png)

## Making sure the configuration is correct

The [Vespa docs](https://docs.vespa.ai/en/reference/embedding-reference.html#huggingface-embedder) specifies 

## Huggingface Embedder

An embedder using any Huggingface tokenizer, including multilingual tokenizers, to produce tokens which is then input to a supplied transformer model in ONNX model format.

The Huggingface embedder is configured in services.xml, within the `container` tag:

```
<container id="default" version="1.0">
    <component id="hf-embedder" type="hugging-face-embedder">
        <transformer-model path="my-models/model.onnx"/>
        <tokenizer-model path="my-models/tokenizer.json"/>
        <prepend>
          <query>query:</query>
          <document>passage:</document>
        </prepend>
    </component>
    ...
</container>
```

### Huggingface embedder reference config

In addition to embedder ONNX parameters:

Name | Occurrence | Description | Type | Default  
---|---|---|---|---  
transformer-model | One | Use to point to the transformer ONNX model file | model-type | N/A  
tokenizer-model | One | Use to point to the `tokenizer.json` Huggingface tokenizer configuration file | model-type | N/A  
max-tokens | One | The maximum number of tokens accepted by the transformer model | numeric | 512  
transformer-input-ids | One | The name or identifier for the transformer input IDs | string | input_ids  
transformer-attention-mask | One | The name or identifier for the transformer attention mask | string | attention_mask  
transformer-token-type-ids | One | The name or identifier for the transformer token type IDs. If the model does not use `token_type_ids` use `<transformer-token-type-ids/>` | string | token_type_ids  
transformer-output | One | The name or identifier for the transformer output | string | last_hidden_state  
pooling-strategy | One | How the output vectors of the ONNX model is pooled to obtain a single vector representation. Valid values are `mean` and `cls` | string | mean  
normalize | One | A boolean indicating whether to normalize the output embedding vector to unit length (length 1). Useful for `prenormalized-angular` distance-metric | boolean | false  
prepend | Optional | Prepend instructions that are prepended to the text input before tokenization and inference. Useful for models that have been trained with specific prompt instructions. The instructions are prepended to the input text. | Element `<query>` and/or `<document>`  | Optional query/doc prepend instruction.


In [1]:
# model_url = "https://huggingface.co/Alibaba-NLP/gte-modernbert-base/resolve/main/onnx/model_quantized.onnx?download=true"

# tokenizer_url = "https://huggingface.co/Alibaba-NLP/gte-modernbert-base/resolve/main/tokenizer.json?download=true"

We will also run inference on a sample query to verify that we get the same results in Vespa as in the original model.

In [28]:
from datasets import load_dataset

dataset_id = "CoIR-Retrieval/cosqa"
ds = load_dataset(dataset_id, "corpus")

In [29]:
docs = ds["corpus"]

In [30]:
len(docs)

20604

In [31]:
sample_text, sample_id = docs[0]["text"], docs[0]["_id"]

In [32]:
from IPython.display import display, Markdown


def display_snippet(snippet):
    display(Markdown(f"**Snippet:**\n\n```python\n{snippet}\n```"))


display_snippet(sample_text)

**Snippet:**

```python
def writeBoolean(self, n):
        """
        Writes a Boolean to the stream.
        """
        t = TYPE_BOOL_TRUE

        if n is False:
            t = TYPE_BOOL_FALSE

        self.stream.write(t)
```

In [33]:
model_name = "gte-modernbert-base"
model_dims = 768

## Create an application package

The [application package](https://pyvespa.readthedocs.io/en/latest/reference-api.html#vespa.package.ApplicationPackage)
has all the Vespa configuration files -
create one from scratch:

In [21]:
from vespa.package import ServicesConfiguration
from vespa.configuration.services import (
    services,
    container,
    search,
    document_api,
    document_processing,
    content,
    redundancy,
    documents,
    document,
    node,
    nodes,
    gpu,
    component,
    components,
    transformer_model,
    max_tokens,
    pooling_strategy,
    resources,
)

application_name = "codesearcheval"
schema_name = "snippets"


services_config = ServicesConfiguration(
    application_name=application_name,
    services_config=services(
        container(
            search(),
            document_api(),
            document_processing(),
            components(
                component(
                    transformer_model(model_id="alibaba-gte-modernbert"),
                    max_tokens("8192"),
                    pooling_strategy("cls"),
                    type="hugging-face-embedder",
                    id=model_name,
                ),
            ),
            nodes(
                resources(
                    gpu(count="1", memory="16Gb"), vcpu="4", memory="16Gb", disk="125Gb"
                ),
                count="2",
            ),
            id=f"{application_name}_container",
            version="1.0",
        ),
        content(
            redundancy("1"),
            nodes(node(distribution_key="0", hostalias="node1")),
            documents(document(type=schema_name, mode="index")),
            id=f"{application_name}_content",
            version="1.0",
        ),
    ),
)

In [22]:
from vespa.package import (
    ApplicationPackage,
    Field,
    Schema,
    Document,
    HNSW,
    RankProfile,
    FieldSet,
    GlobalPhaseRanking,
    Function,
)


package = ApplicationPackage(
    name=application_name,
    schema=[
        Schema(
            name=schema_name,
            document=Document(
                fields=[
                    Field(name="id", type="string", indexing=["summary"]),
                    Field(
                        name="text",
                        type="string",
                        indexing=["index", "summary"],
                        index="enable-bm25",
                    ),
                    Field(
                        name="embedding",
                        type=f"tensor<float>(x[{model_dims}])",
                        indexing=[
                            "input text",
                            f"embed {model_name}",
                            "summary",
                            "index",
                            "attribute",
                        ],
                        ann=HNSW(distance_metric="angular"),
                        is_document_field=False,
                    ),
                ]
            ),
            fieldsets=[
                FieldSet(
                    name="default",
                    fields=[
                        "text",
                    ],
                )
            ],
            rank_profiles=[
                RankProfile(
                    name="bm25",
                    inputs=[("query(q)", f"tensor<float>(x[{model_dims}])")],
                    functions=[Function(name="bm25sum", expression="bm25(text)")],
                    first_phase="bm25sum",
                ),
                RankProfile(
                    name="semantic",
                    inputs=[("query(q)", f"tensor<float>(x[{model_dims}])")],
                    first_phase="closeness(field, embedding)",
                ),
                RankProfile(
                    name="fusion",
                    inherits="bm25",
                    inputs=[("query(q)", f"tensor<float>(x[{model_dims}])")],
                    first_phase="closeness(field, embedding)",
                    global_phase=GlobalPhaseRanking(
                        expression="reciprocal_rank_fusion(bm25sum, closeness(field, embedding))",
                        rerank_count=10,
                    ),
                ),
            ],
        )
    ],
    services_config=services_config,
)

Note that the name cannot have `-` or `_`.

## Deploy the Vespa application 

Deploy `package` on the local machine using Docker,
without leaving the notebook, by creating an instance of
[VespaDocker](https://pyvespa.readthedocs.io/en/latest/reference-api.html#vespa.deployment.VespaDocker). `VespaDocker` connects
to the local Docker daemon socket and starts the [Vespa docker image](https://hub.docker.com/r/vespaengine/vespa/). 

If this step fails, please check
that the Docker daemon is running, and that the Docker daemon socket can be used by clients (Configurable under advanced settings in Docker Desktop).

In [23]:
package.to_files(application_name)

In [24]:
from vespa.deployment import VespaCloud
import os

os.environ["TOKENIZERS_PARALLELISM"] = "false"

# Replace with your tenant name from the Vespa Cloud Console
tenant_name = "vespa-team"

vespa_cloud = VespaCloud(
    tenant=tenant_name,
    application=application_name,
    key_content=os.getenv(
        "VESPA_TEAM_API_KEY", None
    ),  # Key is only used for CI/CD testing of this notebook. Can be removed if logging in interactively
    application_package=package,
)

Setting application...
Running: vespa config set application vespa-team.codesearcheval
Setting target cloud...
Running: vespa config set target cloud

Api-key found for control plane access. Using api-key.


In [25]:
app = vespa_cloud.deploy()

Deployment started in run 3 of dev-aws-us-east-1c for vespa-team.codesearcheval. This may take a few minutes the first time.
INFO    [08:12:29]  Deploying platform version 8.474.10 and application dev build 3 for dev-aws-us-east-1c of default ...
INFO    [08:12:29]  Using CA signed certificate version 1
INFO    [08:12:30]  Requested 2 nodes for container cluster 'codesearcheval_container' 8.474.10, downscaling to 1 nodes in dev
INFO    [08:12:34]  Requested 2 nodes for container cluster 'codesearcheval_container' 8.474.10, downscaling to 1 nodes in dev
INFO    [08:12:35]  Session 336015 for tenant 'vespa-team' prepared, but activation failed: 1/2 application hosts and 2/2 admin hosts for vespa-team.codesearcheval have completed provisioning and bootstrapping, still waiting for h112918.dev.us-east-1c.aws.vespa-cloud.net
INFO    [08:12:36]  Deploying platform version 8.474.10 and application dev build 3 for dev-aws-us-east-1c of default ...
INFO    [08:12:36]  1/2 application hosts and 2

`app` now holds a reference to a [Vespa](https://pyvespa.readthedocs.io/en/latest/reference-api.html#vespa.application.Vespa) instance.

## Feeding documents to Vespa

TODO

In [34]:
from datasets import load_dataset

vespa_feed = [
    {"id": str(i), "fields": {"text": doc["text"], "id": doc["_id"]}}
    for i, doc in enumerate(docs)
]

Now we can feed to Vespa using `feed_iterable` which accepts any `Iterable` and an optional callback function where we can
check the outcome of each operation. The application is configured to use [embedding](https://docs.vespa.ai/en/embedding.html)
functionality, that produce a vector embedding using a concatenation of the title and the body input fields. This step is computionally expensive. Read more
about embedding inference in Vespa in the [Accelerating Transformer-based Embedding Retrieval with Vespa](https://blog.vespa.ai/accelerating-transformer-based-embedding-retrieval-with-vespa/).

In [35]:
import nest_asyncio

nest_asyncio.apply()

In [36]:
from vespa.io import VespaResponse, VespaQueryResponse


def callback(response: VespaResponse, id: str):
    if not response.is_successful():
        print(f"Error when feeding document {id}: {response.get_json()}")


app.feed_async_iterable(vespa_feed, schema=schema_name, callback=callback)

## Querying Vespa


In [37]:
with app.syncio(connections=1) as session:
    query = "python compare timespan to number"
    response: VespaQueryResponse = session.query(
        yql="select * from sources * where userQuery() or ({targetHits:1000}nearestNeighbor(embedding,q))",
        query=query,
        ranking="fusion",
        body={"input.query(q)": f"embed({query})"},
    )
    assert response.is_successful()

In [45]:
response.hits

[{'id': 'id:snippets:snippets::8939',
  'relevance': 0.03278688524590164,
  'source': 'codesearcheval_content',
  'fields': {'sddocname': 'snippets',
   'documentid': 'id:snippets:snippets::8939',
   'embedding': {'type': 'tensor<float>(x[768])',
    'values': [-0.3958292603492737,
     -0.6250116229057312,
     0.6165985465049744,
     0.24815307557582855,
     0.5679518580436707,
     1.0657062530517578,
     -0.18048037588596344,
     0.6662303805351257,
     0.14211931824684143,
     0.8739501237869263,
     -1.0574798583984375,
     -2.5070528984069824,
     0.39734095335006714,
     -0.011718234978616238,
     1.1765754222869873,
     -0.1413334757089615,
     -0.4382267892360687,
     0.28211572766304016,
     1.7953213453292847,
     0.6295420527458191,
     -0.01530868373811245,
     -0.6990135312080383,
     -0.8234652876853943,
     1.2411839962005615,
     -1.5382877588272095,
     2.166571617126465,
     0.6404843926429749,
     -1.4861263036727905,
     2.1903836727142334

In [44]:
display_snippet(response.hits[0]["fields"]["text"])

**Snippet:**

```python
def timespan(start_time):
    """Return time in milliseconds from start_time"""

    timespan = datetime.datetime.now() - start_time
    timespan_ms = timespan.total_seconds() * 1000
    return timespan_ms
```

## Evaluating on CosQA

In [46]:
query_ds = load_dataset(dataset_id, "queries", split="queries")

In [47]:
query_ds.column_names

['_id', 'partition', 'text', 'title', 'language', 'meta_information']

In [48]:
len(query_ds)

20604

In [49]:
qrel_ds = "CoIR-Retrieval/cosqa-qrels"

In [50]:
qrels = load_dataset(dataset_id, "default", split="train")

In [51]:
ids_to_query = dict(zip(query_ds["_id"], query_ds["text"]))

In [52]:
for idx, (qid, q) in enumerate(ids_to_query.items()):
    print(f"qid: {qid}, query: {q}")
    if idx == 5:
        break

qid: q1, query: python code to write bool value 1
qid: q2, query: "python how to manipulate clipboard"
qid: q3, query: python colored output to html
qid: q4, query: python "create directory" using "relative path"
qid: q5, query: python column of an array
qid: q6, query: python calling a property returns "property object"


In [53]:
relevant_docs = [qrel for qrel in qrels if qrel["score"] == 1]

In [54]:
relevant_docs[0]

{'query-id': 'q10', 'corpus-id': 'd10', 'score': 1}

In [55]:
qrels = {qrel["query-id"]: qrel["corpus-id"] for qrel in relevant_docs}

In [56]:
corpus = dict(zip(ds["corpus"]["_id"], ds["corpus"]["text"]))

In [57]:
for idx, doc in enumerate(relevant_docs):
    print(f"qid: {doc['query-id']}, doc_id: {doc['corpus-id']}")
    print(f"query: {ids_to_query[doc['query-id']]}")
    print(f"doc: {corpus[doc['corpus-id']]}")
    if idx == 5:
        break

qid: q10, doc_id: d10
query: 1d array in char datatype in python
doc: def _convert_to_array(array_like, dtype):
        """
        Convert Matrix attributes which are array-like or buffer to array.
        """
        if isinstance(array_like, bytes):
            return np.frombuffer(array_like, dtype=dtype)
        return np.asarray(array_like, dtype=dtype)
qid: q19, doc_id: d19
query: python condition non none
doc: def _not(condition=None, **kwargs):
    """
    Return the opposite of input condition.

    :param condition: condition to process.

    :result: not condition.
    :rtype: bool
    """

    result = True

    if condition is not None:
        result = not run(condition, **kwargs)

    return result
qid: q26, doc_id: d26
query: accessing a column from a matrix in python
doc: def get_column(self, X, column):
        """Return a column of the given matrix.

        Args:
            X: `numpy.ndarray` or `pandas.DataFrame`.
            column: `int` or `str`.

        Retu

In [58]:
import vespa.querybuilder as qb


def semantic_query_fn(query_text: str, top_k: int) -> dict:
    return {
        "yql": str(
            qb.select("*")
            .from_(schema_name)
            .where(
                qb.nearestNeighbor(
                    field="embedding",
                    query_vector="q",
                    annotations={"targetHits": 1000},
                )
            )
        ),
        "query": query_text,
        "ranking": "fusion",
        "input.query(q)": f"embed({query_text})",
        "hits": top_k,
        "presentation.timing": True,  # return timing information in the response
    }

In [59]:
%load_ext autoreload
%autoreload 2

In [63]:
# Take first 10 elements of ids_to_query for testing
ids_to_query_sample = dict(list(ids_to_query.items())[:10])

In [68]:
ids_to_query_sample

{'q1': 'python code to write bool value 1',
 'q2': '"python how to manipulate clipboard"',
 'q3': 'python colored output to html',
 'q4': 'python "create directory" using "relative path"',
 'q5': 'python column of an array',
 'q6': 'python calling a property returns "property object"',
 'q7': 'python combine wav file into one as separate channels',
 'q8': '+how to use range with a dictionary python',
 'q9': 'python compare timespan to number',
 'q10': '1d array in char datatype in python'}

In [75]:
qrels["q10"]

'd10'

In [77]:
corpus["d10"]

'def _convert_to_array(array_like, dtype):\n        """\n        Convert Matrix attributes which are array-like or buffer to array.\n        """\n        if isinstance(array_like, bytes):\n            return np.frombuffer(array_like, dtype=dtype)\n        return np.asarray(array_like, dtype=dtype)'

In [73]:
from vespa.io import VespaQueryResponse

response: VespaQueryResponse = app.query(
    body=semantic_query_fn("1d array in char datatype in python", 10)
)

In [74]:
response.hits

[{'id': 'id:snippets:snippets::5227',
  'relevance': 0.03278688524590164,
  'source': 'codesearcheval_content',
  'fields': {'sddocname': 'snippets',
   'documentid': 'id:snippets:snippets::5227',
   'embedding': {'type': 'tensor<float>(x[768])',
    'values': [0.43813827633857727,
     1.4908432960510254,
     -0.46516716480255127,
     0.4627063274383545,
     -0.7300481200218201,
     1.0950405597686768,
     0.0066570695489645,
     0.39412927627563477,
     -0.2707683742046356,
     1.6436455249786377,
     0.04062717407941818,
     -0.14972053468227386,
     -0.6485252380371094,
     0.42486441135406494,
     -1.3044203519821167,
     0.1290709376335144,
     1.780397653579712,
     0.35676419734954834,
     1.4024505615234375,
     1.303039789199829,
     0.019531793892383575,
     -0.2051764279603958,
     -0.09151089191436768,
     0.5581967830657959,
     0.4842919707298279,
     0.1260094791650772,
     0.7315834164619446,
     -0.38965755701065063,
     -1.0426690578460693,

In [66]:
from vespa.evaluation import VespaEvaluator
from datetime import datetime

dt_str = datetime.now().strftime("%Y%m%d%H%M%S")

evaluator = VespaEvaluator(
    queries=ids_to_query_sample,
    relevant_docs=qrels,
    vespa_query_fn=semantic_query_fn,
    app=app,
    name=f"{dataset_id.replace('/', '_')}_{model_name}_{dt_str}",
    write_csv=True,  # optionally write metrics to CSV
)

results = evaluator.run()

## Cleanup

In [24]:
vespa_cloud.delete()

Deactivated vespa-team.codesearch in dev.aws-us-east-1c
Deleted instance vespa-team.codesearch.default


## Next steps

This is just an intro into the capabilities of Vespa and pyvespa.
Browse the site to learn more about schemas, feeding and queries - 
find more complex applications in
[examples](https://pyvespa.readthedocs.io/en/latest/examples.html).