<picture>
  <source media="(prefers-color-scheme: dark)" srcset="https://assets.vespa.ai/logos/Vespa-logo-green-RGB.svg">
  <source media="(prefers-color-scheme: light)" srcset="https://assets.vespa.ai/logos/Vespa-logo-dark-RGB.svg">
  <img alt="#Vespa" width="200" src="https://assets.vespa.ai/logos/Vespa-logo-dark-RGB.svg" style="margin-bottom: 25px;">
</picture>

# Adding and verifying ONNX-models to Vespa

TODO

<div class="alert alert-info">
    Refer to <a href="https://pyvespa.readthedocs.io/en/latest/troubleshooting.html">troubleshooting</a>
    for any problem when running this guide.
</div>

[Install pyvespa](https://pyvespa.readthedocs.io/) and start Docker Daemon, validate minimum 6G available:

In [None]:
!pip3 install pyvespa onnxruntime tokenizers
!docker info | grep "Total Memory"

## Find the model you are interested in using with Vespa on Huggingface

https://huggingface.co/models?pipeline_tag=sentence-similarity&library=onnx&sort=trending is a good starting point.

## Download the ONNX-model you are interested in using with Vespa

When you have found a model you are interested in using, locate the download button of the ONNX model-file, right-click and copy the link. (You need to remo

In [2]:
application_name = "onnx"

In [None]:
import requests
from pathlib import Path

model_url = "https://huggingface.co/nomic-ai/modernbert-embed-base/resolve/main/onnx/model_uint8.onnx?download=true"

tokenizer_url = "https://huggingface.co/nomic-ai/modernbert-embed-base/resolve/main/tokenizer.json?download=true"


def download_file(url: str, local_model_dir: Path = Path("models")):
    # Extract the filename (exclude query parameter)
    filename = url.split("/")[-1].split("?")[0]
    local_model_path = local_model_dir / filename
    local_model_dir.mkdir(parents=True, exist_ok=True)
    r = requests.get(url)
    if not local_model_path.exists():
        with open(local_model_path, "wb") as f:
            f.write(r.content)
        print(f"Downloaded file to {local_model_path}")
    else:
        print(f"File already exists at {local_model_path}")
    return local_model_path


model_path = download_file(model_url)
tokenizer_path = download_file(tokenizer_url)

In [2]:
model_path, tokenizer_path

(PosixPath('models/model_uint8.onnx'), PosixPath('models/tokenizer.json'))

In [65]:
import onnxruntime as ort
from typing import Dict, List, Tuple
import numpy as np


def inspect_onnx_model(model_path: str) -> Tuple[List[Dict], List[Dict]]:
    """
    Inspect the inputs and outputs of an ONNX model.

    Args:
        model_path (str): Path to the .onnx file

    Returns:
        Tuple[List[Dict], List[Dict]]: Lists containing input and output metadata
    """
    try:
        # Create inference session
        session = ort.InferenceSession(model_path, providers=["CPUExecutionProvider"])

        # Get input details
        inputs = []
        for input_meta in session.get_inputs():
            input_info = {
                "name": input_meta.name,
                "shape": input_meta.shape,
                "type": input_meta.type,
            }
            inputs.append(input_info)

        # Get output details
        outputs = []
        for output_meta in session.get_outputs():
            output_info = {
                "name": output_meta.name,
                "shape": output_meta.shape,
                "type": output_meta.type,
            }
            outputs.append(output_info)

        return inputs, outputs

    except Exception as e:
        print(f"Error loading model: {str(e)}")
        return [], []


inputs, outputs = inspect_onnx_model(model_path)

print("\nModel Inputs:")
for idx, inp in enumerate(inputs):
    print(f"\nInput {idx}:")
    print(f"  Name: {inp['name']}")
    print(f"  Shape: {inp['shape']}")
    print(f"  Type: {inp['type']}")

print("\nModel Outputs:")
for idx, out in enumerate(outputs):
    print(f"\nOutput {idx}:")
    print(f"  Name: {out['name']}")
    print(f"  Shape: {out['shape']}")
    print(f"  Type: {out['type']}")


Model Inputs:

Input 0:
  Name: input_ids
  Shape: ['batch_size', 'sequence_length']
  Type: tensor(int64)

Input 1:
  Name: attention_mask
  Shape: ['batch_size', 'sequence_length']
  Type: tensor(int64)

Model Outputs:

Output 0:
  Name: token_embeddings
  Shape: ['batch_size', 'sequence_length', 768]
  Type: tensor(float)

Output 1:
  Name: sentence_embedding
  Shape: ['batch_size', 768]
  Type: tensor(float)


We will also run inference on a sample query to verify that we get the same results in Vespa as in the original model.

In [None]:
from datasets import load_dataset

ds = load_dataset("zeta-alpha-ai/NanoClimateFEVER", "corpus", split="train")

In [93]:
sample_text, sample_id = ds[0]["text"], ds[0]["_id"]
sample_text, sample_id

("The 1993 Storm of the Century ( also known as the 93 Super Storm or the Great Blizzard of 1993 ) was a large cyclonic storm that formed over the Gulf of Mexico on March 12 , 1993 . The storm eventually dissipated in the North Atlantic Ocean on March 15 , 1993 . It was unique for its intensity , massive size , and wide-reaching effects . At its height , the storm stretched from Canada to the Gulf of Mexico . The cyclone moved through the Gulf of Mexico and then through the eastern United States before moving onto Canada .   Heavy snow was first reported in highland areas as far south as Alabama and northern Georgia , with Union County , Georgia reporting up to 35 inches of snow in the north Georgia mountains . Birmingham , Alabama , reported a rare 13 in of snow .  The Florida Panhandle reported up to 4 in , with hurricane-force wind gusts and record low barometric pressures . Between Louisiana and Cuba , the hurricane-force winds produced high storm surges across Northwestern Florida

In [57]:
from tokenizers import Tokenizer


tokenizer = Tokenizer.from_file(str(tokenizer_path))

In [64]:
encoded = tokenizer.encode(sample_text)
input_ids = encoded.ids
attention_mask = encoded.special_tokens_mask
print(len(input_ids))
assert len(input_ids) == len(attention_mask)

287


In [68]:
# Perform inference with the ONNX model. Provide token-ids and attention-mask as input


def onnx_inference(
    model_path: str, input_data: dict, session: ort.InferenceSession
) -> np.ndarray:
    """
    Perform inference with an ONNX model.

    Args:
        model_path (str): Path to the .onnx file
        input_data (dict): Input data for the model
        session (ort.InferenceSession): ONNX Inference Session

    Returns:
        np.ndarray: Model output
    """
    try:
        # Perform inference
        output = session.run(None, input_data)
        return output[0]

    except Exception as e:
        print(f"Error running inference: {str(e)}")
        return None


input_dict = {
    "input_ids": np.array([input_ids], dtype=np.int64),
    "attention_mask": np.array([attention_mask], dtype=np.int64),
}
output_embedding = onnx_inference(
    model_path, input_dict, ort.InferenceSession(model_path)
)
output_embedding.shape

(1, 287, 768)

In [76]:
output_emb_pooled = np.mean(output_embedding, axis=1).flatten()
output_emb_pooled.shape

(768,)

In [43]:
model_name = "modernbert-embed-base"
model_dims = 768
matryoshka_dims = 256

## Create an application package

The [application package](https://pyvespa.readthedocs.io/en/latest/reference-api.html#vespa.package.ApplicationPackage)
has all the Vespa configuration files -
create one from scratch:

In [78]:
from vespa.package import (
    ApplicationPackage,
    Field,
    Schema,
    Document,
    HNSW,
    RankProfile,
    Component,
    Parameter,
    FieldSet,
    GlobalPhaseRanking,
    Function,
)

application_name = "onnx"

package = ApplicationPackage(
    name=application_name,
    schema=[
        Schema(
            name="doc",
            document=Document(
                fields=[
                    Field(name="id", type="string", indexing=["summary"]),
                    Field(
                        name="text",
                        type="string",
                        indexing=["index", "summary"],
                        index="enable-bm25",
                    ),
                    Field(
                        name="embedding",
                        type=f"tensor<float>(x[{model_dims}])",
                        indexing=[
                            "input text",
                            f"embed {model_name}",
                            "summary",
                            "index",
                            "attribute",
                        ],
                        ann=HNSW(distance_metric="angular"),
                        is_document_field=False,
                    ),
                ]
            ),
            fieldsets=[
                FieldSet(
                    name="default",
                    fields=[
                        "text",
                    ],
                )
            ],
            rank_profiles=[
                RankProfile(
                    name="bm25",
                    inputs=[("query(q)", f"tensor<float>(x[{model_dims}])")],
                    functions=[Function(name="bm25sum", expression="bm25(text)")],
                    first_phase="bm25sum",
                ),
                RankProfile(
                    name="semantic",
                    inputs=[("query(q)", f"tensor<float>(x[{model_dims}])")],
                    first_phase="closeness(field, embedding)",
                ),
                RankProfile(
                    name="fusion",
                    inherits="bm25",
                    inputs=[("query(q)", f"tensor<float>(x[{model_dims}])")],
                    first_phase="closeness(field, embedding)",
                    global_phase=GlobalPhaseRanking(
                        expression="reciprocal_rank_fusion(bm25sum, closeness(field, embedding))",
                        rerank_count=1000,
                    ),
                ),
            ],
        )
    ],
    components=[
        Component(
            id=f"{model_name}",
            type="hugging-face-embedder",
            parameters=[
                Parameter(
                    "transformer-model",
                    {"url": model_url.split("?")[0]},
                ),
                Parameter(
                    "tokenizer-model",
                    {"url": tokenizer_url.split("?")[0]},
                ),
                Parameter(
                    "transformer-token-type-ids",
                    {},
                ),
                Parameter(
                    "max-tokens",
                    {},
                    "8192",
                ),
                Parameter("transformer-output", {}, children="token_embeddings"),
                Parameter(
                    "prepend",
                    {},
                    children=[
                        Parameter("query", {}, "search_query:"),
                        Parameter("document", {}, "search_document:"),
                    ],
                ),
            ],
        )
    ],
)

Note that the name cannot have `-` or `_`.

## Deploy the Vespa application 

Deploy `package` on the local machine using Docker,
without leaving the notebook, by creating an instance of
[VespaDocker](https://pyvespa.readthedocs.io/en/latest/reference-api.html#vespa.deployment.VespaDocker). `VespaDocker` connects
to the local Docker daemon socket and starts the [Vespa docker image](https://hub.docker.com/r/vespaengine/vespa/). 

If this step fails, please check
that the Docker daemon is running, and that the Docker daemon socket can be used by clients (Configurable under advanced settings in Docker Desktop).

In [41]:
package.to_files(application_name)

In [102]:
from vespa.deployment import VespaDocker

vespa_docker = VespaDocker()
app = vespa_docker.deploy(package)

Waiting for configuration server, 0/60 seconds...
Waiting for configuration server, 5/60 seconds...
Waiting for application to come up, 0/300 seconds.
Waiting for application to come up, 5/300 seconds.
Application is up!
Finished deployment.


`app` now holds a reference to a [Vespa](https://pyvespa.readthedocs.io/en/latest/reference-api.html#vespa.application.Vespa) instance.

## Feeding documents to Vespa

In this example we use the [HF Datasets](https://huggingface.co/docs/datasets/index) library to stream the
[BeIR/nfcorpus](https://huggingface.co/datasets/BeIR/nfcorpus) dataset and index in our newly deployed Vespa instance. Read
more about the [NFCorpus](https://www.cl.uni-heidelberg.de/statnlpgroup/nfcorpus/):

>NFCorpus is a full-text English retrieval data set for Medical Information Retrieval. 

The following uses the [stream](https://huggingface.co/docs/datasets/stream) option of datasets to stream the data without
downloading all the contents locally. The `map` functionality allows us to convert the
dataset fields into the expected feed format for `pyvespa` which expects a dict with the keys `id` and `fields`:

` { "id": "vespa-document-id", "fields": {"vespa_field": "vespa-field-value"}} `

In [103]:
from datasets import load_dataset

vespa_feed = [
    {
        "id": str(i),
        "fields": {"text": x["text"], "id": x["_id"]},
    }
    for i, x in enumerate(ds)
]

Now we can feed to Vespa using `feed_iterable` which accepts any `Iterable` and an optional callback function where we can
check the outcome of each operation. The application is configured to use [embedding](https://docs.vespa.ai/en/embedding.html)
functionality, that produce a vector embedding using a concatenation of the title and the body input fields. This step is computionally expensive. Read more
about embedding inference in Vespa in the [Accelerating Transformer-based Embedding Retrieval with Vespa](https://blog.vespa.ai/accelerating-transformer-based-embedding-retrieval-with-vespa/).

In [None]:
from vespa.io import VespaResponse, VespaQueryResponse
import nest_asyncio

nest_asyncio.apply()


def callback(response: VespaResponse, id: str):
    if not response.is_successful():
        print(f"Error when feeding document {id}: {response.get_json()}")


app.feed_iterable(vespa_feed, schema="doc", callback=callback)

Error when feeding document 0: {'Exception': "('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))", 'id': '0', 'message': 'Exception during feed_data_point'}
Error when feeding document 2: {'Exception': "('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))", 'id': '2', 'message': 'Exception during feed_data_point'}
Error when feeding document 1: {'Exception': "('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))", 'id': '1', 'message': 'Exception during feed_data_point'}
Error when feeding document 3: {'Exception': "('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))", 'id': '3', 'message': 'Exception during feed_data_point'}
Error when feeding document 4: {'Exception': "('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))", 'id': '4', 'message': 'Exception during feed_data_point'}
Error

Exception in thread Thread-335 (_consumer):
Traceback (most recent call last):
  File "/Users/thomas/.local/share/uv/python/cpython-3.10.14-macos-aarch64-none/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/Users/thomas/Repos/pyvespa/.venv/lib/python3.10/site-packages/ipykernel/ipkernel.py", line 766, in run_closure
    _threading_Thread_run(self)
  File "/Users/thomas/.local/share/uv/python/cpython-3.10.14-macos-aarch64-none/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/thomas/Repos/pyvespa/vespa/application.py", line 460, in _consumer
    future: Future = executor.submit(_submit, doc, sync_session)
  File "/Users/thomas/.local/share/uv/python/cpython-3.10.14-macos-aarch64-none/lib/python3.10/concurrent/futures/thread.py", line 167, in submit
    raise RuntimeError('cannot schedule new futures after shutdown')
RuntimeError: cannot schedule new futures after shutdown


Error when feeding document 196: {'Exception': "('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))", 'id': '196', 'message': 'Exception during feed_data_point'}
Error when feeding document 198: {'Exception': "('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))", 'id': '198', 'message': 'Exception during feed_data_point'}
Error when feeding document 200: {'Exception': "('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))", 'id': '200', 'message': 'Exception during feed_data_point'}
Error when feeding document 202: {'Exception': "('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))", 'id': '202', 'message': 'Exception during feed_data_point'}


## Querying Vespa

Using the [Vespa Query language](https://docs.vespa.ai/en/query-language.html) we can query the indexed data. 

- Using a context manager `with app.syncio() as session` to handle connection pooling ([best practices](https://cloud.vespa.ai/en/http-best-practices))
- The query method accepts any valid Vespa [query api parameter](https://docs.vespa.ai/en/reference/query-api-reference.html) in `**kwargs`
- Vespa api parameter names that contains `.` must be sent as `dict` parameters in the `body` method argument

The following searches for `How Fruits and Vegetables Can Treat Asthma?` using different retrieval and [ranking](https://docs.vespa.ai/en/ranking.html) strategies.

In [22]:
import pandas as pd


def display_hits_as_df(response: VespaQueryResponse, fields) -> pd.DataFrame:
    records = []
    for hit in response.hits:
        record = {}
        for field in fields:
            record[field] = hit["fields"][field]
        records.append(record)
    return pd.DataFrame(records)

### Plain Keyword search 
The following uses plain keyword search functionality with [bm25](https://docs.vespa.ai/en/reference/bm25.html) ranking, the `bm25` rank-profile was configured in the 
application package to use a linear combination of the bm25 score of the query terms against the title and the body field. 


In [1]:
with app.syncio(connections=1) as session:
    query = "How Fruits and Vegetables Can Treat Asthma?"
    response: VespaQueryResponse = session.query(
        yql="select * from sources * where userQuery() limit 5",
        query=query,
        ranking="bm25",
    )
    assert response.is_successful()
    print(display_hits_as_df(response, ["id", "title"]))

### Plain Semantic Search 
The following uses dense vector representations of the query and the document and matching is performed and accelerated by Vespa's support for
[approximate nearest neighbor search](https://docs.vespa.ai/en/approximate-nn-hnsw.html). 
The vector embedding representation of the text is obtained using Vespa's [embedder functionality](https://docs.vespa.ai/en/embedding.html#embedding-a-query-text).


In [23]:
with app.syncio(connections=1) as session:
    query = "How Fruits and Vegetables Can Treat Asthma?"
    response: VespaQueryResponse = session.query(
        yql="select * from sources * where ({targetHits:1000}nearestNeighbor(embedding,q)) limit 5",
        query=query,
        ranking="semantic",
        body={"input.query(q)": f"embed({query})"},
    )
    assert response.is_successful()
    print(display_hits_as_df(response, ["id", "text"]))

        id                                               text
0  MED-719  In addition to causing embarrassment and uneas...
1  MED-724  In addition to causing embarrassment and uneas...
2  MED-691  Nausea and vomiting are physiological processe...
3  MED-398  Summary Grapefruit is a popular, tasty and nut...
4  MED-712  Hibiscus sabdariffa Linne is a traditional Chi...


### Hybrid Search

This is one approach to combine the two retrieval strategies and where we use Vespa's support for 
[cross-hits feature normalization and reciprocal rank fusion](https://docs.vespa.ai/en/phased-ranking.html#cross-hit-normalization-including-reciprocal-rank-fusion). This
functionality is exposed in the context of `global` re-ranking, after the distributed query retrieval execution which might span 1000s of nodes. 

#### Hybrid search with the OR query operator

This combines the two methods using logical disjunction (OR). Note that the first-phase expression in our `fusion` expression is only using the semantic score, this 
because usually semantic search provides better recall than sparse keyword search alone. 



In [1]:
with app.syncio(connections=1) as session:
    query = "How Fruits and Vegetables Can Treat Asthma?"
    response: VespaQueryResponse = session.query(
        yql="select * from sources * where userQuery() or ({targetHits:1000}nearestNeighbor(embedding,q)) limit 5",
        query=query,
        ranking="fusion",
        body={"input.query(q)": f"embed({query})"},
    )
    assert response.is_successful()
    print(display_hits_as_df(response, ["id", "title"]))

#### Hybrid search with the RANK query operator

This combines the two methods using the [rank](https://docs.vespa.ai/en/reference/query-language-reference.html#rank) query operator. In this case
we express that we want to retrieve the top-1000 documents using vector search, and then have sparse features like BM25 calculated as well (second operand 
of the rank operator). Finally the hits are re-ranked using the reciprocal rank fusion


In [1]:
with app.syncio(connections=1) as session:
    query = "How Fruits and Vegetables Can Treat Asthma?"
    response: VespaQueryResponse = session.query(
        yql="select * from sources * where rank({targetHits:1000}nearestNeighbor(embedding,q), userQuery()) limit 5",
        query=query,
        ranking="fusion",
        body={"input.query(q)": f"embed({query})"},
    )
    assert response.is_successful()
    print(display_hits_as_df(response, ["id", "title"]))

#### Hybrid search with filters

In this example we add another query term to the yql, restricting the nearest neighbor search to only consider documents that have vegetable in the title.

In [1]:
with app.syncio(connections=1) as session:
    query = "How Fruits and Vegetables Can Treat Asthma?"
    response: VespaQueryResponse = session.query(
        yql='select * from sources * where title contains "vegetable" and rank({targetHits:1000}nearestNeighbor(embedding,q), userQuery()) limit 5',
        query=query,
        ranking="fusion",
        body={"input.query(q)": f"embed({query})"},
    )
    assert response.is_successful()
    print(display_hits_as_df(response, ["id", "title"]))

## Cleanup

In [1]:
vespa_docker.container.stop()
vespa_docker.container.remove()

## Next steps

This is just an intro into the capabilities of Vespa and pyvespa.
Browse the site to learn more about schemas, feeding and queries - 
find more complex applications in
[examples](https://pyvespa.readthedocs.io/en/latest/examples.html).