# HuggingFace-to-Onnx

Example of exporting a text embedding model and tokenizer from HuggingFace to ONNX


In [2]:
import onnx
import onnxruntime as ort
import numpy as np
from sentence_transformers import SentenceTransformer, export_optimized_onnx_model

## Attempt 1: Open model and export to ONNX


In [None]:
# embedding_model = SentenceTransformer('all-MiniLM-L6-v2')   # , backend='onnx', model_kwargs={'file_name': 'model.onnx'})
embedding_model = SentenceTransformer(
    "all-MiniLM-L6-v2", backend="onnx", model_kwargs={"file_name": "model.onnx"}
)

[0;93m2025-05-01 08:52:23.154699 [W:onnxruntime:, helper.cc:83 IsInputSupported] CoreML does not support input dim > 16384. Input:embeddings.word_embeddings.weight, shape: {30522,384}[m
[0;93m2025-05-01 08:52:23.155121 [W:onnxruntime:, coreml_execution_provider.cc:112 GetCapability] CoreMLExecutionProvider::GetCapability, number of partitions supported by CoreML: 55 number of nodes in the graph: 418 number of nodes supported by CoreML: 278[m
[0;93m2025-05-01 08:52:23.641367 [W:onnxruntime:, session_state.cc:1263 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.[m
[0;93m2025-05-01 08:52:23.641376 [W:onnxruntime:, session_state.cc:1265 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.[m


In [None]:
export_optimized_onnx_model(
    embedding_model,
    "O1",
    "./",
)



## Examine saved model in ONNX


In [8]:
onnx_model = onnx.load("./onnx/model_O1.onnx")
print(onnx.checker.check_model(onnx_model))

None


In [None]:
inputs = [x.name for x in onnx_model.graph.input]
outputs = [x.name for x in onnx_model.graph.output]
print("inputs", inputs)
print("outputs", outputs)

inputs ['input_ids', 'attention_mask', 'token_type_ids']
outputs ['last_hidden_state']


### Gotcha

The output being `last_hidden_state` suggests the ONNX model from `sentence-transformers` doesn't include pooling and normalization modules. See these:

- https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/tree/main/onnx
- https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/blob/main/modules.json


In [None]:
for module_name, module in embedding_model.named_children():
    print(module, "\n")

Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: ORTModelForFeatureExtraction  

Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True}) 

Normalize() 



This particular `sentence-transformers` model has three modules: the encoder-type transformer, a mean pooling layer on the output, and a final normalization to the text embedding.

## Attempt 2: Open model and export to ONNX
Instead of implementing the code to convert and join all three modules into ONNX, we can use existing libraries.

One option is the Optimum library from HuggingFace: `optimum-cli export onnx --model sentence-transformers/all-MiniLM-L6-v2 model.onnx`.

We will use a library from `.txtai` (you may know them from the Python package `outlines` for constrained LLM sampling).

In [None]:
from txtai.pipeline import HFOnnx

path = "sentence-transformers/all-MiniLM-L6-v2"
onnx_model = HFOnnx()
model = onnx_model(path, "pooling", "model.onnx", True)

# embedding_model = SentenceTransformer('all-MiniLM-L6-v2')



In [7]:
onnx_model = onnx.load("./model.onnx")
print(onnx.checker.check_model(onnx_model))

None


In [None]:
inputs = [x.name for x in onnx_model.graph.input]
outputs = [x.name for x in onnx_model.graph.output]
print("inputs", inputs)
print("outputs", outputs)

inputs ['input_ids', 'attention_mask', 'token_type_ids']
outputs ['embeddings']


Now the output node is labelled `embeddings`, indicating it has applied all three modules.

## Tokenizer
Let's export the tokenizer using ONNXRuntimeExtensions.

In [1]:
from onnxruntime_extensions import gen_processing_models

In [4]:
onnx_tokenizer_path = "tokenizer.onnx"

tokenizer = embedding_model.tokenizer
tok_encode, tok_decode = gen_processing_models(tokenizer, pre_kwargs={})

In [None]:
tok_encode

In [None]:
# Save the tokenizer ONNX model
tokenizer_path = "tokenizer.onnx"
with open(tokenizer_path, "wb") as f:
    f.write(tok_encode.SerializeToString())

I haven't been able to get model and tokenizer in same IR version in order to combine graphs.

In [None]:
onnx_model.ir_version, tok_encode.ir_version

(7, 8)

## Testing Use


In [14]:
import onnxruntime as ort
from onnxruntime_extensions import get_library_path
import numpy as np

In [None]:
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

Inference is performed via the `ONNX Runtime`.

In [None]:
so = ort.SessionOptions()
so.register_custom_ops_library(get_library_path())

ort_sess = ort.InferenceSession("tokenizer.onnx", so)

Let's tokenize a string of text.

In [None]:
test_str = "The quick brown fox jumps over the lazy dog."
outputs = ort_sess.run(None, {"text": [test_str]})

In [21]:
outputs

[array([  101,  1996,  4248,  2829,  4419, 14523,  2058,  1996, 13971,
         3899,  1012,   102], dtype=int64),
 array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int64),
 array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int64),
 array([[ 0,  0],
        [ 0,  3],
        [ 4,  9],
        [10, 15],
        [16, 19],
        [20, 25],
        [26, 30],
        [31, 34],
        [35, 39],
        [40, 43],
        [44, 45],
        [ 0,  0]], dtype=int64)]

In [None]:
test_toks = outputs[0]
token_type_ids = outputs[1]
attention_mask = outputs[2]

In [57]:
type(embedding_model.tokenizer)

transformers.models.bert.tokenization_bert_fast.BertTokenizerFast

Encoding the tokens with our ONNX model and decoding with the HuggingFace model gives back the input (plus additional special tokens).

In [None]:
embedding_model.tokenizer.decode(test_toks)

'[CLS] the quick brown fox jumps over the lazy dog. [SEP]'

Now let's perform inference with the embedding model.

In [None]:
ort_embed = ort.InferenceSession("model.onnx")

In [None]:
ort_embed.run(
    None,
    {
        "input_ids": [test_toks],
        "attention_mask": [token_type_ids],
        "token_type_ids": [attention_mask],
    },
)[0]

array([[ 1.46325007e-01,  3.28532130e-01,  2.66175002e-01,
         5.18237472e-01,  2.02143028e-01, -1.79584488e-01,
         1.52321756e-01, -3.98070544e-01, -3.71623226e-02,
        -5.72629236e-02,  1.29877284e-01,  1.32518455e-01,
        -1.79732755e-01, -1.65457316e-02, -6.52377785e-04,
        -1.31084666e-01, -2.06912145e-01, -1.84920907e-01,
         3.07158738e-01, -2.62583256e-01, -2.98586309e-01,
        -3.02038938e-01,  1.37112692e-01,  1.63912699e-01,
        -4.19944048e-01, -1.17152214e-01, -3.97956759e-01,
        -3.00221205e-01,  4.11090940e-01, -5.13568342e-01,
        -8.72481465e-02,  1.85722232e-01, -2.18171403e-01,
        -2.64097247e-02, -1.63030490e-01, -3.72051567e-01,
         3.35423738e-01,  4.56916727e-02,  1.79710209e-01,
         1.38806954e-01,  1.49378225e-01, -8.92430544e-02,
        -2.62890935e-01,  1.50573924e-01, -5.16011655e-01,
         2.48664081e-01, -4.35304970e-01, -2.22753058e-03,
         1.70447230e-02,  7.45331869e-03, -1.46014124e-0

The embeddings between the ONNX model and the saved HuggingFace weights will differ as they're from different trainings.

In the corresponding Java app, you can compare the embedding below to the one produced via Java.

In [37]:
embedding_model.encode([test_str])

array([[ 4.39401269e-02,  5.89273572e-02,  4.81781922e-02,
         7.75616020e-02,  2.67397407e-02, -3.76246534e-02,
        -2.59578507e-03, -5.99470101e-02, -2.48484872e-03,
         2.20740736e-02,  4.80036773e-02,  5.57535887e-02,
        -3.89535986e-02, -2.66309399e-02,  7.69358641e-03,
        -2.62365304e-02, -3.64078879e-02, -3.78273763e-02,
         7.40729570e-02, -4.95132506e-02, -5.85304871e-02,
        -6.36074990e-02,  3.24228741e-02,  2.20151860e-02,
        -7.10863322e-02, -3.31508964e-02, -6.93992078e-02,
        -5.00420891e-02,  7.46240765e-02, -1.11135170e-01,
        -1.23101575e-02,  3.77289020e-02, -2.80298274e-02,
         1.45433918e-02, -3.15793417e-02, -8.05702582e-02,
         5.83476461e-02,  2.58636032e-03,  3.92938629e-02,
         2.57627461e-02,  4.98468950e-02, -1.74043898e-03,
        -4.55198474e-02,  2.92620845e-02, -1.02021821e-01,
         5.22407517e-02, -7.91030079e-02, -1.02924807e-02,
         9.20308568e-03,  1.30610717e-02, -4.04580906e-0

In [None]:
str_with_special = "[CLS] the quick brown fox jumps over the lazy dog. [SEP]"
embedding_model.encode([str_with_special]).shape

(1, 384)

The similarity scores between sentences, however, should be similar across different trainings of the same model.

In [None]:
test_strs = [
    "The quick brown fox jumps over the lazy dog.",
    "A fast wolf leaps over the sedentary hound.",
    "The cat sat on the mat.",
]

embs = embedding_model.encode(test_strs)
embedding_model.similarity(embs, embs)

tensor([[1.0000, 0.6290, 0.2128],
        [0.6290, 1.0000, 0.1674],
        [0.2128, 0.1674, 1.0000]])

In [None]:
outputs_tmp = ort_sess.run(None, {"text": [test_strs[0]]})
test_toks0, token_type_ids0, attention_mask0, _ = outputs_tmp

outputs_tmp = ort_sess.run(None, {"text": [test_strs[1]]})
test_toks1, token_type_ids1, attention_mask1, _ = outputs_tmp

outputs_tmp = ort_sess.run(None, {"text": [test_strs[2]]})
test_toks2, token_type_ids2, attention_mask2, _ = outputs_tmp

In [None]:
embs_onnx = np.concatenate(
    [
        ort_embed.run(
            None,
            {
                "input_ids": [test_toks0],
                "attention_mask": [attention_mask0],
                "token_type_ids": [0 * np.ones_like(test_toks0)],
            },
        )[0],
        ort_embed.run(
            None,
            {
                "input_ids": [test_toks1],
                "attention_mask": [attention_mask1],
                "token_type_ids": [0 * np.ones_like(test_toks1)],
            },
        )[0],
        ort_embed.run(
            None,
            {
                "input_ids": [test_toks2],
                "attention_mask": [attention_mask2],
                "token_type_ids": [0 * np.ones_like(test_toks2)],
            },
        )[0],
    ]
)

The embeddings from the ONNX version of the model exhibit similar similiarities between sentences as for the HuggingFace weights.

In [98]:
embedding_model.similarity(embs_onnx, embs_onnx)

tensor([[1.0000, 0.6378, 0.2185],
        [0.6378, 1.0000, 0.1812],
        [0.2185, 0.1812, 1.0000]])

In [None]:
# encoder = onnx.load("./model.onnx")
# tokenizer = onnx.load("./tokenizer.onnx")
# print(onnx.checker.check_model(encoder), onnx.checker.check_model(tokenizer))

None None


So we can debug the Java version, we print the tokenized strings here. It will be useful for comparing the ONNX embeddings between Python and Java.

In [99]:
test_toks0

array([  101,  1996,  4248,  2829,  4419, 14523,  2058,  1996, 13971,
        3899,  1012,   102], dtype=int64)

In [100]:
test_toks1

array([  101,  1037,  3435,  4702, 29195,  2058,  1996,  7367, 16454,
        5649, 19598,  1012,   102], dtype=int64)

In [101]:
test_toks2

array([  101,  1996,  4937,  2938,  2006,  1996, 13523,  1012,   102],
      dtype=int64)