# Exporting Hugging Face Models Using Optimum and Running Them in DeepSparse

This guide harnesses the power of Neural Magic's DeepSparse Inference Runtime library in combination with Hugging Face's ONNX models. DeepSparse offers a cutting-edge solution for efficient and accelerated inference on deep learning models, optimizing performance and resource utilization. By seamlessly integrating DeepSparse with Hugging Face's ONNX models, users can experience lightning-fast inference times while maintaining the flexibility and versatility of the widely adopted ONNX format alongside the  `Optimum` library for PyTorch model ONNX exporting.

This notebook will use several popular models found on the Hugging Face Hub for text classification, zero-shot classification, question answering, and NER.

The flow for this guide includes:

1. Exporting models to ONNX using `optimum-cli`.
2. Running inference with ONNX models with DeepSparse.

## Install DeepSparse and Optimum

In [1]:
!pip install deepsparse-nightly optimum[exporters]

Collecting deepsparse-nightly
  Downloading deepsparse_nightly-1.6.0.20230825-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (44.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.7/44.7 MB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting optimum[exporters]
  Downloading optimum-1.12.0-py3-none-any.whl (380 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m380.6/380.6 kB[0m [31m30.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sparsezoo-nightly~=1.6.0 (from deepsparse-nightly)
  Downloading sparsezoo_nightly-1.6.0.20230825-py3-none-any.whl (139 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.1/139.1 kB[0m [31m16.7 MB/s[0m eta [36m0:00:00[0m
Collecting onnx<1.15.0,>=1.5.0 (from deepsparse-nightly)
  Downloading onnx-1.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (14.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.6/14.6 MB[0m [31m57.7 MB/s[0m eta [

## Text Classification | Sentiment Analysis

Let's export the `SamLowe/roberta-base-go_emotions` model for sentiment analysis to an output folder called `tc_model`:

In [2]:
!optimum-cli export onnx --model SamLowe/roberta-base-go_emotions tc_model --sequence_length 128

Framework not specified. Using pt to export to ONNX.
Downloading (…)lve/main/config.json: 100% 1.92k/1.92k [00:00<00:00, 3.22MB/s]
Downloading pytorch_model.bin: 100% 499M/499M [00:04<00:00, 101MB/s]
Automatic task detection to text-classification (possible synonyms are: sequence-classification, zero-shot-classification).
Downloading (…)okenizer_config.json: 100% 380/380 [00:00<00:00, 2.23MB/s]
Downloading (…)olve/main/vocab.json: 100% 798k/798k [00:00<00:00, 57.6MB/s]
Downloading (…)olve/main/merges.txt: 100% 456k/456k [00:00<00:00, 43.4MB/s]
Downloading (…)/main/tokenizer.json: 100% 2.11M/2.11M [00:00<00:00, 95.4MB/s]
Downloading (…)cial_tokens_map.json: 100% 280/280 [00:00<00:00, 1.40MB/s]
Using framework PyTorch: 2.0.1+cu118
Overriding 1 configuration item(s)
	- use_cache -> False
verbose: False, log level: Level.ERROR

Post-processing the exported models...
Validating models in subprocesses...
Validating ONNX model tc_model/model.onnx...
	-[✓] ONNX model output names match referen

Load model and run inference with DeepSparse:

In [3]:
from deepsparse import Pipeline

text_input = "Snorlax loves my Tesla!"

pipe = Pipeline.create(task="sentiment-analysis", model_path="./tc_model")
inference = pipe(text_input)
print(inference)
print(pipe.timer_manager)

2023-08-29 13:43:07 __main__     INFO     Overwriting in-place the input shapes of the transformer model at ./tc_model/model.onnx
INFO:__main__:Overwriting in-place the input shapes of the transformer model at ./tc_model/model.onnx


labels=['love'] scores=[0.8388857841491699]
TimerManager({'engine_forward': 0.37699663099999725, 'total_inference': 0.3789412719999916, 'pre_process': 0.0014164960000186966, 'post_process': 0.00048448900000153117})


## NER

Let's export the `Jean-Baptiste/camembert-ner` French NER model to an output folder called `ner_model`:

In [4]:
!optimum-cli export onnx --model Jean-Baptiste/camembert-ner ner_model --sequence_length 128

Framework not specified. Using pt to export to ONNX.
Downloading (…)lve/main/config.json: 100% 892/892 [00:00<00:00, 2.09MB/s]
Downloading model.safetensors: 100% 440M/440M [00:07<00:00, 59.4MB/s]
Automatic task detection to token-classification.
Downloading (…)okenizer_config.json: 100% 269/269 [00:00<00:00, 1.01MB/s]
Downloading (…)tencepiece.bpe.model: 100% 811k/811k [00:00<00:00, 180MB/s]
Downloading (…)cial_tokens_map.json: 100% 210/210 [00:00<00:00, 1.25MB/s]
Using framework PyTorch: 2.0.1+cu118
Overriding 1 configuration item(s)
	- use_cache -> False
verbose: False, log level: Level.ERROR

Post-processing the exported models...
Validating models in subprocesses...
Validating ONNX model ner_model/model.onnx...
	-[✓] ONNX model output names match reference model (logits)
	- Validating ONNX Model output "logits":
		-[✓] (2, 128, 5) matches (2, 128, 5)
		-[✓] all values close (atol: 0.0001)
The ONNX export succeeded and the exported model was saved at: ner_model


Load model and run inference with DeepSparse:

In [5]:
from deepsparse import Pipeline

text_input = "george washington est allé à washington!"

pipe = Pipeline.create(task="token-classification", model_path="./ner_model")
inference = pipe(text_input)

print(inference)
print(pipe.timer_manager)

2023-08-29 13:46:59 __main__     INFO     Overwriting in-place the input shapes of the transformer model at ./ner_model/model.onnx
INFO:__main__:Overwriting in-place the input shapes of the transformer model at ./ner_model/model.onnx


predictions=[[TokenClassificationResult(entity='I-PER', score=0.9719225168228149, word='▁ge', start=0, end=2, index=1, is_grouped=False), TokenClassificationResult(entity='I-PER', score=0.9716293811798096, word='orge', start=2, end=6, index=2, is_grouped=False), TokenClassificationResult(entity='I-PER', score=0.995206892490387, word='▁was', start=6, end=10, index=3, is_grouped=False), TokenClassificationResult(entity='I-PER', score=0.9953275322914124, word='h', start=10, end=11, index=4, is_grouped=False), TokenClassificationResult(entity='I-PER', score=0.9947953224182129, word='ington', start=11, end=17, index=5, is_grouped=False), TokenClassificationResult(entity='I-LOC', score=0.9657747745513916, word='▁was', start=28, end=32, index=9, is_grouped=False), TokenClassificationResult(entity='I-LOC', score=0.9659914970397949, word='h', start=32, end=33, index=10, is_grouped=False), TokenClassificationResult(entity='I-LOC', score=0.961447536945343, word='ington', start=33, end=39, index=1

## Question Answering

Let's export the `deepset/electra-base-squad2` model for Question Answering to an output folder called `qa_model`:

In [7]:
!optimum-cli export onnx --model deepset/electra-base-squad2 qa_model --sequence_length 128

Framework not specified. Using pt to export to ONNX.
Downloading (…)lve/main/config.json: 100% 635/635 [00:00<00:00, 1.02MB/s]
Downloading model.safetensors: 100% 436M/436M [00:06<00:00, 65.3MB/s]
Automatic task detection to question-answering.
Downloading (…)okenizer_config.json: 100% 200/200 [00:00<00:00, 987kB/s]
Downloading (…)solve/main/vocab.txt: 100% 232k/232k [00:00<00:00, 43.1MB/s]
Downloading (…)cial_tokens_map.json: 100% 112/112 [00:00<00:00, 672kB/s]
Using framework PyTorch: 2.0.1+cu118
Overriding 1 configuration item(s)
	- use_cache -> False
verbose: False, log level: Level.ERROR

Post-processing the exported models...
Validating models in subprocesses...
Validating ONNX model qa_model/model.onnx...
	-[✓] ONNX model output names match reference model (start_logits, end_logits)
	- Validating ONNX Model output "start_logits":
		-[✓] (2, 128) matches (2, 128)
		-[✓] all values close (atol: 0.0001)
	- Validating ONNX Model output "end_logits":
		-[✓] (2, 128) matches (2, 128)


Load model and run inference with DeepSparse:

In [8]:
from deepsparse import Pipeline

pipe = Pipeline.create(task="question-answering", model_path="./qa_model")
inference = pipe(question="What's my name?", context="My name is Snorlax")

question = "who loves Tesla?"
context = "Snorlax loves my Tesla?"

print(inference)
print(pipe.timer_manager)

2023-08-29 13:49:27 __main__     INFO     Overwriting in-place the input shapes of the transformer model at ./qa_model/model.onnx
INFO:__main__:Overwriting in-place the input shapes of the transformer model at ./qa_model/model.onnx
  return numpy.array(array)


score=2.424950361251831 answer='Snorlax' start=11 end=18
TimerManager({'engine_forward': 0.3993458770000302, 'total_inference': 0.4020000669999604, 'pre_process': 0.001061354000000847, 'post_process': 0.0015504159999863987})


## Zero-Shot Text Classification

Let's export the DistilBERT MNLI Base model to an output folder called `zs_model`:

In [9]:
!optimum-cli export onnx --model typeform/distilbert-base-uncased-mnli zs_model --sequence_length 128

Framework not specified. Using pt to export to ONNX.
Downloading (…)lve/main/config.json: 100% 776/776 [00:00<00:00, 1.80MB/s]
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
Downloading model.safetensors: 100% 268M/268M [00:01<00:00, 217MB/s]
Automatic task detection to text-classification (possible synonyms are: sequence-classification, zero-shot-classification).
Downloading (…)okenizer_config.json: 100% 258/258 [00:00<00:00, 1.33MB/s]
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
Downloading (…)solve/main/vocab.txt: 100% 232k/232k [00:00<00:00, 28.1MB/s]
Downloading (…)cial_tokens_map.json: 100% 112/112 [00:00<00:00, 614kB/s]
The `xla_device` ar

Load model and run inference with DeepSparse:

In [3]:
from deepsparse import Pipeline

pipe = Pipeline.create(
    task="zero_shot_text_classification",
    model_scheme="mnli",
    model_config={"hypothesis_template": "This text is related to {}"},
    model_path="./zs_model"
)

sequence = "I like pepperoni pizza."
labels = ["food", "movies", "sports"]

inference = pipe(sequences=sequence, labels=labels)

print(inference)
print(pipe.timer_manager)

RuntimeError: ignored

AttributeError: ignored

## Image Classification

Let's export the `nateraw/vit-age-classifier` model to an output folder called `ic_model`. This model classifies a person's age based on their picture:

In [15]:
!optimum-cli export onnx --model nateraw/vit-age-classifier ic_model

Framework not specified. Using pt to export to ONNX.
Downloading (…)lve/main/config.json: 100% 850/850 [00:00<00:00, 1.61MB/s]
Downloading pytorch_model.bin: 100% 343M/343M [00:05<00:00, 60.6MB/s]
Automatic task detection to image-classification.
Downloading (…)rocessor_config.json: 100% 197/197 [00:00<00:00, 808kB/s]
Using framework PyTorch: 2.0.1+cu118
  if num_channels != self.num_channels:
  if height != self.image_size[0] or width != self.image_size[1]:
verbose: False, log level: Level.ERROR

Post-processing the exported models...
Validating models in subprocesses...
Validating ONNX model ic_model/model.onnx...
	-[✓] ONNX model output names match reference model (logits)
	- Validating ONNX Model output "logits":
		-[✓] (2, 9) matches (2, 9)
		-[✓] all values close (atol: 1e-05)
The ONNX export succeeded and the exported model was saved at: ic_model


Load model and run inference with DeepSparse:

In [19]:
from deepsparse import Pipeline

pipe = Pipeline.create(
    task="image_classification",
    model_path="./ic_model",
    input_shapes=[1,3,224,224]
  )

inference = pipe(images="./face.jpg")

print(inference.labels)
print(pipe.timer_manager)

2023-08-29 14:01:37 deepsparse.utils.onnx INFO     Overwriting in-place the input shapes of the model at ic_model/model.onnx
INFO:deepsparse.utils.onnx:Overwriting in-place the input shapes of the model at ic_model/model.onnx


[3]
TimerManager({'engine_forward': 1.037140602999898, 'total_inference': 1.0436674009999933, 'pre_process': 0.004802337999990414, 'post_process': 0.0016723179999189597})


## Image Segmentation

Let's export the DEtection TRansformer(DETR) model to an output folder called `is_model`:

In [1]:
!pip install deepsparse[yolov8]



In [2]:
!optimum-cli export onnx --model facebook/detr-resnet-50-panoptic is_model

RuntimeError: module compiled against API version 0xf but this version of numpy is 0xe
RuntimeError: module compiled against API version 0xf but this version of numpy is 0xe
RuntimeError: module compiled against API version 0xf but this version of numpy is 0xe
ImportError: numpy.core._multiarray_umath failed to import
ImportError: numpy.core.umath failed to import
RuntimeError: module compiled against API version 0xf but this version of numpy is 0xe
ImportError: numpy.core._multiarray_umath failed to import
ImportError: numpy.core.umath failed to import
RuntimeError: module compiled against API version 0xf but this version of numpy is 0xe
ImportError: numpy.core._multiarray_umath failed to import
ImportError: numpy.core.umath failed to import
Traceback (most recent call last):
  File "/usr/local/bin/optimum-cli", line 5, in <module>
    from optimum.commands.optimum_cli import main
  File "/usr/local/lib/python3.10/dist-packages/optimum/commands/__init__.py", line 17, in <module>
    f

Load model and run inference with DeepSparse:

In [None]:
from deepsparse import Pipeline

pipe = Pipeline.create(
    task="yolov8",
    model_path="./is_model",
    input_shapes=[1,3,224,224],
    image_size=(224,224)
)

inference = pipe(images="./thailand.jpeg")

print(inference)
print(pipe.timer_manager)