# Exporting Hugging Face Models Using Optimum and Running Them in DeepSparse

This guide harnesses the power of Neural Magic's DeepSparse Inference Runtime library in combination with Hugging Face's ONNX models. DeepSparse offers a cutting-edge solution for efficient and accelerated inference on deep learning models, optimizing performance and resource utilization. By seamlessly integrating DeepSparse with Hugging Face's ONNX models, users can experience lightning-fast inference times while maintaining the flexibility and versatility of the widely adopted ONNX format alongside the  `Optimum` library for PyTorch model ONNX exporting.

This notebook will use several popular models found on the Hugging Face Hub for text classification, zero-shot classification, question answering, and NER.

The flow for this guide includes:

1. Exporting models to ONNX using `optimum-cli`.
2. Running inference with ONNX models with DeepSparse.

## Install DeepSparse and Optimum

In [11]:
!pip install deepsparse-nightly optimum[exporters]

Collecting optimum[exporters]
  Using cached optimum-1.13.1-py3-none-any.whl (396 kB)
Collecting coloredlogs
  Using cached coloredlogs-15.0.1-py2.py3-none-any.whl (46 kB)
Collecting huggingface-hub>=0.8.0
  Using cached huggingface_hub-0.17.1-py3-none-any.whl (294 kB)
Collecting transformers[sentencepiece]>=4.26.0
  Using cached transformers-4.33.2-py3-none-any.whl (7.6 MB)
Collecting sympy
  Using cached sympy-1.12-py3-none-any.whl (5.7 MB)
Collecting datasets
  Using cached datasets-2.14.5-py3-none-any.whl (519 kB)
Collecting timm
  Using cached timm-0.9.7-py3-none-any.whl (2.2 MB)
Collecting onnxruntime
  Using cached onnxruntime-1.15.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.9 MB)
Collecting fsspec
  Using cached fsspec-2023.9.1-py3-none-any.whl (173 kB)
Collecting filelock
  Using cached filelock-3.12.4-py3-none-any.whl (11 kB)
Collecting regex!=2019.12.17
  Using cached regex-2023.8.8-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (771 kB)
Colle

## Text Classification | Sentiment Analysis

Let's export the `SamLowe/roberta-base-go_emotions` model for sentiment analysis to an output folder called `tc_model`:

In [2]:
!optimum-cli export onnx --model SamLowe/roberta-base-go_emotions tc_model --sequence_length 128

Framework not specified. Using pt to export to ONNX.
Automatic task detection to text-classification (possible synonyms are: sequence-classification, zero-shot-classification).
Using the export variant default. Available variants are:
	- default: The default ONNX variant.
Using framework PyTorch: 2.0.1+cu117
Overriding 1 configuration item(s)
	- use_cache -> False
verbose: False, log level: Level.ERROR

Post-processing the exported models...
Weight deduplication check in the ONNX export requires accelerate. Please install accelerate to run it.
Validating models in subprocesses...
Validating ONNX model tc_model/model.onnx...
	-[✓] ONNX model output names match reference model (logits)
	- Validating ONNX Model output "logits":
		-[✓] (2, 28) matches (2, 28)
		-[✓] all values close (atol: 0.0001)
The ONNX export succeeded and the exported model was saved at: tc_model


Load model and run inference with DeepSparse:

In [3]:
from deepsparse import Pipeline

text_input = "Snorlax loves my Tesla!"

pipe = Pipeline.create(task="sentiment-analysis", model_path="./tc_model")
inference = pipe(text_input)
print(inference)
print(pipe.timer_manager)

  from .autonotebook import tqdm as notebook_tqdm
2023-09-13 09:56:38 __main__     INFO     Overwriting in-place the input shapes of the transformer model at ./tc_model/model.onnx
DeepSparse, Copyright 2021-present / Neuralmagic, Inc. version: 1.6.0.20230906 COMMUNITY | (f5e597bf) (release) (optimized) (system=avx2_vnni, binary=avx2)


labels=['love'] scores=[0.8388857841491699]
TimerManager({'engine_forward': 0.13723947099992984, 'pre_process': 0.00742578099993807, 'post_process': 0.0009767349999947328, 'total_inference': 0.14570146899995962})


## NER

Let's export the `Jean-Baptiste/camembert-ner` French NER model to an output folder called `ner_model`:

In [4]:
!optimum-cli export onnx --model Jean-Baptiste/camembert-ner ner_model --sequence_length 128

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Framework not specified. Using pt to export to ONNX.
Automatic task detection to token-classification.
Using the export variant default. Available variants are:
	- default: The default ONNX variant.
Using framework PyTorch: 2.0.1+cu117
Overriding 1 configuration item(s)
	- use_cache -> False
verbose: False, log level: Level.ERROR

Post-processing the exported models...
Weight deduplication check in the ONNX export requires accelerate. Please install accelerate to run it.
Validating models in subprocesses...
Validating ONNX model ner_model/model.onnx...
	-[✓] ONNX model output names match reference model (logits)
	- Validating ONNX Model output "logits":
		-[✓] (2, 128, 5) matches (2, 128, 5)
		-[x] values no

Load model and run inference with DeepSparse:

In [5]:
from deepsparse import Pipeline

text_input = "george washington est allé à washington!"

pipe = Pipeline.create(task="token-classification", model_path="./ner_model")
inference = pipe(text_input)

print(inference)
print(pipe.timer_manager)

2023-09-13 09:57:11 __main__     INFO     Overwriting in-place the input shapes of the transformer model at ./ner_model/model.onnx


predictions=[[TokenClassificationResult(entity='I-PER', score=0.9719225168228149, word='▁ge', start=0, end=2, index=1, is_grouped=False), TokenClassificationResult(entity='I-PER', score=0.9716293811798096, word='orge', start=2, end=6, index=2, is_grouped=False), TokenClassificationResult(entity='I-PER', score=0.995206892490387, word='▁was', start=6, end=10, index=3, is_grouped=False), TokenClassificationResult(entity='I-PER', score=0.9953275322914124, word='h', start=10, end=11, index=4, is_grouped=False), TokenClassificationResult(entity='I-PER', score=0.9947953224182129, word='ington', start=11, end=17, index=5, is_grouped=False), TokenClassificationResult(entity='I-LOC', score=0.9657747745513916, word='▁was', start=28, end=32, index=9, is_grouped=False), TokenClassificationResult(entity='I-LOC', score=0.9659914970397949, word='h', start=32, end=33, index=10, is_grouped=False), TokenClassificationResult(entity='I-LOC', score=0.961447536945343, word='ington', start=33, end=39, index=1

## Question Answering

Let's export the `deepset/electra-base-squad2` model for Question Answering to an output folder called `qa_model`:

In [6]:
!optimum-cli export onnx --model deepset/electra-base-squad2 qa_model --sequence_length 128

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Framework not specified. Using pt to export to ONNX.
Automatic task detection to question-answering.
Using the export variant default. Available variants are:
	- default: The default ONNX variant.
Using framework PyTorch: 2.0.1+cu117
Overriding 1 configuration item(s)
	- use_cache -> False
verbose: False, log level: Level.ERROR

Post-processing the exported models...
Weight deduplication check in the ONNX export requires accelerate. Please install accelerate to run it.
Validating models in subprocesses...
Validating ONNX model qa_model/model.onnx...
	-[✓] ONNX model output names match reference model (end_logits, start_logits)
	- Validating ONNX Model output "start_logits":
		-[✓] (2, 128) matches (2, 128)
	

Load model and run inference with DeepSparse:

In [7]:
from deepsparse import Pipeline

pipe = Pipeline.create(task="question-answering", model_path="./qa_model")
inference = pipe(question="What's my name?", context="My name is Snorlax")

question = "who loves Tesla?"
context = "Snorlax loves my Tesla?"

print(inference)
print(pipe.timer_manager)

2023-09-13 09:57:41 __main__     INFO     Overwriting in-place the input shapes of the transformer model at ./qa_model/model.onnx


score=2.424950361251831 answer='Snorlax' start=11 end=18
TimerManager({'engine_forward': 0.09078278999982103, 'pre_process': 0.0017440609999539447, 'post_process': 0.002852336999922045, 'total_inference': 0.09541724099995008})


## Zero-Shot Text Classification

Let's export the DistilBERT MNLI Base model to an output folder called `zs_model`:

In [8]:
!optimum-cli export onnx --model typeform/distilbert-base-uncased-mnli zs_model --sequence_length 128

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Framework not specified. Using pt to export to ONNX.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
Automatic task detection to text-classification (possible synonyms are: sequence-classification, zero-shot-classification).
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can saf

Load model and run inference with DeepSparse:

In [9]:
from deepsparse import Pipeline

pipe = Pipeline.create(
    task="zero_shot_text_classification",
    model_scheme="mnli",
    model_config={"hypothesis_template": "This text is related to {}"},
    model_path="./zs_model"
)

sequence = "I like pepperoni pizza."
labels = ["food", "movies", "sports"]

inference = pipe(sequences=sequence, labels=labels)

print(inference)
print(pipe.timer_manager)

2023-09-13 09:58:08 __main__     INFO     Overwriting in-place the input shapes of the transformer model at ./zs_model/model.onnx


sequences='I like pepperoni pizza.' labels=['food', 'sports', 'movies'] scores=[0.9594793319702148, 0.0325121246278286, 0.008008550852537155]
TimerManager({'engine_forward': 0.13312921599981564, 'pre_process': 0.0024127279998538143, 'post_process': 0.0008873450001374295, 'total_inference': 0.13646740300009697})


## Image Classification

Let's export the `nateraw/vit-age-classifier` model to an output folder called `ic_model`. This model classifies a person's age based on their picture:

In [14]:
!pip install deepsparse[image-classification]



In [11]:
!optimum-cli export onnx --model nateraw/vit-age-classifier ic_model

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Framework not specified. Using pt to export to ONNX.
Automatic task detection to image-classification.
Using the export variant default. Available variants are:
	- default: The default ONNX variant.
Using framework PyTorch: 2.0.1+cu117
  if num_channels != self.num_channels:
  if height != self.image_size[0] or width != self.image_size[1]:
verbose: False, log level: Level.ERROR

Post-processing the exported models...
Weight deduplication check in the ONNX export requires accelerate. Please install accelerate to run it.
Validating models in subprocesses...
Validating ONNX model ic_model/model.onnx...
	-[✓] ONNX model output names match reference model (logits)
	- Validating ONNX Model output "logits":
		-[✓] 

Load model and run inference with DeepSparse:

In [46]:
from deepsparse import Pipeline

image_path = "./face.jpg"

pipe = Pipeline.create(
    task="image_classification",
    model_path="./ic_model",
    input_shapes=[1,3,224,224]
  )

inference = pipe(images=image_path)
print(inference)
print(pipe.timer_manager)

2023-09-18 09:28:35 deepsparse.utils.onnx INFO     Overwriting in-place the input shapes of the model at ic_model/model.onnx


labels=[3] scores=[7.151614665985107]
TimerManager({'post_process': 0.0008744990000195685, 'total_inference': 0.14688509499956126, 'engine_forward': 0.13076873600039107, 'pre_process': 0.015200159999949392})
