# Export LLMs directly from HuggingFace into DeepSparse


## Export to ONNX using Optimum

**Preparing the ONNX Configuration and Exporting the Model**

In this step, you're setting up the model you want to convert to ONNX format. The script uses the `facebook/opt-125m model`, but you could replace this with [any other OPT model on HuggingFace](https://huggingface.co/models?other=opt&sort=trending&search=facebook%2Fopt). When you do this, remember to also update the `model_id` variable with the new model name.

It then prepares the ONNX configuration and exports the model to ONNX format. The configuration parameters in the `OPTOnnxConfig` function specify the task type ("text-generation") and some settings related to the use of the "past" tensor in transformer models.

If you're using a different model or task type, you'll need to adjust these configuration parameters accordingly. Also, the exported model is saved in the `opt-125m_onnx` directory. You can change the output directory and filename in the `main_export()` function if needed.

In [None]:
!pip install optimum[exporters] -qqq

In [None]:
from optimum.exporters.onnx import main_export, TextDecoderOnnxConfig
from optimum.exporters.onnx.model_configs import OPTOnnxConfig
from transformers import AutoConfig


model_id = "facebook/opt-125m"
config = AutoConfig.from_pretrained(model_id)

onnx_config = OPTOnnxConfig(
    config,
    task="text-generation",
    use_past=False,
    use_past_in_inputs=False,
    use_present_in_outputs=False,
)

custom_onnx_configs = {
    "decoder_model": onnx_config,
}

main_export(
    model_id,
    output="opt-125m_onnx",
    task="text-generation",
    custom_onnx_configs=custom_onnx_configs,
    no_post_process=True,
)

Framework not specified. Using pt to export to ONNX.


Downloading pytorch_model.bin:   0%|          | 0.00/251M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

The task `text-generation` was manually specified, and past key values will not be reused in the decoding. if needed, please pass `--task text-generation-with-past` to export using the past key values.


Downloading (…)okenizer_config.json:   0%|          | 0.00/685 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/441 [00:00<?, ?B/s]

use_past = False is different than use_present_in_outputs = True, the value of use_present_in_outputs value will be used for the outputs.
Using framework PyTorch: 2.0.1+cu118
Overriding 1 configuration item(s)
	- use_cache -> False
  elif attention_mask.shape[1] != mask_seq_length:
  if input_shape[-1] > 1:
  if attn_weights.size() != (bsz * self.num_heads, tgt_len, src_len):
  if attention_mask.size() != (bsz, 1, tgt_len, src_len):
  attn_weights, torch.tensor(torch.finfo(attn_weights.dtype).min, device=attn_weights.device)
  if attn_output.size() != (bsz * self.num_heads, tgt_len, self.head_dim):


verbose: False, log level: Level.ERROR



Validating ONNX model opt-125m_onnx/decoder_model.onnx...
	-[✓] ONNX model output names match reference model (logits)
	- Validating ONNX Model output "logits":
		-[✓] (2, 16, 50272) matches (2, 16, 50272)
		-[x] values not close enough, max diff: 0.00014591217041015625 (atol: 1e-05)
- logits: max diff = 0.00014591217041015625.
 The exported model was saved at: opt-125m_onnx


In [None]:
!ls -al opt-125m_onnx

total 643672
drwxr-xr-x 2 root root      4096 Jul 19 22:51 .
drwxr-xr-x 1 root root      4096 Jul 19 22:51 ..
-rw-r--r-- 1 root root       719 Jul 19 22:51 config.json
-rw-r--r-- 1 root root 655723134 Jul 19 22:51 decoder_model.onnx
-rw-r--r-- 1 root root       132 Jul 19 22:51 generation_config.json
-rw-r--r-- 1 root root    456318 Jul 19 22:51 merges.txt
-rw-r--r-- 1 root root       548 Jul 19 22:51 special_tokens_map.json
-rw-r--r-- 1 root root       870 Jul 19 22:51 tokenizer_config.json
-rw-r--r-- 1 root root   2108630 Jul 19 22:51 tokenizer.json
-rw-r--r-- 1 root root    798293 Jul 19 22:51 vocab.json


## Add KV Caching to the ONNX for fast token generation

This step enhances the ONNX model with Key-Value (KV) caching to speed up token generation. If you're using a different model, make sure it supports KV caching. The `input_file` and `output_file` variables need to correspond to the correct paths of your input and output ONNX model files.

In [None]:
!pip install git+https://github.com/neuralmagic/sparseml.git deepsparse-nightly -qqq

In [None]:
import onnx
from sparseml.exporters.kv_cache_injector import KeyValueCacheInjector

input_file = "opt-125m_onnx/decoder_model.onnx"
output_file = "opt-125m_onnx/model.onnx"
model = onnx.load(input_file, load_external_data=False)
model = KeyValueCacheInjector(model_path=os.path.dirname(input_file)).apply(model)
onnx.save(model, output_file)
print(f"Modified model saved to: {output_file}")

2023-07-19 22:53:07 sparseml.exporters.transforms.kv_cache.configs INFO     Loaded config file opt-125m_onnx/config.json for model: opt
INFO:sparseml.exporters.transforms.kv_cache.configs:Loaded config file opt-125m_onnx/config.json for model: opt
2023-07-19 22:53:07 sparseml.exporters.transforms.kv_cache.configs INFO     Properly configured arguments for KV Cache Transformation
INFO:sparseml.exporters.transforms.kv_cache.configs:Properly configured arguments for KV Cache Transformation
2023-07-19 22:53:09 sparseml.exporters.transforms.onnx_transform INFO     [CacheKeysAndValues] Transformed 24 matches
INFO:sparseml.exporters.transforms.onnx_transform:[CacheKeysAndValues] Transformed 24 matches
2023-07-19 22:53:12 sparseml.exporters.transforms.onnx_transform INFO     [PositionsAdjustmentOPT] Transformed 5 matches
INFO:sparseml.exporters.transforms.onnx_transform:[PositionsAdjustmentOPT] Transformed 5 matches


Modified model saved to: opt-125m_onnx/model.onnx


## Text Generation with DeepSparse

Finally, this script uses [DeepSparse](https://github.com/neuralmagic/deepsparse) to generate text using the ONNX model. You can adjust the `max_generated_tokens` parameter to control the length of the generated text. The `sequences` parameter in the opt_pipeline() function is the input prompt for the text-generation task, which you can customize as needed.

Remember that the `model_path` in `Pipeline.create()` should point to the directory containing your enhanced ONNX model. Also, this script currently prints the generated text to the console, but you could modify it to write the output to a file or pipeline for more complex generative tasks.

In [None]:
from deepsparse import Pipeline

opt_pipeline = Pipeline.create(
    task="text-generation",
    model_path="opt-125m_onnx",
    max_generated_tokens=32,
    prompt_processing_sequence_length=1,
)
inference = opt_pipeline(sequences="Who is the president of the United States?")
print(inference)

2023-07-19 22:55:05 deepsparse.transformers.engines.nl_decoder_engine INFO     Overwriting in-place the input shapes of the transformer model at opt-125m_onnx/model.onnx
INFO:deepsparse.transformers.engines.nl_decoder_engine:Overwriting in-place the input shapes of the transformer model at opt-125m_onnx/model.onnx
2023-07-19 22:56:02 deepsparse.transformers.engines.nl_decoder_engine INFO     Overwriting in-place the input shapes of the transformer model at opt-125m_onnx/model.onnx
INFO:deepsparse.transformers.engines.nl_decoder_engine:Overwriting in-place the input shapes of the transformer model at opt-125m_onnx/model.onnx


sequences=['\n\nPresident Donald Trump is the president of the United States. He is the first president to be elected to the presidency.\n\nTrump is the first president'] logits=None session_id=None
