Llama-2-7b is failing with bfloat16 export with onnx #1720

anilmartha · 2024-02-26T02:27:08Z

System Info

Optimum version: installed from source (python -m pip install git+https://github.com/huggingface/optimum.git)
Os: Windows 11 Pro
Python: 3.11.7

Who can help?

@MiCh

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

We are seeing the below error with the command.
optimum-cli export onnx --model meta-llama/Llama-2-7b-chat-hf Llama-2-7b-chat-onnx/ --dtype bf16

(py311) D:\Users\anilm\hf>optimum-cli export onnx --model  meta-llama/Llama-2-7b-chat-hf Llama-2-7b-chat-onnx/ --dtype bf16
C:\ProgramData\anaconda3\envs\py311\Lib\site-packages\transformers\utils\hub.py:124: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
Framework not specified. Using pt to export the model.
config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 614/614 [00:00<?, ?B/s]
config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 614/614 [00:00<00:00, 612kB/s]
model.safetensors.index.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 26.8k/26.8k [00:00<00:00, 2.37MB/s]
model-00001-of-00002.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.98G/9.98G [02:10<00:00, 76.5MB/s]
model-00002-of-00002.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3.50G/3.50G [00:41<00:00, 84.3MB/s]
Downloading shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [02:52<00:00, 86.35s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:10<00:00,  5.04s/it]
generation_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 188/188 [00:00<?, ?B/s]
Automatic task detection to text-generation-with-past (possible synonyms are: causal-lm-with-past).
tokenizer_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.62k/1.62k [00:00<?, ?B/s]
tokenizer.model: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500k/500k [00:00<?, ?B/s]
tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.84M/1.84M [00:00<00:00, 8.42MB/s]
special_tokens_map.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 414/414 [00:00<?, ?B/s]
Using the export variant default. Available variants are:
    - default: The default ONNX variant.
Exporting the model LlamaForCausalLM in bfloat16 float dtype. After the export, ONNX Runtime InferenceSession with CPU/CUDA execution provider likely does not implement all operators for the bfloat16 data type, and the loading is likely to fail.
Using framework PyTorch: 2.2.1+cpu
Overriding 1 configuration item(s)
        - use_cache -> True
C:\ProgramData\anaconda3\envs\py311\Lib\site-packages\transformers\models\llama\modeling_llama.py:1057: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if seq_length > self.causal_mask.shape[-1]:
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\ProgramData\anaconda3\envs\py311\Scripts\optimum-cli.exe\__main__.py", line 7, in <module>
  File "C:\ProgramData\anaconda3\envs\py311\Lib\site-packages\optimum\commands\optimum_cli.py", line 163, in main
    service.run()
  File "C:\ProgramData\anaconda3\envs\py311\Lib\site-packages\optimum\commands\export\onnx.py", line 261, in run
    main_export(
  File "C:\ProgramData\anaconda3\envs\py311\Lib\site-packages\optimum\exporters\onnx\__main__.py", line 351, in main_export
    onnx_export_from_model(
  File "C:\ProgramData\anaconda3\envs\py311\Lib\site-packages\optimum\exporters\onnx\convert.py", line 1152, in onnx_export_from_model
    _, onnx_outputs = export_models(
                      ^^^^^^^^^^^^^^
  File "C:\ProgramData\anaconda3\envs\py311\Lib\site-packages\optimum\exporters\onnx\convert.py", line 763, in export_models
    export(
  File "C:\ProgramData\anaconda3\envs\py311\Lib\site-packages\optimum\exporters\onnx\convert.py", line 868, in export
    export_output = export_pytorch(
                    ^^^^^^^^^^^^^^^
  File "C:\ProgramData\anaconda3\envs\py311\Lib\site-packages\optimum\exporters\onnx\convert.py", line 577, in export_pytorch
    onnx_export(
  File "C:\ProgramData\anaconda3\envs\py311\Lib\site-packages\torch\onnx\utils.py", line 516, in export
    _export(
  File "C:\ProgramData\anaconda3\envs\py311\Lib\site-packages\torch\onnx\utils.py", line 1613, in _export
    graph, params_dict, torch_out = _model_to_graph(
                                    ^^^^^^^^^^^^^^^^
  File "C:\ProgramData\anaconda3\envs\py311\Lib\site-packages\torch\onnx\utils.py", line 1139, in _model_to_graph
    graph = _optimize_graph(
            ^^^^^^^^^^^^^^^^
  File "C:\ProgramData\anaconda3\envs\py311\Lib\site-packages\torch\onnx\utils.py", line 677, in _optimize_graph
    graph = _C._jit_pass_onnx(graph, operator_export_type)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ProgramData\anaconda3\envs\py311\Lib\site-packages\torch\onnx\utils.py", line 1957, in _run_symbolic_function
    return symbolic_fn(graph_context, *inputs, **attrs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ProgramData\anaconda3\envs\py311\Lib\site-packages\torch\onnx\symbolic_helper.py", line 306, in wrapper
    return fn(g, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ProgramData\anaconda3\envs\py311\Lib\site-packages\torch\onnx\symbolic_opset14.py", line 197, in scaled_dot_product_attention
    raise ValueError(
ValueError: Unsupported type for attn_mask: 15

Expected behavior

The latest optimum tool supports bfloat16 data type. We expect that exporting using bfloat16 will be successful.

The text was updated successfully, but these errors were encountered:

fxmarty · 2024-02-26T10:50:56Z

@anilmartha thank you for the report, this is unexpected. I did not add a full CI with bf16 but should probably add one with the most used models.

It appears that PyTorch 2.2.1 does not support scaled_dot_product_attention export in BF16, see https://github.com/pytorch/pytorch/blob/v2.2.1/torch/onnx/symbolic_opset14.py#L191-L199

This is fixed on main: https://github.com/pytorch/pytorch/blob/f4cf25bb24be735b2502ae13f290017992c2fac8/torch/onnx/symbolic_opset14.py#L194 & pytorch/pytorch#117878

So in PyTorch 2.3, this will be possible.

We could add an optional option in Optimum to choose that the export is done with the manual attention implementation, not torch.nn.functional.scaled_dot_product_attention. Would that help you?

As an alternative, you could downgrade to torch==2.1.0 (for which SDPA is not picked in Transformers).

anilmartha added the bug Something isn't working label Feb 26, 2024

fxmarty added the onnx Related to the ONNX export label Feb 29, 2024

fxmarty mentioned this issue Apr 12, 2024

Exporting tinyllama-1.1b using onnxruntime bf16 crashes #1807

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama-2-7b is failing with bfloat16 export with onnx #1720

Llama-2-7b is failing with bfloat16 export with onnx #1720

anilmartha commented Feb 26, 2024 •

edited

Loading

fxmarty commented Feb 26, 2024 •

edited

Loading

Llama-2-7b is failing with bfloat16 export with onnx #1720

Llama-2-7b is failing with bfloat16 export with onnx #1720

Comments

anilmartha commented Feb 26, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction (minimal, reproducible, runnable)

Expected behavior

fxmarty commented Feb 26, 2024 • edited Loading

anilmartha commented Feb 26, 2024 •

edited

Loading

fxmarty commented Feb 26, 2024 •

edited

Loading