Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Llama-2-7b is failing with bfloat16 export with onnx #1720

Open
2 of 4 tasks
anilmartha opened this issue Feb 26, 2024 · 1 comment
Open
2 of 4 tasks

Llama-2-7b is failing with bfloat16 export with onnx #1720

anilmartha opened this issue Feb 26, 2024 · 1 comment
Labels
bug Something isn't working onnx Related to the ONNX export

Comments

@anilmartha
Copy link

anilmartha commented Feb 26, 2024

System Info

Optimum version: installed from source (python -m pip install git+https://github.com/huggingface/optimum.git)
Os: Windows 11 Pro
Python: 3.11.7

Who can help?

@MiCh

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

We are seeing the below error with the command.
optimum-cli export onnx --model meta-llama/Llama-2-7b-chat-hf Llama-2-7b-chat-onnx/ --dtype bf16

(py311) D:\Users\anilm\hf>optimum-cli export onnx --model  meta-llama/Llama-2-7b-chat-hf Llama-2-7b-chat-onnx/ --dtype bf16
C:\ProgramData\anaconda3\envs\py311\Lib\site-packages\transformers\utils\hub.py:124: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
Framework not specified. Using pt to export the model.
config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 614/614 [00:00<?, ?B/s]
config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 614/614 [00:00<00:00, 612kB/s]
model.safetensors.index.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 26.8k/26.8k [00:00<00:00, 2.37MB/s]
model-00001-of-00002.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.98G/9.98G [02:10<00:00, 76.5MB/s]
model-00002-of-00002.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3.50G/3.50G [00:41<00:00, 84.3MB/s]
Downloading shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [02:52<00:00, 86.35s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:10<00:00,  5.04s/it]
generation_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 188/188 [00:00<?, ?B/s]
Automatic task detection to text-generation-with-past (possible synonyms are: causal-lm-with-past).
tokenizer_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.62k/1.62k [00:00<?, ?B/s]
tokenizer.model: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500k/500k [00:00<?, ?B/s]
tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.84M/1.84M [00:00<00:00, 8.42MB/s]
special_tokens_map.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 414/414 [00:00<?, ?B/s]
Using the export variant default. Available variants are:
    - default: The default ONNX variant.
Exporting the model LlamaForCausalLM in bfloat16 float dtype. After the export, ONNX Runtime InferenceSession with CPU/CUDA execution provider likely does not implement all operators for the bfloat16 data type, and the loading is likely to fail.
Using framework PyTorch: 2.2.1+cpu
Overriding 1 configuration item(s)
        - use_cache -> True
C:\ProgramData\anaconda3\envs\py311\Lib\site-packages\transformers\models\llama\modeling_llama.py:1057: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if seq_length > self.causal_mask.shape[-1]:
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\ProgramData\anaconda3\envs\py311\Scripts\optimum-cli.exe\__main__.py", line 7, in <module>
  File "C:\ProgramData\anaconda3\envs\py311\Lib\site-packages\optimum\commands\optimum_cli.py", line 163, in main
    service.run()
  File "C:\ProgramData\anaconda3\envs\py311\Lib\site-packages\optimum\commands\export\onnx.py", line 261, in run
    main_export(
  File "C:\ProgramData\anaconda3\envs\py311\Lib\site-packages\optimum\exporters\onnx\__main__.py", line 351, in main_export
    onnx_export_from_model(
  File "C:\ProgramData\anaconda3\envs\py311\Lib\site-packages\optimum\exporters\onnx\convert.py", line 1152, in onnx_export_from_model
    _, onnx_outputs = export_models(
                      ^^^^^^^^^^^^^^
  File "C:\ProgramData\anaconda3\envs\py311\Lib\site-packages\optimum\exporters\onnx\convert.py", line 763, in export_models
    export(
  File "C:\ProgramData\anaconda3\envs\py311\Lib\site-packages\optimum\exporters\onnx\convert.py", line 868, in export
    export_output = export_pytorch(
                    ^^^^^^^^^^^^^^^
  File "C:\ProgramData\anaconda3\envs\py311\Lib\site-packages\optimum\exporters\onnx\convert.py", line 577, in export_pytorch
    onnx_export(
  File "C:\ProgramData\anaconda3\envs\py311\Lib\site-packages\torch\onnx\utils.py", line 516, in export
    _export(
  File "C:\ProgramData\anaconda3\envs\py311\Lib\site-packages\torch\onnx\utils.py", line 1613, in _export
    graph, params_dict, torch_out = _model_to_graph(
                                    ^^^^^^^^^^^^^^^^
  File "C:\ProgramData\anaconda3\envs\py311\Lib\site-packages\torch\onnx\utils.py", line 1139, in _model_to_graph
    graph = _optimize_graph(
            ^^^^^^^^^^^^^^^^
  File "C:\ProgramData\anaconda3\envs\py311\Lib\site-packages\torch\onnx\utils.py", line 677, in _optimize_graph
    graph = _C._jit_pass_onnx(graph, operator_export_type)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ProgramData\anaconda3\envs\py311\Lib\site-packages\torch\onnx\utils.py", line 1957, in _run_symbolic_function
    return symbolic_fn(graph_context, *inputs, **attrs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ProgramData\anaconda3\envs\py311\Lib\site-packages\torch\onnx\symbolic_helper.py", line 306, in wrapper
    return fn(g, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ProgramData\anaconda3\envs\py311\Lib\site-packages\torch\onnx\symbolic_opset14.py", line 197, in scaled_dot_product_attention
    raise ValueError(
ValueError: Unsupported type for attn_mask: 15

Expected behavior

The latest optimum tool supports bfloat16 data type. We expect that exporting using bfloat16 will be successful.

@anilmartha anilmartha added the bug Something isn't working label Feb 26, 2024
@fxmarty
Copy link
Collaborator

fxmarty commented Feb 26, 2024

@anilmartha thank you for the report, this is unexpected. I did not add a full CI with bf16 but should probably add one with the most used models.

It appears that PyTorch 2.2.1 does not support scaled_dot_product_attention export in BF16, see https://github.com/pytorch/pytorch/blob/v2.2.1/torch/onnx/symbolic_opset14.py#L191-L199

This is fixed on main: https://github.com/pytorch/pytorch/blob/f4cf25bb24be735b2502ae13f290017992c2fac8/torch/onnx/symbolic_opset14.py#L194 & pytorch/pytorch#117878

So in PyTorch 2.3, this will be possible.

We could add an optional option in Optimum to choose that the export is done with the manual attention implementation, not torch.nn.functional.scaled_dot_product_attention. Would that help you?

As an alternative, you could downgrade to torch==2.1.0 (for which SDPA is not picked in Transformers).

@fxmarty fxmarty added the onnx Related to the ONNX export label Feb 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working onnx Related to the ONNX export
Projects
None yet
Development

No branches or pull requests

2 participants