`Graph Output Error` when loading optimizing model after running `optimizer.optimize` #1868

Ashton-Sidhu · 2024-05-22T15:57:22Z

System Info

onnx==1.16.0
onnxruntime-gpu==1.18.0
sentence-transformers==2.2.2
transformers==4.40.2
optimum==1.19.2
Python: 3.10.12

Who can help?

@JingyaHuang @echarlaix Hi!

I'm running into an error when trying to load in an optimized model of facebook/nllb-200-distilled-600M after optimizing it with the Optimum optimizer. In the attached snipped, the following runs successfully:

exporting the model to ONNX
Running optimizer.optimize to save optimized model to local directory

The error happens when then loading the optimized model from disk:

Fail: [ONNXRuntimeError] : 1 : FAIL : Load model from /tmp/optimized_model_2/decoder_model_optimized.onnx failed:/onnxruntime_src/onnxruntime/core/graph/graph.cc:1415 void onnxruntime::Graph::InitializeStateFromModelFileGraphProto() This is an invalid model. Graph output (present.0.decoder.key) does not exist in the graph.
File <command-1757234155281712>, line 1
----> 1 ort_model_optimized = ORTModelForSeq2SeqLM.from_pretrained("/tmp/optimized_model")
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py:472, in InferenceSession._create_inference_session(self, providers, provider_options, disabled_optimizers)
    469 self._register_ep_custom_ops(session_options, providers, provider_options, available_providers)
    471 if self._model_path:
--> 472     sess = C.InferenceSession(session_options, self._model_path, True, self._read_config_from_model)
    473 else:
    474     sess = C.InferenceSession(session_options, self._model_bytes, False, self._read_config_from_model)

Logs from converting model to onnx:

Framework not specified. Using pt to export the model.
/databricks/python/lib/python3.10/site-packages/torch/_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
Using the export variant default. Available variants are:
    - default: The default ONNX variant.
Some non-default generation parameters are set in the model config. These should go into a GenerationConfig file (https://huggingface.co/docs/transformers/generation_strategies#save-a-custom-decoding-strategy-with-your-model) instead. This warning will be raised to an exception in v4.41.
Non-default generation parameters: {'max_length': 200}

***** Exporting submodel 1/3: M2M100Encoder *****
Using framework PyTorch: 2.0.1+cu118
Overriding 1 configuration item(s)
	- use_cache -> False
/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/transformers/models/m2m_100/modeling_m2m_100.py:145: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if max_pos > self.weights.size(0):
/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/transformers/models/m2m_100/modeling_m2m_100.py:270: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if attn_weights.size() != (bsz * self.num_heads, tgt_len, src_len):
/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/transformers/models/m2m_100/modeling_m2m_100.py:277: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if attention_mask.size() != (bsz, 1, tgt_len, src_len):
/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/transformers/models/m2m_100/modeling_m2m_100.py:309: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if attn_output.size() != (bsz * self.num_heads, tgt_len, self.head_dim):
============= Diagnostic Run torch.onnx.export version 2.0.1+cu118 =============
verbose: False, log level: Level.ERROR
======================= 0 NONE 0 NOTE 0 WARNING 0 ERROR ========================


***** Exporting submodel 2/3: M2M100ForConditionalGeneration *****
Using framework PyTorch: 2.0.1+cu118
Overriding 1 configuration item(s)
	- use_cache -> True
/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/transformers/modeling_attn_mask_utils.py:86: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if input_shape[-1] > 1 or self.sliding_window is not None:
/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/transformers/modeling_attn_mask_utils.py:162: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if past_key_values_length > 0:
Saving external data to one file...
============= Diagnostic Run torch.onnx.export version 2.0.1+cu118 =============
verbose: False, log level: Level.ERROR
======================= 0 NONE 0 NOTE 0 WARNING 0 ERROR ========================


***** Exporting submodel 3/3: M2M100ForConditionalGeneration *****
Using framework PyTorch: 2.0.1+cu118
Overriding 1 configuration item(s)
	- use_cache -> True
/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/transformers/models/m2m_100/modeling_m2m_100.py:232: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  and past_key_value[0].shape[2] == key_value_states.shape[1]
============= Diagnostic Run torch.onnx.export version 2.0.1+cu118 =============
verbose: False, log level: Level.ERROR
======================= 0 NONE 0 NOTE 0 WARNING 0 ERROR ========================

Saving external data to one file...
Some non-default generation parameters are set in the model config. These should go into a GenerationConfig file (https://huggingface.co/docs/transformers/generation_strategies#save-a-custom-decoding-strategy-with-your-model) instead. This warning will be raised to an exception in v4.41.
Non-default generation parameters: {'max_length': 200}
============= Diagnostic Run torch.onnx.export version 2.0.1+cu118 =============
verbose: False, log level: Level.ERROR
======================= 0 NONE 0 NOTE 0 WARNING 0 ERROR ========================


***** Exporting submodel 2/3: M2M100ForConditionalGeneration *****
Using framework PyTorch: 2.0.1+cu118
Overriding 1 configuration item(s)
	- use_cache -> True
/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/transformers/modeling_attn_mask_utils.py:86: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if input_shape[-1] > 1 or self.sliding_window is not None:
/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/transformers/modeling_attn_mask_utils.py:162: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if past_key_values_length > 0:
Saving external data to one file...
============= Diagnostic Run torch.onnx.export version 2.0.1+cu118 =============
verbose: False, log level: Level.ERROR
======================= 0 NONE 0 NOTE 0 WARNING 0 ERROR ========================


***** Exporting submodel 3/3: M2M100ForConditionalGeneration *****
Using framework PyTorch: 2.0.1+cu118
Overriding 1 configuration item(s)
	- use_cache -> True
/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/transformers/models/m2m_100/modeling_m2m_100.py:232: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  and past_key_value[0].shape[2] == key_value_states.shape[1]
============= Diagnostic Run torch.onnx.export version 2.0.1+cu118 =============
verbose: False, log level: Level.ERROR
======================= 0 NONE 0 NOTE 0 WARNING 0 ERROR ========================

Saving external data to one file...
Some non-default generation parameters are set in the model config. These should go into a GenerationConfig file (https://huggingface.co/docs/transformers/generation_strategies#save-a-custom-decoding-strategy-with-your-model) instead. This warning will be raised to an exception in v4.41.
Non-default generation parameters: {'max_length': 200}

Logs from optimizing the model with `optimizer.optimize()`

Some non-default generation parameters are set in the model config. These should go into a GenerationConfig file (https://huggingface.co/docs/transformers/generation_strategies#save-a-custom-decoding-strategy-with-your-model) instead. This warning will be raised to an exception in v4.41.
Non-default generation parameters: {'max_length': 200}
Optimizing model...
2024-05-22 07:33:02.123202540 [W:onnxruntime:, session_state.cc:1166 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2024-05-22 07:33:02.123236069 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
2024-05-22 07:33:29.045201559 [W:onnxruntime:, session_state.cc:1166 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2024-05-22 07:33:29.045242843 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
failed in shape inference <class 'Exception'>
2024-05-22 07:35:13.180007714 [W:onnxruntime:, transformer_memcpy.cc:74 ApplyImpl] 1 Memcpy nodes are added to the graph torch_jit for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.
2024-05-22 07:35:13.190649579 [W:onnxruntime:, session_state.cc:1166 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2024-05-22 07:35:13.190662901 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
Some non-default generation parameters are set in the model config. These should go into a GenerationConfig file (https://huggingface.co/docs/transformers/generation_strategies#save-a-custom-decoding-strategy-with-your-model) instead. This warning will be raised to an exception in v4.41.
Non-default generation parameters: {'max_length': 200}
Configuration saved in /tmp/optimized_model/ort_config.json
Optimized model saved at: /tmp/optimized_model (external data format: True; saved all tensor to one file: True)
failed in shape inference <class 'Exception'>
2024-05-22 07:35:13.180007714 [W:onnxruntime:, transformer_memcpy.cc:74 ApplyImpl] 1 Memcpy nodes are added to the graph torch_jit for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.
2024-05-22 07:35:13.190649579 [W:onnxruntime:, session_state.cc:1166 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2024-05-22 07:35:13.190662901 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
Some non-default generation parameters are set in the model config. These should go into a GenerationConfig file (https://huggingface.co/docs/transformers/generation_strategies#save-a-custom-decoding-strategy-with-your-model) instead. This warning will be raised to an exception in v4.41.
Non-default generation parameters: {'max_length': 200}
Configuration saved in /tmp/optimized_model/ort_config.json
Optimized model saved at: /tmp/optimized_model (external data format: True; saved all tensor to one file: True)

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

from transformers import AutoTokenizer
from optimum.onnxruntime import 
    ORTModelForSeq2SeqLM,
    ORTOptimizer,
    OptimizationConfig
)

tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")
ort_model = ORTModelForSeq2SeqLM.from_pretrained(
  "facebook/nllb-200-distilled-600M",
  export=True,
)

optimizer = ORTOptimizer.from_pretrained(ort_model)

optimization_config = OptimizationConfig(
    optimization_level=2,
    enable_transformers_specific_optimizations=True,
    optimize_for_gpu=True,
)

optimizer.optimize(save_dir="/tmp/optimized_model", optimization_config=optimization_config)

# This line will error out
ort_model_optimized = ORTModelForSeq2SeqLM.from_pretrained("/tmp/optimized_model")

Expected behavior

The expected behaviour is to be able to load the optimized model from disk

The text was updated successfully, but these errors were encountered:

Ashton-Sidhu added the bug Something isn't working label May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Graph Output Error` when loading optimizing model after running `optimizer.optimize` #1868

`Graph Output Error` when loading optimizing model after running `optimizer.optimize` #1868

Ashton-Sidhu commented May 22, 2024 •

edited

Loading

Graph Output Error when loading optimizing model after running optimizer.optimize #1868

Graph Output Error when loading optimizing model after running optimizer.optimize #1868

Comments

Ashton-Sidhu commented May 22, 2024 • edited Loading

System Info

Who can help?

Logs from converting model to onnx:

Logs from optimizing the model with optimizer.optimize()

Information

Tasks

Reproduction (minimal, reproducible, runnable)

Expected behavior

`Graph Output Error` when loading optimizing model after running `optimizer.optimize` #1868

`Graph Output Error` when loading optimizing model after running `optimizer.optimize` #1868

Ashton-Sidhu commented May 22, 2024 •

edited

Loading

Logs from optimizing the model with `optimizer.optimize()`