[Bug]: Running Olive with ROCMExecutionProvider. #667

lshqqytiger · 2023-10-29T12:28:50Z

What happened?

I was able to get onnxruntime-training 1.16.1+rocm56 from onnxruntime.ai and it includes ROCMExecutionProvider. But I found out that Olive needs a ROCmExecutionProvider. I added ROCMExecutionProvider to AcceleratorLookup.EXECUTION_PROVIDERS, but I got the error below when optimizing unet. What is the difference between ROCmExecutionProvider and ROCMExecutionProvider? Is ROCMExectionProvider not supported?

Running workflow on accelerator specs: gpu-rocm
Running workflow on accelerator specs: gpu-rocm
Running workflow on accelerator specs: gpu-rocm
Running pass convert:OnnxConversion
Running pass convert:OnnxConversion
Running pass convert:OnnxConversion
Running pass optimize:OrtTransformersOptimization
Running pass optimize:OrtTransformersOptimization
Running pass optimize:OrtTransformersOptimization
2023-10-29 21:15:13,526 onnx_model [INFO] - Skip removing useless cast nodes since shape inference failed.
2023-10-29 21:15:13,852 fusion_base [INFO] - Fused LayerNormalization: 48
2023-10-29 21:15:14,823 fusion_base [INFO] - Fused Gelu: 16
2023-10-29 21:15:15,533 onnx_model_unet [INFO] - Removed 54 Div nodes
2023-10-29 21:15:18,759 fusion_base [INFO] - Fused GroupNorm: 61
2023-10-29 21:15:21,125 onnx_model [INFO] - Removed 64 nodes
2023-10-29 21:15:25,312 onnx_model_unet [INFO] - opset version: 14
2023-10-29 21:15:27,991 onnx_model [WARNING] - Failed to run symbolic shape inference. Please file an issue in https://github.com/microsoft/onnxruntime.
2023-10-29 21:15:51,634 onnx_model [INFO] - Skip removing useless cast nodes since shape inference failed.
2023-10-29 21:15:51,634 onnx_model [INFO] - Skip removing useless cast nodes since shape inference failed.
2023-10-29 21:15:55.437960083 [E:onnxruntime:, sequential_executor.cc:514 ExecuteKernel] Non-zero status code returned while running GroupNorm node. Name:'GroupNorm_0' Status Message: only the channels_last layout is supported
Failed to run Olive on gpu-rocm: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Non-zero status code returned while running GroupNorm node. Name:'GroupNorm_0' Status Message: only the channels_last layout is supported
                         ╭───────── Traceback (most recent call last) ─────────╮
                         │ /home/user/anaconda3/envs/olive/lib/python3.10/site │
                         │ -packages/olive/engine/engine.py:421 in             │
                         │ run_accelerator                                     │
                         │                                                     │
                         │    418 │   │   │   │   │   output_name,             │
                         │    419 │   │   │   │   )                            │
                         │    420 │   │   │   else:                            │
                         │ ❱  421 │   │   │   │   return self.run_search(      │
                         │    422 │   │   │   │   │   input_model_config,      │
                         │    423 │   │   │   │   │   input_model_id,          │
                         │    424 │   │   │   │   │   data_root,               │
                         │                                                     │
                         │ /home/user/anaconda3/envs/olive/lib/python3.10/site │
                         │ -packages/olive/engine/engine.py:585 in run_search  │
                         │                                                     │
                         │    582 │   │   │   logger.debug(f"Step {iter_num} w │
                         │        ...")                                        │
                         │    583 │   │   │                                    │
                         │    584 │   │   │   # run all the passes in the step │
                         │ ❱  585 │   │   │   should_prune, signal, model_ids  │
                         │    586 │   │   │   │   next_step["passes"], model_c │
                         │    587 │   │   │   )                                │
                         │    588                                              │
                         │                                                     │
                         │ /home/user/anaconda3/envs/olive/lib/python3.10/site │
                         │ -packages/olive/engine/engine.py:903 in _run_passes │
                         │                                                     │
                         │    900 │   │   │   │   # skip evaluation if no sear │
                         │    901 │   │   │   │   signal = None                │
                         │    902 │   │   │   else:                            │
                         │ ❱  903 │   │   │   │   signal = self._evaluate_mode │
                         │        evaluator_config, accelerator_spec)          │
                         │    904 │   │   │   logger.debug(f"Signal: {signal}" │
                         │    905 │   │   else:                                │
                         │    906 │   │   │   signal = None                    │
                         │                                                     │
                         │ /home/user/anaconda3/envs/olive/lib/python3.10/site │
                         │ -packages/olive/engine/engine.py:1090 in            │
                         │ _evaluate_model                                     │
                         │                                                     │
                         │   1087 │   │   metrics = evaluator_config.metrics i │
                         │   1088 │   │   if self.target.system_type != System │
                         │   1089 │   │   │   model_config = self._prepare_non │
                         │ ❱ 1090 │   │   signal = self.target.evaluate_model( │
                         │        accelerator_spec)                            │
                         │   1091 │   │                                        │
                         │   1092 │   │   # cache evaluation                   │
                         │   1093 │   │   self._cache_evaluation(model_id_with │
                         │                                                     │
                         │ /home/user/anaconda3/envs/olive/lib/python3.10/site │
                         │ -packages/olive/systems/local.py:47 in              │
                         │ evaluate_model                                      │
                         │                                                     │
                         │   44 │   │                                          │
                         │   45 │   │   model = model_config.create_model()    │
                         │   46 │   │   evaluator: OliveEvaluator =            │
                         │      OliveEvaluatorFactory.create_evaluator_for_mod │
                         │ ❱ 47 │   │   return evaluator.evaluate(model, data_ │
                         │      execution_providers=execution_providers)       │
                         │   48 │                                              │
                         │   49 │   def get_supported_execution_providers(self │
                         │   50 │   │   """Get the available execution provide │
                         │                                                     │
                         │ /home/user/anaconda3/envs/olive/lib/python3.10/site │
                         │ -packages/olive/evaluator/olive_evaluator.py:173 in │
                         │ evaluate                                            │
                         │                                                     │
                         │   170 │   │   │   │   │   model, data_root, metric, │
                         │       execution_providers                           │
                         │   171 │   │   │   │   )                             │
                         │   172 │   │   │   elif metric.type == MetricType.LA │
                         │ ❱ 173 │   │   │   │   metrics_res[metric.name] = se │
                         │   174 │   │   │   │   │   model, data_root, metric, │
                         │       execution_providers                           │
                         │   175 │   │   │   │   )                             │
                         │   176 │   │   │   elif metric.type == MetricType.CU │
                         │                                                     │
                         │ /home/user/anaconda3/envs/olive/lib/python3.10/site │
                         │ -packages/olive/evaluator/olive_evaluator.py:635 in │
                         │ _evaluate_latency                                   │
                         │                                                     │
                         │   632 │   │   execution_providers: Union[str, List[ │
                         │   633 │   ) -> MetricResult:                        │
                         │   634 │   │   if isinstance(model, ONNXModel):      │
                         │ ❱ 635 │   │   │   return self._evaluate_onnx_latenc │
                         │       device, execution_providers)                  │
                         │   636 │   │   elif isinstance(model, DistributedOnn │
                         │   637 │   │   │   if device != Device.GPU:          │
                         │   638 │   │   │   │   raise ValueError("Distributed │
                         │                                                     │
                         │ /home/user/anaconda3/envs/olive/lib/python3.10/site │
                         │ -packages/olive/evaluator/olive_evaluator.py:410 in │
                         │ _evaluate_onnx_latency                              │
                         │                                                     │
                         │   407 │   │   │   if metric.user_config.io_bind:    │
                         │   408 │   │   │   │   session.run_with_iobinding(io │
                         │   409 │   │   │   else:                             │
                         │ ❱ 410 │   │   │   │   session.run(input_feed=input_ │
                         │   411 │   │                                         │
                         │   412 │   │   latencies = []                        │
                         │   413 │   │   for _ in range(repeat_test_num):      │
                         │                                                     │
                         │ /home/user/anaconda3/envs/olive/lib/python3.10/site │
                         │ -packages/onnxruntime/capi/onnxruntime_inference_co │
                         │ llection.py:220 in run                              │
                         │                                                     │
                         │   217 │   │   if not output_names:                  │
                         │   218 │   │   │   output_names = [output.name for o │
                         │   219 │   │   try:                                  │
                         │ ❱ 220 │   │   │   return self._sess.run(output_name │
                         │   221 │   │   except C.EPFail as err:               │
                         │   222 │   │   │   if self._enable_fallback:         │
                         │   223 │   │   │   │   print(f"EP Error: {err!s} usi │
                         ╰─────────────────────────────────────────────────────╯
InvalidArgument: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Non-zero status code returned while running GroupNorm node. Name:'GroupNorm_0' Status Message: only the channels_last layout is supported

Version?

torch==2.2.0.dev20231024+rocm5.6
torchvision==0.17.0.dev20231024+rocm5.6
olive-ai==0.3.3
onnxruntime==1.16.1
onnxruntime-training==1.16.1+rocm56

The text was updated successfully, but these errors were encountered:

jambayk · 2023-10-31T08:15:39Z

Hi,

Thanks for bringing this up! "ROCmExecutionProvider" is a typo for "ROCMExecutionProvider".

With regard to the GroupNorm error, this is because the options for the unet example were set for the DML EP which supports channels_last = False. But Cuda and ROCm ep don't support it https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/contrib_ops/rocm/diffusion/group_norm.cc#L82.

Can you try the example again after setting "group_norm_channels_last" : True in the config json https://github.com/microsoft/Olive/blob/main/examples/directml/stable_diffusion/config_unet.json#L81?

We haven't tested the example with Rocm ep so there might be other incompatibilities with the rocm ep.

## Describe your changes As described in #667, the name for rocm ep has a typo. onnxruntime uses `ROCMExecutionProvider` as the name. ## Checklist before requesting a review - [ ] Add unit tests for this change. - [ ] Make sure all tests can pass. - [ ] Update documents if necessary. - [ ] Format your code by running `pre-commit run --all-files` - [ ] Is this a user-facing change? If yes, give a description of this change to be included in the release notes. ## (Optional) Issue link

lshqqytiger · 2023-10-31T12:06:24Z

Thank you for your kind reply. Its official name is ROCm so I think onnxruntime's is typo but I understand for now. I now get the following error.

Failed to run Olive on gpu-rocm: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running GroupNorm node. Name:'GroupNorm_0' Status Message: /onnxruntime_src/include/onnxruntime/core/framework/tensor.h:208 const T* onnxruntime::Tensor::Data() const [with T = float] utils::IsPrimitiveDataType<T>(dtype_) was false. Tensor type mismatch. T!=N11onnxruntime17PrimitiveDataTypeINS_9MLFloat16EEE

                         ╭───────── Traceback (most recent call last) ─────────╮
                         │ /home/user/anaconda3/envs/olive/lib/python3.10/site │
                         │ -packages/olive/engine/engine.py:421 in             │
                         │ run_accelerator                                     │
                         │                                                     │
                         │    418 │   │   │   │   │   output_name,             │
                         │    419 │   │   │   │   )                            │
                         │    420 │   │   │   else:                            │
                         │ ❱  421 │   │   │   │   return self.run_search(      │
                         │    422 │   │   │   │   │   input_model_config,      │
                         │    423 │   │   │   │   │   input_model_id,          │
                         │    424 │   │   │   │   │   data_root,               │
                         │                                                     │
                         │ /home/user/anaconda3/envs/olive/lib/python3.10/site │
                         │ -packages/olive/engine/engine.py:585 in run_search  │
                         │                                                     │
                         │    582 │   │   │   logger.debug(f"Step {iter_num} w │
                         │        ...")                                        │
                         │    583 │   │   │                                    │
                         │    584 │   │   │   # run all the passes in the step │
                         │ ❱  585 │   │   │   should_prune, signal, model_ids  │
                         │    586 │   │   │   │   next_step["passes"], model_c │
                         │    587 │   │   │   )                                │
                         │    588                                              │
                         │                                                     │
                         │ /home/user/anaconda3/envs/olive/lib/python3.10/site │
                         │ -packages/olive/engine/engine.py:903 in _run_passes │
                         │                                                     │
                         │    900 │   │   │   │   # skip evaluation if no sear │
                         │    901 │   │   │   │   signal = None                │
                         │    902 │   │   │   else:                            │
                         │ ❱  903 │   │   │   │   signal = self._evaluate_mode │
                         │        evaluator_config, accelerator_spec)          │
                         │    904 │   │   │   logger.debug(f"Signal: {signal}" │
                         │    905 │   │   else:                                │
                         │    906 │   │   │   signal = None                    │
                         │                                                     │
                         │ /home/user/anaconda3/envs/olive/lib/python3.10/site │
                         │ -packages/olive/engine/engine.py:1090 in            │
                         │ _evaluate_model                                     │
                         │                                                     │
                         │   1087 │   │   metrics = evaluator_config.metrics i │
                         │   1088 │   │   if self.target.system_type != System │
                         │   1089 │   │   │   model_config = self._prepare_non │
                         │ ❱ 1090 │   │   signal = self.target.evaluate_model( │
                         │        accelerator_spec)                            │
                         │   1091 │   │                                        │
                         │   1092 │   │   # cache evaluation                   │
                         │   1093 │   │   self._cache_evaluation(model_id_with │
                         │                                                     │
                         │ /home/user/anaconda3/envs/olive/lib/python3.10/site │
                         │ -packages/olive/systems/local.py:47 in              │
                         │ evaluate_model                                      │
                         │                                                     │
                         │   44 │   │                                          │
                         │   45 │   │   model = model_config.create_model()    │
                         │   46 │   │   evaluator: OliveEvaluator =            │
                         │      OliveEvaluatorFactory.create_evaluator_for_mod │
                         │ ❱ 47 │   │   return evaluator.evaluate(model, data_ │
                         │      execution_providers=execution_providers)       │
                         │   48 │                                              │
                         │   49 │   def get_supported_execution_providers(self │
                         │   50 │   │   """Get the available execution provide │
                         │                                                     │
                         │ /home/user/anaconda3/envs/olive/lib/python3.10/site │
                         │ -packages/olive/evaluator/olive_evaluator.py:173 in │
                         │ evaluate                                            │
                         │                                                     │
                         │   170 │   │   │   │   │   model, data_root, metric, │
                         │       execution_providers                           │
                         │   171 │   │   │   │   )                             │
                         │   172 │   │   │   elif metric.type == MetricType.LA │
                         │ ❱ 173 │   │   │   │   metrics_res[metric.name] = se │
                         │   174 │   │   │   │   │   model, data_root, metric, │
                         │       execution_providers                           │
                         │   175 │   │   │   │   )                             │
                         │   176 │   │   │   elif metric.type == MetricType.CU │
                         │                                                     │
                         │ /home/user/anaconda3/envs/olive/lib/python3.10/site │
                         │ -packages/olive/evaluator/olive_evaluator.py:635 in │
                         │ _evaluate_latency                                   │
                         │                                                     │
                         │   632 │   │   execution_providers: Union[str, List[ │
                         │   633 │   ) -> MetricResult:                        │
                         │   634 │   │   if isinstance(model, ONNXModel):      │
                         │ ❱ 635 │   │   │   return self._evaluate_onnx_latenc │
                         │       device, execution_providers)                  │
                         │   636 │   │   elif isinstance(model, DistributedOnn │
                         │   637 │   │   │   if device != Device.GPU:          │
                         │   638 │   │   │   │   raise ValueError("Distributed │
                         │                                                     │
                         │ /home/user/anaconda3/envs/olive/lib/python3.10/site │
                         │ -packages/olive/evaluator/olive_evaluator.py:410 in │
                         │ _evaluate_onnx_latency                              │
                         │                                                     │
                         │   407 │   │   │   if metric.user_config.io_bind:    │
                         │   408 │   │   │   │   session.run_with_iobinding(io │
                         │   409 │   │   │   else:                             │
                         │ ❱ 410 │   │   │   │   session.run(input_feed=input_ │
                         │   411 │   │                                         │
                         │   412 │   │   latencies = []                        │
                         │   413 │   │   for _ in range(repeat_test_num):      │
                         │                                                     │
                         │ /home/user/anaconda3/envs/olive/lib/python3.10/site │
                         │ -packages/onnxruntime/capi/onnxruntime_inference_co │
                         │ llection.py:220 in run                              │
                         │                                                     │
                         │   217 │   │   if not output_names:                  │
                         │   218 │   │   │   output_names = [output.name for o │
                         │   219 │   │   try:                                  │
                         │ ❱ 220 │   │   │   return self._sess.run(output_name │
                         │   221 │   │   except C.EPFail as err:               │
                         │   222 │   │   │   if self._enable_fallback:         │
                         │   223 │   │   │   │   print(f"EP Error: {err!s} usi │
                         ╰─────────────────────────────────────────────────────╯
RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running GroupNorm node. Name:'GroupNorm_0' Status Message: /onnxruntime_src/include/onnxruntime/core/framework/tensor.h:208 const T* onnxruntime::Tensor::Data() const [with T = float] utils::IsPrimitiveDataType<T>(dtype_) was false. Tensor type mismatch. T!=N11onnxruntime17PrimitiveDataTypeINS_9MLFloat16EEE

And I'm getting the error below since I first tried optimization with ROCMExecutionProvider. This message occurs not only in UNet but also in other models, but does not stop optimization.

2023-10-31 20:58:37,169 onnx_model [WARNING] - Failed to run symbolic shape inference. Please file an issue in https://github.com/microsoft/onnxruntime.

lshqqytiger · 2023-10-31T12:25:45Z

I found it is because of float16. I changed float16 to false and I got this error on loading ort model after optimization.

2023-10-31 21:18:48,678 sd [ERROR] - [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for MemcpyToHost(1) node with name 'Memcpy_token_1'

Where the error occurred:

submodels = ("text_encoder", "unet", "vae_encoder", "vae_decoder",)

for submodel in submodels:
    kwargs[submodel] = diffusers.OnnxRuntimeModel.from_pretrained(
        os.path.dirname(optimized_model_paths[submodel]),
    )

jambayk · 2023-11-01T00:23:47Z

This looks like some other transformer optimization options in the example that are not compatible with ROCm EP. Because the example was only tested with DML EP, I am not aware of which.
Could you try the workflow with "optimization_options" removed so that it uses the default fusion options?
Without fp16=True, you can also safely remove "force_fp32_ops", "keep_io_types"

lshqqytiger · 2023-11-01T10:09:24Z

I did. It took longer time than before, and I got the same error.

2023-11-01 19:07:31,629 sd [ERROR] - [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for MemcpyToHost(1) node with name 'Memcpy_token_1'

lshqqytiger · 2023-11-01T10:39:12Z

I found microsoft/onnxruntime#17837 and added provider="ROCMExecutionProvider" on OnnxRuntimeModel.from_pretrained as an argument. Then I could load the optimized model successfully. But the generation process is verrrry slow and I got weird output.
I returned "optimization_options" and the optimization ended without any critical issues. But I got model large warning after optimizing unet:

Model is too large to save as a single file but 'save_as_external_data' is False. Saved tensors as external data regardless.

The optimized model was larger in size than unoptimized one, the generation speed was slower than usual, and the results were corrupted.

louwangzhiyuY · 2023-12-19T04:45:07Z

I found microsoft/onnxruntime#17837 and added provider="ROCMExecutionProvider" on OnnxRuntimeModel.from_pretrained as an argument. Then I could load the optimized model successfully. But the generation process is verrrry slow and I got weird output. I returned "optimization_options" and the optimization ended without any critical issues. But I got model large warning after optimizing unet:
Model is too large to save as a single file but 'save_as_external_data' is False. Saved tensors as external data regardless.
The optimized model was larger in size than unoptimized one, the generation speed was slower than usual, and the results were corrupted.

did you solve the issue? I meet a simliar issue. even thtough provider=DMLProvider in my enviroment.

lshqqytiger · 2024-07-06T17:23:53Z

I'm getting this error nowadays.

onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running MultiHeadAttention node. Name:'MultiHeadAttention_0' Status Message: /home/user/onnxruntime/onnxruntime/contrib_ops/rocm/bert/multihead_attention.cu:82 virtual Status onnxruntime::contrib::rocm::MultiHeadAttention<onnxruntime::MLFloat16>::ComputeInternal(OpKernelContext *) const [T = onnxruntime::MLFloat16] GetTuningContext()->IsTunableOpEnabled() was false. MultiHeadAttention of ROCm EP is only supported if tunable op is used and tuning is enabled.

This error occurs when I'm trying to optimize unet.
I built onnxruntime-training from source. microsoft/onnxruntime@83e0c6b
If I insert these lines to make sure that tunable op is used and tuning is enabled,

# olive/common/ort_inference.py
provider_options[idx]["tunable_op_enable"] = True
provider_options[idx]["tunable_op_tuning_enable"] = True

I get another error.

onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running MultiHeadAttention node. Name:'MultiHeadAttention_0' Status Message: /home/user/onnxruntime/onnxruntime/core/framework/tunable.h:288 int onnxruntime::TunableOp<onnxruntime::contrib::rocm::GemmSoftmaxGemmPermuteParams<__half>, onnxruntime::rocm::tunable::Timer>::FindFastestImpl(const ParamsT *, const std::vector<Op<ParamsT>> &) [ParamsT = onnxruntime::contrib::rocm::GemmSoftmaxGemmPermuteParams<__half>, TimerT = onnxruntime::rocm::tunable::Timer] id >= 0 was false. Could not find viable op

Environment

Windows 11 23H2
Adrenaline 24.6.1
Ubuntu 22.04 (WSL2)
ROCm 6.1.3
RX 7900 XTX (gfx1100)

torch==2.5.0.dev20240706+rocm6.1
torchvision==0.20.0.dev20240706+rocm6.1
olive-ai==0.6.2
onnxruntime-training==1.19.0+cpu (built from source, microsoft/onnxruntime@83e0c6b)

lshqqytiger added the bug Something isn't working label Oct 29, 2023

jambayk mentioned this issue Oct 31, 2023

Fix rocm ep name #670

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Running Olive with ROCMExecutionProvider. #667

[Bug]: Running Olive with ROCMExecutionProvider. #667

lshqqytiger commented Oct 29, 2023

jambayk commented Oct 31, 2023 •

edited

Loading

lshqqytiger commented Oct 31, 2023

lshqqytiger commented Oct 31, 2023 •

edited

Loading

jambayk commented Nov 1, 2023

lshqqytiger commented Nov 1, 2023

lshqqytiger commented Nov 1, 2023 •

edited

Loading

louwangzhiyuY commented Dec 19, 2023

lshqqytiger commented Jul 6, 2024 •

edited

Loading

[Bug]: Running Olive with ROCMExecutionProvider. #667

[Bug]: Running Olive with ROCMExecutionProvider. #667

Comments

lshqqytiger commented Oct 29, 2023

What happened?

Version?

jambayk commented Oct 31, 2023 • edited Loading

lshqqytiger commented Oct 31, 2023

lshqqytiger commented Oct 31, 2023 • edited Loading

jambayk commented Nov 1, 2023

lshqqytiger commented Nov 1, 2023

lshqqytiger commented Nov 1, 2023 • edited Loading

louwangzhiyuY commented Dec 19, 2023

lshqqytiger commented Jul 6, 2024 • edited Loading

Environment

jambayk commented Oct 31, 2023 •

edited

Loading

lshqqytiger commented Oct 31, 2023 •

edited

Loading

lshqqytiger commented Nov 1, 2023 •

edited

Loading

lshqqytiger commented Jul 6, 2024 •

edited

Loading