Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Running Olive with ROCMExecutionProvider. #667

Open
lshqqytiger opened this issue Oct 29, 2023 · 8 comments
Open

[Bug]: Running Olive with ROCMExecutionProvider. #667

lshqqytiger opened this issue Oct 29, 2023 · 8 comments
Labels
bug Something isn't working

Comments

@lshqqytiger
Copy link

What happened?

I was able to get onnxruntime-training 1.16.1+rocm56 from onnxruntime.ai and it includes ROCMExecutionProvider. But I found out that Olive needs a ROCmExecutionProvider. I added ROCMExecutionProvider to AcceleratorLookup.EXECUTION_PROVIDERS, but I got the error below when optimizing unet. What is the difference between ROCmExecutionProvider and ROCMExecutionProvider? Is ROCMExectionProvider not supported?

Running workflow on accelerator specs: gpu-rocm
Running workflow on accelerator specs: gpu-rocm
Running workflow on accelerator specs: gpu-rocm
Running pass convert:OnnxConversion
Running pass convert:OnnxConversion
Running pass convert:OnnxConversion
Running pass optimize:OrtTransformersOptimization
Running pass optimize:OrtTransformersOptimization
Running pass optimize:OrtTransformersOptimization
2023-10-29 21:15:13,526 onnx_model [INFO] - Skip removing useless cast nodes since shape inference failed.
2023-10-29 21:15:13,852 fusion_base [INFO] - Fused LayerNormalization: 48
2023-10-29 21:15:14,823 fusion_base [INFO] - Fused Gelu: 16
2023-10-29 21:15:15,533 onnx_model_unet [INFO] - Removed 54 Div nodes
2023-10-29 21:15:18,759 fusion_base [INFO] - Fused GroupNorm: 61
2023-10-29 21:15:21,125 onnx_model [INFO] - Removed 64 nodes
2023-10-29 21:15:25,312 onnx_model_unet [INFO] - opset version: 14
2023-10-29 21:15:27,991 onnx_model [WARNING] - Failed to run symbolic shape inference. Please file an issue in https://github.com/microsoft/onnxruntime.
2023-10-29 21:15:51,634 onnx_model [INFO] - Skip removing useless cast nodes since shape inference failed.
2023-10-29 21:15:51,634 onnx_model [INFO] - Skip removing useless cast nodes since shape inference failed.
2023-10-29 21:15:55.437960083 [E:onnxruntime:, sequential_executor.cc:514 ExecuteKernel] Non-zero status code returned while running GroupNorm node. Name:'GroupNorm_0' Status Message: only the channels_last layout is supported
Failed to run Olive on gpu-rocm: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Non-zero status code returned while running GroupNorm node. Name:'GroupNorm_0' Status Message: only the channels_last layout is supported
                         ╭───────── Traceback (most recent call last) ─────────╮
                         │ /home/user/anaconda3/envs/olive/lib/python3.10/site │
                         │ -packages/olive/engine/engine.py:421 in             │
                         │ run_accelerator                                     │
                         │                                                     │
                         │    418 │   │   │   │   │   output_name,             │
                         │    419 │   │   │   │   )                            │
                         │    420 │   │   │   else:                            │
                         │ ❱  421 │   │   │   │   return self.run_search(      │
                         │    422 │   │   │   │   │   input_model_config,      │
                         │    423 │   │   │   │   │   input_model_id,          │
                         │    424 │   │   │   │   │   data_root,               │
                         │                                                     │
                         │ /home/user/anaconda3/envs/olive/lib/python3.10/site │
                         │ -packages/olive/engine/engine.py:585 in run_search  │
                         │                                                     │
                         │    582 │   │   │   logger.debug(f"Step {iter_num} w │
                         │        ...")                                        │
                         │    583 │   │   │                                    │
                         │    584 │   │   │   # run all the passes in the step │
                         │ ❱  585 │   │   │   should_prune, signal, model_ids  │
                         │    586 │   │   │   │   next_step["passes"], model_c │
                         │    587 │   │   │   )                                │
                         │    588                                              │
                         │                                                     │
                         │ /home/user/anaconda3/envs/olive/lib/python3.10/site │
                         │ -packages/olive/engine/engine.py:903 in _run_passes │
                         │                                                     │
                         │    900 │   │   │   │   # skip evaluation if no sear │
                         │    901 │   │   │   │   signal = None                │
                         │    902 │   │   │   else:                            │
                         │ ❱  903 │   │   │   │   signal = self._evaluate_mode │
                         │        evaluator_config, accelerator_spec)          │
                         │    904 │   │   │   logger.debug(f"Signal: {signal}" │
                         │    905 │   │   else:                                │
                         │    906 │   │   │   signal = None                    │
                         │                                                     │
                         │ /home/user/anaconda3/envs/olive/lib/python3.10/site │
                         │ -packages/olive/engine/engine.py:1090 in            │
                         │ _evaluate_model                                     │
                         │                                                     │
                         │   1087 │   │   metrics = evaluator_config.metrics i │
                         │   1088 │   │   if self.target.system_type != System │
                         │   1089 │   │   │   model_config = self._prepare_non │
                         │ ❱ 1090 │   │   signal = self.target.evaluate_model( │
                         │        accelerator_spec)                            │
                         │   1091 │   │                                        │
                         │   1092 │   │   # cache evaluation                   │
                         │   1093 │   │   self._cache_evaluation(model_id_with │
                         │                                                     │
                         │ /home/user/anaconda3/envs/olive/lib/python3.10/site │
                         │ -packages/olive/systems/local.py:47 in              │
                         │ evaluate_model                                      │
                         │                                                     │
                         │   44 │   │                                          │
                         │   45 │   │   model = model_config.create_model()    │
                         │   46 │   │   evaluator: OliveEvaluator =            │
                         │      OliveEvaluatorFactory.create_evaluator_for_mod │
                         │ ❱ 47 │   │   return evaluator.evaluate(model, data_ │
                         │      execution_providers=execution_providers)       │
                         │   48 │                                              │
                         │   49 │   def get_supported_execution_providers(self │
                         │   50 │   │   """Get the available execution provide │
                         │                                                     │
                         │ /home/user/anaconda3/envs/olive/lib/python3.10/site │
                         │ -packages/olive/evaluator/olive_evaluator.py:173 in │
                         │ evaluate                                            │
                         │                                                     │
                         │   170 │   │   │   │   │   model, data_root, metric, │
                         │       execution_providers                           │
                         │   171 │   │   │   │   )                             │
                         │   172 │   │   │   elif metric.type == MetricType.LA │
                         │ ❱ 173 │   │   │   │   metrics_res[metric.name] = se │
                         │   174 │   │   │   │   │   model, data_root, metric, │
                         │       execution_providers                           │
                         │   175 │   │   │   │   )                             │
                         │   176 │   │   │   elif metric.type == MetricType.CU │
                         │                                                     │
                         │ /home/user/anaconda3/envs/olive/lib/python3.10/site │
                         │ -packages/olive/evaluator/olive_evaluator.py:635 in │
                         │ _evaluate_latency                                   │
                         │                                                     │
                         │   632 │   │   execution_providers: Union[str, List[ │
                         │   633 │   ) -> MetricResult:                        │
                         │   634 │   │   if isinstance(model, ONNXModel):      │
                         │ ❱ 635 │   │   │   return self._evaluate_onnx_latenc │
                         │       device, execution_providers)                  │
                         │   636 │   │   elif isinstance(model, DistributedOnn │
                         │   637 │   │   │   if device != Device.GPU:          │
                         │   638 │   │   │   │   raise ValueError("Distributed │
                         │                                                     │
                         │ /home/user/anaconda3/envs/olive/lib/python3.10/site │
                         │ -packages/olive/evaluator/olive_evaluator.py:410 in │
                         │ _evaluate_onnx_latency                              │
                         │                                                     │
                         │   407 │   │   │   if metric.user_config.io_bind:    │
                         │   408 │   │   │   │   session.run_with_iobinding(io │
                         │   409 │   │   │   else:                             │
                         │ ❱ 410 │   │   │   │   session.run(input_feed=input_ │
                         │   411 │   │                                         │
                         │   412 │   │   latencies = []                        │
                         │   413 │   │   for _ in range(repeat_test_num):      │
                         │                                                     │
                         │ /home/user/anaconda3/envs/olive/lib/python3.10/site │
                         │ -packages/onnxruntime/capi/onnxruntime_inference_co │
                         │ llection.py:220 in run                              │
                         │                                                     │
                         │   217 │   │   if not output_names:                  │
                         │   218 │   │   │   output_names = [output.name for o │
                         │   219 │   │   try:                                  │
                         │ ❱ 220 │   │   │   return self._sess.run(output_name │
                         │   221 │   │   except C.EPFail as err:               │
                         │   222 │   │   │   if self._enable_fallback:         │
                         │   223 │   │   │   │   print(f"EP Error: {err!s} usi │
                         ╰─────────────────────────────────────────────────────╯
InvalidArgument: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Non-zero status code returned while running GroupNorm node. Name:'GroupNorm_0' Status Message: only the channels_last layout is supported

Version?

torch==2.2.0.dev20231024+rocm5.6
torchvision==0.17.0.dev20231024+rocm5.6
olive-ai==0.3.3
onnxruntime==1.16.1
onnxruntime-training==1.16.1+rocm56

@lshqqytiger lshqqytiger added the bug Something isn't working label Oct 29, 2023
@jambayk
Copy link
Contributor

jambayk commented Oct 31, 2023

Hi,

Thanks for bringing this up! "ROCmExecutionProvider" is a typo for "ROCMExecutionProvider".

With regard to the GroupNorm error, this is because the options for the unet example were set for the DML EP which supports channels_last = False. But Cuda and ROCm ep don't support it https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/contrib_ops/rocm/diffusion/group_norm.cc#L82.

Can you try the example again after setting "group_norm_channels_last" : True in the config json https://github.com/microsoft/Olive/blob/main/examples/directml/stable_diffusion/config_unet.json#L81?

We haven't tested the example with Rocm ep so there might be other incompatibilities with the rocm ep.

@jambayk jambayk mentioned this issue Oct 31, 2023
5 tasks
jambayk added a commit that referenced this issue Oct 31, 2023
## Describe your changes
As described in #667, the name
for rocm ep has a typo. onnxruntime uses `ROCMExecutionProvider` as the
name.

## Checklist before requesting a review
- [ ] Add unit tests for this change.
- [ ] Make sure all tests can pass.
- [ ] Update documents if necessary.
- [ ] Format your code by running `pre-commit run --all-files`
- [ ] Is this a user-facing change? If yes, give a description of this
change to be included in the release notes.

## (Optional) Issue link
@lshqqytiger
Copy link
Author

Thank you for your kind reply. Its official name is ROCm so I think onnxruntime's is typo but I understand for now. I now get the following error.

Failed to run Olive on gpu-rocm: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running GroupNorm node. Name:'GroupNorm_0' Status Message: /onnxruntime_src/include/onnxruntime/core/framework/tensor.h:208 const T* onnxruntime::Tensor::Data() const [with T = float] utils::IsPrimitiveDataType<T>(dtype_) was false. Tensor type mismatch. T!=N11onnxruntime17PrimitiveDataTypeINS_9MLFloat16EEE

                         ╭───────── Traceback (most recent call last) ─────────╮
                         │ /home/user/anaconda3/envs/olive/lib/python3.10/site │
                         │ -packages/olive/engine/engine.py:421 in             │
                         │ run_accelerator                                     │
                         │                                                     │
                         │    418 │   │   │   │   │   output_name,             │
                         │    419 │   │   │   │   )                            │
                         │    420 │   │   │   else:                            │
                         │ ❱  421 │   │   │   │   return self.run_search(      │
                         │    422 │   │   │   │   │   input_model_config,      │
                         │    423 │   │   │   │   │   input_model_id,          │
                         │    424 │   │   │   │   │   data_root,               │
                         │                                                     │
                         │ /home/user/anaconda3/envs/olive/lib/python3.10/site │
                         │ -packages/olive/engine/engine.py:585 in run_search  │
                         │                                                     │
                         │    582 │   │   │   logger.debug(f"Step {iter_num} w │
                         │        ...")                                        │
                         │    583 │   │   │                                    │
                         │    584 │   │   │   # run all the passes in the step │
                         │ ❱  585 │   │   │   should_prune, signal, model_ids  │
                         │    586 │   │   │   │   next_step["passes"], model_c │
                         │    587 │   │   │   )                                │
                         │    588                                              │
                         │                                                     │
                         │ /home/user/anaconda3/envs/olive/lib/python3.10/site │
                         │ -packages/olive/engine/engine.py:903 in _run_passes │
                         │                                                     │
                         │    900 │   │   │   │   # skip evaluation if no sear │
                         │    901 │   │   │   │   signal = None                │
                         │    902 │   │   │   else:                            │
                         │ ❱  903 │   │   │   │   signal = self._evaluate_mode │
                         │        evaluator_config, accelerator_spec)          │
                         │    904 │   │   │   logger.debug(f"Signal: {signal}" │
                         │    905 │   │   else:                                │
                         │    906 │   │   │   signal = None                    │
                         │                                                     │
                         │ /home/user/anaconda3/envs/olive/lib/python3.10/site │
                         │ -packages/olive/engine/engine.py:1090 in            │
                         │ _evaluate_model                                     │
                         │                                                     │
                         │   1087 │   │   metrics = evaluator_config.metrics i │
                         │   1088 │   │   if self.target.system_type != System │
                         │   1089 │   │   │   model_config = self._prepare_non │
                         │ ❱ 1090 │   │   signal = self.target.evaluate_model( │
                         │        accelerator_spec)                            │
                         │   1091 │   │                                        │
                         │   1092 │   │   # cache evaluation                   │
                         │   1093 │   │   self._cache_evaluation(model_id_with │
                         │                                                     │
                         │ /home/user/anaconda3/envs/olive/lib/python3.10/site │
                         │ -packages/olive/systems/local.py:47 in              │
                         │ evaluate_model                                      │
                         │                                                     │
                         │   44 │   │                                          │
                         │   45 │   │   model = model_config.create_model()    │
                         │   46 │   │   evaluator: OliveEvaluator =            │
                         │      OliveEvaluatorFactory.create_evaluator_for_mod │
                         │ ❱ 47 │   │   return evaluator.evaluate(model, data_ │
                         │      execution_providers=execution_providers)       │
                         │   48 │                                              │
                         │   49 │   def get_supported_execution_providers(self │
                         │   50 │   │   """Get the available execution provide │
                         │                                                     │
                         │ /home/user/anaconda3/envs/olive/lib/python3.10/site │
                         │ -packages/olive/evaluator/olive_evaluator.py:173 in │
                         │ evaluate                                            │
                         │                                                     │
                         │   170 │   │   │   │   │   model, data_root, metric, │
                         │       execution_providers                           │
                         │   171 │   │   │   │   )                             │
                         │   172 │   │   │   elif metric.type == MetricType.LA │
                         │ ❱ 173 │   │   │   │   metrics_res[metric.name] = se │
                         │   174 │   │   │   │   │   model, data_root, metric, │
                         │       execution_providers                           │
                         │   175 │   │   │   │   )                             │
                         │   176 │   │   │   elif metric.type == MetricType.CU │
                         │                                                     │
                         │ /home/user/anaconda3/envs/olive/lib/python3.10/site │
                         │ -packages/olive/evaluator/olive_evaluator.py:635 in │
                         │ _evaluate_latency                                   │
                         │                                                     │
                         │   632 │   │   execution_providers: Union[str, List[ │
                         │   633 │   ) -> MetricResult:                        │
                         │   634 │   │   if isinstance(model, ONNXModel):      │
                         │ ❱ 635 │   │   │   return self._evaluate_onnx_latenc │
                         │       device, execution_providers)                  │
                         │   636 │   │   elif isinstance(model, DistributedOnn │
                         │   637 │   │   │   if device != Device.GPU:          │
                         │   638 │   │   │   │   raise ValueError("Distributed │
                         │                                                     │
                         │ /home/user/anaconda3/envs/olive/lib/python3.10/site │
                         │ -packages/olive/evaluator/olive_evaluator.py:410 in │
                         │ _evaluate_onnx_latency                              │
                         │                                                     │
                         │   407 │   │   │   if metric.user_config.io_bind:    │
                         │   408 │   │   │   │   session.run_with_iobinding(io │
                         │   409 │   │   │   else:                             │
                         │ ❱ 410 │   │   │   │   session.run(input_feed=input_ │
                         │   411 │   │                                         │
                         │   412 │   │   latencies = []                        │
                         │   413 │   │   for _ in range(repeat_test_num):      │
                         │                                                     │
                         │ /home/user/anaconda3/envs/olive/lib/python3.10/site │
                         │ -packages/onnxruntime/capi/onnxruntime_inference_co │
                         │ llection.py:220 in run                              │
                         │                                                     │
                         │   217 │   │   if not output_names:                  │
                         │   218 │   │   │   output_names = [output.name for o │
                         │   219 │   │   try:                                  │
                         │ ❱ 220 │   │   │   return self._sess.run(output_name │
                         │   221 │   │   except C.EPFail as err:               │
                         │   222 │   │   │   if self._enable_fallback:         │
                         │   223 │   │   │   │   print(f"EP Error: {err!s} usi │
                         ╰─────────────────────────────────────────────────────╯
RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running GroupNorm node. Name:'GroupNorm_0' Status Message: /onnxruntime_src/include/onnxruntime/core/framework/tensor.h:208 const T* onnxruntime::Tensor::Data() const [with T = float] utils::IsPrimitiveDataType<T>(dtype_) was false. Tensor type mismatch. T!=N11onnxruntime17PrimitiveDataTypeINS_9MLFloat16EEE

And I'm getting the error below since I first tried optimization with ROCMExecutionProvider. This message occurs not only in UNet but also in other models, but does not stop optimization.

2023-10-31 20:58:37,169 onnx_model [WARNING] - Failed to run symbolic shape inference. Please file an issue in https://github.com/microsoft/onnxruntime.

@lshqqytiger
Copy link
Author

lshqqytiger commented Oct 31, 2023

I found it is because of float16. I changed float16 to false and I got this error on loading ort model after optimization.

2023-10-31 21:18:48,678 sd [ERROR] - [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for MemcpyToHost(1) node with name 'Memcpy_token_1'

Where the error occurred:

submodels = ("text_encoder", "unet", "vae_encoder", "vae_decoder",)

for submodel in submodels:
    kwargs[submodel] = diffusers.OnnxRuntimeModel.from_pretrained(
        os.path.dirname(optimized_model_paths[submodel]),
    )

@lshqqytiger lshqqytiger changed the title [Bug]: Failed to run Olive on gpu-rocm: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Non-zero status code returned while running GroupNorm node. Name:'GroupNorm_0' Status Message: only the channels_last layout is supported [Bug]: Running Olive with ROCMExecutionProvider. Oct 31, 2023
@jambayk
Copy link
Contributor

jambayk commented Nov 1, 2023

This looks like some other transformer optimization options in the example that are not compatible with ROCm EP. Because the example was only tested with DML EP, I am not aware of which.
Could you try the workflow with "optimization_options" removed so that it uses the default fusion options?
Without fp16=True, you can also safely remove "force_fp32_ops", "keep_io_types"

@lshqqytiger
Copy link
Author

I did. It took longer time than before, and I got the same error.

2023-11-01 19:07:31,629 sd [ERROR] - [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for MemcpyToHost(1) node with name 'Memcpy_token_1'

@lshqqytiger
Copy link
Author

lshqqytiger commented Nov 1, 2023

I found microsoft/onnxruntime#17837 and added provider="ROCMExecutionProvider" on OnnxRuntimeModel.from_pretrained as an argument. Then I could load the optimized model successfully. But the generation process is verrrry slow and I got weird output.
I returned "optimization_options" and the optimization ended without any critical issues. But I got model large warning after optimizing unet:

Model is too large to save as a single file but 'save_as_external_data' is False. Saved tensors as external data regardless.

The optimized model was larger in size than unoptimized one, the generation speed was slower than usual, and the results were corrupted.

@louwangzhiyuY
Copy link

I found microsoft/onnxruntime#17837 and added provider="ROCMExecutionProvider" on OnnxRuntimeModel.from_pretrained as an argument. Then I could load the optimized model successfully. But the generation process is verrrry slow and I got weird output. I returned "optimization_options" and the optimization ended without any critical issues. But I got model large warning after optimizing unet:

Model is too large to save as a single file but 'save_as_external_data' is False. Saved tensors as external data regardless.

The optimized model was larger in size than unoptimized one, the generation speed was slower than usual, and the results were corrupted.

did you solve the issue? I meet a simliar issue. even thtough provider=DMLProvider in my enviroment.

@lshqqytiger
Copy link
Author

lshqqytiger commented Jul 6, 2024

I'm getting this error nowadays.

onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running MultiHeadAttention node. Name:'MultiHeadAttention_0' Status Message: /home/user/onnxruntime/onnxruntime/contrib_ops/rocm/bert/multihead_attention.cu:82 virtual Status onnxruntime::contrib::rocm::MultiHeadAttention<onnxruntime::MLFloat16>::ComputeInternal(OpKernelContext *) const [T = onnxruntime::MLFloat16] GetTuningContext()->IsTunableOpEnabled() was false. MultiHeadAttention of ROCm EP is only supported if tunable op is used and tuning is enabled.

This error occurs when I'm trying to optimize unet.
I built onnxruntime-training from source. microsoft/onnxruntime@83e0c6b
If I insert these lines to make sure that tunable op is used and tuning is enabled,

# olive/common/ort_inference.py
provider_options[idx]["tunable_op_enable"] = True
provider_options[idx]["tunable_op_tuning_enable"] = True

I get another error.

onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running MultiHeadAttention node. Name:'MultiHeadAttention_0' Status Message: /home/user/onnxruntime/onnxruntime/core/framework/tunable.h:288 int onnxruntime::TunableOp<onnxruntime::contrib::rocm::GemmSoftmaxGemmPermuteParams<__half>, onnxruntime::rocm::tunable::Timer>::FindFastestImpl(const ParamsT *, const std::vector<Op<ParamsT>> &) [ParamsT = onnxruntime::contrib::rocm::GemmSoftmaxGemmPermuteParams<__half>, TimerT = onnxruntime::rocm::tunable::Timer] id >= 0 was false. Could not find viable op

Environment

Windows 11 23H2
Adrenaline 24.6.1
Ubuntu 22.04 (WSL2)
ROCm 6.1.3
RX 7900 XTX (gfx1100)

torch==2.5.0.dev20240706+rocm6.1
torchvision==0.20.0.dev20240706+rocm6.1
olive-ai==0.6.2
onnxruntime-training==1.19.0+cpu (built from source, microsoft/onnxruntime@83e0c6b)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants