[ROCm] Global (average) Pooling unusable. #15482

cloudhan · 2023-04-12T09:30:45Z

Describe the issue

Crash on some shapes
Incorrect result on some shape

To reproduce

To reproduce a crash

Run the following single node model

import numpy as np
import onnx
import onnxruntime as ort

batch=1
channel=64
dim1 = 410
dim2 = 400

ort.set_default_logger_severity(0)
ort.set_default_logger_verbosity(1000)

x = onnx.helper.make_tensor_value_info("x", onnx.TensorProto.FLOAT16, [batch, channel, dim1, dim2])
y = onnx.helper.make_tensor_value_info("y", onnx.TensorProto.FLOAT16, [batch, channel, 1, 1])

node = onnx.helper.make_node("GlobalAveragePool", inputs=["x"], outputs=["y"])
graph = onnx.helper.make_graph([node], "GP", [x], [y])
model = onnx.helper.make_model(graph)

sess = ort.InferenceSession(
    model.SerializeToString(), providers=[("ROCMExecutionProvider", {"miopen_conv_use_max_workspace": False})]
)

x = np.random.randn(batch, channel, dim1, dim2).astype(np.float16)
sess.run(input_feed = {"x": x}, output_names = ["y"])

will create the following error:

MIOpen(HIP): Error [Do] 'amd_comgr_do_action(kind, handle, in.GetHandle(), out.GetHandle())' AMD_COMGR_ACTION_CODEGEN_BC_TO_RELOCATABLE: ERROR (1)
MIOpen(HIP): Error [BuildOcl] comgr status = ERROR (1)
MIOpen(HIP): Warning [BuildOcl] error: stack frame size (328004) exceeds limit (131056) in function 'mloPoolingG'
1 error generated.

MIOpen Error: /long_pathname_so_that_rpms_can_package_the_debug_info/data/driver/MLOpen/src/hipoc/hipoc_program.cpp:304: Code object build failed. Source: MIOpenPooling.cl
2023-04-12 08:27:32.942631957 [E:onnxruntime:Default, rocm_call.cc:119 RocmCall] MIOPEN failure 7: miopenStatusUnknownError ; GPU=0 ; hostname=linmif39a00000F ; file=/home/guangyunhan/onnxruntime/onnxruntime/core/providers/rocm/nn/pool.cc ; line=226 ; expr=PoolingForwardHelper(GetMiopenHandle(context), pooling_desc, &alpha, x_tensor, x_data, &beta, y_tensor, y_data); 
2023-04-12 08:27:32.942678735 [E:onnxruntime:, sequential_executor.cc:516 ExecuteKernel] Non-zero status code returned while running GlobalAveragePool node. Name:'' Status Message: MIOPEN failure 7: miopenStatusUnknownError ; GPU=0 ; hostname=linmif39a00000F ; file=/home/guangyunhan/onnxruntime/onnxruntime/core/providers/rocm/nn/pool.cc ; line=226 ; expr=PoolingForwardHelper(GetMiopenHandle(context), pooling_desc, &alpha, x_tensor, x_data, &beta, y_tensor, y_data);

The maxmium shape it can run is

batch=1
channel=64
dim1 = 255
dim2 = 255

This fix the problem by switching the global pool to use reduction instead.

This problem impact the usability of ROCm EP of our internal users.

To reproduce an incorrect result

import numpy as np
import onnx
import onnxruntime as ort

batch=1
channel=3
dim1 = 255
dim2 = 255

ort.set_default_logger_severity(0)
ort.set_default_logger_verbosity(1000)

x = onnx.helper.make_tensor_value_info("x", onnx.TensorProto.FLOAT16, [batch, channel, dim1, dim2])
y = onnx.helper.make_tensor_value_info("y", onnx.TensorProto.FLOAT16, [batch, channel, 1, 1])

node = onnx.helper.make_node("GlobalAveragePool", inputs=["x"], outputs=["y"])
graph = onnx.helper.make_graph([node], "GP", [x], [y])
model = onnx.helper.make_model(graph)

x = np.random.uniform(low=0.0, high=1.10, size=(batch, channel, dim1, dim2)).astype(np.float16)

sess = ort.InferenceSession(model.SerializeToString(), providers=["CPUExecutionProvider"])
ref = sess.run(input_feed = {"x": x}, output_names = ["y"])[0]

sess = ort.InferenceSession(
    model.SerializeToString(), providers=[("ROCMExecutionProvider", {"miopen_conv_use_max_workspace": False})]
)
y = sess.run(input_feed = {"x": x}, output_names = ["y"])[0]

print(ref.shape)
print(y.shape)
print(ref)
print(y)
print("max relative error:", np.abs((ref-y)/ref).max())

Urgency

Must be fixed

Platform

Linux

OS Version

no apply

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

d49a8de

ONNX Runtime API

Python

Architecture

X64

Execution Provider

Other / Unknown

Execution Provider Library Version

ROCm 5.4.2

The text was updated successfully, but these errors were encountered:

JehandadKhan · 2023-04-21T05:42:37Z

Minimal MIOpenDriver command to reproduce the issue

Crash Issue:

MIOpen(HIP): Command [Pooling_logging_cmd] ./bin/MIOpenDriver poolfp16 -M 0 --input 1x64x410x400,10496000x164000x400x1 -y 410 -x 400 -p 0 -q 0 -v 1 -u 1 -m avg -F 1 -t 1

Incorrect Issue:

MIOpen(HIP): Command [Pooling_logging_cmd] ./bin/MIOpenDriver poolfp16 -M 0 --input 1x3x255x255,195075x65025x255x1 -y 255 -x 255 -p 0 -q 0 -v 1 -u 1 -m avg -F 1 -t 1

atamazov · 2023-04-22T19:38:19Z

Opened issues:

atamazov · 2023-04-24T14:36:33Z

@cloudhan are you interested to get a proper fix in MIOpen?

cloudhan · 2023-04-25T02:32:54Z

Why not?

atamazov · 2023-04-25T23:30:55Z

@cloudhan Ok, I'll keep in informed.

atamazov · 2023-04-28T22:13:33Z

The correctness issue in MIOpen is fixed in ROCm/MIOpen#2118.

atamazov · 2023-06-26T21:43:20Z

[pooling] The kernels hit system limits when window size is large ROCmSoftwarePlatform/MIOpen#2110

FYI this is fixed for Forward/Inference pooling in ROCm/MIOpen#2163.

atamazov · 2023-09-04T22:51:09Z

@cloudhan

[pooling] The kernels hit system limits when window size is large ROCmSoftwarePlatform/MIOpen#2110

FYI this is fixed for Backward pooling (except workspace index mask mode for Max pooling) in ROCm/MIOpen#2372.

github-actions bot added the ep:ROCm questions/issues related to ROCm execution provider label Apr 12, 2023

cloudhan mentioned this issue Apr 12, 2023

Workaround ROCm global pool #15481

Merged

This was referenced Apr 22, 2023

[pooling] FP16 precision issue (averaging mode) ROCm/MIOpen#2109

Closed

[pooling] The kernels hit system limits when window size is large ROCm/MIOpen#2110

Closed

cloudhan closed this as completed in #15481 Apr 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCm] Global (average) Pooling unusable. #15482

[ROCm] Global (average) Pooling unusable. #15482

cloudhan commented Apr 12, 2023

JehandadKhan commented Apr 21, 2023

atamazov commented Apr 22, 2023

atamazov commented Apr 24, 2023 •

edited

Loading

cloudhan commented Apr 25, 2023

atamazov commented Apr 25, 2023

atamazov commented Apr 28, 2023

atamazov commented Jun 26, 2023

atamazov commented Sep 4, 2023

[ROCm] Global (average) Pooling unusable. #15482

[ROCm] Global (average) Pooling unusable. #15482

Comments

cloudhan commented Apr 12, 2023

Describe the issue

To reproduce

To reproduce a crash

To reproduce an incorrect result

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

JehandadKhan commented Apr 21, 2023

atamazov commented Apr 22, 2023

atamazov commented Apr 24, 2023 • edited Loading

cloudhan commented Apr 25, 2023

atamazov commented Apr 25, 2023

atamazov commented Apr 28, 2023

atamazov commented Jun 26, 2023

atamazov commented Sep 4, 2023

atamazov commented Apr 24, 2023 •

edited

Loading