-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
Describe the bug
ReduceMean allocates 4x the amount of memory of the input size.
Example: (4, 128, 1024, 1024) float32 tensor with reduction along axes 0, 2, 3 should require memory for 128 floats.
Instead it tries to allocate 8GB of space (which is 4x the input size) and fails with the following error:
[E:onnxruntime:, sequential_executor.cc:346 Execute] Non-zero status code returned while running ReduceMean node. Name:'op' Status Message: /onnxruntime_src/onnxruntime/core/framework/bfc_arena.cc:331 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool) Failed to allocate memory for requested buffer of size 8589934592
Urgency
Urgent. This breaks testing several models with the required batch size.
System information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
- ONNX Runtime installed from (source or binary): binary
- ONNX Runtime version: 1.10
- Python version: 3.7
- Visual Studio version (if applicable): -
- GCC/Compiler version (if compiling from source): -
- CUDA/cuDNN version: 11.1
- GPU model and memory: GTX 1080Ti, 11GB
To Reproduce
Run the following code
import numpy as np
import onnx
import onnxruntime as ort
FEATURE_MAP_SHAPE = (128, 1024, 1024)
def create_onnx_model():
# Size is (batch_size / 2) GB
input_proto = onnx.helper.make_tensor_value_info('x', onnx.TensorProto.FLOAT, [None, *FEATURE_MAP_SHAPE])
output_proto = onnx.helper.make_tensor_value_info('y', onnx.TensorProto.FLOAT, [1, FEATURE_MAP_SHAPE[0], 1, 1])
node_def = onnx.helper.make_node(
'ReduceMean',
inputs=[input_proto.name],
outputs=[output_proto.name],
name='op',
axes=(0, 2, 3),
)
graph_def = onnx.helper.make_graph(
nodes=[node_def],
name='test-model',
inputs=[input_proto],
outputs=[output_proto],
)
model_def = onnx.helper.make_model(graph_def, producer_name='onnx-example')
model_def.ir_version = 4
model_def.opset_import[0].version = 11
onnx.checker.check_model(model_def, full_check=True)
onnx.save_model(model_def, 'test.onnx')
return ort.InferenceSession('test.onnx', providers=['CUDAExecutionProvider'])
model = create_onnx_model()
batch_size = 4
x = np.zeros((batch_size, *FEATURE_MAP_SHAPE), np.float32)
model.run(None, {'x': x})[0].shapeSee the attached Jupyter notebook for a complete runnable example: reducemean_demo.zip
Expected behavior
In the attached example batch_size close to 20 should be possible, as in PyTorch (see the same notebook).
I. e. ReduceMean should only require the same amount of memory as its output for operation.
Screenshots
None
Additional context
A runnable example: reducemean_demo.zip