Skip to content

ReduceMean consumes an unreasonable amount of VRAM #10099

@ponbaton

Description

@ponbaton

Describe the bug

ReduceMean allocates 4x the amount of memory of the input size.

Example: (4, 128, 1024, 1024) float32 tensor with reduction along axes 0, 2, 3 should require memory for 128 floats.

Instead it tries to allocate 8GB of space (which is 4x the input size) and fails with the following error:

[E:onnxruntime:, sequential_executor.cc:346 Execute] Non-zero status code returned while running ReduceMean node. Name:'op' Status Message: /onnxruntime_src/onnxruntime/core/framework/bfc_arena.cc:331 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool) Failed to allocate memory for requested buffer of size 8589934592

Urgency

Urgent. This breaks testing several models with the required batch size.

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
  • ONNX Runtime installed from (source or binary): binary
  • ONNX Runtime version: 1.10
  • Python version: 3.7
  • Visual Studio version (if applicable): -
  • GCC/Compiler version (if compiling from source): -
  • CUDA/cuDNN version: 11.1
  • GPU model and memory: GTX 1080Ti, 11GB

To Reproduce

Run the following code

import numpy as np
import onnx
import onnxruntime as ort


FEATURE_MAP_SHAPE = (128, 1024, 1024)


def create_onnx_model():
    # Size is (batch_size / 2) GB
    input_proto = onnx.helper.make_tensor_value_info('x', onnx.TensorProto.FLOAT, [None, *FEATURE_MAP_SHAPE])
    output_proto = onnx.helper.make_tensor_value_info('y', onnx.TensorProto.FLOAT, [1, FEATURE_MAP_SHAPE[0], 1, 1])

    node_def = onnx.helper.make_node(
        'ReduceMean',
        inputs=[input_proto.name],
        outputs=[output_proto.name],
        name='op',
        axes=(0, 2, 3),
    )

    graph_def = onnx.helper.make_graph(
        nodes=[node_def],
        name='test-model',
        inputs=[input_proto],
        outputs=[output_proto],
    )

    model_def = onnx.helper.make_model(graph_def, producer_name='onnx-example')
    model_def.ir_version = 4
    model_def.opset_import[0].version = 11

    onnx.checker.check_model(model_def, full_check=True)

    onnx.save_model(model_def, 'test.onnx')

    return ort.InferenceSession('test.onnx', providers=['CUDAExecutionProvider'])


model = create_onnx_model()
batch_size = 4
x = np.zeros((batch_size, *FEATURE_MAP_SHAPE), np.float32)
model.run(None, {'x': x})[0].shape

See the attached Jupyter notebook for a complete runnable example: reducemean_demo.zip

Expected behavior

In the attached example batch_size close to 20 should be possible, as in PyTorch (see the same notebook).

I. e. ReduceMean should only require the same amount of memory as its output for operation.

Screenshots
None

Additional context
A runnable example: reducemean_demo.zip

Metadata

Metadata

Assignees

No one assigned

    Labels

    core runtimeissues related to core runtimestaleissues that have not been addressed in a while; categorized by a bot

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions