ReduceMean consumes an unreasonable amount of VRAM

**Describe the bug**

ReduceMean allocates 4x the amount of memory of the input size.

Example: (4, 128, 1024, 1024) float32 tensor with reduction along axes 0, 2, 3 should require memory for 128 floats.

Instead it tries to allocate 8GB of space (which is 4x the input size) and fails with the following error:

```
[E:onnxruntime:, sequential_executor.cc:346 Execute] Non-zero status code returned while running ReduceMean node. Name:'op' Status Message: /onnxruntime_src/onnxruntime/core/framework/bfc_arena.cc:331 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool) Failed to allocate memory for requested buffer of size 8589934592
```

**Urgency**

Urgent. This breaks testing several models with the required batch size.

**System information**
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
- ONNX Runtime installed from (source or binary): binary
- ONNX Runtime version: 1.10
- Python version: 3.7
- Visual Studio version (if applicable): -
- GCC/Compiler version (if compiling from source): -
- CUDA/cuDNN version: 11.1
- GPU model and memory: GTX 1080Ti, 11GB

**To Reproduce**

Run the following code

```python
import numpy as np
import onnx
import onnxruntime as ort


FEATURE_MAP_SHAPE = (128, 1024, 1024)


def create_onnx_model():
    # Size is (batch_size / 2) GB
    input_proto = onnx.helper.make_tensor_value_info('x', onnx.TensorProto.FLOAT, [None, *FEATURE_MAP_SHAPE])
    output_proto = onnx.helper.make_tensor_value_info('y', onnx.TensorProto.FLOAT, [1, FEATURE_MAP_SHAPE[0], 1, 1])

    node_def = onnx.helper.make_node(
        'ReduceMean',
        inputs=[input_proto.name],
        outputs=[output_proto.name],
        name='op',
        axes=(0, 2, 3),
    )

    graph_def = onnx.helper.make_graph(
        nodes=[node_def],
        name='test-model',
        inputs=[input_proto],
        outputs=[output_proto],
    )

    model_def = onnx.helper.make_model(graph_def, producer_name='onnx-example')
    model_def.ir_version = 4
    model_def.opset_import[0].version = 11

    onnx.checker.check_model(model_def, full_check=True)

    onnx.save_model(model_def, 'test.onnx')

    return ort.InferenceSession('test.onnx', providers=['CUDAExecutionProvider'])


model = create_onnx_model()
batch_size = 4
x = np.zeros((batch_size, *FEATURE_MAP_SHAPE), np.float32)
model.run(None, {'x': x})[0].shape
```

See the attached Jupyter notebook for a complete runnable example: [reducemean_demo.zip](https://github.com/microsoft/onnxruntime/files/7756795/reducemean_demo.zip)

**Expected behavior**

In the attached example `batch_size` close to 20 should be possible, as in PyTorch (see [the same notebook](https://github.com/microsoft/onnxruntime/files/7756795/reducemean_demo.zip)).

I. e. ReduceMean should only require the same amount of memory as its output for operation.

**Screenshots**
None

**Additional context**
A runnable example: [reducemean_demo.zip](https://github.com/microsoft/onnxruntime/files/7756795/reducemean_demo.zip)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ReduceMean consumes an unreasonable amount of VRAM #10099

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ReduceMean consumes an unreasonable amount of VRAM #10099

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions