Why does `enable_cpu_mem_arena` have such a large effect on memory usage during inference? #11627

joshuacwnewton · 2022-05-25T15:10:59Z

Describe the bug
A clear and concise description of what the bug is. To avoid repetition please make sure this is not one of the known issues mentioned on the respective release page.

I'm performing inference using the Python API and a small ONNX model (~2MB) that was converted from a Keras .h5 model.

When running ort_sess.run() using default settings, memory usage skyrockets from ~200MB to ~6GB:

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
   172    206.0 MiB    206.0 MiB           1   @profile
   173                                         def onnx_prediction(model_abs_path, input):
   174    206.0 MiB      0.0 MiB           1       sess_options = ort.SessionOptions()
   175    206.1 MiB      0.0 MiB           1       sess_options.enable_profiling = True
   176    212.6 MiB      6.6 MiB           1       ort_sess = ort.InferenceSession(model_abs_path, sess_opti
ons=sess_options)
   177   5792.0 MiB   5579.3 MiB           1       preds = ort_sess.run(output_names=["predictions"], input_
feed={"input_1": input})[0]
   178   5792.0 MiB      0.0 MiB           1       return preds

Searching in past GitHub issues, I found mention of enable_cpu_mem_arena. Setting this to False completely addresses the issue:

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
   172    206.0 MiB    206.0 MiB           1   @profile
   173                                         def onnx_prediction(model_abs_path, input):
   174    206.1 MiB      0.0 MiB           1       sess_options = ort.SessionOptions()
   175    206.1 MiB      0.0 MiB           1       sess_options.enable_profiling = True
   176    206.1 MiB      0.0 MiB           1       sess_options.enable_cpu_mem_arena = False
   177    212.4 MiB      6.4 MiB           1       ort_sess = ort.InferenceSession(model_abs_path, sess_opti
ons=sess_options)
   178    217.8 MiB      5.3 MiB           1       preds = ort_sess.run(output_names=["predictions"], input_
feed={"input_1": input})[0]
   179    217.8 MiB      0.0 MiB           1       return preds

The docs on enable_cpu_mem_area mention:

Enables the memory arena on CPU. Arena may pre-allocate memory for future usage. Set this option to false if you don’t want it. Default is True.

But I have some questions to try to better understand what's actually going on here:

Why was the CPU memory arena pre-allocating so much memory in the first place?
Are there any risks or downsides to setting enable_cpu_mem_arena = False?

Urgency
If there are particular important use cases blocked by this or strict project-related timelines, please share more information and dates. If there are no hard deadlines, please specify none.

None.

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10
ONNX Runtime installed from (source or binary): Binary
ONNX Runtime version: 1.7.0
Python version: 3.7
Visual Studio version (if applicable): N/A
GCC/Compiler version (if compiling from source): N/A
CUDA/cuDNN version: N/A (CPU only)
GPU model and memory: N/A (CPU only)
CPU model: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz, 2801 Mhz, 4 Core(s), 8 Logical Processor(s)
RAM: 16.0 GB

To Reproduce

Describe steps/code to reproduce the behavior.
Attach the ONNX model to the issue (where applicable) to expedite investigation.

from memory_profiler import profile

@profile
def onnx_prediction(model_path, input):
    ort_sess = ort.InferenceSession(model_path)
    preds = ort_sess.run(output_names=["predictions"], input_feed={"input_1": input})[0]
    return preds

Here is a .zip containing both an .onnx model file and a .npy array you can load to use for input: enable_cpu_memory_area_example.zip

Expected behavior
A clear and concise description of what you expected to happen.

Not pre-allocating 6GB of memory for a 2MB model.

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
Add any other context about the problem here. If the issue is about a particular model, please share the model details as well to facilitate debugging.

The text was updated successfully, but these errors were encountered:

tianleiwu · 2022-05-25T18:16:15Z

This is expected. If you disable Arena, heap memory allocation will take time, so inference latency will increase. One drawback of default Arena extend strategy is that it might allocate more memory than needed, which could be a waste.

If you want to save memory, but do not want to impact latency. I recommend to set execution provider options like the following python code:

import onnxruntime as ort
sess_options = ort.SessionOptions()
cuda_provider_options = {
"gpu_mem_limit": "17179869184", # 16GB.
"arena_extend_strategy": "kSameAsRequested",
}
cpu_provider_options = {
"arena_extend_strategy": "kSameAsRequested",
}
execution_providers = [("CUDAExecutionProvider", cuda_provider_options), ("CPUExecutionProvider", cpu_provider_options)]
ort_sess = ort.InferenceSession(onnx_path, sess_options, providers=execution_providers)

Then use one input (warm up query) that need most memory to inference after session is created. That will allocate just enough memory for your need, and also ensure that future inference does not allocate heap memory.

joshuacwnewton · 2022-05-25T19:13:21Z

Thank you so much for confirming that this is expected behavior, and for explaining the trade-offs involved here, and also for providing a more detailed configuration to address this issue. ❤️

Would it be worth adding a section in the documentation that covers the CPU memory arena? I first read through the Tune performance page before making this issue, but there is currently no mention of enable_cpu_mem_arena.

tianleiwu · 2022-05-25T19:40:08Z

@joshuacwnewton, The suggestion sounds good. It worth to have a section about arena settings.

tianleiwu · 2022-05-25T20:34:43Z

Here are some info related to Arena:
https://onnxruntime.ai/docs/get-started/with-c.html
See the sections of Share allocator(s) between sessions, Memory arena shrinkage, Allocate memory for initializer(s) from non-arena memory (for advanced users).

Example code:

onnxruntime/onnxruntime/test/shared_lib/test_inference.cc

Lines 1877 to 1906 in f0dff6b

    
           TEST(CApiTest, ConfigureCudaArenaAndDemonstrateMemoryArenaShrinkage) { 
        
             const auto& api = Ort::GetApi(); 
        
             Ort::SessionOptions session_options; 
        
             const char* keys[] = {"max_mem", "arena_extend_strategy", "initial_chunk_size_bytes", "max_dead_bytes_per_chunk", "initial_growth_chunk_size_bytes"}; 
        
             const size_t values[] = {0 /*let ort pick default max memory*/, 0, 1024, 0, 256}; 
        
             OrtArenaCfg* arena_cfg = nullptr; 
        
             ASSERT_TRUE(api.CreateArenaCfgV2(keys, values, 5, &arena_cfg) == nullptr); 
        
             std::unique_ptr<OrtArenaCfg, decltype(api.ReleaseArenaCfg)> rel_arena_cfg(arena_cfg, api.ReleaseArenaCfg); 
        
             OrtCUDAProviderOptions cuda_provider_options = CreateDefaultOrtCudaProviderOptionsWithCustomStream(nullptr); 
        
             cuda_provider_options.default_memory_arena_cfg = arena_cfg; 
        
             session_options.AppendExecutionProvider_CUDA(cuda_provider_options); 
        
             Ort::Session session(*ort_env, MODEL_URI, session_options); 
        
             // Use a run option like this while invoking Run() to trigger a memory arena shrinkage post Run() 
        
             // This will shrink memory allocations left unused at the end of Run() and cap the arena growth 
        
             // This does come with associated costs as there are costs to cudaFree() but the goodness it offers 
        
             // is that the memory held by the arena (memory pool) is kept checked. 
        
             Ort::RunOptions run_option; 
        
             run_option.AddConfigEntry(kOrtRunOptionsConfigEnableMemoryArenaShrinkage, "gpu:0"); 
        
             // To also trigger a cpu memory arena shrinkage along with the gpu arena shrinkage, use the following- 
        
             // (Memory arena for the CPU should not have been disabled) 
        
             //  run_option.AddConfigEntry(kOrtRunOptionsConfigEnableMemoryArenaShrinkage, "cpu:0;gpu:0"); 
        
           }

It is for C API, and some settings might not be available to python API.

snnn · 2022-06-02T23:09:48Z

@souptc , FYI.

@tianleiwu, is arena enabled by default in our published python cpu package for the default CPU EP?

natke · 2022-06-07T18:37:29Z

@pkreg101 You could add this information to #11508

tianleiwu · 2022-07-20T22:40:05Z

@snnn, arena is enabled by default for CPU. See https://onnxruntime.ai/docs/api/python/api_summary.html#sessionoptions mentions that the option enable_cpu_mem_arena has default value True.

joshuacwnewton mentioned this issue May 25, 2022

Replace Tensorflow/Keras-based inference (.h5) with onnxruntime (.ONNX) spinalcordtoolbox/spinalcordtoolbox#3738

Merged

8 tasks

tianleiwu added the documentation improvements or additions to documentation; typically submitted using template label May 25, 2022

dyt811 mentioned this issue Jun 3, 2022

Documentation of Edge Case Issues: on ONNX inference time memory hog when enable_cpu_mem_area flag enabled. ivadomed/ivadomed#1147

Closed

natke added this to To do in ONNX Runtime Samples and Documentation via automation Dec 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why does `enable_cpu_mem_arena` have such a large effect on memory usage during inference? #11627

Why does `enable_cpu_mem_arena` have such a large effect on memory usage during inference? #11627

joshuacwnewton commented May 25, 2022 •

edited

tianleiwu commented May 25, 2022 •

edited

joshuacwnewton commented May 25, 2022 •

edited

tianleiwu commented May 25, 2022

tianleiwu commented May 25, 2022

snnn commented Jun 2, 2022

natke commented Jun 7, 2022

tianleiwu commented Jul 20, 2022

Why does enable_cpu_mem_arena have such a large effect on memory usage during inference? #11627

Why does enable_cpu_mem_arena have such a large effect on memory usage during inference? #11627

Comments

joshuacwnewton commented May 25, 2022 • edited

tianleiwu commented May 25, 2022 • edited

joshuacwnewton commented May 25, 2022 • edited

tianleiwu commented May 25, 2022

tianleiwu commented May 25, 2022

snnn commented Jun 2, 2022

natke commented Jun 7, 2022

tianleiwu commented Jul 20, 2022

Why does `enable_cpu_mem_arena` have such a large effect on memory usage during inference? #11627

Why does `enable_cpu_mem_arena` have such a large effect on memory usage during inference? #11627

joshuacwnewton commented May 25, 2022 •

edited

tianleiwu commented May 25, 2022 •

edited

joshuacwnewton commented May 25, 2022 •

edited