Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why does enable_cpu_mem_arena have such a large effect on memory usage during inference? #11627

Open
joshuacwnewton opened this issue May 25, 2022 · 7 comments
Labels
documentation improvements or additions to documentation; typically submitted using template

Comments

@joshuacwnewton
Copy link

joshuacwnewton commented May 25, 2022

Describe the bug
A clear and concise description of what the bug is. To avoid repetition please make sure this is not one of the known issues mentioned on the respective release page.

I'm performing inference using the Python API and a small ONNX model (~2MB) that was converted from a Keras .h5 model.

When running ort_sess.run() using default settings, memory usage skyrockets from ~200MB to ~6GB:

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
   172    206.0 MiB    206.0 MiB           1   @profile
   173                                         def onnx_prediction(model_abs_path, input):
   174    206.0 MiB      0.0 MiB           1       sess_options = ort.SessionOptions()
   175    206.1 MiB      0.0 MiB           1       sess_options.enable_profiling = True
   176    212.6 MiB      6.6 MiB           1       ort_sess = ort.InferenceSession(model_abs_path, sess_opti
ons=sess_options)
   177   5792.0 MiB   5579.3 MiB           1       preds = ort_sess.run(output_names=["predictions"], input_
feed={"input_1": input})[0]
   178   5792.0 MiB      0.0 MiB           1       return preds

Searching in past GitHub issues, I found mention of enable_cpu_mem_arena. Setting this to False completely addresses the issue:

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
   172    206.0 MiB    206.0 MiB           1   @profile
   173                                         def onnx_prediction(model_abs_path, input):
   174    206.1 MiB      0.0 MiB           1       sess_options = ort.SessionOptions()
   175    206.1 MiB      0.0 MiB           1       sess_options.enable_profiling = True
   176    206.1 MiB      0.0 MiB           1       sess_options.enable_cpu_mem_arena = False
   177    212.4 MiB      6.4 MiB           1       ort_sess = ort.InferenceSession(model_abs_path, sess_opti
ons=sess_options)
   178    217.8 MiB      5.3 MiB           1       preds = ort_sess.run(output_names=["predictions"], input_
feed={"input_1": input})[0]
   179    217.8 MiB      0.0 MiB           1       return preds

The docs on enable_cpu_mem_area mention:

Enables the memory arena on CPU. Arena may pre-allocate memory for future usage. Set this option to false if you don’t want it. Default is True.

But I have some questions to try to better understand what's actually going on here:

  • Why was the CPU memory arena pre-allocating so much memory in the first place?
  • Are there any risks or downsides to setting enable_cpu_mem_arena = False?

Urgency
If there are particular important use cases blocked by this or strict project-related timelines, please share more information and dates. If there are no hard deadlines, please specify none.

None.

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10
  • ONNX Runtime installed from (source or binary): Binary
  • ONNX Runtime version: 1.7.0
  • Python version: 3.7
  • Visual Studio version (if applicable): N/A
  • GCC/Compiler version (if compiling from source): N/A
  • CUDA/cuDNN version: N/A (CPU only)
  • GPU model and memory: N/A (CPU only)
  • CPU model: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz, 2801 Mhz, 4 Core(s), 8 Logical Processor(s)
  • RAM: 16.0 GB

To Reproduce

  • Describe steps/code to reproduce the behavior.
  • Attach the ONNX model to the issue (where applicable) to expedite investigation.
from memory_profiler import profile

@profile
def onnx_prediction(model_path, input):
    ort_sess = ort.InferenceSession(model_path)
    preds = ort_sess.run(output_names=["predictions"], input_feed={"input_1": input})[0]
    return preds

Here is a .zip containing both an .onnx model file and a .npy array you can load to use for input: enable_cpu_memory_area_example.zip

Expected behavior
A clear and concise description of what you expected to happen.

Not pre-allocating 6GB of memory for a 2MB model.

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
Add any other context about the problem here. If the issue is about a particular model, please share the model details as well to facilitate debugging.

@tianleiwu
Copy link
Contributor

tianleiwu commented May 25, 2022

This is expected. If you disable Arena, heap memory allocation will take time, so inference latency will increase. One drawback of default Arena extend strategy is that it might allocate more memory than needed, which could be a waste.

If you want to save memory, but do not want to impact latency. I recommend to set execution provider options like the following python code:

import onnxruntime as ort
sess_options = ort.SessionOptions()
cuda_provider_options = {
"gpu_mem_limit": "17179869184", # 16GB.
"arena_extend_strategy": "kSameAsRequested",
}
cpu_provider_options = {
"arena_extend_strategy": "kSameAsRequested",
}
execution_providers = [("CUDAExecutionProvider", cuda_provider_options), ("CPUExecutionProvider", cpu_provider_options)]
ort_sess = ort.InferenceSession(onnx_path, sess_options, providers=execution_providers)

Then use one input (warm up query) that need most memory to inference after session is created. That will allocate just enough memory for your need, and also ensure that future inference does not allocate heap memory.

@joshuacwnewton
Copy link
Author

joshuacwnewton commented May 25, 2022

Thank you so much for confirming that this is expected behavior, and for explaining the trade-offs involved here, and also for providing a more detailed configuration to address this issue. ❤️

Would it be worth adding a section in the documentation that covers the CPU memory arena? I first read through the Tune performance page before making this issue, but there is currently no mention of enable_cpu_mem_arena.

@tianleiwu tianleiwu added the documentation improvements or additions to documentation; typically submitted using template label May 25, 2022
@tianleiwu
Copy link
Contributor

@joshuacwnewton, The suggestion sounds good. It worth to have a section about arena settings.

@tianleiwu
Copy link
Contributor

Here are some info related to Arena:
https://onnxruntime.ai/docs/get-started/with-c.html
See the sections of Share allocator(s) between sessions, Memory arena shrinkage, Allocate memory for initializer(s) from non-arena memory (for advanced users).

Example code:

TEST(CApiTest, ConfigureCudaArenaAndDemonstrateMemoryArenaShrinkage) {
const auto& api = Ort::GetApi();
Ort::SessionOptions session_options;
const char* keys[] = {"max_mem", "arena_extend_strategy", "initial_chunk_size_bytes", "max_dead_bytes_per_chunk", "initial_growth_chunk_size_bytes"};
const size_t values[] = {0 /*let ort pick default max memory*/, 0, 1024, 0, 256};
OrtArenaCfg* arena_cfg = nullptr;
ASSERT_TRUE(api.CreateArenaCfgV2(keys, values, 5, &arena_cfg) == nullptr);
std::unique_ptr<OrtArenaCfg, decltype(api.ReleaseArenaCfg)> rel_arena_cfg(arena_cfg, api.ReleaseArenaCfg);
OrtCUDAProviderOptions cuda_provider_options = CreateDefaultOrtCudaProviderOptionsWithCustomStream(nullptr);
cuda_provider_options.default_memory_arena_cfg = arena_cfg;
session_options.AppendExecutionProvider_CUDA(cuda_provider_options);
Ort::Session session(*ort_env, MODEL_URI, session_options);
// Use a run option like this while invoking Run() to trigger a memory arena shrinkage post Run()
// This will shrink memory allocations left unused at the end of Run() and cap the arena growth
// This does come with associated costs as there are costs to cudaFree() but the goodness it offers
// is that the memory held by the arena (memory pool) is kept checked.
Ort::RunOptions run_option;
run_option.AddConfigEntry(kOrtRunOptionsConfigEnableMemoryArenaShrinkage, "gpu:0");
// To also trigger a cpu memory arena shrinkage along with the gpu arena shrinkage, use the following-
// (Memory arena for the CPU should not have been disabled)
// run_option.AddConfigEntry(kOrtRunOptionsConfigEnableMemoryArenaShrinkage, "cpu:0;gpu:0");
}

It is for C API, and some settings might not be available to python API.

@snnn
Copy link
Member

snnn commented Jun 2, 2022

@souptc , FYI.

@tianleiwu, is arena enabled by default in our published python cpu package for the default CPU EP?

@natke
Copy link
Contributor

natke commented Jun 7, 2022

@pkreg101 You could add this information to #11508

@tianleiwu
Copy link
Contributor

@snnn, arena is enabled by default for CPU. See https://onnxruntime.ai/docs/api/python/api_summary.html#sessionoptions mentions that the option enable_cpu_mem_arena has default value True.

@natke natke added this to To do in ONNX Runtime Samples and Documentation via automation Dec 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation improvements or additions to documentation; typically submitted using template
Projects
No open projects
Development

No branches or pull requests

4 participants