Consider lowering MALLOC_ARENA_MAX to prevent native memory OOM #8993

highker · 2017-09-17T18:29:40Z

Yes, we leak native memory

When compressing/decompressing gziped tables with rcfile writers, we use java native zlib inflaters and deflaters which allocate system native memory. There is an ongoing effort (#8531, #8879, #8455, #8529) to ensure the gzip input and output streams are properly closed to prevent native memory leak. However, even with all these fixes, we are still leaking memory. The following figure shows the native memory usage with 4 current queries with shape insert into A select * from B. The cluster OOMed several times.

But there is no leaking object!

To understand what objects are not freed, we use jemalloc. However, the jemalloc profiling result shows 0 memory leak. What's interesting is that the machines with jemalloc turned on didn't show a sign of memory leak. The following figure shows a comparison of a node with jemalloc and a node with the default allocator (glibc) in the same cluster with the same queries ran as above.

Why memory allocators make a difference?

glibc is the default native memory allocator of Java. The objects allocated by glibc may NOT be returned to the OS once it's freed for performance improvement. The downside of this is memory fragmentation. The fragmentation can grow unboundedly and finally triggers a OOM. This blog describes the details. On the other side, jemalloc is designed to minimize memory fragmentation, which avoids this problem from the beginning.

Tuning glibc

MALLOC_ARENA_MAX is an environment variable to control how many memory pools can be created for glibc. By default, it is 8 X #CPU cores. With MALLOC_ARENA_MAX set to 2, the OOM issue has completely gone. The following figure demonstrates the native memory usage with different MALLOC_ARENA_MAX values vs jemalloc. Notice that the drop is not a OOM; I just killed the query. When MALLOC_ARENA_MAX is 2 or 4, the memory save is even better than jemalloc. But of course, this is a trade of between memory and performance.

What we can do to prevent this?

Use a memory pool like what Hadoop does
Switch to jemalloc
Tune down MALLOC_ARENA_MAX

The first point may not work well given the memory pool can hold onto a codec for a long time without releasing it. This can lead to memory waste. That is also the reason we switch to JDK gzip library from the Hadoop one (#8481). Switching to jemalloc could be an option but may bring some uncertainties to the existing system. So maybe just to tune down MALLOC_ARENA_MAX?

Pick a number for MALLOC_ARENA_MAX

The goal is to find out what value for MALLOC_ARENA_MAX is proper. Of course, this can variable from different types of machines/clusters. The test environment is a cluster with 95 nodes with each node has 200GB of heap memory and 50GB of native memory.

1. To what extend we may OOM
Setting: a script to repetitively run 4 concurrent queries reading the same table with 256 billion rows and inserting into another 4 tables. This benchmark runs for hours to determine if there is (a trend or fact) of OOM.

MALLOC_ARENA_MAX=4:	not OOM
MALLOC_ARENA_MAX=8:	not OOM
MALLOC_ARENA_MAX=16:	OOM

Somehow, this benchmark may not be representative since it really depends on what queries we are running and how we assign heap/non-heap memory.

2. CPU performance
Setting: a single query reads a table with 111 billion rows/26 columns and writes into another table. The task concurrency and number of writers are all set to 64 to simulate the production environment and give pressure on memory.

Original Hadoop writer:				43.60 CPU days
default MALLOC_ARENA_MAX with rcfile writer:	38.41 CPU days
MALLOC_ARENA_MAX=8 with rcfile writer:		38.57 CPU days
MALLOC_ARENA_MAX=4 with rcfile writer:		38.61 CPU days
MALLOC_ARENA_MAX=2 with rcfile writer:		38.69 CPU days

The rcfile writer is designed to run faster than the Hadoop one. Among different values of MALLOC_ARENA_MAX, there is subtle difference. I bet most of the CPU is used in compressing/decompressing/writing/reading data instead of allocating/deallocating memory.

Conclusion

When memory is leaking, it may not be a problem of our code. It could just be an improper tuning.

The text was updated successfully, but these errors were encountered:

martint · 2017-09-17T19:35:16Z

Can we reproduce this with a standalone test of gzip inputstream? It might be worth a post in the jdk dev list.

highker · 2017-09-18T01:16:50Z

More benchmark

Also, I did some more benchmark to compare jemalloc and glibc. I installed jemalloc on all machines in the 95-node cluster. With the same setting as the benchmark as I described in the 'CPU Performance' section above (i.e, a single insertion query with 64 writers), jemalloc seems out-performs glibc (with both benchmarks running at least twice; the error margin is about 0.03 CPU day):

glibc:		38.41 CPU days
jemalloc:	37.63 CPU days

highker · 2017-09-19T00:31:15Z

Solution

Per discussion offline, we are going to align the output data to a fixed size to reduce fragmentation. The following figure shows the native memory usage when the size is aligned to 4K. Look how stable the memory usage is!

To decide how long to align with (1K, 2K, 4K?) will be benchmarked in a coming PR.

Given some versions of glibc, this can provide a significant reduction in virtual memory usage. https://jkutner.github.io/2017/04/28/oh-the-places-your-java-memory-goes.html prestodb/presto#8993 https://issues.apache.org/jira/browse/HADOOP-7154

Text copied from: https://github.com/imgproxy/imgproxy/blob/master/docs/memory_usage_tweaks.md See also: - prestodb/presto#8993 - https://www.speedshop.co/2017/12/04/malloc-doubles-ruby-memory.html

highker mentioned this issue Sep 19, 2017

Align output to buffer size in rcfile writer #9001

Merged

highker closed this as completed in #9001 Sep 19, 2017

carterkozak mentioned this issue Jun 13, 2018

Set env MALLOC_ARENA_MAX=4 by default palantir/sls-packaging#284

Merged

noahfalk mentioned this issue Apr 8, 2020

Memory leak with Application Insights + asp .net core 3.1 + linux containers microsoft/ApplicationInsights-dotnet#1678

Closed

negbie mentioned this issue May 3, 2020

Memory grows without limit facebook/rocksdb#4112

Closed

Zelldon mentioned this issue Oct 22, 2020

Investigate RocksDB memory consumption and limiting camunda/camunda#3988

Closed

bozaro mentioned this issue Jun 15, 2021

Add MALLOC_ARENA_MAX to README.md davidbyttow/govips#199

Merged

renshangtao mentioned this issue Mar 19, 2022

Running the same batch task multiple times using the Flink Session pattern causes the linux out of memory apache/iceberg#4305

Closed

data-sync-user mentioned this issue Jun 15, 2022

Test a restricted MALLOC_ARENA_MAX's effect on memory usage mozilla-services/syncstorage-rs#1340

Closed

rhh777 mentioned this issue May 10, 2023

[Improvement] ShuffleServer set MALLOC_ARENA_MAX apache/incubator-uniffle#859

Closed

3 tasks

jerqi mentioned this issue May 10, 2023

[#859][Improvement] Set MALLOC_ARENA_MAX in start-shuffle-server.sh apache/incubator-uniffle#860

Merged

oruchreis mentioned this issue Sep 14, 2023

GC does not release memory easily on kubernetes cluster in workstation mode dotnet/runtime#49317

Closed

ClaudioConsolmagno mentioned this issue Feb 20, 2024

java: linear solver memory leak google/or-tools#3684

Closed

adamashton mentioned this issue Mar 6, 2024

.NET Blazor App high memory usage under Linux dotnet/runtime#90163

Closed

IkerPSR mentioned this issue Sep 18, 2024

Memory growth on Linux with "journal_mode=memory" dotnet/efcore#34695

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider lowering MALLOC_ARENA_MAX to prevent native memory OOM #8993

Consider lowering MALLOC_ARENA_MAX to prevent native memory OOM #8993

highker commented Sep 17, 2017

martint commented Sep 17, 2017

highker commented Sep 18, 2017

highker commented Sep 19, 2017

Consider lowering MALLOC_ARENA_MAX to prevent native memory OOM #8993

Consider lowering MALLOC_ARENA_MAX to prevent native memory OOM #8993

Comments

highker commented Sep 17, 2017

Yes, we leak native memory

But there is no leaking object!

Why memory allocators make a difference?

Tuning glibc

What we can do to prevent this?

Pick a number for MALLOC_ARENA_MAX

Conclusion

martint commented Sep 17, 2017

highker commented Sep 18, 2017

More readings

More benchmark

highker commented Sep 19, 2017

Solution