-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider lowering MALLOC_ARENA_MAX to prevent native memory OOM #8993
Comments
Can we reproduce this with a standalone test of gzip inputstream? It might be worth a post in the jdk dev list. |
More readings@martint, I tried various ways to stress gzip streams; unfortunately non of them can reproduce the problem on my own server. From some other posts I found, the leak may happen when we have multiple long running threads; each thread is asking for some memory allocation. If MALLOC_ARENA_MAX is a large number, the new allocation won't share with the existing ones. Bounding MALLOC_ARENA_MAX can force glibc to share new malloc arenas (with existing threads) instead of creating new ones. (How area works is described in the first post I pasted below). So maybe directly reading from hdfs can help reproducing it? Here are some posts describing the problem in a more detailed manner:
More benchmarkAlso, I did some more benchmark to compare jemalloc and glibc. I installed jemalloc on all machines in the 95-node cluster. With the same setting as the benchmark as I described in the 'CPU Performance' section above (i.e, a single insertion query with 64 writers), jemalloc seems out-performs glibc (with both benchmarks running at least twice; the error margin is about 0.03 CPU day):
|
Given some versions of glibc, this can provide a significant reduction in virtual memory usage. https://jkutner.github.io/2017/04/28/oh-the-places-your-java-memory-goes.html prestodb/presto#8993 https://issues.apache.org/jira/browse/HADOOP-7154
Given some versions of glibc, this can provide a significant reduction in virtual memory usage. https://jkutner.github.io/2017/04/28/oh-the-places-your-java-memory-goes.html prestodb/presto#8993 https://issues.apache.org/jira/browse/HADOOP-7154
Given some versions of glibc, this can provide a significant reduction in virtual memory usage. https://jkutner.github.io/2017/04/28/oh-the-places-your-java-memory-goes.html prestodb/presto#8993 https://issues.apache.org/jira/browse/HADOOP-7154
Yes, we leak native memory
When compressing/decompressing gziped tables with rcfile writers, we use java native zlib inflaters and deflaters which allocate system native memory. There is an ongoing effort (#8531, #8879, #8455, #8529) to ensure the gzip input and output streams are properly closed to prevent native memory leak. However, even with all these fixes, we are still leaking memory. The following figure shows the native memory usage with 4 current queries with shape
insert into A select * from B
. The cluster OOMed several times.But there is no leaking object!
To understand what objects are not freed, we use jemalloc. However, the jemalloc profiling result shows 0 memory leak. What's interesting is that the machines with jemalloc turned on didn't show a sign of memory leak. The following figure shows a comparison of a node with jemalloc and a node with the default allocator (glibc) in the same cluster with the same queries ran as above.
Why memory allocators make a difference?
glibc is the default native memory allocator of Java. The objects allocated by glibc may NOT be returned to the OS once it's freed for performance improvement. The downside of this is memory fragmentation. The fragmentation can grow unboundedly and finally triggers a OOM. This blog describes the details. On the other side, jemalloc is designed to minimize memory fragmentation, which avoids this problem from the beginning.
Tuning glibc
MALLOC_ARENA_MAX is an environment variable to control how many memory pools can be created for glibc. By default, it is 8 X #CPU cores. With MALLOC_ARENA_MAX set to 2, the OOM issue has completely gone. The following figure demonstrates the native memory usage with different MALLOC_ARENA_MAX values vs jemalloc. Notice that the drop is not a OOM; I just killed the query. When MALLOC_ARENA_MAX is 2 or 4, the memory save is even better than jemalloc. But of course, this is a trade of between memory and performance.
What we can do to prevent this?
The first point may not work well given the memory pool can hold onto a codec for a long time without releasing it. This can lead to memory waste. That is also the reason we switch to JDK gzip library from the Hadoop one (#8481). Switching to jemalloc could be an option but may bring some uncertainties to the existing system. So maybe just to tune down MALLOC_ARENA_MAX?
Pick a number for MALLOC_ARENA_MAX
The goal is to find out what value for MALLOC_ARENA_MAX is proper. Of course, this can variable from different types of machines/clusters. The test environment is a cluster with 95 nodes with each node has 200GB of heap memory and 50GB of native memory.
1. To what extend we may OOM
Setting: a script to repetitively run 4 concurrent queries reading the same table with 256 billion rows and inserting into another 4 tables. This benchmark runs for hours to determine if there is (a trend or fact) of OOM.
Somehow, this benchmark may not be representative since it really depends on what queries we are running and how we assign heap/non-heap memory.
2. CPU performance
Setting: a single query reads a table with 111 billion rows/26 columns and writes into another table. The task concurrency and number of writers are all set to 64 to simulate the production environment and give pressure on memory.
The rcfile writer is designed to run faster than the Hadoop one. Among different values of MALLOC_ARENA_MAX, there is subtle difference. I bet most of the CPU is used in compressing/decompressing/writing/reading data instead of allocating/deallocating memory.
Conclusion
When memory is leaking, it may not be a problem of our code. It could just be an improper tuning.
The text was updated successfully, but these errors were encountered: