Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider lowering MALLOC_ARENA_MAX to prevent native memory OOM #8993

Closed
highker opened this issue Sep 17, 2017 · 3 comments
Closed

Consider lowering MALLOC_ARENA_MAX to prevent native memory OOM #8993

highker opened this issue Sep 17, 2017 · 3 comments

Comments

@highker
Copy link
Contributor

highker commented Sep 17, 2017

Yes, we leak native memory

When compressing/decompressing gziped tables with rcfile writers, we use java native zlib inflaters and deflaters which allocate system native memory. There is an ongoing effort (#8531, #8879, #8455, #8529) to ensure the gzip input and output streams are properly closed to prevent native memory leak. However, even with all these fixes, we are still leaking memory. The following figure shows the native memory usage with 4 current queries with shape insert into A select * from B. The cluster OOMed several times.
screen shot 2017-09-17 at 11 20 04 am

But there is no leaking object!

To understand what objects are not freed, we use jemalloc. However, the jemalloc profiling result shows 0 memory leak. What's interesting is that the machines with jemalloc turned on didn't show a sign of memory leak. The following figure shows a comparison of a node with jemalloc and a node with the default allocator (glibc) in the same cluster with the same queries ran as above.
screen shot 2017-09-17 at 11 20 31 am

Why memory allocators make a difference?

glibc is the default native memory allocator of Java. The objects allocated by glibc may NOT be returned to the OS once it's freed for performance improvement. The downside of this is memory fragmentation. The fragmentation can grow unboundedly and finally triggers a OOM. This blog describes the details. On the other side, jemalloc is designed to minimize memory fragmentation, which avoids this problem from the beginning.

Tuning glibc

MALLOC_ARENA_MAX is an environment variable to control how many memory pools can be created for glibc. By default, it is 8 X #CPU cores. With MALLOC_ARENA_MAX set to 2, the OOM issue has completely gone. The following figure demonstrates the native memory usage with different MALLOC_ARENA_MAX values vs jemalloc. Notice that the drop is not a OOM; I just killed the query. When MALLOC_ARENA_MAX is 2 or 4, the memory save is even better than jemalloc. But of course, this is a trade of between memory and performance.
screen shot 2017-09-17 at 12 41 55 am

What we can do to prevent this?

  • Use a memory pool like what Hadoop does
  • Switch to jemalloc
  • Tune down MALLOC_ARENA_MAX

The first point may not work well given the memory pool can hold onto a codec for a long time without releasing it. This can lead to memory waste. That is also the reason we switch to JDK gzip library from the Hadoop one (#8481). Switching to jemalloc could be an option but may bring some uncertainties to the existing system. So maybe just to tune down MALLOC_ARENA_MAX?

Pick a number for MALLOC_ARENA_MAX

The goal is to find out what value for MALLOC_ARENA_MAX is proper. Of course, this can variable from different types of machines/clusters. The test environment is a cluster with 95 nodes with each node has 200GB of heap memory and 50GB of native memory.

1. To what extend we may OOM
Setting: a script to repetitively run 4 concurrent queries reading the same table with 256 billion rows and inserting into another 4 tables. This benchmark runs for hours to determine if there is (a trend or fact) of OOM.

MALLOC_ARENA_MAX=4:	not OOM
MALLOC_ARENA_MAX=8:	not OOM
MALLOC_ARENA_MAX=16:	OOM

Somehow, this benchmark may not be representative since it really depends on what queries we are running and how we assign heap/non-heap memory.

2. CPU performance
Setting: a single query reads a table with 111 billion rows/26 columns and writes into another table. The task concurrency and number of writers are all set to 64 to simulate the production environment and give pressure on memory.

Original Hadoop writer:				43.60 CPU days
default MALLOC_ARENA_MAX with rcfile writer:	38.41 CPU days
MALLOC_ARENA_MAX=8 with rcfile writer:		38.57 CPU days
MALLOC_ARENA_MAX=4 with rcfile writer:		38.61 CPU days
MALLOC_ARENA_MAX=2 with rcfile writer:		38.69 CPU days

The rcfile writer is designed to run faster than the Hadoop one. Among different values of MALLOC_ARENA_MAX, there is subtle difference. I bet most of the CPU is used in compressing/decompressing/writing/reading data instead of allocating/deallocating memory.

Conclusion

When memory is leaking, it may not be a problem of our code. It could just be an improper tuning.

@martint
Copy link
Contributor

martint commented Sep 17, 2017

Can we reproduce this with a standalone test of gzip inputstream? It might be worth a post in the jdk dev list.

@highker
Copy link
Contributor Author

highker commented Sep 18, 2017

More readings

@martint, I tried various ways to stress gzip streams; unfortunately non of them can reproduce the problem on my own server. From some other posts I found, the leak may happen when we have multiple long running threads; each thread is asking for some memory allocation. If MALLOC_ARENA_MAX is a large number, the new allocation won't share with the existing ones. Bounding MALLOC_ARENA_MAX can force glibc to share new malloc arenas (with existing threads) instead of creating new ones. (How area works is described in the first post I pasted below). So maybe directly reading from hdfs can help reproducing it? Here are some posts describing the problem in a more detailed manner:

More benchmark

Also, I did some more benchmark to compare jemalloc and glibc. I installed jemalloc on all machines in the 95-node cluster. With the same setting as the benchmark as I described in the 'CPU Performance' section above (i.e, a single insertion query with 64 writers), jemalloc seems out-performs glibc (with both benchmarks running at least twice; the error margin is about 0.03 CPU day):

glibc:		38.41 CPU days
jemalloc:	37.63 CPU days

@highker
Copy link
Contributor Author

highker commented Sep 19, 2017

Solution

Per discussion offline, we are going to align the output data to a fixed size to reduce fragmentation. The following figure shows the native memory usage when the size is aligned to 4K. Look how stable the memory usage is!
screen shot 2017-09-18 at 5 25 44 pm

To decide how long to align with (1K, 2K, 4K?) will be benchmarked in a coming PR.

carterkozak added a commit to carterkozak/sls-packaging that referenced this issue Jun 13, 2018
carterkozak added a commit to carterkozak/sls-packaging that referenced this issue Jun 13, 2018
uschi2000 pushed a commit to palantir/sls-packaging that referenced this issue Jun 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants