PooledByteBufAllocator is a performance bottleneck #2264

normanmaurer · 2014-02-26T14:45:00Z

During massive load tests I noticed that a lot of time is spend because of the synchronization in PoolArea. We need to try to minimize this as it is a pretty big performance bottleneck:

Maybe we can do something smart with lock-free algorithms or have a dedicated thread which free's the buffers.

normanmaurer · 2014-02-26T14:45:39Z

Looking into it ...

trustin · 2014-02-26T19:24:03Z

PooledByteBufAllocator does not currently implement the thread-local cache, and that seems to be causing the contention. Take a look at PoolArena.allocate(PoolThreadCache cache, PooledByteBuf<T> buf, final int reqCapacity).

normanmaurer · 2014-02-26T19:27:36Z

Just ran the benchmark again using -Dio.netty.allocator.numDirectArenas=cores

And got this:

normanmaurer · 2014-02-26T19:33:58Z

@trustin could you be more specific what you mean with "PooledByteBufAllocator does not currently implement the thread-local cache, and that seems to be causing the contention" ?

normanmaurer · 2014-02-26T20:49:02Z

@trustin btw do you remember why we use cpu-cores as default area numbers and not 2 * cores?

Am 26.02.2014 um 20:24 schrieb Trustin Lee notifications@github.com:

PooledByteBufAllocator does not currently implement the thread-local cache, and that seems to be causing the contention. Take a look at PoolArena.allocate(PoolThreadCache cache, PooledByteBuf buf, final int reqCapacity).

—
Reply to this email directly or view it on GitHub.

trustin · 2014-02-26T22:24:35Z

Maybe it's a good idea to have as many as the default number of I/O threads due to contention you observed.

By default the number of arenas is determined by the amount of available memory, and it's likely that you'll have a smaller number of arenas depending on your system's available memory.

More fundamental fix is to keep a thread local cache to store the recently released buffers for a while. The jemalloc paper describes this in detail. Unfortunately, our allocator does not implement the thread-local cache yet.

normanmaurer · 2014-02-27T06:14:25Z

@trustin yeah I think we should use Runtime.getRuntime().availableProcessors() * 2 for each area and not Runtime.getRuntime().availableProcessors(). Ok with have me change it ?

Beside this I will look into the ThreadLocalCache. Currently reading the jemalloc paper ;)

trustin · 2014-02-27T16:12:28Z

Re: Why it's numCores - It's purely CPU bound, but I had to consider the number of I/O threads, too. The default number of I/O threads can be overriden by a system property, so we need to respect that. The problem is it does not depend on netty-transport (and it should not of course), so we are going to have some code duplication in netty-buffer or we need to move all system property related code to netty-common for cleanness.

normanmaurer · 2014-02-27T17:18:05Z

Yeah but I think we should use the same default as in transport which is 2 * cores

Am 27.02.2014 um 17:12 schrieb Trustin Lee notifications@github.com:

Re: Why it's numCores - It's purely CPU bound, but I had to consider the number of I/O threads, too. The default number of I/O threads can be overriden by a system property, so we need to respect that. The problem is it does not depend on netty-transport (and it should not of course), so we are going to have some code duplication in netty-buffer or we need to move all system property related code to netty-common for cleanness.

—
Reply to this email directly or view it on GitHub.

trustin · 2014-02-28T04:35:45Z

The default can be overridden by system property and we have to respect it.

normanmaurer · 2014-02-28T19:01:04Z

Related to #808

…is fixes [#2264] and [#808]. This implementation uses kind of the same technics as outlined in the jemalloc paper and jemalloc blogpost https://www.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-jemalloc/480222803919. At the moment we only cache for "known" Threads and not for others to keep the overhead minimal when need to free up unused buffers in the cache and free up cached buffers once the Thread completes. Here we use multi-level caches for tiny, small and normal allocations. Huge allocations are not cached at all to keep the memory usage at a sane level. All the different cache configurations can be adjusted via system properties or the constructor directly where it makes sense.

…is fixes [#2264] and [#808]. Motivation: Remove the synchronization bottleneck in PoolArena and so speed up things Modifications: This implementation uses kind of the same technics as outlined in the jemalloc paper and jemalloc blogpost https://www.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-jemalloc/480222803919. At the moment we only cache for "known" Threads (that powers EventExecutors) and not for others to keep the overhead minimal when need to free up unused buffers in the cache and free up cached buffers once the Thread completes. Here we use multi-level caches for tiny, small and normal allocations. Huge allocations are not cached at all to keep the memory usage at a sane level. All the different cache configurations can be adjusted via system properties or the constructor directly where it makes sense. Result: Less conditions as most allocations can be served by the cache itself

normanmaurer · 2014-03-20T16:53:25Z

Fixed via #2284

…is fixes [netty#2264] and [netty#808]. Motivation: Remove the synchronization bottleneck in PoolArena and so speed up things Modifications: This implementation uses kind of the same technics as outlined in the jemalloc paper and jemalloc blogpost https://www.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-jemalloc/480222803919. At the moment we only cache for "known" Threads (that powers EventExecutors) and not for others to keep the overhead minimal when need to free up unused buffers in the cache and free up cached buffers once the Thread completes. Here we use multi-level caches for tiny, small and normal allocations. Huge allocations are not cached at all to keep the memory usage at a sane level. All the different cache configurations can be adjusted via system properties or the constructor directly where it makes sense. Result: Less conditions as most allocations can be served by the cache itself

normanmaurer added this to the 4.0.18.Final milestone Feb 26, 2014

normanmaurer added the improvement label Feb 26, 2014

normanmaurer self-assigned this Feb 28, 2014

normanmaurer closed this as completed Mar 20, 2014

ashish-tyagi mentioned this issue Nov 6, 2015

High CPU usage with SO_LINGER and sudden connection close (4.0.26.Final+) #4449

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PooledByteBufAllocator is a performance bottleneck #2264

PooledByteBufAllocator is a performance bottleneck #2264

normanmaurer commented Feb 26, 2014

normanmaurer commented Feb 26, 2014

trustin commented Feb 26, 2014

normanmaurer commented Feb 26, 2014

normanmaurer commented Feb 26, 2014

normanmaurer commented Feb 26, 2014

trustin commented Feb 26, 2014

normanmaurer commented Feb 27, 2014

trustin commented Feb 27, 2014

normanmaurer commented Feb 27, 2014

trustin commented Feb 28, 2014

normanmaurer commented Feb 28, 2014

normanmaurer commented Mar 20, 2014

PooledByteBufAllocator is a performance bottleneck #2264

PooledByteBufAllocator is a performance bottleneck #2264

Comments

normanmaurer commented Feb 26, 2014

normanmaurer commented Feb 26, 2014

trustin commented Feb 26, 2014

normanmaurer commented Feb 26, 2014

normanmaurer commented Feb 26, 2014

normanmaurer commented Feb 26, 2014

trustin commented Feb 26, 2014

normanmaurer commented Feb 27, 2014

trustin commented Feb 27, 2014

normanmaurer commented Feb 27, 2014

trustin commented Feb 28, 2014

normanmaurer commented Feb 28, 2014

normanmaurer commented Mar 20, 2014