Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PooledByteBufAllocator is a performance bottleneck #2264

Closed
normanmaurer opened this issue Feb 26, 2014 · 12 comments
Closed

PooledByteBufAllocator is a performance bottleneck #2264

normanmaurer opened this issue Feb 26, 2014 · 12 comments
Assignees
Milestone

Comments

@normanmaurer
Copy link
Member

During massive load tests I noticed that a lot of time is spend because of the synchronization in PoolArea. We need to try to minimize this as it is a pretty big performance bottleneck:

bildschirmfoto 2014-02-26 um 13 11 11

Maybe we can do something smart with lock-free algorithms or have a dedicated thread which free's the buffers.

@normanmaurer normanmaurer added this to the 4.0.18.Final milestone Feb 26, 2014
@normanmaurer
Copy link
Member Author

Looking into it ...

@trustin
Copy link
Member

trustin commented Feb 26, 2014

PooledByteBufAllocator does not currently implement the thread-local cache, and that seems to be causing the contention. Take a look at PoolArena.allocate(PoolThreadCache cache, PooledByteBuf<T> buf, final int reqCapacity).

@normanmaurer
Copy link
Member Author

Just ran the benchmark again using -Dio.netty.allocator.numDirectArenas=cores

And got this:
bildschirmfoto 2014-02-26 um 20 27 11

@normanmaurer
Copy link
Member Author

@trustin could you be more specific what you mean with "PooledByteBufAllocator does not currently implement the thread-local cache, and that seems to be causing the contention" ?

@normanmaurer
Copy link
Member Author

@trustin btw do you remember why we use cpu-cores as default area numbers and not 2 * cores?

Am 26.02.2014 um 20:24 schrieb Trustin Lee notifications@github.com:

PooledByteBufAllocator does not currently implement the thread-local cache, and that seems to be causing the contention. Take a look at PoolArena.allocate(PoolThreadCache cache, PooledByteBuf buf, final int reqCapacity).


Reply to this email directly or view it on GitHub.

@trustin
Copy link
Member

trustin commented Feb 26, 2014

Maybe it's a good idea to have as many as the default number of I/O threads due to contention you observed.

By default the number of arenas is determined by the amount of available memory, and it's likely that you'll have a smaller number of arenas depending on your system's available memory.

More fundamental fix is to keep a thread local cache to store the recently released buffers for a while. The jemalloc paper describes this in detail. Unfortunately, our allocator does not implement the thread-local cache yet.

@normanmaurer
Copy link
Member Author

@trustin yeah I think we should use Runtime.getRuntime().availableProcessors() * 2 for each area and not Runtime.getRuntime().availableProcessors(). Ok with have me change it ?

Beside this I will look into the ThreadLocalCache. Currently reading the jemalloc paper ;)

@trustin
Copy link
Member

trustin commented Feb 27, 2014

Re: Why it's numCores - It's purely CPU bound, but I had to consider the number of I/O threads, too. The default number of I/O threads can be overriden by a system property, so we need to respect that. The problem is it does not depend on netty-transport (and it should not of course), so we are going to have some code duplication in netty-buffer or we need to move all system property related code to netty-common for cleanness.

@normanmaurer
Copy link
Member Author

Yeah but I think we should use the same default as in transport which is 2 * cores

Am 27.02.2014 um 17:12 schrieb Trustin Lee notifications@github.com:

Re: Why it's numCores - It's purely CPU bound, but I had to consider the number of I/O threads, too. The default number of I/O threads can be overriden by a system property, so we need to respect that. The problem is it does not depend on netty-transport (and it should not of course), so we are going to have some code duplication in netty-buffer or we need to move all system property related code to netty-common for cleanness.


Reply to this email directly or view it on GitHub.

@trustin
Copy link
Member

trustin commented Feb 28, 2014

The default can be overridden by system property and we have to respect it.

@normanmaurer
Copy link
Member Author

Related to #808

@normanmaurer normanmaurer self-assigned this Feb 28, 2014
normanmaurer pushed a commit that referenced this issue Mar 4, 2014
…is fixes [#2264] and [#808].

This implementation uses kind of the same technics as outlined in the jemalloc paper and jemalloc
blogpost https://www.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-jemalloc/480222803919.

At the moment we only cache for "known" Threads and not for others to keep the overhead minimal when need to
free up unused buffers in the cache and free up cached buffers once the Thread completes. Here we use multi-level
caches for tiny, small and normal allocations. Huge allocations are not cached at all to keep the memory usage at
a sane level. All the different cache configurations can be adjusted via system properties or the constructor
directly where it makes sense.
normanmaurer pushed a commit that referenced this issue Mar 12, 2014
…is fixes [#2264] and [#808].

Motivation:
Remove the synchronization bottleneck in PoolArena and so speed up things

Modifications:

This implementation uses kind of the same technics as outlined in the jemalloc paper and jemalloc
blogpost https://www.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-jemalloc/480222803919.

At the moment we only cache for "known" Threads (that powers EventExecutors) and not for others to keep the overhead
minimal when need to free up unused buffers in the cache and free up cached buffers once the Thread completes. Here
we use multi-level caches for tiny, small and normal allocations. Huge allocations are not cached at all to keep the
memory usage at a sane level. All the different cache configurations can be adjusted via system properties or the constructor
directly where it makes sense.

Result:
Less conditions as most allocations can be served by the cache itself
normanmaurer pushed a commit that referenced this issue Mar 14, 2014
…is fixes [#2264] and [#808].

Motivation:
Remove the synchronization bottleneck in PoolArena and so speed up things

Modifications:

This implementation uses kind of the same technics as outlined in the jemalloc paper and jemalloc
blogpost https://www.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-jemalloc/480222803919.

At the moment we only cache for "known" Threads (that powers EventExecutors) and not for others to keep the overhead
minimal when need to free up unused buffers in the cache and free up cached buffers once the Thread completes. Here
we use multi-level caches for tiny, small and normal allocations. Huge allocations are not cached at all to keep the
memory usage at a sane level. All the different cache configurations can be adjusted via system properties or the constructor
directly where it makes sense.

Result:
Less conditions as most allocations can be served by the cache itself
normanmaurer pushed a commit that referenced this issue Mar 20, 2014
…is fixes [#2264] and [#808].

Motivation:
Remove the synchronization bottleneck in PoolArena and so speed up things

Modifications:

This implementation uses kind of the same technics as outlined in the jemalloc paper and jemalloc
blogpost https://www.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-jemalloc/480222803919.

At the moment we only cache for "known" Threads (that powers EventExecutors) and not for others to keep the overhead
minimal when need to free up unused buffers in the cache and free up cached buffers once the Thread completes. Here
we use multi-level caches for tiny, small and normal allocations. Huge allocations are not cached at all to keep the
memory usage at a sane level. All the different cache configurations can be adjusted via system properties or the constructor
directly where it makes sense.

Result:
Less conditions as most allocations can be served by the cache itself
normanmaurer pushed a commit that referenced this issue Mar 20, 2014
…is fixes [#2264] and [#808].

Motivation:
Remove the synchronization bottleneck in PoolArena and so speed up things

Modifications:

This implementation uses kind of the same technics as outlined in the jemalloc paper and jemalloc
blogpost https://www.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-jemalloc/480222803919.

At the moment we only cache for "known" Threads (that powers EventExecutors) and not for others to keep the overhead
minimal when need to free up unused buffers in the cache and free up cached buffers once the Thread completes. Here
we use multi-level caches for tiny, small and normal allocations. Huge allocations are not cached at all to keep the
memory usage at a sane level. All the different cache configurations can be adjusted via system properties or the constructor
directly where it makes sense.

Result:
Less conditions as most allocations can be served by the cache itself
normanmaurer pushed a commit that referenced this issue Mar 20, 2014
…is fixes [#2264] and [#808].

Motivation:
Remove the synchronization bottleneck in PoolArena and so speed up things

Modifications:

This implementation uses kind of the same technics as outlined in the jemalloc paper and jemalloc
blogpost https://www.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-jemalloc/480222803919.

At the moment we only cache for "known" Threads (that powers EventExecutors) and not for others to keep the overhead
minimal when need to free up unused buffers in the cache and free up cached buffers once the Thread completes. Here
we use multi-level caches for tiny, small and normal allocations. Huge allocations are not cached at all to keep the
memory usage at a sane level. All the different cache configurations can be adjusted via system properties or the constructor
directly where it makes sense.

Result:
Less conditions as most allocations can be served by the cache itself
normanmaurer pushed a commit that referenced this issue Mar 20, 2014
…is fixes [#2264] and [#808].

Motivation:
Remove the synchronization bottleneck in PoolArena and so speed up things

Modifications:

This implementation uses kind of the same technics as outlined in the jemalloc paper and jemalloc
blogpost https://www.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-jemalloc/480222803919.

At the moment we only cache for "known" Threads (that powers EventExecutors) and not for others to keep the overhead
minimal when need to free up unused buffers in the cache and free up cached buffers once the Thread completes. Here
we use multi-level caches for tiny, small and normal allocations. Huge allocations are not cached at all to keep the
memory usage at a sane level. All the different cache configurations can be adjusted via system properties or the constructor
directly where it makes sense.

Result:
Less conditions as most allocations can be served by the cache itself
@normanmaurer
Copy link
Member Author

Fixed via #2284

pulllock pushed a commit to pulllock/netty that referenced this issue Oct 19, 2023
…is fixes [netty#2264] and [netty#808].

Motivation:
Remove the synchronization bottleneck in PoolArena and so speed up things

Modifications:

This implementation uses kind of the same technics as outlined in the jemalloc paper and jemalloc
blogpost https://www.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-jemalloc/480222803919.

At the moment we only cache for "known" Threads (that powers EventExecutors) and not for others to keep the overhead
minimal when need to free up unused buffers in the cache and free up cached buffers once the Thread completes. Here
we use multi-level caches for tiny, small and normal allocations. Huge allocations are not cached at all to keep the
memory usage at a sane level. All the different cache configurations can be adjusted via system properties or the constructor
directly where it makes sense.

Result:
Less conditions as most allocations can be served by the cache itself
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants