New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide a way to cache the internal nioBuffer of the PooledByteBuffer… #8603

Merged
merged 5 commits into from Dec 4, 2018

Conversation

Projects
None yet
5 participants
@normanmaurer
Copy link
Member

normanmaurer commented Nov 27, 2018

… to reduce GC. (#8603)

Motivation:

Often a temporary ByteBuffer is used which can be cached to reduce the GC pressure.

Modifications:

Cache the ByteBuffer in the PoolThreadCache as well.

Result:

Less GC.

@njhill

This comment has been minimized.

Copy link
Contributor

njhill commented Nov 28, 2018

@normanmaurer apologies I didn't review carefully enough the first time and had made some assumptions which I don't think are valid after looking closer.

I don't this is threadsafe in its current form - ArrayDeque isn't and the additions/removals could be from arbitrary threads without synchronization.. e.g. deallocate() could be in any user thread releasing the buffer. It doesn't look like the init0 method is synchronized either in general.

I think it may work to add an nio buffer field to PoolThreadCache.Entry instead, but not sure if there are are other implications of doing that.

@normanmaurer

This comment has been minimized.

Copy link
Member Author

normanmaurer commented Nov 28, 2018

No worries.... looking today again

@njhill

This comment has been minimized.

Copy link
Contributor

njhill commented Nov 28, 2018

I do think stashing it in PoolThreadCache.Entry instead of having a separate per-chunk cache would work, looks like it would be easy to plumb through the alloc and dealloc paths. WDYT?

@normanmaurer

This comment has been minimized.

Copy link
Member Author

normanmaurer commented Nov 28, 2018

@njhill yep agree that should work. I will take care.

@normanmaurer normanmaurer force-pushed the cache_buffer branch from 6579ada to 812f87b Nov 28, 2018

@normanmaurer

This comment has been minimized.

Copy link
Member Author

normanmaurer commented Nov 28, 2018

@njhill done. PTAL again

@normanmaurer normanmaurer requested a review from trustin Nov 28, 2018

@normanmaurer

This comment has been minimized.

Copy link
Member Author

normanmaurer commented Nov 28, 2018

Also @franz1981

@normanmaurer normanmaurer added this to the 4.1.32.Final milestone Nov 28, 2018

@normanmaurer normanmaurer force-pushed the cache_buffer branch 2 times, most recently from d133607 to 7102bb4 Nov 28, 2018

@normanmaurer normanmaurer self-assigned this Nov 28, 2018

@njhill
Copy link
Contributor

njhill left a comment

@normanmaurer LGTM! curious if the allocs still look good in yourkit, and does it resolve the test issues?

Also I wonder might there be some unintended cpu cache side effects from adding the new field. This is not based on any insight but just a random thought given the thread chunk cache looks like it's a core part of the library.

@@ -387,6 +387,7 @@ void reallocate(PooledByteBuf<T> buf, int newCapacity, boolean freeOldMemory) {
}

PoolChunk<T> oldChunk = buf.chunk;
ByteBuffer olNioBuffer = buf.tmpNioBuf;

This comment has been minimized.

@njhill

njhill Nov 28, 2018

Contributor

olNioBuffer -> oldNioBuffer?

@normanmaurer normanmaurer force-pushed the cache_buffer branch from 7102bb4 to 49f4d46 Nov 28, 2018

@franz1981

This comment has been minimized.

Copy link
Contributor

franz1981 commented Nov 28, 2018

@njhill
Not sure if it is what you mean, but pre PR:

io.netty.buffer.PoolThreadCache$MemoryRegionCache$Entry object internals:
 OFFSET  SIZE                            TYPE DESCRIPTION                               VALUE
      0    12                                 (object header)                           N/A
     12     4   io.netty.util.Recycler.Handle Entry.recyclerHandle                      N/A
     16     8                            long Entry.handle                              N/A
     24     4       io.netty.buffer.PoolChunk Entry.chunk                               N/A
     28     4                                 (loss due to the next object alignment)
Instance size: 32 bytes
Space losses: 0 bytes internal + 4 bytes external = 4 bytes total

With this PR:

io.netty.buffer.PoolThreadCache$MemoryRegionCache$Entry object internals:
 OFFSET  SIZE                            TYPE DESCRIPTION                               VALUE
      0    12                                 (object header)                           N/A
     12     4   io.netty.util.Recycler.Handle Entry.recyclerHandle                      N/A
     16     8                            long Entry.handle                              N/A
     24     4       io.netty.buffer.PoolChunk Entry.chunk                               N/A
     28     4             java.nio.ByteBuffer Entry.nioBuffer                           N/A
Instance size: 32 bytes
Space losses: 0 bytes internal + 0 bytes external = 0 bytes total

We're lucky that at least it won't make any difference on memory footprint (with oops).
My concern instead is related to the effect on GCs that use card-marking (G1 does SATB AFAIK): probably doing a bench aging on purpose the NIO ByteBuf to became old gen would help to check what's the effect on the allocation rate while copying the nioBuffer reference from Entry to the pooled ByteBuf.

@normanmaurer

This comment has been minimized.

Copy link
Member Author

normanmaurer commented Nov 28, 2018

@njhill the downside of cache the ByteBuffer in the PoolThreadCache is that by default we only cache buffers <= 32k there, which means for other buffers there will be no caching happening of the ByteBuffer. That is a bit unfortunate as if pushed hard enough we may use bigger buffers to read from the socket.

@njhill

This comment has been minimized.

Copy link
Contributor

njhill commented Nov 28, 2018

@normanmaurer I had a go at adding more plumbing onto this change to handle those cases, let me know what you think c29fa22 (minor cleanup would still be needed)

@normanmaurer

This comment has been minimized.

Copy link
Member Author

normanmaurer commented Nov 28, 2018

@njhill yep I think your commit looks mostly fine. Do you mind if I bring it over into this pr with some cleanup ?

@njhill

This comment has been minimized.

Copy link
Contributor

njhill commented Nov 28, 2018

@normanmaurer i don't mind at all, go for it

@njhill

This comment has been minimized.

Copy link
Contributor

njhill commented Nov 28, 2018

The deque bits I copied from your original changes anyhow :)

@trustin

This comment has been minimized.

Copy link
Member

trustin commented Nov 29, 2018

LGTM. Will review again once updated with @njhill's patch.

normanmaurer added some commits Nov 27, 2018

Provide a way to cache the internal nioBuffer of the PooledByteBuffer…
… to reduce GC. (#8593)

Motivation:

Often a temporary ByteBuffer is used which can be cached to reduce the GC pressure.

Modifications:

Cache the ByteBuffer in the PoolThreadCache as well.

Result:

Less GC.

@normanmaurer normanmaurer force-pushed the cache_buffer branch from 49f4d46 to eb2c6ae Nov 29, 2018

@normanmaurer

This comment has been minimized.

Copy link
Member Author

normanmaurer commented Nov 29, 2018

@njhill @trustin @franz1981 PTAL again... I incorporated the changes of @njhill with the exception that I did not cache the ByteBuffer for PoolSubPage allocations. Thats mainly because how it was done was not safe as we may allocate multiple times per sub-page. We could fix this be adding another Deque to the head of the PoolSubPage linked list and cache there. But I think we just can ignore this as these allocations should be small enough to cache in PoolThreadCache and so have the ByteBuffer cached there anyway.

@normanmaurer

This comment has been minimized.

Copy link
Member Author

normanmaurer commented Nov 29, 2018

screenshot 2018-11-29 at 09 13 41

New allocation profile with the PR.

@normanmaurer normanmaurer removed this from the 4.1.32.Final milestone Nov 29, 2018

@normanmaurer normanmaurer added this to the 4.1.33.Final milestone Nov 29, 2018

@njhill

This comment has been minimized.

Copy link
Contributor

njhill commented Nov 29, 2018

Thanks @normanmaurer, looks great, I agree this seems good enough to not warrant the additional complexity/overhead of subpage-scoped caches, although without closer analysis/profiling can't judge that for sure.

How about this small improvement/simplification to the latest version though:

  • revert the changes to PoolChunk.allocateSubpage and PoolChunk.allocateRun (so they take only int and return long again)
  • move the initBuf(...) call into the common parent PoolChunk.allocate method, so that it polls the cache for both cases (as is currently done only in allocateRun)

It's "free" and may result in a few more cache hits, WDYT?

@normanmaurer

This comment has been minimized.

Copy link
Member Author

normanmaurer commented Nov 29, 2018

@njhill good feedback... Just pushed a new commit which should address it. PTAL again

@njhill

njhill approved these changes Nov 29, 2018

Copy link
Contributor

njhill left a comment

Thanks @normanmaurer, LGTM.. just one more minor comment.

}

if (handle > 0) {

This comment has been minimized.

@njhill

njhill Nov 29, 2018

Contributor

This was effectively >= 0 before, not sure if it makes a difference. Might also read slightly simpler if inverted i.e. initBuf outside the if block

This comment has been minimized.

@normanmaurer

normanmaurer Nov 29, 2018

Author Member

fixed

@franz1981
Copy link
Contributor

franz1981 left a comment

Today I'm at a conference guys hence I'll be able to take a closer look tomorrow :P. I will just use DEFAULT_MAX_CACHED_BYTEBUFFERS_PER_CHUNK as 1023 to avoid Deque internal array to be 2048 while hitting 1024. But it is really a minor thing...

@njhill

njhill approved these changes Nov 29, 2018

Copy link
Contributor

njhill left a comment

still looks good, but just noticed one more small cleanup needed (sorry!)

@@ -365,33 +385,41 @@ void free(long handle) {
PoolSubpage<T> head = arena.findSubpagePoolHead(subpage.elemSize);
synchronized (head) {
if (subpage.free(head, bitmapIdx & 0x3FFFFFFF)) {
// this may be overwriting with same thing, in theory this should never
// overwrite non-null with null.

This comment has been minimized.

@njhill

njhill Nov 29, 2018

Contributor

this comment is N/A now, should be removed

@normanmaurer

This comment has been minimized.

Copy link
Member Author

normanmaurer commented Nov 30, 2018

@njhill @franz1981 thanks... fixed :)

@normanmaurer

This comment has been minimized.

Copy link
Member Author

normanmaurer commented Dec 3, 2018

@franz1981 any more comments ?

@franz1981

This comment has been minimized.

Copy link
Contributor

franz1981 commented Dec 3, 2018

@normanmaurer Looks fine to me 👍

@normanmaurer normanmaurer merged commit 2680357 into 4.1 Dec 4, 2018

3 of 4 checks passed

pull request validation (centos6-java9) Build finished.
Details
pull request validation (centos6-java11) Build finished.
Details
pull request validation (centos6-java12) Build finished.
Details
pull request validation (centos6-java8) Build finished.
Details

@normanmaurer normanmaurer deleted the cache_buffer branch Dec 4, 2018

normanmaurer added a commit that referenced this pull request Dec 4, 2018

Provide a way to cache the internal nioBuffer of the PooledByteBuffer… (
#8603)


Motivation:

Often a temporary ByteBuffer is used which can be cached to reduce the GC pressure.

Modifications:

Cache the ByteBuffer in the PoolThreadCache as well.

Result:

Less GC.
@johnou

This comment has been minimized.

Copy link
Contributor

johnou commented Dec 19, 2018

@normanmaurer I believe this changeset substantially increased our applications memory footprint which resulted in low memory on some boxes and the kernel OOM killer axing java processes 💥 I will see about deploying a canary build when the holiday period is over but wanted to give you a heads up in the meantime.

@normanmaurer

This comment has been minimized.

Copy link
Member Author

normanmaurer commented Dec 19, 2018

@johnou I would be surprised if this would be the case as it only caches the "container" around the memory.

You are running the snapshot ?

@johnou

This comment has been minimized.

Copy link
Contributor

johnou commented Dec 19, 2018

@normanmaurer oh crap, "removed this from the 4.1.32.Final milestone" then it must be something else.. In the meantime downgrading to 4.1.30.Final / 2.0.17.Final ( should hold for Christmas 🤞 ) I will create a new issue once I get more information.

screenshot from 2018-12-19 19-29-38

@normanmaurer

This comment has been minimized.

Copy link
Member Author

normanmaurer commented Dec 19, 2018

@johnou let me know once you have more infos (like heap dump etc)

@johnou

This comment has been minimized.

Copy link
Contributor

johnou commented Dec 19, 2018

@normanmaurer if it's the PooledByteBufAllocator that has simply cached too much that won't show up in a heap dump because it's off-heap right? I just added a new mbean to export ByteBufAllocatorMetric values.

@normanmaurer

This comment has been minimized.

Copy link
Member Author

normanmaurer commented Dec 19, 2018

@johnou while this is true it may still give some hints as you will see the "container" in the heap dump.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment