Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AbstractReferenceCountedByteBuf can save using a volatile set on constructor #12288

Merged
merged 2 commits into from
Apr 14, 2022

Conversation

franz1981
Copy link
Contributor

Motivation:

AbstractReferenceCountedByteBuf can be made faster by ensure proper safe initialization vs volatile set on construction

Modification:

Using Unsafe and plain store instead of volatile set

Result:

Faster AbstractReferenceCountedByteBuf allocation

@franz1981
Copy link
Contributor Author

franz1981 commented Apr 6, 2022

Still have to run benchs (that's why it's a draft) and need adding few checks to be sure the fence method is present on Unsafe.
This should benefit any buffer allocation

…tructor

Motivation:

AbstractReferenceCountedByteBuf can be made faster by ensure proper safe initialization vs volatile set on construction

Modification:

Using Unsafe and plain store instead of volatile set

Result:

Faster AbstractReferenceCountedByteBuf allocation
@franz1981
Copy link
Contributor Author

it's similar to #12286

@franz1981 franz1981 added improvement discussion benchmark Affect the JMH benchmarks labels Apr 6, 2022
@franz1981
Copy link
Contributor Author

franz1981 commented Apr 6, 2022

Results with delay = 0 on my machine are

4.1:
Benchmark                                                          (delay)  Mode  Cnt   Score   Error  Units
AbstractReferenceCountedByteBufBenchmark.createUseAndRelease             0  avgt    8  54.772 ± 0.445  ns/op

this PR:

Benchmark                                                          (delay)  Mode  Cnt   Score   Error  Units
AbstractReferenceCountedByteBufBenchmark.createUseAndRelease             0  avgt    8  47.956 ± 0.515  ns/op

There's an improvement, but not a huge one

@franz1981
Copy link
Contributor Author

franz1981 commented Apr 6, 2022

Relevant assembly parts are...
4.1:

  0.03%  │  0x00007f704d2aab23: movl   $0x2,0x20(%rbp)
  0.03%  │  0x00007f704d2aab2a: lock addl $0x0,(%rsp)     ;*putfield refCnt
         │                                                ; - io.netty.buffer.AbstractReferenceCountedByteBuf::<init>@12 (line 46)
         │                                                ; - io.netty.buffer.UnpooledHeapByteBuf::<init>@2 (line 51)
         │                                                ; - io.netty.buffer.UnpooledUnsafeHeapByteBuf::<init>@4 (line 34)
         │                                                ; - io.netty.buffer.UnpooledByteBufAllocator$InstrumentedUnpooledUnsafeHeapByteBuf::<init>@4 (line 139)
         │                                                ; - io.netty.buffer.UnpooledByteBufAllocator::newHeapBuffer@13 (line 82)
         │                                                ; - io.netty.buffer.AbstractByteBufAllocator::heapBuffer@21 (line 169)
         │                                                ; - io.netty.buffer.AbstractByteBufAllocator::heapBuffer@4 (line 160)
         │                                                ; - io.netty.buffer.Unpooled::buffer@4 (line 119)
         │                                                ; - io.netty.buffer.AbstractReferenceCountedByteBufBenchmark::createUseAndRelease@1 (line 66)
 15.16%  │  0x00007f704d2aab2f: movl   $0xe3932e95,0x24(%rbp)  ;   {oop(a 'io/netty/buffer/UnpooledByteBufAllocator')}

and, this PR:

  0.05%  │  0x00007f96552a7eeb: movl   $0x2,0x20(%rbp)    ;*invokevirtual storeFence
         │                                                ; - io.netty.util.internal.PlatformDependent0::storeFence@3 (line 583)
         │                                                ; - io.netty.util.internal.PlatformDependent::storeFence@0 (line 512)
         │                                                ; - io.netty.buffer.AbstractReferenceCountedByteBuf::<init>@37 (line 55)
         │                                                ; - io.netty.buffer.UnpooledHeapByteBuf::<init>@2 (line 51)
         │                                                ; - io.netty.buffer.UnpooledUnsafeHeapByteBuf::<init>@4 (line 34)
         │                                                ; - io.netty.buffer.UnpooledByteBufAllocator$InstrumentedUnpooledUnsafeHeapByteBuf::<init>@4 (line 139)
         │                                                ; - io.netty.buffer.UnpooledByteBufAllocator::newHeapBuffer@13 (line 82)
         │                                                ; - io.netty.buffer.AbstractByteBufAllocator::heapBuffer@21 (line 169)
         │                                                ; - io.netty.buffer.AbstractByteBufAllocator::heapBuffer@4 (line 160)
         │                                                ; - io.netty.buffer.Unpooled::buffer@4 (line 119)
         │                                                ; - io.netty.buffer.AbstractReferenceCountedByteBufBenchmark::createUseAndRelease@1 (line 66)
         │  0x00007f96552a7ef2: movl   $0xe393248b,0x24(%rbp)  ;   {oop(a 'io/netty/buffer/UnpooledByteBufAllocator')}

Hence, with this PR, using storeFence translate in no-op (if we're lucky with inlining it till the Unsafe::storeFence intrinsic) and there's no associated cost, while on 4.1 there is a full barrier after setting the refCnt field with its initial value ie 0x2, in the form of a lock addl which cost is later shown ie 15.16%, appeared later due to skidding

@franz1981
Copy link
Contributor Author

franz1981 commented Apr 6, 2022

IMO it still worth to be considered wdyt @normanmaurer ?
It looks similar to the optimization made by Nitsan years ago re using plain load on refCnt vs using atomic ops

I'll be curious to see what numbers get @chrisvest disabling leak detection on a non-x86 CPU: i don't have any, but I'm curious


protected AbstractReferenceCountedByteBuf(int maxCapacity) {
super(maxCapacity);
if (!PlatformDependent.hasUnsafe()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider adding some comments here on what we are doing and why

Copy link
Contributor Author

@franz1981 franz1981 Apr 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think to move everything in a better named method too, but I'm curious if any can run some experiment vs the existing allocations benchs; I've a gut feeling (hopefully not wrong) that it has a larger benefit that expected given that it happen on hot path

@franz1981
Copy link
Contributor Author

franz1981 commented Apr 7, 2022

@normanmaurer I've implemented a specific PlatformDependent method to "safe construct put int", but I've been unlucky that it has hit the inlining depth limit of JDK 8 and not inlined, see

         │  0x00007f6edd2a90a0: mov    $0x2,%ecx
  0.02%  │  0x00007f6edd2a90a5: xchg   %ax,%ax
         │  0x00007f6edd2a90a7: callq  0x00007f6edd046260  ; OopMap{rbp=Oop [0]=Oop [8]=Oop [16]=Oop [24]=Oop off=204}
         │                                                ;*invokestatic safeConstructPutInt
         │                                                ; - io.netty.util.internal.PlatformDependent::safeConstructPutInt@3 (line 508)
         │                                                ; - io.netty.util.internal.ReferenceCountUpdater::setInitialValue@34 (line 67)
         │                                                ; - io.netty.buffer.AbstractReferenceCountedByteBuf::<init>@9 (line 50)
         │                                                ; - io.netty.buffer.UnpooledHeapByteBuf::<init>@2 (line 51)
         │                                                ; - io.netty.buffer.UnpooledUnsafeHeapByteBuf::<init>@4 (line 34)
         │                                                ; - io.netty.buffer.UnpooledByteBufAllocator$InstrumentedUnpooledUnsafeHeapByteBuf::<init>@4 (line 139)
         │                                                ; - io.netty.buffer.UnpooledByteBufAllocator::newHeapBuffer@13 (line 83)
         │                                                ; - io.netty.buffer.AbstractByteBufAllocator::heapBuffer@21 (line 169)
         │                                                ; - io.netty.buffer.AbstractByteBufAllocator::heapBuffer@4 (line 160)
         │                                                ; - io.netty.buffer.Unpooled::buffer@4 (line 119)
         │                                                ; - io.netty.buffer.AbstractReferenceCountedByteBufBenchmark::createUseAndRelease@1 (line 66)
         │                                                ;   {static_call}
         │  0x00007f6edd2a90ac: movl   $0xe393267d,0x24(%rbp)  ;   {oop(a 'io/netty/buffer/UnpooledByteBufAllocator')}

This is one of the drawback of using wrapped Unsafe in the form of PlatformDependent methods; they won't benefit from be intrinsified (-> that's the reason why on JCTools we directly use Unsafe) :(

And indeed the performance are now not as brilliant as they should be ie

Benchmark                                                          (delay)  Mode  Cnt   Score   Error  Units
AbstractReferenceCountedByteBufBenchmark.createUseAndRelease             0  avgt   10  50.324 ± 1.069  ns/op

Still better then the existing one, but will now pay the callq cost: If I run the same benchmark with -XX:MaxInlineLevel=15:

Benchmark                                                          (delay)  Mode  Cnt   Score   Error  Units
AbstractReferenceCountedByteBufBenchmark.createUseAndRelease             0  avgt   10  43.067 ± 0.868  ns/op

That seems a good improvement (as expected).

The assembly has changed too:

  0.02%  │  0x00007f3bb52a58f5: movq   $0x2,0x20(%rax)    ;*new  ; - io.netty.buffer.UnpooledByteBufAllocator::newHeapBuffer@6 (line 83)
         │                                                ; - io.netty.buffer.AbstractByteBufAllocator::heapBuffer@21 (line 169)
         │                                                ; - io.netty.buffer.AbstractByteBufAllocator::heapBuffer@4 (line 160)
         │                                                ; - io.netty.buffer.Unpooled::buffer@4 (line 119)
         │                                                ; - io.netty.buffer.AbstractReferenceCountedByteBufBenchmark::createUseAndRelease@1 (line 66)
         │  0x00007f3bb52a58fd: mov    %rax,0x10(%rsp)    ;*invokevirtual storeFence
         │                                                ; - io.netty.util.internal.PlatformDependent0::safeConstructPutInt@18 (line 616)
         │                                                ; - io.netty.util.internal.PlatformDependent::safeConstructPutInt@3 (line 508)
         │                                                ; - io.netty.util.internal.ReferenceCountUpdater::setInitialValue@34 (line 67)
         │                                                ; - io.netty.buffer.AbstractReferenceCountedByteBuf::<init>@9 (line 50)
         │                                                ; - io.netty.buffer.UnpooledHeapByteBuf::<init>@2 (line 51)
         │                                                ; - io.netty.buffer.UnpooledUnsafeHeapByteBuf::<init>@4 (line 34)
         │                                                ; - io.netty.buffer.UnpooledByteBufAllocator$InstrumentedUnpooledUnsafeHeapByteBuf::<init>@4 (line 139)
         │                                                ; - io.netty.buffer.UnpooledByteBufAllocator::newHeapBuffer@13 (line 83)
         │                                                ; - io.netty.buffer.AbstractByteBufAllocator::heapBuffer@21 (line 169)
         │                                                ; - io.netty.buffer.AbstractByteBufAllocator::heapBuffer@4 (line 160)
         │                                                ; - io.netty.buffer.Unpooled::buffer@4 (line 119)
         │                                                ; - io.netty.buffer.AbstractReferenceCountedByteBufBenchmark::createUseAndRelease@1 (line 66)
  0.89%  │  0x00007f3bb52a5902: movl   $0xe3880a40,0x24(%rax)  ;   {oop(a 'io/netty/buffer/UnpooledByteBufAllocator')}

The expected movl has been replaced by a movq, no idea why, given that the intiial value is an int field (movl), not a quad (movq) and there's a wrong attribution to storeFence, that should be a no-op, but overall the improvement looks great.

In short: although a wild (and unconfigured) JMH benchmark shows just a small improvement (because of unlucky inlining depth of the newly platform dependent method), the improvement in the best case is larger.

@franz1981
Copy link
Contributor Author

franz1981 commented Apr 7, 2022

Re the inlining thing:

image

it shows that everything is getting inlined (as expected) with max inlining level 15
while it's not with jdk 8 default max inlining level (9):
image

Hence, in order to get comparable numbers I suggest to run the benchmark with an higher MaxInlineLevel ie 15

@franz1981
Copy link
Contributor Author

PTAL @normanmaurer

@franz1981 franz1981 marked this pull request as ready for review April 7, 2022 05:27
@normanmaurer
Copy link
Member

@chrisvest @njhill PTAL

@chrisvest chrisvest merged commit f9e765e into netty:4.1 Apr 14, 2022
@chrisvest
Copy link
Contributor

Thanks!

raidyue pushed a commit to raidyue/netty that referenced this pull request Jul 8, 2022
…tructor (netty#12288)

Motivation:

AbstractReferenceCountedByteBuf can be made faster by ensure proper safe initialization vs volatile set on construction

Modification:

Using Unsafe and plain store instead of volatile set

Result:

Faster AbstractReferenceCountedByteBuf allocation
franz1981 added a commit to franz1981/netty that referenced this pull request Aug 22, 2022
…tructor (netty#12288)

Motivation:

AbstractReferenceCountedByteBuf can be made faster by ensure proper safe initialization vs volatile set on construction

Modification:

Using Unsafe and plain store instead of volatile set

Result:

Faster AbstractReferenceCountedByteBuf allocation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
benchmark Affect the JMH benchmarks discussion improvement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants