AbstractReferenceCountedByteBuf can save using a volatile set on constructor #12288

franz1981 · 2022-04-06T13:05:10Z

Motivation:

AbstractReferenceCountedByteBuf can be made faster by ensure proper safe initialization vs volatile set on construction

Modification:

Using Unsafe and plain store instead of volatile set

Result:

Faster AbstractReferenceCountedByteBuf allocation

franz1981 · 2022-04-06T13:06:16Z

Still have to run benchs (that's why it's a draft) and need adding few checks to be sure the fence method is present on Unsafe.
This should benefit any buffer allocation

…tructor Motivation: AbstractReferenceCountedByteBuf can be made faster by ensure proper safe initialization vs volatile set on construction Modification: Using Unsafe and plain store instead of volatile set Result: Faster AbstractReferenceCountedByteBuf allocation

franz1981 · 2022-04-06T13:08:07Z

it's similar to #12286

franz1981 · 2022-04-06T13:24:19Z

Results with delay = 0 on my machine are

4.1:
Benchmark                                                          (delay)  Mode  Cnt   Score   Error  Units
AbstractReferenceCountedByteBufBenchmark.createUseAndRelease             0  avgt    8  54.772 ± 0.445  ns/op

this PR:

Benchmark                                                          (delay)  Mode  Cnt   Score   Error  Units
AbstractReferenceCountedByteBufBenchmark.createUseAndRelease             0  avgt    8  47.956 ± 0.515  ns/op

There's an improvement, but not a huge one

franz1981 · 2022-04-06T13:29:55Z

Relevant assembly parts are...
4.1:

  0.03%  │  0x00007f704d2aab23: movl   $0x2,0x20(%rbp)
  0.03%  │  0x00007f704d2aab2a: lock addl $0x0,(%rsp)     ;*putfield refCnt
         │                                                ; - io.netty.buffer.AbstractReferenceCountedByteBuf::&lt;init&gt;@12 (line 46)
         │                                                ; - io.netty.buffer.UnpooledHeapByteBuf::&lt;init&gt;@2 (line 51)
         │                                                ; - io.netty.buffer.UnpooledUnsafeHeapByteBuf::&lt;init&gt;@4 (line 34)
         │                                                ; - io.netty.buffer.UnpooledByteBufAllocator$InstrumentedUnpooledUnsafeHeapByteBuf::&lt;init&gt;@4 (line 139)
         │                                                ; - io.netty.buffer.UnpooledByteBufAllocator::newHeapBuffer@13 (line 82)
         │                                                ; - io.netty.buffer.AbstractByteBufAllocator::heapBuffer@21 (line 169)
         │                                                ; - io.netty.buffer.AbstractByteBufAllocator::heapBuffer@4 (line 160)
         │                                                ; - io.netty.buffer.Unpooled::buffer@4 (line 119)
         │                                                ; - io.netty.buffer.AbstractReferenceCountedByteBufBenchmark::createUseAndRelease@1 (line 66)
 15.16%  │  0x00007f704d2aab2f: movl   $0xe3932e95,0x24(%rbp)  ;   {oop(a &apos;io/netty/buffer/UnpooledByteBufAllocator&apos;)}

and, this PR:

  0.05%  │  0x00007f96552a7eeb: movl   $0x2,0x20(%rbp)    ;*invokevirtual storeFence
         │                                                ; - io.netty.util.internal.PlatformDependent0::storeFence@3 (line 583)
         │                                                ; - io.netty.util.internal.PlatformDependent::storeFence@0 (line 512)
         │                                                ; - io.netty.buffer.AbstractReferenceCountedByteBuf::&lt;init&gt;@37 (line 55)
         │                                                ; - io.netty.buffer.UnpooledHeapByteBuf::&lt;init&gt;@2 (line 51)
         │                                                ; - io.netty.buffer.UnpooledUnsafeHeapByteBuf::&lt;init&gt;@4 (line 34)
         │                                                ; - io.netty.buffer.UnpooledByteBufAllocator$InstrumentedUnpooledUnsafeHeapByteBuf::&lt;init&gt;@4 (line 139)
         │                                                ; - io.netty.buffer.UnpooledByteBufAllocator::newHeapBuffer@13 (line 82)
         │                                                ; - io.netty.buffer.AbstractByteBufAllocator::heapBuffer@21 (line 169)
         │                                                ; - io.netty.buffer.AbstractByteBufAllocator::heapBuffer@4 (line 160)
         │                                                ; - io.netty.buffer.Unpooled::buffer@4 (line 119)
         │                                                ; - io.netty.buffer.AbstractReferenceCountedByteBufBenchmark::createUseAndRelease@1 (line 66)
         │  0x00007f96552a7ef2: movl   $0xe393248b,0x24(%rbp)  ;   {oop(a &apos;io/netty/buffer/UnpooledByteBufAllocator&apos;)}

Hence, with this PR, using storeFence translate in no-op (if we're lucky with inlining it till the Unsafe::storeFence intrinsic) and there's no associated cost, while on 4.1 there is a full barrier after setting the refCnt field with its initial value ie 0x2, in the form of a lock addl which cost is later shown ie 15.16%, appeared later due to skidding

franz1981 · 2022-04-06T13:33:17Z

IMO it still worth to be considered wdyt @normanmaurer ?
It looks similar to the optimization made by Nitsan years ago re using plain load on refCnt vs using atomic ops

I'll be curious to see what numbers get @chrisvest disabling leak detection on a non-x86 CPU: i don't have any, but I'm curious

normanmaurer · 2022-04-06T15:58:53Z

buffer/src/main/java/io/netty/buffer/AbstractReferenceCountedByteBuf.java


    protected AbstractReferenceCountedByteBuf(int maxCapacity) {
        super(maxCapacity);
+        if (!PlatformDependent.hasUnsafe()) {


consider adding some comments here on what we are doing and why

I think to move everything in a better named method too, but I'm curious if any can run some experiment vs the existing allocations benchs; I've a gut feeling (hopefully not wrong) that it has a larger benefit that expected given that it happen on hot path

franz1981 · 2022-04-07T04:54:18Z

@normanmaurer I've implemented a specific PlatformDependent method to "safe construct put int", but I've been unlucky that it has hit the inlining depth limit of JDK 8 and not inlined, see

         │  0x00007f6edd2a90a0: mov    $0x2,%ecx
  0.02%  │  0x00007f6edd2a90a5: xchg   %ax,%ax
         │  0x00007f6edd2a90a7: callq  0x00007f6edd046260  ; OopMap{rbp=Oop [0]=Oop [8]=Oop [16]=Oop [24]=Oop off=204}
         │                                                ;*invokestatic safeConstructPutInt
         │                                                ; - io.netty.util.internal.PlatformDependent::safeConstructPutInt@3 (line 508)
         │                                                ; - io.netty.util.internal.ReferenceCountUpdater::setInitialValue@34 (line 67)
         │                                                ; - io.netty.buffer.AbstractReferenceCountedByteBuf::&lt;init&gt;@9 (line 50)
         │                                                ; - io.netty.buffer.UnpooledHeapByteBuf::&lt;init&gt;@2 (line 51)
         │                                                ; - io.netty.buffer.UnpooledUnsafeHeapByteBuf::&lt;init&gt;@4 (line 34)
         │                                                ; - io.netty.buffer.UnpooledByteBufAllocator$InstrumentedUnpooledUnsafeHeapByteBuf::&lt;init&gt;@4 (line 139)
         │                                                ; - io.netty.buffer.UnpooledByteBufAllocator::newHeapBuffer@13 (line 83)
         │                                                ; - io.netty.buffer.AbstractByteBufAllocator::heapBuffer@21 (line 169)
         │                                                ; - io.netty.buffer.AbstractByteBufAllocator::heapBuffer@4 (line 160)
         │                                                ; - io.netty.buffer.Unpooled::buffer@4 (line 119)
         │                                                ; - io.netty.buffer.AbstractReferenceCountedByteBufBenchmark::createUseAndRelease@1 (line 66)
         │                                                ;   {static_call}
         │  0x00007f6edd2a90ac: movl   $0xe393267d,0x24(%rbp)  ;   {oop(a &apos;io/netty/buffer/UnpooledByteBufAllocator&apos;)}

This is one of the drawback of using wrapped Unsafe in the form of PlatformDependent methods; they won't benefit from be intrinsified (-> that's the reason why on JCTools we directly use Unsafe) :(

And indeed the performance are now not as brilliant as they should be ie

Benchmark                                                          (delay)  Mode  Cnt   Score   Error  Units
AbstractReferenceCountedByteBufBenchmark.createUseAndRelease             0  avgt   10  50.324 ± 1.069  ns/op

Still better then the existing one, but will now pay the callq cost: If I run the same benchmark with -XX:MaxInlineLevel=15:

Benchmark                                                          (delay)  Mode  Cnt   Score   Error  Units
AbstractReferenceCountedByteBufBenchmark.createUseAndRelease             0  avgt   10  43.067 ± 0.868  ns/op

That seems a good improvement (as expected).

The assembly has changed too:

  0.02%  │  0x00007f3bb52a58f5: movq   $0x2,0x20(%rax)    ;*new  ; - io.netty.buffer.UnpooledByteBufAllocator::newHeapBuffer@6 (line 83)
         │                                                ; - io.netty.buffer.AbstractByteBufAllocator::heapBuffer@21 (line 169)
         │                                                ; - io.netty.buffer.AbstractByteBufAllocator::heapBuffer@4 (line 160)
         │                                                ; - io.netty.buffer.Unpooled::buffer@4 (line 119)
         │                                                ; - io.netty.buffer.AbstractReferenceCountedByteBufBenchmark::createUseAndRelease@1 (line 66)
         │  0x00007f3bb52a58fd: mov    %rax,0x10(%rsp)    ;*invokevirtual storeFence
         │                                                ; - io.netty.util.internal.PlatformDependent0::safeConstructPutInt@18 (line 616)
         │                                                ; - io.netty.util.internal.PlatformDependent::safeConstructPutInt@3 (line 508)
         │                                                ; - io.netty.util.internal.ReferenceCountUpdater::setInitialValue@34 (line 67)
         │                                                ; - io.netty.buffer.AbstractReferenceCountedByteBuf::&lt;init&gt;@9 (line 50)
         │                                                ; - io.netty.buffer.UnpooledHeapByteBuf::&lt;init&gt;@2 (line 51)
         │                                                ; - io.netty.buffer.UnpooledUnsafeHeapByteBuf::&lt;init&gt;@4 (line 34)
         │                                                ; - io.netty.buffer.UnpooledByteBufAllocator$InstrumentedUnpooledUnsafeHeapByteBuf::&lt;init&gt;@4 (line 139)
         │                                                ; - io.netty.buffer.UnpooledByteBufAllocator::newHeapBuffer@13 (line 83)
         │                                                ; - io.netty.buffer.AbstractByteBufAllocator::heapBuffer@21 (line 169)
         │                                                ; - io.netty.buffer.AbstractByteBufAllocator::heapBuffer@4 (line 160)
         │                                                ; - io.netty.buffer.Unpooled::buffer@4 (line 119)
         │                                                ; - io.netty.buffer.AbstractReferenceCountedByteBufBenchmark::createUseAndRelease@1 (line 66)
  0.89%  │  0x00007f3bb52a5902: movl   $0xe3880a40,0x24(%rax)  ;   {oop(a &apos;io/netty/buffer/UnpooledByteBufAllocator&apos;)}

The expected movl has been replaced by a movq, no idea why, given that the intiial value is an int field (movl), not a quad (movq) and there's a wrong attribution to storeFence, that should be a no-op, but overall the improvement looks great.

In short: although a wild (and unconfigured) JMH benchmark shows just a small improvement (because of unlucky inlining depth of the newly platform dependent method), the improvement in the best case is larger.

franz1981 · 2022-04-07T05:08:56Z

Re the inlining thing:

it shows that everything is getting inlined (as expected) with max inlining level 15
while it's not with jdk 8 default max inlining level (9):

Hence, in order to get comparable numbers I suggest to run the benchmark with an higher MaxInlineLevel ie 15

franz1981 · 2022-04-07T05:27:12Z

PTAL @normanmaurer

normanmaurer · 2022-04-07T08:46:03Z

@chrisvest @njhill PTAL

chrisvest · 2022-04-14T15:12:50Z

Thanks!

…tructor (netty#12288) Motivation: AbstractReferenceCountedByteBuf can be made faster by ensure proper safe initialization vs volatile set on construction Modification: Using Unsafe and plain store instead of volatile set Result: Faster AbstractReferenceCountedByteBuf allocation

franz1981 requested review from njhill and chrisvest April 6, 2022 13:05

franz1981 force-pushed the faster_ref_cnt_alloc branch from ded5850 to a876add Compare April 6, 2022 13:07

franz1981 added improvement discussion benchmark Affect the JMH benchmarks labels Apr 6, 2022

normanmaurer reviewed Apr 6, 2022

View reviewed changes

Addressing Norman's comment

09f16cb

franz1981 marked this pull request as ready for review April 7, 2022 05:27

normanmaurer approved these changes Apr 7, 2022

View reviewed changes

chrisvest approved these changes Apr 7, 2022

View reviewed changes

chrisvest merged commit f9e765e into netty:4.1 Apr 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AbstractReferenceCountedByteBuf can save using a volatile set on constructor #12288

AbstractReferenceCountedByteBuf can save using a volatile set on constructor #12288

franz1981 commented Apr 6, 2022

franz1981 commented Apr 6, 2022 •

edited

franz1981 commented Apr 6, 2022

franz1981 commented Apr 6, 2022 •

edited

franz1981 commented Apr 6, 2022 •

edited

franz1981 commented Apr 6, 2022 •

edited

normanmaurer Apr 6, 2022

franz1981 Apr 6, 2022 •

edited

franz1981 commented Apr 7, 2022 •

edited

franz1981 commented Apr 7, 2022 •

edited

franz1981 commented Apr 7, 2022

normanmaurer commented Apr 7, 2022

chrisvest commented Apr 14, 2022

AbstractReferenceCountedByteBuf can save using a volatile set on constructor #12288

AbstractReferenceCountedByteBuf can save using a volatile set on constructor #12288

Conversation

franz1981 commented Apr 6, 2022

franz1981 commented Apr 6, 2022 • edited

franz1981 commented Apr 6, 2022

franz1981 commented Apr 6, 2022 • edited

franz1981 commented Apr 6, 2022 • edited

franz1981 commented Apr 6, 2022 • edited

normanmaurer Apr 6, 2022

Choose a reason for hiding this comment

franz1981 Apr 6, 2022 • edited

Choose a reason for hiding this comment

franz1981 commented Apr 7, 2022 • edited

franz1981 commented Apr 7, 2022 • edited

franz1981 commented Apr 7, 2022

normanmaurer commented Apr 7, 2022

chrisvest commented Apr 14, 2022

franz1981 commented Apr 6, 2022 •

edited

franz1981 commented Apr 6, 2022 •

edited

franz1981 commented Apr 6, 2022 •

edited

franz1981 commented Apr 6, 2022 •

edited

franz1981 Apr 6, 2022 •

edited

franz1981 commented Apr 7, 2022 •

edited

franz1981 commented Apr 7, 2022 •

edited