Make AdaptiveByteBuf.setBytes faster by chrisvest · Pull Request #15736 · netty/netty

chrisvest · 2025-10-03T21:09:47Z

Motivation:
The setBytes method was getting a sliced nioBuffer of its source, which typically causes allocation. On Java 16 onwards, we can instead copy using an absolutely offsetted put method, and forego allocating a duplicate ByteBuffer instance that is otherwise needed for isolating the position field.

Modification:

Make use of the absolutely offsetted put method in setBytes, when its available.
Use the underlying ByteBuffer of the shared chunk where possible to avoid multiple bounds checks.
Add a benchmark targeting the setBytes method that takes a ByteBuf source.
Change a few benchmarks to use the default allocator when pooling is enabled.

Result:
Faster setBytes in certain cases.

buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java

Motivation: The setBytes method was getting a sliced nioBuffer of its source, which typically causes allocation. On Java 16 onwards, we can instead copy using an absolutely offsetted `put` method, and forego allocating a duplicate ByteBuffer instance that is otherwise needed for isolating the position field. Modification: - Make use of the absolutely offsetted put method in setBytes, when its available. - Add a benchmark targeting the setBytes method that takes a ByteBuf source. - Change a few benchmarks to use the default allocator when pooling is enabled. Result: Faster setBytes in certain cases.

franz1981 · 2025-10-04T05:23:53Z

buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java

-            tmp.put(src);
+            int length = src.remaining();
+            checkIndex(index, length);
+            ByteBuffer tmp = internalNioBuffer();


Why not using the rootParent NIO buffer(s, from each adaptive, if available)?
It is shared among size classed chunks and will save them to allocates NIO buffers, which will be dead as they will be released (AdaptiveByteBufs are just empty shells afaik)

I tried this but it was a lot slower, by like 50%. Not sure what was going on, but in theory you're right.

Maybe it was doubling the bound check? Debugging it or checking ASM could help.

If you have the version w root parent link it here an tomorrow I will take a look

Here's the version I had that used the rootParent:

@Override public ByteBuf setBytes(int index, ByteBuf src, int srcIndex, int length) { checkIndex(index, length); if (src instanceof AdaptiveByteBuf && PlatformDependent.javaVersion() >= 16) { AdaptiveByteBuf srcBuf = (AdaptiveByteBuf) src; AbstractByteBuf dstRoot = rootParent(); ByteBuffer dstBuffer = dstRoot.internalNioBuffer(0, dstRoot.maxFastWritableBytes()); AbstractByteBuf srcRoot = srcBuf.rootParent(); ByteBuffer srcBuffer = srcRoot.internalNioBuffer(0, srcRoot.maxFastWritableBytes()); PlatformDependent.absolutePut(dstBuffer, idx(index), srcBuffer, srcBuf.idx(srcIndex), length); } else { ByteBuffer tmp = (ByteBuffer) internalNioBuffer(); tmp.clear().position(index); tmp.put(src.nioBuffer(srcIndex, length)); } return this; }

I'm getting

Bug: unsafe access to shared internal chunk buffer

so I'm adding this too

@Override public ByteBuffer internalNioBuffer(int index, int length) { if (!allowSectionedInternalNioBufferAccess) { // we can only return the internalNioBuffer if the whole buffer is requested if (length != capacity && index != 0) { throw new UnsupportedOperationException("Bug: unsafe access to shared internal chunk buffer"); } return internalNioBuffer(); } checkIndex(index, length); return internalNioBuffer().clear().position(index).limit(index + length); }

it's not very secure, still .-.
But I've moved the checkIndex later since I didn't wanted to perform again both the accessibility and bound check here.
In short: I would love we have a simple internalNioBuffer() method which don't perform bound nor accessibility checks and is not meant (ever) to provide a modifiable (in term of position/limit) internal buffer

With the changes at #15736 (comment) to enable using the internalNioBuffer and

@Override public ByteBuf setBytes(int index, ByteBuf src, int srcIndex, int length) { checkIndex(index, length); if (src instanceof AdaptiveByteBuf && PlatformDependent.javaVersion() >= 16) { AdaptiveByteBuf srcBuf = (AdaptiveByteBuf) src; // bound check src as well srcBuf.checkIndex(srcIndex, length); AbstractByteBuf dstRoot = rootParent; AbstractByteBuf srcRoot = srcBuf.rootParent; // TODO why we have to pay again for accessibility and bound checks? ByteBuffer dstBuffer = dstRoot.internalNioBuffer(0, dstRoot.maxFastWritableBytes()); ByteBuffer srcBuffer = srcRoot.internalNioBuffer(0, srcRoot.maxFastWritableBytes()); PlatformDependent.absolutePut(dstBuffer, idx(index), srcBuffer, srcBuf.idx(srcIndex), length); } else { ByteBuffer tmp = (ByteBuffer) internalNioBuffer(); tmp.clear().position(index); tmp.put(src.nioBuffer(srcIndex, length)); } return this; }

I'm getting the best ever performance and no allocations so i would go for it

franz1981 · 2025-10-06T02:47:18Z

Still running the http 2 benchmarks and numbers now looks better

The slice creation is still dominant:

but the actual copy has become cheaper - especially with --jvmArgsAppend="-Dio.netty.noUnsafe=true":

While benchmarking/profiling this I' ve found something weird
i.e Disabling Unsafe just cause the VarHandle ref counter methods not been inlined due to io/netty/util/internal/ReferenceCountUpdater.release not been inlined.

Reading https://wiki.openjdk.org/display/HotSpot/Server+Compiler+Inlining+Messages it means

already compiled into a big method: there is already compiled code
for the method that is called from the call site and the code that was
generated for is larger than InlineSmallCode

and

already compiled into a medium method: there is already compiled code
for the method that is called from the call site and the code that was
generated for is larger than InlineSmallCode / 4

Adding https://github.com/openjdk/jdk/blob/ba7bf43c76c94bea85dbbd865794184b7ee0cc86/src/hotspot/share/opto/bytecodeInfo.cpp#L276 for confirmation.

JDK 24 InlineSmallCode is 2500 and to make sure the method is inlined requires to be < then InlineSmallCode / 4 = 625 bytes.

By adding -XX:+UnlockDiagnosticVMOptions -XX:+PrintInlining shows:

FYI disabling VarHandle still cause a similar issue with AtomicIntegerFieldUpdater$AtomicIntegerFieldUpdaterImpl - same old same old.
This is happening because the assembly size of io/netty/util/internal/ReferenceCountUpdater.release is just too big (if not using Unsafe) and the same applies for the atomic integer updater variant.

with noUnsafe=true:

============================= C2-compiled nmethod ==============================
----------------------------------- Assembly -----------------------------------

Compiled method (c2) 592 2313       4       io.netty.util.internal.ReferenceCountUpdater::release (50 bytes)
 total in heap  [0x00007fa1102d1d88,0x00007fa1102d2d30] = 4008
 relocation     [0x00007fa1102d1e60,0x00007fa1102d1f80] = 288
 main code      [0x00007fa1102d1f80,0x00007fa1102d2bc0] = 3136
 stub code      [0x00007fa1102d2bc0,0x00007fa1102d2c10] = 80
 oops           [0x00007fa1102d2c10,0x00007fa1102d2c60] = 80
 metadata       [0x00007fa1102d2c60,0x00007fa1102d2d30] = 208
 immutable data [0x00007fa0280d6870,0x00007fa0280d78c0] = 4176
 dependencies   [0x00007fa0280d6870,0x00007fa0280d68a8] = 56
 nul chk table  [0x00007fa0280d68a8,0x00007fa0280d68e8] = 64
 handler table  [0x00007fa0280d68e8,0x00007fa0280d69a8] = 192
 scopes pcs     [0x00007fa0280d69a8,0x00007fa0280d7228] = 2176
 scopes data    [0x00007fa0280d7228,0x00007fa0280d78c0] = 1688

whilst, without (i.e. using Unsafe) :

============================= C2-compiled nmethod ==============================
----------------------------------- Assembly -----------------------------------

Compiled method (c2) 460 1912       4       io.netty.util.internal.ReferenceCountUpdater::release (50 bytes)
 total in heap  [0x00007f7928267888,0x00007f7928267ca0] = 1048
 relocation     [0x00007f7928267960,0x00007f79282679a8] = 72
 main code      [0x00007f79282679c0,0x00007f7928267bd0] = 528
 stub code      [0x00007f7928267bd0,0x00007f7928267be8] = 24
 oops           [0x00007f7928267be8,0x00007f7928267bf0] = 8
 metadata       [0x00007f7928267bf0,0x00007f7928267ca0] = 176
 immutable data [0x00007f785c006070,0x00007f785c0063d0] = 864
 dependencies   [0x00007f785c006070,0x00007f785c0060a0] = 48
 scopes pcs     [0x00007f785c0060a0,0x00007f785c006260] = 448
 scopes data    [0x00007f785c006260,0x00007f785c0063d0] = 368

The reason why it popups here is because release is getting compiled with rawCnt != 2 too, making it fatter - due to adaptive chunks which get retained/released and because the data buffer in the benchmark keep on getting retained/released (without been deallocated) - which is what #13783 is causing instead (in the real world it won't be a problem since we expect data to not be reused over and over!).
The way chunks uses the reference count is addressed by #15571

It looks like that the reason why ReferenceCountUpdater::release is that big is due to the bimorphic inlining of varHandle() , see

  0x00007fd0142d3656:   mov    %rsi,%r11                    ;*invokevirtual getRawRefCnt {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - io.netty.util.internal.ReferenceCountUpdater::release@2 (line 130)
  0x00007fd0142d3659:   mov    0x8(%r11),%r9d
  0x00007fd0142d365d:   cmp    $0x10aed20,%r9d              ;   {metadata('io/netty/buffer/AbstractReferenceCountedByteBuf$3')}
  0x00007fd0142d3664:   je     0x00007fd0142d367f
  0x00007fd0142d3666:   cmp    $0x10b5090,%r9d              ;   {metadata('io/netty/buffer/AdaptivePoolingAllocator$Chunk$3')}
  0x00007fd0142d366d:   jne    0x00007fd0142d3d34
  0x00007fd0142d3673:   movabs $0x45a840548,%r8             ;   {oop(a 'java/lang/invoke/VarHandleInts$FieldInstanceReadWrite'{0x000000045a840548})}
  0x00007fd0142d367d:   jmp    0x00007fd0142d3689
  0x00007fd0142d367f:   movabs $0x45a840010,%r8             ;*invokevirtual varHandle {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - io.netty.util.internal.VarHandleReferenceCountUpdater::getRawRefCnt@1 (line 41)
                                                            ; - io.netty.util.internal.ReferenceCountUpdater::release@2 (line 130)
                                                            ;   {oop(a 'java/lang/invoke/VarHandleInts$FieldInstanceReadWrite'{0x000000045a840010})}
  0x00007fd0142d3689:   movzbl 0xc(%r8),%r10d
  0x00007fd0142d368e:   test   %r10d,%r10d

where io/netty/buffer/AdaptivePoolingAllocator$Chunk$3 and io/netty/buffer/AbstractReferenceCountedByteBuf$3 are the two concrete types observed while executing

netty/common/src/main/java/io/netty/util/internal/VarHandleReferenceCountUpdater.java

Lines 40 to 42 in 63ffb7e

    
           protected final int getRawRefCnt(T refCnt) { 
        
               return (int) varHandle().get(refCnt); 
        
           }

@yawkat suggested franz1981#1 long time ago to fix this problem
wdyt @chrisvest ?

In this way, since there are no overloaded varHandles the problem should go away:

there will be a single type implementing varHandle
no need of bimorphic inlining
compiled assembly "should " (need to check) become smaller
the problem should disappear

Another "dirty" solution to this, is to make io/netty/buffer/AdaptivePoolingAllocator$Chunk to have it's own overridden method(s) e.g. release: this will bring the type check for io/netty/buffer/AdaptivePoolingAllocator$Chunk earlier while calling release making the io/netty/buffer/AbstractReferenceCountedByteBuf::release one smaller
i.e.

                    // on Chunk
                    updater = new VarHandleReferenceCountUpdater<Chunk>() {
                        @Override
                        protected VarHandle varHandle() {
                            return (VarHandle) REFCNT_FIELD_VH;
                        }

                        @Override
                        public boolean release(final Chunk instance) {
                            int rawCnt = getRawRefCnt(instance);
                            return rawCnt == 2 ? tryFinalRelease0(instance, 2) || retryRelease0(instance, 1)
                                    : nonFinalRelease0(instance, 1, rawCnt, toLiveRealRefCnt(rawCnt, 1));
                        }
                    };

chrisvest · 2025-10-06T17:03:16Z

I'm up for improving the way reference counting works; one implementation used across (if possible), and simpler algorithm with no count/raw count distinction. But let's do those in separate PRs.

franz1981 · 2025-10-07T02:27:08Z

one implementation used across (if possible), and simpler algorithm with no count/raw count distinction. But let's do those in separate PRs

oki, I leave this in good hands than ;)
I'll focus on the adaptive pool and to analyze what's wrong with the root parent version of this PR instead

franz1981

I would improve the way we expose internalNioBuffer to avoid paying twice for bound and accessibility checks, but ATM that's the most performant version in my tests.

normanmaurer · 2025-10-27T09:09:36Z

@chrisvest @franz1981 what is the status of this one ?

franz1981 · 2025-10-27T09:23:00Z

Let's say that the ref cnt PR from Chris will change what we would observe here.
I could run few tests rebasing this over that work and see how it looks, in case.

franz1981 · 2025-11-07T10:14:36Z

@chrisvest I think after #15764 this can make progress and use something like what I made at #15736 (comment)

chrisvest · 2025-11-11T22:26:22Z

@franz1981 I tried the attached patch, but I'm finding that it's actually slower in the ByteBufCopy2Benchmark on Java 25:

This PR currently:

Benchmark                       (directByteBuf)  (size)   Mode  Cnt          Score         Error  Units
ByteBufCopy2Benchmark.setBytes             true       7  thrpt   20  187270276.915 ± 1932878.488  ops/s
ByteBufCopy2Benchmark.setBytes             true      36  thrpt   20  175995791.002 ±  640168.287  ops/s
ByteBufCopy2Benchmark.setBytes             true     128  thrpt   20  143235411.817 ± 3168283.176  ops/s
ByteBufCopy2Benchmark.setBytes             true     512  thrpt   20   88840177.260 ±  949277.638  ops/s

The patch:

Benchmark                       (directByteBuf)  (size)   Mode  Cnt          Score         Error  Units
ByteBufCopy2Benchmark.setBytes             true       7  thrpt   20  155487223.856 ±  695797.841  ops/s
ByteBufCopy2Benchmark.setBytes             true      36  thrpt   20  142725234.580 ± 1837994.419  ops/s
ByteBufCopy2Benchmark.setBytes             true     128  thrpt   20  118052653.832 ±  270108.505  ops/s
ByteBufCopy2Benchmark.setBytes             true     512  thrpt   20   70395815.138 ±  169314.180  ops/s

A patch attempting the approach in #15736 (comment)

Subject: [PATCH] Use root parent ByteBuffers directly
---
Index: buffer/src/main/java/io/netty/buffer/UnsafeByteBufUtil.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/buffer/src/main/java/io/netty/buffer/UnsafeByteBufUtil.java b/buffer/src/main/java/io/netty/buffer/UnsafeByteBufUtil.java
--- a/buffer/src/main/java/io/netty/buffer/UnsafeByteBufUtil.java	(revision 293385859e01ba7501e6f120a0984dd084dfab67)
+++ b/buffer/src/main/java/io/netty/buffer/UnsafeByteBufUtil.java	(date 1762899017654)
@@ -689,14 +689,13 @@
     }
 
     static UnpooledUnsafeDirectByteBuf newUnsafeDirectByteBuf(
-            ByteBufAllocator alloc, int initialCapacity, int maxCapacity,
-            boolean allowSectionedInternalNioBufferAccess) {
+            ByteBufAllocator alloc, int initialCapacity, int maxCapacity) {
         if (PlatformDependent.useDirectBufferNoCleaner()) {
             return new UnpooledUnsafeNoCleanerDirectByteBuf(
-                    alloc, initialCapacity, maxCapacity, allowSectionedInternalNioBufferAccess);
+                    alloc, initialCapacity, maxCapacity);
         }
         return new UnpooledUnsafeDirectByteBuf(
-                alloc, initialCapacity, maxCapacity, allowSectionedInternalNioBufferAccess);
+                alloc, initialCapacity, maxCapacity);
     }
 
     private UnsafeByteBufUtil() { }
Index: buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java b/buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java
--- a/buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java	(revision 293385859e01ba7501e6f120a0984dd084dfab67)
+++ b/buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java	(date 1762899017652)
@@ -1687,11 +1687,15 @@
         @Override
         public ByteBuf setBytes(int index, ByteBuf src, int srcIndex, int length) {
             checkIndex(index, length);
-            ByteBuffer tmp = internalNioBuffer();
             if (src instanceof AdaptiveByteBuf && PlatformDependent.javaVersion() >= 16) {
                 AdaptiveByteBuf srcBuf = (AdaptiveByteBuf) src;
-                PlatformDependent.absolutePut(tmp, index, srcBuf.internalNioBuffer(), srcIndex, length);
+//                ByteBuffer tmp = internalNioBuffer();
+//                PlatformDependent.absolutePut(tmp, index, srcBuf.internalNioBuffer(), srcIndex, length);
+                ByteBuffer dstBuffer = rootParent()._internalNioBuffer();
+                ByteBuffer srcBuffer = srcBuf.rootParent()._internalNioBuffer();
+                PlatformDependent.absolutePut(dstBuffer, idx(index), srcBuffer, srcBuf.idx(srcIndex), length);
             } else {
+                ByteBuffer tmp = internalNioBuffer();
                 tmp.position(index);
                 tmp.put(src.nioBuffer(srcIndex, length));
             }
Index: buffer/src/main/java/io/netty/buffer/UnpooledHeapByteBuf.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/buffer/src/main/java/io/netty/buffer/UnpooledHeapByteBuf.java b/buffer/src/main/java/io/netty/buffer/UnpooledHeapByteBuf.java
--- a/buffer/src/main/java/io/netty/buffer/UnpooledHeapByteBuf.java	(revision 293385859e01ba7501e6f120a0984dd084dfab67)
+++ b/buffer/src/main/java/io/netty/buffer/UnpooledHeapByteBuf.java	(date 1762899017653)
@@ -213,7 +213,7 @@
         ensureAccessible();
         ByteBuffer tmpBuf;
         if (internal) {
-            tmpBuf = internalNioBuffer();
+            tmpBuf = _internalNioBuffer();
         } else {
             tmpBuf = ByteBuffer.wrap(array);
         }
@@ -222,7 +222,7 @@
 
     private int getBytes(int index, FileChannel out, long position, int length, boolean internal) throws IOException {
         ensureAccessible();
-        ByteBuffer tmpBuf = internal ? internalNioBuffer() : ByteBuffer.wrap(array);
+        ByteBuffer tmpBuf = internal ? _internalNioBuffer() : ByteBuffer.wrap(array);
         return out.write((ByteBuffer) tmpBuf.clear().position(index).limit(index + length), position);
     }
 
@@ -279,7 +279,7 @@
     public int setBytes(int index, ScatteringByteChannel in, int length) throws IOException {
         ensureAccessible();
         try {
-            return in.read((ByteBuffer) internalNioBuffer().clear().position(index).limit(index + length));
+            return in.read((ByteBuffer) _internalNioBuffer().clear().position(index).limit(index + length));
         } catch (ClosedChannelException ignored) {
             return -1;
         }
@@ -289,7 +289,7 @@
     public int setBytes(int index, FileChannel in, long position, int length) throws IOException {
         ensureAccessible();
         try {
-            return in.read((ByteBuffer) internalNioBuffer().clear().position(index).limit(index + length), position);
+            return in.read((ByteBuffer) _internalNioBuffer().clear().position(index).limit(index + length), position);
         } catch (ClosedChannelException ignored) {
             return -1;
         }
@@ -314,7 +314,7 @@
     @Override
     public ByteBuffer internalNioBuffer(int index, int length) {
         checkIndex(index, length);
-        return (ByteBuffer) internalNioBuffer().clear().position(index).limit(index + length);
+        return (ByteBuffer) _internalNioBuffer().clear().position(index).limit(index + length);
     }
 
     @Override
@@ -535,7 +535,8 @@
         return alloc().heapBuffer(length, maxCapacity()).writeBytes(array, index, length);
     }
 
-    private ByteBuffer internalNioBuffer() {
+    @Override
+    ByteBuffer _internalNioBuffer() {
         ByteBuffer tmpNioBuf = this.tmpNioBuf;
         if (tmpNioBuf == null) {
             this.tmpNioBuf = tmpNioBuf = ByteBuffer.wrap(array);
Index: buffer/src/main/java/io/netty/buffer/PooledByteBufAllocator.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/buffer/src/main/java/io/netty/buffer/PooledByteBufAllocator.java b/buffer/src/main/java/io/netty/buffer/PooledByteBufAllocator.java
--- a/buffer/src/main/java/io/netty/buffer/PooledByteBufAllocator.java	(revision 293385859e01ba7501e6f120a0984dd084dfab67)
+++ b/buffer/src/main/java/io/netty/buffer/PooledByteBufAllocator.java	(date 1762899017652)
@@ -403,8 +403,8 @@
             buf = directArena.allocate(cache, initialCapacity, maxCapacity);
         } else {
             buf = PlatformDependent.hasUnsafe() ?
-                    UnsafeByteBufUtil.newUnsafeDirectByteBuf(this, initialCapacity, maxCapacity, true) :
-                    new UnpooledDirectByteBuf(this, initialCapacity, maxCapacity, true);
+                    UnsafeByteBufUtil.newUnsafeDirectByteBuf(this, initialCapacity, maxCapacity) :
+                    new UnpooledDirectByteBuf(this, initialCapacity, maxCapacity);
             onAllocateBuffer(buf, false, false);
         }
         return toLeakAwareBuffer(buf);
Index: buffer/src/test/java/io/netty/buffer/LittleEndianUnsafeNoCleanerDirectByteBufTest.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/buffer/src/test/java/io/netty/buffer/LittleEndianUnsafeNoCleanerDirectByteBufTest.java b/buffer/src/test/java/io/netty/buffer/LittleEndianUnsafeNoCleanerDirectByteBufTest.java
--- a/buffer/src/test/java/io/netty/buffer/LittleEndianUnsafeNoCleanerDirectByteBufTest.java	(revision 293385859e01ba7501e6f120a0984dd084dfab67)
+++ b/buffer/src/test/java/io/netty/buffer/LittleEndianUnsafeNoCleanerDirectByteBufTest.java	(date 1762899017655)
@@ -31,6 +31,6 @@
 
     @Override
     protected ByteBuf newBuffer(int length, int maxCapacity) {
-        return new UnpooledUnsafeNoCleanerDirectByteBuf(UnpooledByteBufAllocator.DEFAULT, length, maxCapacity, true);
+        return new UnpooledUnsafeNoCleanerDirectByteBuf(UnpooledByteBufAllocator.DEFAULT, length, maxCapacity);
     }
 }
Index: buffer/src/main/java/io/netty/buffer/AdaptiveByteBufAllocator.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/buffer/src/main/java/io/netty/buffer/AdaptiveByteBufAllocator.java b/buffer/src/main/java/io/netty/buffer/AdaptiveByteBufAllocator.java
--- a/buffer/src/main/java/io/netty/buffer/AdaptiveByteBufAllocator.java	(revision 293385859e01ba7501e6f120a0984dd084dfab67)
+++ b/buffer/src/main/java/io/netty/buffer/AdaptiveByteBufAllocator.java	(date 1762899017651)
@@ -112,8 +112,8 @@
         @Override
         public AbstractByteBuf allocate(int initialCapacity, int maxCapacity) {
             return PlatformDependent.hasUnsafe() ?
-                    UnsafeByteBufUtil.newUnsafeDirectByteBuf(allocator, initialCapacity, maxCapacity, false) :
-                    new UnpooledDirectByteBuf(allocator, initialCapacity, maxCapacity, false);
+                    UnsafeByteBufUtil.newUnsafeDirectByteBuf(allocator, initialCapacity, maxCapacity) :
+                    new UnpooledDirectByteBuf(allocator, initialCapacity, maxCapacity);
         }
     }
 }
Index: buffer/src/test/java/io/netty/buffer/BigEndianUnsafeNoCleanerDirectByteBufTest.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/buffer/src/test/java/io/netty/buffer/BigEndianUnsafeNoCleanerDirectByteBufTest.java b/buffer/src/test/java/io/netty/buffer/BigEndianUnsafeNoCleanerDirectByteBufTest.java
--- a/buffer/src/test/java/io/netty/buffer/BigEndianUnsafeNoCleanerDirectByteBufTest.java	(revision 293385859e01ba7501e6f120a0984dd084dfab67)
+++ b/buffer/src/test/java/io/netty/buffer/BigEndianUnsafeNoCleanerDirectByteBufTest.java	(date 1762899017655)
@@ -31,6 +31,6 @@
 
     @Override
     protected ByteBuf newBuffer(int length, int maxCapacity) {
-        return new UnpooledUnsafeNoCleanerDirectByteBuf(UnpooledByteBufAllocator.DEFAULT, length, maxCapacity, true);
+        return new UnpooledUnsafeNoCleanerDirectByteBuf(UnpooledByteBufAllocator.DEFAULT, length, maxCapacity);
     }
 }
Index: buffer/src/main/java/io/netty/buffer/AbstractByteBuf.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/buffer/src/main/java/io/netty/buffer/AbstractByteBuf.java b/buffer/src/main/java/io/netty/buffer/AbstractByteBuf.java
--- a/buffer/src/main/java/io/netty/buffer/AbstractByteBuf.java	(revision 293385859e01ba7501e6f120a0984dd084dfab67)
+++ b/buffer/src/main/java/io/netty/buffer/AbstractByteBuf.java	(date 1762899017651)
@@ -1496,4 +1496,8 @@
     long _memoryAddress() {
         return isAccessible() && hasMemoryAddress() ? memoryAddress() : 0L;
     }
+
+    ByteBuffer _internalNioBuffer() {
+        return internalNioBuffer(0, maxFastWritableBytes());
+    }
 }
Index: buffer/src/main/java/io/netty/buffer/UnpooledDirectByteBuf.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/buffer/src/main/java/io/netty/buffer/UnpooledDirectByteBuf.java b/buffer/src/main/java/io/netty/buffer/UnpooledDirectByteBuf.java
--- a/buffer/src/main/java/io/netty/buffer/UnpooledDirectByteBuf.java	(revision 293385859e01ba7501e6f120a0984dd084dfab67)
+++ b/buffer/src/main/java/io/netty/buffer/UnpooledDirectByteBuf.java	(date 1762899017653)
@@ -45,7 +45,6 @@
     private ByteBuffer tmpNioBuf;
     private int capacity;
     private boolean doNotFree;
-    private final boolean allowSectionedInternalNioBufferAccess;
 
     /**
      * Creates a new direct buffer.
@@ -54,20 +53,6 @@
      * @param maxCapacity     the maximum capacity of the underlying direct buffer
      */
     public UnpooledDirectByteBuf(ByteBufAllocator alloc, int initialCapacity, int maxCapacity) {
-        this(alloc, initialCapacity, maxCapacity, true);
-    }
-
-    /**
-     * Creates a new direct buffer.
-     *
-     * @param initialCapacity the initial capacity of the underlying direct buffer
-     * @param maxCapacity     the maximum capacity of the underlying direct buffer
-     * @param allowSectionedInternalNioBufferAccess
-     * {@code true} if {@link #internalNioBuffer(int, int)} is allowed to be called,
-     * or {@code false} if it should throw an exception.
-     */
-    UnpooledDirectByteBuf(ByteBufAllocator alloc, int initialCapacity, int maxCapacity,
-                          boolean allowSectionedInternalNioBufferAccess) {
         super(maxCapacity);
         ObjectUtil.checkNotNull(alloc, "alloc");
         checkPositiveOrZero(initialCapacity, "initialCapacity");
@@ -79,7 +64,6 @@
 
         this.alloc = alloc;
         setByteBuffer(allocateDirectBuffer(initialCapacity), false);
-        this.allowSectionedInternalNioBufferAccess = allowSectionedInternalNioBufferAccess;
     }
 
     /**
@@ -113,7 +97,6 @@
         doNotFree = !doFree;
         setByteBuffer((slice ? initialBuffer.slice() : initialBuffer).order(ByteOrder.BIG_ENDIAN), false);
         writerIndex(initialCapacity);
-        allowSectionedInternalNioBufferAccess = true;
     }
 
     /**
@@ -617,7 +600,7 @@
         if (length == 0) {
             return;
         }
-        ByteBufUtil.readBytes(alloc(), internal ? internalNioBuffer() : buffer.duplicate(), index, length, out);
+        ByteBufUtil.readBytes(alloc(), internal ? _internalNioBuffer() : buffer.duplicate(), index, length, out);
     }
 
     @Override
@@ -750,13 +733,11 @@
     @Override
     public ByteBuffer internalNioBuffer(int index, int length) {
         checkIndex(index, length);
-        if (!allowSectionedInternalNioBufferAccess) {
-            throw new UnsupportedOperationException("Bug: unsafe access to shared internal chunk buffer");
-        }
-        return (ByteBuffer) internalNioBuffer().clear().position(index).limit(index + length);
+        return (ByteBuffer) _internalNioBuffer().clear().position(index).limit(index + length);
     }
 
-    private ByteBuffer internalNioBuffer() {
+    @Override
+    ByteBuffer _internalNioBuffer() {
         ByteBuffer tmpNioBuf = this.tmpNioBuf;
         if (tmpNioBuf == null) {
             this.tmpNioBuf = tmpNioBuf = buffer.duplicate();
Index: buffer/src/main/java/io/netty/buffer/UnpooledByteBufAllocator.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/buffer/src/main/java/io/netty/buffer/UnpooledByteBufAllocator.java b/buffer/src/main/java/io/netty/buffer/UnpooledByteBufAllocator.java
--- a/buffer/src/main/java/io/netty/buffer/UnpooledByteBufAllocator.java	(revision 293385859e01ba7501e6f120a0984dd084dfab67)
+++ b/buffer/src/main/java/io/netty/buffer/UnpooledByteBufAllocator.java	(date 1762899017652)
@@ -179,7 +179,7 @@
             extends UnpooledUnsafeNoCleanerDirectByteBuf {
         InstrumentedUnpooledUnsafeNoCleanerDirectByteBuf(
                 UnpooledByteBufAllocator alloc, int initialCapacity, int maxCapacity) {
-            super(alloc, initialCapacity, maxCapacity, true);
+            super(alloc, initialCapacity, maxCapacity);
         }
 
         @Override
Index: buffer/src/main/java/io/netty/buffer/UnpooledUnsafeDirectByteBuf.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/buffer/src/main/java/io/netty/buffer/UnpooledUnsafeDirectByteBuf.java b/buffer/src/main/java/io/netty/buffer/UnpooledUnsafeDirectByteBuf.java
--- a/buffer/src/main/java/io/netty/buffer/UnpooledUnsafeDirectByteBuf.java	(revision 293385859e01ba7501e6f120a0984dd084dfab67)
+++ b/buffer/src/main/java/io/netty/buffer/UnpooledUnsafeDirectByteBuf.java	(date 1762899017654)
@@ -43,11 +43,6 @@
         super(alloc, initialCapacity, maxCapacity);
     }
 
-    UnpooledUnsafeDirectByteBuf(ByteBufAllocator alloc, int initialCapacity, int maxCapacity,
-                                boolean allowSectionedInternalNioBufferAccess) {
-        super(alloc, initialCapacity, maxCapacity, allowSectionedInternalNioBufferAccess);
-    }
-
     /**
      * Creates a new direct buffer by wrapping the specified initial buffer.
      *
Index: buffer/src/main/java/io/netty/buffer/UnpooledUnsafeNoCleanerDirectByteBuf.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/buffer/src/main/java/io/netty/buffer/UnpooledUnsafeNoCleanerDirectByteBuf.java b/buffer/src/main/java/io/netty/buffer/UnpooledUnsafeNoCleanerDirectByteBuf.java
--- a/buffer/src/main/java/io/netty/buffer/UnpooledUnsafeNoCleanerDirectByteBuf.java	(revision 293385859e01ba7501e6f120a0984dd084dfab67)
+++ b/buffer/src/main/java/io/netty/buffer/UnpooledUnsafeNoCleanerDirectByteBuf.java	(date 1762899017654)
@@ -21,9 +21,8 @@
 import java.nio.ByteBuffer;
 
 class UnpooledUnsafeNoCleanerDirectByteBuf extends UnpooledUnsafeDirectByteBuf {
-    UnpooledUnsafeNoCleanerDirectByteBuf(ByteBufAllocator alloc, int initialCapacity, int maxCapacity,
-                                         boolean allowSectionedInternalNioBufferAccess) {
-        super(alloc, initialCapacity, maxCapacity, allowSectionedInternalNioBufferAccess);
+    UnpooledUnsafeNoCleanerDirectByteBuf(ByteBufAllocator alloc, int initialCapacity, int maxCapacity) {
+        super(alloc, initialCapacity, maxCapacity);
     }
 
     @Override

franz1981 · 2025-11-12T03:57:33Z

Thanks, i will try with JDK 25 as well, to make sure 🙏
Sadly I got no apple machine to validate why you got such numbers

But If you want, run the benchmark with -prof gc to compare the 2 approaches. I would suggest to use the patched one because should allocate less

franz1981 · 2025-11-13T06:53:07Z

I think @chrisvest that this is not the "right" benchmark because it would reuse the internal buffer - which under certain circumstances can save to be materialized...

Said that, I see a regression myself, likely due to bound checks

this is not happening in the current version - and "it could" (to be verified) a quirk of the benchmark
i.e. JMH hoisting few invartiants considered "safe" out of the loop?

I have again inspect the asm to know it...
but what I know for sure is that the Http 2 benchmark show the exact opposite...can you give it a shot?

(see #13783 (comment) as a remainder: this PR is key to see in that HTTP 2 PR the right improvement...a linked list of issues!)

franz1981 · 2025-11-13T09:21:23Z

@chrisvest re the performance difference: at a first look the problem is visible by reading the code as well
i.e.

using AdaptiveByteBuf::internalNioBuffer: it forces allocating it, but once done, there's no need to compute any real offset - each internal NIO ByteBuffer already has the right address
using rootParent's _internalNioBuffer: it requires to compute twice the idx - for both src and dst

So, in which cases the second is a better option?
I think only if the AdaptiveByteBuf::internalNioBuffer hasn't been allocated yet and indeed the HTTP 2 benchmark was creating a new duplicate/slice of the buffer to work with, causing it to allocate a fresh new internalNioBuffer and harming performance.

chrisvest · 2025-12-10T00:21:21Z

@normanmaurer @franz1981 Please take a look.

franz1981

LGTM!

Motivation: The setBytes method was getting a sliced nioBuffer of its source, which typically causes allocation. On Java 16 onwards, we can instead copy using an absolutely offsetted `put` method, and forego allocating a duplicate ByteBuffer instance that is otherwise needed for isolating the position field. Modification: - Make use of the absolutely offsetted put method in setBytes, when its available. - Use the underlying ByteBuffer of the shared chunk where possible to avoid multiple bounds checks. - Add a benchmark targeting the setBytes method that takes a ByteBuf source. - Change a few benchmarks to use the default allocator when pooling is enabled. Result: Faster setBytes in certain cases.

normanmaurer · 2025-12-16T07:54:53Z

@chrisvest @franz1981 do we want to also port this to 4.1 ?

Motivation: The setBytes method was getting a sliced nioBuffer of its source, which typically causes allocation. On Java 16 onwards, we can instead copy using an absolutely offsetted `put` method, and forego allocating a duplicate ByteBuffer instance that is otherwise needed for isolating the position field. Modification: - Make use of the absolutely offsetted put method in setBytes, when its available. - Use the underlying ByteBuffer of the shared chunk where possible to avoid multiple bounds checks. - Add a benchmark targeting the setBytes method that takes a ByteBuf source. - Change a few benchmarks to use the default allocator when pooling is enabled. Result: Faster setBytes in certain cases. Co-authored-by: Chris Vest <christianvest_hansen@apple.com>

chrisvest · 2025-12-16T17:43:06Z

@normanmaurer I don't think we need to back port this to 4.1.

Motivation: Adaptive allocator perform costly atomic operations in the thread local path, which reduce its performance Modification: Reduce the amount of atomic operations in the thread local allocation's fast path Result: Fixes #15571 These are the different variations I want to test: - [x] Uses unguarded `Recycler`s - [x] Implements "compressed" local free list (LIFO) - [x] Use a mpsc q for the reuse chunk q in the thread-local case **NO VISIBLE IMPROVEMENTS** - [x] Guards `nextInLine`'s `getAndSet` with a null check via volatile `get` first, since size classed chunks rarely end up into `nextInLine` (i.e. which is mostly `null`) **NO VISIBLE IMPROVEMENTS** - [x] Implements a var handle based `MpscIntQueue` (done at 1c4e1e4) **NO VISIBLE IMPROVEMENTS** - [x] Remove the live/raw ref cnt as mentioned at #15736 (comment) - [ ] Remove the ref count for size classed chunks (see 8953bbe and 8cb1bf0) - [ ] Use the "pinned" Recycler instead of the `FastThreadLocal`-based one

Motivation: Adaptive allocator perform costly atomic operations in the thread local path, which reduce its performance Modification: Reduce the amount of atomic operations in the thread local allocation's fast path Result: Fixes netty#15571 These are the different variations I want to test: - [x] Uses unguarded `Recycler`s - [x] Implements "compressed" local free list (LIFO) - [x] Use a mpsc q for the reuse chunk q in the thread-local case **NO VISIBLE IMPROVEMENTS** - [x] Guards `nextInLine`'s `getAndSet` with a null check via volatile `get` first, since size classed chunks rarely end up into `nextInLine` (i.e. which is mostly `null`) **NO VISIBLE IMPROVEMENTS** - [x] Implements a var handle based `MpscIntQueue` (done at 1c4e1e4) **NO VISIBLE IMPROVEMENTS** - [x] Remove the live/raw ref cnt as mentioned at netty#15736 (comment) - [ ] Remove the ref count for size classed chunks (see 8953bbe and 8cb1bf0) - [ ] Use the "pinned" Recycler instead of the `FastThreadLocal`-based one (cherry picked from commit accd981)

Motivation: Adaptive allocator perform costly atomic operations in the thread local path, which reduce its performance Modification: Reduce the amount of atomic operations in the thread local allocation's fast path Result: Fixes #15571 These are the different variations I want to test: - [x] Uses unguarded `Recycler`s - [x] Implements "compressed" local free list (LIFO) - [x] Use a mpsc q for the reuse chunk q in the thread-local case **NO VISIBLE IMPROVEMENTS** - [x] Guards `nextInLine`'s `getAndSet` with a null check via volatile `get` first, since size classed chunks rarely end up into `nextInLine` (i.e. which is mostly `null`) **NO VISIBLE IMPROVEMENTS** - [x] Implements a var handle based `MpscIntQueue` (done at 1c4e1e4) **NO VISIBLE IMPROVEMENTS** - [x] Remove the live/raw ref cnt as mentioned at #15736 (comment) - [ ] Remove the ref count for size classed chunks (see 8953bbe and 8cb1bf0) - [ ] Use the "pinned" Recycler instead of the `FastThreadLocal`-based one (cherry picked from commit accd981) Co-authored-by: Francesco Nigro <nigro.fra@gmail.com>

chrisvest requested review from franz1981 and normanmaurer October 3, 2025 21:09

chrisvest mentioned this pull request Oct 3, 2025

Adaptive's buffers setBytes performance is held back by NIO Buffers allocations #15723

Closed

diegolovison reviewed Oct 3, 2025

View reviewed changes

buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java Show resolved Hide resolved

chrisvest force-pushed the 4.2-setbytes branch from d94c605 to 3c5a614 Compare October 3, 2025 22:38

franz1981 requested changes Oct 4, 2025

View reviewed changes

franz1981 requested changes Oct 7, 2025

View reviewed changes

This was referenced Oct 10, 2025

Improve adaptive allocator thread local performance #15741

Merged

Simplify reference counting #15764

Merged

normanmaurer added this to the 4.2.8.Final milestone Oct 27, 2025

Merge branch '4.2' into 4.2-setbytes

ff41f56

chrisvest added 2 commits November 11, 2025 10:10

Merge branch '4.2' into 4.2-setbytes

07420b9

Fix assertion in test

2933858

chrisvest added 3 commits November 18, 2025 12:08

Merge branch '4.2' into 4.2-setbytes

9b7eceb

Use absolute put in AdaptiveByteBuf.setBytes where possible

5cad0dc

Merge branch '4.2' into 4.2-setbytes

dc348b5

chrisvest requested a review from franz1981 December 10, 2025 00:21

franz1981 approved these changes Dec 15, 2025

View reviewed changes

normanmaurer modified the milestones: 4.2.8.Final, 4.2.9.Final, 4.2.10.Final Dec 15, 2025

normanmaurer merged commit 5fe1eea into netty:4.2 Dec 16, 2025
18 of 19 checks passed

chrisvest deleted the 4.2-setbytes branch December 16, 2025 17:40

chrisvest mentioned this pull request Jan 6, 2026

Improve adaptive allocator thread local performance (#15741) #16107

Merged

8 tasks

Uh oh!

Conversation

chrisvest commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

franz1981 Oct 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chrisvest Oct 4, 2025

Choose a reason for hiding this comment

Uh oh!

franz1981 Oct 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

franz1981 Oct 5, 2025

Choose a reason for hiding this comment

Uh oh!

chrisvest Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

franz1981 Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

franz1981 Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

franz1981 commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chrisvest commented Oct 6, 2025

Uh oh!

franz1981 commented Oct 7, 2025

Uh oh!

franz1981 left a comment

Choose a reason for hiding this comment

Uh oh!

normanmaurer commented Oct 27, 2025

Uh oh!

franz1981 commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

franz1981 commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chrisvest commented Nov 11, 2025

Uh oh!

franz1981 commented Nov 12, 2025

Uh oh!

franz1981 commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

franz1981 commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chrisvest commented Dec 10, 2025

Uh oh!

franz1981 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

normanmaurer commented Dec 16, 2025

Uh oh!

chrisvest commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

chrisvest commented Oct 3, 2025 •

edited

Loading

franz1981 Oct 4, 2025 •

edited

Loading

franz1981 Oct 4, 2025 •

edited

Loading

franz1981 Oct 7, 2025 •

edited

Loading

franz1981 Oct 7, 2025 •

edited

Loading

franz1981 commented Oct 6, 2025 •

edited

Loading

franz1981 commented Oct 27, 2025 •

edited

Loading

franz1981 commented Nov 7, 2025 •

edited

Loading

franz1981 commented Nov 13, 2025 •

edited

Loading

franz1981 commented Nov 13, 2025 •

edited

Loading