Skip to content

Make AdaptiveByteBuf.setBytes faster#15736

Merged
normanmaurer merged 7 commits intonetty:4.2from
chrisvest:4.2-setbytes
Dec 16, 2025
Merged

Make AdaptiveByteBuf.setBytes faster#15736
normanmaurer merged 7 commits intonetty:4.2from
chrisvest:4.2-setbytes

Conversation

@chrisvest
Copy link
Copy Markdown
Member

@chrisvest chrisvest commented Oct 3, 2025

Motivation:
The setBytes method was getting a sliced nioBuffer of its source, which typically causes allocation. On Java 16 onwards, we can instead copy using an absolutely offsetted put method, and forego allocating a duplicate ByteBuffer instance that is otherwise needed for isolating the position field.

Modification:

  • Make use of the absolutely offsetted put method in setBytes, when its available.
  • Use the underlying ByteBuffer of the shared chunk where possible to avoid multiple bounds checks.
  • Add a benchmark targeting the setBytes method that takes a ByteBuf source.
  • Change a few benchmarks to use the default allocator when pooling is enabled.

Result:
Faster setBytes in certain cases.

Motivation:
The setBytes method was getting a sliced nioBuffer of its source, which typically causes allocation.
On Java 16 onwards, we can instead copy using an absolutely offsetted `put` method, and forego allocating a duplicate ByteBuffer instance that is otherwise needed for isolating the position field.

Modification:
- Make use of the absolutely offsetted put method in setBytes, when its available.
- Add a benchmark targeting the setBytes method that takes a ByteBuf source.
- Change a few benchmarks to use the default allocator when pooling is enabled.

Result:
Faster setBytes in certain cases.
tmp.put(src);
int length = src.remaining();
checkIndex(index, length);
ByteBuffer tmp = internalNioBuffer();
Copy link
Copy Markdown
Contributor

@franz1981 franz1981 Oct 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not using the rootParent NIO buffer(s, from each adaptive, if available)?
It is shared among size classed chunks and will save them to allocates NIO buffers, which will be dead as they will be released (AdaptiveByteBufs are just empty shells afaik)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this but it was a lot slower, by like 50%. Not sure what was going on, but in theory you're right.

Copy link
Copy Markdown
Contributor

@franz1981 franz1981 Oct 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it was doubling the bound check? Debugging it or checking ASM could help.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have the version w root parent link it here an tomorrow I will take a look

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the version I had that used the rootParent:

        @Override
        public ByteBuf setBytes(int index, ByteBuf src, int srcIndex, int length) {
            checkIndex(index, length);
            if (src instanceof AdaptiveByteBuf && PlatformDependent.javaVersion() >= 16) {
                AdaptiveByteBuf srcBuf = (AdaptiveByteBuf) src;
                AbstractByteBuf dstRoot = rootParent();
                ByteBuffer dstBuffer = dstRoot.internalNioBuffer(0, dstRoot.maxFastWritableBytes());
                AbstractByteBuf srcRoot = srcBuf.rootParent();
                ByteBuffer srcBuffer = srcRoot.internalNioBuffer(0, srcRoot.maxFastWritableBytes());
                PlatformDependent.absolutePut(dstBuffer, idx(index), srcBuffer, srcBuf.idx(srcIndex), length);
            } else {
                ByteBuffer tmp = (ByteBuffer) internalNioBuffer();
                tmp.clear().position(index);
                tmp.put(src.nioBuffer(srcIndex, length));
            }
            return this;
        }

Copy link
Copy Markdown
Contributor

@franz1981 franz1981 Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm getting

Bug: unsafe access to shared internal chunk buffer

so I'm adding this too

    @Override
    public ByteBuffer internalNioBuffer(int index, int length) {
        if (!allowSectionedInternalNioBufferAccess) {
            // we can only return the internalNioBuffer if the whole buffer is requested
            if (length != capacity && index != 0) {
                throw new UnsupportedOperationException("Bug: unsafe access to shared internal chunk buffer");
            }
            return internalNioBuffer();
        }
        checkIndex(index, length);
        return internalNioBuffer().clear().position(index).limit(index + length);
    }

it's not very secure, still .-.
But I've moved the checkIndex later since I didn't wanted to perform again both the accessibility and bound check here.
In short: I would love we have a simple internalNioBuffer() method which don't perform bound nor accessibility checks and is not meant (ever) to provide a modifiable (in term of position/limit) internal buffer

Copy link
Copy Markdown
Contributor

@franz1981 franz1981 Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the changes at #15736 (comment) to enable using the internalNioBuffer and

        @Override
        public ByteBuf setBytes(int index, ByteBuf src, int srcIndex, int length) {
            checkIndex(index, length);
            if (src instanceof AdaptiveByteBuf && PlatformDependent.javaVersion() >= 16) {
                AdaptiveByteBuf srcBuf = (AdaptiveByteBuf) src;
                // bound check src as well
                srcBuf.checkIndex(srcIndex, length);
                AbstractByteBuf dstRoot = rootParent;
                AbstractByteBuf srcRoot = srcBuf.rootParent;
                // TODO why we have to pay again for accessibility and bound checks?
                ByteBuffer dstBuffer = dstRoot.internalNioBuffer(0, dstRoot.maxFastWritableBytes());
                ByteBuffer srcBuffer = srcRoot.internalNioBuffer(0, srcRoot.maxFastWritableBytes());
                PlatformDependent.absolutePut(dstBuffer, idx(index), srcBuffer, srcBuf.idx(srcIndex), length);
            } else {
                ByteBuffer tmp = (ByteBuffer) internalNioBuffer();
                tmp.clear().position(index);
                tmp.put(src.nioBuffer(srcIndex, length));
            }
            return this;
        }

I'm getting the best ever performance and no allocations so i would go for it

image

@franz1981
Copy link
Copy Markdown
Contributor

franz1981 commented Oct 6, 2025

Still running the http 2 benchmarks and numbers now looks better

The slice creation is still dominant:
image

but the actual copy has become cheaper - especially with --jvmArgsAppend="-Dio.netty.noUnsafe=true":
image

While benchmarking/profiling this I' ve found something weird
i.e Disabling Unsafe just cause the VarHandle ref counter methods not been inlined due to io/netty/util/internal/ReferenceCountUpdater.release not been inlined.

image

Reading https://wiki.openjdk.org/display/HotSpot/Server+Compiler+Inlining+Messages it means

already compiled into a big method: there is already compiled code
for the method that is called from the call site and the code that was
generated for is larger than InlineSmallCode

and

already compiled into a medium method: there is already compiled code
for the method that is called from the call site and the code that was
generated for is larger than InlineSmallCode / 4

Adding https://github.com/openjdk/jdk/blob/ba7bf43c76c94bea85dbbd865794184b7ee0cc86/src/hotspot/share/opto/bytecodeInfo.cpp#L276 for confirmation.

JDK 24 InlineSmallCode is 2500 and to make sure the method is inlined requires to be < then InlineSmallCode / 4 = 625 bytes.

By adding -XX:+UnlockDiagnosticVMOptions -XX:+PrintInlining shows:

image

FYI disabling VarHandle still cause a similar issue with AtomicIntegerFieldUpdater$AtomicIntegerFieldUpdaterImpl - same old same old.
This is happening because the assembly size of io/netty/util/internal/ReferenceCountUpdater.release is just too big (if not using Unsafe) and the same applies for the atomic integer updater variant.

with noUnsafe=true:

============================= C2-compiled nmethod ==============================
----------------------------------- Assembly -----------------------------------

Compiled method (c2) 592 2313       4       io.netty.util.internal.ReferenceCountUpdater::release (50 bytes)
 total in heap  [0x00007fa1102d1d88,0x00007fa1102d2d30] = 4008
 relocation     [0x00007fa1102d1e60,0x00007fa1102d1f80] = 288
 main code      [0x00007fa1102d1f80,0x00007fa1102d2bc0] = 3136
 stub code      [0x00007fa1102d2bc0,0x00007fa1102d2c10] = 80
 oops           [0x00007fa1102d2c10,0x00007fa1102d2c60] = 80
 metadata       [0x00007fa1102d2c60,0x00007fa1102d2d30] = 208
 immutable data [0x00007fa0280d6870,0x00007fa0280d78c0] = 4176
 dependencies   [0x00007fa0280d6870,0x00007fa0280d68a8] = 56
 nul chk table  [0x00007fa0280d68a8,0x00007fa0280d68e8] = 64
 handler table  [0x00007fa0280d68e8,0x00007fa0280d69a8] = 192
 scopes pcs     [0x00007fa0280d69a8,0x00007fa0280d7228] = 2176
 scopes data    [0x00007fa0280d7228,0x00007fa0280d78c0] = 1688

whilst, without (i.e. using Unsafe) :

============================= C2-compiled nmethod ==============================
----------------------------------- Assembly -----------------------------------

Compiled method (c2) 460 1912       4       io.netty.util.internal.ReferenceCountUpdater::release (50 bytes)
 total in heap  [0x00007f7928267888,0x00007f7928267ca0] = 1048
 relocation     [0x00007f7928267960,0x00007f79282679a8] = 72
 main code      [0x00007f79282679c0,0x00007f7928267bd0] = 528
 stub code      [0x00007f7928267bd0,0x00007f7928267be8] = 24
 oops           [0x00007f7928267be8,0x00007f7928267bf0] = 8
 metadata       [0x00007f7928267bf0,0x00007f7928267ca0] = 176
 immutable data [0x00007f785c006070,0x00007f785c0063d0] = 864
 dependencies   [0x00007f785c006070,0x00007f785c0060a0] = 48
 scopes pcs     [0x00007f785c0060a0,0x00007f785c006260] = 448
 scopes data    [0x00007f785c006260,0x00007f785c0063d0] = 368

The reason why it popups here is because release is getting compiled with rawCnt != 2 too, making it fatter - due to adaptive chunks which get retained/released and because the data buffer in the benchmark keep on getting retained/released (without been deallocated) - which is what #13783 is causing instead (in the real world it won't be a problem since we expect data to not be reused over and over!).
The way chunks uses the reference count is addressed by #15571

It looks like that the reason why ReferenceCountUpdater::release is that big is due to the bimorphic inlining of varHandle() , see

  0x00007fd0142d3656:   mov    %rsi,%r11                    ;*invokevirtual getRawRefCnt {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - io.netty.util.internal.ReferenceCountUpdater::release@2 (line 130)
  0x00007fd0142d3659:   mov    0x8(%r11),%r9d
  0x00007fd0142d365d:   cmp    $0x10aed20,%r9d              ;   {metadata('io/netty/buffer/AbstractReferenceCountedByteBuf$3')}
  0x00007fd0142d3664:   je     0x00007fd0142d367f
  0x00007fd0142d3666:   cmp    $0x10b5090,%r9d              ;   {metadata('io/netty/buffer/AdaptivePoolingAllocator$Chunk$3')}
  0x00007fd0142d366d:   jne    0x00007fd0142d3d34
  0x00007fd0142d3673:   movabs $0x45a840548,%r8             ;   {oop(a 'java/lang/invoke/VarHandleInts$FieldInstanceReadWrite'{0x000000045a840548})}
  0x00007fd0142d367d:   jmp    0x00007fd0142d3689
  0x00007fd0142d367f:   movabs $0x45a840010,%r8             ;*invokevirtual varHandle {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - io.netty.util.internal.VarHandleReferenceCountUpdater::getRawRefCnt@1 (line 41)
                                                            ; - io.netty.util.internal.ReferenceCountUpdater::release@2 (line 130)
                                                            ;   {oop(a 'java/lang/invoke/VarHandleInts$FieldInstanceReadWrite'{0x000000045a840010})}
  0x00007fd0142d3689:   movzbl 0xc(%r8),%r10d
  0x00007fd0142d368e:   test   %r10d,%r10d

where io/netty/buffer/AdaptivePoolingAllocator$Chunk$3 and io/netty/buffer/AbstractReferenceCountedByteBuf$3 are the two concrete types observed while executing

protected final int getRawRefCnt(T refCnt) {
return (int) varHandle().get(refCnt);
}

@yawkat suggested franz1981#1 long time ago to fix this problem
wdyt @chrisvest ?

In this way, since there are no overloaded varHandles the problem should go away:

  • there will be a single type implementing varHandle
  • no need of bimorphic inlining
  • compiled assembly "should " (need to check) become smaller
  • the problem should disappear

Another "dirty" solution to this, is to make io/netty/buffer/AdaptivePoolingAllocator$Chunk to have it's own overridden method(s) e.g. release: this will bring the type check for io/netty/buffer/AdaptivePoolingAllocator$Chunk earlier while calling release making the io/netty/buffer/AbstractReferenceCountedByteBuf::release one smaller
i.e.

                    // on Chunk
                    updater = new VarHandleReferenceCountUpdater<Chunk>() {
                        @Override
                        protected VarHandle varHandle() {
                            return (VarHandle) REFCNT_FIELD_VH;
                        }

                        @Override
                        public boolean release(final Chunk instance) {
                            int rawCnt = getRawRefCnt(instance);
                            return rawCnt == 2 ? tryFinalRelease0(instance, 2) || retryRelease0(instance, 1)
                                    : nonFinalRelease0(instance, 1, rawCnt, toLiveRealRefCnt(rawCnt, 1));
                        }
                    };

@chrisvest
Copy link
Copy Markdown
Member Author

I'm up for improving the way reference counting works; one implementation used across (if possible), and simpler algorithm with no count/raw count distinction. But let's do those in separate PRs.

@franz1981
Copy link
Copy Markdown
Contributor

one implementation used across (if possible), and simpler algorithm with no count/raw count distinction. But let's do those in separate PRs

oki, I leave this in good hands than ;)
I'll focus on the adaptive pool and to analyze what's wrong with the root parent version of this PR instead

Copy link
Copy Markdown
Contributor

@franz1981 franz1981 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would improve the way we expose internalNioBuffer to avoid paying twice for bound and accessibility checks, but ATM that's the most performant version in my tests.

@normanmaurer
Copy link
Copy Markdown
Member

@chrisvest @franz1981 what is the status of this one ?

@franz1981
Copy link
Copy Markdown
Contributor

franz1981 commented Oct 27, 2025

Let's say that the ref cnt PR from Chris will change what we would observe here.
I could run few tests rebasing this over that work and see how it looks, in case.

@franz1981
Copy link
Copy Markdown
Contributor

franz1981 commented Nov 7, 2025

@chrisvest I think after #15764 this can make progress and use something like what I made at #15736 (comment)

@chrisvest
Copy link
Copy Markdown
Member Author

@franz1981 I tried the attached patch, but I'm finding that it's actually slower in the ByteBufCopy2Benchmark on Java 25:

This PR currently:

Benchmark                       (directByteBuf)  (size)   Mode  Cnt          Score         Error  Units
ByteBufCopy2Benchmark.setBytes             true       7  thrpt   20  187270276.915 ± 1932878.488  ops/s
ByteBufCopy2Benchmark.setBytes             true      36  thrpt   20  175995791.002 ±  640168.287  ops/s
ByteBufCopy2Benchmark.setBytes             true     128  thrpt   20  143235411.817 ± 3168283.176  ops/s
ByteBufCopy2Benchmark.setBytes             true     512  thrpt   20   88840177.260 ±  949277.638  ops/s

The patch:

Benchmark                       (directByteBuf)  (size)   Mode  Cnt          Score         Error  Units
ByteBufCopy2Benchmark.setBytes             true       7  thrpt   20  155487223.856 ±  695797.841  ops/s
ByteBufCopy2Benchmark.setBytes             true      36  thrpt   20  142725234.580 ± 1837994.419  ops/s
ByteBufCopy2Benchmark.setBytes             true     128  thrpt   20  118052653.832 ±  270108.505  ops/s
ByteBufCopy2Benchmark.setBytes             true     512  thrpt   20   70395815.138 ±  169314.180  ops/s
A patch attempting the approach in #15736 (comment)
Subject: [PATCH] Use root parent ByteBuffers directly
---
Index: buffer/src/main/java/io/netty/buffer/UnsafeByteBufUtil.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/buffer/src/main/java/io/netty/buffer/UnsafeByteBufUtil.java b/buffer/src/main/java/io/netty/buffer/UnsafeByteBufUtil.java
--- a/buffer/src/main/java/io/netty/buffer/UnsafeByteBufUtil.java	(revision 293385859e01ba7501e6f120a0984dd084dfab67)
+++ b/buffer/src/main/java/io/netty/buffer/UnsafeByteBufUtil.java	(date 1762899017654)
@@ -689,14 +689,13 @@
     }
 
     static UnpooledUnsafeDirectByteBuf newUnsafeDirectByteBuf(
-            ByteBufAllocator alloc, int initialCapacity, int maxCapacity,
-            boolean allowSectionedInternalNioBufferAccess) {
+            ByteBufAllocator alloc, int initialCapacity, int maxCapacity) {
         if (PlatformDependent.useDirectBufferNoCleaner()) {
             return new UnpooledUnsafeNoCleanerDirectByteBuf(
-                    alloc, initialCapacity, maxCapacity, allowSectionedInternalNioBufferAccess);
+                    alloc, initialCapacity, maxCapacity);
         }
         return new UnpooledUnsafeDirectByteBuf(
-                alloc, initialCapacity, maxCapacity, allowSectionedInternalNioBufferAccess);
+                alloc, initialCapacity, maxCapacity);
     }
 
     private UnsafeByteBufUtil() { }
Index: buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java b/buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java
--- a/buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java	(revision 293385859e01ba7501e6f120a0984dd084dfab67)
+++ b/buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java	(date 1762899017652)
@@ -1687,11 +1687,15 @@
         @Override
         public ByteBuf setBytes(int index, ByteBuf src, int srcIndex, int length) {
             checkIndex(index, length);
-            ByteBuffer tmp = internalNioBuffer();
             if (src instanceof AdaptiveByteBuf && PlatformDependent.javaVersion() >= 16) {
                 AdaptiveByteBuf srcBuf = (AdaptiveByteBuf) src;
-                PlatformDependent.absolutePut(tmp, index, srcBuf.internalNioBuffer(), srcIndex, length);
+//                ByteBuffer tmp = internalNioBuffer();
+//                PlatformDependent.absolutePut(tmp, index, srcBuf.internalNioBuffer(), srcIndex, length);
+                ByteBuffer dstBuffer = rootParent()._internalNioBuffer();
+                ByteBuffer srcBuffer = srcBuf.rootParent()._internalNioBuffer();
+                PlatformDependent.absolutePut(dstBuffer, idx(index), srcBuffer, srcBuf.idx(srcIndex), length);
             } else {
+                ByteBuffer tmp = internalNioBuffer();
                 tmp.position(index);
                 tmp.put(src.nioBuffer(srcIndex, length));
             }
Index: buffer/src/main/java/io/netty/buffer/UnpooledHeapByteBuf.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/buffer/src/main/java/io/netty/buffer/UnpooledHeapByteBuf.java b/buffer/src/main/java/io/netty/buffer/UnpooledHeapByteBuf.java
--- a/buffer/src/main/java/io/netty/buffer/UnpooledHeapByteBuf.java	(revision 293385859e01ba7501e6f120a0984dd084dfab67)
+++ b/buffer/src/main/java/io/netty/buffer/UnpooledHeapByteBuf.java	(date 1762899017653)
@@ -213,7 +213,7 @@
         ensureAccessible();
         ByteBuffer tmpBuf;
         if (internal) {
-            tmpBuf = internalNioBuffer();
+            tmpBuf = _internalNioBuffer();
         } else {
             tmpBuf = ByteBuffer.wrap(array);
         }
@@ -222,7 +222,7 @@
 
     private int getBytes(int index, FileChannel out, long position, int length, boolean internal) throws IOException {
         ensureAccessible();
-        ByteBuffer tmpBuf = internal ? internalNioBuffer() : ByteBuffer.wrap(array);
+        ByteBuffer tmpBuf = internal ? _internalNioBuffer() : ByteBuffer.wrap(array);
         return out.write((ByteBuffer) tmpBuf.clear().position(index).limit(index + length), position);
     }
 
@@ -279,7 +279,7 @@
     public int setBytes(int index, ScatteringByteChannel in, int length) throws IOException {
         ensureAccessible();
         try {
-            return in.read((ByteBuffer) internalNioBuffer().clear().position(index).limit(index + length));
+            return in.read((ByteBuffer) _internalNioBuffer().clear().position(index).limit(index + length));
         } catch (ClosedChannelException ignored) {
             return -1;
         }
@@ -289,7 +289,7 @@
     public int setBytes(int index, FileChannel in, long position, int length) throws IOException {
         ensureAccessible();
         try {
-            return in.read((ByteBuffer) internalNioBuffer().clear().position(index).limit(index + length), position);
+            return in.read((ByteBuffer) _internalNioBuffer().clear().position(index).limit(index + length), position);
         } catch (ClosedChannelException ignored) {
             return -1;
         }
@@ -314,7 +314,7 @@
     @Override
     public ByteBuffer internalNioBuffer(int index, int length) {
         checkIndex(index, length);
-        return (ByteBuffer) internalNioBuffer().clear().position(index).limit(index + length);
+        return (ByteBuffer) _internalNioBuffer().clear().position(index).limit(index + length);
     }
 
     @Override
@@ -535,7 +535,8 @@
         return alloc().heapBuffer(length, maxCapacity()).writeBytes(array, index, length);
     }
 
-    private ByteBuffer internalNioBuffer() {
+    @Override
+    ByteBuffer _internalNioBuffer() {
         ByteBuffer tmpNioBuf = this.tmpNioBuf;
         if (tmpNioBuf == null) {
             this.tmpNioBuf = tmpNioBuf = ByteBuffer.wrap(array);
Index: buffer/src/main/java/io/netty/buffer/PooledByteBufAllocator.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/buffer/src/main/java/io/netty/buffer/PooledByteBufAllocator.java b/buffer/src/main/java/io/netty/buffer/PooledByteBufAllocator.java
--- a/buffer/src/main/java/io/netty/buffer/PooledByteBufAllocator.java	(revision 293385859e01ba7501e6f120a0984dd084dfab67)
+++ b/buffer/src/main/java/io/netty/buffer/PooledByteBufAllocator.java	(date 1762899017652)
@@ -403,8 +403,8 @@
             buf = directArena.allocate(cache, initialCapacity, maxCapacity);
         } else {
             buf = PlatformDependent.hasUnsafe() ?
-                    UnsafeByteBufUtil.newUnsafeDirectByteBuf(this, initialCapacity, maxCapacity, true) :
-                    new UnpooledDirectByteBuf(this, initialCapacity, maxCapacity, true);
+                    UnsafeByteBufUtil.newUnsafeDirectByteBuf(this, initialCapacity, maxCapacity) :
+                    new UnpooledDirectByteBuf(this, initialCapacity, maxCapacity);
             onAllocateBuffer(buf, false, false);
         }
         return toLeakAwareBuffer(buf);
Index: buffer/src/test/java/io/netty/buffer/LittleEndianUnsafeNoCleanerDirectByteBufTest.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/buffer/src/test/java/io/netty/buffer/LittleEndianUnsafeNoCleanerDirectByteBufTest.java b/buffer/src/test/java/io/netty/buffer/LittleEndianUnsafeNoCleanerDirectByteBufTest.java
--- a/buffer/src/test/java/io/netty/buffer/LittleEndianUnsafeNoCleanerDirectByteBufTest.java	(revision 293385859e01ba7501e6f120a0984dd084dfab67)
+++ b/buffer/src/test/java/io/netty/buffer/LittleEndianUnsafeNoCleanerDirectByteBufTest.java	(date 1762899017655)
@@ -31,6 +31,6 @@
 
     @Override
     protected ByteBuf newBuffer(int length, int maxCapacity) {
-        return new UnpooledUnsafeNoCleanerDirectByteBuf(UnpooledByteBufAllocator.DEFAULT, length, maxCapacity, true);
+        return new UnpooledUnsafeNoCleanerDirectByteBuf(UnpooledByteBufAllocator.DEFAULT, length, maxCapacity);
     }
 }
Index: buffer/src/main/java/io/netty/buffer/AdaptiveByteBufAllocator.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/buffer/src/main/java/io/netty/buffer/AdaptiveByteBufAllocator.java b/buffer/src/main/java/io/netty/buffer/AdaptiveByteBufAllocator.java
--- a/buffer/src/main/java/io/netty/buffer/AdaptiveByteBufAllocator.java	(revision 293385859e01ba7501e6f120a0984dd084dfab67)
+++ b/buffer/src/main/java/io/netty/buffer/AdaptiveByteBufAllocator.java	(date 1762899017651)
@@ -112,8 +112,8 @@
         @Override
         public AbstractByteBuf allocate(int initialCapacity, int maxCapacity) {
             return PlatformDependent.hasUnsafe() ?
-                    UnsafeByteBufUtil.newUnsafeDirectByteBuf(allocator, initialCapacity, maxCapacity, false) :
-                    new UnpooledDirectByteBuf(allocator, initialCapacity, maxCapacity, false);
+                    UnsafeByteBufUtil.newUnsafeDirectByteBuf(allocator, initialCapacity, maxCapacity) :
+                    new UnpooledDirectByteBuf(allocator, initialCapacity, maxCapacity);
         }
     }
 }
Index: buffer/src/test/java/io/netty/buffer/BigEndianUnsafeNoCleanerDirectByteBufTest.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/buffer/src/test/java/io/netty/buffer/BigEndianUnsafeNoCleanerDirectByteBufTest.java b/buffer/src/test/java/io/netty/buffer/BigEndianUnsafeNoCleanerDirectByteBufTest.java
--- a/buffer/src/test/java/io/netty/buffer/BigEndianUnsafeNoCleanerDirectByteBufTest.java	(revision 293385859e01ba7501e6f120a0984dd084dfab67)
+++ b/buffer/src/test/java/io/netty/buffer/BigEndianUnsafeNoCleanerDirectByteBufTest.java	(date 1762899017655)
@@ -31,6 +31,6 @@
 
     @Override
     protected ByteBuf newBuffer(int length, int maxCapacity) {
-        return new UnpooledUnsafeNoCleanerDirectByteBuf(UnpooledByteBufAllocator.DEFAULT, length, maxCapacity, true);
+        return new UnpooledUnsafeNoCleanerDirectByteBuf(UnpooledByteBufAllocator.DEFAULT, length, maxCapacity);
     }
 }
Index: buffer/src/main/java/io/netty/buffer/AbstractByteBuf.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/buffer/src/main/java/io/netty/buffer/AbstractByteBuf.java b/buffer/src/main/java/io/netty/buffer/AbstractByteBuf.java
--- a/buffer/src/main/java/io/netty/buffer/AbstractByteBuf.java	(revision 293385859e01ba7501e6f120a0984dd084dfab67)
+++ b/buffer/src/main/java/io/netty/buffer/AbstractByteBuf.java	(date 1762899017651)
@@ -1496,4 +1496,8 @@
     long _memoryAddress() {
         return isAccessible() && hasMemoryAddress() ? memoryAddress() : 0L;
     }
+
+    ByteBuffer _internalNioBuffer() {
+        return internalNioBuffer(0, maxFastWritableBytes());
+    }
 }
Index: buffer/src/main/java/io/netty/buffer/UnpooledDirectByteBuf.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/buffer/src/main/java/io/netty/buffer/UnpooledDirectByteBuf.java b/buffer/src/main/java/io/netty/buffer/UnpooledDirectByteBuf.java
--- a/buffer/src/main/java/io/netty/buffer/UnpooledDirectByteBuf.java	(revision 293385859e01ba7501e6f120a0984dd084dfab67)
+++ b/buffer/src/main/java/io/netty/buffer/UnpooledDirectByteBuf.java	(date 1762899017653)
@@ -45,7 +45,6 @@
     private ByteBuffer tmpNioBuf;
     private int capacity;
     private boolean doNotFree;
-    private final boolean allowSectionedInternalNioBufferAccess;
 
     /**
      * Creates a new direct buffer.
@@ -54,20 +53,6 @@
      * @param maxCapacity     the maximum capacity of the underlying direct buffer
      */
     public UnpooledDirectByteBuf(ByteBufAllocator alloc, int initialCapacity, int maxCapacity) {
-        this(alloc, initialCapacity, maxCapacity, true);
-    }
-
-    /**
-     * Creates a new direct buffer.
-     *
-     * @param initialCapacity the initial capacity of the underlying direct buffer
-     * @param maxCapacity     the maximum capacity of the underlying direct buffer
-     * @param allowSectionedInternalNioBufferAccess
-     * {@code true} if {@link #internalNioBuffer(int, int)} is allowed to be called,
-     * or {@code false} if it should throw an exception.
-     */
-    UnpooledDirectByteBuf(ByteBufAllocator alloc, int initialCapacity, int maxCapacity,
-                          boolean allowSectionedInternalNioBufferAccess) {
         super(maxCapacity);
         ObjectUtil.checkNotNull(alloc, "alloc");
         checkPositiveOrZero(initialCapacity, "initialCapacity");
@@ -79,7 +64,6 @@
 
         this.alloc = alloc;
         setByteBuffer(allocateDirectBuffer(initialCapacity), false);
-        this.allowSectionedInternalNioBufferAccess = allowSectionedInternalNioBufferAccess;
     }
 
     /**
@@ -113,7 +97,6 @@
         doNotFree = !doFree;
         setByteBuffer((slice ? initialBuffer.slice() : initialBuffer).order(ByteOrder.BIG_ENDIAN), false);
         writerIndex(initialCapacity);
-        allowSectionedInternalNioBufferAccess = true;
     }
 
     /**
@@ -617,7 +600,7 @@
         if (length == 0) {
             return;
         }
-        ByteBufUtil.readBytes(alloc(), internal ? internalNioBuffer() : buffer.duplicate(), index, length, out);
+        ByteBufUtil.readBytes(alloc(), internal ? _internalNioBuffer() : buffer.duplicate(), index, length, out);
     }
 
     @Override
@@ -750,13 +733,11 @@
     @Override
     public ByteBuffer internalNioBuffer(int index, int length) {
         checkIndex(index, length);
-        if (!allowSectionedInternalNioBufferAccess) {
-            throw new UnsupportedOperationException("Bug: unsafe access to shared internal chunk buffer");
-        }
-        return (ByteBuffer) internalNioBuffer().clear().position(index).limit(index + length);
+        return (ByteBuffer) _internalNioBuffer().clear().position(index).limit(index + length);
     }
 
-    private ByteBuffer internalNioBuffer() {
+    @Override
+    ByteBuffer _internalNioBuffer() {
         ByteBuffer tmpNioBuf = this.tmpNioBuf;
         if (tmpNioBuf == null) {
             this.tmpNioBuf = tmpNioBuf = buffer.duplicate();
Index: buffer/src/main/java/io/netty/buffer/UnpooledByteBufAllocator.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/buffer/src/main/java/io/netty/buffer/UnpooledByteBufAllocator.java b/buffer/src/main/java/io/netty/buffer/UnpooledByteBufAllocator.java
--- a/buffer/src/main/java/io/netty/buffer/UnpooledByteBufAllocator.java	(revision 293385859e01ba7501e6f120a0984dd084dfab67)
+++ b/buffer/src/main/java/io/netty/buffer/UnpooledByteBufAllocator.java	(date 1762899017652)
@@ -179,7 +179,7 @@
             extends UnpooledUnsafeNoCleanerDirectByteBuf {
         InstrumentedUnpooledUnsafeNoCleanerDirectByteBuf(
                 UnpooledByteBufAllocator alloc, int initialCapacity, int maxCapacity) {
-            super(alloc, initialCapacity, maxCapacity, true);
+            super(alloc, initialCapacity, maxCapacity);
         }
 
         @Override
Index: buffer/src/main/java/io/netty/buffer/UnpooledUnsafeDirectByteBuf.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/buffer/src/main/java/io/netty/buffer/UnpooledUnsafeDirectByteBuf.java b/buffer/src/main/java/io/netty/buffer/UnpooledUnsafeDirectByteBuf.java
--- a/buffer/src/main/java/io/netty/buffer/UnpooledUnsafeDirectByteBuf.java	(revision 293385859e01ba7501e6f120a0984dd084dfab67)
+++ b/buffer/src/main/java/io/netty/buffer/UnpooledUnsafeDirectByteBuf.java	(date 1762899017654)
@@ -43,11 +43,6 @@
         super(alloc, initialCapacity, maxCapacity);
     }
 
-    UnpooledUnsafeDirectByteBuf(ByteBufAllocator alloc, int initialCapacity, int maxCapacity,
-                                boolean allowSectionedInternalNioBufferAccess) {
-        super(alloc, initialCapacity, maxCapacity, allowSectionedInternalNioBufferAccess);
-    }
-
     /**
      * Creates a new direct buffer by wrapping the specified initial buffer.
      *
Index: buffer/src/main/java/io/netty/buffer/UnpooledUnsafeNoCleanerDirectByteBuf.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/buffer/src/main/java/io/netty/buffer/UnpooledUnsafeNoCleanerDirectByteBuf.java b/buffer/src/main/java/io/netty/buffer/UnpooledUnsafeNoCleanerDirectByteBuf.java
--- a/buffer/src/main/java/io/netty/buffer/UnpooledUnsafeNoCleanerDirectByteBuf.java	(revision 293385859e01ba7501e6f120a0984dd084dfab67)
+++ b/buffer/src/main/java/io/netty/buffer/UnpooledUnsafeNoCleanerDirectByteBuf.java	(date 1762899017654)
@@ -21,9 +21,8 @@
 import java.nio.ByteBuffer;
 
 class UnpooledUnsafeNoCleanerDirectByteBuf extends UnpooledUnsafeDirectByteBuf {
-    UnpooledUnsafeNoCleanerDirectByteBuf(ByteBufAllocator alloc, int initialCapacity, int maxCapacity,
-                                         boolean allowSectionedInternalNioBufferAccess) {
-        super(alloc, initialCapacity, maxCapacity, allowSectionedInternalNioBufferAccess);
+    UnpooledUnsafeNoCleanerDirectByteBuf(ByteBufAllocator alloc, int initialCapacity, int maxCapacity) {
+        super(alloc, initialCapacity, maxCapacity);
     }
 
     @Override

@franz1981
Copy link
Copy Markdown
Contributor

Thanks, i will try with JDK 25 as well, to make sure 🙏
Sadly I got no apple machine to validate why you got such numbers

But If you want, run the benchmark with -prof gc to compare the 2 approaches. I would suggest to use the patched one because should allocate less

@franz1981
Copy link
Copy Markdown
Contributor

franz1981 commented Nov 13, 2025

I think @chrisvest that this is not the "right" benchmark because it would reuse the internal buffer - which under certain circumstances can save to be materialized...

Said that, I see a regression myself, likely due to bound checks

image

this is not happening in the current version - and "it could" (to be verified) a quirk of the benchmark
i.e. JMH hoisting few invartiants considered "safe" out of the loop?

I have again inspect the asm to know it...
but what I know for sure is that the Http 2 benchmark show the exact opposite...can you give it a shot?

(see #13783 (comment) as a remainder: this PR is key to see in that HTTP 2 PR the right improvement...a linked list of issues!)

@franz1981
Copy link
Copy Markdown
Contributor

franz1981 commented Nov 13, 2025

@chrisvest re the performance difference: at a first look the problem is visible by reading the code as well
i.e.

  1. using AdaptiveByteBuf::internalNioBuffer: it forces allocating it, but once done, there's no need to compute any real offset - each internal NIO ByteBuffer already has the right address
  2. using rootParent's _internalNioBuffer: it requires to compute twice the idx - for both src and dst

So, in which cases the second is a better option?
I think only if the AdaptiveByteBuf::internalNioBuffer hasn't been allocated yet and indeed the HTTP 2 benchmark was creating a new duplicate/slice of the buffer to work with, causing it to allocate a fresh new internalNioBuffer and harming performance.

@chrisvest chrisvest requested a review from franz1981 December 10, 2025 00:21
@chrisvest
Copy link
Copy Markdown
Member Author

@normanmaurer @franz1981 Please take a look.

Copy link
Copy Markdown
Contributor

@franz1981 franz1981 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@normanmaurer normanmaurer merged commit 5fe1eea into netty:4.2 Dec 16, 2025
18 of 19 checks passed
normanmaurer pushed a commit that referenced this pull request Dec 16, 2025
Motivation:
The setBytes method was getting a sliced nioBuffer of its source, which
typically causes allocation. On Java 16 onwards, we can instead copy
using an absolutely offsetted `put` method, and forego allocating a
duplicate ByteBuffer instance that is otherwise needed for isolating the
position field.

Modification:
- Make use of the absolutely offsetted put method in setBytes, when its
available.
- Use the underlying ByteBuffer of the shared chunk where possible to
avoid multiple bounds checks.
- Add a benchmark targeting the setBytes method that takes a ByteBuf
source.
- Change a few benchmarks to use the default allocator when pooling is
enabled.

Result:
Faster setBytes in certain cases.
@normanmaurer
Copy link
Copy Markdown
Member

@chrisvest @franz1981 do we want to also port this to 4.1 ?

normanmaurer added a commit that referenced this pull request Dec 16, 2025
Motivation:
The setBytes method was getting a sliced nioBuffer of its source, which
typically causes allocation. On Java 16 onwards, we can instead copy
using an absolutely offsetted `put` method, and forego allocating a
duplicate ByteBuffer instance that is otherwise needed for isolating the
position field.

Modification:
- Make use of the absolutely offsetted put method in setBytes, when its
available.
- Use the underlying ByteBuffer of the shared chunk where possible to
avoid multiple bounds checks.
- Add a benchmark targeting the setBytes method that takes a ByteBuf
source.
- Change a few benchmarks to use the default allocator when pooling is
enabled.

Result:
Faster setBytes in certain cases.

Co-authored-by: Chris Vest <christianvest_hansen@apple.com>
@chrisvest chrisvest deleted the 4.2-setbytes branch December 16, 2025 17:40
@chrisvest
Copy link
Copy Markdown
Member Author

@normanmaurer I don't think we need to back port this to 4.1.

chrisvest pushed a commit that referenced this pull request Jan 6, 2026
Motivation:

Adaptive allocator perform costly atomic operations in the thread local
path, which reduce its performance

Modification:

Reduce the amount of atomic operations in the thread local allocation's
fast path

Result:

Fixes #15571


These are the different variations I want to test:

- [x] Uses unguarded `Recycler`s
- [x] Implements "compressed" local free list (LIFO) 
- [x] Use a mpsc q for the reuse chunk q in the thread-local case 
**NO VISIBLE IMPROVEMENTS**
- [x] Guards `nextInLine`'s `getAndSet` with a null check via volatile
`get` first, since size classed chunks rarely end up into `nextInLine`
(i.e. which is mostly `null`)
**NO VISIBLE IMPROVEMENTS**
- [x] Implements a var handle based `MpscIntQueue` (done at
1c4e1e4)
**NO VISIBLE IMPROVEMENTS**
- [x] Remove the live/raw ref cnt as mentioned at
#15736 (comment)
- [ ] Remove the ref count for size classed chunks (see
8953bbe and
8cb1bf0)
- [ ] Use the "pinned" Recycler instead of the `FastThreadLocal`-based
one
chrisvest pushed a commit to chrisvest/netty that referenced this pull request Jan 6, 2026
Motivation:

Adaptive allocator perform costly atomic operations in the thread local
path, which reduce its performance

Modification:

Reduce the amount of atomic operations in the thread local allocation's
fast path

Result:

Fixes netty#15571

These are the different variations I want to test:

- [x] Uses unguarded `Recycler`s
- [x] Implements "compressed" local free list (LIFO)
- [x] Use a mpsc q for the reuse chunk q in the thread-local case
**NO VISIBLE IMPROVEMENTS**
- [x] Guards `nextInLine`'s `getAndSet` with a null check via volatile
`get` first, since size classed chunks rarely end up into `nextInLine`
(i.e. which is mostly `null`)
**NO VISIBLE IMPROVEMENTS**
- [x] Implements a var handle based `MpscIntQueue` (done at
1c4e1e4)
**NO VISIBLE IMPROVEMENTS**
- [x] Remove the live/raw ref cnt as mentioned at
netty#15736 (comment)
- [ ] Remove the ref count for size classed chunks (see
8953bbe and
8cb1bf0)
- [ ] Use the "pinned" Recycler instead of the `FastThreadLocal`-based
one

(cherry picked from commit accd981)
chrisvest added a commit that referenced this pull request Jan 7, 2026
Motivation:

Adaptive allocator perform costly atomic operations in the thread local
path, which reduce its performance

Modification:

Reduce the amount of atomic operations in the thread local allocation's
fast path

Result:

Fixes #15571

These are the different variations I want to test:

- [x] Uses unguarded `Recycler`s
- [x] Implements "compressed" local free list (LIFO)
- [x] Use a mpsc q for the reuse chunk q in the thread-local case **NO
VISIBLE IMPROVEMENTS**
- [x] Guards `nextInLine`'s `getAndSet` with a null check via volatile
`get` first, since size classed chunks rarely end up into `nextInLine`
(i.e. which is mostly `null`)
**NO VISIBLE IMPROVEMENTS**
- [x] Implements a var handle based `MpscIntQueue` (done at
1c4e1e4)
**NO VISIBLE IMPROVEMENTS**
- [x] Remove the live/raw ref cnt as mentioned at
#15736 (comment)
- [ ] Remove the ref count for size classed chunks (see
8953bbe and
8cb1bf0)
- [ ] Use the "pinned" Recycler instead of the `FastThreadLocal`-based
one

(cherry picked from commit accd981)

Co-authored-by: Francesco Nigro <nigro.fra@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants