Simplify reference counting #15764

chrisvest · 2025-10-16T18:16:09Z

Motivation:
Our current reference counting algorithm is quite complicated, trying to guard for word-tearing of 32-bit integers, which isn't possible in Java.

The complicated code compiles to larger amounts of native code, which in turn can prevent inlining as it comes up against code size heuristics in the JIT.

Modification:
Simplify the code by removing special case checks for certain constants. Reorganize the code to use fewer branches

Result:
Simpler code that's more JIT friendly, without loss of performance.

Motivation: Our current reference counting algorithm is quite complicated, trying to guard for word-tearing of 32-bit integers, which isn't possible in Java. The complicated code compiles to larger amounts of native code, which in turn can prevent inlining as it comes up against code size heuristics in the JIT. Modification: Simplify the code by removing special case checks for certain constants. Reorganize the code to use fewer branches Result: Simpler code that's more JIT friendly, without loss of performance.

chrisvest · 2025-10-16T18:19:03Z

@franz1981 Please try benchmarking these changes. On my M1, I get performance in AbstractReferenceCountedByteBufBenchmark that is on par with the current code.

If I change retain to use a CAS loop, its performance drops dramatically as soon as there's contention. I kept the even/odd reference counting scheme because of this, because it allows us to use the getAndAdd intrinsics, which scale much better.

franz1981 · 2025-10-16T18:21:54Z

But to be fair, it matters?
I mean, I would optimize for normal uncontended use cases if I have to pick a poison. This will improve the unconteded cases because the comparisons doesn't need to strip the parity bit and become (if predictable) super cheap.
If we want to go really fast (under contention) a double sequence scheme is the way because the two sequences keep on increasing and can use increment and get separately

franz1981 · 2025-10-16T18:25:10Z

You can try applying this on top of my adaptive pull and see how to perform since I didn't get rid of the chunk ref count for size classed because I was waiting this ❤️

chrisvest · 2025-10-16T18:25:18Z

It wasn't faster without the parity bit on my machine.

franz1981 · 2025-10-16T18:27:13Z

I believe it - I will try on my x86 as well to check how it behaves. I have to check to if it solves the problem in the issue related set bytes too

chrisvest · 2025-10-16T18:30:30Z

Also, I suspect the contended case is relevant for the chunks.

franz1981 · 2025-10-16T18:49:31Z

Uh, you are right. It is indeed (for shared magazines).
Although I have another solution there to get rid of it

chrisvest · 2025-10-16T19:16:20Z

Huh, interesting. I'm actually getting pretty reliable JIT compiler crash on the ReferenceCountUpdater::release0 method!

chrisvest · 2025-10-16T21:01:55Z

Looks like I can make the compiler crash go away by collapsing the implementations down as well, so we only have AbstractReferenceCounted and no longer need the impls in Chunk and AbstrsctReferenceCountedByteBuf. And a few other small tweaks.

chrisvest · 2025-10-16T22:16:07Z

To work-around the JIT bug, I've had to increase the scope of this PR to also collapse the three different ReferenceCounted implementations into one; AbstractReferenceCounted. This means that Chunk and AbstractReferenceCountedByteBuf no longer integrate with ReferenceCountUpdater directly, but instead all pull their implementations from AbstractReferenceCounted.

…ReferenceCounted And also work-around what appears to be a C2 JIT compiler bug in Java 17 through 25+.

franz1981 · 2025-10-17T01:55:04Z

buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java

-            }
-        }
-
+    private static class Chunk extends AbstractReferenceCounted implements ChunkInfo {


Consider that w 8953bbe I plan to make Chunk's reference count to exist only for the BumpChunk (bad name I know :"()

Yeah. And this is also breaking the graal build as-is… need to think about what to do here.

Check franz1981#1

franz1981 · 2025-10-17T02:27:19Z

The idea of collapsing the hierarchy is very similar to what @yawkat has done on franz1981#1 and will likely avoid the problem of bimoprhic inlining + fat compiled methods observed in #15736 (comment).
But the only way to know it is try this change on #15736 as well.

common/src/main/java/io/netty/util/internal/ReferenceCountUpdater.java

franz1981 · 2025-10-17T03:15:38Z

I have cherry picked both your changes on top of #15741 getting

Benchmark                                               (allocatorType)  (pollutionIterations)  Mode  Cnt   Score   Error  Units
ByteBufAllocatorAllocPatternBenchmark.directAllocation         ADAPTIVE                      0  avgt  100  29.725 ± 0.261  ns/op
ByteBufAllocatorAllocPatternBenchmark.directAllocation         ADAPTIVE                 200000  avgt  100  32.376 ± 0.453  ns/op

whilst my last commit was delivering

Benchmark                                               (allocatorType)  (pollutionIterations)  Mode  Cnt   Score   Error  Units
ByteBufAllocatorAllocPatternBenchmark.directAllocation         ADAPTIVE                      0  avgt  100  27.484 ± 0.251  ns/op
ByteBufAllocatorAllocPatternBenchmark.directAllocation         ADAPTIVE                 200000  avgt  100  32.874 ± 0.585  ns/op

which shows a slightly regressed performance in the non-"polluted" case

With the change at #15764 (comment) the performance is (nearly) restored:

Benchmark                                               (allocatorType)  (pollutionIterations)  Mode  Cnt   Score   Error  Units
ByteBufAllocatorAllocPatternBenchmark.directAllocation         ADAPTIVE                      0  avgt  100  28.091 ± 0.431  ns/op
ByteBufAllocatorAllocPatternBenchmark.directAllocation         ADAPTIVE                 200000  avgt  100  32.044 ± 0.457  ns/op

Adding #15764 (comment) too (to the previous one) completely restore the performance:

Benchmark                                               (allocatorType)  (pollutionIterations)  Mode  Cnt   Score   Error  Units
ByteBufAllocatorAllocPatternBenchmark.directAllocation         ADAPTIVE                      0  avgt  100  27.636 ± 0.142  ns/op
ByteBufAllocatorAllocPatternBenchmark.directAllocation         ADAPTIVE                 200000  avgt  100  31.993 ± 0.531  ns/op

I have the suspect we can get better on retain0 as well since we can assume to optimize for the not taken branch there too (i.e. no thrown exception) - I will post a suggestion for that one too.
And that should help the chunk and (retained slices) use cases (e.g. for HTTP 2)

common/src/main/java/io/netty/util/internal/ReferenceCountUpdater.java

chrisvest · 2025-10-17T15:55:38Z

@franz1981 I applied the optimizations you suggested. They don't seem to make any noticeable difference on M1, but if they help x86 we'll of course take that win.

common/src/main/java/io/netty/util/internal/ReferenceCountUpdater.java

franz1981 · 2025-10-28T13:26:45Z

@normanmaurer how this look like?

https://github.com/franz1981/netty/commits/4.2-new-refcount/

it implements a separate RefCnt class
it allocates it where is needed.
it has 3 separate concrete sub-classes which are NEVER used together in the same VM

The downside is the mem footprint (+16 bytes of RefCnt) and indirection, which is not great, but I can run few benchmarks to check this.
To be fair I have some strong feeling that repeated memory accesses witll hoist the refCnt load (and no need of guards because there MUST be a single concrete type loaded at runtime only) and further refCnt.value offset too because repeated buffer accesses won't involve volatile loads due to how RefCnt::isLiveNonVolatile is implemented.

Another option is what @yawkat has done in a PR on my Netty repo - which is not so distant from what's proposed by this PR.

There's still some code duplication (which TBH could be removed, with some effort), but I preferred to leave it because makes the life of both native image and JIT much easier :)

normanmaurer · 2025-10-28T13:45:28Z

@franz1981 if the performance is improved by your implementation I would prefer this one as it seems a lot cleaner etc.

normanmaurer · 2025-10-28T13:52:23Z

@franz1981 do we even need all the other changes with what you did now ?

franz1981 · 2025-10-28T14:04:33Z

@normanmaurer

I've created https://github.com/franz1981/netty/tree/4.2-refcnt-http2-setbytes-after-new-refcnt to measure the proposed changes on #15764 (comment) against #15764 (comment)

Benchmark                                       (padding)  (payloadSize)  (pooled)  Mode  Cnt   Score   Error  Units
Http2FrameWriterDataBenchmark.newWriter                 0            128      true  avgt    5  58.290 ± 0.247  ns/op

I haven't measured the unsafe performance and how much we lose and other benchmarks.

franz1981 · 2025-10-28T16:00:14Z

Another option is https://gist.github.com/franz1981/e81b474c415b58e2601407fa254dfc00 which would enable its child which want to embedd the int field to extend it and implement ReferenceCounted.
It will create some clashing issue due to names of the methods, but is solvable...

chrisvest · 2025-10-28T16:08:08Z

@rwestrel The first commit, fafa429, might be of interest to the JDK team because it managed to find a compiler crash in C2. I can repro it reliably on JDK 23, 24, 25, by just running the tests in the netty-buffer module.
Here's the JDK 25 crash log hs_err_pid4992.log

chrisvest · 2025-10-28T23:27:23Z

Looking at franz1981#1 the ReferenceCountHolder apparently does not need to be initialized at runtime (a Graal 21 build passes with it), and that gets us around a lot of the changes. It's not obvious to me how it can get away with that. I'll have to look into that more.

franz1981 · 2025-10-29T06:08:54Z

@chrisvest @normanmaurer let me try to summarize my thoughts on this

Problem Summary

ReferenceCountUpdater was designed for flexibility — any class can provide its own int field and use the shared updater.

AbstractReferenceCounted defines its own refCnt field and updater.
ByteBuf implements ReferenceCounted but does not extend AbstractReferenceCounted.
AbstractByteBuf has no ref count field, but its subclass AbstractReferenceCountedByteBuf defines one (with its own updater).
Some subclasses (e.g. AbstractUnpooledSlicedByteBuf) are ReferenceCounted but don’t need a ref count field.

This flexibility prevents ReferenceCountUpdater from being optimized via VarHandle.

Observed issue in the adaptive allocator:

AdaptiveByteBuf → extends AbstractReferenceCountedByteBuf (own updater).
Chunk → implements ReferenceCounted (own updater).

Two updaters subclasses hurt ref count performance.
Making Chunk extend AbstractReferenceCounted doesn’t help — each uses a different updater/field.

Option 1 — Make `ByteBuf` extend `AbstractReferenceCounted`

Both adaptive chunks and buffers would share the same field/updater.

Pros:

no changes in memory footprint for the common case i.e. AbstractReferenceCountedByteBuf children

Cons:

Some ByteBuf types would carry an unused refCnt field.
Users could still introduce their own updaters, harming performance.

Option 2 — Replace updaters with a less flexible `RefCnt` helper

Introduce a simple RefCnt holder (just an int field + static utility methods, no virtuals).

Two variants:

Option (2.A): Keep the current hierarchy — make AbstractReferenceCountedByteBuf hold a RefCnt field.
Option (2.B): Let ByteBuf extend RefCnt directly (similar to Option 1)

Pros:

Extending RefCnt won't affect its performance

Cons:

Option (2.A): adds one level of indirection and small memory overhead in a common use case
Option (2.B): adds a int refCnt field even when unused

TL;DR:

Ref count flexibility currently hurts performance.
Unifying the updater or introducing a more rigid RefCnt holder fixes the issue

An example of the Option 2.A is at franz1981@ed468f2

And by doing what @chrisvest has done at f410cc2 (but with RefCnt) is possible to implement Option 2.B with ease.

NOTE: Option 2 is not the same as franz1981#1 because we are removing the flexibility for users to act on user-defined int ref cnt fields!

normanmaurer · 2025-10-29T07:05:42Z

I am in favor of Option (2.A).

rwestrel · 2025-10-30T10:08:22Z

@rwestrel The first commit, fafa429, might be of interest to the JDK team because it managed to find a compiler crash in C2. I can repro it reliably on JDK 23, 24, 25, by just running the tests in the netty-buffer module. Here's the JDK 25 crash log hs_err_pid4992.log

FYI, I filed an openjdk bug: https://bugs.openjdk.org/browse/JDK-8370939

normanmaurer · 2025-10-30T10:14:22Z

@rwestrel thanks for linking it here.

This reverts commit a49f4d3.

This reverts commit b077572

…AbstractReferenceCounted" This reverts commit f410cc2

chrisvest · 2025-10-30T18:38:40Z

Ok, I'll try implementing option 2.A, following the example in franz1981@ed468f2

…ive-image

chrisvest · 2025-10-30T22:58:55Z

@franz1981 @normanmaurer Please take a look.

franz1981

In term of performance it will have some impact for the most common case (as explained in the description I already gave), but in a follow up PR it enables us to likely re-enable VarHandle for native image as well.

I will run few benchmarks, likely early next week; in case, we still have the option B which would remove any chance for regression - with @normanmaurer blessing

common/src/main/java/io/netty/util/internal/RefCnt.java

franz1981 · 2025-11-02T14:59:01Z

buffer/src/main/java/io/netty/buffer/AbstractReferenceCountedByteBuf.java

        super(maxCapacity);
-        updater.setInitialValue(this);
+        // this is setting the ref cnt to the initial value
+        refCnt = new RefCnt();


I have the feeling (i.e. to verify w a simple test) that despite been the right way for this code, having a super which could observe the refCnt as null is causing the JIT not be able to use https://shipilev.net/jvm/anatomy-quarks/25-implicit-null-checks/

I have verified indeed that resetRefCnt is emitting (for the adaptive benchmark), this

0x00007f70746c1d7b: mov 0x20(%r8),%r10d ;*getfield refCnt {reexecute=0 rethrow=0 return_oop=0} ; - io.netty.buffer.AbstractReferenceCountedByteBuf::resetRefCnt@1 (line 57) 0.02% 0x00007f70746c1d7f: test %r10d,%r10d ; NO implicit NULL check here :"( 0.21% 0x00007f70746c1d82: je 0x00007f70746c2fcc 0x00007f70746c1d88: movl $0x2,0xc(%r12,%r10,8) ;*invokevirtual putIntRelease {reexecute=0 rethrow=0 return_oop=0} ; - java.lang.invoke.VarHandleInts$FieldInstanceReadWrite::setRelease@24 (line 170) ; - java.lang.invoke.VarHandleGuards::guard_LI_V@45 (line 122) ; - io.netty.util.internal.RefCnt$VarHandleRefCnt::resetRefCnt@5 (line 378) ; - io.netty.util.internal.RefCnt::resetRefCnt@36 (line 236) ; - io.netty.buffer.AbstractReferenceCountedByteBuf::resetRefCnt@4 (line 57)

which shows that moving to a separate refCnt field add few costs:

reading a field (with an embedded int refCnt that won't happen too)

1 test + branch for the null check (predictable, but still...)

decoding the compressed oop on the fly while setting the value to 2 (see https://discord.com/channels/@me/1418816637674459157/1434550833768304710)

This is not terrible, but that's why the Option 2.B is more appealing in term of cost

@normanmaurer too

franz1981 · 2025-11-02T15:01:04Z

buffer/src/main/java/io/netty/buffer/AbstractReferenceCountedByteBuf.java

-    // Value might not equal "real" reference count, all access should be via the updater
-    @SuppressWarnings({"unused", "FieldMayBeFinal"})
-    private volatile int refCnt;
+    private final RefCnt refCnt;


Suggested change

private final RefCnt refCnt;

private final RefCnt refCnt = new RefCnt();

franz1981 · 2025-11-02T15:02:49Z

buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java

            magazine = null;
            allocator = null;
            chunkReleasePredicate = null;
+            refCnt = null;


Read https://github.com/netty/netty/pull/15764/files#r2484854800: we can still allocate the RefCnt here regardless, in the constructor.
This "should" save the JIT to emit a null check

franz1981 · 2025-11-02T15:03:46Z

buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java

        protected Magazine magazine;
        private final AdaptivePoolingAllocator allocator;
        private final ChunkReleasePredicate chunkReleasePredicate;
+        private final RefCnt refCnt;


Suggested change

private final RefCnt refCnt;

// we always populate the refCnt field to save Hotspot to emit `null` checks

private final RefCnt refCnt = new RefCnt();

chrisvest requested a review from franz1981 October 16, 2025 18:16

chrisvest requested a review from normanmaurer October 16, 2025 18:20

chrisvest force-pushed the 4.2-refcount branch from 4f58b3e to c6afb03 Compare October 16, 2025 21:33

Collapse the ReferenceCounted implementations to all rely on Abstract…

f410cc2

…ReferenceCounted And also work-around what appears to be a C2 JIT compiler bug in Java 17 through 25+.

chrisvest force-pushed the 4.2-refcount branch from c6afb03 to f410cc2 Compare October 16, 2025 22:55

franz1981 reviewed Oct 17, 2025

View reviewed changes

common/src/main/java/io/netty/util/internal/ReferenceCountUpdater.java Show resolved Hide resolved

common/src/main/java/io/netty/util/internal/ReferenceCountUpdater.java Show resolved Hide resolved

franz1981 requested changes Oct 17, 2025

View reviewed changes

common/src/main/java/io/netty/util/internal/ReferenceCountUpdater.java Show resolved Hide resolved

franz1981 reviewed Oct 17, 2025

View reviewed changes

common/src/main/java/io/netty/util/internal/ReferenceCountUpdater.java Show resolved Hide resolved

franz1981 requested changes Oct 17, 2025

View reviewed changes

common/src/main/java/io/netty/util/internal/ReferenceCountUpdater.java Show resolved Hide resolved

Apply some micro-opts that help x86 a bit

45633cd

chrisvest added 2 commits October 17, 2025 16:31

Merge branch '4.2' into 4.2-refcount

7952927

Merge branch '4.2' into 4.2-refcount

d72d545

franz1981 reviewed Oct 21, 2025

View reviewed changes

common/src/main/java/io/netty/util/internal/ReferenceCountUpdater.java Show resolved Hide resolved

Fix the native image build

b077572

chrisvest force-pushed the 4.2-refcount branch from 78ccf85 to b077572 Compare October 23, 2025 19:05

franz1981 force-pushed the 4.2-refcount branch 2 times, most recently from 0e655be to 4e306ad Compare October 28, 2025 13:37

normanmaurer closed this Oct 28, 2025

normanmaurer reopened this Oct 28, 2025

chrisvest added 3 commits October 30, 2025 11:33

Revert "Fix graal 21 build"

ba3f5f5

This reverts commit a49f4d3.

Revert "Fix the native image build"

51c6c69

This reverts commit b077572

Revert "Collapse the ReferenceCounted implementations to all rely on …

a4db143

…AbstractReferenceCounted" This reverts commit f410cc2

franz1981 and others added 4 commits October 30, 2025 12:04

new RefCnt class

3b4f83e

Properly integrate the RefCnt into the adaptive allocator and fix nat…

06dcfef

…ive-image

Fix revapi warning

9b726f9

Deprecate ReferenceCountUpdater

4c09131

chrisvest force-pushed the 4.2-refcount branch from e5c3b63 to 4c09131 Compare October 30, 2025 22:49

chrisvest requested a review from franz1981 October 30, 2025 22:51

franz1981 requested changes Oct 31, 2025

View reviewed changes

Address review comments

e0ba069

franz1981 reviewed Nov 2, 2025

View reviewed changes

franz1981 requested changes Nov 2, 2025

View reviewed changes

	private final RefCnt refCnt;
	private final RefCnt refCnt = new RefCnt();

	private final RefCnt refCnt;
	// we always populate the refCnt field to save Hotspot to emit `null` checks
	private final RefCnt refCnt = new RefCnt();

Uh oh!

Simplify reference counting #15764

Are you sure you want to change the base?

Simplify reference counting #15764

Conversation

chrisvest commented Oct 16, 2025

Uh oh!

chrisvest commented Oct 16, 2025

Uh oh!

franz1981 commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

franz1981 commented Oct 16, 2025

Uh oh!

chrisvest commented Oct 16, 2025

Uh oh!

franz1981 commented Oct 16, 2025

Uh oh!

chrisvest commented Oct 16, 2025

Uh oh!

franz1981 commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chrisvest commented Oct 16, 2025

Uh oh!

chrisvest commented Oct 16, 2025

Uh oh!

chrisvest commented Oct 16, 2025

Uh oh!

franz1981 Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chrisvest Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

franz1981 Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

franz1981 commented Oct 17, 2025

Uh oh!

Uh oh!

Uh oh!

franz1981 commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chrisvest commented Oct 17, 2025

Uh oh!

Uh oh!

franz1981 commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

normanmaurer commented Oct 28, 2025

Uh oh!

normanmaurer commented Oct 28, 2025

Uh oh!

franz1981 commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

franz1981 commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chrisvest commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chrisvest commented Oct 28, 2025

Uh oh!

franz1981 commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem Summary

Option 1 — Make ByteBuf extend AbstractReferenceCounted

Option 2 — Replace updaters with a less flexible RefCnt helper

TL;DR:

Uh oh!

normanmaurer commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

franz1981 commented Oct 16, 2025 •

edited

Loading

franz1981 commented Oct 16, 2025 •

edited

Loading

franz1981 Oct 17, 2025 •

edited

Loading

franz1981 commented Oct 17, 2025 •

edited

Loading

franz1981 commented Oct 28, 2025 •

edited

Loading

franz1981 commented Oct 28, 2025 •

edited

Loading

franz1981 commented Oct 28, 2025 •

edited

Loading

chrisvest commented Oct 28, 2025 •

edited

Loading

franz1981 commented Oct 29, 2025 •

edited

Loading

Option 1 — Make `ByteBuf` extend `AbstractReferenceCounted`

Option 2 — Replace updaters with a less flexible `RefCnt` helper

normanmaurer commented Oct 29, 2025 •

edited

Loading

franz1981 Nov 2, 2025 •

edited

Loading