Skip to content

Conversation

@chrisvest
Copy link
Member

Motivation:
Our current reference counting algorithm is quite complicated, trying to guard for word-tearing of 32-bit integers, which isn't possible in Java.

The complicated code compiles to larger amounts of native code, which in turn can prevent inlining as it comes up against code size heuristics in the JIT.

Modification:
Simplify the code by removing special case checks for certain constants. Reorganize the code to use fewer branches

Result:
Simpler code that's more JIT friendly, without loss of performance.

Motivation:
Our current reference counting algorithm is quite complicated, trying to guard for word-tearing of 32-bit integers, which isn't possible in Java.

The complicated code compiles to larger amounts of native code, which in turn can prevent inlining as it comes up against code size heuristics in the JIT.

Modification:
Simplify the code by removing special case checks for certain constants.
Reorganize the code to use fewer branches

Result:
Simpler code that's more JIT friendly, without loss of performance.
@chrisvest chrisvest requested a review from franz1981 October 16, 2025 18:16
@chrisvest
Copy link
Member Author

@franz1981 Please try benchmarking these changes. On my M1, I get performance in AbstractReferenceCountedByteBufBenchmark that is on par with the current code.

If I change retain to use a CAS loop, its performance drops dramatically as soon as there's contention. I kept the even/odd reference counting scheme because of this, because it allows us to use the getAndAdd intrinsics, which scale much better.

@franz1981
Copy link
Contributor

franz1981 commented Oct 16, 2025

But to be fair, it matters?
I mean, I would optimize for normal uncontended use cases if I have to pick a poison. This will improve the unconteded cases because the comparisons doesn't need to strip the parity bit and become (if predictable) super cheap.
If we want to go really fast (under contention) a double sequence scheme is the way because the two sequences keep on increasing and can use increment and get separately

@franz1981
Copy link
Contributor

You can try applying this on top of my adaptive pull and see how to perform since I didn't get rid of the chunk ref count for size classed because I was waiting this ❤️

@chrisvest
Copy link
Member Author

It wasn't faster without the parity bit on my machine.

@franz1981
Copy link
Contributor

I believe it - I will try on my x86 as well to check how it behaves. I have to check to if it solves the problem in the issue related set bytes too

@chrisvest
Copy link
Member Author

Also, I suspect the contended case is relevant for the chunks.

@franz1981
Copy link
Contributor

franz1981 commented Oct 16, 2025

Uh, you are right. It is indeed (for shared magazines).
Although I have another solution there to get rid of it

@chrisvest
Copy link
Member Author

Huh, interesting. I'm actually getting pretty reliable JIT compiler crash on the ReferenceCountUpdater::release0 method!

@chrisvest
Copy link
Member Author

Looks like I can make the compiler crash go away by collapsing the implementations down as well, so we only have AbstractReferenceCounted and no longer need the impls in Chunk and AbstrsctReferenceCountedByteBuf. And a few other small tweaks.

@chrisvest
Copy link
Member Author

To work-around the JIT bug, I've had to increase the scope of this PR to also collapse the three different ReferenceCounted implementations into one; AbstractReferenceCounted. This means that Chunk and AbstractReferenceCountedByteBuf no longer integrate with ReferenceCountUpdater directly, but instead all pull their implementations from AbstractReferenceCounted.

…ReferenceCounted

And also work-around what appears to be a C2 JIT compiler bug in Java 17 through 25+.
}
}

private static class Chunk extends AbstractReferenceCounted implements ChunkInfo {
Copy link
Contributor

@franz1981 franz1981 Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider that w 8953bbe I plan to make Chunk's reference count to exist only for the BumpChunk (bad name I know :"()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. And this is also breaking the graal build as-is… need to think about what to do here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check franz1981#1

@franz1981
Copy link
Contributor

The idea of collapsing the hierarchy is very similar to what @yawkat has done on franz1981#1 and will likely avoid the problem of bimoprhic inlining + fat compiled methods observed in #15736 (comment).
But the only way to know it is try this change on #15736 as well.

@franz1981
Copy link
Contributor

franz1981 commented Oct 17, 2025

I have cherry picked both your changes on top of #15741 getting

Benchmark                                               (allocatorType)  (pollutionIterations)  Mode  Cnt   Score   Error  Units
ByteBufAllocatorAllocPatternBenchmark.directAllocation         ADAPTIVE                      0  avgt  100  29.725 ± 0.261  ns/op
ByteBufAllocatorAllocPatternBenchmark.directAllocation         ADAPTIVE                 200000  avgt  100  32.376 ± 0.453  ns/op

whilst my last commit was delivering

Benchmark                                               (allocatorType)  (pollutionIterations)  Mode  Cnt   Score   Error  Units
ByteBufAllocatorAllocPatternBenchmark.directAllocation         ADAPTIVE                      0  avgt  100  27.484 ± 0.251  ns/op
ByteBufAllocatorAllocPatternBenchmark.directAllocation         ADAPTIVE                 200000  avgt  100  32.874 ± 0.585  ns/op

which shows a slightly regressed performance in the non-"polluted" case

With the change at #15764 (comment) the performance is (nearly) restored:

Benchmark                                               (allocatorType)  (pollutionIterations)  Mode  Cnt   Score   Error  Units
ByteBufAllocatorAllocPatternBenchmark.directAllocation         ADAPTIVE                      0  avgt  100  28.091 ± 0.431  ns/op
ByteBufAllocatorAllocPatternBenchmark.directAllocation         ADAPTIVE                 200000  avgt  100  32.044 ± 0.457  ns/op

Adding #15764 (comment) too (to the previous one) completely restore the performance:

Benchmark                                               (allocatorType)  (pollutionIterations)  Mode  Cnt   Score   Error  Units
ByteBufAllocatorAllocPatternBenchmark.directAllocation         ADAPTIVE                      0  avgt  100  27.636 ± 0.142  ns/op
ByteBufAllocatorAllocPatternBenchmark.directAllocation         ADAPTIVE                 200000  avgt  100  31.993 ± 0.531  ns/op

I have the suspect we can get better on retain0 as well since we can assume to optimize for the not taken branch there too (i.e. no thrown exception) - I will post a suggestion for that one too.
And that should help the chunk and (retained slices) use cases (e.g. for HTTP 2)

@chrisvest
Copy link
Member Author

@franz1981 I applied the optimizations you suggested. They don't seem to make any noticeable difference on M1, but if they help x86 we'll of course take that win.

@franz1981
Copy link
Contributor

franz1981 commented Oct 28, 2025

@normanmaurer how this look like?

https://github.com/franz1981/netty/commits/4.2-new-refcount/

  1. it implements a separate RefCnt class
  2. it allocates it where is needed.
  3. it has 3 separate concrete sub-classes which are NEVER used together in the same VM

The downside is the mem footprint (+16 bytes of RefCnt) and indirection, which is not great, but I can run few benchmarks to check this.
To be fair I have some strong feeling that repeated memory accesses witll hoist the refCnt load (and no need of guards because there MUST be a single concrete type loaded at runtime only) and further refCnt.value offset too because repeated buffer accesses won't involve volatile loads due to how RefCnt::isLiveNonVolatile is implemented.

Another option is what @yawkat has done in a PR on my Netty repo - which is not so distant from what's proposed by this PR.

There's still some code duplication (which TBH could be removed, with some effort), but I preferred to leave it because makes the life of both native image and JIT much easier :)

@franz1981 franz1981 force-pushed the 4.2-refcount branch 2 times, most recently from 0e655be to 4e306ad Compare October 28, 2025 13:37
@normanmaurer
Copy link
Member

@franz1981 if the performance is improved by your implementation I would prefer this one as it seems a lot cleaner etc.

@normanmaurer
Copy link
Member

@franz1981 do we even need all the other changes with what you did now ?

@franz1981
Copy link
Contributor

franz1981 commented Oct 28, 2025

@normanmaurer

I've created https://github.com/franz1981/netty/tree/4.2-refcnt-http2-setbytes-after-new-refcnt to measure the proposed changes on #15764 (comment) against #15764 (comment)

Benchmark                                       (padding)  (payloadSize)  (pooled)  Mode  Cnt   Score   Error  Units
Http2FrameWriterDataBenchmark.newWriter                 0            128      true  avgt    5  58.290 ± 0.247  ns/op

I haven't measured the unsafe performance and how much we lose and other benchmarks.

@franz1981
Copy link
Contributor

franz1981 commented Oct 28, 2025

Another option is https://gist.github.com/franz1981/e81b474c415b58e2601407fa254dfc00 which would enable its child which want to embedd the int field to extend it and implement ReferenceCounted.
It will create some clashing issue due to names of the methods, but is solvable...

@chrisvest
Copy link
Member Author

chrisvest commented Oct 28, 2025

@rwestrel The first commit, fafa429, might be of interest to the JDK team because it managed to find a compiler crash in C2. I can repro it reliably on JDK 23, 24, 25, by just running the tests in the netty-buffer module.
Here's the JDK 25 crash log hs_err_pid4992.log

@chrisvest
Copy link
Member Author

Looking at franz1981#1 the ReferenceCountHolder apparently does not need to be initialized at runtime (a Graal 21 build passes with it), and that gets us around a lot of the changes. It's not obvious to me how it can get away with that. I'll have to look into that more.

@franz1981
Copy link
Contributor

franz1981 commented Oct 29, 2025

@chrisvest @normanmaurer let me try to summarize my thoughts on this

Problem Summary

ReferenceCountUpdater was designed for flexibility — any class can provide its own int field and use the shared updater.

  • AbstractReferenceCounted defines its own refCnt field and updater.
  • ByteBuf implements ReferenceCounted but does not extend AbstractReferenceCounted.
  • AbstractByteBuf has no ref count field, but its subclass AbstractReferenceCountedByteBuf defines one (with its own updater).
  • Some subclasses (e.g. AbstractUnpooledSlicedByteBuf) are ReferenceCounted but don’t need a ref count field.

This flexibility prevents ReferenceCountUpdater from being optimized via VarHandle.

Observed issue in the adaptive allocator:

  • AdaptiveByteBuf → extends AbstractReferenceCountedByteBuf (own updater).
  • Chunk → implements ReferenceCounted (own updater).

Two updaters subclasses hurt ref count performance.
Making Chunk extend AbstractReferenceCounted doesn’t help — each uses a different updater/field.

Option 1 — Make ByteBuf extend AbstractReferenceCounted

Both adaptive chunks and buffers would share the same field/updater.

Pros:

  • no changes in memory footprint for the common case i.e. AbstractReferenceCountedByteBuf children

Cons:

  • Some ByteBuf types would carry an unused refCnt field.
  • Users could still introduce their own updaters, harming performance.

Option 2 — Replace updaters with a less flexible RefCnt helper

Introduce a simple RefCnt holder (just an int field + static utility methods, no virtuals).

Two variants:

  • Option (2.A): Keep the current hierarchy — make AbstractReferenceCountedByteBuf hold a RefCnt field.
  • Option (2.B): Let ByteBuf extend RefCnt directly (similar to Option 1)

Pros:

  • Extending RefCnt won't affect its performance

Cons:

  • Option (2.A): adds one level of indirection and small memory overhead in a common use case
  • Option (2.B): adds a int refCnt field even when unused

TL;DR:

Ref count flexibility currently hurts performance.
Unifying the updater or introducing a more rigid RefCnt holder fixes the issue

An example of the Option 2.A is at franz1981@ed468f2

And by doing what @chrisvest has done at f410cc2 (but with RefCnt) is possible to implement Option 2.B with ease.

NOTE: Option 2 is not the same as franz1981#1 because we are removing the flexibility for users to act on user-defined int ref cnt fields!

@normanmaurer
Copy link
Member

normanmaurer commented Oct 29, 2025

I am in favor of Option (2.A).

@rwestrel
Copy link

@rwestrel The first commit, fafa429, might be of interest to the JDK team because it managed to find a compiler crash in C2. I can repro it reliably on JDK 23, 24, 25, by just running the tests in the netty-buffer module. Here's the JDK 25 crash log hs_err_pid4992.log

FYI, I filed an openjdk bug: https://bugs.openjdk.org/browse/JDK-8370939

@normanmaurer
Copy link
Member

@rwestrel thanks for linking it here.

@chrisvest
Copy link
Member Author

Ok, I'll try implementing option 2.A, following the example in franz1981@ed468f2

@chrisvest
Copy link
Member Author

@franz1981 @normanmaurer Please take a look.

Copy link
Contributor

@franz1981 franz1981 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In term of performance it will have some impact for the most common case (as explained in the description I already gave), but in a follow up PR it enables us to likely re-enable VarHandle for native image as well.

I will run few benchmarks, likely early next week; in case, we still have the option B which would remove any chance for regression - with @normanmaurer blessing

super(maxCapacity);
updater.setInitialValue(this);
// this is setting the ref cnt to the initial value
refCnt = new RefCnt();
Copy link
Contributor

@franz1981 franz1981 Nov 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the feeling (i.e. to verify w a simple test) that despite been the right way for this code, having a super which could observe the refCnt as null is causing the JIT not be able to use https://shipilev.net/jvm/anatomy-quarks/25-implicit-null-checks/

I have verified indeed that resetRefCnt is emitting (for the adaptive benchmark), this

            0x00007f70746c1d7b:   mov    0x20(%r8),%r10d              ;*getfield refCnt {reexecute=0 rethrow=0 return_oop=0}
                                                                      ; - io.netty.buffer.AbstractReferenceCountedByteBuf::resetRefCnt@1 (line 57)
   0.02%    0x00007f70746c1d7f:   test   %r10d,%r10d                  ; NO implicit NULL check here :"(
   0.21%    0x00007f70746c1d82:   je     0x00007f70746c2fcc
            0x00007f70746c1d88:   movl   $0x2,0xc(%r12,%r10,8)        ;*invokevirtual putIntRelease {reexecute=0 rethrow=0 return_oop=0}
                                                                      ; - java.lang.invoke.VarHandleInts$FieldInstanceReadWrite::setRelease@24 (line 170)
                                                                      ; - java.lang.invoke.VarHandleGuards::guard_LI_V@45 (line 122)
                                                                      ; - io.netty.util.internal.RefCnt$VarHandleRefCnt::resetRefCnt@5 (line 378)
                                                                      ; - io.netty.util.internal.RefCnt::resetRefCnt@36 (line 236)
                                                                      ; - io.netty.buffer.AbstractReferenceCountedByteBuf::resetRefCnt@4 (line 57)

which shows that moving to a separate refCnt field add few costs:

This is not terrible, but that's why the Option 2.B is more appealing in term of cost

@normanmaurer too

// Value might not equal "real" reference count, all access should be via the updater
@SuppressWarnings({"unused", "FieldMayBeFinal"})
private volatile int refCnt;
private final RefCnt refCnt;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private final RefCnt refCnt;
private final RefCnt refCnt = new RefCnt();

magazine = null;
allocator = null;
chunkReleasePredicate = null;
refCnt = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read https://github.com/netty/netty/pull/15764/files#r2484854800: we can still allocate the RefCnt here regardless, in the constructor.
This "should" save the JIT to emit a null check

protected Magazine magazine;
private final AdaptivePoolingAllocator allocator;
private final ChunkReleasePredicate chunkReleasePredicate;
private final RefCnt refCnt;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private final RefCnt refCnt;
// we always populate the refCnt field to save Hotspot to emit `null` checks
private final RefCnt refCnt = new RefCnt();

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants