-
-
Notifications
You must be signed in to change notification settings - Fork 16.2k
Simplify reference counting #15764
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: 4.2
Are you sure you want to change the base?
Simplify reference counting #15764
Conversation
Motivation: Our current reference counting algorithm is quite complicated, trying to guard for word-tearing of 32-bit integers, which isn't possible in Java. The complicated code compiles to larger amounts of native code, which in turn can prevent inlining as it comes up against code size heuristics in the JIT. Modification: Simplify the code by removing special case checks for certain constants. Reorganize the code to use fewer branches Result: Simpler code that's more JIT friendly, without loss of performance.
|
@franz1981 Please try benchmarking these changes. On my M1, I get performance in If I change |
|
But to be fair, it matters? |
|
You can try applying this on top of my adaptive pull and see how to perform since I didn't get rid of the chunk ref count for size classed because I was waiting this ❤️ |
|
It wasn't faster without the parity bit on my machine. |
|
I believe it - I will try on my x86 as well to check how it behaves. I have to check to if it solves the problem in the issue related set bytes too |
|
Also, I suspect the contended case is relevant for the chunks. |
|
Uh, you are right. It is indeed (for shared magazines). |
|
Huh, interesting. I'm actually getting pretty reliable JIT compiler crash on the |
|
Looks like I can make the compiler crash go away by collapsing the implementations down as well, so we only have AbstractReferenceCounted and no longer need the impls in Chunk and AbstrsctReferenceCountedByteBuf. And a few other small tweaks. |
4f58b3e to
c6afb03
Compare
|
To work-around the JIT bug, I've had to increase the scope of this PR to also collapse the three different |
…ReferenceCounted And also work-around what appears to be a C2 JIT compiler bug in Java 17 through 25+.
c6afb03 to
f410cc2
Compare
| } | ||
| } | ||
|
|
||
| private static class Chunk extends AbstractReferenceCounted implements ChunkInfo { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider that w 8953bbe I plan to make Chunk's reference count to exist only for the BumpChunk (bad name I know :"()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. And this is also breaking the graal build as-is… need to think about what to do here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check franz1981#1
|
The idea of collapsing the hierarchy is very similar to what @yawkat has done on franz1981#1 and will likely avoid the problem of bimoprhic inlining + fat compiled methods observed in #15736 (comment). |
|
I have cherry picked both your changes on top of #15741 getting whilst my last commit was delivering which shows a slightly regressed performance in the non-"polluted" case With the change at #15764 (comment) the performance is (nearly) restored: Adding #15764 (comment) too (to the previous one) completely restore the performance: I have the suspect we can get better on |
|
@franz1981 I applied the optimizations you suggested. They don't seem to make any noticeable difference on M1, but if they help x86 we'll of course take that win. |
78ccf85 to
b077572
Compare
|
@normanmaurer how this look like? https://github.com/franz1981/netty/commits/4.2-new-refcount/
The downside is the mem footprint (+16 bytes of Another option is what @yawkat has done in a PR on my Netty repo - which is not so distant from what's proposed by this PR. There's still some code duplication (which TBH could be removed, with some effort), but I preferred to leave it because makes the life of both native image and JIT much easier :) |
0e655be to
4e306ad
Compare
|
@franz1981 if the performance is improved by your implementation I would prefer this one as it seems a lot cleaner etc. |
|
@franz1981 do we even need all the other changes with what you did now ? |
|
I've created https://github.com/franz1981/netty/tree/4.2-refcnt-http2-setbytes-after-new-refcnt to measure the proposed changes on #15764 (comment) against #15764 (comment) I haven't measured the |
|
Another option is https://gist.github.com/franz1981/e81b474c415b58e2601407fa254dfc00 which would enable its child which want to embedd the |
|
@rwestrel The first commit, fafa429, might be of interest to the JDK team because it managed to find a compiler crash in C2. I can repro it reliably on JDK 23, 24, 25, by just running the tests in the |
|
Looking at franz1981#1 the |
|
@chrisvest @normanmaurer let me try to summarize my thoughts on this Problem Summary
This flexibility prevents Observed issue in the adaptive allocator:
Two updaters subclasses hurt ref count performance. Option 1 — Make
|
|
I am in favor of Option (2.A). |
FYI, I filed an openjdk bug: https://bugs.openjdk.org/browse/JDK-8370939 |
|
@rwestrel thanks for linking it here. |
|
Ok, I'll try implementing option 2.A, following the example in franz1981@ed468f2 |
e5c3b63 to
4c09131
Compare
|
@franz1981 @normanmaurer Please take a look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In term of performance it will have some impact for the most common case (as explained in the description I already gave), but in a follow up PR it enables us to likely re-enable VarHandle for native image as well.
I will run few benchmarks, likely early next week; in case, we still have the option B which would remove any chance for regression - with @normanmaurer blessing
| super(maxCapacity); | ||
| updater.setInitialValue(this); | ||
| // this is setting the ref cnt to the initial value | ||
| refCnt = new RefCnt(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the feeling (i.e. to verify w a simple test) that despite been the right way for this code, having a super which could observe the refCnt as null is causing the JIT not be able to use https://shipilev.net/jvm/anatomy-quarks/25-implicit-null-checks/
I have verified indeed that resetRefCnt is emitting (for the adaptive benchmark), this
0x00007f70746c1d7b: mov 0x20(%r8),%r10d ;*getfield refCnt {reexecute=0 rethrow=0 return_oop=0}
; - io.netty.buffer.AbstractReferenceCountedByteBuf::resetRefCnt@1 (line 57)
0.02% 0x00007f70746c1d7f: test %r10d,%r10d ; NO implicit NULL check here :"(
0.21% 0x00007f70746c1d82: je 0x00007f70746c2fcc
0x00007f70746c1d88: movl $0x2,0xc(%r12,%r10,8) ;*invokevirtual putIntRelease {reexecute=0 rethrow=0 return_oop=0}
; - java.lang.invoke.VarHandleInts$FieldInstanceReadWrite::setRelease@24 (line 170)
; - java.lang.invoke.VarHandleGuards::guard_LI_V@45 (line 122)
; - io.netty.util.internal.RefCnt$VarHandleRefCnt::resetRefCnt@5 (line 378)
; - io.netty.util.internal.RefCnt::resetRefCnt@36 (line 236)
; - io.netty.buffer.AbstractReferenceCountedByteBuf::resetRefCnt@4 (line 57)which shows that moving to a separate refCnt field add few costs:
- reading a field (with an embedded
int refCntthat won't happen too) - 1 test + branch for the null check (predictable, but still...)
- decoding the compressed oop on the fly while setting the value to
2(see https://discord.com/channels/@me/1418816637674459157/1434550833768304710)
This is not terrible, but that's why the Option 2.B is more appealing in term of cost
@normanmaurer too
| // Value might not equal "real" reference count, all access should be via the updater | ||
| @SuppressWarnings({"unused", "FieldMayBeFinal"}) | ||
| private volatile int refCnt; | ||
| private final RefCnt refCnt; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| private final RefCnt refCnt; | |
| private final RefCnt refCnt = new RefCnt(); |
| magazine = null; | ||
| allocator = null; | ||
| chunkReleasePredicate = null; | ||
| refCnt = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Read https://github.com/netty/netty/pull/15764/files#r2484854800: we can still allocate the RefCnt here regardless, in the constructor.
This "should" save the JIT to emit a null check
| protected Magazine magazine; | ||
| private final AdaptivePoolingAllocator allocator; | ||
| private final ChunkReleasePredicate chunkReleasePredicate; | ||
| private final RefCnt refCnt; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| private final RefCnt refCnt; | |
| // we always populate the refCnt field to save Hotspot to emit `null` checks | |
| private final RefCnt refCnt = new RefCnt(); |
Motivation:
Our current reference counting algorithm is quite complicated, trying to guard for word-tearing of 32-bit integers, which isn't possible in Java.
The complicated code compiles to larger amounts of native code, which in turn can prevent inlining as it comes up against code size heuristics in the JIT.
Modification:
Simplify the code by removing special case checks for certain constants. Reorganize the code to use fewer branches
Result:
Simpler code that's more JIT friendly, without loss of performance.