-
Notifications
You must be signed in to change notification settings - Fork 6.1k
8338379: Accesses to class init state should be properly synchronized #21110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Welcome back shade! A progress list of the required criteria for merging this PR into |
|
@shipilev This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be: You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 4 new commits pushed to the
Please see this link for an up-to-date comparison between the source branch of this pull request and the ➡️ To integrate this PR with the above commit message to the |
Webrevs
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems far more extensive than what was discussed. Code that takes the lock-free path to check in_initialized is what I thought we agreed needed the acquire/release not every read of the state variable. This code will be executed a lot and in 99.99% of cases the memory barriers are not needed.
This just extends the architectural parts of the patch we agreed with @coleenp for the fix. Which parts you think are excessive? The acquires in |
|
The problem is we have completely different code paths that look at the different states of a class (loaded, linked, initialized, in-error) and those actions use different locks. This issue was, I thought, only about the lock-free fast-paths checking the "is initialized" state not anything else. These extra barriers could be completely redundant for "is loaded" or "is linked" or "is in error" checks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this patch.
Right. I chose this code shape to make sure we cover all paths that poll I would not discount the possibility that something somewhere would depend on pre-fully-initialized states to publish the intermediate class state. Looking around, I see some interesting uses in It feels much safer to be extra paranoid here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well I don't like "paranoid" code when it comes to concurrency for the reason I already gave. I think part of the problem here is that so many different locks are involved in the different stages of class loading, linking and initialization, that it can be unclear when you've zoomed in exactly which lock should be part of the code path you're dealing with (e.g the loader constraint table code is protected by the SD lock so the checking of the is_loaded state is not lock-free).
But this code is functionally correct so the only potential harm here (other than complicating code understanding) is to performance, which we will just have to keep an eye on.
FYI I'm away for the next couple of days.
|
I was looking through and we set the "loaded" state under the Compile_lock (because of dependencies in add_to_hierarchy), we set the "linked", "being_initialized", "fully_initialized" and "initialization_error" under the init_lock object (which I want to change again) with a notify for the latter two. Using a load_acquire to examine the state (and release_store to write) seems like the right thing to do because there isn't just one lock so we should assume reading this state is lock free. It looks like the C2 code optimizes away the clinit_barrier when possible so we can watch for any performance difference but I'd still rather have safety. |
|
I am running performance tests with it, and expect no difference given JITed code normally knows that classes are initialized at JIT compilation time. The impact on interpreter paths is likely not be visible as well. If you can run your set of benchmarks, please do as well. |
Ok. |
|
My benchmarking showed only the normal jitter and no regressions. |
|
Our performance tests show no effect as well. So I guess we are fine. I would like platform maintainers to look at relevant parts: @RealFYang, @TheRealMDoerr, @RealLucy, @offamitkumar? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, Thanks for the ping. RISC-V part of the change looks fine. Not obvious change witnessed on specjbb numbers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't see any regression on s390x as well. @RealLucy maybe a quick look ?
|
Which benchmarks did you use? Is there any micro benchmark for class initialization? Is this one interesting? https://github.com/clojure/test.benchmark |
Our usual corpus of industry-standard benchmarks, like Dacapo, SPECjbb, etc. I don't think we have a microbenchmark that targets class loading specifically. |
|
I've run some of these benchmarks on PPC64le and couldn't spot a regression, but the results are not very stable and I guess that they are not very sensitive to class initialization.
(Note that membar_storestore is cumulative and includes loadstore ordering on PPC64.) |
Well, that's the thing: if compiler does not know the class is initialized, it emits the runtime check for class initialization. Here, in jdk/src/hotspot/share/c1/c1_LIRGenerator.cpp Lines 670 to 671 in 65200a9
In generated code, we come to jdk/src/hotspot/cpu/ppc/c1_LIRAssembler_ppc.cpp Lines 2275 to 2277 in f554c3f
If we were the only thread, it would not have been a problem: on first entry we would have called the stub, initialized the class and completed the allocation there. Next time around we would have passed But the caveat we are handling in this PR is that if some other thread might have completed the class initialization, we need to make sure this thread sees the class state consistently. It is not only about the state of To achieve that, the initializing thread would do release-store for Makes sense? |
|
Thanks for the explanation. This makes sense. Nevertheless, the aforementioned |
OK, I see what you are getting at. But isn't that barrier still too late? See: |
|
The point is that compiled |
|
All right, granted. We can make an argument that a release store to Given that we expect no perf problems on this seemingly rare path, I prefer not to go into exploiting those specifics, unless you feel strongly otherwise :) |
|
I've done a bit of research and it seems like the C2 clinit barrier is only used very rarely in a corner case while the C1 parts are not so infrequently used. Peak performance doesn't seem to be affected. So, I don't see any reason for optimizing C2, either. The shared code LGTM. The more frequently used parts are in platform specific code, so it might make sense to optimize the PPC64 parts. Also note that the "isync trick" is a faster acquire barrier than "lwsync". What do you think about this? diff --git a/src/hotspot/cpu/ppc/c1_LIRAssembler_ppc.cpp b/src/hotspot/cpu/ppc/c1_LIRAssembler_ppc.cpp
index 61f654c9cfa..684c06614a9 100644
--- a/src/hotspot/cpu/ppc/c1_LIRAssembler_ppc.cpp
+++ b/src/hotspot/cpu/ppc/c1_LIRAssembler_ppc.cpp
@@ -2274,7 +2274,7 @@ void LIR_Assembler::emit_alloc_obj(LIR_OpAllocObj* op) {
}
__ lbz(op->tmp1()->as_register(),
in_bytes(InstanceKlass::init_state_offset()), op->klass()->as_register());
- __ lwsync(); // acquire
+ // acquire barrier included in membar_storestore() which follows the allocation immediately.
__ cmpwi(CCR0, op->tmp1()->as_register(), InstanceKlass::fully_initialized);
__ bc_far_optimized(Assembler::bcondCRbiIs0, __ bi0(CCR0, Assembler::equal), *op->stub()->entry());
}
diff --git a/src/hotspot/cpu/ppc/macroAssembler_ppc.cpp b/src/hotspot/cpu/ppc/macroAssembler_ppc.cpp
index e73e617b8ca..bf2b2540e35 100644
--- a/src/hotspot/cpu/ppc/macroAssembler_ppc.cpp
+++ b/src/hotspot/cpu/ppc/macroAssembler_ppc.cpp
@@ -2410,7 +2410,7 @@ void MacroAssembler::verify_secondary_supers_table(Register r_sub_klass,
void MacroAssembler::clinit_barrier(Register klass, Register thread, Label* L_fast_path, Label* L_slow_path) {
assert(L_fast_path != nullptr || L_slow_path != nullptr, "at least one is required");
- Label L_fallthrough;
+ Label L_check_thread, L_fallthrough;
if (L_fast_path == nullptr) {
L_fast_path = &L_fallthrough;
} else if (L_slow_path == nullptr) {
@@ -2419,11 +2419,14 @@ void MacroAssembler::clinit_barrier(Register klass, Register thread, Label* L_fa
// Fast path check: class is fully initialized
lbz(R0, in_bytes(InstanceKlass::init_state_offset()), klass);
- lwsync(); // acquire
+ // acquire by cmp-branch-isync if fully_initialized
cmpwi(CCR0, R0, InstanceKlass::fully_initialized);
- beq(CCR0, *L_fast_path);
+ bne(CCR0, L_check_thread);
+ isync();
+ b(*L_fast_path);
// Fast path check: current thread is initializer thread
+ bind(L_check_thread);
ld(R0, in_bytes(InstanceKlass::init_thread_offset()), klass);
cmpd(CCR0, thread, R0);
if (L_slow_path == &L_fallthrough) { |
I don't mind, and what you say as maintainer of PPC64 code goes :) I merged the patch in this PR, thanks. |
|
Thanks all for reviews. If there are no other comments, I'll integrate soon. |
|
/integrate |
|
Going to push as commit 6600161.
Your commit was automatically rebased without conflicts. |
See the bug for the discussion. We have not seen a clear evidence this is the problem in the field, neither we were able to come up with a reproducer. We have found this gap by inspecting the code, while chasing a production bug.
In short,
InstanceKlass::_init_stateis used as the "witness" for initialized class state. When class initialization completes, it needs to publish the class state by writing_init_state = _fully_initializedwith release semantics.Various accessors that poll
IK::_init_state, looking for class initialization to complete, need to read the field with acquire semantics. This is where the change fans out, touching VM, interpreter and compiler paths that e.g. implement clinit barriers. In some cases in assembler code, we can rely on hardware memory model to do what we need (i.e. acquire barriers/fences are nops).I made the best guess what ARM32, S390X, PPC64, RISC-V code should look like, based on what related code does for volatile loads. It would be good if port maintainers could sanity-check those.
Additional testing:
allallProgress
Issue
Reviewers
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/21110/head:pull/21110$ git checkout pull/21110Update a local copy of the PR:
$ git checkout pull/21110$ git pull https://git.openjdk.org/jdk.git pull/21110/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 21110View PR using the GUI difftool:
$ git pr show -t 21110Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/21110.diff
Webrev
Link to Webrev Comment