Skip to content

Conversation

@shipilev
Copy link
Member

@shipilev shipilev commented Sep 20, 2024

See the bug for the discussion. We have not seen a clear evidence this is the problem in the field, neither we were able to come up with a reproducer. We have found this gap by inspecting the code, while chasing a production bug.

In short, InstanceKlass::_init_state is used as the "witness" for initialized class state. When class initialization completes, it needs to publish the class state by writing _init_state = _fully_initialized with release semantics.

Various accessors that poll IK::_init_state, looking for class initialization to complete, need to read the field with acquire semantics. This is where the change fans out, touching VM, interpreter and compiler paths that e.g. implement clinit barriers. In some cases in assembler code, we can rely on hardware memory model to do what we need (i.e. acquire barriers/fences are nops).

I made the best guess what ARM32, S390X, PPC64, RISC-V code should look like, based on what related code does for volatile loads. It would be good if port maintainers could sanity-check those.

Additional testing:

  • Linux x86_64 server fastdebug, all
  • Linux AArch64 server fastdebug, all
  • GHA to test platform buildability + adhoc platform cross-compilation

Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8338379: Accesses to class init state should be properly synchronized (Bug - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/21110/head:pull/21110
$ git checkout pull/21110

Update a local copy of the PR:
$ git checkout pull/21110
$ git pull https://git.openjdk.org/jdk.git pull/21110/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 21110

View PR using the GUI difftool:
$ git pr show -t 21110

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/21110.diff

Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Sep 20, 2024

👋 Welcome back shade! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Sep 20, 2024

@shipilev This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8338379: Accesses to class init state should be properly synchronized

Reviewed-by: mdoerr, dholmes, coleenp, fyang, amitkumar

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 4 new commits pushed to the master branch:

  • bfdeb33: 8340332: Open source mixed AWT tests - Set3
  • a6b3188: 8337632: AES-GCM Algorithm optimization for x86_64
  • 5586f83: 8341064: Define anchor point and index term for "wrapper classes"
  • 4168faf: 8341100: Add index entries for terms used in java.lang.Class

Please see this link for an up-to-date comparison between the source branch of this pull request and the master branch.
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk
Copy link

openjdk bot commented Sep 20, 2024

@shipilev The following labels will be automatically applied to this pull request:

  • graal
  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added graal graal-dev@openjdk.org hotspot hotspot-dev@openjdk.org labels Sep 20, 2024
@shipilev shipilev marked this pull request as ready for review September 21, 2024 06:02
@openjdk openjdk bot added the rfr Pull request is ready for review label Sep 21, 2024
@mlbridge
Copy link

mlbridge bot commented Sep 21, 2024

Webrevs

Copy link
Member

@dholmes-ora dholmes-ora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems far more extensive than what was discussed. Code that takes the lock-free path to check in_initialized is what I thought we agreed needed the acquire/release not every read of the state variable. This code will be executed a lot and in 99.99% of cases the memory barriers are not needed.

@shipilev
Copy link
Member Author

Seems far more extensive than what was discussed. Code that takes the lock-free path to check in_initialized is what I thought we agreed needed the acquire/release not every read of the state variable. This code will be executed a lot and in 99.99% of cases the memory barriers are not needed.

This just extends the architectural parts of the patch we agreed with @coleenp for the fix. Which parts you think are excessive? The acquires in instanceKlass.hpp? It would be hard to track which one of those are used without a lock, I think.

@dholmes-ora
Copy link
Member

The problem is we have completely different code paths that look at the different states of a class (loaded, linked, initialized, in-error) and those actions use different locks. This issue was, I thought, only about the lock-free fast-paths checking the "is initialized" state not anything else. These extra barriers could be completely redundant for "is loaded" or "is linked" or "is in error" checks.

Copy link
Contributor

@coleenp coleenp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this patch.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Sep 23, 2024
@shipilev
Copy link
Member Author

The problem is we have completely different code paths that look at the different states of a class (loaded, linked, initialized, in-error) and those actions use different locks. This issue was, I thought, only about the lock-free fast-paths checking the "is initialized" state not anything else. These extra barriers could be completely redundant for "is loaded" or "is linked" or "is in error" checks.

Right. I chose this code shape to make sure we cover all paths that poll init_state to extra safety. We could, in principle, only protect is_initialized() path with the acquire. But I think we then start to depend on downstream code not doing "smart" things bypassing that check, for example polling is_loaded() { _init_state >= loaded } to implicitly (and too optimistically) check for init_state >= fully_initialized, or even doing init_state() > being_initialized somewhere.

I would not discount the possibility that something somewhere would depend on pre-fully-initialized states to publish the intermediate class state. Looking around, I see some interesting uses in InstanceKlass::methods_do, ClassLoaderData::methods_do, ClassLoaderData::loaded_classes_do, LoaderConstraintTable::find_constrained_klass, ...

It feels much safer to be extra paranoid here.

Copy link
Member

@dholmes-ora dholmes-ora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well I don't like "paranoid" code when it comes to concurrency for the reason I already gave. I think part of the problem here is that so many different locks are involved in the different stages of class loading, linking and initialization, that it can be unclear when you've zoomed in exactly which lock should be part of the code path you're dealing with (e.g the loader constraint table code is protected by the SD lock so the checking of the is_loaded state is not lock-free).

But this code is functionally correct so the only potential harm here (other than complicating code understanding) is to performance, which we will just have to keep an eye on.

FYI I'm away for the next couple of days.

@coleenp
Copy link
Contributor

coleenp commented Sep 25, 2024

I was looking through and we set the "loaded" state under the Compile_lock (because of dependencies in add_to_hierarchy), we set the "linked", "being_initialized", "fully_initialized" and "initialization_error" under the init_lock object (which I want to change again) with a notify for the latter two. Using a load_acquire to examine the state (and release_store to write) seems like the right thing to do because there isn't just one lock so we should assume reading this state is lock free.

It looks like the C2 code optimizes away the clinit_barrier when possible so we can watch for any performance difference but I'd still rather have safety.

@shipilev
Copy link
Member Author

I am running performance tests with it, and expect no difference given JITed code normally knows that classes are initialized at JIT compilation time. The impact on interpreter paths is likely not be visible as well.

If you can run your set of benchmarks, please do as well.

@coleenp
Copy link
Contributor

coleenp commented Sep 25, 2024

If you can run your set of benchmarks, please do as well.

Ok.

@coleenp
Copy link
Contributor

coleenp commented Sep 26, 2024

My benchmarking showed only the normal jitter and no regressions.

@shipilev
Copy link
Member Author

Our performance tests show no effect as well. So I guess we are fine.

I would like platform maintainers to look at relevant parts: @RealFYang, @TheRealMDoerr, @RealLucy, @offamitkumar?

Copy link
Member

@RealFYang RealFYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, Thanks for the ping. RISC-V part of the change looks fine. Not obvious change witnessed on specjbb numbers.

Copy link
Member

@offamitkumar offamitkumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't see any regression on s390x as well. @RealLucy maybe a quick look ?

@TheRealMDoerr
Copy link
Contributor

Which benchmarks did you use? Is there any micro benchmark for class initialization? Is this one interesting? https://github.com/clojure/test.benchmark

@shipilev
Copy link
Member Author

Which benchmarks did you use? Is there any micro benchmark for class initialization? Is this one interesting? https://github.com/clojure/test.benchmark

Our usual corpus of industry-standard benchmarks, like Dacapo, SPECjbb, etc. I don't think we have a microbenchmark that targets class loading specifically.

@TheRealMDoerr
Copy link
Contributor

TheRealMDoerr commented Sep 27, 2024

I've run some of these benchmarks on PPC64le and couldn't spot a regression, but the results are not very stable and I guess that they are not very sensitive to class initialization.
I really wonder about the acquire barrier in LIR_Assembler::emit_alloc_obj. The interesting fields of the class are already read by LIRGenerator::new_instance during compile time. How can an acquire barrier after the execution help?
At least, it doesn't help the allocation itself. If the purpose is to make sure anyone who sees the new object can also see the class initialization, this is already ensured by

__ membar_storestore();

(Note that membar_storestore is cumulative and includes loadstore ordering on PPC64.)

@shipilev
Copy link
Member Author

shipilev commented Sep 27, 2024

I really wonder about the acquire barrier in LIR_Assembler::emit_alloc_obj. The interesting fields of the class are already read by LIRGenerator::new_instance during compile time. How can an acquire barrier after the execution help? At least, it doesn't help the allocation itself.

Well, that's the thing: if compiler does not know the class is initialized, it emits the runtime check for class initialization. Here, in LIRGenerator::new_instance we enter with init_check = true (!klass->is_initialized()):

__ allocate_object(dst, scratch1, scratch2, scratch3, scratch4,
oopDesc::header_size(), instance_size, klass_reg, !klass->is_initialized(), slow_path);

In generated code, we come to init_check block, check at runtime if class is fully initialized, and proceed to the rest of allocation path if so:

__ lbz(op->tmp1()->as_register(),
in_bytes(InstanceKlass::init_state_offset()), op->klass()->as_register());
__ cmpwi(CCR0, op->tmp1()->as_register(), InstanceKlass::fully_initialized);

If we were the only thread, it would not have been a problem: on first entry we would have called the stub, initialized the class and completed the allocation there. Next time around we would have passed init_check == fully_initialized, and proceeded without calling a stub. I suspect that is the overwhelmingly frequent case for most workloads.

But the caveat we are handling in this PR is that if some other thread might have completed the class initialization, we need to make sure this thread sees the class state consistently. It is not only about the state of InstanceKlass itself, it is also about the state we have written as part of class init sequence, for example things in <clinit>. If allocating object constructor needs those fields, we need to make sure the proper publication happens.

To achieve that, the initializing thread would do release-store for init_check = fully_initialized. On this reader side, we need a related acquire-load in the runtime check. Since runtime check does not run often -- most of the time compilers know the class is definitely initialized, the change does not affect performance all that much, if at all.

Makes sense?

@TheRealMDoerr
Copy link
Contributor

TheRealMDoerr commented Sep 27, 2024

Thanks for the explanation. This makes sense. Nevertheless, the aforementioned membar_storestore() follows the allocation immediately and it includes an acquire barrier for the current thread, too. So, the extra acquire is redundant. At least for the C1 code and probably at more places. This is not so obvious, so we may be able to live with what you have as long as performance is ok. Otherwise, we could still do a follow-up.

@shipilev
Copy link
Member Author

shipilev commented Sep 27, 2024

Nevertheless, the aforementioned membar_storestore() follows the allocation immediately and it includes an acquire barrier for the current thread, too.

OK, I see what you are getting at. But isn't that barrier still too late? See:

Thread 1 (in "new A()"):
  <runtime check fires> 
  <clinit runs>
  IK::init_state = fully_initialized
  [stalls]

Thread 2: (also in "new A()"):
  <runtime check passes: IK::init_state == fully_initialized>
  <allocates>
  membar_storestore(); // <---- nothing to cumulate with yet
  <constructor runs>  // not seeing <clinit> result fully, no barriers!

Thread 1:
  [resumes]
  <allocates>
  membar_storestore(); // <---- too late!

@TheRealMDoerr
Copy link
Contributor

TheRealMDoerr commented Sep 27, 2024

The point is that compiled <allocates> is only a TLAB pointer bump. It doesn't read anything from the class. We only need an acquire barrier anywhere between <runtime check passes: IK::init_state == fully_initialized> and <constructor runs>. The latter will always see <clinit> result fully because membar_storestore(); acts as acquire barrier (PPC64 specifically).
Btw. IK::init_state = fully_initialized should be a releasing store.
Also note that membar_storestore(); is maped to lwsync, so the C1 compiled nmethod will have one before and one after the allocation with your change.

@shipilev
Copy link
Member Author

shipilev commented Sep 27, 2024

All right, granted. We can make an argument that a release store to IK::init_state can be matched with cumulative barrier like storestore at the end of C1-compiled allocation code. That said, it looks quite fragile, since: a) it depends on cumulative properties of low-level hardware primitives, and b) it likely only holds true for C1, as C2 normally coalesces header-protecting storestore with the final field storestore.

Given that we expect no perf problems on this seemingly rare path, I prefer not to go into exploiting those specifics, unless you feel strongly otherwise :)

@TheRealMDoerr
Copy link
Contributor

I've done a bit of research and it seems like the C2 clinit barrier is only used very rarely in a corner case while the C1 parts are not so infrequently used. Peak performance doesn't seem to be affected. So, I don't see any reason for optimizing C2, either. The shared code LGTM. The more frequently used parts are in platform specific code, so it might make sense to optimize the PPC64 parts. Also note that the "isync trick" is a faster acquire barrier than "lwsync". What do you think about this?

diff --git a/src/hotspot/cpu/ppc/c1_LIRAssembler_ppc.cpp b/src/hotspot/cpu/ppc/c1_LIRAssembler_ppc.cpp
index 61f654c9cfa..684c06614a9 100644
--- a/src/hotspot/cpu/ppc/c1_LIRAssembler_ppc.cpp
+++ b/src/hotspot/cpu/ppc/c1_LIRAssembler_ppc.cpp
@@ -2274,7 +2274,7 @@ void LIR_Assembler::emit_alloc_obj(LIR_OpAllocObj* op) {
     }
     __ lbz(op->tmp1()->as_register(),
            in_bytes(InstanceKlass::init_state_offset()), op->klass()->as_register());
-    __ lwsync(); // acquire
+    // acquire barrier included in membar_storestore() which follows the allocation immediately.
     __ cmpwi(CCR0, op->tmp1()->as_register(), InstanceKlass::fully_initialized);
     __ bc_far_optimized(Assembler::bcondCRbiIs0, __ bi0(CCR0, Assembler::equal), *op->stub()->entry());
   }
diff --git a/src/hotspot/cpu/ppc/macroAssembler_ppc.cpp b/src/hotspot/cpu/ppc/macroAssembler_ppc.cpp
index e73e617b8ca..bf2b2540e35 100644
--- a/src/hotspot/cpu/ppc/macroAssembler_ppc.cpp
+++ b/src/hotspot/cpu/ppc/macroAssembler_ppc.cpp
@@ -2410,7 +2410,7 @@ void MacroAssembler::verify_secondary_supers_table(Register r_sub_klass,
 void MacroAssembler::clinit_barrier(Register klass, Register thread, Label* L_fast_path, Label* L_slow_path) {
   assert(L_fast_path != nullptr || L_slow_path != nullptr, "at least one is required");
 
-  Label L_fallthrough;
+  Label L_check_thread, L_fallthrough;
   if (L_fast_path == nullptr) {
     L_fast_path = &L_fallthrough;
   } else if (L_slow_path == nullptr) {
@@ -2419,11 +2419,14 @@ void MacroAssembler::clinit_barrier(Register klass, Register thread, Label* L_fa
 
   // Fast path check: class is fully initialized
   lbz(R0, in_bytes(InstanceKlass::init_state_offset()), klass);
-  lwsync(); // acquire
+  // acquire by cmp-branch-isync if fully_initialized
   cmpwi(CCR0, R0, InstanceKlass::fully_initialized);
-  beq(CCR0, *L_fast_path);
+  bne(CCR0, L_check_thread);
+  isync();
+  b(*L_fast_path);
 
   // Fast path check: current thread is initializer thread
+  bind(L_check_thread);
   ld(R0, in_bytes(InstanceKlass::init_thread_offset()), klass);
   cmpd(CCR0, thread, R0);
   if (L_slow_path == &L_fallthrough) {

@shipilev
Copy link
Member Author

The more frequently used parts are in platform specific code, so it might make sense to optimize the PPC64 parts. Also note that the "isync trick" is a faster acquire barrier than "lwsync". What do you think about this?

I don't mind, and what you say as maintainer of PPC64 code goes :) I merged the patch in this PR, thanks.

@openjdk openjdk bot removed the ready Pull request is ready to be integrated label Sep 30, 2024
@openjdk openjdk bot added the ready Pull request is ready to be integrated label Sep 30, 2024
@shipilev
Copy link
Member Author

shipilev commented Oct 2, 2024

Thanks all for reviews. If there are no other comments, I'll integrate soon.

@shipilev
Copy link
Member Author

shipilev commented Oct 7, 2024

/integrate

@openjdk
Copy link

openjdk bot commented Oct 7, 2024

Going to push as commit 6600161.
Since your change was applied there have been 102 commits pushed to the master branch:

  • 20f36c6: 8339329: ConstantPoolBuilder#constantValueEntry method doc typo and clarifications
  • 50426b3: 8337713: RISC-V: fix typos in macroAssembler_riscv.cpp
  • 260d465: 8340572: ConcurrentModificationException when sorting ArrayList sublists
  • 9a25f82: 8339386: Assertion on AIX - original PC must be in the main code section of the compiled method
  • df763cd: 8341558: [AIX] build broken after 8341413
  • 1c3e56c: 8341512: Optimize StackMapGenerator::processInvokeInstructions
  • f8db3a8: 8341510: Optimize StackMapGenerator::processFieldInstructions
  • b42fbf4: 8339699: Optimize DataOutputStream writeUTF
  • 5592894: 8340417: Open source some MenuBar tests - Set1
  • bade041: 8341554: Shenandoah: Missing heap lock when updating usage for soft ref policy
  • ... and 92 more: https://git.openjdk.org/jdk/compare/f1bf469b4ee07b48b629a126111e307d3cab7fd7...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Oct 7, 2024
@openjdk openjdk bot closed this Oct 7, 2024
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Oct 7, 2024
@openjdk
Copy link

openjdk bot commented Oct 7, 2024

@shipilev Pushed as commit 6600161.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@shipilev shipilev deleted the JDK-8338379-class-init-checks branch January 8, 2025 12:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

graal graal-dev@openjdk.org hotspot hotspot-dev@openjdk.org integrated Pull request has been integrated

Development

Successfully merging this pull request may close these issues.

6 participants