Skip to content

Conversation

@matias9927
Copy link
Contributor

@matias9927 matias9927 commented Mar 25, 2024

A misplaced memory barrier causes a very intermittent crash on on some aarch64 systems. This patch adds an appropriate LoadLoad barrier after a constant pool cache field entry is loaded. Verified with tier 1-5 tests.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8327647: Occasional SIGSEGV in markWord::displaced_mark_helper() for SPECjvm2008 sunflow (Bug - P2)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/18477/head:pull/18477
$ git checkout pull/18477

Update a local copy of the PR:
$ git checkout pull/18477
$ git pull https://git.openjdk.org/jdk.git pull/18477/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 18477

View PR using the GUI difftool:
$ git pr show -t 18477

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/18477.diff

Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Mar 25, 2024

👋 Welcome back matsaave! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Mar 25, 2024

@matias9927 This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8327647: Occasional SIGSEGV in markWord::displaced_mark_helper() for SPECjvm2008 sunflow

Reviewed-by: coleenp, fyang, dlong

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 157 new commits pushed to the master branch:

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk
Copy link

openjdk bot commented Mar 25, 2024

@matias9927 The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label Mar 25, 2024

__ push(r0);
// R1: field offset, R2: TOS, R3: flags
load_resolved_field_entry(r2, r2, r0, r1, r3);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is useless to use r0 here, so can we change it to noreg and eliminate the use of push(r0)/pop(r0)?

Copy link
Member

@RealFYang RealFYang Mar 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also noticed this today while looking at the code and I was testing the following change:
unnecessary-tos-load-v2.diff.txt

@matias9927 : Can you add this extra change while you are on it? I think we should fix this for both performance and correctness reasons. The 8-byte push/pop would violate the AArch64 & RISC-V ABI which specifies that alignment of SP should always be 16 bytes [1][2].

[1] https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/using-the-stack-in-aarch32-and-aarch64
[2] https://github.com/riscv-non-isa/riscv-elf-psabi-doc/blob/master/riscv-cc.adoc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted! I haven't been able to replicate the crash yet, but I think it will be worth moving forward with this patch as a matter of correctness.

Copy link
Member

@RealFYang RealFYang Apr 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that noreg is currently not properly handled / considered in load_resolved_field_entry. And it doesn't look nice to me to only check if tos_state with noreg in this function. That's why I replaced call to load_resolved_field_entry with seperate loads in my propsed fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't notice your message until just now, but you're right. I would prefer to use load_resolved_field_entry when possible but there are places where fields are loaded individually so I think it's fine.

Copy link
Member

@RealFYang RealFYang Apr 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All right. Another place in TemplateTable::fast_accessfield which might be worth turning into a load_resolved_field_entry call with noreg for tos_state for consistency [1][2]:

[1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/templateTable_aarch64.cpp#L3169-L3170
[2] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/templateTable_riscv.cpp#L3139-L3140

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change should be limited to fixing the membar issue and the push/pop issue and limit refactoring to a new change if more is wanted, but maybe this could call load_resolved_field_entry with noreg for tos-state too.

@matias9927 matias9927 marked this pull request as ready for review April 11, 2024 21:16
@openjdk openjdk bot added the rfr Pull request is ready for review label Apr 11, 2024
@mlbridge
Copy link

mlbridge bot commented Apr 11, 2024

Webrevs


// Must prevent reordering of the following cp cache loads with bytecode load
__ membar(MacroAssembler::LoadLoad);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this can be in load_field_entry at the end so we don't miss any callers. It might be a bit redundant with the ldar in the resolve_cache_and_index_for_field(), but that's for only the first time the field is resolved and in the interpreter, should not be an issue for performance.

__ load_unsigned_byte(tos_state, Address(cache, in_bytes(ResolvedFieldEntry::type_offset())));
if (tos_state != noreg) {
__ load_unsigned_byte(tos_state, Address(cache, in_bytes(ResolvedFieldEntry::type_offset())));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This handling of tos_state seems fine to me. Add a comment that the caller might not want to set type_offset as tos.

@coleenp
Copy link
Contributor

coleenp commented Apr 11, 2024

/label add hotspot-runtime

@openjdk openjdk bot added the hotspot-runtime hotspot-runtime-dev@openjdk.org label Apr 11, 2024
@openjdk
Copy link

openjdk bot commented Apr 11, 2024

@coleenp
The hotspot-runtime label was successfully added.

Copy link
Contributor

@coleenp coleenp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks really good.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Apr 17, 2024
Copy link
Member

@RealFYang RealFYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The riscv part is still lacking one change:

diff --git a/src/hotspot/cpu/riscv/templateTable_riscv.cpp b/src/hotspot/cpu/riscv/templateTable_riscv.cpp
index 58f57f32b2f..e27b35e793f 100644
--- a/src/hotspot/cpu/riscv/templateTable_riscv.cpp
+++ b/src/hotspot/cpu/riscv/templateTable_riscv.cpp
@@ -2272,7 +2272,9 @@ void TemplateTable::load_resolved_field_entry(Register obj,
   __ load_unsigned_byte(flags, Address(cache, in_bytes(ResolvedFieldEntry::flags_offset())));

   // TOS state
-  __ load_unsigned_byte(tos_state, Address(cache, in_bytes(ResolvedFieldEntry::type_offset())));
+  if (tos_state != noreg) {
+    __ load_unsigned_byte(tos_state, Address(cache, in_bytes(ResolvedFieldEntry::type_offset())));
+  }

   // Klass overwrite register
   if (is_static) {


// Must prevent reordering of the following cp cache loads with bytecode load
__ membar(MacroAssembler::LoadLoad);
// X11: field offset, X12: TOS, X13: flags
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: // X11: field offset, X12: field holder, X13: flags

Copy link
Member

@RealFYang RealFYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated change LGTM. I didn't see any issue when running SPECjvm2008 sunflow overnight on my 64-core linux-aarch64 server.

ldr(cache, Address(rcpool, ConstantPoolCache::field_entries_offset()));
add(cache, cache, Array<ResolvedFieldEntry>::base_offset_in_bytes());
lea(cache, Address(cache, index));
// Must prevent reordering of the following cp cache loads with bytecode load
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is rather unclear. Where is the bytecode load to which this comment refers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment was above all the other uses of membar and was just moved here for convenience. The "bytecode load" being referred to here is InterpreterMacroAssembler::dispatch_next. The bytecode loading could be scheduled before the cache entry is resolved.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment at the stop says "This patch adds an appropriate LoadLoad barrier after a constant pool cache field entry is loaded." But this seems to be before the constant pool cache field entry is loaded.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what this wants to say is that when you've loaded the pointer to the ResolvedFieldEntry in 'cache', previous memory operations have been synched before the fields in the resolved entry are read. I'm looking for a different word than synched.

__ mov(c_rarg3, esp); // points to jvalue on the stack
// access constant pool cache entry
__ load_field_entry(c_rarg2, r0);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

@dean-long
Copy link
Member

If I understand correctly, the order of writes must be:

  1. ResolvedFieldEntry fields, except _get_code and _put_code
  2. _get_code, _put_code
  3. patch_bytecode(fast_bytecode)

so the order of reads must be reversed. That's why there are load-acquires when reading _get_code and _put_code.
After [3] is done, after dispatching to fast_bytecode, we need to do a LoadLoad between the already read fast bytecode [3] and the "cache" fields [1]. The LoadLoad is not for the load of the next bytecode that will be done in dispatch_next().

@theRealAph
Copy link
Contributor

If I understand correctly, the order of writes must be:

1. ResolvedFieldEntry fields, except _get_code and _put_code

So, release fence here?

2. _get_code, _put_code

and another here

3. patch_bytecode(fast_bytecode)

so the order of reads must be reversed. That's why there are load-acquires when reading _get_code and _put_code. After [3] is done, after dispatching to fast_bytecode, we need to do a LoadLoad between the already read fast bytecode [3] and the "cache" fields [1]. The LoadLoad is not for the load of the next bytecode that will be done in dispatch_next().

So, I guess the loadload fence being inserted here is the one we need between [2] and [3].

@coleenp
Copy link
Contributor

coleenp commented Apr 22, 2024

There are two threads though:
t1: 1. write resolved fields
2. release_store get_code/put_code
3. patch bytecode to fastpath (should this be a release_store?)

but t2: 1. reads patched_bytecode
2. goes fastpath
3. loads pointer to cpCache entry for Resolved fields assuming they've been written in order
4. gets a resolved field

So the patch puts the LoadLoad membar between 2 and 3 because t2 is the thread that's loading the information. Would a LoadLoad barrier executed by t1 help?

@dean-long
Copy link
Member

There are two threads though:
t1: 1. write resolved fields
2. release_store get_code/put_code
3. patch bytecode to fastpath (should this be a release_store?)

Yes, I think you need a release_store for 3, because 2 happened in the same thread. If 2 and 3 happened in different threads, then maybe it could be avoid, but that seems risky.

but t2: 1. reads patched_bytecode
2. goes fastpath
3. loads pointer to cpCache entry for Resolved fields assuming they've been written in order
4. gets a resolved field

So the patch puts the LoadLoad membar between 2 and 3 because t2 is the thread that's loading the information. Would a LoadLoad barrier executed by t1 help?

I don't see how t1 doing a LoadLoad helps. The LoadLoad needs to go anywhere between 1 and 4.

@dean-long
Copy link
Member

So, I guess the loadload fence being inserted here is the one we need between [2] and [3].

The way I would say it is we need a LoadLoad betwen [3] and [2] or between [3] and [1]. The code assumes that if it is a fast bytecode, then it can read [1] without checking [2] again.

@coleenp
Copy link
Contributor

coleenp commented Apr 23, 2024

Dean, can you change your numbers to say which thread? In the code someone has written that you need a LoadLoad between t2's [t2-3] and [t2-4] because t2 assumes that it's a fast path, so writes to the ResolvedFieldEntry [t1-2] are complete. The missing LoadLoad there was the cause of the crash anyway.

@theRealAph
Copy link
Contributor

My confusion is because @dean-long said

_If I understand correctly, the order of writes must be:

ResolvedFieldEntry fields, except _get_code and _put_code
_get_code, _put_code
patch_bytecode(fast_bytecode)_

therefore, if that ordering must be maintained, we'll need two store fences. And on the reading side, we'll need two load fences. If that total order is more than is necessary, OK.

@dean-long
Copy link
Member

I don't think t2-3 does any loads that need to be ordered, so the LoadLoad could be done earlier, anywhere between t2-1 and t2-4.

@dean-long
Copy link
Member

And on the reading side, we'll need two load fences. If that total order is more than is necessary, OK.

On the read side, I don't think we read _get_code or _put_code for the fast bytecode path, so that's why there is only one barrier needed.

@coleenp
Copy link
Contributor

coleenp commented Apr 23, 2024

Yes, the fast path assumes that put_code and get_code are already set since we're in the fast path. For this patch, t2 (the reader) has the LoadLoad membar between t2-3 and t2-4. I assume it doesn't matter where, but there were a few places where we loaded the pointer to ResolvedFieldEntry that had the LoadLoad membar. The bug was because one was missing. That's why I thought it should be moved to inside of load_field_entry, so that all readers would have the membar. There were also some fast-path jvmti cases where the membar was missing.

Above, my step t2-3 is loading a pointer to the ResolvedFieldEntry, ie: ResolvedFieldEntry* cache = ResolvedFieldEntry->at(index)

t2-4 is loading cache->flags, tos, and offset

@matias9927
Copy link
Contributor Author

If I understand correctly, it seems like we agree on where the membar belongs, is this right @dean-long? The current placement of the LoadLoad barrier inside load_field_entry seems to be sufficient, and to reiterate information from the description, tier 1-5 test results look clean.

@dean-long
Copy link
Member

@matias9927, yes that's right. But I agree with earlier reviewer comments that the comment above the LoadLoad could be improved. To me it's helpful to think of this as a kind of self-modifying code. We had a slow bytecode that used the operands in a certain way. Then we changed the opcode to a fast bytecode that uses the operands to access newly initialized data outside the bytecode stream. The LoadLoad makes sure we don't read any of the newly initialized data before we read that the bytecode has been changed to the fast version.

@matias9927
Copy link
Contributor Author

@theRealAph @dean-long how about replacing the comment with this:

// Prevents stale data from being read by another thread after the bytecode is patched to the fast bytecode

@dean-long
Copy link
Member

OK, except for the "another thread" part. The reads are done in the current thread, so that's the thread the barrier is for.

// Prevents stale data from being read after the bytecode is patched to the fast bytecode

add(cache, cache, Array<ResolvedFieldEntry>::base_offset_in_bytes());
lea(cache, Address(cache, index));
// Prevents stale data from being read after the bytecode is patched to the fast bytecode
membar(MacroAssembler::LoadLoad);
Copy link
Contributor

@sunny868 sunny868 Apr 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we put LoadLoad in here, maybe it is redundant for TemplateTable::patch_bytecode(..)

__ load_field_entry(temp_reg, bc_reg);

because there has a Load-acquire in L199.
can we change
load_field_entry(Register cache, Register index, int bcp_offset = 1)
to
load_field_entry(Register cache, Register index, int bcp_offset = 1, bool needLoadLoad = 1) ?

then change L192 like this

@@ -189,7 +189,7 @@ void TemplateTable::patch_bytecode(Bytecodes::Code bc, Register bc_reg,
       // additional, required work.
       assert(byte_no == f1_byte || byte_no == f2_byte, "byte_no out of range");
       assert(load_bc_into_bc_reg, "we use bc_reg as temp");
-      __ load_field_entry(temp_reg, bc_reg);
+      __ load_field_entry(temp_reg, bc_reg, 1 /*bcp_offset*/, false /*needLoadLoad*/);
       if (byte_no == f1_byte) {
         __ lea(temp_reg, Address(temp_reg, in_bytes(ResolvedFieldEntry::get_code_offset())));
       } else {
         __ lea(temp_reg, Address(temp_reg, in_bytes(ResolvedFieldEntry::put_code_offset())));
       }                                                                         
       // Load-acquire the bytecode to match store-release in ResolvedFieldEntry::fill_in()
       __ ldarb(temp_reg, temp_reg);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe patch_bytecode only needs a LoadStore between reading the cache bytecode and patching at bcp[0], so I would agree the LoadLoad in load_field_entry is not needed here. But the reason is not because we have already load-acquire. The reason is because there are not two loads that need ordering. There is only a load and a store. We can't replace the load-acquire with a LoadLoad, for example.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The LoadLoad in load_field_entry is needed for the second thread that (t2-1) loads the patched bytecode, and then (t2-2) loads the pointer to the cpCache entry for field, then (t2-3) loads the values from the cpCache entry pointer. The LoadLoad is between t2-2 and t2-3.

The ldarb in patch_bytecode does make the LoadLoad in load_field_entry redundant as pointed out by @sunny868 but seems harmless in terms of performance and correctness, ie. not needing a special parameter.

The reason that it's in load_field_entry is that there are other callers who do not load acquire the get_code and put_code fields. So it's safer and less error prone to forget with it in load_field_entry.

@matias9927
Copy link
Contributor Author

Thank you for the reviews and detailed discussion @dean-long @coleenp @sunny868 @RealFYang!

I think it is best to move forward with the current design. If we do find that the unneeded LoadLoad negatively impacts performance, I will follow up with an appropriate fix.

/integrate

@openjdk
Copy link

openjdk bot commented Apr 30, 2024

Going to push as commit 9ce21d1.
Since your change was applied there have been 167 commits pushed to the master branch:

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Apr 30, 2024
@openjdk openjdk bot closed this Apr 30, 2024
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Apr 30, 2024
@openjdk
Copy link

openjdk bot commented Apr 30, 2024

@matias9927 Pushed as commit 9ce21d1.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hotspot-compiler hotspot-compiler-dev@openjdk.org hotspot-runtime hotspot-runtime-dev@openjdk.org integrated Pull request has been integrated

Development

Successfully merging this pull request may close these issues.

6 participants