-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8340313: Crash due to invalid oop in nmethod after C1 patching #21389
Conversation
👋 Welcome back thartmann! A progress list of the required criteria for merging this PR into |
@TobiHartmann This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 67 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. ➡️ To integrate this PR with the above commit message to the |
@TobiHartmann The following label will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command. |
Webrevs
|
// Produce a copy of the load mirror / appendix instruction. | ||
// 8-byte align the address of the oop immediate to guarantee atomicity | ||
// when patching since the GC might walk nmethod oops concurrently. | ||
__ align(8, __ offset() + NativeMovConstReg::data_offset_rex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In 32-bit VM oops are 4 bytes so 8 bytes is overkill but I am fine with unified alignment.
Should we align mov_metadata() too or it is guarantee aligned already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to guarantee atomicity for metadata because it's not observed concurrently as far as I know, right?
Wouldn't it be better to get rid of the concurrency? We could grab CodeCache_lock and Patching_lock in the same block, so we serialize the patching and register_nmethod. |
Thanks for looking at this, Vladimir and Dean!
Yes, that would be an alternative solution. I went with the alignment because I thought it has the least impact. I'll ping the GC team, maybe they want to have a say in this. |
@dean-long I discussed this with @tschatzl and, on his request, improved the PR description a bit. He would also prefer the alignment solution because it does not increase the scope of the lock (and we already rely on word-aligned word-sized memory accesses being atomic in many other places). What do you think? |
Sorry, I'm still not convinced this is safe. I took another look at C1 patching, and not only are we trying to scan oops while they are being patched, we are also patching the reloc information at the same time (see the call to change_reloc_info_for_address()). So that means there is a window where the instruction is patched but the reloc information is stale. |
Could be delayed at least for the STW collectors, but we want to avoid doing any work during gc as much as possible. This may be more tricky with concurrent gcs. After some talk with @TobiHartmann we think that it is best and safer to extend the lock scope. |
Thanks for the discussions! I updated the PR to extend the scope of the |
I'm fine with the fix. I can't help though but to reflect on the ever diminishing role of the Patching_lock. It used to be used quite a lot, but has had its lunch eaten by the CompiledMethod_lock, CompiledIC_lock and CodeCache_lock over time. Today, the Patching_lock is used in exactly one place: in this exact C1 patching that we are looking at now. And now we found that holding that lock wasn't enough because we need the CodeCache_lock as well. We could instead extend the CodeCache_lock critical section a bit, and then there is no need for the Patching_lock at all. Is it time for this lock to retire? It's had a good run. Thoughts? |
+1 to retiring the |
Yes, please retire Patching_lock. It doesn't matter to me if it's part of this PR or a follow-up. |
Sounds good. I pushed a new change that completely removes the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems good to me.
Thanks Thomas! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
Thanks for the reviews, Vladimir and Dean! |
/integrate |
Going to push as commit 58d39c3.
Your commit was automatically rebased without conflicts. |
@TobiHartmann Pushed as commit 58d39c3. 💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored. |
C1 patching (explained in detail here) works by rewriting the patch body, usually a move, and then copying it into place over top of the jmp instruction, being careful to flush caches and doing it in an MP-safe way.
Now the problem is that there can be multiple patch sides in one nmethod (for example, one for each field access in
TestConcurrentPatching::test
) and multiple threads executing that nmethod can trigger patching concurrently. Although the patch body is not executed, oneThread A
can update the oop immediate of themov
in the patch body:jdk/src/hotspot/share/c1/c1_Runtime1.cpp
Line 1185 in 212e329
While another
Thread B
is done with patching another patch side and already walks the nmethod oops viaUniverse::heap()->register_nmethod(nm)
to notify the GC.Thread B
might then encounter a half-written oop fromThread A
if the immediate crosses a page or cache-line boundary:jdk/src/hotspot/share/c1/c1_Runtime1.cpp
Line 1282 in 212e329
In short:
Thread A
: Patches location 1 in an nmethod and executesregister_nmethod()
, walking all immediate oops.Thread B
: Patches location 2 in the same nmethod and just wrote half of the oop immediate of anmov
in the patch body because the store is not atomic.Thread A
: Crashes when walking the immediate oops of the nmethod and encountering the oop just partially written byThread B
concurrently.Updating the oop immediate is not atomic on x86_64 because the address of the immediate is not 8-byte aligned.
I propose to simply align it in
PatchingStub::emit_code
to guarantee atomicity.The new regression test triggers the issue reliably for the
load_mirror_patching_id
case but unfortunately, I was not able to trigger theload_appendix_patching_id
case which should be affected as well:jdk/src/hotspot/share/c1/c1_Runtime1.cpp
Line 1197 in 212e329
I still added a corresponding test case
testIndy
and a new assert that checks proper alignment triggers in both scenarios. I had to remove them again because they are x64 specific and I don't think it's worth the effort of adding a platform independent way of alignment checking.AArch64 is not affected because we always deopt instead of patching but other platforms might be affected as well. Should this fix be accepted, I'll ping the maintainers of the respective platforms.
Thanks,
Tobias
Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/21389/head:pull/21389
$ git checkout pull/21389
Update a local copy of the PR:
$ git checkout pull/21389
$ git pull https://git.openjdk.org/jdk.git pull/21389/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 21389
View PR using the GUI difftool:
$ git pr show -t 21389
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/21389.diff
Webrev
Link to Webrev Comment