Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8340313: Crash due to invalid oop in nmethod after C1 patching #21389

Closed
wants to merge 7 commits into from

Conversation

TobiHartmann
Copy link
Member

@TobiHartmann TobiHartmann commented Oct 7, 2024

C1 patching (explained in detail here) works by rewriting the patch body, usually a move, and then copying it into place over top of the jmp instruction, being careful to flush caches and doing it in an MP-safe way.

Now the problem is that there can be multiple patch sides in one nmethod (for example, one for each field access in TestConcurrentPatching::test) and multiple threads executing that nmethod can trigger patching concurrently. Although the patch body is not executed, one Thread A can update the oop immediate of the mov in the patch body:

n_copy->set_data(cast_from_oop<intx>(mirror()));

While another Thread B is done with patching another patch side and already walks the nmethod oops via Universe::heap()->register_nmethod(nm) to notify the GC. Thread B might then encounter a half-written oop from Thread A if the immediate crosses a page or cache-line boundary:

Universe::heap()->register_nmethod(nm);

In short:

  • Thread A: Patches location 1 in an nmethod and executes register_nmethod(), walking all immediate oops.
  • Thread B: Patches location 2 in the same nmethod and just wrote half of the oop immediate of an mov in the patch body because the store is not atomic.
  • Thread A: Crashes when walking the immediate oops of the nmethod and encountering the oop just partially written by Thread B concurrently.

Updating the oop immediate is not atomic on x86_64 because the address of the immediate is not 8-byte aligned.
I propose to simply align it in PatchingStub::emit_code to guarantee atomicity.

The new regression test triggers the issue reliably for the load_mirror_patching_id case but unfortunately, I was not able to trigger the load_appendix_patching_id case which should be affected as well:

n_copy->set_data(cast_from_oop<intx>(appendix()));

I still added a corresponding test case testIndy and a new assert that checks proper alignment triggers in both scenarios. I had to remove them again because they are x64 specific and I don't think it's worth the effort of adding a platform independent way of alignment checking.

AArch64 is not affected because we always deopt instead of patching but other platforms might be affected as well. Should this fix be accepted, I'll ping the maintainers of the respective platforms.

Thanks,
Tobias


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8340313: Crash due to invalid oop in nmethod after C1 patching (Bug - P2)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/21389/head:pull/21389
$ git checkout pull/21389

Update a local copy of the PR:
$ git checkout pull/21389
$ git pull https://git.openjdk.org/jdk.git pull/21389/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 21389

View PR using the GUI difftool:
$ git pr show -t 21389

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/21389.diff

Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Oct 7, 2024

👋 Welcome back thartmann! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Oct 7, 2024

@TobiHartmann This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8340313: Crash due to invalid oop in nmethod after C1 patching

Reviewed-by: tschatzl, kvn, dlong

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 67 new commits pushed to the master branch:

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk
Copy link

openjdk bot commented Oct 7, 2024

@TobiHartmann The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label Oct 7, 2024
@TobiHartmann TobiHartmann marked this pull request as ready for review October 7, 2024 14:29
@openjdk openjdk bot added the rfr Pull request is ready for review label Oct 7, 2024
@mlbridge
Copy link

mlbridge bot commented Oct 7, 2024

Webrevs

// Produce a copy of the load mirror / appendix instruction.
// 8-byte align the address of the oop immediate to guarantee atomicity
// when patching since the GC might walk nmethod oops concurrently.
__ align(8, __ offset() + NativeMovConstReg::data_offset_rex);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In 32-bit VM oops are 4 bytes so 8 bytes is overkill but I am fine with unified alignment.
Should we align mov_metadata() too or it is guarantee aligned already?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to guarantee atomicity for metadata because it's not observed concurrently as far as I know, right?

@dean-long
Copy link
Member

Wouldn't it be better to get rid of the concurrency? We could grab CodeCache_lock and Patching_lock in the same block, so we serialize the patching and register_nmethod.

@TobiHartmann
Copy link
Member Author

Thanks for looking at this, Vladimir and Dean!

Wouldn't it be better to get rid of the concurrency? We could grab CodeCache_lock and Patching_lock in the same block, so we serialize the patching and register_nmethod.

Yes, that would be an alternative solution. I went with the alignment because I thought it has the least impact. I'll ping the GC team, maybe they want to have a say in this.

@TobiHartmann
Copy link
Member Author

@dean-long I discussed this with @tschatzl and, on his request, improved the PR description a bit. He would also prefer the alignment solution because it does not increase the scope of the lock (and we already rely on word-aligned word-sized memory accesses being atomic in many other places).

What do you think?

@dean-long
Copy link
Member

Sorry, I'm still not convinced this is safe. I took another look at C1 patching, and not only are we trying to scan oops while they are being patched, we are also patching the reloc information at the same time (see the call to change_reloc_info_for_address()). So that means there is a window where the instruction is patched but the reloc information is stale.
If the scope of the lock is the only issue, then we could try to address that with a finer-grained lock or even a per-nmethod lock.
@tschatzl , when we call register_nmethod(), do we really need to scan the oops immediately, or could that be delayed until the next safepoint?

@tschatzl
Copy link
Contributor

@tschatzl , when we call register_nmethod(), do we really need to scan the oops immediately, or could that be delayed until the next safepoint?

Could be delayed at least for the STW collectors, but we want to avoid doing any work during gc as much as possible. This may be more tricky with concurrent gcs.

After some talk with @TobiHartmann we think that it is best and safer to extend the lock scope.

@TobiHartmann
Copy link
Member Author

Thanks for the discussions! I updated the PR to extend the scope of the Patching_lock. I also had to decrease the iterations in the test due to timeouts with debug on slow machines.

@fisk
Copy link
Contributor

fisk commented Oct 15, 2024

I'm fine with the fix. I can't help though but to reflect on the ever diminishing role of the Patching_lock. It used to be used quite a lot, but has had its lunch eaten by the CompiledMethod_lock, CompiledIC_lock and CodeCache_lock over time. Today, the Patching_lock is used in exactly one place: in this exact C1 patching that we are looking at now. And now we found that holding that lock wasn't enough because we need the CodeCache_lock as well. We could instead extend the CodeCache_lock critical section a bit, and then there is no need for the Patching_lock at all. Is it time for this lock to retire? It's had a good run. Thoughts?

@TobiHartmann
Copy link
Member Author

We could instead extend the CodeCache_lock critical section a bit, and then there is no need for the Patching_lock at all. Is it time for this lock to retire?

+1 to retiring the Patching_lock and using the CodeCache_lock instead. Let's see what others think.

@dean-long
Copy link
Member

Yes, please retire Patching_lock. It doesn't matter to me if it's part of this PR or a follow-up.

@TobiHartmann
Copy link
Member Author

Sounds good. I pushed a new change that completely removes the Patching_lock and uses the CodeCache_lock instead.

Copy link
Contributor

@tschatzl tschatzl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems good to me.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Oct 16, 2024
@TobiHartmann
Copy link
Member Author

Thanks Thomas!

Copy link
Contributor

@vnkozlov vnkozlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@TobiHartmann
Copy link
Member Author

Thanks for the reviews, Vladimir and Dean!

@TobiHartmann
Copy link
Member Author

/integrate

@openjdk
Copy link

openjdk bot commented Oct 17, 2024

Going to push as commit 58d39c3.
Since your change was applied there have been 68 commits pushed to the master branch:

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Oct 17, 2024
@openjdk openjdk bot closed this Oct 17, 2024
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Oct 17, 2024
@openjdk
Copy link

openjdk bot commented Oct 17, 2024

@TobiHartmann Pushed as commit 58d39c3.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-compiler hotspot-compiler-dev@openjdk.org integrated Pull request has been integrated
Development

Successfully merging this pull request may close these issues.

5 participants