Skip to content

Conversation

@JornVernee
Copy link
Member

@JornVernee JornVernee commented Oct 28, 2024

There is a subtle race in UpcallLinker::on_exit between copying of the old frame anchor back into place, and the GC. Since this copy is not atomic, it may briefly appear as if a thread has no last Java frame, while still in the _thread_in_native state, which leads to the GC skipping processing of any active Java frames.

This code was originally adapted from JavaCallWrapper::!JavaCallWrapper - the JNI mechanism for upcalls - but in that case the frame anchor copy happens in the _thread_in_vm state, which means the GC will wait for the thread to get to a safepoint.

The solution proposed here is to do the frame anchor copy in the java thread state, before transitioning back to the native state. The java thread state, like the vm thread state, is also 'safe' i.e. the GC will wait for the thread to get to a safepoint, so we can safely do our non-atomic copy of the frame anchor.

Additionally, this PR resolves a similar issue in on_entry, by moving the clearing of the pending exception (in case native code use a JNI API and didn't handle the exception afterwards). We now also skip checking for async exceptions when transitioning from native to java, so we don't immediately clear them. Any async exceptions will be picked up at the next safepoint instead.

Special thanks to @stefank and @fisk for finding the root cause, and @jaikiran for testing and debugging.

Testing: tier 1-4, 20k runs of the failing test on linux-aarch64.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issues

  • JDK-8331735: UpcallLinker::on_exit races with GC when copying frame anchor (Bug - P3)
  • JDK-8343144: UpcallLinker::on_entry racingly clears pending exception with GC safepoints (Bug - P4)
  • JDK-8286875: ProgrammableUpcallHandler::on_entry/on_exit access thread fields from native (Bug - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/21742/head:pull/21742
$ git checkout pull/21742

Update a local copy of the PR:
$ git checkout pull/21742
$ git pull https://git.openjdk.org/jdk.git pull/21742/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 21742

View PR using the GUI difftool:
$ git pr show -t 21742

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/21742.diff

Using Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Oct 28, 2024

👋 Welcome back jvernee! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Oct 28, 2024

@JornVernee This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8331735: UpcallLinker::on_exit races with GC when copying frame anchor
8343144: UpcallLinker::on_entry racingly clears pending exception with GC safepoints
8286875: ProgrammableUpcallHandler::on_entry/on_exit access thread fields from native

Reviewed-by: dholmes, eosterlund, aboldtch

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 265 new commits pushed to the master branch:

  • b3986bd: 8344118: C2 SuperWord: add VectorThroughputForIterationCount benchmark
  • 96388be: 8345004: [BACKOUT] GTK & Nimbus LAF: Tabbed pane's background color is not expected one when change the opaque checkbox.
  • 4ae6ce6: 8344300: Implement JEP 499: Structured Concurrency (Fourth Preview)
  • 57ee3ba: 8344912: Sharpen the return type of various internal methods in jdk.internal.foreign
  • 1f6144e: 8345050: Fix -Wzero-as-null-pointer warning in MemPointer ctor
  • 08c1f44: 8341028: Do not use lambdas or method refs for verifyConstantPool
  • 28c8729: 8343004: Adjust JAXP limits
  • 8c2b4f6: 8345057: ML_KEM NamedParameterSpec constants removed by ML-DSA integration
  • 8389e24: 8345058: Javac issues different error messages for the modifiers of the requires directive
  • 8da6435: 8343693: [JVMCI] Override ModifiersProvider.isConcrete in ResolvedJavaType to be isArray() || !isAbstract()
  • ... and 255 more: https://git.openjdk.org/jdk/compare/bfee766f035fb1b122cd3f3703b9e2a2d85abfe6...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk
Copy link

openjdk bot commented Oct 28, 2024

@JornVernee The following label will be automatically applied to this pull request:

  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot hotspot-dev@openjdk.org label Oct 28, 2024
@JornVernee
Copy link
Member Author

/label remove hotspot
/label add hotspot-gc
/label add core-libs

@JornVernee
Copy link
Member Author

/solves 8343144
/solves 8286875

@openjdk openjdk bot removed the hotspot hotspot-dev@openjdk.org label Nov 15, 2024
@openjdk
Copy link

openjdk bot commented Nov 15, 2024

@JornVernee
The hotspot label was successfully removed.

@openjdk openjdk bot added the hotspot-gc hotspot-gc-dev@openjdk.org label Nov 15, 2024
@openjdk
Copy link

openjdk bot commented Nov 15, 2024

@JornVernee
The hotspot-gc label was successfully added.

@openjdk openjdk bot added the core-libs core-libs-dev@openjdk.org label Nov 15, 2024
@openjdk
Copy link

openjdk bot commented Nov 15, 2024

@JornVernee
The core-libs label was successfully added.

@openjdk
Copy link

openjdk bot commented Nov 15, 2024

@JornVernee
Adding additional issue to solves list: 8343144: UpcallLinker::on_entry racingly clears pending exception with GC safepoints.

@openjdk
Copy link

openjdk bot commented Nov 15, 2024

@JornVernee
Adding additional issue to solves list: 8286875: ProgrammableUpcallHandler::on_entry/on_exit access thread fields from native.

@JornVernee JornVernee marked this pull request as ready for review November 18, 2024 12:41
@JornVernee JornVernee changed the title 8331735: java.lang.Thread.scopedValueBindings contains garbage when crashing in java/awt/font/TextLayout/FontLayoutStressTest during GC 8331735: UpcallLinker::on_exit races with GC when copying frame anchor Nov 18, 2024
@openjdk openjdk bot added the rfr Pull request is ready for review label Nov 18, 2024
@mlbridge
Copy link

mlbridge bot commented Nov 18, 2024

Webrevs

@jaikiran
Copy link
Member

Happy to see this addressed and as Jorn noted, thanks to Stefan and Erik for finding the root cause of this issue which was hard to reproduce and debug.

Copy link
Member

@dholmes-ora dholmes-ora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems quite reasonable. Ensuring the correct state for things like updating the frame_anchor is critical, so I wonder if we can assert we are in a safepoint-safe state when doing so?

I had to think long about the async exception deferral ... probably okay.

Thanks

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Nov 19, 2024
@JornVernee
Copy link
Member Author

I wonder if we can assert we are in a safepoint-safe state when doing so?

I think we can do this. I've prototyped this here: pr/21742...JornVernee:jdk:SafeFrameAnchor+assert

This catches the issue fixed by this patch, and it passes at least tier 1. We'd need something similar in assembly where we touch the frame anchor, is MacroAssembler::set_last_Java_frame and MacroAssembler::reset_last_Java_frame.

@JornVernee
Copy link
Member Author

I wonder if we can assert we are in a safepoint-safe state when doing so?

I think we can do this. I've prototyped this here: pr/21742...JornVernee:jdk:SafeFrameAnchor+assert

This catches the issue fixed by this patch, and it passes at least tier 1. We'd need something similar in assembly where we touch the frame anchor, is MacroAssembler::set_last_Java_frame and MacroAssembler::reset_last_Java_frame.

Thinking some more about this: there might be other instances of JavaFrameAnchor that are fine to touch when the thread is in the native state. It's just the frame anchor inside a JavaThread that can not be touched if that thread is in a certain state. It might be possible to encapsulate the JavaFrameAnchor instance inside the thread, and then guard any accesses to it. But, that seems like a much more invasive change, so I'll hold off on that and focus this PR on fixing the issue.

Copy link
Contributor

@fisk fisk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix looks good to me.

Copy link
Member

@xmas92 xmas92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm.

Would be nice if if we could assert that we are not in native or blocked when touching the oops as well. Similarly to modifications of the frame anchor. But I agree that it should be done separately.

@JornVernee
Copy link
Member Author

/integrate

@openjdk
Copy link

openjdk bot commented Nov 27, 2024

Going to push as commit 461ffaf.
Since your change was applied there have been 267 commits pushed to the master branch:

  • eb0d1ce: 8344355: Register corruption in MacroAssembler::lookup_secondary_supers_table_var: x86-64 only
  • 82137db: 8345047: RISC-V: Remove explicit use of AvoidUnalignedAccesses in interpreter
  • b3986bd: 8344118: C2 SuperWord: add VectorThroughputForIterationCount benchmark
  • 96388be: 8345004: [BACKOUT] GTK & Nimbus LAF: Tabbed pane's background color is not expected one when change the opaque checkbox.
  • 4ae6ce6: 8344300: Implement JEP 499: Structured Concurrency (Fourth Preview)
  • 57ee3ba: 8344912: Sharpen the return type of various internal methods in jdk.internal.foreign
  • 1f6144e: 8345050: Fix -Wzero-as-null-pointer warning in MemPointer ctor
  • 08c1f44: 8341028: Do not use lambdas or method refs for verifyConstantPool
  • 28c8729: 8343004: Adjust JAXP limits
  • 8c2b4f6: 8345057: ML_KEM NamedParameterSpec constants removed by ML-DSA integration
  • ... and 257 more: https://git.openjdk.org/jdk/compare/bfee766f035fb1b122cd3f3703b9e2a2d85abfe6...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Nov 27, 2024
@openjdk openjdk bot closed this Nov 27, 2024
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Nov 27, 2024
@openjdk
Copy link

openjdk bot commented Nov 27, 2024

@JornVernee Pushed as commit 461ffaf.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core-libs core-libs-dev@openjdk.org hotspot-gc hotspot-gc-dev@openjdk.org integrated Pull request has been integrated

Development

Successfully merging this pull request may close these issues.

5 participants