Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8327990: [macosx-aarch64] Various tests fail with -XX:+AssertWXAtThreadSync #18238

Closed

Conversation

reinrich
Copy link
Member

@reinrich reinrich commented Mar 12, 2024

Updated (2024-03-20):

This PR adds switching to WXWrite mode before entering the vm where it is missing.

With the changes the following jtreg tests succeed with AssertWXAtThreadSync enabled.

  • hotspot tier 1-4
  • jdk tier 1-4
  • langtools
  • jaxp

Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8327990: [macosx-aarch64] Various tests fail with -XX:+AssertWXAtThreadSync (Bug - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/18238/head:pull/18238
$ git checkout pull/18238

Update a local copy of the PR:
$ git checkout pull/18238
$ git pull https://git.openjdk.org/jdk.git pull/18238/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 18238

View PR using the GUI difftool:
$ git pr show -t 18238

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/18238.diff

Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Mar 12, 2024

👋 Welcome back rrich! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Mar 12, 2024

@reinrich The following labels will be automatically applied to this pull request:

  • hotspot-jfr
  • serviceability

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added serviceability serviceability-dev@openjdk.org hotspot-jfr hotspot-jfr-dev@openjdk.org labels Mar 12, 2024
@reinrich
Copy link
Member Author

/label add hotspot

@openjdk openjdk bot added the hotspot hotspot-dev@openjdk.org label Mar 13, 2024
@openjdk
Copy link

openjdk bot commented Mar 13, 2024

@reinrich
The hotspot label was successfully added.

@reinrich reinrich marked this pull request as ready for review March 13, 2024 09:43
@openjdk openjdk bot added the rfr Pull request is ready for review label Mar 13, 2024
@mlbridge
Copy link

mlbridge bot commented Mar 13, 2024

Webrevs

Copy link
Member

@dholmes-ora dholmes-ora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I wrote in JBS, shouldn't this be handled by ThreadInVMfromNative?

@reinrich
Copy link
Member Author

As I wrote in JBS, shouldn't this be handled by ThreadInVMfromNative?

(I wanted to publish the PR before answering your comment)

This would be reasonable in my opinion.
I've hoisted setting WXWrite mode in JfrJvmtiAgent::retransform_classes() because multiple instances of ThreadInVMfromNative are reached. This is likely not even necessary. Still exceptions could be made if there are usages of ThreadInVMfromNative where it is needed.

While I agree I'd prefer to do it as a separate enhancement.

Comment on lines +159 to +160
// WXWrite is needed before entering the vm below and in callee methods.
MACOS_AARCH64_ONLY(ThreadWXEnable __wx(WXWrite, THREAD));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand you placed this here to cover the transition inside create_classes_array and the immediate one at line 170, but doesn't this risk having the wrong WX state for code in between?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've asked this myself (after making the change).
Being in WXWrite mode would be wrong if the thread would execute dynamically generated code. There's not too much happening outside the scope of the ThreadInVMfromNative instances. I see jni calls (GetObjectArrayElement, ExceptionOccurred) and a jvmti call (RetransformClasses) but these are safe because the callees enter the vm right away. We even avoid switching to WXWrite and back there.
So I thought it would be ok to coarsen the WXMode switching.
But maybe it's still better to avoid any risk especially since there's likely no performance effect.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or could the ThreadInVMfromNative tvmfn(THREAD); in check_exception_and_log be move out to JfrJvmtiAgent::retransform_classes? And then only use one ThreadInVMfromNative in JfrJvmtiAgent::retransform_classes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would require hoisting the ThreadInVMfromNative out of the loop with the check_exception_and_log call but then the thread would be in _thread_in_vm when doing the GetObjectArrayElement jni call which would be wrong.

@reinrich
Copy link
Member Author

@MBaesken found 2 more locations in jvmti that need switching to WXWrite

JvmtiExport::get_jvmti_interface
GetCarrierThread

Both use ThreadInVMfromNative.

@openjdk
Copy link

openjdk bot commented Mar 13, 2024

@reinrich This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8327990: [macosx-aarch64] Various tests fail with -XX:+AssertWXAtThreadSync

Reviewed-by: dholmes, stuefe, mdoerr, tholenstein, aph

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 5 new commits pushed to the master branch:

  • d3f3011: 7036144: GZIPInputStream readTrailer uses faulty available() test for end-of-stream
  • e5e7cd2: 8328386: Convert java/awt/FileDialog/FileNameOverrideTest test to main
  • 1b68c73: 8328377: Convert java/awt/Cursor/MultiResolutionCursorTest test to main
  • e0373e0: 8328378: Convert java/awt/FileDialog/FileDialogForDirectories test to main
  • 03c25b1: 8328367: Convert java/awt/Component/UpdatingBootTime test to main

Please see this link for an up-to-date comparison between the source branch of this pull request and the master branch.
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@MBaesken
Copy link
Member

MBaesken commented Mar 14, 2024

@MBaesken found 2 more locations in jvmti that need switching to WXWrite

JvmtiExport::get_jvmti_interface GetCarrierThread

Both use ThreadInVMfromNative.

Should we address those 2 more findings in this PR ? Or open a separate JBS issue ?

btw those were the jtreg tests triggering the 2 additional findings / asserts

runtime/Thread/AsyncExceptionOnMonitorEnter.java
runtime/Thread/AsyncExceptionTest.java
serviceability/jvmti/RedefineClasses/RedefineSharedClassJFR.java
runtime/handshake/HandshakeDirectTest.java
runtime/handshake/SuspendBlocked.java
runtime/jni/terminatedThread/TestTerminatedThread.java
runtime/lockStack/TestStackWalk.java
serviceability/jvmti/vthread/GetThreadState/GetThreadStateTest.java#default
serviceability/jvmti/vthread/GetThreadState/GetThreadStateTest.java#no-vmcontinuations
serviceability/jvmti/vthread/GetThreadStateMountedTest/GetThreadStateMountedTest.java
serviceability/jvmti/vthread/RawMonitorTest/RawMonitorTest.java
serviceability/jvmti/vthread/SuspendWithInterruptLock/SuspendWithInterruptLock.java#default
serviceability/jvmti/vthread/SuspendWithInterruptLock/SuspendWithInterruptLock.java#xint
serviceability/jvmti/vthread/ThreadStateTest/ThreadStateTest.java
serviceability/jvmti/vthread/WaitNotifySuspendedVThreadTest/WaitNotifySuspendedVThreadTest.java

@MBaesken
Copy link
Member

I noticed a few more asserts (assert(_wx_state == expected) failed: wrong state) in the jfr area (jdk tier3 jfr tests).
E.g.

V  [libjvm.dylib+0x8a5d94]  JavaThread::check_for_valid_safepoint_state()+0x0
V  [libjvm.dylib+0x3e95b4]  ThreadStateTransition::transition_from_native(JavaThread*, JavaThreadState, bool)+0x174
V  [libjvm.dylib+0x3e93d0]  ThreadInVMfromNative::ThreadInVMfromNative(JavaThread*)+0x70
V  [libjvm.dylib+0x91a578]  JfrRecorderService::emit_leakprofiler_events(long long, bool, bool)+0xcc

and

V  [libjvm.dylib+0x8a5d94]  JavaThread::check_for_valid_safepoint_state()+0x0
V  [libjvm.dylib+0x3e95b4]  ThreadStateTransition::transition_from_native(JavaThread*, JavaThreadState, bool)+0x174
V  [libjvm.dylib+0x3e93d0]  ThreadInVMfromNative::ThreadInVMfromNative(JavaThread*)+0x70
V  [libjvm.dylib+0x8d7f74]  JfrJavaEventWriter::flush(_jobject*, int, int, JavaThread*)+0xf8
j  jdk.jfr.internal.JVM.flush(Ljdk/jfr/internal/event/EventWriter;II)V+0 jdk.jfr@23-internal
j  jdk.jfr.internal.event.EventWriter.flush(II)V+3 jdk.jfr@23-internal

@reinrich
Copy link
Member Author

@MBaesken found 2 more locations in jvmti that need switching to WXWrite

JvmtiExport::get_jvmti_interface
GetCarrierThread

Both use ThreadInVMfromNative.

Should we address those 2 more findings in this PR ? Or open a separate JBS issue ?

I'm leaning towards fixing all locations in this PR even though this will prevent clean backports. Would that be ok?

@MBaesken
Copy link
Member

@MBaesken found 2 more locations in jvmti that need switching to WXWrite

JvmtiExport::get_jvmti_interface
GetCarrierThread

Both use ThreadInVMfromNative.

Should we address those 2 more findings in this PR ? Or open a separate JBS issue ?

I'm leaning towards fixing all locations in this PR even though this will prevent clean backports. Would that be ok?

I think this is ok.

@MBaesken
Copy link
Member

MBaesken commented Mar 15, 2024

JfrRecorderService::emit_leakprofiler_events (src/hotspot/share/jfr/recorder/service/jfrRecorderService.cpp ) and JfrJavaEventWriter::flush (src/hotspot/share/jfr/writers/jfrJavaEventWriter.cpp) might need adjustment too (see the other findings I posted yesterday).

@tobiasholenstein
Copy link
Member

As I wrote in JBS, shouldn't this be handled by ThreadInVMfromNative?

I agree. This is something I am investigating at the moment. Ideally, AssertWXAtThreadSync would also be true by default.

@reinrich
Copy link
Member Author

As I wrote in JBS, shouldn't this be handled by ThreadInVMfromNative?

I agree. This is something I am investigating at the moment. Ideally, AssertWXAtThreadSync would also be true by default.

I've added a bunch more locations we've seen when testing with AssertWXAtThreadSync.

@tobiasholenstein would you think that this PR is actually not needed because you are going to push a better way of handling the WXMode in the near future?
How should be handle the issues in released versions (jdk 21, 17, ...)? Will it be possible to backport your work?

@reinrich reinrich changed the title 8327990: [macosx-aarch64] JFR enters VM without WXWrite 8327990: [macosx-aarch64] Various tests fail with -XX:+AssertWXAtThreadSync Mar 18, 2024
Copy link
Member

@dholmes-ora dholmes-ora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still feel that this should be fixed inside the ThreadInVMfromNative transition - the number of callsites that need it just reinforces that for me. Granted we then need to look at where we would now have redundant calls.

That said we have had a lot of people looking at the overall WX state management logic in the past week or so due to https://bugs.openjdk.org/browse/JDK-8327860. The workaround there requires us to be in EXEC mode and there are a number of folk who are questioning why "we" chose WRITE mode as the default with a switch to EXEC, instead of EXEC as the default with a switch to WRITE. But whichever way that goes I think the VM state transitions are the places to enforce that choice.

@theRealAph
Copy link
Contributor

That said we have had a lot of people looking at the overall WX state management logic in the past week or so due to https://bugs.openjdk.org/browse/JDK-8327860. The workaround there requires us to be in EXEC mode

That's very odd. The example there doesn't even involve MAP_JIT memory, so what does it have to do with WX?

and there are a number of folk who are questioning why "we" chose WRITE mode as the default with a switch to EXEC, instead of EXEC as the default with a switch to WRITE.
But whichever way that goes I think the VM state transitions are the places to enforce that choice.

Hmm. Changing WX at VM state transitions is a form of temporal coupling, a classic design smell that has caused problems for decades. It's causing problems for us now. Instead, could we tag code that needs one or the other, keep track of the current WX state in thread-local memory, and flip WX only when we know we need to? That'd (by definition) reduce the number of transitions to the minimum if we were through.

@dholmes-ora
Copy link
Member

That's very odd. The example there doesn't even involve MAP_JIT memory, so what does it have to do with WX?

@theRealAph that is the mystery we hope will be resolved once we know the nature of the underlying OS bug. Somehow switching to exec mode fixes/works-around the issue. I can imagine a missing conditional to check if the region is MAP_JIT.

Changing WX at VM state transitions is a form of temporal coupling, a classic design smell that has caused problems for decades.

The original introducers of WXEnable made the decision that the VM should be in WRITE mode unless it needs EXEC. That is the state we are presently trying to achieve with this change. If that original design choice is wrong then ...

Instead, could we tag code that needs one or the other, keep track of the current WX state in thread-local memory, and flip WX only when we know we need to?

And I've asked about this every time a missing WXEnable has had to be added. We seem to be generically able to describe what kind of code needs which mode, but we seem to struggle to pin it down. Though that is what https://bugs.openjdk.org/browse/JDK-8307817 is looking at doing.

That'd (by definition) reduce the number of transitions to the minimum if we were through.

Not necessarily. It may well remove some transitions from paths that don't need it, but if you move the state change too low down the call chain you could end up transitioning much more often in code that does need it e.g. if a transitioning method is called in a loop. We need to optimise the actual call paths as well as identify specific methods.

But all this discussion suggests to me that this PR is not really worth pursuing at this time - IIUC no actual failures are observed other than those pertaining to AssertWXAtThreadSync and that flag will be gone if we do decide to be more fine-grained about WX management.

@MBaesken
Copy link
Member

IIUC no actual failures are observed other than those pertaining to AssertWXAtThreadSync

We saw sporadic crashes in our jtreg (maybe also jck?) runs; only later we enabled AssertWXAtThreadSync for more checking.

@tstuefe
Copy link
Member

tstuefe commented Mar 19, 2024

Instead, could we tag code that needs one or the other, keep track of the current WX state in thread-local memory, and flip WX only when we know we need to?

The first part we already do.

I wonder wheter we could - at least as workaround for if we missed a spot - do wx switching as a reaction to a SIBBUS related to WX violation in code cache. Switch state around, return from signal handler and retry operation.

(Edit: tested it, does not seem to work. I guess when the SIGBUS is triggered in the kernel thread WX state had already been processed somehow).

That's very odd. The example there doesn't even involve MAP_JIT memory, so what does it have to do with WX?

@theRealAph that is the mystery we hope will be resolved once we know the nature of the underlying OS bug. Somehow switching to exec mode fixes/works-around the issue. I can imagine a missing conditional to check if the region is MAP_JIT.

Changing WX at VM state transitions is a form of temporal coupling, a classic design smell that has caused problems for decades.

The original introducers of WXEnable made the decision that the VM should be in WRITE mode unless it needs EXEC. That is the state we are presently trying to achieve with this change. If that original design choice is wrong then ...

Instead, could we tag code that needs one or the other, keep track of the current WX state in thread-local memory, and flip WX only when we know we need to?

And I've asked about this every time a missing WXEnable has had to be added. We seem to be generically able to describe what kind of code needs which mode, but we seem to struggle to pin it down. Though that is what https://bugs.openjdk.org/browse/JDK-8307817 is looking at doing.

That'd (by definition) reduce the number of transitions to the minimum if we were through.

Not necessarily. It may well remove some transitions from paths that don't need it, but if you move the state change too low down the call chain you could end up transitioning much more often in code that does need it e.g. if a transitioning method is called in a loop.

Not if you do the switching lazily. The first iteration would switch to the needed state; subsequent iterations would not do anything since the state already matches. Unless you interleave writes and execs, but then you would need the state changes anyway.

@reinrich
Copy link
Member Author

But all this discussion suggests to me that this PR is not really worth pursuing at this time - IIUC no actual failures are observed other than those pertaining to AssertWXAtThreadSync and that flag will be gone if we do decide to be more fine-grained about WX management.

I see it differently. This PR is just a simple attempt to get clean test runs with AssertWXAtThreadSync (after fixing an actual crash https://bugs.openjdk.org/browse/JDK-8327036). While the violating locations in this PR might be unlikely to produce actual crashes I think it is worthwhile to have clean testing with AssertWXAtThreadSync because this will help prevent regressions that are more likely.

Beyond the trivial fixes of this PR I'm very much in favor of further enhancements as the aforementioned https://bugs.openjdk.org/browse/JDK-8307817.
My recommendation would be to remove as much non-constant data from the code cache as possible.

@tobiasholenstein
Copy link
Member

I see it differently. This PR is just a simple attempt to get clean test runs with AssertWXAtThreadSync (after fixing an actual crash https://bugs.openjdk.org/browse/JDK-8327036). While the violating locations in this PR might be unlikely to produce actual crashes I think it is worthwhile to have clean testing with AssertWXAtThreadSync because this will help prevent regressions that are more likely.

I agree. Fixing the current state with this PR makes sense to me. Changing the logic of W^X will take more time and discussion. So from my point of view this PR is ready and should be integrated. If no-one disagrees I will approve

Copy link
Member

@tstuefe tstuefe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this patch makes sense, and does not compete with a long-term solution.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Mar 19, 2024
Copy link
Contributor

@TheRealMDoerr TheRealMDoerr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@mlbridge
Copy link

mlbridge bot commented Mar 19, 2024

Mailing list message from dean.long at oracle.com on hotspot-dev:

On 3/19/24 8:20 AM, Thomas Stuefe wrote:

I wonder wheter we could - at least as workaround for if we missed a
spot - do wx switching as a reaction to a SIBBUS related to WX violation
in code cache. Switch state around, return from signal handler and
retry operation.

(Edit: tested it, does not seem to work. I guess when the SIGBUS is
triggered in the kernel thread WX state had already been processed
somehow).

That makes sense if the WX state is part of the signal context saved and
restored by the signal handler.

dl
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-dev/attachments/20240319/1836f7ec/attachment.htm>

@theRealAph
Copy link
Contributor

Not necessarily. It may well remove some transitions from paths that don't need it, but if you move the state change too low down the call chain you could end up transitioning much more often in code that does need it e.g. if a transitioning method is called in a loop.

Not if you do the switching lazily. The first iteration would switch to the needed state; subsequent iterations would not do anything since the state already matches. Unless you interleave writes and execs, but then you would need the state changes anyway.

Exactly. You do the transition when it's needed.

Copy link
Member

@dholmes-ora dholmes-ora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not ideal but it fixes a real problem.

Copy link
Contributor

@TheRealMDoerr TheRealMDoerr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

@reinrich
Copy link
Member Author

Tests with AssertWXAtThreadSync are clean now. Thanks!
/integrate

@openjdk
Copy link

openjdk bot commented Mar 21, 2024

Going to push as commit e41bc42.
Since your change was applied there have been 25 commits pushed to the master branch:

  • 4308017: 8328631: Convert java/awt/InputMethods/InputMethodsTest/InputMethodsTest.java applet test to manual
  • 700d2b9: 8328401: Convert java/awt/Frame/InitialMaximizedTest/InitialMaximizedTest.html applet test to automated
  • bb3e84b: 8328592: hprof tests fail with -XX:-CompactStrings
  • ac2f8e5: 8327994: Update code gen in CallGeneratorHelper
  • c434b79: 8327169: serviceability/dcmd/vm/SystemMapTest.java and SystemDumpMapTest.java may fail after JDK-8326586
  • 7006790: 8328628: JDK-8328157 incorrectly sets -MT on all compilers in jdk.jpackage
  • 68170ae: 8328238: Convert few closed manual applet tests to main
  • 9f5ad43: 8320675: PrinterJob/SecurityDialogTest.java hangs
  • 684678f: 8328633: s390x: Improve vectorization of Match.sqrt() on floats
  • 93d1700: 8328589: unify os::breakpoint among posix platforms
  • ... and 15 more: https://git.openjdk.org/jdk/compare/eebcc2181fe27f6aa10559233c7c58882a146f56...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Mar 21, 2024
@openjdk openjdk bot closed this Mar 21, 2024
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Mar 21, 2024
@openjdk
Copy link

openjdk bot commented Mar 21, 2024

@reinrich Pushed as commit e41bc42.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@reinrich reinrich deleted the 8327990_jfr_enters_vm_without_wxwrite branch March 27, 2024 15:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot hotspot-dev@openjdk.org hotspot-jfr hotspot-jfr-dev@openjdk.org integrated Pull request has been integrated serviceability serviceability-dev@openjdk.org
Development

Successfully merging this pull request may close these issues.

7 participants