Skip to content

Conversation

@sspitsyn
Copy link
Contributor

@sspitsyn sspitsyn commented Dec 7, 2023

This fix is for JDK 23 but the intention is to back port it to 22 in RDP-1 time frame.
It is fixing a deadlock issue between VirtualThread class critical sections with the interruptLock (in methods: unpark(), interrupt(), getAndClearInterrupt(), threadState(), toString()), JvmtiVTMSTransitionDisabler and JVMTI Suspend/Resume mechanisms.
The deadlocking scenario is well described by Patricio in a bug report comment.
In simple words, a virtual thread should not be suspended during 'interruptLock' critical sections.

The fix is to record that a virtual thread is in a critical section (JavaThread's _in_critical_section bit) by notifying the VM/JVMTI about begin/end of critical section.
This bit is used in HandshakeState::get_op_for_self() to filter out any HandshakeOperation if a target JavaThread is in a critical section.

Some of new notifications with notifyJvmtiSync() method is on a performance critical path. It is why this method has been intrincified.

New test was developed by Patricio:
test/hotspot/jtreg/serviceability/jvmti/vthread/SuspendWithInterruptLock
The test is very nice as it reliably in 100% reproduces the deadlock without the fix.
The test is never failing with this fix.

Testing:

  • tested with newly added test: test/hotspot/jtreg/serviceability/jvmti/vthread/SuspendWithInterruptLock
  • tested with mach5 tiers 1-6

Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8311218: fatal error: stuck in JvmtiVTMSTransitionDisabler::VTMS_transition_disable (Bug - P3)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/17011/head:pull/17011
$ git checkout pull/17011

Update a local copy of the PR:
$ git checkout pull/17011
$ git pull https://git.openjdk.org/jdk.git pull/17011/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 17011

View PR using the GUI difftool:
$ git pr show -t 17011

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/17011.diff

Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Dec 7, 2023

👋 Welcome back sspitsyn! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Dec 7, 2023

@sspitsyn The following labels will be automatically applied to this pull request:

  • build
  • core-libs
  • graal
  • hotspot
  • serviceability

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added graal graal-dev@openjdk.org serviceability serviceability-dev@openjdk.org hotspot hotspot-dev@openjdk.org build build-dev@openjdk.org core-libs core-libs-dev@openjdk.org labels Dec 7, 2023
@openjdk openjdk bot added the rfr Pull request is ready for review label Dec 7, 2023
@mlbridge
Copy link

mlbridge bot commented Dec 7, 2023

Webrevs

@openjdk
Copy link

openjdk bot commented Dec 7, 2023

@sspitsyn this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout b13
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

@openjdk openjdk bot added the merge-conflict Pull request has merge conflict with target branch label Dec 7, 2023
@openjdk openjdk bot removed the merge-conflict Pull request has merge conflict with target branch label Dec 8, 2023
@AlanBateman
Copy link
Contributor

AlanBateman commented Dec 8, 2023

I chatted briefly with @sspitsyn about this. A couple of points:

  • It shouldn't be necessary to touch mount/unmount as the thread identity is the carrier, not the virtual thread, when executing the "critical code".
  • toggle_is_in_critical_section needs to detect reentrancy, it is otherwise too easy to refactor the Java code, e.g. call threadState while holding the interrupt lock.
  • All the use-sides will need to use try-finally to more reliably revert the critical section flag when rewinding.
  • The naming is very problematic, we'll need to replace with methods that are clearly named enter and exit critical section. Ongoing work in this area to support monitors has to introduce some temporary pinning so there will be enter/exitCriticalSection methods, that's a better place for the JVMTI hooks.

@magicus
Copy link
Member

magicus commented Dec 8, 2023

/label -build

@openjdk openjdk bot removed the build build-dev@openjdk.org label Dec 8, 2023
@openjdk
Copy link

openjdk bot commented Dec 8, 2023

@magicus
The build label was successfully removed.

while (iterations-- > 0) {
Thread.yield();
}
done = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to use done to stop all threads and set it to true in the main thread after some time. So you could be sure that the yielder hadn't been completed before the suspender started. But it is just proposal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. Will consider this.


// Notification from VirtualThread about entering/exiting sync critical section.
// Needed to avoid deadlocks with JVMTI suspend mechanism.
JVM_ENTRY(void, JVM_VirtualThreadCriticalLock(JNIEnv* env, jobject vthread, jboolean enter))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the jobject vthread is not used. Can't be the method made static to reduce the number of arguments?
It is the performance-critical code, I don't know if it is optimized by C2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question.
In general, I'd like to keep this unified with the other notiftJvmti methods.
Let me double check how it fits together.
Also, I'm not sure how is going to impact the intrinsification.

@sspitsyn
Copy link
Contributor Author

@AlanBateman Thank you for reviewing an the comment.

It shouldn't be necessary to touch mount/unmount as the thread identity is the carrier, not the virtual thread, when executing the "critical code".

Carrier thread also can be suspended when executing the "critical code".
Why do you think it can't be a problem? Do you think the deadlocking scenario described in the bug report is not possible?

toggle_is_in_critical_section needs to detect reentrancy, it is otherwise too easy to refactor the Java code, e.g. call threadState while holding the interrupt lock.

Is your concern a recursive interruptLock enter? I was also thinking if this scenario is possible, so a counter can be used instead of boolean.

All the use-sides will need to use try-finally to more reliably revert the critical section flag when rewinding.

Right, thanks. It is already done.

The naming is very problematic, we'll need to replace with methods that are clearly named enter and exit critical section. Ongoing work in this area to support monitors has to introduce some temporary pinning so there will be enter/exitCriticalSection methods, that's a better place for the JVMTI hooks.

Okay. What about the Leonid's suggestion to name it notifyJvmtiDisableSuspend() ?

@AlanBateman
Copy link
Contributor

Carrier thread also can be suspended when executing the "critical code". Why do you think it can't be a problem? Do you think the deadlocking scenario described in the bug report is not possible?

It's a different scenario. When mounting, the coordination of the interrupt status is done before the thread identity is changed. Similarly, when unmounting, the coordination is done after reverting the thread identity to the carrier. So if there is an agent randomly suspending threads when it shouldn't be an issue here.

toggle_is_in_critical_section needs to detect reentrancy, it is otherwise too easy to refactor the Java code, e.g. call threadState while holding the interrupt lock.

Is your concern a recursive interruptLock enter? I was also thinking if this scenario is possible, so a counter can be used instead of boolean.

Minimally an assert. A counter might be needed later.

Okay. What about the Leonid's suggestion to name it notifyJvmtiDisableSuspend() ?

We have changes in the works that require pinning during some critical sections so I think I prefer to use that terminology. We can move the notification to JVMTI to the enter/leave methods.

@AlanBateman
Copy link
Contributor

Okay. What about the Leonid's suggestion to name it notifyJvmtiDisableSuspend() ?

Okay with me. We'll need to move the notifyJvmtiDisableSuspend(true) to before the try in all cases, I've pointed out the cases that we missed.

@sspitsyn
Copy link
Contributor Author

sspitsyn commented Dec 14, 2023

Okay with me. We'll need to move the notifyJvmtiDisableSuspend(true) to before the try in all cases, I've pointed out the cases that we missed.

Thank you, Alan. Fixed now.
I believe, all your suggestions have been addressed now.

@AlanBateman
Copy link
Contributor

Thank you, Alan. Fixed now. I believe, all your suggestions have been addressed now.

Thanks, it looks much better now.

Copy link
Contributor

@AlanBateman AlanBateman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think okay, I don't have any other comments.

@openjdk
Copy link

openjdk bot commented Dec 15, 2023

@sspitsyn This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8311218: fatal error: stuck in JvmtiVTMSTransitionDisabler::VTMS_transition_disable

Reviewed-by: lmesnik, alanb

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 1 new commit pushed to the master branch:

  • 1fde8b8: 8321933: TestCDSVMCrash.java spawns two processes

Please see this link for an up-to-date comparison between the source branch of this pull request and the master branch.
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Dec 15, 2023
@sspitsyn
Copy link
Contributor Author

sspitsyn commented Dec 18, 2023

Alan and Leonid, thank you for review!
Will push after the final mach5 testing is completed.

@sspitsyn
Copy link
Contributor Author

/integrate

@openjdk
Copy link

openjdk bot commented Dec 19, 2023

Going to push as commit 0f8e4e0.
Since your change was applied there have been 14 commits pushed to the master branch:

  • 6313223: 8315856: RISC-V: Use Zacas extension for cmpxchg
  • 3bc5679: 8322309: Fix an inconsistancy in spacing style in spec.gmk.template
  • be49dab: 8321619: Generational ZGC: ZColorStoreGoodOopClosure is only valid for young objects
  • ac968c3: 8319451: PhaseIdealLoop::conditional_move is too conservative
  • 0ad6c9e: 8322255: Generational ZGC: ZPageSizeMedium should be set before MaxTenuringThreshold
  • fff2e58: 8322195: RISC-V: Minor improvement of MD5 instrinsic
  • 7b4d62c: 8322300: Remove redundant arg in PSAdaptiveSizePolicy::adjust_promo_for_pause_time
  • 76637c5: 8321648: Integral gather optimized mask computation.
  • 59073fa: 8322154: RISC-V: JDK-8315743 missed change in MacroAssembler::load_reserved
  • 808a039: 8321815: Shenandoah: gc state should be synchronized to java threads only once per safepoint
  • ... and 4 more: https://git.openjdk.org/jdk/compare/66aeb89469c20f1f1840773e59d3b45393418344...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Dec 19, 2023
@openjdk openjdk bot closed this Dec 19, 2023
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Dec 19, 2023
@openjdk
Copy link

openjdk bot commented Dec 19, 2023

@sspitsyn Pushed as commit 0f8e4e0.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Comment on lines +4022 to +4024
#else
fatal("Should only be called with JVMTI enabled");
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't do this! The Java code knows nothing about JVM TI being enabled/disabled and will call this function unconditionally.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't do this! The Java code knows nothing about JVM TI being enabled/disabled and will call this function unconditionally.

Indeed. I wonder if anyone is testing minimal builds to catch issues like this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, David!
Filed a cleanup bug: https://bugs.openjdk.org/browse/JDK-8322538

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these new compiler intrinsics required or an optional performance optimization? This PR creates issues for us when updating the JDK build for Graal. CC @davleopo

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these new compiler intrinsics required or an optional performance optimization?

Performance. If the intrinsic isn't there then some methods executed on virtual threads, or on a virtual thread as the target for some op, will have to call into the VM. The main concern was Thread.interrupted() as it gets called very frequently in locking and concurrency code.

void toggle_is_in_tmp_VTMS_transition() { _is_in_tmp_VTMS_transition = !_is_in_tmp_VTMS_transition; };

bool is_disable_suspend() const { return _is_disable_suspend; }
void toggle_is_disable_suspend() { _is_disable_suspend = !_is_disable_suspend; };
Copy link
Contributor

@AlanBateman AlanBateman Dec 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this again then I don't think it can be a bit that is toggled on and off will work. Consider the case where several threads attempt to poll the state of a virtual Thread with Thread::getState at the same time. This can't work without an atomic counter and further coordination. So I think further work is required on this issue.

Update: ignore this I mis-read that it updates the current thread's suspend value, not the thread's suspend value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: ignore this I mis-read that it updates the current thread's suspend value, not the thread's suspend value.

Thanks, Alan. I've also got confused with this and even filed a follow up bug. :)
Yes, the initial design was the _is_disable_suspend is set/modified/accessed on the current thread only.

@sspitsyn
Copy link
Contributor Author

/backport jdk22

@openjdk
Copy link

openjdk bot commented Dec 20, 2023

@sspitsyn the backport was successfully created on the branch backport-sspitsyn-0f8e4e0a in my personal fork of openjdk/jdk22. To create a pull request with this backport targeting openjdk/jdk22:master, just click the following link:

➡️ Create pull request

The title of the pull request is automatically filled in correctly and below you find a suggestion for the pull request body:

Hi all,

This pull request contains a backport of commit 0f8e4e0a from the openjdk/jdk repository.

The commit being backported was authored by Serguei Spitsyn on 19 Dec 2023 and was reviewed by Leonid Mesnik and Alan Bateman.

Thanks!

If you need to update the source branch of the pull then run the following commands in a local clone of your personal fork of openjdk/jdk22:

$ git fetch https://github.com/openjdk-bots/jdk22.git backport-sspitsyn-0f8e4e0a:backport-sspitsyn-0f8e4e0a
$ git checkout backport-sspitsyn-0f8e4e0a
# make changes
$ git add paths/to/changed/files
$ git commit --message 'Describe additional changes made'
$ git push https://github.com/openjdk-bots/jdk22.git backport-sspitsyn-0f8e4e0a

@sspitsyn sspitsyn deleted the b13 branch January 23, 2024 01:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core-libs core-libs-dev@openjdk.org graal graal-dev@openjdk.org hotspot hotspot-dev@openjdk.org integrated Pull request has been integrated serviceability serviceability-dev@openjdk.org

Development

Successfully merging this pull request may close these issues.

6 participants