Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8311981: Test gc/stringdedup/TestStringDeduplicationAgeThreshold.java#ZGenerational timed out #15240

Closed

Conversation

dholmes-ora
Copy link
Member

@dholmes-ora dholmes-ora commented Aug 11, 2023

Please see the JBS issue for full details on the underlying deadlock issue (credit to @stefank for discovering it) and the proposed solution (credit @pchilano and @xmas92 ). Quite simply we make HandshakeState::has_operation() non-blocking by using a try_lock and conservatively return true to indicate an operation may be pending. By not blocking we avoid the deadlock scenario. All usages of the changed code have been examined to see that they are safe with this change (they all basically just take a safe slow path to see if there really is an operation).

Testing:

  • tiers 1-4, 7
  • the failing string dedup test was run under our tier7 conditions, 10 times on linux-x64-debug and windows-x64-debug

Given the nature of the deadlock this testing is not sufficient to claims success as we probably only saw 1 failure in many hundreds of runs. So if anyone has suggestions for additional testing please speak up. Otherwise we are relying on "correctness by design" - we've removed a blocking condition that leads to the 3-way deadlock, and examined the code paths affected.

Thanks.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8311981: Test gc/stringdedup/TestStringDeduplicationAgeThreshold.java#ZGenerational timed out (Bug - P2)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/15240/head:pull/15240
$ git checkout pull/15240

Update a local copy of the PR:
$ git checkout pull/15240
$ git pull https://git.openjdk.org/jdk.git pull/15240/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 15240

View PR using the GUI difftool:
$ git pr show -t 15240

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/15240.diff

Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Aug 11, 2023

👋 Welcome back dholmes! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr Pull request is ready for review label Aug 11, 2023
@openjdk
Copy link

openjdk bot commented Aug 11, 2023

@dholmes-ora The following label will be automatically applied to this pull request:

  • hotspot-runtime

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-runtime hotspot-runtime-dev@openjdk.org label Aug 11, 2023
@mlbridge
Copy link

mlbridge bot commented Aug 11, 2023

Webrevs

Copy link
Member

@stefank stefank left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this sounds like a good solution, but I haven't dug deep enough fully appreciate all the paths taken. Please make sure to get another (R)eviewer for this change.

@openjdk
Copy link

openjdk bot commented Aug 11, 2023

@dholmes-ora This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8311981: Test gc/stringdedup/TestStringDeduplicationAgeThreshold.java#ZGenerational timed out

Reviewed-by: stefank, pchilanomate, dcubed, rehn

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 13 new commits pushed to the master branch:

  • 4164693: 8313372: [JVMCI] Export vmIntrinsics::is_intrinsic_available results to JVMCI compilers.
  • 049b55f: 8314019: Add gc logging to jdk/jfr/event/gc/detailed/TestZAllocationStallEvent.java
  • a39ed10: 8314116: C2: assert(false) failed: malformed control flow after JDK-8305636
  • 1de5bf1: 8314106: C2: assert(is_valid()) failed: must be valid after JDK-8305636
  • 5c91622: 8314117: RISC-V: Incorrect VMReg encoding in RISCV64Frame.java
  • 6bbcef5: 8313948: Remove unnecessary static fields defaultUpper/defaultLower in sun.net.PortConfig
  • b88c273: 8313743: Make fields final in sun.nio.ch
  • ec0cc63: 8313904: [macos] All signing tests which verifies unsigned app images are failing
  • 7332502: 8314139: TEST_BUG: runtime/os/THPsInThreadStackPreventionTest.java could fail on machine with large number of cores
  • 8f1c134: 8313798: [aarch64] sun/tools/jhsdb/HeapDumpTestWithActiveProcess.java sometimes times out on aarch64
  • ... and 3 more: https://git.openjdk.org/jdk/compare/43462a36ab02b67d426c04d345868bd420b30c25...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Aug 11, 2023
Copy link
Contributor

@pchilano pchilano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

Thanks,
Patricio

// "external" mutex. If the try_lock fails then we assume that there is an operation
// and force the caller to check more carefully in a safer context. If we can't get
// the lock it means another thread is trying to handshake with us, so it can't
// happen during thread termination and destruction.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the particular mention about thread termination and destruction?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit typo: s/musn't/mustn't/

which is preferred according to urbandictionary.com.

I also don't understand the last sentence. More accurately, I understand the first
phrase, but I don't understand the need for the second phrase.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Earlier when I mistakenly thought the no-arg version was this version with default args, I was concerned about its use in the HandshakeState destructor, but reasoned that it was safe as there couldn't be a real pending op at that time to cause contention on the lock - hence the comment in the main code. I left the comment just to add some information on when during a thread's lifetime we could hit this problem - to make it easier to reason about. I can remove it if it is causing confusion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. The only thing which might be confusing is what exactly "termination" phase includes. Because when we are in Threads::remove() with a status of _thread_exiting we can still fail to acquire the lock here while trying to grab the Threads_lock.

@dcubed-ojdk
Copy link
Member

Copy link
Member

@dcubed-ojdk dcubed-ojdk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thumbs up. The call site analysis is complicated. The only one that
bothers my brain are the suspend cases, but I've convinced myself
that these are okay even if we have errant duplicate calls.

// "external" mutex. If the try_lock fails then we assume that there is an operation
// and force the caller to check more carefully in a safer context. If we can't get
// the lock it means another thread is trying to handshake with us, so it can't
// happen during thread termination and destruction.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit typo: s/musn't/mustn't/

which is preferred according to urbandictionary.com.

I also don't understand the last sentence. More accurately, I understand the first
phrase, but I don't understand the need for the second phrase.

@dcubed-ojdk
Copy link
Member

@robehn - do you have time to take a peak at this one since it is Handshake related?

@dholmes-ora
Copy link
Member Author

Thanks for the reviews @stefank , @pchilano and @dcubed-ojdk .

@dcubed-ojdk I've fixed the typo. See response to @pchilano about the comment.

Thanks.

@robehn
Copy link
Contributor

robehn commented Aug 14, 2023

Hello, a question, if I understand the description correctly:
Even if "Generation GC Thread" gets the VM op lock, it will still wait for VM thread to execute it's VM op.
But if the VM thread is blocked in the handshake waiting for that VM op, we are still stuck?

What I'm I not understanding?

@xmas92
Copy link
Member

xmas92 commented Aug 14, 2023

Hello, a question, if I understand the description correctly: Even if "Generation GC Thread" gets the VM op lock, it will still wait for VM thread to execute it's VM op. But if the VM thread is blocked in the handshake waiting for that VM op, we are still stuck?

What I'm I not understanding?

In Monitor::wait() the InFlightMutexRelease is used by ThreadBlockInVMPreprocess to unlock the lock before calling SafepointMechanism::process_if_requested.

@robehn
Copy link
Contributor

robehn commented Aug 14, 2023

Hello, a question, if I understand the description correctly: Even if "Generation GC Thread" gets the VM op lock, it will still wait for VM thread to execute it's VM op. But if the VM thread is blocked in the handshake waiting for that VM op, we are still stuck?
What I'm I not understanding?

In Monitor::wait() the InFlightMutexRelease is used by ThreadBlockInVMPreprocess to unlock the lock before calling SafepointMechanism::process_if_requested.

Sorry if I didn't make it clear, as I understand the description:
The VM thread will not complete the handshake until "Generation GC Thread" have completed the VM op.
Even if the lock is unlocked VM thread will wait on ZCond, and "Generation GC Thread" will wait for VM thread to execute it's VM op. We may never wait for a safepoint in a handshake. Which is the reason why we can't allocate in handshake.

If "Generation GC Thread" install the new VM op to be execute, it will still wait until it have been execute.
I don't see how ZCond would be released?

The "Gen GC Thread" would still be in this method "VMThread::wait_until_executed".

Is my question clearer.

@xmas92
Copy link
Member

xmas92 commented Aug 14, 2023

Sorry if I didn't make it clear, as I understand the description: The VM thread will not complete the handshake until "Generation GC Thread" have completed the VM op. Even if the lock is unlocked VM thread will wait on ZCond, and "Generation GC Thread" will wait for VM thread to execute it's VM op. We may never wait for a safepoint in a handshake. Which is the reason why we can't allocate in handshake.

If "Generation GC Thread" install the new VM op to be execute, it will still wait until it have been execute. I don't see how ZCond would be released?

The "Gen GC Thread" would still be in this method "VMThread::wait_until_executed".

Is my question clearer.

The reason the handshake stalls is because the VM_ZRelocateStartX VM_Operation::do_operation has been executed. The gc thread should not depend on any new VM_Operation being executed.

However maybe there is still an issue here with process_if_requested. As this may require the handshake lock? and this still deadlocks the Monitor::wait call.

The deadlock arises from that the VM_Operation::do_operation has been executed, but the gc thread has not woken from wait_until_executed (finished it's Monitor::wait call) . And a handshake which triggers a relocation stall comes in in-between.

@robehn
Copy link
Contributor

robehn commented Aug 14, 2023

Thanks! It sounds like we want the VM op requester to leave VMThread::wait_until_executed/VMOperation_lock before starting the next VM op. As you say process_if_requested may also trigger this AFAICT.

@xmas92
Copy link
Member

xmas92 commented Aug 14, 2023

Member

Looking at it again. Because this is a gc thread which is not a JavaThread it will not do handshakes (process_if_requested when calling Monitor::wait) so the GC thread will progress as long as no JavaThread stalls (locks on handshake locks) with the VMOperation_lock held. This change should fix this. As it is released before process_if_requested. So the deadlock should be avoided. Unless there are other paths which leads to handshake locks being taken under the VMOperation_lock

@dholmes-ora
Copy link
Member Author

Thanks @xmas92 for explaining what happens after the current issue gets resolved.

Thanks for the reviews @stefank , @pchilano, @dcubed-ojdk and @robehn .

/integrate

@openjdk
Copy link

openjdk bot commented Aug 14, 2023

Going to push as commit f142470.
Since your change was applied there have been 25 commits pushed to the master branch:

  • 595fdd3: 8314059: Remove PKCS7.verify()
  • 49b2984: 8313854: Some tests in serviceability area fail on localized Windows platform
  • c132176: 8114830: (fs) Files.copy fails due to interference from something else changing the file system
  • e56d3bc: 8313657: com.sun.jndi.ldap.Connection.cleanup does not close connections on SocketTimeoutErrors
  • 4b2703a: 8313678: SymbolTable can leak Symbols during cleanup
  • f41c267: 8314045: ArithmeticException in GaloisCounterMode
  • 911d1db: 8314078: HotSpotConstantPool.lookupField() asserts due to field changes in ConstantPool.cpp
  • 6574dd7: 8314025: Remove JUnit-based test in java/lang/invoke from problem list
  • 207bd00: 8313756: [BACKOUT] 8308682: Enhance AES performance
  • 823f5b9: 8308850: Change JVM options with small ranges that get -Wconversion warnings to 32 bits
  • ... and 15 more: https://git.openjdk.org/jdk/compare/43462a36ab02b67d426c04d345868bd420b30c25...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Aug 14, 2023
@openjdk openjdk bot closed this Aug 14, 2023
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Aug 14, 2023
@openjdk
Copy link

openjdk bot commented Aug 14, 2023

@dholmes-ora Pushed as commit f142470.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@dholmes-ora dholmes-ora deleted the 8311981-handshake-deadlock branch August 15, 2023 00:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-runtime hotspot-runtime-dev@openjdk.org integrated Pull request has been integrated
6 participants