New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8253794: TestAbortVMOnSafepointTimeout never timeouts #465
Conversation
👋 Welcome back rehn! A progress list of the required criteria for merging this PR into |
/label hotspot-runtime |
@robehn |
Webrevs
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
test/hotspot/jtreg/runtime/Safepoint/TestAbortVMOnSafepointTimeout.java
Outdated
Show resolved
Hide resolved
test/hotspot/jtreg/runtime/Safepoint/TestAbortVMOnSafepointTimeout.java
Outdated
Show resolved
Hide resolved
test/hotspot/jtreg/runtime/Safepoint/TestAbortVMOnSafepointTimeout.java
Outdated
Show resolved
Hide resolved
test/hotspot/jtreg/runtime/Safepoint/TestAbortVMOnSafepointTimeout.java
Outdated
Show resolved
Hide resolved
if (Platform.isWindows()) { | ||
output.shouldMatch("Safepoint sync time longer than"); | ||
} else { | ||
output.shouldMatch("SIGILL"); | ||
if (Platform.isLinux()) { | ||
output.shouldMatch("(sent by kill)"); | ||
} | ||
output.shouldMatch("TestAbortVMOnSafepointTimeout.test_loop"); | ||
} | ||
output.shouldNotHaveExitValue(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the test doesn't require that this mesg get printed:
System.out.println("This message would occur after some time.");
And it is set up to detect that the SafepointTimeout happened
which is what we want the test to verify at the core.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The line "This message would occur after some time." should never be print if VM is working.
If the VM fails for some reason and the timeout is not performed, line:
"Timed out while spinning to reach a safepoint." is never printed and the OutputAnalyzer fails the test.
If we did timeout and it was printed we know that we didn't print the other message, since the only thread that can timeout is the one printing that message.
The second part verifies that the SIGILL was delivered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, but then this message when you're reading the code is misleading:
System.out.println("This message would occur after some time.");
It should be printing something like:
System.out.println("This message only prints if something is broken.");
Update: Yes, I realize that this is an existing problem, but it's still reads wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed comment in last update, since it can't be printed.
Hi Robbin, So.... The old test used an "uncounted loop" (based on internal JIT knowledge) to create looping code with no safepoint polls so that it remains safepoint-unsafe (and Patricio had to tweak the test conditions to avoid unexpected safepoints). The new code has a WhiteBox entry that uses an internal naked_sleep which keeps the thread _thread_in_VM IIUC, which is not safepoint-safe, but also potentially different to being _thread_in_Java. But lets just accept the net effect is the same - the thread will prevent a safepoint from being reached until the sleep time has elapsed. If that time is > (GuaranteedSafepointInterval + SafepointTimeoutDelay) then we should see a safepoint timeout and the VM abort. Okay ... so how does that solve the problem the test currently experiences with handshakes ... if we are at a handshake the handshake can't proceed until the sleep time expires, but then when we transition back to Java the thread will see the handshake and so the handshake will proceed. As long as the WB function returns false we will repeat the process, eventually when the expected safepoint is requested we should again trigger the safepoint timeout and abort. But like Dan I'm unclear how the WB function can ever return true as the safepoint state can't change whilst the thread is in the naked sleep. ?? Aside: rather than using "args.length > 0" to discriminate between the original and subsequent executions of the test class, it can be clearer (IMO) to add a static nested class which has the main method that performs the actual test, and you invoke that via ProcessTools. That all said, for the record, we really should have a handshake timeout mechanism the same as we have the safepoint timeout mechanism. Thanks, |
Hi David,
It can't return true if the VM is working. So yes the safepoint tracker maybe overkill.
I didn't change any of that.
We have a timeout mechanism but default off HandshakeTimeout. Thanks, Robbin
|
Thanks! There is an update, please consider. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reimplementing it to resolve problems with handshake all operations.
Still looks good. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Thanks.
@robehn This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for more details. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 10 new commits pushed to the
Please see this link for an up-to-date comparison between the source branch of this pull request and the ➡️ To integrate this PR with the above commit message to the |
src/hotspot/share/prims/whitebox.cpp
Outdated
@@ -74,6 +74,7 @@ | |||
#include "runtime/javaCalls.hpp" | |||
#include "runtime/jniHandles.inline.hpp" | |||
#include "runtime/os.hpp" | |||
#include "runtime/safepoint.hpp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need this include change anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
if (Platform.isWindows()) { | ||
output.shouldMatch("Safepoint sync time longer than"); | ||
} else { | ||
output.shouldMatch("SIGILL"); | ||
if (Platform.isLinux()) { | ||
output.shouldMatch("(sent by kill)"); | ||
} | ||
output.shouldMatch("TestAbortVMOnSafepointTimeout.test_loop"); | ||
} | ||
output.shouldNotHaveExitValue(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, but then this message when you're reading the code is misleading:
System.out.println("This message would occur after some time.");
It should be printing something like:
System.out.println("This message only prints if something is broken.");
Update: Yes, I realize that this is an existing problem, but it's still reads wrong.
public static void main(String[] args) throws Exception { | ||
Integer waitTime = Integer.parseInt(args[0]); | ||
WhiteBox wb = WhiteBox.getWhiteBox(); | ||
// While no safepoint timeout. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps: // Loop here to cause a safepoint timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
Robbin replied:
What's the conclusion here? Are there going to be changes to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushing the small update in a minute.
src/hotspot/share/prims/whitebox.cpp
Outdated
@@ -74,6 +74,7 @@ | |||
#include "runtime/javaCalls.hpp" | |||
#include "runtime/jniHandles.inline.hpp" | |||
#include "runtime/os.hpp" | |||
#include "runtime/safepoint.hpp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
if (Platform.isWindows()) { | ||
output.shouldMatch("Safepoint sync time longer than"); | ||
} else { | ||
output.shouldMatch("SIGILL"); | ||
if (Platform.isLinux()) { | ||
output.shouldMatch("(sent by kill)"); | ||
} | ||
output.shouldMatch("TestAbortVMOnSafepointTimeout.test_loop"); | ||
} | ||
output.shouldNotHaveExitValue(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed comment in last update, since it can't be printed.
public static void main(String[] args) throws Exception { | ||
Integer waitTime = Integer.parseInt(args[0]); | ||
WhiteBox wb = WhiteBox.getWhiteBox(); | ||
// While no safepoint timeout. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
In https://bugs.openjdk.java.net/browse/JDK-8198730 I'm have been looking into setting these (safepoint and handshake timeout ) to default 1 second. There were some impediments which now seems to have been resolved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thumbs up.
Thanks for review @pchilano, @dcubed-ojdk, @TheRealMDoerr. Update was trivial so integrating in a bit. |
/integrate |
@robehn Since your change was applied there have been 10 commits pushed to the
Your commit was automatically rebased without conflicts. Pushed as commit c9d0407. 💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored. |
The issue is that this test doesn't consider Handshake All operation.
Depending if/when such operation is scheduled it can lockup the VM thread.
And the safepoint that should timeout never happens.
See issue for more information.
So I changed the test to "try timeout" the safepoint, but if there was no safepoint (blocked by a handshake all), we retry.
We sleep unsafe much longer than the interval SafepointALot generates operations, which 'guarantees' we will timeout if there is no handshake all. (some extreme case of kernel scheduling causing a very long context switch could also make us not timeout)
Passes t1, t3, and repeat runs of the test.
Progress
Issue
Reviewers
Download
$ git fetch https://git.openjdk.java.net/jdk pull/465/head:pull/465
$ git checkout pull/465