Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8253794: TestAbortVMOnSafepointTimeout never timeouts #465

Closed
wants to merge 4 commits into from

Conversation

@robehn
Copy link
Contributor

@robehn robehn commented Oct 1, 2020

The issue is that this test doesn't consider Handshake All operation.
Depending if/when such operation is scheduled it can lockup the VM thread.
And the safepoint that should timeout never happens.
See issue for more information.

So I changed the test to "try timeout" the safepoint, but if there was no safepoint (blocked by a handshake all), we retry.
We sleep unsafe much longer than the interval SafepointALot generates operations, which 'guarantees' we will timeout if there is no handshake all. (some extreme case of kernel scheduling causing a very long context switch could also make us not timeout)

Passes t1, t3, and repeat runs of the test.


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

  • JDK-8253794: TestAbortVMOnSafepointTimeout never timeouts

Reviewers

Download

$ git fetch https://git.openjdk.java.net/jdk pull/465/head:pull/465
$ git checkout pull/465

@bridgekeeper
Copy link

@bridgekeeper bridgekeeper bot commented Oct 1, 2020

👋 Welcome back rehn! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

@openjdk openjdk bot commented Oct 1, 2020

@robehn The following label will be automatically applied to this pull request:

  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot label Oct 1, 2020
@robehn
Copy link
Contributor Author

@robehn robehn commented Oct 1, 2020

/label hotspot-runtime

@openjdk
Copy link

@openjdk openjdk bot commented Oct 1, 2020

@robehn
The hotspot-runtime label was successfully added.

@robehn robehn marked this pull request as ready for review Oct 1, 2020
@openjdk openjdk bot added the rfr label Oct 1, 2020
@mlbridge
Copy link

@mlbridge mlbridge bot commented Oct 1, 2020

Webrevs

Copy link
Contributor

@pchilano pchilano left a comment

LGTM

src/hotspot/share/prims/whitebox.cpp Outdated Show resolved Hide resolved
src/hotspot/share/prims/whitebox.cpp Outdated Show resolved Hide resolved
if (Platform.isWindows()) {
output.shouldMatch("Safepoint sync time longer than");
} else {
output.shouldMatch("SIGILL");
if (Platform.isLinux()) {
output.shouldMatch("(sent by kill)");
}
output.shouldMatch("TestAbortVMOnSafepointTimeout.test_loop");
}
output.shouldNotHaveExitValue(0);
Copy link
Member

@dcubed-ojdk dcubed-ojdk Oct 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the test doesn't require that this mesg get printed:
System.out.println("This message would occur after some time.");

And it is set up to detect that the SafepointTimeout happened
which is what we want the test to verify at the core.

Copy link
Contributor Author

@robehn robehn Oct 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The line "This message would occur after some time." should never be print if VM is working.
If the VM fails for some reason and the timeout is not performed, line:
"Timed out while spinning to reach a safepoint." is never printed and the OutputAnalyzer fails the test.
If we did timeout and it was printed we know that we didn't print the other message, since the only thread that can timeout is the one printing that message.

The second part verifies that the SIGILL was delivered.

Copy link
Member

@dcubed-ojdk dcubed-ojdk Oct 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, but then this message when you're reading the code is misleading:
System.out.println("This message would occur after some time.");
It should be printing something like:
System.out.println("This message only prints if something is broken.");

Update: Yes, I realize that this is an existing problem, but it's still reads wrong.

Copy link
Contributor Author

@robehn robehn Oct 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed comment in last update, since it can't be printed.

@dholmes-ora
Copy link
Member

@dholmes-ora dholmes-ora commented Oct 2, 2020

Hi Robbin,

So.... The old test used an "uncounted loop" (based on internal JIT knowledge) to create looping code with no safepoint polls so that it remains safepoint-unsafe (and Patricio had to tweak the test conditions to avoid unexpected safepoints). The new code has a WhiteBox entry that uses an internal naked_sleep which keeps the thread _thread_in_VM IIUC, which is not safepoint-safe, but also potentially different to being _thread_in_Java. But lets just accept the net effect is the same - the thread will prevent a safepoint from being reached until the sleep time has elapsed. If that time is > (GuaranteedSafepointInterval + SafepointTimeoutDelay) then we should see a safepoint timeout and the VM abort. Okay ... so how does that solve the problem the test currently experiences with handshakes ... if we are at a handshake the handshake can't proceed until the sleep time expires, but then when we transition back to Java the thread will see the handshake and so the handshake will proceed. As long as the WB function returns false we will repeat the process, eventually when the expected safepoint is requested we should again trigger the safepoint timeout and abort.

But like Dan I'm unclear how the WB function can ever return true as the safepoint state can't change whilst the thread is in the naked sleep. ??

Aside: rather than using "args.length > 0" to discriminate between the original and subsequent executions of the test class, it can be clearer (IMO) to add a static nested class which has the main method that performs the actual test, and you invoke that via ProcessTools.

That all said, for the record, we really should have a handshake timeout mechanism the same as we have the safepoint timeout mechanism.

Thanks,
David

@robehn
Copy link
Contributor Author

@robehn robehn commented Oct 2, 2020

Hi Robbin,

Hi David,

So.... The old test used an "uncounted loop" (based on internal JIT knowledge) to create looping code with no safepoint polls so that it remains safepoint-unsafe (and Patricio had to tweak the test conditions to avoid unexpected safepoints). The new code has a WhiteBox entry that uses an internal naked_sleep which keeps the thread _thread_in_VM IIUC, which is not safepoint-safe, but also potentially different to being _thread_in_Java. But lets just accept the net effect is the same - the thread will prevent a safepoint from being reached until the sleep time has elapsed. If that time is > (GuaranteedSafepointInterval + SafepointTimeoutDelay) then we should see a safepoint timeout and the VM abort. Okay ... so how does that solve the problem the test currently experiences with handshakes ... if we are at a handshake the handshake can't proceed until the sleep time expires, but then when we transition back to Java the thread will see the handshake and so the handshake will proceed. As long as the WB function returns false we will repeat the process, eventually when the expected safepoint is requested we should again trigger the safepoint timeout and abort.

But like Dan I'm unclear how the WB function can ever return true as the safepoint state can't change whilst the thread is in the naked sleep. ??

It can't return true if the VM is working. So yes the safepoint tracker maybe overkill.

Aside: rather than using "args.length > 0" to discriminate between the original and subsequent executions of the test class, it can be clearer (IMO) to add a static nested class which has the main method that performs the actual test, and you invoke that via ProcessTools.

I didn't change any of that.

That all said, for the record, we really should have a handshake timeout mechanism the same as we have the safepoint timeout mechanism.

We have a timeout mechanism but default off HandshakeTimeout.
But it doesn't fire SIGILL to troubled thread as safepoint does.

Thanks, Robbin

Thanks,
David

@robehn
Copy link
Contributor Author

@robehn robehn commented Oct 2, 2020

LGTM

Thanks! There is an update, please consider.

Copy link
Contributor

@TheRealMDoerr TheRealMDoerr left a comment

Thanks for reimplementing it to resolve problems with handshake all operations.

@pchilano
Copy link
Contributor

@pchilano pchilano commented Oct 5, 2020

LGTM

Thanks! There is an update, please consider.

Still looks good.

Copy link
Contributor

@TheRealMDoerr TheRealMDoerr left a comment

Looks good to me. Thanks.

@openjdk
Copy link

@openjdk openjdk bot commented Oct 5, 2020

@robehn This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for more details.

After integration, the commit message for the final commit will be:

8253794: TestAbortVMOnSafepointTimeout never timeouts

Reviewed-by: pchilanomate, dcubed, mdoerr

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 10 new commits pushed to the master branch:

  • f2f77f7: 8253761: Wrong URI syntax printed by jar --describe-module
  • b29e108: 8253944: Certain method references to VarHandle methods should fail
  • 88d75c9: 8156071: List.of: reduce array copying during creation
  • ea27a54: 8224509: Incorrect alignment in CDS related allocation code on 32-bit platforms
  • 4d29116: 8253433: Remove -XX:+Debugging product option
  • 81dae70: 8253948: Memory leak in ImageFileReader
  • 65cab55: 8253971: ZGC: Flush mark stacks after processing concurrent roots
  • 19219a9: 8253960: Memory leak in Java_java_lang_ClassLoader_defineClass0()
  • 5d4a135: 8253842: [JVMCI] Allow implicit exception to dispatch to other address in jvmci compilers.
  • 289ae79: 8252998: ModuleWrapper.gmk doesn't consult include path

Please see this link for an up-to-date comparison between the source branch of this pull request and the master branch.
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready label Oct 5, 2020
@@ -74,6 +74,7 @@
#include "runtime/javaCalls.hpp"
#include "runtime/jniHandles.inline.hpp"
#include "runtime/os.hpp"
#include "runtime/safepoint.hpp"
Copy link
Member

@dcubed-ojdk dcubed-ojdk Oct 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need this include change anymore.

Copy link
Contributor Author

@robehn robehn Oct 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

if (Platform.isWindows()) {
output.shouldMatch("Safepoint sync time longer than");
} else {
output.shouldMatch("SIGILL");
if (Platform.isLinux()) {
output.shouldMatch("(sent by kill)");
}
output.shouldMatch("TestAbortVMOnSafepointTimeout.test_loop");
}
output.shouldNotHaveExitValue(0);
Copy link
Member

@dcubed-ojdk dcubed-ojdk Oct 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, but then this message when you're reading the code is misleading:
System.out.println("This message would occur after some time.");
It should be printing something like:
System.out.println("This message only prints if something is broken.");

Update: Yes, I realize that this is an existing problem, but it's still reads wrong.

public static void main(String[] args) throws Exception {
Integer waitTime = Integer.parseInt(args[0]);
WhiteBox wb = WhiteBox.getWhiteBox();
// While no safepoint timeout.
Copy link
Member

@dcubed-ojdk dcubed-ojdk Oct 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps: // Loop here to cause a safepoint timeout.

Copy link
Contributor Author

@robehn robehn Oct 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@dcubed-ojdk
Copy link
Member

@dcubed-ojdk dcubed-ojdk commented Oct 5, 2020

Robbin replied:

David H. wrote:

That all said, for the record, we really should have a handshake timeout mechanism the same as we have the safepoint timeout mechanism.

We have a timeout mechanism but default off HandshakeTimeout.
But it doesn't fire SIGILL to troubled thread as safepoint does.

What's the conclusion here? Are there going to be changes to the
test to use the HandshakeTimeout option? Should the test have
failed in a different way than it did?

Copy link
Contributor Author

@robehn robehn left a comment

Pushing the small update in a minute.

@@ -74,6 +74,7 @@
#include "runtime/javaCalls.hpp"
#include "runtime/jniHandles.inline.hpp"
#include "runtime/os.hpp"
#include "runtime/safepoint.hpp"
Copy link
Contributor Author

@robehn robehn Oct 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

if (Platform.isWindows()) {
output.shouldMatch("Safepoint sync time longer than");
} else {
output.shouldMatch("SIGILL");
if (Platform.isLinux()) {
output.shouldMatch("(sent by kill)");
}
output.shouldMatch("TestAbortVMOnSafepointTimeout.test_loop");
}
output.shouldNotHaveExitValue(0);
Copy link
Contributor Author

@robehn robehn Oct 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed comment in last update, since it can't be printed.

public static void main(String[] args) throws Exception {
Integer waitTime = Integer.parseInt(args[0]);
WhiteBox wb = WhiteBox.getWhiteBox();
// While no safepoint timeout.
Copy link
Contributor Author

@robehn robehn Oct 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@robehn
Copy link
Contributor Author

@robehn robehn commented Oct 5, 2020

Robbin replied:

David H. wrote:

That all said, for the record, we really should have a handshake timeout mechanism the same as we have the safepoint timeout mechanism.

We have a timeout mechanism but default off HandshakeTimeout.
But it doesn't fire SIGILL to troubled thread as safepoint does.

What's the conclusion here? Are there going to be changes to the
test to use the HandshakeTimeout option? Should the test have
failed in a different way than it did?

In https://bugs.openjdk.java.net/browse/JDK-8198730 I'm have been looking into setting these (safepoint and handshake timeout ) to default 1 second. There were some impediments which now seems to have been resolved.
If any of these operations takes longer you want to know it because A: you have a bug, or B: you have performance problems.
So I don't think the default value is what any user wants.

Copy link
Member

@dcubed-ojdk dcubed-ojdk left a comment

Thumbs up.

@robehn
Copy link
Contributor Author

@robehn robehn commented Oct 5, 2020

Thanks for review @pchilano, @dcubed-ojdk, @TheRealMDoerr.

Update was trivial so integrating in a bit.

@robehn
Copy link
Contributor Author

@robehn robehn commented Oct 5, 2020

/integrate

@openjdk openjdk bot closed this Oct 5, 2020
@openjdk openjdk bot added integrated and removed ready rfr labels Oct 5, 2020
@openjdk
Copy link

@openjdk openjdk bot commented Oct 5, 2020

@robehn Since your change was applied there have been 10 commits pushed to the master branch:

  • f2f77f7: 8253761: Wrong URI syntax printed by jar --describe-module
  • b29e108: 8253944: Certain method references to VarHandle methods should fail
  • 88d75c9: 8156071: List.of: reduce array copying during creation
  • ea27a54: 8224509: Incorrect alignment in CDS related allocation code on 32-bit platforms
  • 4d29116: 8253433: Remove -XX:+Debugging product option
  • 81dae70: 8253948: Memory leak in ImageFileReader
  • 65cab55: 8253971: ZGC: Flush mark stacks after processing concurrent roots
  • 19219a9: 8253960: Memory leak in Java_java_lang_ClassLoader_defineClass0()
  • 5d4a135: 8253842: [JVMCI] Allow implicit exception to dispatch to other address in jvmci compilers.
  • 289ae79: 8252998: ModuleWrapper.gmk doesn't consult include path

Your commit was automatically rebased without conflicts.

Pushed as commit c9d0407.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@robehn robehn deleted the 8253794 branch Oct 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
5 participants