8253794: TestAbortVMOnSafepointTimeout never timeouts #465

robehn · 2020-10-01T14:35:45Z

The issue is that this test doesn't consider Handshake All operation.
Depending if/when such operation is scheduled it can lockup the VM thread.
And the safepoint that should timeout never happens.
See issue for more information.

So I changed the test to "try timeout" the safepoint, but if there was no safepoint (blocked by a handshake all), we retry.
We sleep unsafe much longer than the interval SafepointALot generates operations, which 'guarantees' we will timeout if there is no handshake all. (some extreme case of kernel scheduling causing a very long context switch could also make us not timeout)

Passes t1, t3, and repeat runs of the test.

Progress

Change must not contain extraneous whitespace
Commit message must refer to an issue
Change must be properly reviewed

Issue

JDK-8253794: TestAbortVMOnSafepointTimeout never timeouts

Reviewers

Patricio Chilano Mateo (@pchilano - Committer) ⚠️ Review applies to 13b4033
Daniel D. Daugherty (@dcubed-ojdk - Reviewer)
Martin Doerr (@TheRealMDoerr - Reviewer) ⚠️ Review applies to 90fb310

Download

$ git fetch https://git.openjdk.java.net/jdk pull/465/head:pull/465
$ git checkout pull/465

bridgekeeper · 2020-10-01T14:37:56Z

👋 Welcome back rehn! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2020-10-01T14:38:58Z

@robehn The following label will be automatically applied to this pull request:

hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

robehn · 2020-10-01T15:32:47Z

/label hotspot-runtime

openjdk · 2020-10-01T15:33:13Z

@robehn
The hotspot-runtime label was successfully added.

mlbridge · 2020-10-01T17:10:12Z

Webrevs

pchilano

LGTM

test/hotspot/jtreg/runtime/Safepoint/TestAbortVMOnSafepointTimeout.java

src/hotspot/share/prims/whitebox.cpp

test/hotspot/jtreg/runtime/Safepoint/TestAbortVMOnSafepointTimeout.java

dcubed-ojdk · 2020-10-01T20:06:59Z

test/hotspot/jtreg/runtime/Safepoint/TestAbortVMOnSafepointTimeout.java

        if (Platform.isWindows()) {
            output.shouldMatch("Safepoint sync time longer than");
        } else {
            output.shouldMatch("SIGILL");
            if (Platform.isLinux()) {
                output.shouldMatch("(sent by kill)");
            }
-            output.shouldMatch("TestAbortVMOnSafepointTimeout.test_loop");
        }
        output.shouldNotHaveExitValue(0);


Looks like the test doesn't require that this mesg get printed:
System.out.println("This message would occur after some time.");

And it is set up to detect that the SafepointTimeout happened
which is what we want the test to verify at the core.

The line "This message would occur after some time." should never be print if VM is working.
If the VM fails for some reason and the timeout is not performed, line:
"Timed out while spinning to reach a safepoint." is never printed and the OutputAnalyzer fails the test.
If we did timeout and it was printed we know that we didn't print the other message, since the only thread that can timeout is the one printing that message.

The second part verifies that the SIGILL was delivered.

Okay, but then this message when you're reading the code is misleading:
System.out.println("This message would occur after some time.");
It should be printing something like:
System.out.println("This message only prints if something is broken.");

Update: Yes, I realize that this is an existing problem, but it's still reads wrong.

I removed comment in last update, since it can't be printed.

dholmes-ora · 2020-10-02T04:24:44Z

Hi Robbin,

So.... The old test used an "uncounted loop" (based on internal JIT knowledge) to create looping code with no safepoint polls so that it remains safepoint-unsafe (and Patricio had to tweak the test conditions to avoid unexpected safepoints). The new code has a WhiteBox entry that uses an internal naked_sleep which keeps the thread _thread_in_VM IIUC, which is not safepoint-safe, but also potentially different to being _thread_in_Java. But lets just accept the net effect is the same - the thread will prevent a safepoint from being reached until the sleep time has elapsed. If that time is > (GuaranteedSafepointInterval + SafepointTimeoutDelay) then we should see a safepoint timeout and the VM abort. Okay ... so how does that solve the problem the test currently experiences with handshakes ... if we are at a handshake the handshake can't proceed until the sleep time expires, but then when we transition back to Java the thread will see the handshake and so the handshake will proceed. As long as the WB function returns false we will repeat the process, eventually when the expected safepoint is requested we should again trigger the safepoint timeout and abort.

But like Dan I'm unclear how the WB function can ever return true as the safepoint state can't change whilst the thread is in the naked sleep. ??

Aside: rather than using "args.length > 0" to discriminate between the original and subsequent executions of the test class, it can be clearer (IMO) to add a static nested class which has the main method that performs the actual test, and you invoke that via ProcessTools.

That all said, for the record, we really should have a handshake timeout mechanism the same as we have the safepoint timeout mechanism.

Thanks,
David

robehn · 2020-10-02T07:15:55Z

Hi Robbin,

Hi David,

So.... The old test used an "uncounted loop" (based on internal JIT knowledge) to create looping code with no safepoint polls so that it remains safepoint-unsafe (and Patricio had to tweak the test conditions to avoid unexpected safepoints). The new code has a WhiteBox entry that uses an internal naked_sleep which keeps the thread _thread_in_VM IIUC, which is not safepoint-safe, but also potentially different to being _thread_in_Java. But lets just accept the net effect is the same - the thread will prevent a safepoint from being reached until the sleep time has elapsed. If that time is > (GuaranteedSafepointInterval + SafepointTimeoutDelay) then we should see a safepoint timeout and the VM abort. Okay ... so how does that solve the problem the test currently experiences with handshakes ... if we are at a handshake the handshake can't proceed until the sleep time expires, but then when we transition back to Java the thread will see the handshake and so the handshake will proceed. As long as the WB function returns false we will repeat the process, eventually when the expected safepoint is requested we should again trigger the safepoint timeout and abort.

But like Dan I'm unclear how the WB function can ever return true as the safepoint state can't change whilst the thread is in the naked sleep. ??

It can't return true if the VM is working. So yes the safepoint tracker maybe overkill.

Aside: rather than using "args.length > 0" to discriminate between the original and subsequent executions of the test class, it can be clearer (IMO) to add a static nested class which has the main method that performs the actual test, and you invoke that via ProcessTools.

I didn't change any of that.

That all said, for the record, we really should have a handshake timeout mechanism the same as we have the safepoint timeout mechanism.

We have a timeout mechanism but default off HandshakeTimeout.
But it doesn't fire SIGILL to troubled thread as safepoint does.

Thanks, Robbin

Thanks,
David

robehn · 2020-10-02T12:06:31Z

LGTM

Thanks! There is an update, please consider.

TheRealMDoerr

Thanks for reimplementing it to resolve problems with handshake all operations.

test/hotspot/jtreg/runtime/Safepoint/TestAbortVMOnSafepointTimeout.java

pchilano · 2020-10-05T14:00:52Z

LGTM

Thanks! There is an update, please consider.

Still looks good.

TheRealMDoerr

Looks good to me. Thanks.

test/hotspot/jtreg/runtime/Safepoint/TestAbortVMOnSafepointTimeout.java

openjdk · 2020-10-05T14:45:15Z

@robehn This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for more details.

After integration, the commit message for the final commit will be:

8253794: TestAbortVMOnSafepointTimeout never timeouts

Reviewed-by: pchilanomate, dcubed, mdoerr

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 10 new commits pushed to the master branch:

f2f77f7: 8253761: Wrong URI syntax printed by jar --describe-module
b29e108: 8253944: Certain method references to VarHandle methods should fail
88d75c9: 8156071: List.of: reduce array copying during creation
ea27a54: 8224509: Incorrect alignment in CDS related allocation code on 32-bit platforms
4d29116: 8253433: Remove -XX:+Debugging product option
81dae70: 8253948: Memory leak in ImageFileReader
65cab55: 8253971: ZGC: Flush mark stacks after processing concurrent roots
19219a9: 8253960: Memory leak in Java_java_lang_ClassLoader_defineClass0()
5d4a135: 8253842: [JVMCI] Allow implicit exception to dispatch to other address in jvmci compilers.
289ae79: 8252998: ModuleWrapper.gmk doesn't consult include path

Please see this link for an up-to-date comparison between the source branch of this pull request and the master branch.
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

dcubed-ojdk · 2020-10-05T15:14:16Z

src/hotspot/share/prims/whitebox.cpp

@@ -74,6 +74,7 @@
 #include "runtime/javaCalls.hpp"
 #include "runtime/jniHandles.inline.hpp"
 #include "runtime/os.hpp"
+#include "runtime/safepoint.hpp"


I don't think you need this include change anymore.

test/hotspot/jtreg/runtime/Safepoint/TestAbortVMOnSafepointTimeout.java

dcubed-ojdk · 2020-10-05T15:31:01Z

test/hotspot/jtreg/runtime/Safepoint/TestAbortVMOnSafepointTimeout.java

        if (Platform.isWindows()) {
            output.shouldMatch("Safepoint sync time longer than");
        } else {
            output.shouldMatch("SIGILL");
            if (Platform.isLinux()) {
                output.shouldMatch("(sent by kill)");
            }
-            output.shouldMatch("TestAbortVMOnSafepointTimeout.test_loop");
        }
        output.shouldNotHaveExitValue(0);


Okay, but then this message when you're reading the code is misleading:
System.out.println("This message would occur after some time.");
It should be printing something like:
System.out.println("This message only prints if something is broken.");

Update: Yes, I realize that this is an existing problem, but it's still reads wrong.

dcubed-ojdk · 2020-10-05T15:31:58Z

test/hotspot/jtreg/runtime/Safepoint/TestAbortVMOnSafepointTimeout.java

+        public static void main(String[] args) throws Exception {
+            Integer waitTime = Integer.parseInt(args[0]);
+            WhiteBox wb = WhiteBox.getWhiteBox();
+            // While no safepoint timeout.


Perhaps: // Loop here to cause a safepoint timeout.

dcubed-ojdk · 2020-10-05T15:44:55Z

Robbin replied:

David H. wrote:

That all said, for the record, we really should have a handshake timeout mechanism the same as we have the safepoint timeout mechanism.

We have a timeout mechanism but default off HandshakeTimeout.
But it doesn't fire SIGILL to troubled thread as safepoint does.

What's the conclusion here? Are there going to be changes to the
test to use the HandshakeTimeout option? Should the test have
failed in a different way than it did?

robehn

Pushing the small update in a minute.

robehn · 2020-10-05T18:15:57Z

src/hotspot/share/prims/whitebox.cpp

@@ -74,6 +74,7 @@
 #include "runtime/javaCalls.hpp"
 #include "runtime/jniHandles.inline.hpp"
 #include "runtime/os.hpp"
+#include "runtime/safepoint.hpp"


robehn · 2020-10-05T18:18:43Z

test/hotspot/jtreg/runtime/Safepoint/TestAbortVMOnSafepointTimeout.java

        if (Platform.isWindows()) {
            output.shouldMatch("Safepoint sync time longer than");
        } else {
            output.shouldMatch("SIGILL");
            if (Platform.isLinux()) {
                output.shouldMatch("(sent by kill)");
            }
-            output.shouldMatch("TestAbortVMOnSafepointTimeout.test_loop");
        }
        output.shouldNotHaveExitValue(0);


I removed comment in last update, since it can't be printed.

robehn · 2020-10-05T18:19:27Z

test/hotspot/jtreg/runtime/Safepoint/TestAbortVMOnSafepointTimeout.java

+        public static void main(String[] args) throws Exception {
+            Integer waitTime = Integer.parseInt(args[0]);
+            WhiteBox wb = WhiteBox.getWhiteBox();
+            // While no safepoint timeout.


robehn · 2020-10-05T18:27:47Z

Robbin replied:

David H. wrote:

That all said, for the record, we really should have a handshake timeout mechanism the same as we have the safepoint timeout mechanism.

We have a timeout mechanism but default off HandshakeTimeout.
But it doesn't fire SIGILL to troubled thread as safepoint does.

What's the conclusion here? Are there going to be changes to the
test to use the HandshakeTimeout option? Should the test have
failed in a different way than it did?

In https://bugs.openjdk.java.net/browse/JDK-8198730 I'm have been looking into setting these (safepoint and handshake timeout ) to default 1 second. There were some impediments which now seems to have been resolved.
If any of these operations takes longer you want to know it because A: you have a bug, or B: you have performance problems.
So I don't think the default value is what any user wants.

dcubed-ojdk

Thumbs up.

robehn · 2020-10-05T18:48:21Z

Thanks for review @pchilano, @dcubed-ojdk, @TheRealMDoerr.

Update was trivial so integrating in a bit.

robehn · 2020-10-05T19:15:31Z

/integrate

openjdk · 2020-10-05T19:16:29Z

@robehn Since your change was applied there have been 10 commits pushed to the master branch:

f2f77f7: 8253761: Wrong URI syntax printed by jar --describe-module
b29e108: 8253944: Certain method references to VarHandle methods should fail
88d75c9: 8156071: List.of: reduce array copying during creation
ea27a54: 8224509: Incorrect alignment in CDS related allocation code on 32-bit platforms
4d29116: 8253433: Remove -XX:+Debugging product option
81dae70: 8253948: Memory leak in ImageFileReader
65cab55: 8253971: ZGC: Flush mark stacks after processing concurrent roots
19219a9: 8253960: Memory leak in Java_java_lang_ClassLoader_defineClass0()
5d4a135: 8253842: [JVMCI] Allow implicit exception to dispatch to other address in jvmci compilers.
289ae79: 8252998: ModuleWrapper.gmk doesn't consult include path

Your commit was automatically rebased without conflicts.

Pushed as commit c9d0407.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Fixed test

13b4033

openjdk bot added the hotspot hotspot-dev@openjdk.org label Oct 1, 2020

openjdk bot added the hotspot-runtime hotspot-runtime-dev@openjdk.org label Oct 1, 2020

robehn marked this pull request as ready for review October 1, 2020 17:06

openjdk bot added the rfr Pull request is ready for review label Oct 1, 2020

pchilano approved these changes Oct 1, 2020

View reviewed changes

dcubed-ojdk suggested changes Oct 1, 2020

View reviewed changes

Update with input from reviews

90fb310

Merge branch 'master' into 8253794

2561908

TheRealMDoerr reviewed Oct 5, 2020

View reviewed changes

test/hotspot/jtreg/runtime/Safepoint/TestAbortVMOnSafepointTimeout.java Show resolved Hide resolved

TheRealMDoerr approved these changes Oct 5, 2020

View reviewed changes

test/hotspot/jtreg/runtime/Safepoint/TestAbortVMOnSafepointTimeout.java Show resolved Hide resolved

openjdk bot added the ready Pull request is ready to be integrated label Oct 5, 2020

dcubed-ojdk suggested changes Oct 5, 2020

View reviewed changes

robehn commented Oct 5, 2020

View reviewed changes

Fixed include and comment

b63b6a0

dcubed-ojdk approved these changes Oct 5, 2020

View reviewed changes

openjdk bot closed this Oct 5, 2020

openjdk bot added integrated Pull request has been integrated and removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Oct 5, 2020

robehn deleted the 8253794 branch October 5, 2020 19:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

8253794: TestAbortVMOnSafepointTimeout never timeouts #465

8253794: TestAbortVMOnSafepointTimeout never timeouts #465

robehn commented Oct 1, 2020 •

edited by openjdk bot

bridgekeeper bot commented Oct 1, 2020

openjdk bot commented Oct 1, 2020

robehn commented Oct 1, 2020

openjdk bot commented Oct 1, 2020

mlbridge bot commented Oct 1, 2020 •

edited

pchilano left a comment

dcubed-ojdk Oct 1, 2020

robehn Oct 2, 2020

dcubed-ojdk Oct 5, 2020 •

edited

robehn Oct 5, 2020

dholmes-ora commented Oct 2, 2020

robehn commented Oct 2, 2020

robehn commented Oct 2, 2020

TheRealMDoerr left a comment

pchilano commented Oct 5, 2020

TheRealMDoerr left a comment

openjdk bot commented Oct 5, 2020 •

edited

dcubed-ojdk Oct 5, 2020

robehn Oct 5, 2020

dcubed-ojdk Oct 5, 2020 •

edited

dcubed-ojdk Oct 5, 2020

robehn Oct 5, 2020

dcubed-ojdk commented Oct 5, 2020

robehn left a comment

robehn Oct 5, 2020

robehn Oct 5, 2020

robehn Oct 5, 2020

robehn commented Oct 5, 2020 •

edited

dcubed-ojdk left a comment

robehn commented Oct 5, 2020

robehn commented Oct 5, 2020

openjdk bot commented Oct 5, 2020

8253794: TestAbortVMOnSafepointTimeout never timeouts #465

8253794: TestAbortVMOnSafepointTimeout never timeouts #465

Conversation

robehn commented Oct 1, 2020 • edited by openjdk bot

Progress

Issue

Reviewers

Download

bridgekeeper bot commented Oct 1, 2020

openjdk bot commented Oct 1, 2020

robehn commented Oct 1, 2020

openjdk bot commented Oct 1, 2020

mlbridge bot commented Oct 1, 2020 • edited

Webrevs

pchilano left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dcubed-ojdk Oct 5, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dholmes-ora commented Oct 2, 2020

robehn commented Oct 2, 2020

robehn commented Oct 2, 2020

TheRealMDoerr left a comment

Choose a reason for hiding this comment

pchilano commented Oct 5, 2020

TheRealMDoerr left a comment

Choose a reason for hiding this comment

openjdk bot commented Oct 5, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dcubed-ojdk Oct 5, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dcubed-ojdk commented Oct 5, 2020

robehn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robehn commented Oct 5, 2020 • edited

dcubed-ojdk left a comment

Choose a reason for hiding this comment

robehn commented Oct 5, 2020

robehn commented Oct 5, 2020

openjdk bot commented Oct 5, 2020

robehn commented Oct 1, 2020 •

edited by openjdk bot

mlbridge bot commented Oct 1, 2020 •

edited

dcubed-ojdk Oct 5, 2020 •

edited

openjdk bot commented Oct 5, 2020 •

edited

dcubed-ojdk Oct 5, 2020 •

edited

robehn commented Oct 5, 2020 •

edited