Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8265261: java/nio/file/Files/InterruptCopy.java fails with java.lang.RuntimeException: Copy was not interrupted #5154

Closed
wants to merge 5 commits into from

Conversation

bplb
Copy link
Member

@bplb bplb commented Aug 18, 2021

This proposal suggests to change the timing of testing whether a file copy is terminated by an interrupt.


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

  • JDK-8265261: java/nio/file/Files/InterruptCopy.java fails with java.lang.RuntimeException: Copy was not interrupted

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/5154/head:pull/5154
$ git checkout pull/5154

Update a local copy of the PR:
$ git checkout pull/5154
$ git pull https://git.openjdk.java.net/jdk pull/5154/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 5154

View PR using the GUI difftool:
$ git pr show -t 5154

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/5154.diff

@bridgekeeper
Copy link

bridgekeeper bot commented Aug 18, 2021

👋 Welcome back bpb! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr Pull request is ready for review label Aug 18, 2021
@openjdk
Copy link

openjdk bot commented Aug 18, 2021

@bplb The following label will be automatically applied to this pull request:

  • nio

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the nio nio-dev@openjdk.org label Aug 18, 2021
@mlbridge
Copy link

mlbridge bot commented Aug 18, 2021

Webrevs

Copy link
Member

@dfuch dfuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@openjdk
Copy link

openjdk bot commented Aug 18, 2021

@bplb This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8265261: java/nio/file/Files/InterruptCopy.java fails with java.lang.RuntimeException: Copy was not interrupted

Reviewed-by: dfuchs

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 49 new commits pushed to the master branch:

  • 7f80683: 8272783: Epsilon: Refactor tests to improve performance
  • 22ef4f0: 5015261: NPE may be thrown if JDesktopIcon is set to null on a JInternalFrame
  • 9bc0232: 8269223: -Xcheck:jni WARNINGs working with fonts on Linux
  • 2ff4c01: 8271600: C2: CheckCastPP which should closely follow Allocate is sunk of a loop
  • ad92033: 8272736: [JVMCI] Add API for reading and writing JVMCI thread locals
  • 709b591: 8272553: several hotspot runtime/CommandLine tests don't check exit code
  • 1884072: 8265253: javac -Xdoclint:all gives "no comment" warning for code that can't be commented
  • 594e516: 8272778: Consolidate is_instance and is_instance_inlined in java_lang_String
  • d542745: 8267894: Skip work for empty regions in G1 Full GC
  • 741f58c: 8272417: ZGC: fastdebug build crashes when printing ClassLoaderData
  • ... and 39 more: https://git.openjdk.java.net/jdk/compare/a68b5b9c1dd9dd69fa3ef29d8f1186707ec09311...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Aug 18, 2021
@msheppar
Copy link

msheppar commented Aug 18, 2021

is a countdown of 2 required?
or is it sufficient that the Interrupter thread should wait for the main to tell it to proceed?
Does the fact that the test runs agentvm influence the scheduling of the test's threads or is
is that irrelevant on linux? That is to say would running as othervm allow for more consistent
and deterministic interruption

@bplb
Copy link
Member Author

bplb commented Aug 18, 2021

It did indeed look as if the copy starts before the interrupter thread but that is not guaranteed. I do not know about the difference between agentvm and othervm. Another alternative would be to repeatedly interrupt at a fixed rate with zero delay but I don't know whether the original intent was to have only one interrupt.

@msheppar
Copy link

ah yes! with the count of 2 you wish to avoid either thread racing ahead, and so are attempting to arrange that the interrupter is at least Ready to Run when the copy thread initiates its copy, as such creating a reasonable probability that the interrupter will invoke the interrupt while the copy is in progress. As such, the ping pong between the two threads is a sort of ad hoc barrier, which seems reasonable and will most likely work the majority of the time.

But there is, I think, still some probability of missing the interrupt, if the interrupter thread "got stuck" after its little sleep, and was less than fairly scheduled? This is where agentvm or othervm may have influence, depending on the scheduling algorithm of the OS. Are threads, on Windows, scheduled independent of their owning process or relative to their owning process. othervm mode, afaik, launches a new process to executed the test, not sure if that would influence the scheduling of its threads.

Would some additional diagnostic output be useful in the Interrupter thread to indicate that it starting to run and it is invoking interrupt or has invoked interrupt?

In the second phase of the test the cancellation of scheduled task, is a latch also required in this instance, to ensure that when the cancel is invoked that the copy is in progress.

@bplb
Copy link
Member Author

bplb commented Aug 18, 2021

@msheppar Right, the latch is intended to do exactly what you described. There is definitely still some probability of missing the interrupt however, but this results in a failure only if the copy took more than DURATION_MAX_IN_MS (5 seconds). I don't know how this limit was determined.

I think you are correct about the problem of the interrupter getting stuck. I don't know which vm mode would best minimize this problem.

I had some print statements in the test when I was playing around with different approaches, but I was afraid that they might affect the timing. I'll take another look at it however as well as the idea of a latch added to the cancellation section.

@bplb
Copy link
Member Author

bplb commented Aug 18, 2021

Strangely enough it looks like no exception is thrown if cancellation fails.

Copy link
Member

@dfuch dfuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated logic LGTM.

result.get();
System.out.println("Copy cancelled.");
throw new RuntimeException("Copy was not cancelled");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should DURATION_MAX_IN_MS be taken into account here too to decide whether to throw or not?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. It seems like cancellation ought to be a bit more predictable than interruption. I am not sure how DURATION_MAX_IN_MS was determined. If there were no "hiccups" in the threads starting, then with the correct timing I am not sure that such a threshold is even needed. The execution time for the copy that I have observed is between 150ms (macOS) and 400ms (Windows) which is a lot less than 5s, so there would have to be some slowdown for this threshold even to be reached.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not wishing to labour the discussion on the change, which will in the main deal with the issue, but is it not the case with the current set of Intermittent failures that there is this slowdown, i.e. the copy is taking longer than the threshold of 5 seconds hence the RuntimeException with Copy was not Interrupted?

As such,

if (duration > DURATION_MAX_IN_MS)
throw new RuntimeException("Copy was not interrupted");

is triggering a failure.
You would expect if a typical copy duration is 400msecs that if the duration is taking a lot longer that the interrupt has a greater probability to happen.

the rationale for this logic is not clear - so if the duration is less than the copy threshold then test proceeds as if an interrupt has occurred?

the success and failure semantics for the test are not absolute. As such, maybe there's an opportunity to be more exact with test semantics e.g. retrying to interrupt the copy if it didn't happen.

as an aside it would appear that invoking Thread.interrupted(); in the main thread seems redundant ?

@bplb
Copy link
Member Author

bplb commented Aug 19, 2021

It looks like the duration threshold DURATION_MAX_IN_MS was added in the change which resolved JDK-6993267. A comment in that issue is "We should improve the test to work on very fast machines." So the copy must have been too fast for it to be interrupted and without the duration threshold there were spurious failures.

Thread.interrupted() in the main thread indeed seems unnecessary.

I think a further look at the semantics, as suggested, is in order.

1 similar comment
@bplb
Copy link
Member Author

bplb commented Aug 19, 2021

It looks like the duration threshold DURATION_MAX_IN_MS was added in the change which resolved JDK-6993267. A comment in that issue is "We should improve the test to work on very fast machines." So the copy must have been too fast for it to be interrupted and without the duration threshold there were spurious failures.

Thread.interrupted() in the main thread indeed seems unnecessary.

I think a further look at the semantics, as suggested, is in order.

@bplb
Copy link
Member Author

bplb commented Aug 20, 2021

This latest version might be overthinking it, but I doubled the size of the test file created, and changed the interrupt and cancellation sections to make multiple attempts to interrupt the copy. I ran 80 iterations on each of the pertinent platforms with no failures.

final Thread me = Thread.currentThread();
Future<?> wakeup = pool.schedule(new Runnable() {
Future<?> wakeup = pool.scheduleAtFixedRate(new Runnable() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really a good idea? The interrupted state should stick to the thread until it is cleared. It should not really matter if you interrupt the thread too early - but if you interrupt the thread again after the File copy was interrupted, won't it skew the logic in the test? Especially WRT double checking that the interrupt status was/wasn't cleared? Also shouldn't you cancel this wakeup at some point before testing cancel below?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not sure about the fixed rate firing here so you might be right. The wakeup is cancelled by wakeup.get() unless I misunderstand.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correction: I think an explicit wakeup.cancel() is needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently setting the interrupted state prior to the copy does not cause the copy to be interrupted.

@bplb
Copy link
Member Author

bplb commented Aug 21, 2021

I have another version in progress and it would be worth waiting to review that one once it is published. The new version succeeded in 539 of 540 runs on four platforms (135 runs per platform). The failed case printed this information:

Source file created.
Copying file at 1629496891924ms...
Interrupting at 1629496891934ms...
Done copying at 1629496892261ms...
----------System.err:(13/761)----------
java.lang.RuntimeException: Copy was not interrupted
	at InterruptCopy.doTest(InterruptCopy.java:110)
	at InterruptCopy.main(InterruptCopy.java:60)

The interrupt call was made 10ms after the copy started, but the copy nonetheless ran another 327ms somehow without catching the interrupt. At present this is baffling.

@bplb
Copy link
Member Author

bplb commented Aug 23, 2021

I tested in my work in progress by delaying the sleep until 10ms after the interrupt was issued and the copy was not interrupted. I expect that this is the reason for the 1 failure out of 540 executions mentioned above.

@bplb
Copy link
Member Author

bplb commented Aug 24, 2021

Version 04 revives the use of CountDownLatch but with a count of 1 instead of 2. It is used for both the interrupt and the cancel cases to make the interrupting thread wait until just before copy() before the interrupting thread starts its sleep delay. For the interrupt case, a minimum duration for copying leading to a test failure is reinstated. This tries to prevent a false failure if the copy was much faster than expected.

Copy link
Member

@dfuch dfuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the test campaign shows better results with this latter version then I have no objection: it looks better to me than the previous repeated interrupt logic.

@bplb
Copy link
Member Author

bplb commented Aug 24, 2021

/integrate

@openjdk
Copy link

openjdk bot commented Aug 24, 2021

Going to push as commit aaedac6.
Since your change was applied there have been 58 commits pushed to the master branch:

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot closed this Aug 24, 2021
@openjdk openjdk bot added integrated Pull request has been integrated and removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Aug 24, 2021
@openjdk
Copy link

openjdk bot commented Aug 24, 2021

@bplb Pushed as commit aaedac6.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@bplb bplb deleted the Files-InterruptCopy-8265261 branch August 24, 2021 17:11
@mlbridge
Copy link

mlbridge bot commented Sep 1, 2021

Mailing list message from Brian Burkhalter on nio-dev:

On Aug 24, 2021, at 4:04 AM, Daniel Fuchs <dfuchs at openjdk.java.net<mailto:dfuchs at openjdk.java.net>> wrote:

On Tue, 24 Aug 2021 00:30:48 GMT, Brian Burkhalter <bpb at openjdk.org<mailto:bpb at openjdk.org>> wrote:

This proposal suggests to change the timing of testing whether a file copy is terminated by an interrupt.

Brian Burkhalter has updated the pull request incrementally with one additional commit since the last revision:

8265261: Reinstate duration check and latches but with countdown 1

If the test campaign shows better results with this latter version then I have no objection: it looks better to me than the previous repeated interrupt logic.

This does show better results. I am not however convinced that it is better than the previously posted fixed rate interrupt version. I verified that if the interrupt is issued before the copy begins then it is not recognized by `copy()`. Given that, repeating might give a more predictable result. I would agree though to go with this version and revisit it once more should failures be seen.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.java.net/pipermail/nio-dev/attachments/20210824/1715a4e8/attachment.htm>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
integrated Pull request has been integrated nio nio-dev@openjdk.org
3 participants