Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIXED JENKINS-15331] Workaround Windows unpredictable file locking in Util.deleteContentsRecursive #615

Closed
wants to merge 1 commit into from

Conversation

2 participants
@pjdarton
Copy link
Contributor

commented Nov 12, 2012

I've written (what I hope to be) a fix for https://issues.jenkins-ci.org/browse/JENKINS-15331. Unlike my previous post on this subject, this contains improved functionality (uses same tactics as Ant to delete files on Windows), and the file diffs are sensible.
This fix can either be pulled from here, or grabbed from the patch I've uploaded to JIRA.

Features:

Added three new system properties that control behavior (defaults should be acceptable to all, but configurable if necessary):

  • "Util.deletionRetries" (an integer, defaults to 3),
  • "Util.deletionRetryWait" (an integer, defaults to 100ms),
  • "Util.performGCOnFailedDelete" (boolean, defaults to true on Windows, false everywhere else).

Delete operations that affect directories now try to delete the entire contents of the directory, continuing on to subfolders etc even after encountering files that wouldn't die, before eventually throwing an exception about what wouldn't die. i.e. if a folder has a file "a", "b" and "c", and you can't delete "b", then "a" and "c" would get deleted (and you'll still get the exception about "b").

Delete operations now have multiple attempts at deleting things (not just "twice"), so if not everything could be deleted first time around, maybe they'll get deleted 2nd/3rd etc time around. An exception is only thrown if all retry attempts are exhausted and there are still files/directories that won't delete.

On Windows, it defaults to calling System.gc() before retrying the delete. This approach is used by Apache Ant to workaround much the same problem, namely that deleting files on Windows isn't reliable (and the Ant devs seem to think that this is the right thing to do).

Disclaimers:

Whilst I've added unit-tests that prove that the deletion behavior is as I intended, I have yet to prove to that this actually fixes the issue. I am reasonably sure it makes it no worse ("it works for me") but, due to the unpredictable nature of the fault, it's difficult to prove a fix has fixed it, one can merely wait and see if it re-occurs.

Peter Darton
JENKINS-15331
Retry deletes (a configurable number of times) after a (configurable)
delay, possibly after calling the garbage collector (configurable).
@evernat

This comment has been minimized.

Copy link
Contributor

commented Oct 28, 2013

Jenkins on Windows is not Ant, because of different memory used and because of the duration to execute a Full Stop-the-world GC.
I would prefer not to enable automatic System.gc() on a failed delete by default : it could be a system property to enable if needed, but not enabled by default.

By the way, the method "possiblyCallGC" doesn't seem to be called anywhere.

@pjdarton

This comment has been minimized.

Copy link
Contributor Author

commented Oct 28, 2013

This might not be Ant, but I think it's better to have a performance stutter due to a GC than to have a build failure - the GC should only get called after a deletion failure, so it's only in the "we're about to fail anyway" path that the GC would get called. I've been running with this functionality for longer than this patch has been here, and I've found that it fixed the build failures (completely) without causing any noticable performance issues.

As for "possiblyCallGC", you're spot on - oops. As you might guess, this patch isn't /identical/ to what I'm running at work (I'm using the LTS branch at work, but this patch was re-done against the head at the time, and I guess that call got missed out when I re-coded it). If I get the chance (probably next year) and update the patch to be based on the /current/ head and re-issue the pull request, I'll resolve it then.

@evernat

This comment has been minimized.

Copy link
Contributor

commented Oct 28, 2013

I still think that a failing build is not a reason to potentially halt the server, and all builds, in successive GC.
And, there were already code fixes with streams correctly closed and there are still other pendings.

So I would vote against calling GC when a delete fails (in the default behavior), sorry.

@pjdarton pjdarton closed this May 1, 2014

@pjdarton

This comment has been minimized.

Copy link
Contributor Author

commented May 1, 2014

Replaced by #1209

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.