Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.Sign up
[FIXED JENKINS-15331] Workaround Windows unpredictable file locking in Util.deleteContentsRecursive #615
I've written (what I hope to be) a fix for https://issues.jenkins-ci.org/browse/JENKINS-15331. Unlike my previous post on this subject, this contains improved functionality (uses same tactics as Ant to delete files on Windows), and the file diffs are sensible.
Added three new system properties that control behavior (defaults should be acceptable to all, but configurable if necessary):
Delete operations that affect directories now try to delete the entire contents of the directory, continuing on to subfolders etc even after encountering files that wouldn't die, before eventually throwing an exception about what wouldn't die. i.e. if a folder has a file "a", "b" and "c", and you can't delete "b", then "a" and "c" would get deleted (and you'll still get the exception about "b").
Delete operations now have multiple attempts at deleting things (not just "twice"), so if not everything could be deleted first time around, maybe they'll get deleted 2nd/3rd etc time around. An exception is only thrown if all retry attempts are exhausted and there are still files/directories that won't delete.
On Windows, it defaults to calling System.gc() before retrying the delete. This approach is used by Apache Ant to workaround much the same problem, namely that deleting files on Windows isn't reliable (and the Ant devs seem to think that this is the right thing to do).
Whilst I've added unit-tests that prove that the deletion behavior is as I intended, I have yet to prove to that this actually fixes the issue. I am reasonably sure it makes it no worse ("it works for me") but, due to the unpredictable nature of the fault, it's difficult to prove a fix has fixed it, one can merely wait and see if it re-occurs.
Jenkins on Windows is not Ant, because of different memory used and because of the duration to execute a Full Stop-the-world GC.
By the way, the method "possiblyCallGC" doesn't seem to be called anywhere.
This might not be Ant, but I think it's better to have a performance stutter due to a GC than to have a build failure - the GC should only get called after a deletion failure, so it's only in the "we're about to fail anyway" path that the GC would get called. I've been running with this functionality for longer than this patch has been here, and I've found that it fixed the build failures (completely) without causing any noticable performance issues.
As for "possiblyCallGC", you're spot on - oops. As you might guess, this patch isn't /identical/ to what I'm running at work (I'm using the LTS branch at work, but this patch was re-done against the head at the time, and I guess that call got missed out when I re-coded it). If I get the chance (probably next year) and update the patch to be based on the /current/ head and re-issue the pull request, I'll resolve it then.
I still think that a failing build is not a reason to potentially halt the server, and all builds, in successive GC.
So I would vote against calling GC when a delete fails (in the default behavior), sorry.