Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIXED JENKINS-15331] Retry failed file delete operations #1800

Closed
wants to merge 77 commits into from

Conversation

@pjdarton
Copy link
Contributor

commented Aug 17, 2015

This changes Jenkins' strategy when doing recursive deletes so that it no longer stops iterating through a directory when it finds a file it can't delete - it'll try to delete what it can before throwing the exception.
It also changes the default number of attempts that Jenkins makes when doing a recursive delete from 2 attempts to 3 attempts.
This combination results in a dramatic improvement in deletion reliability on Windows-based job executors.

This behavior is configurable (on a per-JVM basis), albeit crudely - like the existing hudson.Util.noSymlink & hudson.Util.symlinkEscapeHatch flags, these can be configured using -D options when starting the Jenkins or slave services.

-Dhudson.Util.deletionMax=3
Sets the number of times Jenkins will try to delete a file before giving up.
Defaults to 3 attempts.

-Dhudson.Util.deletionRetryWait=100
Sets the time (in milliseconds) for which Jenkins will pause between delete attempts.
Defaults to 100ms between each attempt.
If set to a negative number, it will delay for a linearly increasing amount between attempts.

-Dhudson.Util.performGCOnFailedDelete=false
If set to true, this tells Jenkins to run the Garbage Collector between attempts (just in case the process holding a file open is an un-garbage-collected Java filehandle within Jenkins itself).
It defaults to false.

(Note: This was reworked to patch the Jenkins master branch as of 15:30 GMT Aug 17 2015, and superceeds pull request 1209)

olivergondza and others added some commits May 5, 2015

JENKINS-24433 Improving tests
(cherry picked from commit cb17460)
Correct tests
(cherry picked from commit 9afb092)
Corrent tab in assertTrue
(cherry picked from commit e5e53be)
Corrent tab in assertTrue again
(cherry picked from commit dab5650)
Tab major clean up
(cherry picked from commit 6528ff1)
[FIXED JENKINS-28115] Division by zero in Executor.getProgress()
```
WARNING: Caught exception evaluating: executor.progress in /ajaxExecutors. Reason: java.lang.reflect.InvocationTargetException
java.lang.reflect.InvocationTargetException
...
	at java.lang.Thread.run(Thread.java:744)
Caused by: java.lang.ArithmeticException: / by zero
	at hudson.model.Executor.getProgress(Executor.java:649)
	... 154 more
```
(cherry picked from commit e685b60)
[FIXED JENKINS-28062] A Launcher.afterDisconnect() method that propag…
…ates an exception is a sign of a bad Launcher not a problem closing the channel.

(cherry picked from commit d760216)
[FIXED JENKINS-27708][FIXED JENKINS-27871] Ensure that identification…
… of blocked tasks is using the live state.

- The creation of a snapshot itself should be relatively cheap given the expected rate of
  job execution. You probably would need 100's of jobs starting execution every iteration
  of maintain() before this could even start to become an issue and likely the calculation
  of isBuildBlocked(p) will become a bottleneck before updateSnapshot() will. Additionally
  since the snapshot itself only ever has at most one reference originating outside of the stack
  it should remain in the eden space and thus be cheap to GC.

- JENKINS-27708 comments 225819 and 225906 provide more complex but logically equivalent fixes of
  this issue. I am favouring this approach as it is simpler and provides less scope for error as any
  new helper methods can just rely on the snapshot being up to date whereas with the other
  two candidates if a new helper method is introduced there is the potential to miss adding support
  for the live view. The comment 225819 has the risk of introducing extra lock contention while
  the comment 225906 version forces every access to the helper methods to pass a second memory
  barrier

(cherry picked from commit 5880ed8)
[JENKINS-26781] lookup descriptor by ID, then by class if explicitly set
to match some specific use cases.

(cherry picked from commit 8881703)
deprecated find(string) and introduce explicit methods
to lookup descriptor by id or by Describable class name
(cherry picked from commit 6a8ce6c)
make $class and kind mutually exclusive
Use kind when descriptor.id != descriptor.clazz.name and $clazz otherwise.
(cherry picked from commit 901194a)
[JENKINS-26781] Reproduced problem in test.
Besides failing in master, it also fails in 4e0af43 (merge of #1443), but passes in the 1.598-SNAPSHOT parent 239caf8.
(cherry picked from commit 3304190)
Now it passes.
(cherry picked from commit 8f1883a)
Added @SInCE.
(cherry picked from commit c4c7259)
Improving Javadoc.
(cherry picked from commit 33923db)
Noting deprecations more widely.
(cherry picked from commit 37167d5)
Doc improvements
(cherry picked from commit 3c4b1e6)
Reverting whitespace change.
(cherry picked from commit f4ddac1)
[FIXED JENKINS-28011] Amending #1563, ${descriptor} is not necessaril…
…y ${attrs.descriptor}!

(cherry picked from commit 343382b)
newInstancesFromHeteroList should recover gracefully from the case th…
…at both kind and $class are missing.

(cherry picked from commit 17a8835)

pjdarton added some commits Aug 13, 2015

Renamed test methods to be OS-specific not test-case specific.
Added enthusiastic clean-up code (to match the other tests in this class).
FIX JENKINS-15331 by changing Util.deleteContentsRecursive, Util.dele…
…teFile and Util.deleteRecursive so that they can retry failed deletions.

The number of deletion attempts and the time it waits between deletes are configurable via system properties (like hudson.Util.noSymlink etc).

Also added unit-tests for new functionality.

@pjdarton pjdarton changed the title [FIX JENKINS-15331] Retry failed file delete operations [FIXED JENKINS-15331] Retry failed file delete operations Aug 23, 2015

@oleg-nenashev

This comment has been minimized.

Copy link
Member

commented Oct 17, 2015

I think the change is reasonable even if it does not guarantee the deletion success. 👍 for merge

@jtnord

This comment has been minimized.

Copy link
Member

commented Oct 19, 2015

This combination results in a dramatic improvement in deletion reliability on Windows-based job executors.

Are there issues for this, otherwise we are just masking issues rather than fixing them?

@pjdarton

This comment has been minimized.

Copy link
Contributor Author

commented Oct 19, 2015

@jtnord https://issues.jenkins-ci.org/browse/JENKINS-15331 details the solution, but there are many other issues showing the way the problem manifests (most typically during the "delete workspace" phase of a build), several of which are either "duplicates of 15331" or "related to 15331".
Basically, you can't reliably delete (or move, or open for writing) files/directories on Windows - the filesystem is inherently non-deterministic because of (arguably poor) design decisions made by Microsoft, which manifests as a lack of reliability. Code running on Windows needs to cope with that in order to avoid passing that "unreliable" behavior back to the user.

So, yes, technically we are "just masking issues rather than fixing them", but the "proper" fix would be either "don't use Windows!" (impractical), or have Microsoft implement a fix (unlikely - they don't acknowledge that this is a problem and they have been ignoring this for over a decade now); this is a pragmatic solution to the problem.

@jtnord

This comment has been minimized.

Copy link
Member

commented Oct 19, 2015

@pjdarton What I was asking is is there issues where this occurs so that we can look at fixing the root issues rather than masking the issue (and potentially having falky builds).

If a full GC solves the issue then a plugin or core has not closed the file after using it - so this should be an easy fix if we can narrow this down to the plugins in use.
External tools are a little more complicated - some behave well, and some do not. We could use some hooks to find out what that tool is (given we know the file path) and file upstream issues (if that tool is designed to live outside the build). The other suspect area would be workspace browsing or SCMs which require the workspace for polling.

@pjdarton

This comment has been minimized.

Copy link
Contributor Author

commented Oct 19, 2015

The root issue really is the Windows OS. It's not a coding fault within Jenkins. It's not caused by Jenkins code holding files open, it's caused by other processes on the Windows slave interfering with the files which "belong" to the Jenkins slave, e.g. by using the Windows "filesystem activity hook" (as used by anti-virus, Windows Search etc) to detect the addition (or usage) of files on the filesystem, and then opening the files that Jenkins is about to remove (e.g. in order to inspect them), which then sometimes causes the deletion to fail (because another process briefly has the file open).
This is a "well known" problem on Windows.
e.g. if you use Windows Explorer to delete a deeply-nested directory hierarchy, it'll often leave you with some empty folders that you have to delete again. Similarly, the Windows CMD window "RD /S /Q myNestedDirecoty" command will sometimes state "directory is not empty" and require you to repeat the command.
i.e. this is not a problem within Jenkins, or its code, it's a problem on Windows.

The solution, on Windows, is to issue the deletion request multiple times. Usually, the rogue process (e.g. Windows Search Service) that locked the file open won't keep it open very long, so an immediate retry will succeed.
If the underlying problem is that Jenkins itself has the file open (e.g. the RTC plugin has a bug whereby it can attempt to reinstall the RTC jars while the RTC plugin is using them) then no amount of retrying will fix that, at which point the build will fail with an error message (albeit taking slightly longer to do so).
Similarly, if an external process has a file (locked) open then the build will still fail (and report why). What this code does is handle the situation where files are briefly locked open while the delete is ongoing (which is typical behavior of the Windows Search Service, and most Anti-Virus software, but this malware-like-behavior is not limited to just those two).
i.e. The code enhancement I've done does not mask real persistent failures, only transient ones caused by the (poor) Windows filesystem behavior.

Re: GC
I would agree that, if a full GC "solved" an issue, then that would be just masking the issue, but this is why the GC behavior is turned off by default. This functionality is useful when encountering buggy plugins (typically written by people who are 100% unix so they never encounter file deletion problems), and allows you to temporarily work around such bugs until a proper fix can be implemented.
FYI Apache Ant uses the same GC technique in its deletion code. I was persuaded (see #615) that it should be disabled by default in Jenkins, and the javadoc for the GC configuration flag came from that discussion.
i.e. I'm not recommending that the GC functionality is enabled by default, and I wouldn't describe it as a "cure", but it can be an effective "workaround", allowing people to continue to work while a bugfix is done.

In summary:

  • The deletion retry mechanism is a workaround for a "behavioral characteristic" (aka a bug) in Windows, and (in my experience) is essential (without it, one sometimes gets red builds which aren't the developer's fault).
  • The GC capability is less essential but is sometimes very useful when running code that has yet to be fixed to also run on Windows. It's a very useful capability to have available "in case of emergency", and using this can really help with tracking down the cause of locked files (as, if this fixes it, you know that your code is relying on the GC to close files that it should have closed itself).
@jtnord

This comment has been minimized.

Copy link
Member

commented Oct 19, 2015

@pjdarton thanks for the detailed explanation

@oleg-nenashev

This comment has been minimized.

Copy link
Member

commented Nov 17, 2015

I think we can integrate it into the master branch and see how it goes. 👍 from me.
@jtnord WDYT?

We could convert it to the experimental feature as a plan B

@pjdarton

This comment has been minimized.

Copy link
Contributor Author

commented Nov 18, 2015

I would very much like this integrated into the master branch.

Re:"experimental feature", it can be configured that way if the defaults
don't suit, just as with other settings in the Util class, however I
believe that the chosen defaults should work well as-is. I can vouch that
those settings work fine (for me and my colleagues) on both Windows and
Linux.
FYI The text for the wiki page
https://wiki.jenkins-ci.org/display/JENKINS/Features+controlled+by+system+properties
could be copy/pasted straight from the Javadoc in the code - I wrote those
comments with the aim of eliminating the need for people to ask me what
those options did ;)

On 17 November 2015 at 22:52, Oleg Nenashev notifications@github.com
wrote:

I think we can integrate it into the master branch and see how it goes. [image:
👍] from me.
@jtnord https://github.com/jtnord WDYT?

We could convert it to the experimental feature as a plan B


Reply to this email directly or view it on GitHub
#1800 (comment).

* recommended, and should not be used on the main Jenkins server
* unless you can tolerate the performance impact.
*/
public static boolean GC_AFTER_FAILED_DELETE = Boolean.getBoolean(Util.class.getName() + ".performGCOnFailedDelete");

This comment has been minimized.

Copy link
@jtnord

jtnord Nov 18, 2015

Member

private please - or package if you need it for tests.

* Whatever the reason, this allows us to do multiple attempts before we
* give up, thus improving build reliability.
*/
public static int DELETION_MAX = Math.max(1, Integer.getInteger(Util.class.getName() + ".deletionMax", 3).intValue());

This comment has been minimized.

Copy link
@jtnord

jtnord Nov 18, 2015

Member

private please or package if you need it for tests.

* If negative, we will wait an (linearly) increasing multiple of this value
* between attempts.
*/
public static int WAIT_BETWEEN_DELETION_RETRIES = Integer.getInteger(Util.class.getName() + ".deletionRetryWait", 100).intValue();

This comment has been minimized.

Copy link
@jtnord

jtnord Nov 18, 2015

Member

private please or package if you need it for tests.

@jtnord

This comment has been minimized.

Copy link
Member

commented Nov 18, 2015

I'm ok on it going in if we get rid of the newly exposed public state.
I see no reason why this needs to be public. You can change private state from the script console - and I do not want more public fields causing issues in the future.

Changed Util.GC_AFTER_FAILED_DELETE, DELETION_MAX and WAIT_BETWEEN_DE…
…LETION_RETRIES from "public" to "package" level access, as per review comments.
@daniel-beck

This comment has been minimized.

Copy link
Member

commented Dec 7, 2015

Should probably also be @Restricted(NoExternalUse.class) to be safer.

@pjdarton

This comment has been minimized.

Copy link
Contributor Author

commented Dec 7, 2015

@daniel-beck I'm unfamiliar with that particular annotation. Can you
elaborate?
e.g. Is this a mark-up to be applied to the (now package level access)
fields to indicate that they are not to be used by anything other than the
unit-test class?

On 7 December 2015 at 14:02, Daniel Beck notifications@github.com wrote:

Should probably also be @restricted(NoExternalUse.class) to be safer.


Reply to this email directly or view it on GitHub
#1800 (comment).

@daniel-beck

This comment has been minimized.

Copy link
Member

commented Jan 12, 2016

@pjdarton Exactly. Sorry I missed the notification.

Something appears to have gone really wrong with this PR. There are tons of unrelated commits. Could you please fix this?

@pjdarton pjdarton closed this Feb 12, 2016

@pjdarton pjdarton deleted the pjdarton:fix_jenkins_15331 branch Feb 12, 2016

@pjdarton pjdarton referenced this pull request Feb 12, 2016

Merged

FIXED JENKINS-15331 #2026

@pjdarton

This comment has been minimized.

Copy link
Contributor Author

commented Feb 12, 2016

@daniel-beck My skills with the git command-line are somewhat limited (as demonstrated by the presence of "tons of unrelated commits"...), so I've fixed the mess the only way I know - I wiped and re-created the branch. That automatically closed this pull request, so I've created a new one - #2026

I hope that this new (and clean!) pull request will meet all your requirements and we can (finally) get this functionality merged into the Jenkins code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.