New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FIXED JENKINS-35160] - Job deletion: Wait up to 15 seconds for interrupted builds to complete #2789
Conversation
… complete - Also now aware of concurrent builds
This pull request originates from a CloudBees employee. At CloudBees, we require that all pull requests be reviewed by other CloudBees employees before we seek to have the change accepted. If you want to learn more about our process please see this explanation. |
@@ -262,20 +268,6 @@ public void onCopied(Item src, Item item) { | |||
} | |||
} | |||
|
|||
@Override | |||
protected void performDelete() throws IOException, InterruptedException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to not hold the lock while interrupting or the interrupted threads may be unable to complete
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🐛 for the concurrency flaws. Ideally the job should be marked for deletion (e.g. boolean flag), and after setting this flag the job should prevent any attempts to schedule it to the queue. And maybe the flag should be released if the deletion fails (or not?).
Another 🐛 for using isAlive()
for the executor
// if a build is in progress. Cancel it. | ||
if (this instanceof Queue.Task) { | ||
// clear any items in the queue so they do not get picket up | ||
Queue.getInstance().cancel((Queue.Task) this); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It won't work for the Promoted Builds and other such... exotic implementations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better consider that architecturally deprecated.
for (Executor e : c.getOneOffExecutors()) { | ||
WorkUnit workUnit = e.getCurrentWorkUnit(); | ||
if (workUnit != null && (workUnit.work == this || workUnit.work.getOwnerTask() == this)) { | ||
building.put(e.getCurrentExecutable(), e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lacks logging
WorkUnit workUnit = e.getCurrentWorkUnit(); | ||
if (workUnit != null && (workUnit.work == this || workUnit.work.getOwnerTask() == this)) { | ||
building.put(e.getCurrentExecutable(), e); | ||
e.interrupt(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lacks logging
Thread.sleep(50L); | ||
} | ||
if (!building.isEmpty()) { | ||
throw new AbortException(Messages.Job_FailureToStopBuilds(building.size(), getFullDisplayName())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it also go to the system log?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not see why. The user has gotten a failure message (I hope—test it?); the job has not been deleted; there is nothing for the admin to do.
// comparison with executor.getCurrentExecutable() == executable currently should always be true | ||
// as we no longer recycle Executors, but safer to future-proof in case we ever revisit recycling | ||
|
||
if (!entry.getValue().isAlive() || entry.getKey() != entry.getValue().getCurrentExecutable()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be isActive()
. E.g. if the executor is starting up/shutting down, we should not ignore the task check 🐛
} | ||
} | ||
} | ||
if (!building.isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🐛 this method uses cached list of builds. If the new build gets started due to whatever reason, during the deletion cycle, these builds won't be noticed.
} | ||
Thread.sleep(50L); | ||
} | ||
if (!building.isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Risk of false negatives since this is a value from the past
WorkUnit workUnit = e.getCurrentWorkUnit(); | ||
if (workUnit != null && (workUnit.work == this || workUnit.work.getOwnerTask() == this)) { | ||
building.put(e, e.getCurrentExecutable()); | ||
e.interrupt(Result.ABORTED); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to factor into a helper method. Better yet, introduce Computer.getAllExecutors()
, since that is a common problem. (See Executor.of
etc.)
for (Executor e : c.getOneOffExecutors()) { | ||
WorkUnit workUnit = e.getCurrentWorkUnit(); | ||
if (workUnit != null && (workUnit.work == this || workUnit.work.getOwnerTask() == this)) { | ||
building.put(e, e.getCurrentExecutable()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe easier to just check for
Queue.Executable exec = e.getCurrentExecutable();
if (exec instanceof Run && ((Run) exec).getParent() == this) {
// comparison with executor.getCurrentExecutable() == executable currently should always be true | ||
// as we no longer recycle Executors, but safer to future-proof in case we ever revisit recycling | ||
|
||
if (!entry.getKey().isAlive() || entry.getValue() != entry.getKey().getCurrentExecutable()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And then we could use simply Run.isLogUpdated()
.
// if a build is in progress. Cancel it. | ||
if (this instanceof Queue.Task) { | ||
// clear any items in the queue so they do not get picket up | ||
Queue.getInstance().cancel((Queue.Task) this); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better consider that architecturally deprecated.
Left over references before I settled on a better name
@oleg-nenashev if the executor is interrupted the thread will die: 111a8d3#diff-4da8c72b731d029f61f01ed41568d209R248 so the only way to know that the executor has completed is to observe Other issues addressed |
@stephenc maybe it also addresses https://issues.jenkins-ci.org/browse/JENKINS-32783 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a minor 🐜 regarding the missing Interruption cause, but the most of the code looks good to me. IMHO some testing like ATH and PCT would not hurt in such case due to the fix complexity and potential risk of severe regressions.
Also dismissing the review from @jglick since the code has changed significantly
} | ||
} | ||
for (Executor executor : computer.getOneOffExecutors()) { | ||
for (Executor executor : computer.getAllExecutors()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It causes some performance degradation (getOneOffExecutors() always happens + collection merge), so maybe we want to keep the original implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are arguing for more complex code on the grounds of performance in the absence of a proven performance degradation. That smells like premature optimization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it is not "more complex" code, it is just the "original code". And I think the impact may be really visible if we talk about modern Jenkins Pipeline-powered instances with dozens/hundreds of OneOff executors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you have lots of one-off executors then the majority of cases will be checking the one-off executors, so we save two ArrayList and two Iterator allocations, plus escape analysis can reveal that the one list is only ever used for iteration and the method can be inlined as it is small... the JVM is smarter than you think. (maybe not that smart, but there is hope!)
} | ||
} | ||
for (Executor e : c.getOneOffExecutors()) { | ||
for (Executor e : c.getAllExecutors()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
boolean ownsRegistration = ItemDeletion.register(this); | ||
if (!ownsRegistration && ItemDeletion.isRegistered(this)) { | ||
// we are not the owning thread and somebody else is concurrently deleting this exact item | ||
throw new Failure(Messages.AbstractItem_BeingDeleted(getPronoun())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it worth putting this warning into UI as well (e.g. in the summary page near the "disabled" status)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For all of 15 seconds!!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still may be a case if the timeout gets configurable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then whoever adds that feature can add the UI indicator. Solved
} | ||
} | ||
} | ||
// interrupt any builds in progress (and this should be a recursive test so that folders do not pay |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe they should, but not sure. It makes FolderDelition a separate long background task. Which is probably fine, but out of the scope of this PR imho
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are recursively interrupting all jobs in one go here, any deletion will be at most 15 seconds
while (item != null) { | ||
if (item == this) { | ||
buildsInProgress.put(e, e.getCurrentExecutable()); | ||
e.interrupt(Result.ABORTED); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🐛 No CauseOfInterruption
, which would be really useful in this code. The default behavior will be an emty cause in such case: https://github.com/jenkinsci/jenkins/blob/master/core/src/main/java/hudson/model/Executor.java#L190
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The job is going to be deleted
- As the this is the user who will be triggering the interrupt the
UserInterruption
cause will be injected
iterator.remove(); | ||
} | ||
// I don't know why, but we have to keep interrupting | ||
entry.getKey().interrupt(Result.ABORTED); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same, missing cause
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code has significantly changed
@oleg-nenashev I argue that your bug was incorrect, the interruption is a user triggered action and will have a cause reflecting that |
@oleg-nenashev @jglick ping also @jenkinsci/code-reviewers |
I do not block the PR anymore though I still prefer the explicit cause
@stephenc removed the bug according to explanation |
@reviewbybees any feedback? |
@oleg-nenashev per policy, 7 days so @reviewbybees done |
* | ||
* @since TODO | ||
*/ | ||
@Extension |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want this to be @Restricted
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No because plugins may need access to perform additional checks and prevent work on items that are registered
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will be likely non-backportable then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be backported, but restricted in the backport
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be @Restricted
unless there is a tested & documented use case for accessing it from a plugin (which I doubt—a real API would be designed differently).
List<Executor> result = new ArrayList<>(executors.size() + oneOffExecutors.size()); | ||
result.addAll(executors); | ||
result.addAll(oneOffExecutors); | ||
return result; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unmodifiableList?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you want to add another layer of indirection, plus we have given you a copy so you can do what you want with it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because that's how I interpret the "read-only" part of the Javadoc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, read-only
should be removed from the javadoc, snapshot view
should be enough, since it actually isn't read-only :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would prefer to just use unmodifiableList
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the fix since it definitely does not make the behavior worse than it was before. I am concerned about adding another non-configurable magic number for timeout, but I accept delaying it.
I will be likely against backporting the fix into .2 due to the potential impact of the changes. Also non-Restricted API
On hold for now since we are waiting for the 2.54 release. Will merge afterwards |
long start = System.nanoTime(); | ||
p.delete(); | ||
long end = System.nanoTime(); | ||
assertThat(end - start, Matchers.lessThan(TimeUnit.SECONDS.toNanos(1))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might result in flaky builds on slower build agents.
🐝 |
@reviewbybees done |
@jenkinsci/code-reviewers we would appreciate some additional feedback. It may potentially break some use-cases with deletion of projects with pending builds, though these use-cases were not working reliably anyway. If you are fine, my plan is to merge it towards the next weekly on Friday. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did not follow all of the code here but overall looks right.
List<Executor> result = new ArrayList<>(executors.size() + oneOffExecutors.size()); | ||
result.addAll(executors); | ||
result.addAll(oneOffExecutors); | ||
return result; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would prefer to just use unmodifiableList
.
*/ | ||
@CheckForNull | ||
public static hudson.model.Item getItemOf(@Nonnull SubTask t) { | ||
// TODO move to default method on SubTask once code level is Java 8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can do that now.
* | ||
* @since TODO | ||
*/ | ||
@Extension |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be @Restricted
unless there is a tested & documented use case for accessing it from a plugin (which I doubt—a real API would be designed differently).
try { | ||
return !_contains(item); | ||
} finally { | ||
lock.readLock().unlock(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pity that Lock
is not AutoCloseable
.
|
Makes sense to Restrict in a follow-up PR. |
} | ||
} | ||
} | ||
synchronized (this) { // could just make performDelete synchronized but overriders might not honor that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unusable API; see jenkinsci/cloudbees-folder-plugin#112.
See JENKINS-35160
@reviewbybees