Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIXED JENKINS-35160] - Job deletion: Wait up to 15 seconds for interrupted builds to complete #2789

Merged
merged 6 commits into from Apr 14, 2017

Conversation

@stephenc
Copy link
Member

commented Mar 9, 2017

  • Also now aware of concurrent builds

See JENKINS-35160

@reviewbybees

[FIXED JENKINS-35160] Wait up to 15 seconds for interrupted builds to…
… complete

- Also now aware of concurrent builds
@reviewbybees

This comment has been minimized.

Copy link

commented Mar 9, 2017

This pull request originates from a CloudBees employee. At CloudBees, we require that all pull requests be reviewed by other CloudBees employees before we seek to have the change accepted. If you want to learn more about our process please see this explanation.

@@ -262,20 +268,6 @@ public void onCopied(Item src, Item item) {
}
}

@Override
protected void performDelete() throws IOException, InterruptedException {

This comment has been minimized.

Copy link
@stephenc

stephenc Mar 9, 2017

Author Member

need to not hold the lock while interrupting or the interrupted threads may be unable to complete

@oleg-nenashev
Copy link
Member

left a comment

🐛 for the concurrency flaws. Ideally the job should be marked for deletion (e.g. boolean flag), and after setting this flag the job should prevent any attempts to schedule it to the queue. And maybe the flag should be released if the deletion fails (or not?).

Another 🐛 for using isAlive() for the executor

// if a build is in progress. Cancel it.
if (this instanceof Queue.Task) {
// clear any items in the queue so they do not get picket up
Queue.getInstance().cancel((Queue.Task) this);

This comment has been minimized.

Copy link
@oleg-nenashev

oleg-nenashev Mar 9, 2017

Member

It won't work for the Promoted Builds and other such... exotic implementations

This comment has been minimized.

Copy link
@jglick

jglick Mar 9, 2017

Member

Better consider that architecturally deprecated.

for (Executor e : c.getOneOffExecutors()) {
WorkUnit workUnit = e.getCurrentWorkUnit();
if (workUnit != null && (workUnit.work == this || workUnit.work.getOwnerTask() == this)) {
building.put(e.getCurrentExecutable(), e);

This comment has been minimized.

Copy link
@oleg-nenashev

oleg-nenashev Mar 9, 2017

Member

Lacks logging

WorkUnit workUnit = e.getCurrentWorkUnit();
if (workUnit != null && (workUnit.work == this || workUnit.work.getOwnerTask() == this)) {
building.put(e.getCurrentExecutable(), e);
e.interrupt();

This comment has been minimized.

Copy link
@oleg-nenashev

oleg-nenashev Mar 9, 2017

Member

Lacks logging

Thread.sleep(50L);
}
if (!building.isEmpty()) {
throw new AbortException(Messages.Job_FailureToStopBuilds(building.size(), getFullDisplayName()));

This comment has been minimized.

Copy link
@oleg-nenashev

oleg-nenashev Mar 9, 2017

Member

Should it also go to the system log?

This comment has been minimized.

Copy link
@jglick

jglick Mar 9, 2017

Member

I do not see why. The user has gotten a failure message (I hope—test it?); the job has not been deleted; there is nothing for the admin to do.

// comparison with executor.getCurrentExecutable() == executable currently should always be true
// as we no longer recycle Executors, but safer to future-proof in case we ever revisit recycling

if (!entry.getValue().isAlive() || entry.getKey() != entry.getValue().getCurrentExecutable()) {

This comment has been minimized.

Copy link
@oleg-nenashev

oleg-nenashev Mar 9, 2017

Member

It should be isActive(). E.g. if the executor is starting up/shutting down, we should not ignore the task check 🐛

}
}
}
if (!building.isEmpty()) {

This comment has been minimized.

Copy link
@oleg-nenashev

oleg-nenashev Mar 9, 2017

Member

🐛 this method uses cached list of builds. If the new build gets started due to whatever reason, during the deletion cycle, these builds won't be noticed.

}
Thread.sleep(50L);
}
if (!building.isEmpty()) {

This comment has been minimized.

Copy link
@oleg-nenashev

oleg-nenashev Mar 9, 2017

Member

Risk of false negatives since this is a value from the past

WorkUnit workUnit = e.getCurrentWorkUnit();
if (workUnit != null && (workUnit.work == this || workUnit.work.getOwnerTask() == this)) {
building.put(e, e.getCurrentExecutable());
e.interrupt(Result.ABORTED);

This comment has been minimized.

Copy link
@jglick

jglick Mar 9, 2017

Member

Better to factor into a helper method. Better yet, introduce Computer.getAllExecutors(), since that is a common problem. (See Executor.of etc.)

for (Executor e : c.getOneOffExecutors()) {
WorkUnit workUnit = e.getCurrentWorkUnit();
if (workUnit != null && (workUnit.work == this || workUnit.work.getOwnerTask() == this)) {
building.put(e, e.getCurrentExecutable());

This comment has been minimized.

Copy link
@jglick

jglick Mar 9, 2017

Member

Maybe easier to just check for

Queue.Executable exec = e.getCurrentExecutable();
if (exec instanceof Run && ((Run) exec).getParent() == this) {
// comparison with executor.getCurrentExecutable() == executable currently should always be true
// as we no longer recycle Executors, but safer to future-proof in case we ever revisit recycling

if (!entry.getKey().isAlive() || entry.getValue() != entry.getKey().getCurrentExecutable()) {

This comment has been minimized.

Copy link
@jglick

jglick Mar 9, 2017

Member

And then we could use simply Run.isLogUpdated().

// if a build is in progress. Cancel it.
if (this instanceof Queue.Task) {
// clear any items in the queue so they do not get picket up
Queue.getInstance().cancel((Queue.Task) this);

This comment has been minimized.

Copy link
@jglick

jglick Mar 9, 2017

Member

Better consider that architecturally deprecated.

@stephenc

This comment has been minimized.

Copy link
Member Author

commented Mar 10, 2017

@oleg-nenashev if the executor is interrupted the thread will die: 111a8d3#diff-4da8c72b731d029f61f01ed41568d209R248 so the only way to know that the executor has completed is to observe isAlive()

Other issues addressed

@oleg-nenashev

This comment has been minimized.

Copy link
Member

commented Mar 14, 2017

@oleg-nenashev oleg-nenashev self-assigned this Mar 14, 2017

@oleg-nenashev
Copy link
Member

left a comment

I have a minor 🐜 regarding the missing Interruption cause, but the most of the code looks good to me. IMHO some testing like ATH and PCT would not hurt in such case due to the fix complexity and potential risk of severe regressions.

Also dismissing the review from @jglick since the code has changed significantly

}
}
for (Executor executor : computer.getOneOffExecutors()) {
for (Executor executor : computer.getAllExecutors()) {

This comment has been minimized.

Copy link
@oleg-nenashev

oleg-nenashev Mar 15, 2017

Member

It causes some performance degradation (getOneOffExecutors() always happens + collection merge), so maybe we want to keep the original implementation

This comment has been minimized.

Copy link
@stephenc

stephenc Mar 15, 2017

Author Member

I think you are arguing for more complex code on the grounds of performance in the absence of a proven performance degradation. That smells like premature optimization

This comment has been minimized.

Copy link
@oleg-nenashev

oleg-nenashev Mar 15, 2017

Member

Well, it is not "more complex" code, it is just the "original code". And I think the impact may be really visible if we talk about modern Jenkins Pipeline-powered instances with dozens/hundreds of OneOff executors.

This comment has been minimized.

Copy link
@stephenc

stephenc Mar 15, 2017

Author Member

if you have lots of one-off executors then the majority of cases will be checking the one-off executors, so we save two ArrayList and two Iterator allocations, plus escape analysis can reveal that the one list is only ever used for iteration and the method can be inlined as it is small... the JVM is smarter than you think. (maybe not that smart, but there is hope!)

}
}
for (Executor e : c.getOneOffExecutors()) {
for (Executor e : c.getAllExecutors()) {

This comment has been minimized.

Copy link
@oleg-nenashev

oleg-nenashev Mar 15, 2017

Member

same as above

boolean ownsRegistration = ItemDeletion.register(this);
if (!ownsRegistration && ItemDeletion.isRegistered(this)) {
// we are not the owning thread and somebody else is concurrently deleting this exact item
throw new Failure(Messages.AbstractItem_BeingDeleted(getPronoun()));

This comment has been minimized.

Copy link
@oleg-nenashev

oleg-nenashev Mar 15, 2017

Member

Maybe it worth putting this warning into UI as well (e.g. in the summary page near the "disabled" status)

This comment has been minimized.

Copy link
@stephenc

stephenc Mar 15, 2017

Author Member

For all of 15 seconds!!!

This comment has been minimized.

Copy link
@oleg-nenashev

This comment has been minimized.

Copy link
@oleg-nenashev

oleg-nenashev Mar 15, 2017

Member

Still may be a case if the timeout gets configurable

This comment has been minimized.

Copy link
@stephenc

stephenc Mar 15, 2017

Author Member

Then whoever adds that feature can add the UI indicator. Solved

}
}
}
// interrupt any builds in progress (and this should be a recursive test so that folders do not pay

This comment has been minimized.

Copy link
@oleg-nenashev

oleg-nenashev Mar 15, 2017

Member

Maybe they should, but not sure. It makes FolderDelition a separate long background task. Which is probably fine, but out of the scope of this PR imho

This comment has been minimized.

Copy link
@stephenc

stephenc Mar 15, 2017

Author Member

We are recursively interrupting all jobs in one go here, any deletion will be at most 15 seconds

while (item != null) {
if (item == this) {
buildsInProgress.put(e, e.getCurrentExecutable());
e.interrupt(Result.ABORTED);

This comment has been minimized.

Copy link
@oleg-nenashev

oleg-nenashev Mar 15, 2017

Member

🐛 No CauseOfInterruption, which would be really useful in this code. The default behavior will be an emty cause in such case: https://github.com/jenkinsci/jenkins/blob/master/core/src/main/java/hudson/model/Executor.java#L190

This comment has been minimized.

Copy link
@stephenc

stephenc Mar 15, 2017

Author Member
  1. The job is going to be deleted
  2. As the this is the user who will be triggering the interrupt the UserInterruption cause will be injected
iterator.remove();
}
// I don't know why, but we have to keep interrupting
entry.getKey().interrupt(Result.ABORTED);

This comment has been minimized.

Copy link
@oleg-nenashev

oleg-nenashev Mar 15, 2017

Member

Same, missing cause

The code has significantly changed

@stephenc

This comment has been minimized.

Copy link
Member Author

commented Mar 15, 2017

@oleg-nenashev I argue that your bug was incorrect, the interruption is a user triggered action and will have a cause reflecting that

@stephenc

This comment has been minimized.

Copy link
Member Author

commented Mar 22, 2017

I do not block the PR anymore though I still prefer the explicit cause

@oleg-nenashev

This comment has been minimized.

Copy link
Member

commented Mar 23, 2017

@stephenc removed the bug according to explanation

@oleg-nenashev

This comment has been minimized.

Copy link
Member

commented Mar 29, 2017

@reviewbybees any feedback?

@stephenc

This comment has been minimized.

Copy link
Member Author

commented Mar 29, 2017

@oleg-nenashev per policy, 7 days so @reviewbybees done

*
* @since TODO
*/
@Extension

This comment has been minimized.

Copy link
@daniel-beck

daniel-beck Mar 30, 2017

Member

Do we want this to be @Restricted?

This comment has been minimized.

Copy link
@stephenc

stephenc Mar 30, 2017

Author Member

No because plugins may need access to perform additional checks and prevent work on items that are registered

This comment has been minimized.

Copy link
@oleg-nenashev

oleg-nenashev Apr 1, 2017

Member

Will be likely non-backportable then

This comment has been minimized.

Copy link
@rsandell

rsandell Apr 10, 2017

Member

It could be backported, but restricted in the backport

This comment has been minimized.

Copy link
@jglick

jglick Apr 13, 2017

Member

Should be @Restricted unless there is a tested & documented use case for accessing it from a plugin (which I doubt—a real API would be designed differently).

List<Executor> result = new ArrayList<>(executors.size() + oneOffExecutors.size());
result.addAll(executors);
result.addAll(oneOffExecutors);
return result;

This comment has been minimized.

Copy link
@daniel-beck

daniel-beck Mar 30, 2017

Member

unmodifiableList?

This comment has been minimized.

Copy link
@stephenc

stephenc Mar 30, 2017

Author Member

Why do you want to add another layer of indirection, plus we have given you a copy so you can do what you want with it

This comment has been minimized.

Copy link
@daniel-beck

daniel-beck Mar 30, 2017

Member

Because that's how I interpret the "read-only" part of the Javadoc.

This comment has been minimized.

Copy link
@rsandell

rsandell Apr 10, 2017

Member

Yea, read-only should be removed from the javadoc, snapshot view should be enough, since it actually isn't read-only :)

This comment has been minimized.

Copy link
@jglick

jglick Apr 13, 2017

Member

Would prefer to just use unmodifiableList.

@oleg-nenashev
Copy link
Member

left a comment

I agree with the fix since it definitely does not make the behavior worse than it was before. I am concerned about adding another non-configurable magic number for timeout, but I accept delaying it.

I will be likely against backporting the fix into .2 due to the potential impact of the changes. Also non-Restricted API

@oleg-nenashev

This comment has been minimized.

Copy link
Member

commented Apr 10, 2017

On hold for now since we are waiting for the 2.54 release. Will merge afterwards

long start = System.nanoTime();
p.delete();
long end = System.nanoTime();
assertThat(end - start, Matchers.lessThan(TimeUnit.SECONDS.toNanos(1)));

This comment has been minimized.

Copy link
@rsandell

rsandell Apr 10, 2017

Member

Might result in flaky builds on slower build agents.

@rsandell

This comment has been minimized.

Copy link
Member

commented Apr 10, 2017

🐝

@oleg-nenashev

This comment has been minimized.

Copy link
Member

commented Apr 10, 2017

@reviewbybees done
No merge right now pls

@oleg-nenashev oleg-nenashev changed the title [FIXED JENKINS-35160] Wait up to 15 seconds for interrupted builds to complete [FIXED JENKINS-35160] - Job deletion: Wait up to 15 seconds for interrupted builds to complete Apr 11, 2017

@oleg-nenashev

This comment has been minimized.

Copy link
Member

commented Apr 11, 2017

@jenkinsci/code-reviewers we would appreciate some additional feedback. It may potentially break some use-cases with deletion of projects with pending builds, though these use-cases were not working reliably anyway. If you are fine, my plan is to merge it towards the next weekly on Friday.

@jglick
jglick approved these changes Apr 13, 2017
Copy link
Member

left a comment

Did not follow all of the code here but overall looks right.

List<Executor> result = new ArrayList<>(executors.size() + oneOffExecutors.size());
result.addAll(executors);
result.addAll(oneOffExecutors);
return result;

This comment has been minimized.

Copy link
@jglick

jglick Apr 13, 2017

Member

Would prefer to just use unmodifiableList.

*/
@CheckForNull
public static hudson.model.Item getItemOf(@Nonnull SubTask t) {
// TODO move to default method on SubTask once code level is Java 8

This comment has been minimized.

Copy link
@jglick

jglick Apr 13, 2017

Member

Can do that now.

*
* @since TODO
*/
@Extension

This comment has been minimized.

Copy link
@jglick

jglick Apr 13, 2017

Member

Should be @Restricted unless there is a tested & documented use case for accessing it from a plugin (which I doubt—a real API would be designed differently).

try {
return !_contains(item);
} finally {
lock.readLock().unlock();

This comment has been minimized.

Copy link
@jglick

jglick Apr 13, 2017

Member

Pity that Lock is not AutoCloseable.

@oleg-nenashev oleg-nenashev merged commit 52a1a10 into jenkinsci:master Apr 14, 2017

1 of 2 checks passed

continuous-integration/jenkins/pr-head This commit has test failures
Details
Jenkins This pull request looks good
Details
@daniel-beck

This comment has been minimized.

Copy link
Member

commented Apr 22, 2017

Should be @Restricted unless there is a tested & documented use case for accessing it from a plugin (which I doubt—a real API would be designed differently).

☹️

@oleg-nenashev

This comment has been minimized.

Copy link
Member

commented Apr 22, 2017

@jglick @daniel-beck

Should be @restricted unless there is a tested & documented use case for accessing it from a plugin (which I doubt—a real API would be designed differently).

Makes sense to Restrict in a follow-up PR.

@stephenc stephenc deleted the stephenc:jenkins-35160 branch Jun 7, 2017

}
}
}
synchronized (this) { // could just make performDelete synchronized but overriders might not honor that

This comment has been minimized.

Copy link
@jglick
jglick added a commit to jglick/cloudbees-folder-plugin that referenced this pull request Oct 10, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.