Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIXED JENKINS-35160] - Job deletion: Wait up to 15 seconds for interrupted builds to complete #2789

Merged
merged 6 commits into from Apr 14, 2017

Conversation

stephenc
Copy link
Member

@stephenc stephenc commented Mar 9, 2017

  • Also now aware of concurrent builds

See JENKINS-35160

@reviewbybees

… complete

- Also now aware of concurrent builds
@ghost
Copy link

ghost commented Mar 9, 2017

This pull request originates from a CloudBees employee. At CloudBees, we require that all pull requests be reviewed by other CloudBees employees before we seek to have the change accepted. If you want to learn more about our process please see this explanation.

@@ -262,20 +268,6 @@ public void onCopied(Item src, Item item) {
}
}

@Override
protected void performDelete() throws IOException, InterruptedException {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to not hold the lock while interrupting or the interrupted threads may be unable to complete

Copy link
Member

@oleg-nenashev oleg-nenashev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐛 for the concurrency flaws. Ideally the job should be marked for deletion (e.g. boolean flag), and after setting this flag the job should prevent any attempts to schedule it to the queue. And maybe the flag should be released if the deletion fails (or not?).

Another 🐛 for using isAlive() for the executor

// if a build is in progress. Cancel it.
if (this instanceof Queue.Task) {
// clear any items in the queue so they do not get picket up
Queue.getInstance().cancel((Queue.Task) this);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It won't work for the Promoted Builds and other such... exotic implementations

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better consider that architecturally deprecated.

for (Executor e : c.getOneOffExecutors()) {
WorkUnit workUnit = e.getCurrentWorkUnit();
if (workUnit != null && (workUnit.work == this || workUnit.work.getOwnerTask() == this)) {
building.put(e.getCurrentExecutable(), e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lacks logging

WorkUnit workUnit = e.getCurrentWorkUnit();
if (workUnit != null && (workUnit.work == this || workUnit.work.getOwnerTask() == this)) {
building.put(e.getCurrentExecutable(), e);
e.interrupt();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lacks logging

Thread.sleep(50L);
}
if (!building.isEmpty()) {
throw new AbortException(Messages.Job_FailureToStopBuilds(building.size(), getFullDisplayName()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it also go to the system log?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see why. The user has gotten a failure message (I hope—test it?); the job has not been deleted; there is nothing for the admin to do.

// comparison with executor.getCurrentExecutable() == executable currently should always be true
// as we no longer recycle Executors, but safer to future-proof in case we ever revisit recycling

if (!entry.getValue().isAlive() || entry.getKey() != entry.getValue().getCurrentExecutable()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be isActive(). E.g. if the executor is starting up/shutting down, we should not ignore the task check 🐛

}
}
}
if (!building.isEmpty()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐛 this method uses cached list of builds. If the new build gets started due to whatever reason, during the deletion cycle, these builds won't be noticed.

}
Thread.sleep(50L);
}
if (!building.isEmpty()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Risk of false negatives since this is a value from the past

jglick
jglick previously approved these changes Mar 9, 2017
WorkUnit workUnit = e.getCurrentWorkUnit();
if (workUnit != null && (workUnit.work == this || workUnit.work.getOwnerTask() == this)) {
building.put(e, e.getCurrentExecutable());
e.interrupt(Result.ABORTED);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to factor into a helper method. Better yet, introduce Computer.getAllExecutors(), since that is a common problem. (See Executor.of etc.)

for (Executor e : c.getOneOffExecutors()) {
WorkUnit workUnit = e.getCurrentWorkUnit();
if (workUnit != null && (workUnit.work == this || workUnit.work.getOwnerTask() == this)) {
building.put(e, e.getCurrentExecutable());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe easier to just check for

Queue.Executable exec = e.getCurrentExecutable();
if (exec instanceof Run && ((Run) exec).getParent() == this) {

// comparison with executor.getCurrentExecutable() == executable currently should always be true
// as we no longer recycle Executors, but safer to future-proof in case we ever revisit recycling

if (!entry.getKey().isAlive() || entry.getValue() != entry.getKey().getCurrentExecutable()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And then we could use simply Run.isLogUpdated().

// if a build is in progress. Cancel it.
if (this instanceof Queue.Task) {
// clear any items in the queue so they do not get picket up
Queue.getInstance().cancel((Queue.Task) this);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better consider that architecturally deprecated.

@stephenc
Copy link
Member Author

@oleg-nenashev if the executor is interrupted the thread will die: 111a8d3#diff-4da8c72b731d029f61f01ed41568d209R248 so the only way to know that the executor has completed is to observe isAlive()

Other issues addressed

@oleg-nenashev
Copy link
Member

@stephenc maybe it also addresses https://issues.jenkins-ci.org/browse/JENKINS-32783

@oleg-nenashev oleg-nenashev self-assigned this Mar 14, 2017
Copy link
Member

@oleg-nenashev oleg-nenashev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a minor 🐜 regarding the missing Interruption cause, but the most of the code looks good to me. IMHO some testing like ATH and PCT would not hurt in such case due to the fix complexity and potential risk of severe regressions.

Also dismissing the review from @jglick since the code has changed significantly

}
}
for (Executor executor : computer.getOneOffExecutors()) {
for (Executor executor : computer.getAllExecutors()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It causes some performance degradation (getOneOffExecutors() always happens + collection merge), so maybe we want to keep the original implementation

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are arguing for more complex code on the grounds of performance in the absence of a proven performance degradation. That smells like premature optimization

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it is not "more complex" code, it is just the "original code". And I think the impact may be really visible if we talk about modern Jenkins Pipeline-powered instances with dozens/hundreds of OneOff executors.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you have lots of one-off executors then the majority of cases will be checking the one-off executors, so we save two ArrayList and two Iterator allocations, plus escape analysis can reveal that the one list is only ever used for iteration and the method can be inlined as it is small... the JVM is smarter than you think. (maybe not that smart, but there is hope!)

}
}
for (Executor e : c.getOneOffExecutors()) {
for (Executor e : c.getAllExecutors()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

boolean ownsRegistration = ItemDeletion.register(this);
if (!ownsRegistration && ItemDeletion.isRegistered(this)) {
// we are not the owning thread and somebody else is concurrently deleting this exact item
throw new Failure(Messages.AbstractItem_BeingDeleted(getPronoun()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it worth putting this warning into UI as well (e.g. in the summary page near the "disabled" status)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all of 15 seconds!!!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still may be a case if the timeout gets configurable

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then whoever adds that feature can add the UI indicator. Solved

}
}
}
// interrupt any builds in progress (and this should be a recursive test so that folders do not pay
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe they should, but not sure. It makes FolderDelition a separate long background task. Which is probably fine, but out of the scope of this PR imho

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are recursively interrupting all jobs in one go here, any deletion will be at most 15 seconds

while (item != null) {
if (item == this) {
buildsInProgress.put(e, e.getCurrentExecutable());
e.interrupt(Result.ABORTED);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐛 No CauseOfInterruption, which would be really useful in this code. The default behavior will be an emty cause in such case: https://github.com/jenkinsci/jenkins/blob/master/core/src/main/java/hudson/model/Executor.java#L190

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The job is going to be deleted
  2. As the this is the user who will be triggering the interrupt the UserInterruption cause will be injected

iterator.remove();
}
// I don't know why, but we have to keep interrupting
entry.getKey().interrupt(Result.ABORTED);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, missing cause

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oleg-nenashev oleg-nenashev dismissed jglick’s stale review March 15, 2017 11:08

The code has significantly changed

@stephenc
Copy link
Member Author

@oleg-nenashev I argue that your bug was incorrect, the interruption is a user triggered action and will have a cause reflecting that

@stephenc
Copy link
Member Author

@oleg-nenashev @jglick ping also @jenkinsci/code-reviewers

@oleg-nenashev oleg-nenashev dismissed their stale review March 23, 2017 08:16

I do not block the PR anymore though I still prefer the explicit cause

@oleg-nenashev
Copy link
Member

@stephenc removed the bug according to explanation

@oleg-nenashev oleg-nenashev added the needs-more-reviews Complex change, which would benefit from more eyes label Mar 29, 2017
@oleg-nenashev
Copy link
Member

@reviewbybees any feedback?

@stephenc
Copy link
Member Author

@oleg-nenashev per policy, 7 days so @reviewbybees done

*
* @since TODO
*/
@Extension
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want this to be @Restricted?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No because plugins may need access to perform additional checks and prevent work on items that are registered

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will be likely non-backportable then

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be backported, but restricted in the backport

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be @Restricted unless there is a tested & documented use case for accessing it from a plugin (which I doubt—a real API would be designed differently).

List<Executor> result = new ArrayList<>(executors.size() + oneOffExecutors.size());
result.addAll(executors);
result.addAll(oneOffExecutors);
return result;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unmodifiableList?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you want to add another layer of indirection, plus we have given you a copy so you can do what you want with it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because that's how I interpret the "read-only" part of the Javadoc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, read-only should be removed from the javadoc, snapshot view should be enough, since it actually isn't read-only :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would prefer to just use unmodifiableList.

Copy link
Member

@oleg-nenashev oleg-nenashev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the fix since it definitely does not make the behavior worse than it was before. I am concerned about adding another non-configurable magic number for timeout, but I accept delaying it.

I will be likely against backporting the fix into .2 due to the potential impact of the changes. Also non-Restricted API

@oleg-nenashev oleg-nenashev added the on-hold This pull request depends on another event/release, and it cannot be merged right now label Apr 10, 2017
@oleg-nenashev
Copy link
Member

On hold for now since we are waiting for the 2.54 release. Will merge afterwards

long start = System.nanoTime();
p.delete();
long end = System.nanoTime();
assertThat(end - start, Matchers.lessThan(TimeUnit.SECONDS.toNanos(1)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might result in flaky builds on slower build agents.

@rsandell
Copy link
Member

🐝

@oleg-nenashev
Copy link
Member

@reviewbybees done
No merge right now pls

@oleg-nenashev oleg-nenashev changed the title [FIXED JENKINS-35160] Wait up to 15 seconds for interrupted builds to complete [FIXED JENKINS-35160] - Job deletion: Wait up to 15 seconds for interrupted builds to complete Apr 11, 2017
@oleg-nenashev oleg-nenashev added ready-for-merge The PR is ready to go, and it will be merged soon if there is no negative feedback and removed on-hold This pull request depends on another event/release, and it cannot be merged right now labels Apr 11, 2017
@oleg-nenashev
Copy link
Member

@jenkinsci/code-reviewers we would appreciate some additional feedback. It may potentially break some use-cases with deletion of projects with pending builds, though these use-cases were not working reliably anyway. If you are fine, my plan is to merge it towards the next weekly on Friday.

Copy link
Member

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did not follow all of the code here but overall looks right.

List<Executor> result = new ArrayList<>(executors.size() + oneOffExecutors.size());
result.addAll(executors);
result.addAll(oneOffExecutors);
return result;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would prefer to just use unmodifiableList.

*/
@CheckForNull
public static hudson.model.Item getItemOf(@Nonnull SubTask t) {
// TODO move to default method on SubTask once code level is Java 8
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can do that now.

*
* @since TODO
*/
@Extension
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be @Restricted unless there is a tested & documented use case for accessing it from a plugin (which I doubt—a real API would be designed differently).

try {
return !_contains(item);
} finally {
lock.readLock().unlock();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pity that Lock is not AutoCloseable.

@daniel-beck
Copy link
Member

Should be @Restricted unless there is a tested & documented use case for accessing it from a plugin (which I doubt—a real API would be designed differently).

☹️

@oleg-nenashev
Copy link
Member

@jglick @daniel-beck

Should be @restricted unless there is a tested & documented use case for accessing it from a plugin (which I doubt—a real API would be designed differently).

Makes sense to Restrict in a follow-up PR.

@stephenc stephenc deleted the jenkins-35160 branch June 7, 2017 08:53
}
}
}
synchronized (this) { // could just make performDelete synchronized but overriders might not honor that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-more-reviews Complex change, which would benefit from more eyes ready-for-merge The PR is ready to go, and it will be merged soon if there is no negative feedback
Projects
None yet
5 participants