[FIXED JENKINS-35160] - Job deletion: Wait up to 15 seconds for interrupted builds to complete #2789

stephenc · 2017-03-09T17:21:26Z

Also now aware of concurrent builds

@reviewbybees

… complete - Also now aware of concurrent builds

ghost · 2017-03-09T17:24:09Z

This pull request originates from a CloudBees employee. At CloudBees, we require that all pull requests be reviewed by other CloudBees employees before we seek to have the change accepted. If you want to learn more about our process please see this explanation.

stephenc · 2017-03-09T17:45:43Z

core/src/main/java/hudson/model/Job.java

@@ -262,20 +268,6 @@ public void onCopied(Item src, Item item) {
        }
    }

-    @Override
-    protected void performDelete() throws IOException, InterruptedException {


need to not hold the lock while interrupting or the interrupted threads may be unable to complete

oleg-nenashev

🐛 for the concurrency flaws. Ideally the job should be marked for deletion (e.g. boolean flag), and after setting this flag the job should prevent any attempts to schedule it to the queue. And maybe the flag should be released if the deletion fails (or not?).

Another 🐛 for using isAlive() for the executor

oleg-nenashev · 2017-03-09T17:32:33Z

core/src/main/java/hudson/model/Job.java

+        // if a build is in progress. Cancel it.
+        if (this instanceof Queue.Task) {
+            // clear any items in the queue so they do not get picket up
+            Queue.getInstance().cancel((Queue.Task) this);


It won't work for the Promoted Builds and other such... exotic implementations

Better consider that architecturally deprecated.

oleg-nenashev · 2017-03-09T17:33:21Z

core/src/main/java/hudson/model/Job.java

+                for (Executor e : c.getOneOffExecutors()) {
+                    WorkUnit workUnit = e.getCurrentWorkUnit();
+                    if (workUnit != null && (workUnit.work == this || workUnit.work.getOwnerTask() == this)) {
+                        building.put(e.getCurrentExecutable(), e);


Lacks logging

oleg-nenashev · 2017-03-09T17:33:35Z

core/src/main/java/hudson/model/Job.java

+                    WorkUnit workUnit = e.getCurrentWorkUnit();
+                    if (workUnit != null && (workUnit.work == this || workUnit.work.getOwnerTask() == this)) {
+                        building.put(e.getCurrentExecutable(), e);
+                        e.interrupt();


Lacks logging

oleg-nenashev · 2017-03-09T17:36:19Z

core/src/main/java/hudson/model/Job.java

+                    Thread.sleep(50L);
+                }
+                if (!building.isEmpty()) {
+                    throw new AbortException(Messages.Job_FailureToStopBuilds(building.size(), getFullDisplayName()));


Should it also go to the system log?

I do not see why. The user has gotten a failure message (I hope—test it?); the job has not been deleted; there is nothing for the admin to do.

oleg-nenashev · 2017-03-09T17:39:17Z

core/src/main/java/hudson/model/Job.java

+                        // comparison with executor.getCurrentExecutable() == executable currently should always be true
+                        // as we no longer recycle Executors, but safer to future-proof in case we ever revisit recycling
+
+                        if (!entry.getValue().isAlive() || entry.getKey() != entry.getValue().getCurrentExecutable()) {


It should be isActive(). E.g. if the executor is starting up/shutting down, we should not ignore the task check 🐛

oleg-nenashev · 2017-03-09T17:42:49Z

core/src/main/java/hudson/model/Job.java

+                    }
+                }
+            }
+            if (!building.isEmpty()) {


🐛 this method uses cached list of builds. If the new build gets started due to whatever reason, during the deletion cycle, these builds won't be noticed.

oleg-nenashev · 2017-03-09T17:43:48Z

core/src/main/java/hudson/model/Job.java

+                    }
+                    Thread.sleep(50L);
+                }
+                if (!building.isEmpty()) {


Risk of false negatives since this is a value from the past

jglick · 2017-03-09T17:55:20Z

core/src/main/java/hudson/model/Job.java

+                    WorkUnit workUnit = e.getCurrentWorkUnit();
+                    if (workUnit != null && (workUnit.work == this || workUnit.work.getOwnerTask() == this)) {
+                        building.put(e, e.getCurrentExecutable());
+                        e.interrupt(Result.ABORTED);


Better to factor into a helper method. Better yet, introduce Computer.getAllExecutors(), since that is a common problem. (See Executor.of etc.)

jglick · 2017-03-09T17:57:37Z

core/src/main/java/hudson/model/Job.java

+                for (Executor e : c.getOneOffExecutors()) {
+                    WorkUnit workUnit = e.getCurrentWorkUnit();
+                    if (workUnit != null && (workUnit.work == this || workUnit.work.getOwnerTask() == this)) {
+                        building.put(e, e.getCurrentExecutable());


Maybe easier to just check for

Queue.Executable exec = e.getCurrentExecutable(); if (exec instanceof Run && ((Run) exec).getParent() == this) {

jglick · 2017-03-09T17:59:10Z

core/src/main/java/hudson/model/Job.java

+                        // comparison with executor.getCurrentExecutable() == executable currently should always be true
+                        // as we no longer recycle Executors, but safer to future-proof in case we ever revisit recycling
+
+                        if (!entry.getKey().isAlive() || entry.getValue() != entry.getKey().getCurrentExecutable()) {


And then we could use simply Run.isLogUpdated().

jglick · 2017-03-09T18:01:34Z

core/src/main/java/hudson/model/Job.java

+        // if a build is in progress. Cancel it.
+        if (this instanceof Queue.Task) {
+            // clear any items in the queue so they do not get picket up
+            Queue.getInstance().cancel((Queue.Task) this);


Better consider that architecturally deprecated.

Left over references before I settled on a better name

stephenc · 2017-03-10T15:15:15Z

@oleg-nenashev if the executor is interrupted the thread will die: 111a8d3#diff-4da8c72b731d029f61f01ed41568d209R248 so the only way to know that the executor has completed is to observe isAlive()

Other issues addressed

oleg-nenashev · 2017-03-14T00:56:19Z

@stephenc maybe it also addresses https://issues.jenkins-ci.org/browse/JENKINS-32783

oleg-nenashev

I have a minor 🐜 regarding the missing Interruption cause, but the most of the code looks good to me. IMHO some testing like ATH and PCT would not hurt in such case due to the fix complexity and potential risk of severe regressions.

Also dismissing the review from @jglick since the code has changed significantly

oleg-nenashev · 2017-03-15T10:51:37Z

core/src/main/java/hudson/model/Executor.java

-                }
-            }
-            for (Executor executor : computer.getOneOffExecutors()) {
+            for (Executor executor : computer.getAllExecutors()) {


It causes some performance degradation (getOneOffExecutors() always happens + collection merge), so maybe we want to keep the original implementation

I think you are arguing for more complex code on the grounds of performance in the absence of a proven performance degradation. That smells like premature optimization

Well, it is not "more complex" code, it is just the "original code". And I think the impact may be really visible if we talk about modern Jenkins Pipeline-powered instances with dozens/hundreds of OneOff executors.

if you have lots of one-off executors then the majority of cases will be checking the one-off executors, so we save two ArrayList and two Iterator allocations, plus escape analysis can reveal that the one list is only ever used for iteration and the method can be inlined as it is small... the JVM is smarter than you think. (maybe not that smart, but there is hope!)

oleg-nenashev · 2017-03-15T10:52:09Z

core/src/main/java/hudson/model/RestartListener.java

-                        }
-                    }
-                    for (Executor e : c.getOneOffExecutors()) {
+                    for (Executor e : c.getAllExecutors()) {


same as above

oleg-nenashev · 2017-03-15T11:00:34Z

core/src/main/java/hudson/model/AbstractItem.java

+        boolean ownsRegistration = ItemDeletion.register(this);
+        if (!ownsRegistration && ItemDeletion.isRegistered(this)) {
+            // we are not the owning thread and somebody else is concurrently deleting this exact item
+            throw new Failure(Messages.AbstractItem_BeingDeleted(getPronoun()));


Maybe it worth putting this warning into UI as well (e.g. in the summary page near the "disabled" status)

For all of 15 seconds!!!

Still may be a case if the timeout gets configurable

Then whoever adds that feature can add the UI indicator. Solved

oleg-nenashev · 2017-03-15T11:01:32Z

core/src/main/java/hudson/model/AbstractItem.java

+                        }
+                    }
+                }
+                // interrupt any builds in progress (and this should be a recursive test so that folders do not pay


Maybe they should, but not sure. It makes FolderDelition a separate long background task. Which is probably fine, but out of the scope of this PR imho

We are recursively interrupting all jobs in one go here, any deletion will be at most 15 seconds

oleg-nenashev · 2017-03-15T11:04:12Z

core/src/main/java/hudson/model/AbstractItem.java

+                                while (item != null) {
+                                    if (item == this) {
+                                        buildsInProgress.put(e, e.getCurrentExecutable());
+                                        e.interrupt(Result.ABORTED);


🐛 No CauseOfInterruption, which would be really useful in this code. The default behavior will be an emty cause in such case: https://github.com/jenkinsci/jenkins/blob/master/core/src/main/java/hudson/model/Executor.java#L190

The job is going to be deleted

As the this is the user who will be triggering the interrupt the UserInterruption cause will be injected

oleg-nenashev · 2017-03-15T11:04:32Z

core/src/main/java/hudson/model/AbstractItem.java

+                                iterator.remove();
+                            }
+                            // I don't know why, but we have to keep interrupting
+                            entry.getKey().interrupt(Result.ABORTED);


Same, missing cause

Nope: https://github.com/jenkinsci/jenkins/blob/master/core/src/main/java/hudson/model/Executor.java#L183-L196

The code has significantly changed

stephenc · 2017-03-15T16:38:22Z

@oleg-nenashev I argue that your bug was incorrect, the interruption is a user triggered action and will have a cause reflecting that

stephenc · 2017-03-22T21:34:41Z

@oleg-nenashev @jglick ping also @jenkinsci/code-reviewers

I do not block the PR anymore though I still prefer the explicit cause

oleg-nenashev · 2017-03-23T08:17:38Z

@stephenc removed the bug according to explanation

oleg-nenashev · 2017-03-29T09:48:44Z

@reviewbybees any feedback?

stephenc · 2017-03-29T13:06:33Z

@oleg-nenashev per policy, 7 days so @reviewbybees done

daniel-beck · 2017-03-30T10:50:01Z

core/src/main/java/jenkins/model/queue/ItemDeletion.java

+ *
+ * @since TODO
+ */
+@Extension


Do we want this to be @Restricted?

No because plugins may need access to perform additional checks and prevent work on items that are registered

Will be likely non-backportable then

It could be backported, but restricted in the backport

Should be @Restricted unless there is a tested & documented use case for accessing it from a plugin (which I doubt—a real API would be designed differently).

daniel-beck · 2017-03-30T10:53:18Z

core/src/main/java/hudson/model/Computer.java

+        List<Executor> result = new ArrayList<>(executors.size() + oneOffExecutors.size());
+        result.addAll(executors);
+        result.addAll(oneOffExecutors);
+        return result;


unmodifiableList?

Why do you want to add another layer of indirection, plus we have given you a copy so you can do what you want with it

Because that's how I interpret the "read-only" part of the Javadoc.

Yea, read-only should be removed from the javadoc, snapshot view should be enough, since it actually isn't read-only :)

Would prefer to just use unmodifiableList.

oleg-nenashev

I agree with the fix since it definitely does not make the behavior worse than it was before. I am concerned about adding another non-configurable magic number for timeout, but I accept delaying it.

I will be likely against backporting the fix into .2 due to the potential impact of the changes. Also non-Restricted API

oleg-nenashev · 2017-04-10T11:17:29Z

On hold for now since we are waiting for the 2.54 release. Will merge afterwards

rsandell · 2017-04-10T11:18:19Z

test/src/test/java/hudson/model/JobTest.java

+        long start = System.nanoTime();
+        p.delete();
+        long end = System.nanoTime();
+        assertThat(end - start, Matchers.lessThan(TimeUnit.SECONDS.toNanos(1)));


Might result in flaky builds on slower build agents.

rsandell · 2017-04-10T14:03:40Z

🐝

oleg-nenashev · 2017-04-10T14:05:16Z

@reviewbybees done
No merge right now pls

oleg-nenashev · 2017-04-11T10:42:13Z

@jenkinsci/code-reviewers we would appreciate some additional feedback. It may potentially break some use-cases with deletion of projects with pending builds, though these use-cases were not working reliably anyway. If you are fine, my plan is to merge it towards the next weekly on Friday.

jglick

Did not follow all of the code here but overall looks right.

jglick · 2017-04-13T22:30:53Z

core/src/main/java/hudson/model/Computer.java

+        List<Executor> result = new ArrayList<>(executors.size() + oneOffExecutors.size());
+        result.addAll(executors);
+        result.addAll(oneOffExecutors);
+        return result;


Would prefer to just use unmodifiableList.

jglick · 2017-04-13T22:32:20Z

core/src/main/java/hudson/model/queue/Tasks.java

+     */
+    @CheckForNull
+    public static hudson.model.Item getItemOf(@Nonnull SubTask t) {
+        // TODO move to default method on SubTask once code level is Java 8


Can do that now.

jglick · 2017-04-13T22:33:15Z

core/src/main/java/jenkins/model/queue/ItemDeletion.java

+ *
+ * @since TODO
+ */
+@Extension


Should be @Restricted unless there is a tested & documented use case for accessing it from a plugin (which I doubt—a real API would be designed differently).

jglick · 2017-04-13T22:34:30Z

core/src/main/java/jenkins/model/queue/ItemDeletion.java

+            try {
+                return !_contains(item);
+            } finally {
+                lock.readLock().unlock();


Pity that Lock is not AutoCloseable.

daniel-beck · 2017-04-22T12:29:25Z

Should be @Restricted unless there is a tested & documented use case for accessing it from a plugin (which I doubt—a real API would be designed differently).

☹️

oleg-nenashev · 2017-04-22T13:14:13Z

@jglick @daniel-beck

Should be @restricted unless there is a tested & documented use case for accessing it from a plugin (which I doubt—a real API would be designed differently).

Makes sense to Restrict in a follow-up PR.

jglick · 2017-10-10T19:00:23Z

core/src/main/java/hudson/model/AbstractItem.java

+                    }
+                }
+            }
+            synchronized (this) { // could just make performDelete synchronized but overriders might not honor that


Unusable API; see jenkinsci/cloudbees-folder-plugin#112.

Cf. jenkinsci#95 & jenkinsci/jenkins#2789.

[FIXED JENKINS-35160] Wait up to 15 seconds for interrupted builds to…

0f2b88b

… complete - Also now aware of concurrent builds

[JENKINS-35160] Tests are good, they catch bugs

047e849

stephenc commented Mar 9, 2017

View reviewed changes

oleg-nenashev requested changes Mar 9, 2017

View reviewed changes

jglick previously approved these changes Mar 9, 2017

View reviewed changes

stephenc added 4 commits March 10, 2017 14:23

[JENKINS-35160] We should do the interrupt for any Item not just Jobs

ee5d2f3

[JENKINS-35160] s/DeleteBlocker/ItemDeletion/g

222b09e

Left over references before I settled on a better name

[JENKINS-35160] Switch to Failure for better HTML rendering

8813555

[JENKINS-35160] Align the i18n key with owning class

9747405

stephenc mentioned this pull request Mar 10, 2017

[JENKINS-35112] Re-implement based on JENKINS-35160 impl in core jenkinsci/cloudbees-folder-plugin#95

Merged

oleg-nenashev self-assigned this Mar 14, 2017

oleg-nenashev previously requested changes Mar 15, 2017

View reviewed changes

oleg-nenashev added the needs-more-reviews Complex change, which would benefit from more eyes label Mar 29, 2017

daniel-beck reviewed Mar 30, 2017

View reviewed changes

oleg-nenashev approved these changes Apr 1, 2017

View reviewed changes

oleg-nenashev added the on-hold This pull request depends on another event/release, and it cannot be merged right now label Apr 10, 2017

rsandell reviewed Apr 10, 2017

View reviewed changes

oleg-nenashev changed the title ~~[FIXED JENKINS-35160] Wait up to 15 seconds for interrupted builds to complete~~ [FIXED JENKINS-35160] - Job deletion: Wait up to 15 seconds for interrupted builds to complete Apr 11, 2017

oleg-nenashev added ready-for-merge The PR is ready to go, and it will be merged soon if there is no negative feedback and removed on-hold This pull request depends on another event/release, and it cannot be merged right now labels Apr 11, 2017

jglick approved these changes Apr 13, 2017

View reviewed changes

oleg-nenashev merged commit 52a1a10 into jenkinsci:master Apr 14, 2017

oleg-nenashev mentioned this pull request Apr 22, 2017

[JENKINS-43653] - Ensure AbstractItem#delete() NPE safety when checking executors #2854

Merged

4 tasks

stephenc deleted the jenkins-35160 branch June 7, 2017 08:53

jglick reviewed Oct 10, 2017

View reviewed changes

jglick added a commit to jglick/cloudbees-folder-plugin that referenced this pull request Oct 10, 2017

Planned cleanup of AbstractFolder.delete & ItemDeletion failed.

cdf7370

Cf. jenkinsci#95 & jenkinsci/jenkins#2789.

jglick mentioned this pull request Oct 17, 2023

Use Tasks.getItemOf jenkinsci/cloudbees-folder-plugin#353

Merged

jglick mentioned this pull request Oct 25, 2023

Pull up AbstractFolder.delete logic into AbstractItem #8645

Merged

[FIXED JENKINS-35160] - Job deletion: Wait up to 15 seconds for interrupted builds to complete #2789

[FIXED JENKINS-35160] - Job deletion: Wait up to 15 seconds for interrupted builds to complete #2789

Conversation

stephenc commented Mar 9, 2017

ghost commented Mar 9, 2017

Choose a reason for hiding this comment

oleg-nenashev left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephenc commented Mar 10, 2017

oleg-nenashev commented Mar 14, 2017

oleg-nenashev left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephenc commented Mar 15, 2017

stephenc commented Mar 22, 2017

oleg-nenashev commented Mar 23, 2017

oleg-nenashev commented Mar 29, 2017

stephenc commented Mar 29, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

daniel-beck Mar 30, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oleg-nenashev left a comment

Choose a reason for hiding this comment

oleg-nenashev commented Apr 10, 2017

Choose a reason for hiding this comment

rsandell commented Apr 10, 2017

oleg-nenashev commented Apr 10, 2017

oleg-nenashev commented Apr 11, 2017

jglick left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

daniel-beck commented Apr 22, 2017

oleg-nenashev commented Apr 22, 2017

Choose a reason for hiding this comment

oleg-nenashev left a comment •

edited

daniel-beck Mar 30, 2017 •

edited