Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JENKINS-41127] Prevent flyweight tasks for jobs with concurrent execution disabled from running concurrently #3562

Merged
merged 2 commits into from Aug 4, 2018

Conversation

5 participants
@dwnusbaum
Copy link
Member

dwnusbaum commented Jul 24, 2018

See JENKINS-41127.

Previously, if a flyweight task was in Queue.pendings at the start of a call to Queue#maintain, then it would be removed from pendings as part of the lostPending logic because tasks running on a OneOffExecutor (flyweight tasks) were not tracked, and after being removed, it is not added back to the queue. (which appears incorrect as previously noted, though I'm not sure exactly how it should be fixed (as simple as just adding the return value to the queue as in the other parts of Queue#maintain?)):

makeBuildable(p); // TODO whatever this is for, the return value is being ignored, so this does nothing at all

Because of this, if there were two builds in the queue for a pipeline with concurrent execution disabled, one still in blocked, and the other in pendings at the start of Queue#maintain, then the pending build would be removed from the queue (but would continue executing), and the previously blocked build would no longer be blocked and would start executing concurrently.

I was only able to reproduce the bug with Pipeline, but the issue should in theory affect other job types that use flyweight executors such as matrix projects.

I am not sure if there is a good way to create a regression test for the change, but am looking into it.

Proposed changelog entries

  • Bug: Some types of builds, like pipelines, would sometimes run concurrently even when that was disabled.

    (Suggestions to improve the changelog entry are very welcome)

Submitter checklist

  • JIRA issue is well described
  • Changelog entry appropriate for the audience affected by the change (users or developer, depending on the change). Examples
    * Use the Internal: prefix if the change has no user-visible impact (API, test frameworks, etc.)
  • Appropriate autotests or explanation to why this change has no tests
  • For dependency updates: links to external changelogs and, if possible, full diffs

Desired reviewers

@reviewbybees @jennbriden

@dwnusbaum dwnusbaum requested review from svanoort and abayer Jul 24, 2018

@daniel-beck

This comment has been minimized.

Copy link
Member

daniel-beck commented Jul 24, 2018

Reworded the changelog entry. WDYT?

@dwnusbaum

This comment has been minimized.

Copy link
Member Author

dwnusbaum commented Jul 25, 2018

@daniel-beck Thanks, I think your version is much better!

@svanoort
Copy link
Member

svanoort left a comment

Seems reasonable to me, good catch!

@dwnusbaum

This comment has been minimized.

Copy link
Member Author

dwnusbaum commented Jul 25, 2018

I added a basic regression test for the issue, and verified locally that it fails without my fix and passes with it.

@svanoort

This comment has been minimized.

Copy link
Member

svanoort commented Jul 26, 2018

@daniel-beck @oleg-nenashev WDYT? Merge-worthy?

@dwnusbaum

This comment has been minimized.

Copy link
Member Author

dwnusbaum commented Jul 27, 2018

I am planning on merging this today towards the next weekly if there are no objections.

@daniel-beck

This comment has been minimized.

Copy link
Member

daniel-beck commented Jul 27, 2018

@dwnusbaum This might interact badly with JENKINS-45571.

@dwnusbaum

This comment has been minimized.

Copy link
Member Author

dwnusbaum commented Jul 27, 2018

Daniel's point is that if JENKINS-45571 is caused by OneOffExecutors not being cleaned up correctly for some reason, then this change may cause new builds of flyweight tasks to be blocked indefinitely waiting for those stuck executors, when previously the stuck executors would just have been ignored.

Without knowing more about the cause of JENKINS-45571, I don't know how best to prevent it from happening. We could merge this anyway, and if users start hitting JENKINS-45571 then we know what the proximate cause is, but we'd still need to identify the root cause and fix it. We could try a preemptive workaround here like checking Executor#isLikelyStuck for OneOffExecutor s before including them in the lostPending logic, but that feels seriously hacky. Any thoughts?

@daniel-beck

This comment has been minimized.

Copy link
Member

daniel-beck commented Jul 27, 2018

The problem is that we have no idea about the scope of JENKINS-45571. It might be extremely rare, or affect nearly every serious Pipeline user. @dwnusbaum If the weeklies get regression reports that can be traced back to this change, are you prepared to look into them?

@daniel-beck daniel-beck requested a review from svanoort Jul 27, 2018

@svanoort

This comment has been minimized.

Copy link
Member

svanoort commented Jul 27, 2018

@daniel-beck @dwnusbaum I am quite confident from a deep investigation that https://issues.jenkins-ci.org/browse/JENKINS-45571 is caused by the Pipeline failing to mark its AsynchronousExecution is done (AsynchronousExecution#completed), which means the OneOfExecutor is never removed.

Edit: Now, fixing that is much harder than identifying what caused the the scenario (it's quite a long chain to figure out what prevents that callback from being issued, since it's at the end of a long chain of very complex, threaded logic).

But I really and honestly cannot say what the interaction with Queuing will be here unfortunately -- have not dived into that end of things lately. Knowing the above, does it help/hurt with this particular PR?

@daniel-beck

This comment has been minimized.

Copy link
Member

daniel-beck commented Jul 27, 2018

screen shot

Responsiveness trophy! 🏆

@svanoort

This comment has been minimized.

Copy link
Member

svanoort commented Jul 27, 2018

Thanks, but I think you just got lucky here @daniel-beck since I was looking into this one already (please see the edit above though).

@dwnusbaum

This comment has been minimized.

Copy link
Member Author

dwnusbaum commented Jul 27, 2018

The problem is that we have no idea about the scope of JENKINS-45571. It might be extremely rare, or affect nearly every serious Pipeline user.

Yes, given that the only way users have noticed it so far is by seeing lots of likelyStuck: true executors in the API, which I expect very few users to ever look at, it seems like it might be affecting quite a few users who just don't realize it.

It seems for now like we should wait on jenkinsci/workflow-cps-plugin#234, because I don't think we have a good way to prevent this issue from core's side of things.

@dwnusbaum dwnusbaum added the on-hold label Jul 27, 2018

@dwnusbaum

This comment has been minimized.

Copy link
Member Author

dwnusbaum commented Jul 27, 2018

But I really and honestly cannot say what the interaction with Queuing will be here unfortunately -- have not dived into that end of things lately. Knowing the above, does it help/hurt with this particular PR?

It means that we have a lot more confidence that this PR would cause some builds to be blocked forever for anyone hitting the bugs you are looking into, so I think we should definitely hold off for now.

@svanoort

This comment has been minimized.

Copy link
Member

svanoort commented Jul 27, 2018

@dwnusbaum If that's the case, then we have a general design problem with OneOffExecutor use -- they are expected to stick around for some time because they track to the internal Pipeline execution and complete when it does. The fix won't change that behavior -- only ensure that the executor is killed when the Pipeline completes rather than managing to dangle around due to the bug.

@dwnusbaum

This comment has been minimized.

Copy link
Member Author

dwnusbaum commented Jul 27, 2018

Ok, looking at this some more, it might not be a problem. The only difference in behavior this PR would cause with JENKINS-45571 is that if the stuck executor's work unit is in Queue.pendings, then other builds of the same pipeline would be blocked due to this change (if concurrent builds are disabled). The reason for this is that because the executor for the item is now looped over here, it gets removed from lostPendings correctly and so is no longer removed from Queue.pendings here.

Executor#run calls Queue#onStartExecuting, which removes the executor's task from Queue.pendings long before it calls Queue#execute, so the executor's work unit should be removed from Queue.pendings before it even starts executing. Whether or not Executor#completedAsynchronous is called to clean up the executor shouldn't matter.

Now, if the task is still running on the Executor while the executor is stuck, then the Queue will still be blocked, but it will be blocked because of WorkflowJob#getCauseOfBlockage, and this would happen before or after my patch.

Summary: I don't think the executors stuck waiting for Executor#completedAsynchronous will ever have a work item that is still in Queue.pendings, so I don't think this PR will interact badly with JENKINS-45571.

@svanoort

This comment has been minimized.

Copy link
Member

svanoort commented Jul 27, 2018

@dwnusbaum If you can deploy a snapshot of this fix, and tell me what state the OneOffExecutor & the FlyWeight task should be in in the queue, I can test against my JENKINS-45571 reproduction case.

@dwnusbaum

This comment has been minimized.

Copy link
Member Author

dwnusbaum commented Jul 27, 2018

@svanoort Run your reproduction case, and while aOneOffExecutor is stuck, check that the task on that executor is not in Queue#getPendingItems. You should be able to just add assertNull(Queue.getItem(Executor.getCurrentWorkUnit().context.task)) to your test, since the task shouldn't be inside of any of the Queue's lists at that point, but if that fails, then you can separate into the following to figure out exactly where it is:

BuildableItem item = Executor.getCurrentWorkUnit().context.item); 
assertThat(Queue.getPendingItems(), not(contains(item)));
assertThat(Queue.getBuildableItems(), not(contains(item)));
assertThat(Queue.getUnblockedItems(), not(contains(item)));

I uploaded the snapshots as jenkins-core:2.135-20180727.210759-1 and jenkins-war:2.135-20180727.210819-1.

@svanoort

This comment has been minimized.

Copy link
Member

svanoort commented Jul 27, 2018

Looks like I'm going to have to do some extra finagling to make this work, using https://github.com/jenkinsci/jenkins-test-harness/pull/100/files after all -- since the WarExploder isn't happy about this.

Got it most of the way set up but is going to be a Monday Thing for the remainder.

@daniel-beck

This comment has been minimized.

Copy link
Member

daniel-beck commented Aug 2, 2018

@svanoort Next Monday perhaps? 😃

@atikhono

This comment has been minimized.

Copy link

atikhono commented Aug 3, 2018

@svanoort @dwnusbaum So looking forward to have it fixed!

@atikhono

This comment has been minimized.

Copy link

atikhono commented Aug 3, 2018

@svanoort In my case, the Throttle Concurrent Builds plugin prevents new builds from running because there are OneOffExecutors that are never removed from the master node. A workaround is to run a cleanup pipeline that removes such flyweights when there're no builds of my pipeline running. The problem is there's always a build running (well, almost)... (As another workaround, I can increase the concurrent builds limit for the throttle category, but that's not an automated solution an is troubling me and my fellow Jenkins users). Can you please tell if it's possible to tell apart stuck flyweights of the same pipeline from the ones that run normally?

@dwnusbaum

This comment has been minimized.

Copy link
Member Author

dwnusbaum commented Aug 3, 2018

the Throttle Concurrent Builds plugin prevents new builds from running because there are OneOffExecutors that are never removed from the master node.

@atikhono I don't think that issue (likely the same cause as JENKINS-45571) would be fixed by this change. This change is just about keeping pipelines from running concurrently (JENKINS-41127), but before we merge we want to make sure that it won't make things worse for users like you who are hitting JENKINS-45571, because we don't have a fix for that issue yet.

@svanoort

This comment has been minimized.

Copy link
Member

svanoort commented Aug 3, 2018

@dwnusbaum @daniel-beck No matter what I do, I cannot get the custom WAR incrementals to play nicely with the IntelliJ debugger for some reason. I've gone as far as using a custom test harness built off of jenkinsci/jenkins-test-harness#99

It seems a lot like https://issues.jenkins-ci.org/browse/JENKINS-45245

java.lang.AssertionError: /Users/svanoort/.m2/repository/org/jenkins-ci/main/jenkins-core/2.135-SNAPSHOT/jenkins-core-2.135-20180727.210759-1.jar is not in the expected location, and jenkins-war-*.war was not in /Applications/IntelliJ IDEA CE 2.app/Contents/lib/idea_rt.jar:/Applications/IntelliJ IDEA CE 2.app/Contents/plugins/junit/lib/junit-rt.jar: YADDA YADDA

at org.jvnet.hudson.test.WarExploder.explode(WarExploder.java:119)
at org.jvnet.hudson.test.WarExploder.(WarExploder.java:68)
at org.jvnet.hudson.test.JenkinsRule.createWebServer(JenkinsRule.java:668)
at org.jvnet.hudson.test.JenkinsRule.newHudson(JenkinsRule.java:612)
at org.jvnet.hudson.test.JenkinsRule.before(JenkinsRule.java:390)
at org.jvnet.hudson.test.JenkinsRule$1.evaluate(JenkinsRule.java:543)
at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.lang.Thread.run(Thread.java:745)

You guys have any ideas, or is this one where we either wait for ideas from @jglick or hope for the best from a review alone?

@dwnusbaum

This comment has been minimized.

Copy link
Member Author

dwnusbaum commented Aug 3, 2018

I was able to sync up with Sam and run his reproduction for JENKINS-45571 on my machine. Thankfully, when the executor is stuck, its work item is not in the Queue at all, so the change to the lostPendings logic here should not change the behavior, although as noted in my earlier comment, the existing behavior could still be problematic (although in my reproduction WorkflowRun#isBuilding returns false, so that specific case shouldn't be a problem):

If the task is still running on the Executor while the executor is stuck (WorkflowRun#isBuilding returns true), then new tasks may still be blocked, but they will be blocked because of WorkflowJob#getCauseOfBlockage, and this would happen before or after my patch.

I think this is ready to be merged. Any objections?

@svanoort

This comment has been minimized.

Copy link
Member

svanoort commented Aug 3, 2018

@dwnusbaum I'm fine with it, but am hoping @daniel-beck feels the same

@daniel-beck

This comment has been minimized.

Copy link
Member

daniel-beck commented Aug 3, 2018

This investigation appears to adequately address my concerns, thanks for clarifying that! Otherwise I haven't had the time to look into this further, so ±0

@dwnusbaum

This comment has been minimized.

Copy link
Member Author

dwnusbaum commented Aug 4, 2018

@daniel-beck Ok, I will go ahead and merge towards this weekly, and if there are any regressions I will make sure to look into them.

@dwnusbaum dwnusbaum added ready-for-merge and removed on-hold labels Aug 4, 2018

@dwnusbaum

This comment has been minimized.

Copy link
Member Author

dwnusbaum commented Aug 4, 2018

CI failure appears to just be due to ci.jenkins.io running out of space: java.io.IOException: No space left on device, but the tests are otherwise passing.

@dwnusbaum dwnusbaum merged commit 6df66e9 into jenkinsci:master Aug 4, 2018

1 check failed

continuous-integration/jenkins/pr-merge This commit cannot be built
Details

@dwnusbaum dwnusbaum deleted the dwnusbaum:JENKINS-41127 branch Aug 4, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.