New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[JENKINS-34256] Resume Pipeline execution if quiet mode is canceled #340
Conversation
src/test/java/org/jenkinsci/plugins/workflow/cps/CpsFlowExecutionTest.java
Outdated
Show resolved
Hide resolved
public long getRecurrencePeriod() { | ||
return Main.isUnitTest | ||
? TimeUnit.SECONDS.toMillis(1) | ||
: TimeUnit.MINUTES.toMillis(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about timing here. One minute seems like a decent balance between responsiveness and avoiding unnecessary work, but I am happy to tweak it based on review feedback. Ideally, we'd have a listener so this could be event-driven instead (should I go ahead and file a core PR for that?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we need to bother with a core patch given that this is kind of a corner case and polling should work well enough.
Note I had a Jenkins user specifically ask this same question. |
I just updated the description, but it looks like the reasoning for the current behavior is discussed in JENKINS-32015. I am going to try to log a message to the build log when execution is suspended because of quiet mode so users understand why their build appears to be hanging. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just allow the Pipelines to continue executing so that suspension/resumption works as normal when Jenkins is restarted?
It might work, depending on durability modes of metadata and the exact timing of events, but you would be pushing your luck.
If you are just inside some sh
step that is running along for an hour or so, Jenkins can restart. The idea is to not be writing critical metadata from the master JVM at the moment it exits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK but could be optimized to add no overhead when the situation does not occur.
SemaphoreStep.waitForStart("wait/1", b); | ||
r.jenkins.doQuietDown(true, 0); | ||
SemaphoreStep.success("wait/1", null); | ||
((CpsFlowExecution) b.getExecution()).waitForSuspension(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would avoid using waitForSuspension
. Really it should be deprecated—too internal. Would suffice to just sleep for a second here, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In #68 you switched from using Thread.sleep(1000)
to CpsFlowExecution.waitForSuspension
in CpsFlowExecutionTest.quietDown
to try to fix some flaky tests, so I'm a bit hesitant to do the reverse. I'm happy to add a disclaimer to CpsFlowExecution.waitForSuspension
that the method should not be used unless what you are testing involves some implementation detail of CpsThreadGroup.scheduleRun
, if you think that would be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#68 was a different case: the test would fail if the sleep were too short. Here, at worst, the test could pass if the code were buggy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, e026365 switches to using waitForMessage
instead of a waitForSuspension
. I think that is ok because the message is printed from inside of scheduleRun
, but it might be flaky in some cases. I need to think about it some more.
// TODO: Core could use a QuietDownListener class with `onQuietDown` and `onCancelQuietDown` methods. | ||
@Restricted(NoExternalUse.class) | ||
@Extension | ||
public static class QuietDownListener extends PeriodicWork { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should not need a global extension to do this. Rather, when you saveProgramIfPossible(true)
inside scheduleRun()
, use Timer
to set up a task which checks (say) every second to see if this program can resume. (If you use an anonymous inner Runnable
class, you can Timer.get().submit(this)
to reschedule if Jenkins is still quieting down.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll look into it. IIUC, you mean that if a build is executing something like a long-running sh
step when quiet mode is enabled, there is no need to call scheduleRun
on it if quiet mode is canceled unless the sh
step completed while quiet mode was enabled, is that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that is not what I meant at all. Let me try to rephrase. You are fixing what I consider to be a corner case that many people will never hit. Yet the implementation registers a global extension that does something in the JVM every minute, even if no Pipeline builds are running, and needs to iterate all running Pipeline builds. Instead you can just create a timer task in a particular build that you know was suspended due to quiet mode because we just ran
workflow-cps-plugin/src/main/java/org/jenkinsci/plugins/workflow/cps/CpsThreadGroup.java
Lines 274 to 278 in c4ecc1d
if (paused.get() || j == null || (execution != null && j.isQuietingDown())) { | |
// by doing the pause check inside, we make sure that scheduleRun() returns a | |
// future that waits for any previously scheduled tasks to be completed. | |
saveProgramIfPossible(true); | |
f.set(null); |
FlowExecutionList
etc. because we have references to all the relevant objects right here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My confusion is that it doesn't seem like there is a meaningful difference between a global extension that only loops over all Pipeline builds when quiet mode is canceled and running a timer task per suspended Pipeline build while quiet mode is enabled, unless there is some reason to expect that the set of suspended Pipeline builds will normally be smaller than the set of all Pipeline builds, so I was trying to figure out if that was the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is some reason to expect that the set of suspended Pipeline builds will normally be smaller than the set of all Pipeline builds
Consider the normal case that Jenkins has not been set to quiet down in the current session at all. The current patch would be waking up every minute. I was merely proposing to activate the code in this patch only if and when a particular Pipeline build was actually suspended due to Jenkins quieting down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
…and scheduling a task which was actually specific to that build, independent of all others, which ought to make it more obvious from reading the code that it is doing the right thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was merely proposing to activate the code in this patch only if and when a particular Pipeline build was actually suspended due to Jenkins quieting down.
Sure, but then instead of waking up Jenkins once a minute (and usually doing more or less nothing), you are waking up Jenkins even more frequently than once a minute for every Pipeline build when quiet mode is enabled (and usually doing more or less nothing). Depending on how often you enable quiet mode and how many builds are running concurrently, I could see either approach being preferable, so I just wanted to check if I was missing something important that might tip the balance.
Anyway, I switched to your approach in e026365, so let me know what you think.
src/main/java/org/jenkinsci/plugins/workflow/cps/CpsFlowExecution.java
Outdated
Show resolved
Hide resolved
} | ||
|
||
@Issue("JENKINS-34256") | ||
@Test public void pauseThenQuietDownThenUnpauseThenCancelQuietDown() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about all of these tests; I think they might be flaky. I added them because I was trying to make sure there weren't any bugs around combinations of pause and quiet mode. Some of the cases are a bit weird, especially when you unpause or cancel quiet mode when quiet mode is enabled or when the build is paused, respectively. I think I could tweak CpsFlowExecution.pause()
to avoid printing consecutive "Pausing" and "Resuming" lines, but I'm not sure if that's worth the complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if that's worth the complexity
Likely not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests seem good when I run them. I see the need for all the combinations, but yeah it does kind of suck to have to do all these variations. There's probably a parameterized way to simplify the code, but I don't think it's worth the effort. Maybe put it another test class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, using a parameterized test would probably make the intention clearer, but I think the abstraction might make the individual tests more difficult to understand, so I'd prefer to leave them as-is.
@@ -271,7 +282,28 @@ public void unexport(BodyReference ref) { | |||
@SuppressFBWarnings(value="RV_RETURN_VALUE_IGNORED_BAD_PRACTICE", justification="runner.submit() result") | |||
public Void call() throws Exception { | |||
Jenkins j = Jenkins.getInstanceOrNull(); | |||
if (j != null && !j.isQuietingDown() && execution != null && pausedByQuietMode.compareAndSet(true, false)) { | |||
try { | |||
execution.getOwner().getListener().getLogger().println("Resuming (Prepare for shutdown was canceled)"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some log messages around quiet mode being enabled/disabled so users understand why their build appears to be hanging. I need to check what the user-facing text says for the "Prepare for shutdown" link in the management page to make sure this is consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, the link to enable quiet mode says "Prepare for shutdown", and the link to cancel quiet mode says "Cancel shutdown", so I kept "Pausing (Preparing for shutdown)", but changed "Resuming (Prepare for shutdown was canceled)" to "Resuming (Shutdown was canceled)" in 3cbff11.
src/main/java/org/jenkinsci/plugins/workflow/cps/CpsThreadGroup.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Longstanding issue, thanks for taking it on!
} | ||
|
||
@Issue("JENKINS-34256") | ||
@Test public void pauseThenQuietDownThenUnpauseThenCancelQuietDown() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if that's worth the complexity
Likely not.
@jglick Independent of durability mode, as long as you restart Jenkins via something "safe" like workflow-cps-plugin/src/main/java/org/jenkinsci/plugins/workflow/cps/CpsFlowExecution.java Line 1572 in c4ecc1d
|
Probably. But also consider |
Good point. Based on the comments in JENKINS-34256, it seems like some users would like a variant of "Prepare for shutdown" that also waits for Pipelines to complete (ignoring the fact that they can be suspended and then resumed after the restart 🤷♂) and others would like a variant that only suspends Pipelines once all non-Pipelines have completed. Both groups of users seem a bit confused that "Prepare for shutdown" does not actually shut down or restart Jenkins. |
As a longtime Jenkins user, my two cents: I am not in the camp confused by "Prepare for Shutdown"; I've always understood it and used it as a way to block any new builds from starting and allow any running builds to complete -- quiescing the system, essentially. My expectation for Pipeline builds in this scenario would be either that they too simply be allowed to complete (as Freestyle builds do) or -- probably better -- that they reach a safe point at which they could pause (being resumable whether the shutdown were canceled or the service actually got restarted), then pause, then resume as predictably and reliably as possible. |
Perhaps not as much of a threat as I thought—I had forgotten about the
Reasonable enough. Would involve waiting for all asynchronous executors to exit.
I believe this could be accomplished by amending the current quiet down check in this plugin to check for heavyweight executors also being idle. Again I am not sure this is safe.
That would be
This is the current behavior, or the intention behind it at any rate. |
Ok, I'll merge this, and then I'll comment on the ticket to explain the expected behavior before and after my change, and to mention that if users would prefer different behavior, they should open a new ticket and explain their use case and the desired behavior. |
Could this be achieved somehow? I see occasional problem related to the interaction with the ThinBackup plugin with If a pipeline job is running and the ThinBackup plugin puts Jenkins to quiet mode, the job will at some point log The whole Jenkins will remain stuck in that state since ThinBackup is waiting for all jobs to complete and thus keeping Jenkins in the quiet mode, but the blocking job will never complete because it's waiting for Jenkins to resume. The only way out of this is to abort the blocking job manually. |
@hostalp Please add general comments like the to the Jira ticket instead of the PR. Looking through the comments on JENKINS-34256, your issue sounds like it might be addressed by JENKINS-60434. If you agree, I would just add any additional information that seems relevant in a comment on that issue and then vote for it and add start watching it. If you think a different approach is needed for your use case, please open a new Jira ticket with details and link it to JENKINS-34256. |
No problem. I didn't find the JENKINS-60434 (yes. it's the right one) initially and didn't want to open a new issue for something that was meant as a simple question whose answer would decide further steps, if any. 😃 |
See JENKINS-34256.
I am not sure if this is the only problem described in that ticket, since there is some discussion there about specific steps, but as far as I can tell the problem was that once the
CpsThreadGroup
suspended itself because it detected that quiet mode was enabled, nothing will callCpsThreadGroup.scheduleRun
other than pausing and unpausing the Pipeline or restarting Jenkins, so the build is just permanently suspended.This PR fixes the issue by adding a listener that tracks
Jenkins.isQuietingDown
to detect when quiet down mode is cancelled so that it can callCpsThreadGroup.scheduleRun
which will resume the build.I'm not sure if there is a better way to do this, or if the current behavior is even the behavior we want. Why not just allow the Pipelines to continue executing so that suspension/resumption works as normal when Jenkins is restarted? EDIT: Jesse discusses the reasoning behind the current behavior in JENKINS-32015.
Opening a draft PR for now while I read through the ticket to see if I am misunderstanding the issue.
EDIT: I read through all the comments, including the ones on JENKINS-38316 and found this comment from Jesse, which is essentially what I am doing here for the fix. I think this fix is the root cause of all the issues discussed in the ticket, along with confusion from users on what "Prepare for shutdown" does for Pipeline jobs.