Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JENKINS-34256] Resume Pipeline execution if quiet mode is canceled #340

Merged
merged 6 commits into from Dec 9, 2019

Conversation

dwnusbaum
Copy link
Member

@dwnusbaum dwnusbaum commented Dec 2, 2019

See JENKINS-34256.

I am not sure if this is the only problem described in that ticket, since there is some discussion there about specific steps, but as far as I can tell the problem was that once the CpsThreadGroup suspended itself because it detected that quiet mode was enabled, nothing will call CpsThreadGroup.scheduleRun other than pausing and unpausing the Pipeline or restarting Jenkins, so the build is just permanently suspended.

This PR fixes the issue by adding a listener that tracks Jenkins.isQuietingDown to detect when quiet down mode is cancelled so that it can call CpsThreadGroup.scheduleRun which will resume the build.

I'm not sure if there is a better way to do this, or if the current behavior is even the behavior we want. Why not just allow the Pipelines to continue executing so that suspension/resumption works as normal when Jenkins is restarted? EDIT: Jesse discusses the reasoning behind the current behavior in JENKINS-32015.

Opening a draft PR for now while I read through the ticket to see if I am misunderstanding the issue.

EDIT: I read through all the comments, including the ones on JENKINS-38316 and found this comment from Jesse, which is essentially what I am doing here for the fix. I think this fix is the root cause of all the issues discussed in the ticket, along with confusion from users on what "Prepare for shutdown" does for Pipeline jobs.

public long getRecurrencePeriod() {
return Main.isUnitTest
? TimeUnit.SECONDS.toMillis(1)
: TimeUnit.MINUTES.toMillis(1);
Copy link
Member Author

@dwnusbaum dwnusbaum Dec 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about timing here. One minute seems like a decent balance between responsiveness and avoiding unnecessary work, but I am happy to tweak it based on review feedback. Ideally, we'd have a listener so this could be event-driven instead (should I go ahead and file a core PR for that?).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we need to bother with a core patch given that this is kind of a corner case and polling should work well enough.

@dwnusbaum dwnusbaum marked this pull request as ready for review December 3, 2019 16:26
@dwnusbaum dwnusbaum requested a review from jglick December 3, 2019 16:28
@bitwiseman
Copy link
Contributor

I'm not sure if there is a better way to do this, or if the current behavior is even the behavior we want. Why not just allow the Pipelines to continue executing so that suspension/resumption works as normal when Jenkins is restarted?

Note I had a Jenkins user specifically ask this same question.

@dwnusbaum
Copy link
Member Author

I'm not sure if there is a better way to do this, or if the current behavior is even the behavior we want. Why not just allow the Pipelines to continue executing so that suspension/resumption works as normal when Jenkins is restarted?

Note I had a Jenkins user specifically ask this same question.

I just updated the description, but it looks like the reasoning for the current behavior is discussed in JENKINS-32015. I am going to try to log a message to the build log when execution is suspended because of quiet mode so users understand why their build appears to be hanging.

Copy link
Member

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just allow the Pipelines to continue executing so that suspension/resumption works as normal when Jenkins is restarted?

It might work, depending on durability modes of metadata and the exact timing of events, but you would be pushing your luck.

If you are just inside some sh step that is running along for an hour or so, Jenkins can restart. The idea is to not be writing critical metadata from the master JVM at the moment it exits.

Copy link
Member

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK but could be optimized to add no overhead when the situation does not occur.

SemaphoreStep.waitForStart("wait/1", b);
r.jenkins.doQuietDown(true, 0);
SemaphoreStep.success("wait/1", null);
((CpsFlowExecution) b.getExecution()).waitForSuspension();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would avoid using waitForSuspension. Really it should be deprecated—too internal. Would suffice to just sleep for a second here, I think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In #68 you switched from using Thread.sleep(1000) to CpsFlowExecution.waitForSuspension in CpsFlowExecutionTest.quietDown to try to fix some flaky tests, so I'm a bit hesitant to do the reverse. I'm happy to add a disclaimer to CpsFlowExecution.waitForSuspension that the method should not be used unless what you are testing involves some implementation detail of CpsThreadGroup.scheduleRun, if you think that would be helpful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#68 was a different case: the test would fail if the sleep were too short. Here, at worst, the test could pass if the code were buggy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, e026365 switches to using waitForMessage instead of a waitForSuspension. I think that is ok because the message is printed from inside of scheduleRun, but it might be flaky in some cases. I need to think about it some more.

// TODO: Core could use a QuietDownListener class with `onQuietDown` and `onCancelQuietDown` methods.
@Restricted(NoExternalUse.class)
@Extension
public static class QuietDownListener extends PeriodicWork {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should not need a global extension to do this. Rather, when you saveProgramIfPossible(true) inside scheduleRun(), use Timer to set up a task which checks (say) every second to see if this program can resume. (If you use an anonymous inner Runnable class, you can Timer.get().submit(this) to reschedule if Jenkins is still quieting down.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll look into it. IIUC, you mean that if a build is executing something like a long-running sh step when quiet mode is enabled, there is no need to call scheduleRun on it if quiet mode is canceled unless the sh step completed while quiet mode was enabled, is that right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that is not what I meant at all. Let me try to rephrase. You are fixing what I consider to be a corner case that many people will never hit. Yet the implementation registers a global extension that does something in the JVM every minute, even if no Pipeline builds are running, and needs to iterate all running Pipeline builds. Instead you can just create a timer task in a particular build that you know was suspended due to quiet mode because we just ran

if (paused.get() || j == null || (execution != null && j.isQuietingDown())) {
// by doing the pause check inside, we make sure that scheduleRun() returns a
// future that waits for any previously scheduled tasks to be completed.
saveProgramIfPossible(true);
f.set(null);
which will incur that overhead if only and if a build actually needs it, and there is no need to look up FlowExecutionList etc. because we have references to all the relevant objects right here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My confusion is that it doesn't seem like there is a meaningful difference between a global extension that only loops over all Pipeline builds when quiet mode is canceled and running a timer task per suspended Pipeline build while quiet mode is enabled, unless there is some reason to expect that the set of suspended Pipeline builds will normally be smaller than the set of all Pipeline builds, so I was trying to figure out if that was the case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is some reason to expect that the set of suspended Pipeline builds will normally be smaller than the set of all Pipeline builds

Consider the normal case that Jenkins has not been set to quiet down in the current session at all. The current patch would be waking up every minute. I was merely proposing to activate the code in this patch only if and when a particular Pipeline build was actually suspended due to Jenkins quieting down.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

…and scheduling a task which was actually specific to that build, independent of all others, which ought to make it more obvious from reading the code that it is doing the right thing.

Copy link
Member Author

@dwnusbaum dwnusbaum Dec 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was merely proposing to activate the code in this patch only if and when a particular Pipeline build was actually suspended due to Jenkins quieting down.

Sure, but then instead of waking up Jenkins once a minute (and usually doing more or less nothing), you are waking up Jenkins even more frequently than once a minute for every Pipeline build when quiet mode is enabled (and usually doing more or less nothing). Depending on how often you enable quiet mode and how many builds are running concurrently, I could see either approach being preferable, so I just wanted to check if I was missing something important that might tip the balance.

Anyway, I switched to your approach in e026365, so let me know what you think.

}

@Issue("JENKINS-34256")
@Test public void pauseThenQuietDownThenUnpauseThenCancelQuietDown() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about all of these tests; I think they might be flaky. I added them because I was trying to make sure there weren't any bugs around combinations of pause and quiet mode. Some of the cases are a bit weird, especially when you unpause or cancel quiet mode when quiet mode is enabled or when the build is paused, respectively. I think I could tweak CpsFlowExecution.pause() to avoid printing consecutive "Pausing" and "Resuming" lines, but I'm not sure if that's worth the complexity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if that's worth the complexity

Likely not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests seem good when I run them. I see the need for all the combinations, but yeah it does kind of suck to have to do all these variations. There's probably a parameterized way to simplify the code, but I don't think it's worth the effort. Maybe put it another test class?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, using a parameterized test would probably make the intention clearer, but I think the abstraction might make the individual tests more difficult to understand, so I'd prefer to leave them as-is.

@@ -271,7 +282,28 @@ public void unexport(BodyReference ref) {
@SuppressFBWarnings(value="RV_RETURN_VALUE_IGNORED_BAD_PRACTICE", justification="runner.submit() result")
public Void call() throws Exception {
Jenkins j = Jenkins.getInstanceOrNull();
if (j != null && !j.isQuietingDown() && execution != null && pausedByQuietMode.compareAndSet(true, false)) {
try {
execution.getOwner().getListener().getLogger().println("Resuming (Prepare for shutdown was canceled)");
Copy link
Member Author

@dwnusbaum dwnusbaum Dec 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some log messages around quiet mode being enabled/disabled so users understand why their build appears to be hanging. I need to check what the user-facing text says for the "Prepare for shutdown" link in the management page to make sure this is consistent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, the link to enable quiet mode says "Prepare for shutdown", and the link to cancel quiet mode says "Cancel shutdown", so I kept "Pausing (Preparing for shutdown)", but changed "Resuming (Prepare for shutdown was canceled)" to "Resuming (Shutdown was canceled)" in 3cbff11.

Copy link
Member

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Longstanding issue, thanks for taking it on!

}

@Issue("JENKINS-34256")
@Test public void pauseThenQuietDownThenUnpauseThenCancelQuietDown() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if that's worth the complexity

Likely not.

@dwnusbaum
Copy link
Member Author

dwnusbaum commented Dec 4, 2019

Why not just allow the Pipelines to continue executing so that suspension/resumption works as normal when Jenkins is restarted?

It might work, depending on durability modes of metadata and the exact timing of events, but you would be pushing your luck.

If you are just inside some sh step that is running along for an hour or so, Jenkins can restart. The idea is to not be writing critical metadata from the master JVM at the moment it exits.

@jglick Independent of durability mode, as long as you restart Jenkins via something "safe" like /restart or /safeRestart rather than kill -9,

should save all metadata for running Pipelines before the restart, right? Or is that terminator not robust and known to not always work as desired?

@jglick
Copy link
Member

jglick commented Dec 4, 2019

should save all metadata for running Pipelines before the restart, right?

Probably. But also consider SynchronousNonBlockingStepExecutions, which are not currently resumable. You would not want, say, a junit step to begin running a moment before the system shuts down.

@dwnusbaum
Copy link
Member Author

You would not want, say, a junit step to begin running a moment before the system shuts down.

Good point. Based on the comments in JENKINS-34256, it seems like some users would like a variant of "Prepare for shutdown" that also waits for Pipelines to complete (ignoring the fact that they can be suspended and then resumed after the restart 🤷‍♂) and others would like a variant that only suspends Pipelines once all non-Pipelines have completed. Both groups of users seem a bit confused that "Prepare for shutdown" does not actually shut down or restart Jenkins.

@medianick
Copy link

As a longtime Jenkins user, my two cents: I am not in the camp confused by "Prepare for Shutdown"; I've always understood it and used it as a way to block any new builds from starting and allow any running builds to complete -- quiescing the system, essentially. My expectation for Pipeline builds in this scenario would be either that they too simply be allowed to complete (as Freestyle builds do) or -- probably better -- that they reach a safe point at which they could pause (being resumable whether the shutdown were canceled or the service actually got restarted), then pause, then resume as predictably and reliably as possible.

@jglick
Copy link
Member

jglick commented Dec 6, 2019

You would not want, say, a junit step to begin running a moment before the system shuts down.

Perhaps not as much of a threat as I thought—I had forgotten about the SynchronousNonBlockingStepExecution.blocksRestart override. Still, there could be race conditions here. The idea behind suspending programs as soon as the first Groovy safepoint is encountered after quiet mode is entered is that otherwise Jenkins might do one last restart-blocked check, get a go-ahead because a sh step is running, and then initiate shutdown while a CPS VM thread is switching to new tasks.

it seems like some users would like a variant of "Prepare for shutdown" that also waits for Pipelines to complete

Reasonable enough. Would involve waiting for all asynchronous executors to exit.

only suspends Pipelines once all non-Pipelines have completed

I believe this could be accomplished by amending the current quiet down check in this plugin to check for heavyweight executors also being idle. Again I am not sure this is safe.

"Prepare for shutdown" does not actually shut down or restart Jenkins

That would be /safeExit or /safeRestart, respectively. Not sure if we currently offer this as a GUI option.

that they reach a safe point at which they could pause

This is the current behavior, or the intention behind it at any rate.

@dwnusbaum
Copy link
Member Author

Ok, I'll merge this, and then I'll comment on the ticket to explain the expected behavior before and after my change, and to mention that if users would prefer different behavior, they should open a new ticket and explain their use case and the desired behavior.

@dwnusbaum dwnusbaum merged commit 53a57bf into jenkinsci:master Dec 9, 2019
@dwnusbaum dwnusbaum deleted the JENKINS-34256 branch December 9, 2019 22:24
@hostalp
Copy link

hostalp commented Feb 13, 2020

Why not just allow the Pipelines to continue executing so that suspension/resumption works as normal when Jenkins is restarted?

Could this be achieved somehow?

I see occasional problem related to the interaction with the ThinBackup plugin with Wait until Jenkins/Hudson is idle to perform a backup and Force Jenkins to quiet mode after specified minutes options set.

If a pipeline job is running and the ThinBackup plugin puts Jenkins to quiet mode, the job will at some point log Pausing (Preparing for shutdown), its execution will really stop (at least I can see nothing running at that point - no slave processes, no master threads related to the job), but in Jenkins it will be still seen as active on the slave, just with the (Analyze) note below the running job number.

The whole Jenkins will remain stuck in that state since ThinBackup is waiting for all jobs to complete and thus keeping Jenkins in the quiet mode, but the blocking job will never complete because it's waiting for Jenkins to resume. The only way out of this is to abort the blocking job manually.

@dwnusbaum
Copy link
Member Author

dwnusbaum commented Feb 13, 2020

@hostalp Please add general comments like the to the Jira ticket instead of the PR. Looking through the comments on JENKINS-34256, your issue sounds like it might be addressed by JENKINS-60434. If you agree, I would just add any additional information that seems relevant in a comment on that issue and then vote for it and add start watching it. If you think a different approach is needed for your use case, please open a new Jira ticket with details and link it to JENKINS-34256.

@hostalp
Copy link

hostalp commented Feb 13, 2020

No problem. I didn't find the JENKINS-60434 (yes. it's the right one) initially and didn't want to open a new issue for something that was meant as a simple question whose answer would decide further steps, if any. 😃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants