Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better fix of timeout activity tick after restart #234

Merged
merged 2 commits into from Aug 30, 2022

Conversation

jglick
Copy link
Member

@jglick jglick commented Aug 29, 2022

#226 correctly diagnosed a problem brought to light by a flaky test, but the fix was not correct (only applied to messages printed within the controller as opposed to from an agent) and apparently introduced a severe performance regression.

I think the root issue was that restarting the controller ought to count as “activity” for purposes of delaying the timeout (better to err on the side of caution) but it did not.

@Pldi23
Copy link
Contributor

Pldi23 commented Aug 29, 2022

Thanks @jglick for quick reaction on this regression, just in case I've setup 300 executions run in test selector https://gauntlet-2.cloudbees.com/rosie/job/playground/job/flakebusters/job/selectors/job/pct-test-selector/168/

@jglick
Copy link
Member Author

jglick commented Aug 29, 2022

Seems it occasionally fails in

   0.766 [restarted #1] Resuming build at Mon Aug 29 20:10:01 UTC 2022 after Jenkins restart
   0.767 [restarted #1] Waiting to resume part of restarted #1: Waiting for next available executor
   1.826 [restarted #1] Timeout set to expire after 2 sec without activity
   2.083 [restarted #1] Ready to run at Mon Aug 29 20:10:02 UTC 2022
   2.393 [restarted #1] [Pipeline] echo
   2.401 [id=63]	FINE	o.j.p.w.s.TimeoutStepExecution$Tick#schedule: scheduling tick for 0 ms
   2.401 [id=91]	FINE	o.j.p.w.s.TimeoutStepExecution$Tick#schedule: scheduling tick for 1.5 sec
   2.447 [restarted #1] NotHereYet
   2.598 [restarted #1] [Pipeline] sleep
   2.647 [id=63]	FINE	o.j.p.w.s.TimeoutStepExecution$Tick#schedule: scheduling tick for 0 ms
   2.647 [id=93]	FINE	o.j.p.w.s.TimeoutStepExecution$Tick#schedule: scheduling tick for 1.5 sec
   2.648 [restarted #1] Sleeping for 10 sec
   3.840 [id=85]	FINE	o.j.p.w.s.TimeoutStepExecution#resetTimer: resetting timer on 8de14d60-6560-4abb-9426-1128deb999e5
   3.903 [id=83]	FINE	o.j.p.w.s.TimeoutStepExecution$Tick#schedule: scheduling tick for 7.5 sec
   3.957 [restarted #1] Cancelling nested steps due to timeout

Unclear if this is a bug in production code or just in the test. In any case, I am inclined to ship this amendment anyway, at least if @chwehrli @tarioch can confirm it addresses the performance regression.

@Pldi23
Copy link
Contributor

Pldi23 commented Aug 30, 2022

Thanks @jglick for quick reaction on this regression, just in case I've setup 300 executions run in test selector https://gauntlet-2.cloudbees.com/rosie/job/playground/job/flakebusters/job/selectors/job/pct-test-selector/168/

TimeoutStepTest#activityRestart still flaky, failed 3 times in 300 executions.

@chwehrli
Copy link

This also looks good from our end with regards to performance issues we noticed with 226 - Jenkins has been running fine over several hours at regular load - thanks!

@Pldi23
Copy link
Contributor

Pldi23 commented Aug 30, 2022

Seems it occasionally fails in

   0.766 [restarted #1] Resuming build at Mon Aug 29 20:10:01 UTC 2022 after Jenkins restart
   0.767 [restarted #1] Waiting to resume part of restarted #1: Waiting for next available executor
   1.826 [restarted #1] Timeout set to expire after 2 sec without activity
   2.083 [restarted #1] Ready to run at Mon Aug 29 20:10:02 UTC 2022
   2.393 [restarted #1] [Pipeline] echo
   2.401 [id=63]	FINE	o.j.p.w.s.TimeoutStepExecution$Tick#schedule: scheduling tick for 0 ms
   2.401 [id=91]	FINE	o.j.p.w.s.TimeoutStepExecution$Tick#schedule: scheduling tick for 1.5 sec
   2.447 [restarted #1] NotHereYet
   2.598 [restarted #1] [Pipeline] sleep
   2.647 [id=63]	FINE	o.j.p.w.s.TimeoutStepExecution$Tick#schedule: scheduling tick for 0 ms
   2.647 [id=93]	FINE	o.j.p.w.s.TimeoutStepExecution$Tick#schedule: scheduling tick for 1.5 sec
   2.648 [restarted #1] Sleeping for 10 sec
   3.840 [id=85]	FINE	o.j.p.w.s.TimeoutStepExecution#resetTimer: resetting timer on 8de14d60-6560-4abb-9426-1128deb999e5
   3.903 [id=83]	FINE	o.j.p.w.s.TimeoutStepExecution$Tick#schedule: scheduling tick for 7.5 sec
   3.957 [restarted #1] Cancelling nested steps due to timeout

Unclear if this is a bug in production code or just in the test. In any case, I am inclined to ship this amendment anyway, at least if @chwehrli @tarioch can confirm it addresses the performance regression.

Probably test changes are not needed? If I run ‘old’-versioned test against your fix the test does not flake.
https://gauntlet-2.cloudbees.com/rosie/job/playground/job/flakebusters/job/selectors/job/pct-test-selector/169/

Copy link
Member Author

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noting the reason for test changes.

@@ -71,6 +73,8 @@ public class TimeoutStepTest {

@Rule public GitSampleRepoRule git = new GitSampleRepoRule();

@Rule public LoggerRule logging = new LoggerRule().record(TimeoutStepExecution.class, Level.FINE);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to help debug.

@@ -212,13 +216,14 @@ public void activityRestart() throws Throwable {
+ " echo 'NotHereYet';\n"
+ " sleep 10;\n"
+ " echo 'JustHere!';\n"
+ " sleep 30;\n"
+ " sleep 20;\n"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A pre-commit version the main patch contained a mistake—the build did not time out until 1½× the stated period; in this case sleep would run for ~23s before being terminated. I was trying to strengthen the test here to verify that the 15s activity timeout really applies.

+ " echo 'ShouldNot!';\n"
+ " }\n"
+ "}\n", true));
WorkflowRun b = p.scheduleBuild2(0).getStartCondition().get();
SemaphoreStep.waitForStart("restarted/1", b);
});
Thread.sleep(10_000); // restarting should count as activity
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making the test reliably reproduce the originally reported problem.

@jglick
Copy link
Member Author

jglick commented Aug 30, 2022

If I run ‘old’-versioned test against your fix the test does not flake.

Like what I did in 512d920? OK, if that is what it takes to address the regression and prevent test flakes, then we can go with that. I just want to ship this promptly.

@jglick jglick enabled auto-merge August 30, 2022 12:51
@jglick jglick merged commit d57e3ca into jenkinsci:master Aug 30, 2022
@jglick jglick deleted the TimeoutStepExecution branch August 30, 2022 14:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants