Call ResetTimer synchronously from eol and skip the Tick when channel is null #226

Pldi23 · 2022-08-10T08:23:06Z

Cloudbees CI reported a flakiness in TimeoutStepExecutionTest#activityRestart.

Logs:
activityRestart_stacktrace.txt
activityRestart_stderr.txt

The issue could be reproduced locally, (test crashes quite often), when after restart timeout set to expire in less then 7.5 sec the test fails.

1.281 [restarted #1] Timeout set to expire after 3.3 sec without activity

From what I see the reason of the flake is the race condition between Tick (which gets scheduled with timeout / 2, in our particular case 7.5 sec) and Killer's delay (which is after restart in our particular case was 3.3 sec). When delay < timeout / 2 no reset happens and TimeoutStepExecution cancelled the rest of the pipeline after delay.

There is a TODO if channel == null, we can safely ResetTimer.call synchronously from eol and skip the Tick, if we will follow this recommendation we could reset timer synchronously instead of scheduling it in timeout / 2 time, which in itself will fix a flakiness.

… is null

Pldi23 · 2022-08-10T08:27:19Z

@jglick Could you please review? (I tried to apply a fix with a minimal impact to not break anything.)

jglick

OK I guess? I have no memory of writing that comment. Does it seem to work?

jglick · 2022-08-11T00:08:24Z

src/main/java/org/jenkinsci/plugins/workflow/steps/TimeoutStepExecution.java

@@ -336,11 +336,13 @@ private Object readResolve() {
        @Override
        public OutputStream decorateLogger(@SuppressWarnings("rawtypes") Run build, final OutputStream logger)
                throws IOException, InterruptedException {
-            // TODO if channel == null, we can safely ResetTimer.call synchronously from eol and skip the Tick


was from #74 FTR

Pldi23 · 2022-08-11T05:09:17Z

OK I guess? I have no memory of writing that comment. Does it seem to work?

Flakiness gone with this change. All other tests works. Just in case I also tested manually the case described in https://issues.jenkins.io/browse/JENKINS-54078?focusedCommentId=351432&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-351432. So yes, It seem to work.

chwehrli · 2022-08-29T15:56:26Z

After upgrading to v991 (which includes this PR) we're now facing cpu pressure and eventually Jenkins POD is being terminated.
I believe this might be caused by some of our jobs producing massive logs (>50k lines) within minutes and hence now triggering lots of ResetTimer calls:

org.jenkinsci.plugins.workflow.steps.TimeoutStepExecution$ResetTimer.call(TimeoutStepExecution.java:308) org.jenkinsci.plugins.workflow.steps.TimeoutStepExecution$ConsoleLogFilterImpl2$1.eol(TimeoutStepExecution.java:344) hudson.console.LineTransformationOutputStream.eol(LineTransformationOutputStream.java:61)

JavaMelody__jenkins-0_8_29_22.pdf

threadDump.txt

Any suggestions?
Please let me know if you want me to report a bug for this issue.

Many thanks!
Christian

tarioch · 2022-08-29T16:12:58Z

Maybe worth adding, the jobs run with a timeout of 2h, so if I got this right the reset call before happened only every 1h and would now happen on each log line.

jglick · 2022-08-29T16:28:44Z

The issue could be reproduced locally, (test crashes quite often), when after restart timeout set to expire in less then 7.5 sec the test fails.

via

diff --git src/test/java/org/jenkinsci/plugins/workflow/steps/TimeoutStepTest.java src/test/java/org/jenkinsci/plugins/workflow/steps/TimeoutStepTest.java
index f18f476..faea2f4 100644
--- src/test/java/org/jenkinsci/plugins/workflow/steps/TimeoutStepTest.java
+++ src/test/java/org/jenkinsci/plugins/workflow/steps/TimeoutStepTest.java
@@ -219,6 +219,7 @@ public class TimeoutStepTest {
                 WorkflowRun b = p.scheduleBuild2(0).getStartCondition().get();
                 SemaphoreStep.waitForStart("restarted/1", b);
         });
+        Thread.sleep(10_000);
         sessions.then(j -> {
                 WorkflowJob p = j.jenkins.getItemByFullName("restarted", WorkflowJob.class);
                 WorkflowRun b = p.getBuildByNumber(1);

jglick · 2022-08-29T17:02:10Z

@chwehrli @tarioch thanks for the report and sorry for any inconvenience. I believe #234 should fix the performance issue as well as address the original problem motivating this PR. If you want to test prior to release, that would be great; just download a workflow-basic-steps.hpi from the Incrementals link of #234, assuming tests pass.

chwehrli · 2022-08-30T06:03:20Z

Thank you very much for that ultra-fast response @jglick !
I've just installed workflow-basic-steps-992.v85b_6c33ca_6e0.hpi - will report later how it behaves...

jglick · 2022-08-30T12:12:16Z

Thanks. Please use #234 for further comments.

call ResetTimer synchronously from eol and skip the Tick when channel…

ebc33aa

… is null

Pldi23 changed the title ~~call ResetTimer synchronously from eol and skip the Tick when channel is null~~ Call ResetTimer synchronously from eol and skip the Tick when channel is null Aug 10, 2022

jglick added the bug label Aug 11, 2022

jglick approved these changes Aug 11, 2022

View reviewed changes

jglick merged commit 43d80fe into jenkinsci:master Aug 11, 2022

jglick mentioned this pull request Aug 29, 2022

Better fix of timeout activity tick after restart #234

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Call ResetTimer synchronously from eol and skip the Tick when channel is null #226

Call ResetTimer synchronously from eol and skip the Tick when channel is null #226

Pldi23 commented Aug 10, 2022

Pldi23 commented Aug 10, 2022 •

edited

jglick left a comment

jglick Aug 11, 2022

Pldi23 commented Aug 11, 2022

chwehrli commented Aug 29, 2022

tarioch commented Aug 29, 2022

jglick commented Aug 29, 2022

jglick commented Aug 29, 2022

chwehrli commented Aug 30, 2022

jglick commented Aug 30, 2022

Call ResetTimer synchronously from eol and skip the Tick when channel is null #226

Call ResetTimer synchronously from eol and skip the Tick when channel is null #226

Conversation

Pldi23 commented Aug 10, 2022

Pldi23 commented Aug 10, 2022 • edited

jglick left a comment

Choose a reason for hiding this comment

jglick Aug 11, 2022

Choose a reason for hiding this comment

Pldi23 commented Aug 11, 2022

chwehrli commented Aug 29, 2022

tarioch commented Aug 29, 2022

jglick commented Aug 29, 2022

jglick commented Aug 29, 2022

chwehrli commented Aug 30, 2022

jglick commented Aug 30, 2022

Pldi23 commented Aug 10, 2022 •

edited