Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Call ResetTimer synchronously from eol and skip the Tick when channel is null #226

Merged
merged 1 commit into from Aug 11, 2022

Conversation

Pldi23
Copy link
Contributor

@Pldi23 Pldi23 commented Aug 10, 2022

Cloudbees CI reported a flakiness in TimeoutStepExecutionTest#activityRestart.

Logs:
activityRestart_stacktrace.txt
activityRestart_stderr.txt

The issue could be reproduced locally, (test crashes quite often), when after restart timeout set to expire in less then 7.5 sec the test fails.

1.281 [restarted #1] Timeout set to expire after 3.3 sec without activity

From what I see the reason of the flake is the race condition between Tick (which gets scheduled with timeout / 2, in our particular case 7.5 sec) and Killer's delay (which is after restart in our particular case was 3.3 sec). When delay < timeout / 2 no reset happens and TimeoutStepExecution cancelled the rest of the pipeline after delay.

There is a TODO if channel == null, we can safely ResetTimer.call synchronously from eol and skip the Tick, if we will follow this recommendation we could reset timer synchronously instead of scheduling it in timeout / 2 time, which in itself will fix a flakiness.

@Pldi23 Pldi23 changed the title call ResetTimer synchronously from eol and skip the Tick when channel is null Call ResetTimer synchronously from eol and skip the Tick when channel is null Aug 10, 2022
@Pldi23
Copy link
Contributor Author

Pldi23 commented Aug 10, 2022

@jglick Could you please review? (I tried to apply a fix with a minimal impact to not break anything.)

@jglick jglick added the bug label Aug 11, 2022
Copy link
Member

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I guess? I have no memory of writing that comment. Does it seem to work?

@@ -336,11 +336,13 @@ private Object readResolve() {
@Override
public OutputStream decorateLogger(@SuppressWarnings("rawtypes") Run build, final OutputStream logger)
throws IOException, InterruptedException {
// TODO if channel == null, we can safely ResetTimer.call synchronously from eol and skip the Tick
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was from #74 FTR

@Pldi23
Copy link
Contributor Author

Pldi23 commented Aug 11, 2022

OK I guess? I have no memory of writing that comment. Does it seem to work?

Flakiness gone with this change. All other tests works. Just in case I also tested manually the case described in https://issues.jenkins.io/browse/JENKINS-54078?focusedCommentId=351432&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-351432. So yes, It seem to work.

@jglick jglick merged commit 43d80fe into jenkinsci:master Aug 11, 2022
@chwehrli
Copy link

After upgrading to v991 (which includes this PR) we're now facing cpu pressure and eventually Jenkins POD is being terminated.
I believe this might be caused by some of our jobs producing massive logs (>50k lines) within minutes and hence now triggering lots of ResetTimer calls:

org.jenkinsci.plugins.workflow.steps.TimeoutStepExecution$ResetTimer.call(TimeoutStepExecution.java:308) org.jenkinsci.plugins.workflow.steps.TimeoutStepExecution$ConsoleLogFilterImpl2$1.eol(TimeoutStepExecution.java:344) hudson.console.LineTransformationOutputStream.eol(LineTransformationOutputStream.java:61)

JavaMelody__jenkins-0_8_29_22.pdf

threadDump.txt

Any suggestions?
Please let me know if you want me to report a bug for this issue.

Many thanks!
Christian

@tarioch
Copy link

tarioch commented Aug 29, 2022

Maybe worth adding, the jobs run with a timeout of 2h, so if I got this right the reset call before happened only every 1h and would now happen on each log line.

@jglick
Copy link
Member

jglick commented Aug 29, 2022

The issue could be reproduced locally, (test crashes quite often), when after restart timeout set to expire in less then 7.5 sec the test fails.

via

diff --git src/test/java/org/jenkinsci/plugins/workflow/steps/TimeoutStepTest.java src/test/java/org/jenkinsci/plugins/workflow/steps/TimeoutStepTest.java
index f18f476..faea2f4 100644
--- src/test/java/org/jenkinsci/plugins/workflow/steps/TimeoutStepTest.java
+++ src/test/java/org/jenkinsci/plugins/workflow/steps/TimeoutStepTest.java
@@ -219,6 +219,7 @@ public class TimeoutStepTest {
                 WorkflowRun b = p.scheduleBuild2(0).getStartCondition().get();
                 SemaphoreStep.waitForStart("restarted/1", b);
         });
+        Thread.sleep(10_000);
         sessions.then(j -> {
                 WorkflowJob p = j.jenkins.getItemByFullName("restarted", WorkflowJob.class);
                 WorkflowRun b = p.getBuildByNumber(1);

@jglick
Copy link
Member

jglick commented Aug 29, 2022

@chwehrli @tarioch thanks for the report and sorry for any inconvenience. I believe #234 should fix the performance issue as well as address the original problem motivating this PR. If you want to test prior to release, that would be great; just download a workflow-basic-steps.hpi from the Incrementals link of #234, assuming tests pass.

@chwehrli
Copy link

Thank you very much for that ultra-fast response @jglick !
I've just installed workflow-basic-steps-992.v85b_6c33ca_6e0.hpi - will report later how it behaves...

@jglick
Copy link
Member

jglick commented Aug 30, 2022

Thanks. Please use #234 for further comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants