Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JENKINS-59793] Avoid hanging jobs with faulty SubTasks #4346

Merged
merged 6 commits into from Nov 13, 2019

Conversation

@MRamonLeon
Copy link
Contributor

MRamonLeon commented Nov 7, 2019

See JENKINS-59793.

I have created a test to reproduce the status reached by several private Jenkins instances which have thousand of Metric Plugin threads (QueueSubTaskMetrics) running. It's all because if a Job has faulty SubTasks (throwing an unmanaged exception), the future object is not set in the WorkUnitContext#synchronizeEnd method. It may also happen in WorkUnitContext#synchronizeStart but I have not reproduced this case).

Whatever code that keeps waiting for the start or the end of the build will stay there forever. The Metrics Plugin is impacted by this here:

Called from: https://github.com/jenkinsci/metrics-plugin/blob/24bf92ebd59095d8f77b2552696b3f210a024bdc/src/main/java/jenkins/metrics/impl/JenkinsMetricProviderImpl.java#L944

I will let the CI fail and then I will push the fix.

Proposed changelog entries

  • Entry 1: Improve the resilience to faulty subtask contributors which may keep the build running forever.

Submitter checklist

  • JIRA issue is well described
  • Changelog entry appropriate for the audience affected by the change (users or developer, depending on the change). Examples
    * Use the Internal: prefix if the change has no user-visible impact (API, test frameworks, etc.)
  • Appropriate autotests or explanation to why this change has no tests
  • [N/A] For dependency updates: links to external changelogs and, if possible, full diffs

Desired reviewers

@daniel-beck @varyvol @batmat @alecharp @rsandell @fcojfernandez @stephenc

@MRamonLeon

This comment has been minimized.

Copy link
Contributor Author

MRamonLeon commented Nov 7, 2019

The failure in the new test without the fix in the code: https://ci.jenkins.io/blue/organizations/jenkins/Core%2Fjenkins/detail/PR-4346/1/tests

Screenshot from 2019-11-07 20-50-24

Stacktrace
java.util.concurrent.TimeoutException
	at hudson.remoting.AsyncFutureImpl.get(AsyncFutureImpl.java:97)
	at hudson.model.queue.BuildKeepsRunningWhenFaultySubTasksTest.buildDoesntFinishWhenSubTaskFails(BuildKeepsRunningWhenFaultySubTasksTest.java:46)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.jvnet.hudson.test.JenkinsRule$1.evaluate(JenkinsRule.java:600)
	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.lang.Thread.run(Thread.java:834)
...
  10.292 [id=32]	SEVERE	hudson.model.Executor#finish1: Executor threw an exception
java.lang.ArrayIndexOutOfBoundsException: My unexpected exception
	at hudson.model.queue.BuildKeepsRunningWhenFaultySubTasksTest$FailingSubTaskContributor$1$1.run(BuildKeepsRunningWhenFaultySubTasksTest.java:65)
	at hudson.model.ResourceController.execute(ResourceController.java:97)
	at hudson.model.Executor.run(Executor.java:427)
Copy link
Contributor

res0nance left a comment

LGTM 👍

Copy link
Contributor

varyvol left a comment

For all reviews, note this issue has been happening also in http://ci.jenkins.io

@MRamonLeon my doubt here is: why has the issue has started appearing now? The metrics plugin does not seem to have important changes related to this.


// We get stalled waiting the finalization of the job
FreeStyleBuild build = future.get(5, TimeUnit.SECONDS);
assertTrue("The message is printed in the log", logs.getRecords().stream().anyMatch(r -> r.getThrown().getMessage().equals(ERROR_MESSAGE)));

This comment has been minimized.

Copy link
@varyvol

varyvol Nov 8, 2019

Contributor

Not sure this is really necessary (reaching this point would be enough to assert the fix?) and you could save all the logger rules.

This comment has been minimized.

Copy link
@MRamonLeon

MRamonLeon Nov 8, 2019

Author Contributor

Yes, not needed if the code passes the get. It's a reminiscence from a lot of tests I've been doing. I can remove it.

// When using SubTaskContributor (FailingSubTaskContributor) the build never ends
@Test
@Issue("JENKINS-59793")
public void buildDoesntFinishWhenSubTaskFails() throws Exception {

This comment has been minimized.

Copy link
@varyvol

varyvol Nov 8, 2019

Contributor
Suggested change
public void buildDoesntFinishWhenSubTaskFails() throws Exception {
public void buildFinishesWhenSubTaskFails() throws Exception {

I think the name is appropriate before the fix but after the fix it should say what's actually asserting.

This comment has been minimized.

Copy link
@MRamonLeon

MRamonLeon Nov 8, 2019

Author Contributor

Pushed with the other change

assertThat("Build should be actually scheduled by Jenkins", future, notNullValue());

// We get stalled waiting the finalization of the job
FreeStyleBuild build = future.get(5, TimeUnit.SECONDS);

This comment has been minimized.

Copy link
@varyvol

varyvol Nov 8, 2019

Contributor

Should you also check the build result is a failure?

This comment has been minimized.

Copy link
@MRamonLeon

MRamonLeon Nov 8, 2019

Author Contributor

No need to assert that, the purpose of the test is to guarantee we pass this point, so let's keep the minimum needed as agreed above.

@varyvol
varyvol approved these changes Nov 8, 2019
@timja
timja approved these changes Nov 10, 2019
Copy link
Member

stephenc left a comment

Some suggestions for even more robustness and perhaps covering another edge case.

Your change makes things better (hence approving) so if you would rather merge your change first to get an improvement out and then examine my proposal in a subsequent PR that's totally fine

Co-authored-by: Stephen Connolly <stephen.alan.connolly@gmail.com>
@MRamonLeon

This comment has been minimized.

Copy link
Contributor Author

MRamonLeon commented Nov 11, 2019

Added @stephenc's suggestions

Copy link
Member

stephenc left a comment

LGTM

@stephenc

This comment has been minimized.

Unclear what you gain by adding an if clause and a local variable. The !future.start.isDone() ensures we only set once and never override

@stephenc

This comment has been minimized.

This would seem better handled in the catch block as it was originally

@MRamonLeon

This comment has been minimized.

Copy link
Contributor Author

MRamonLeon commented Nov 12, 2019

I had to roll back to the original fix because the test was failing. Jenkins was shutting down after an unexpected exception in the test. I couldn't figure out why it was happening, so better we let this PR get merged and we improve it in a followup PR if desired.

@oleg-nenashev

This comment has been minimized.

Copy link
Member

oleg-nenashev commented Nov 12, 2019

I plan to merge it tomorrow if no negative feedback

@batmat

This comment has been minimized.

Copy link
Member

batmat commented Nov 13, 2019

OK, moving on given no negative feedback since yesterday.

Thank you eveyone!

@batmat batmat merged commit 5f81077 into jenkinsci:master Nov 13, 2019
1 check passed
1 check passed
continuous-integration/jenkins/pr-merge This commit looks good
Details
@MRamonLeon MRamonLeon deleted the MRamonLeon:JENKINS-59793 branch Nov 18, 2019
olivergondza added a commit that referenced this pull request Nov 26, 2019
[JENKINS-59793] Avoid hanging jobs with faulty SubTasks

(cherry picked from commit 5f81077)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.