Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JENKINS-59793] Avoid hanging jobs with faulty SubTasks #4346

Merged
merged 6 commits into from
Nov 13, 2019

Conversation

MRamonLeon
Copy link
Contributor

@MRamonLeon MRamonLeon commented Nov 7, 2019

See JENKINS-59793.

I have created a test to reproduce the status reached by several private Jenkins instances which have thousand of Metric Plugin threads (QueueSubTaskMetrics) running. It's all because if a Job has faulty SubTasks (throwing an unmanaged exception), the future object is not set in the WorkUnitContext#synchronizeEnd method. It may also happen in WorkUnitContext#synchronizeStart but I have not reproduced this case).

Whatever code that keeps waiting for the start or the end of the build will stay there forever. The Metrics Plugin is impacted by this here:

Called from: https://github.com/jenkinsci/metrics-plugin/blob/24bf92ebd59095d8f77b2552696b3f210a024bdc/src/main/java/jenkins/metrics/impl/JenkinsMetricProviderImpl.java#L944

I will let the CI fail and then I will push the fix.

Proposed changelog entries

  • Entry 1: Improve the resilience to faulty subtask contributors which may keep the build running forever.

Submitter checklist

  • JIRA issue is well described
  • Changelog entry appropriate for the audience affected by the change (users or developer, depending on the change). Examples
    * Use the Internal: prefix if the change has no user-visible impact (API, test frameworks, etc.)
  • Appropriate autotests or explanation to why this change has no tests
  • [N/A] For dependency updates: links to external changelogs and, if possible, full diffs

Desired reviewers

@daniel-beck @varyvol @batmat @alecharp @rsandell @fcojfernandez @stephenc

@MRamonLeon MRamonLeon added the bug For changelog: Minor bug. Will be listed after features label Nov 7, 2019
@MRamonLeon
Copy link
Contributor Author

MRamonLeon commented Nov 7, 2019

The failure in the new test without the fix in the code: https://ci.jenkins.io/blue/organizations/jenkins/Core%2Fjenkins/detail/PR-4346/1/tests

Screenshot from 2019-11-07 20-50-24

Stacktrace
java.util.concurrent.TimeoutException
	at hudson.remoting.AsyncFutureImpl.get(AsyncFutureImpl.java:97)
	at hudson.model.queue.BuildKeepsRunningWhenFaultySubTasksTest.buildDoesntFinishWhenSubTaskFails(BuildKeepsRunningWhenFaultySubTasksTest.java:46)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.jvnet.hudson.test.JenkinsRule$1.evaluate(JenkinsRule.java:600)
	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
	at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.lang.Thread.run(Thread.java:834)
...
  10.292 [id=32]	SEVERE	hudson.model.Executor#finish1: Executor threw an exception
java.lang.ArrayIndexOutOfBoundsException: My unexpected exception
	at hudson.model.queue.BuildKeepsRunningWhenFaultySubTasksTest$FailingSubTaskContributor$1$1.run(BuildKeepsRunningWhenFaultySubTasksTest.java:65)
	at hudson.model.ResourceController.execute(ResourceController.java:97)
	at hudson.model.Executor.run(Executor.java:427)

Copy link
Contributor

@res0nance res0nance left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

Copy link

@varyvol varyvol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all reviews, note this issue has been happening also in http://ci.jenkins.io

@MRamonLeon my doubt here is: why has the issue has started appearing now? The metrics plugin does not seem to have important changes related to this.


// We get stalled waiting the finalization of the job
FreeStyleBuild build = future.get(5, TimeUnit.SECONDS);
assertTrue("The message is printed in the log", logs.getRecords().stream().anyMatch(r -> r.getThrown().getMessage().equals(ERROR_MESSAGE)));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this is really necessary (reaching this point would be enough to assert the fix?) and you could save all the logger rules.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, not needed if the code passes the get. It's a reminiscence from a lot of tests I've been doing. I can remove it.

// When using SubTaskContributor (FailingSubTaskContributor) the build never ends
@Test
@Issue("JENKINS-59793")
public void buildDoesntFinishWhenSubTaskFails() throws Exception {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public void buildDoesntFinishWhenSubTaskFails() throws Exception {
public void buildFinishesWhenSubTaskFails() throws Exception {

I think the name is appropriate before the fix but after the fix it should say what's actually asserting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed with the other change

assertThat("Build should be actually scheduled by Jenkins", future, notNullValue());

// We get stalled waiting the finalization of the job
FreeStyleBuild build = future.get(5, TimeUnit.SECONDS);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you also check the build result is a failure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to assert that, the purpose of the test is to guarantee we pass this point, so let's keep the minimum needed as agreed above.

@MRamonLeon MRamonLeon added the needs-more-reviews Complex change, which would benefit from more eyes label Nov 8, 2019
Copy link
Member

@stephenc stephenc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some suggestions for even more robustness and perhaps covering another edge case.

Your change makes things better (hence approving) so if you would rather merge your change first to get an improvement out and then examine my proposal in a subsequent PR that's totally fine

@oleg-nenashev oleg-nenashev removed the needs-more-reviews Complex change, which would benefit from more eyes label Nov 11, 2019
Co-authored-by: Stephen Connolly <stephen.alan.connolly@gmail.com>
@MRamonLeon
Copy link
Contributor Author

MRamonLeon commented Nov 11, 2019

Added @stephenc's suggestions

Copy link
Member

@stephenc stephenc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@MRamonLeon
Copy link
Contributor Author

I had to roll back to the original fix because the test was failing. Jenkins was shutting down after an unexpected exception in the test. I couldn't figure out why it was happening, so better we let this PR get merged and we improve it in a followup PR if desired.

@MRamonLeon MRamonLeon added the ready-for-merge The PR is ready to go, and it will be merged soon if there is no negative feedback label Nov 12, 2019
@oleg-nenashev
Copy link
Member

I plan to merge it tomorrow if no negative feedback

@batmat
Copy link
Member

batmat commented Nov 13, 2019

OK, moving on given no negative feedback since yesterday.

Thank you eveyone!

@batmat batmat merged commit 5f81077 into jenkinsci:master Nov 13, 2019
@MRamonLeon MRamonLeon deleted the JENKINS-59793 branch November 18, 2019 10:30
olivergondza pushed a commit that referenced this pull request Nov 26, 2019
[JENKINS-59793] Avoid hanging jobs with faulty SubTasks

(cherry picked from commit 5f81077)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug For changelog: Minor bug. Will be listed after features ready-for-merge The PR is ready to go, and it will be merged soon if there is no negative feedback
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants