Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JENKINS-57304] - Add call to oneOffExecutors.remove in Computer.removeExecutor #4329

Merged
merged 1 commit into from Dec 3, 2019
Merged

Conversation

lgoenner
Copy link
Contributor

@lgoenner lgoenner commented Oct 28, 2019

This PR attempts to stop "zombie executors" from appearing in the status list.

There are several issues describing this problem, for instance JENKINS-55484, JENKINS-55811, JENKINS-57304. I am not certain that this PR fixes all of these issues.

It is possible to reliably create these zombie executors:

  • Install Jenkins with SSH Slave + Pipeline plugins
  • Configure an SSH Slave and set amount of master executors to 0
  • Create a pipeline job
  • Turn off (temporarily mark offline) the master node & trigger the pipeline job
  • Re-enable the master node
  • One zombie executor for each time the build now button was pressed has been created

The issue is the one-off executor not being removed in

owner.removeExecutor(this);
The easiest fix would be to also remove items from oneOffExecutors.

Proposed changelog entries

  • Computer.removeExecutor now also removes one-off executors

@oleg-nenashev oleg-nenashev requested a review from a team November 11, 2019 11:20
Copy link
Member

@oleg-nenashev oleg-nenashev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it also make sense to check the executor type before calling the method? Just for performance reasons, no strong opinion

@oleg-nenashev
Copy link
Member

AFAICT https://issues.jenkins-ci.org/browse/JENKINS-57304 is going to be fixed by this change, at least partially.

@oleg-nenashev oleg-nenashev changed the title Add call to oneOffExecutors.remove in Computer.removeExecutor [JENKINS-57304] - Add call to oneOffExecutors.remove in Computer.removeExecutor Nov 11, 2019
@oleg-nenashev oleg-nenashev added the bug For changelog: Minor bug. Will be listed after features label Nov 12, 2019
@oleg-nenashev oleg-nenashev requested a review from a team November 12, 2019 14:20
@oleg-nenashev
Copy link
Member

I plan to merge it tomorrow if no negative feedback

@oleg-nenashev oleg-nenashev added the ready-for-merge The PR is ready to go, and it will be merged soon if there is no negative feedback label Nov 13, 2019
Copy link
Member

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may solve symptoms, but the fix does not look quite right. oneOffExecutors is adjusted in startFlyWeightTask and remove(OneOffExecutor), the latter of which is called from Executor.finish2 whenever this instanceof OneOffExecutor. Perhaps finish2 is not being called reliably—there are other places in interrupt and run which bypass this—which could be breaking other things as well. I suspect

if (!owner.isOnline()) {
resetWorkUnit("went off-line before the task's worker thread started");
owner.removeExecutor(this);
queue.scheduleMaintenance();
return;
}
for example.

The steps to reproduce are valuable and should make it possible to write a test reproducing the problem and verifying the fix.

(There is probably a more basic question here as to why it is even possible for the master node to be marked offline to begin with. I think some node monitors treat things like a lack of disk space on the master as a reason to mark it “offline”, which seems like an abuse of the API since controlling, say, a Pipeline build is just one of dozens of things the master may be doing which may require disk I/O. Really it should be up to some lower service wrapper layer to decide when Jenkins is out of hardware resources and should simply be shut down.)

@jglick jglick removed the ready-for-merge The PR is ready to go, and it will be merged soon if there is no negative feedback label Nov 13, 2019
@fcojfernandez
Copy link
Contributor

@jglick Executor.run is not executing finish2 when the node is offline (the code you linked). Do you mean the Executor.run method should something like

 if (!owner.isOnline()) { 
     resetWorkUnit("went off-line before the task's worker thread started"); 
     finish2();
     return; 
 } 

?
And similar with

if (!started) {
// not yet started, so simply dispose this
owner.removeExecutor(this);
return;
}
?

@MRamonLeon
Copy link
Contributor

I agree that a test to reproduce the problem would be needed

@jglick
Copy link
Member

jglick commented Nov 18, 2019

Something like that I guess—it should be safer and clearer to ensure that finish2 is called under all circumstances. May suffice to rearrange try-finally blocks a bit?

@jglick
Copy link
Member

jglick commented Nov 18, 2019

To be clear, I am not trying to block this PR, just raising some questions which might be used to improve it a bit.

@fcojfernandez
Copy link
Contributor

@lgoenner please, have a look at @jglick 's comments. They makes sense. Please, do not hesitate if you have any doubt.

@MRamonLeon
Copy link
Contributor

MRamonLeon commented Nov 21, 2019

I already strengthened the queue and I was looking into finally2 and its called methods a lot. Maybe it would be good to check if the problem is fixed after the changes in #4346. If not, maybe you can get inspiration from the test written there.

@MRamonLeon
Copy link
Contributor

MRamonLeon commented Dec 3, 2019

@fcojfernandez has confirmed that the issue is solved with this PR. I already strengthened the finally2 method in the mentioned PR (#4346) and @jglick is not blocking the PR actually. The issue is becoming critical to some CloudBees customers and the test could be hard to do, the PR already has 3 approvals (+ 1 mine), could you @oleg-nenashev, please, go forward and merge it? Thank you.

@MRamonLeon MRamonLeon added the ready-for-merge The PR is ready to go, and it will be merged soon if there is no negative feedback label Dec 3, 2019
@batmat
Copy link
Member

batmat commented Dec 3, 2019

Given existing approvals, and Ramon's summary, I think we can move forward. This change seems to improve the situation. Given Jesse's comments are also said to be more "to improve it a bit.", I think we can as needed always file followups to strengthen things even more, if deemed necesary.

I plan to merge this some time later today if nobody objects.

Thank you everyone for the reviews!

@oleg-nenashev
Copy link
Member

I agree with merging

@MRamonLeon
Copy link
Contributor

@batmat batmat merged commit cb424f3 into jenkinsci:master Dec 3, 2019
@batmat
Copy link
Member

batmat commented Dec 3, 2019

Thank you everyone! Especially @lgoenner ❤️

olivergondza pushed a commit that referenced this pull request Jan 13, 2020
[JENKINS-57304] - Add call to oneOffExecutors.remove in Computer.removeExecutor

(cherry picked from commit cb424f3)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug For changelog: Minor bug. Will be listed after features ready-for-merge The PR is ready to go, and it will be merged soon if there is no negative feedback
Projects
None yet
9 participants