New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SlaveComputer.getRetentionStrategy mistake led to launches of unused agent JVMs #3701

Merged
merged 1 commit into from Oct 25, 2018

Conversation

2 participants
@jglick
Member

jglick commented Oct 24, 2018

While playing with a scenario in which mock-slave was configured with a Cloud producing agents using CloudRetentionStrategy, I found a strange problem: after running and then terminating a highly parallel build, the now-idle agents would after a 1min delay be disconnected as expected. Yet the vagaries of NodeProvisioner meant that more agents were initially connected than actually needed; and after all agents were apparently disconnected from the master, according to the executor widget etc., there would still be lots of remoting.jar processes running. The clue to the problem was that the master log was printing a message from SlaveComputer.tryReconnect—which is a method only called by RetentionStrategy.Always and never by CloudRetentionStrategy! It seems that there were a bunch of stranded SlaveComputers with no associated Slaves after the cleanup (these things that get temporarily out of synch), and when getRetentionStrategy was being called, getNode was null, so it was falling back to INSTANCE ~ Always, which then tried to reconnect the computer—even though it was supposed to be dying. After the relaunch, there was nothing to find and terminate the rogue computer connections.

After restarting Jenkins with this fix applied, the problem went away. The number of agent JVMs actually running reliably matched the number of online agents according to the Jenkins web UI.

Proposed changelog entries

  • Under some conditions when using elastic agents (clouds), agent JVMs could be incorrectly relaunched and never terminated.
SlaveComputer.getRetentionStrategy was using Always when there was no…
… node, leading to launches of unused agent JVMs.
@jglick

This comment has been minimized.

Member

jglick commented Oct 24, 2018

The buggy code is old: 33573ca for JENKINS-3696. Not sure how this could not have been noticed before now. Perhaps it is only an issue with certain launchers.

@jglick jglick merged commit bc39ba8 into jenkinsci:master Oct 25, 2018

2 checks passed

continuous-integration/jenkins/incrementals Deployed to Incrementals.
Details
continuous-integration/jenkins/pr-merge This commit looks good
Details

@jglick jglick deleted the jglick:SlaveComputer.getRetentionStrategy branch Oct 25, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment