Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle launcher failures by running the disconnect logic #121

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

nhunsperger
Copy link

Launching an agent on the VM can fail. This results in the VM online, but unusable.
Ensure the disconnect logic runs in this case, such that the VM can be recycled.

When using RECONNECT_AND_REVERT, we need to revert regardless of power state.

Launching the agent on the VM can fail.  This results in the VM online, but unusable.
Ensure the cleanup logic happens in this case, such that the VM will be recycled.
In the disconnect and revert case, we need to revert regardless of power state.
@pjdarton
Copy link
Member

pjdarton commented Dec 1, 2020

Any changes to vSphereCloudLauncher.java are kinda fraught with peril; it's fairly fragile code so we need to be uber-careful when making any changes to it.
...and there isn't any unit-test coverage (as the Jenkins-CI servers don't have a vSphere cloud to test against), so it's hard to be sure that a PR doesn't break things (the automated tests will likely say "passed" regardless).

FYI one thing I see in my own environment is Jenkins sometimes complains that a vSphere VM failed to come online ... and then the VM comes online anyway (because the Windows VM was just being much slower than expected to start up).
I suspect that this PR would mean that, in that case, I'd see the VM killed off just before it came online, which would be a backward step for me.
i.e. what helps one person can break the world for another (e.g. #124 was needed to fix a bug caused by #117)

So, what tests have you done with this PR?
Stuff to consider are the various different connection methods (SSH, JNLP etc), dynamic (cloud) provisioning (VM is created & destroyed automatically) and static provisioning (VM is set up manually and then controlled automatically) etc. There's a lot of possible ways of using this plugin and it's important not to break any of them for anyone...

@nhunsperger
Copy link
Author

This change has been running in an environment I support for about 5 months. Typical volume is 1.5K-2K agents/day. The first 3.5 months was with static provisioning using the RECONNECT_AND_REVERT idle action. About 6 weeks ago I transitioned everything to dynamic provisioning with run-once retention strategy (using my other 2 PRs merged with this one). We exclusively use SSH, so I had to rely on code analysis for JNLP.

Based on the description of your environment, it sounds like you are using JNLP. I presume the complaint you mention shows up as "EXCEPTION while starting VM" followed by "ERROR: Unexpected error in launching an agent. This is probably a bug in Jenkins" stack trace. If my assumptions are correct, it seems that you have not configured a long enough launchDelay for JNLP to reliably connect. Per the documentation for "Delay in seconds", this is "the maximum amount of time that the master will wait for the agent to come online before timing out."

I think the question therefore is, what should the vSphere plugin behavior be when the agent was not successfully connected during the allowed launch time? I feel the current behavior is not great, especially with static provisioning. In the case of SSH, it logs "Slave online", and since it isn't, nothing else ever happens. This leaves the static node out of service until manual intervention. In the case of JNLP, it throws an exception all the way up the stack. JNLP may connect later, but the plugin already gave up and stopped tracking it against the max online count. Or JNLP may never connect, leaving us in the same bad situation as SSH. In both cases, it is effectively not handling the failure. I think a better behavior is to allow the plugin to recover the VM by using whatever idle action was configured.

With dynamic provisioning, the idle timeout checker covers up the launch failure by disconnecting the computer and then terminating the node. This PR is calling that disconnect process a bit earlier. Yes, it may get called an extra time, but the idle action is NOTHING, so it effectively becomes a NOOP to the vSphereCloudLauncher, and it is already getting called multiple times.

As an aside, the idle timeout checker is currently disconnecting nodes that are actively connecting. The timeout starts counting prior to launch, whereas I think most users would consider "idle" to mean online and not running any jobs. This is particularly problematic as the run-once retention strategy defaults to 2 minutes, and doesn't take into account delayed launch caused by launchDelay. For reference, the EC2 plugin only times-out online nodes. I bring this up, as the need for the idle timeout to backstop launch failures could be removed if the launch mechanism handled this itself.

One potential issue with this PR is that non-launcher exceptions will also trigger the disconnect logic. In some cases this may be desirable (such as the startVm call timing out after 60 seconds due to transient DRS placement slowness). In other cases, this could potentially create a loop on bad config (I'm thinking of a missing VM/snapshot). I can retool the PR to only disconnect after launcher failures, but I no longer have static provisioning load to validate with.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants