Handle launcher failures by running the disconnect logic #121

nhunsperger · 2020-10-18T04:42:42Z

Launching an agent on the VM can fail. This results in the VM online, but unusable.
Ensure the disconnect logic runs in this case, such that the VM can be recycled.

When using RECONNECT_AND_REVERT, we need to revert regardless of power state.

Launching the agent on the VM can fail. This results in the VM online, but unusable. Ensure the cleanup logic happens in this case, such that the VM will be recycled. In the disconnect and revert case, we need to revert regardless of power state.

pjdarton · 2020-12-01T10:59:48Z

Any changes to vSphereCloudLauncher.java are kinda fraught with peril; it's fairly fragile code so we need to be uber-careful when making any changes to it.
...and there isn't any unit-test coverage (as the Jenkins-CI servers don't have a vSphere cloud to test against), so it's hard to be sure that a PR doesn't break things (the automated tests will likely say "passed" regardless).

FYI one thing I see in my own environment is Jenkins sometimes complains that a vSphere VM failed to come online ... and then the VM comes online anyway (because the Windows VM was just being much slower than expected to start up).
I suspect that this PR would mean that, in that case, I'd see the VM killed off just before it came online, which would be a backward step for me.
i.e. what helps one person can break the world for another (e.g. #124 was needed to fix a bug caused by #117)

So, what tests have you done with this PR?
Stuff to consider are the various different connection methods (SSH, JNLP etc), dynamic (cloud) provisioning (VM is created & destroyed automatically) and static provisioning (VM is set up manually and then controlled automatically) etc. There's a lot of possible ways of using this plugin and it's important not to break any of them for anyone...

nhunsperger · 2020-12-14T05:39:10Z

This change has been running in an environment I support for about 5 months. Typical volume is 1.5K-2K agents/day. The first 3.5 months was with static provisioning using the RECONNECT_AND_REVERT idle action. About 6 weeks ago I transitioned everything to dynamic provisioning with run-once retention strategy (using my other 2 PRs merged with this one). We exclusively use SSH, so I had to rely on code analysis for JNLP.

Based on the description of your environment, it sounds like you are using JNLP. I presume the complaint you mention shows up as "EXCEPTION while starting VM" followed by "ERROR: Unexpected error in launching an agent. This is probably a bug in Jenkins" stack trace. If my assumptions are correct, it seems that you have not configured a long enough launchDelay for JNLP to reliably connect. Per the documentation for "Delay in seconds", this is "the maximum amount of time that the master will wait for the agent to come online before timing out."

I think the question therefore is, what should the vSphere plugin behavior be when the agent was not successfully connected during the allowed launch time? I feel the current behavior is not great, especially with static provisioning. In the case of SSH, it logs "Slave online", and since it isn't, nothing else ever happens. This leaves the static node out of service until manual intervention. In the case of JNLP, it throws an exception all the way up the stack. JNLP may connect later, but the plugin already gave up and stopped tracking it against the max online count. Or JNLP may never connect, leaving us in the same bad situation as SSH. In both cases, it is effectively not handling the failure. I think a better behavior is to allow the plugin to recover the VM by using whatever idle action was configured.

With dynamic provisioning, the idle timeout checker covers up the launch failure by disconnecting the computer and then terminating the node. This PR is calling that disconnect process a bit earlier. Yes, it may get called an extra time, but the idle action is NOTHING, so it effectively becomes a NOOP to the vSphereCloudLauncher, and it is already getting called multiple times.

As an aside, the idle timeout checker is currently disconnecting nodes that are actively connecting. The timeout starts counting prior to launch, whereas I think most users would consider "idle" to mean online and not running any jobs. This is particularly problematic as the run-once retention strategy defaults to 2 minutes, and doesn't take into account delayed launch caused by launchDelay. For reference, the EC2 plugin only times-out online nodes. I bring this up, as the need for the idle timeout to backstop launch failures could be removed if the launch mechanism handled this itself.

One potential issue with this PR is that non-launcher exceptions will also trigger the disconnect logic. In some cases this may be desirable (such as the startVm call timing out after 60 seconds due to transient DRS placement slowness). In other cases, this could potentially create a loop on bad config (I'm thinking of a missing VM/snapshot). I can retool the PR to only disconnect after launcher failures, but I no longer have static provisioning load to validate with.

Handle launcher failures

2d83ae2

Launching the agent on the VM can fail. This results in the VM online, but unusable. Ensure the cleanup logic happens in this case, such that the VM will be recycled. In the disconnect and revert case, we need to revert regardless of power state.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle launcher failures by running the disconnect logic #121

Handle launcher failures by running the disconnect logic #121

nhunsperger commented Oct 18, 2020

pjdarton commented Dec 1, 2020

nhunsperger commented Dec 14, 2020

Handle launcher failures by running the disconnect logic #121

Are you sure you want to change the base?

Handle launcher failures by running the disconnect logic #121

Conversation

nhunsperger commented Oct 18, 2020

pjdarton commented Dec 1, 2020

nhunsperger commented Dec 14, 2020