Cluster refuses to start a rolling-update when the cluster does not pass validation #9379

ari-becker · 2020-06-17T12:16:29Z

1. What kops version are you running? The command kops version, will display
this information.

Version 1.17.0 (git-a17511e6dd)

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.1", GitCommit:"7879fc12a63337efff607952a323df90cdc7a335", GitTreeState:"archive", BuildDate:"1970-01-01T00:00:01Z", GoVersion:"go1.14.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.6", GitCommit:"d32e40e20d167e103faf894261614c5b45c44198", GitTreeState:"clean", BuildDate:"2020-05-20T13:08:34Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}

3. What cloud provider are you using?

AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

kops rolling-update cluster --instance-group my-instance-group --yes

5. What happened after the commands executed?

W0617 14:56:00.493511   22627 aws_cloud.go:673] ignoring instance as it is terminating: i-deadbeefdeadbeef0 in autoscaling group: my-instance-group.mycluster.example.com

cluster "mycluster.example.com" did not pass validation: machine "i-deadbeefdeadbeef0" has not yet joined cluster

6. What did you expect to happen?

I0617 14:37:57.477202 15279 instancegroups.go:268] Cluster did not pass validation, will try again in "30s" until duration "15m0s" expires: machine "i-deadbeefdeadbeef0" has not yet joined cluster.

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.
N/A

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.
N/A

9. Anything else do we need to know?

We run a large cluster with cluster-autoscaling and mixedInstancesPolicy / spot instances running on a number of our instance groups. Additionally, in cases when we need to roll the entire cluster (i.e. Kubernetes updates), it's common for us to have multiple engineers rolling individual instance groups and monitoring the roll-out. It's therefore pretty common for the cluster to be in a state where some node, somewhere is joining or leaving the cluster. When this happens, the rolling-update command refuses to execute and exits.

While the refusal to actually terminate any nodes while the cluster is unstable is desirable behavior, the immediate exit is not. The rolling-update command is actually inconsistent in this behavior; generally, in particular if a given instance group is in the middle of a rolling update, the command will print that it will try again in "30s" until duration "15m0s" expires. It does so because cluster instability is a natural consequence of terminating instances, and the retry loop provides a smoother experience, instead of exiting after rolling every instance.
This bug report (not really a bug, more a sore point in the UX that is presumably working as designed) asks that rolling-update behave in the same way when it's starting to roll an instance group as it does when it's in the middle of rolling an instance group - to poll validate until the cluster is ready, or times out.

The text was updated successfully, but these errors were encountered:

johngmyers · 2020-06-17T14:53:01Z

Per #9165, kops 1.18 will retry the initial validation before updating an instancegroup.

ari-becker · 2020-06-17T15:19:10Z

Oh awesome! Closing as duplicate then.

ari-becker changed the title ~~Cluster refuses to rolling-update when the cluster does not pass validation~~ Cluster refuses to start a rolling-update when the cluster does not pass validation Jun 17, 2020

ari-becker closed this as completed Jun 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster refuses to start a rolling-update when the cluster does not pass validation #9379

Cluster refuses to start a rolling-update when the cluster does not pass validation #9379

ari-becker commented Jun 17, 2020

johngmyers commented Jun 17, 2020

ari-becker commented Jun 17, 2020

Cluster refuses to start a rolling-update when the cluster does not pass validation #9379

Cluster refuses to start a rolling-update when the cluster does not pass validation #9379

Comments

ari-becker commented Jun 17, 2020

johngmyers commented Jun 17, 2020

ari-becker commented Jun 17, 2020