Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster refuses to start a rolling-update when the cluster does not pass validation #9379

Closed
ari-becker opened this issue Jun 17, 2020 · 2 comments

Comments

@ari-becker
Copy link
Contributor

1. What kops version are you running? The command kops version, will display
this information.

Version 1.17.0 (git-a17511e6dd)

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.1", GitCommit:"7879fc12a63337efff607952a323df90cdc7a335", GitTreeState:"archive", BuildDate:"1970-01-01T00:00:01Z", GoVersion:"go1.14.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.6", GitCommit:"d32e40e20d167e103faf894261614c5b45c44198", GitTreeState:"clean", BuildDate:"2020-05-20T13:08:34Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}

3. What cloud provider are you using?

AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

kops rolling-update cluster --instance-group my-instance-group --yes

5. What happened after the commands executed?

W0617 14:56:00.493511   22627 aws_cloud.go:673] ignoring instance as it is terminating: i-deadbeefdeadbeef0 in autoscaling group: my-instance-group.mycluster.example.com

cluster "mycluster.example.com" did not pass validation: machine "i-deadbeefdeadbeef0" has not yet joined cluster

6. What did you expect to happen?

I0617 14:37:57.477202 15279 instancegroups.go:268] Cluster did not pass validation, will try again in "30s" until duration "15m0s" expires: machine "i-deadbeefdeadbeef0" has not yet joined cluster.

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

N/A

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

N/A

9. Anything else do we need to know?

We run a large cluster with cluster-autoscaling and mixedInstancesPolicy / spot instances running on a number of our instance groups. Additionally, in cases when we need to roll the entire cluster (i.e. Kubernetes updates), it's common for us to have multiple engineers rolling individual instance groups and monitoring the roll-out. It's therefore pretty common for the cluster to be in a state where some node, somewhere is joining or leaving the cluster. When this happens, the rolling-update command refuses to execute and exits.

While the refusal to actually terminate any nodes while the cluster is unstable is desirable behavior, the immediate exit is not. The rolling-update command is actually inconsistent in this behavior; generally, in particular if a given instance group is in the middle of a rolling update, the command will print that it will try again in "30s" until duration "15m0s" expires. It does so because cluster instability is a natural consequence of terminating instances, and the retry loop provides a smoother experience, instead of exiting after rolling every instance.
This bug report (not really a bug, more a sore point in the UX that is presumably working as designed) asks that rolling-update behave in the same way when it's starting to roll an instance group as it does when it's in the middle of rolling an instance group - to poll validate until the cluster is ready, or times out.

@ari-becker ari-becker changed the title Cluster refuses to rolling-update when the cluster does not pass validation Cluster refuses to start a rolling-update when the cluster does not pass validation Jun 17, 2020
@johngmyers
Copy link
Member

Per #9165, kops 1.18 will retry the initial validation before updating an instancegroup.

@ari-becker
Copy link
Contributor Author

Oh awesome! Closing as duplicate then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants