Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rolling update panic #11774

Closed
yurrriq opened this issue Jun 15, 2021 · 16 comments · Fixed by #12698
Closed

Rolling update panic #11774

yurrriq opened this issue Jun 15, 2021 · 16 comments · Fixed by #12698
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@yurrriq
Copy link
Contributor

yurrriq commented Jun 15, 2021

/kind bug

1. What kops version are you running? The command kops version, will display
this information.

Version 1.21.0-beta.3 (git-03fc6a2601809f143499d16aaab12cd7c22d9eed)

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.9", GitCommit:"9dd794e454ac32d97cde41ae10be801ae98f75df", GitTreeState:"clean", BuildDate:"2021-03-18T01:09:28Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-12T14:12:29Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/arm64"}

3. What cloud provider are you using?
AWS

4. What commands did you run? What is the simplest way to reproduce this issue?
kops rolling-update cluster --name <my cluster> --yes

5. What happened after the commands executed?
Works for a while then panics.

6. What did you expect to happen?
No panic.

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

---
apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  name: panic.example.com
spec:
  fileAssets:
  - content: |
      podNodeSelectorPluginConfig:
        clusterDefaultNodeSelector: "kubernetes.io/role=node"
    name: admission-controller-config.yaml
    roles:
    - Master
  - content: |
      apiVersion: audit.k8s.io/v1
      kind: Policy
      rules:
      - level: Metadata
    name: audit.yaml
    roles:
    - Master
  kubeAPIServer:
    admissionControlConfigFile: /srv/kubernetes/assets/admission-controller-config.yaml
    auditPolicyFile: /srv/kubernetes/assets/audit.yaml
  kubernetesVersion: 1.21.1
---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: panic.example.com
  name: apiserver-eu-west-1a
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-arm64-server-20210415
  machineType: t4g.large
  maxSize: 1
  minSize: 1
  role: APIServer
  rollingUpdate:
    maxSurge: 25%
  subnets:
  - eu-west-1a
---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: panic.example.com
  name: apiserver-eu-west-1b
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-arm64-server-20210415
  machineType: t4g.large
  maxSize: 1
  minSize: 1
  role: APIServer
  rollingUpdate:
    maxSurge: 25%
  subnets:
  - eu-west-1b
---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: panic.example.com
  name: apiserver-eu-west-1c
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-arm64-server-20210415
  machineType: t4g.large
  maxSize: 1
  minSize: 1
  role: APIServer
  rollingUpdate:
    maxSurge: 25%
  subnets:
  - eu-west-1c
@@ -72,6 +72,7 @@ spec:
         clusterDefaultNodeSelector: "kubernetes.io/role=node"
     name: admission-controller-config.yaml
     roles:
+    - APIServer
     - Master
   - content: |
       apiVersion: audit.k8s.io/v1
@@ -80,6 +81,7 @@ spec:
       - level: Metadata
     name: audit.yaml
     roles:
+    - APIServer
     - Master
   hooks:
   - execContainer:

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

Not -v 10, but this is what I have:

panic: runtime error: index out of range [1] with length 1
goroutine 1 [running]:
k8s.io/kops/pkg/instancegroups.(*RollingUpdateCluster).rollingUpdateInstanceGroup(0xc00028a840, 0xc000db4cb0, 0x37e11d600, 0x14, 0xc0010ad870)
	pkg/instancegroups/instancegroups.go:154 +0x1454
k8s.io/kops/pkg/instancegroups.(*RollingUpdateCluster).RollingUpdate(0xc00028a840, 0xc000b9d380, 0xc00064f2d0, 0xc00064f2d0, 0xc000622390)
	pkg/instancegroups/rollingupdate.go:173 +0x96f
main.RunRollingUpdateCluster(0x549fe90, 0xc000060100, 0xc00000c330, 0x544ad80, 0xc000182008, 0xc000212500, 0x5d5560, 0xc000be9d68)
	cmd/kops/rollingupdatecluster.go:444 +0x10b3
main.NewCmdRollingUpdateCluster.func1(0xc000a90f00, 0xc0009e3d70, 0x0, 0x3)
	cmd/kops/rollingupdatecluster.go:219 +0x105
k8s.io/kops/vendor/github.com/spf13/cobra.(*Command).execute(0xc000a90f00, 0xc0009e3d40, 0x3, 0x3, 0xc000a90f00, 0xc0009e3d40)
	vendor/github.com/spf13/cobra/command.go:856 +0x2c2
k8s.io/kops/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0x785a8c0, 0x78aa770, 0x0, 0x0)
	vendor/github.com/spf13/cobra/command.go:960 +0x375
k8s.io/kops/vendor/github.com/spf13/cobra.(*Command).Execute(...)
	vendor/github.com/spf13/cobra/command.go:897
main.Execute()
	cmd/kops/root.go:97 +0x8f
main.main()
	cmd/kops/main.go:24 +0x25

9. Anything else do we need to know?

I had 3x APIServer instance groups, but the asset manifests were misconfigured by me, so those three nodes never joined the cluster. I terminated one manually then a bit later got a panic. The same thing happened when later I terminated the other two.

As the stack trace mentions, the problem seems to lie in the maths around here: https://github.com/kubernetes/kops/blob/v1.21.0-beta.3/pkg/instancegroups/instancegroups.go#L153-L154

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jun 15, 2021
@yurrriq
Copy link
Contributor Author

yurrriq commented Jun 15, 2021

/assign @johngmyers

@johngmyers
Copy link
Member

The issue is that incrementing skippedNodes can cause the code to try to detach more nodes than it has.

So this was introduced by #10740 and is technically a 1.21 regression.

@yurrriq
Copy link
Contributor Author

yurrriq commented Jun 16, 2021

Aha, thanks for the explanation. So is this a duplicate issue, or should I leave it open?

@johngmyers
Copy link
Member

Leave it open

@yurrriq
Copy link
Contributor Author

yurrriq commented Sep 1, 2021

Have you had a chance to look into this, @johngmyers?

@yurrriq
Copy link
Contributor Author

yurrriq commented Sep 30, 2021

/bump we keep hitting this with kOps 1.21.1 (but our pipelines retry twice, so it's more of a minor inconvenience than anything)

@azman0101
Copy link
Contributor

azman0101 commented Oct 27, 2021

hitting this too with kops 1.21.1

do you have rollingUpdate.maxSurge configured in your IG @yurrriq ?

@yurrriq
Copy link
Contributor Author

yurrriq commented Oct 27, 2021

@azman0101, yeah, we're typically using 25% or 60%.

@azman0101
Copy link
Contributor

Never saw this problem without rollingUpdate before

@olemarkus
Copy link
Member

I am trying to reproduce this, but I can't. I've done a few detachment race conditions, and triggering autoscaling using cluster autoscaler (CAS) and manually in the console while rotating etc, but nothing seems to trigger this issue.

Can you provide a bit more info on:
What's the desired flag of the ASG set to.
What was maxSurge set to.
Was there any CAS scaling events at the time, if so what was the new and old desired value.

@azman0101
Copy link
Contributor

I was able to trigger the bug while CA disable.
Maxsurge at 6

I was also able to trigger the bug when using Warmpool and without UpdateStrategy

@olemarkus
Copy link
Member

What was you desired flag set to?

@rifelpet
Copy link
Member

rifelpet commented Nov 3, 2021

I'm wondering if this would be sufficient:

diff --git a/pkg/instancegroups/instancegroups.go b/pkg/instancegroups/instancegroups.go
index bf07a12bf4..e47364f48b 100644
--- a/pkg/instancegroups/instancegroups.go
+++ b/pkg/instancegroups/instancegroups.go
@@ -151,6 +151,10 @@ func (c *RollingUpdateCluster) rollingUpdateInstanceGroup(group *cloudinstances.
        if maxSurge > 0 && !c.CloudOnly {
                skippedNodes := 0
                for numSurge := 1; numSurge <= maxSurge; numSurge++ {
+                       if skippedNodes > numSurge-1 {
+                               break
+                       }
+
                        u := update[len(update)-numSurge+skippedNodes]
                        if u.Status != cloudinstances.CloudInstanceStatusDetached {
                                if err := c.detachInstance(u); err != nil {

@azman0101
Copy link
Contributor

What was you desired flag set to?

Sorry @olemarkus, can't tell you anymore

@yurrriq
Copy link
Contributor Author

yurrriq commented Nov 10, 2021

I'm wondering if this would be sufficient:

This was my line of thinking too.

@yurrriq
Copy link
Contributor Author

yurrriq commented Jun 9, 2022

Still hitting this on kOps 1.23.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants