Rolling update panic #11774

yurrriq · 2021-06-15T21:01:48Z

/kind bug

1. What kops version are you running? The command kops version, will display
this information.

Version 1.21.0-beta.3 (git-03fc6a2601809f143499d16aaab12cd7c22d9eed)

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.9", GitCommit:"9dd794e454ac32d97cde41ae10be801ae98f75df", GitTreeState:"clean", BuildDate:"2021-03-18T01:09:28Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-12T14:12:29Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/arm64"}

3. What cloud provider are you using?
AWS

4. What commands did you run? What is the simplest way to reproduce this issue?
kops rolling-update cluster --name <my cluster> --yes

5. What happened after the commands executed?
Works for a while then panics.

6. What did you expect to happen?
No panic.

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

---
apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  name: panic.example.com
spec:
  fileAssets:
  - content: |
      podNodeSelectorPluginConfig:
        clusterDefaultNodeSelector: "kubernetes.io/role=node"
    name: admission-controller-config.yaml
    roles:
    - Master
  - content: |
      apiVersion: audit.k8s.io/v1
      kind: Policy
      rules:
      - level: Metadata
    name: audit.yaml
    roles:
    - Master
  kubeAPIServer:
    admissionControlConfigFile: /srv/kubernetes/assets/admission-controller-config.yaml
    auditPolicyFile: /srv/kubernetes/assets/audit.yaml
  kubernetesVersion: 1.21.1
---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: panic.example.com
  name: apiserver-eu-west-1a
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-arm64-server-20210415
  machineType: t4g.large
  maxSize: 1
  minSize: 1
  role: APIServer
  rollingUpdate:
    maxSurge: 25%
  subnets:
  - eu-west-1a
---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: panic.example.com
  name: apiserver-eu-west-1b
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-arm64-server-20210415
  machineType: t4g.large
  maxSize: 1
  minSize: 1
  role: APIServer
  rollingUpdate:
    maxSurge: 25%
  subnets:
  - eu-west-1b
---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: panic.example.com
  name: apiserver-eu-west-1c
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-arm64-server-20210415
  machineType: t4g.large
  maxSize: 1
  minSize: 1
  role: APIServer
  rollingUpdate:
    maxSurge: 25%
  subnets:
  - eu-west-1c

@@ -72,6 +72,7 @@ spec:
         clusterDefaultNodeSelector: "kubernetes.io/role=node"
     name: admission-controller-config.yaml
     roles:
+    - APIServer
     - Master
   - content: |
       apiVersion: audit.k8s.io/v1
@@ -80,6 +81,7 @@ spec:
       - level: Metadata
     name: audit.yaml
     roles:
+    - APIServer
     - Master
   hooks:
   - execContainer:

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.
Not -v 10, but this is what I have:

panic: runtime error: index out of range [1] with length 1
goroutine 1 [running]:
k8s.io/kops/pkg/instancegroups.(*RollingUpdateCluster).rollingUpdateInstanceGroup(0xc00028a840, 0xc000db4cb0, 0x37e11d600, 0x14, 0xc0010ad870)
	pkg/instancegroups/instancegroups.go:154 +0x1454
k8s.io/kops/pkg/instancegroups.(*RollingUpdateCluster).RollingUpdate(0xc00028a840, 0xc000b9d380, 0xc00064f2d0, 0xc00064f2d0, 0xc000622390)
	pkg/instancegroups/rollingupdate.go:173 +0x96f
main.RunRollingUpdateCluster(0x549fe90, 0xc000060100, 0xc00000c330, 0x544ad80, 0xc000182008, 0xc000212500, 0x5d5560, 0xc000be9d68)
	cmd/kops/rollingupdatecluster.go:444 +0x10b3
main.NewCmdRollingUpdateCluster.func1(0xc000a90f00, 0xc0009e3d70, 0x0, 0x3)
	cmd/kops/rollingupdatecluster.go:219 +0x105
k8s.io/kops/vendor/github.com/spf13/cobra.(*Command).execute(0xc000a90f00, 0xc0009e3d40, 0x3, 0x3, 0xc000a90f00, 0xc0009e3d40)
	vendor/github.com/spf13/cobra/command.go:856 +0x2c2
k8s.io/kops/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0x785a8c0, 0x78aa770, 0x0, 0x0)
	vendor/github.com/spf13/cobra/command.go:960 +0x375
k8s.io/kops/vendor/github.com/spf13/cobra.(*Command).Execute(...)
	vendor/github.com/spf13/cobra/command.go:897
main.Execute()
	cmd/kops/root.go:97 +0x8f
main.main()
	cmd/kops/main.go:24 +0x25

9. Anything else do we need to know?

I had 3x APIServer instance groups, but the asset manifests were misconfigured by me, so those three nodes never joined the cluster. I terminated one manually then a bit later got a panic. The same thing happened when later I terminated the other two.

As the stack trace mentions, the problem seems to lie in the maths around here: https://github.com/kubernetes/kops/blob/v1.21.0-beta.3/pkg/instancegroups/instancegroups.go#L153-L154

The text was updated successfully, but these errors were encountered:

yurrriq · 2021-06-15T21:04:30Z

/assign @johngmyers

johngmyers · 2021-06-15T22:23:06Z

The issue is that incrementing skippedNodes can cause the code to try to detach more nodes than it has.

So this was introduced by #10740 and is technically a 1.21 regression.

yurrriq · 2021-06-16T15:14:24Z

Aha, thanks for the explanation. So is this a duplicate issue, or should I leave it open?

johngmyers · 2021-06-16T16:22:48Z

Leave it open

yurrriq · 2021-09-01T14:53:56Z

Have you had a chance to look into this, @johngmyers?

yurrriq · 2021-09-30T07:32:15Z

/bump we keep hitting this with kOps 1.21.1 (but our pipelines retry twice, so it's more of a minor inconvenience than anything)

azman0101 · 2021-10-27T14:53:48Z

hitting this too with kops 1.21.1

do you have rollingUpdate.maxSurge configured in your IG @yurrriq ?

yurrriq · 2021-10-27T15:40:21Z

@azman0101, yeah, we're typically using 25% or 60%.

azman0101 · 2021-10-27T20:29:34Z

Never saw this problem without rollingUpdate before

olemarkus · 2021-11-03T09:05:39Z

I am trying to reproduce this, but I can't. I've done a few detachment race conditions, and triggering autoscaling using cluster autoscaler (CAS) and manually in the console while rotating etc, but nothing seems to trigger this issue.

Can you provide a bit more info on:
What's the desired flag of the ASG set to.
What was maxSurge set to.
Was there any CAS scaling events at the time, if so what was the new and old desired value.

azman0101 · 2021-11-03T11:34:46Z

I was able to trigger the bug while CA disable.
Maxsurge at 6

I was also able to trigger the bug when using Warmpool and without UpdateStrategy

olemarkus · 2021-11-03T11:36:11Z

What was you desired flag set to?

rifelpet · 2021-11-03T12:41:56Z

I'm wondering if this would be sufficient:

diff --git a/pkg/instancegroups/instancegroups.go b/pkg/instancegroups/instancegroups.go
index bf07a12bf4..e47364f48b 100644
--- a/pkg/instancegroups/instancegroups.go
+++ b/pkg/instancegroups/instancegroups.go
@@ -151,6 +151,10 @@ func (c *RollingUpdateCluster) rollingUpdateInstanceGroup(group *cloudinstances.
        if maxSurge > 0 && !c.CloudOnly {
                skippedNodes := 0
                for numSurge := 1; numSurge <= maxSurge; numSurge++ {
+                       if skippedNodes > numSurge-1 {
+                               break
+                       }
+
                        u := update[len(update)-numSurge+skippedNodes]
                        if u.Status != cloudinstances.CloudInstanceStatusDetached {
                                if err := c.detachInstance(u); err != nil {

azman0101 · 2021-11-03T19:47:37Z

What was you desired flag set to?

Sorry @olemarkus, can't tell you anymore

yurrriq · 2021-11-10T19:29:09Z

I'm wondering if this would be sufficient:

This was my line of thinking too.

yurrriq · 2022-06-09T14:24:29Z

Still hitting this on kOps 1.23.2

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jun 15, 2021

k8s-ci-robot assigned johngmyers Jun 15, 2021

johngmyers mentioned this issue Nov 9, 2021

Fix out of bounds error when instance detach fails #12698

Merged

k8s-ci-robot closed this as completed in #12698 Nov 14, 2021

yurrriq mentioned this issue Jun 9, 2022

Rolling update panic with warm pools #13758

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rolling update panic #11774

Rolling update panic #11774

yurrriq commented Jun 15, 2021 •

edited

Loading

yurrriq commented Jun 15, 2021

johngmyers commented Jun 15, 2021

yurrriq commented Jun 16, 2021

johngmyers commented Jun 16, 2021

yurrriq commented Sep 1, 2021

yurrriq commented Sep 30, 2021 •

edited

Loading

azman0101 commented Oct 27, 2021 •

edited

Loading

yurrriq commented Oct 27, 2021

azman0101 commented Oct 27, 2021

olemarkus commented Nov 3, 2021

azman0101 commented Nov 3, 2021

olemarkus commented Nov 3, 2021

rifelpet commented Nov 3, 2021

azman0101 commented Nov 3, 2021

yurrriq commented Nov 10, 2021

yurrriq commented Jun 9, 2022

Rolling update panic #11774

Rolling update panic #11774

Comments

yurrriq commented Jun 15, 2021 • edited Loading

yurrriq commented Jun 15, 2021

johngmyers commented Jun 15, 2021

yurrriq commented Jun 16, 2021

johngmyers commented Jun 16, 2021

yurrriq commented Sep 1, 2021

yurrriq commented Sep 30, 2021 • edited Loading

azman0101 commented Oct 27, 2021 • edited Loading

yurrriq commented Oct 27, 2021

azman0101 commented Oct 27, 2021

olemarkus commented Nov 3, 2021

azman0101 commented Nov 3, 2021

olemarkus commented Nov 3, 2021

rifelpet commented Nov 3, 2021

azman0101 commented Nov 3, 2021

yurrriq commented Nov 10, 2021

yurrriq commented Jun 9, 2022

yurrriq commented Jun 15, 2021 •

edited

Loading

yurrriq commented Sep 30, 2021 •

edited

Loading

azman0101 commented Oct 27, 2021 •

edited

Loading