Rolling update panic with warm pools #13758

yurrriq · 2022-06-09T16:20:26Z

/kind bug

1. What kops version are you running? The command kops version, will display
this information.

Version 1.23.2 (git-1.23.2)

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

kubernetesVersion: 1.22.10

3. What cloud provider are you using?

cloudProvider: aws

4. What commands did you run? What is the simplest way to reproduce this issue?

kops rolling-update cluster --yes. We were also upgrading from kOps 1.22.4 to 1.23.2, in case that's relevant.

5. What happened after the commands executed?

Panic.

6. What did you expect to happen?

Smooth rolling update.

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

---
apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
# ...
spec:
  # ...
  cloudProvider: aws
  kubernetesVersion: 1.22.10 # was 1.22.7, in case that's relevant
---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
# ...
spec:
  # ...
  maxSize: 30
  minSize: 1
  role: Node
  rollingUpdate:
    drainAndTerminate: true
    maxSurge: 60%
  warmPool:
    enableLifecycleHook: true
    maxSize: 4
    minSize: 3

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

I0609 14:19:12.802210     485 instancegroups.go:309] Tainting 2 nodes in "nodes-eu-central-1a" instancegroup.
I0609 14:19:12.837133     485 instancegroups.go:131] deleting warm pool instance "AAAA"
I0609 14:19:13.004328     485 instancegroups.go:131] deleting warm pool instance "BBBB"
I0609 14:19:13.161806     485 instancegroups.go:131] deleting warm pool instance "CCCC"
I0609 14:19:13.309859     485 instancegroups.go:565] Detaching instance "DDDD", ...
I0609 14:19:13.564841     485 instancegroups.go:170] waiting for 15s after detaching instance
I0609 14:19:28.565202     485 instancegroups.go:467] Validating the cluster.
I0609 14:19:30.949595     485 instancegroups.go:503] Cluster validated; revalidating in 10s to make sure it does not flap.
I0609 14:19:44.024648     485 instancegroups.go:500] Cluster validated.
I0609 14:19:44.024687     485 instancegroups.go:565] Detaching instance "EEEE", ...
panic: runtime error: index out of range [-1]
goroutine 1 [running]:
k8s.io/kops/pkg/instancegroups.(*RollingUpdateCluster).rollingUpdateInstanceGroup(0xc0005ba900, 0xc0007de000, 0x37e11d600)
	k8s.io/kops/pkg/instancegroups/instancegroups.go:154 +0x1031
k8s.io/kops/pkg/instancegroups.(*RollingUpdateCluster).RollingUpdate(0xc0005ba900, 0xc0007d5590, 0xc0008aeb40)
	k8s.io/kops/pkg/instancegroups/rollingupdate.go:196 +0xa45
main.RunRollingUpdateCluster({0x7439c80, 0xc00005c1c8}, 0xc0008bfd98, {0x7343860, 0xc000010018}, 0xc00071a510)
	k8s.io/kops/cmd/kops/rolling-update_cluster.go:434 +0xf45
main.NewCmdRollingUpdateCluster.func1(0xc000765900, {0xc000088800, 0x2, 0x2})
	k8s.io/kops/cmd/kops/rolling-update_cluster.go:176 +0x37
k8s.io/kops/vendor/github.com/spf13/cobra.(*Command).execute(0xc000765900, {0xc0000887e0, 0x2, 0x2})
	k8s.io/kops/vendor/github.com/spf13/cobra/command.go:856 +0x60e
k8s.io/kops/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0x9ece900)
	k8s.io/kops/vendor/github.com/spf13/cobra/command.go:974 +0x3bc
k8s.io/kops/vendor/github.com/spf13/cobra.(*Command).Execute(...)
	k8s.io/kops/vendor/github.com/spf13/cobra/command.go:902
main.Execute()
	k8s.io/kops/cmd/kops/root.go:95 +0x5c
main.main()
	k8s.io/kops/cmd/kops/main.go:20 +0x17

9. Anything else do we need to know?

Similar to #11774

The text was updated successfully, but these errors were encountered:

yurrriq · 2022-08-10T19:26:28Z

We're still seeing this quite frequently.

olemarkus · 2022-08-10T19:34:41Z

Please reproduce this with kops 1.24.1. It should give more information about why the detach fails that may cause this.

yurrriq · 2022-08-10T19:42:25Z

Noted. We're currently still using 1.23.2.

yurrriq · 2022-09-08T12:00:48Z

Still seeing this with kOps 1.24.1. Here are some logs:

I0908 09:43:08.745335    6639 instancegroups.go:311] Tainting 6 nodes in "nodes-eu-central-1c" instancegroup.
I0908 09:43:08.848436    6639 instancegroups.go:132] deleting warm pool instance "AAAAA"
I0908 09:43:09.003241    6639 instancegroups.go:132] deleting warm pool instance "BBBBB"
I0908 09:43:09.141794    6639 instancegroups.go:132] deleting warm pool instance "CCCCC"
I0908 09:43:09.293505    6639 instancegroups.go:132] deleting warm pool instance "DDDDD"
I0908 09:43:09.463532    6639 instancegroups.go:132] deleting warm pool instance "EEEEE"
I0908 09:43:09.632396    6639 instancegroups.go:132] deleting warm pool instance "FFFFF"
I0908 09:43:09.791094    6639 instancegroups.go:132] deleting warm pool instance "GGGGG"
I0908 09:43:09.962070    6639 instancegroups.go:132] deleting warm pool instance "HHHHH"
I0908 09:43:10.099380    6639 instancegroups.go:132] deleting warm pool instance "IIIII"
I0908 09:43:10.241848    6639 instancegroups.go:132] deleting warm pool instance "JJJJJ"
I0908 09:43:10.385283    6639 instancegroups.go:568] Detaching instance "KKKKK", node "ip-KKKKK.eu-central-1.compute.internal", in group "nodes-eu-central-1c.my-cluster.example".
I0908 09:43:10.739226    6639 instancegroups.go:172] waiting for 15s after detaching instance
I0908 09:43:25.739428    6639 instancegroups.go:470] Validating the cluster.
I0908 09:43:28.186852    6639 instancegroups.go:506] Cluster validated; revalidating in 10s to make sure it does not flap.
I0908 09:43:40.244820    6639 instancegroups.go:503] Cluster validated.
I0908 09:43:40.244869    6639 instancegroups.go:568] Detaching instance "LLLLL", node "ip-LLLLL.eu-central-1.compute.internal", in group "nodes-eu-central-1c.my-cluster.example".
I0908 09:43:40.678965    6639 instancegroups.go:568] Detaching instance "MMMMM", node "ip-MMMMM.eu-central-1.compute.internal", in group "nodes-eu-central-1c.my-cluster.example".
I0908 09:43:41.120884    6639 instancegroups.go:568] Detaching instance "NNNNN", node "ip-NNNNN.eu-central-1.compute.internal", in group "nodes-eu-central-1c.my-cluster.example".
I0908 09:43:41.547119    6639 instancegroups.go:568] Detaching instance "OOOOO", node "ip-OOOOO.eu-central-1.compute.internal", in group "nodes-eu-central-1c.my-cluster.example".
I0908 09:43:41.957413    6639 instancegroups.go:568] Detaching instance "PPPPP", node "ip-PPPPP.eu-central-1.compute.internal", in group "nodes-eu-central-1c.my-cluster.example".
panic: runtime error: index out of range [-1]

goroutine 1 [running]:
k8s.io/kops/pkg/instancegroups.(*RollingUpdateCluster).rollingUpdateInstanceGroup(0xc000452240, 0xc0002c01c0, 0x37e11d600)
	k8s.io/kops/pkg/instancegroups/instancegroups.go:155 +0x1410
k8s.io/kops/pkg/instancegroups.(*RollingUpdateCluster).RollingUpdate(0xc000452240, 0xc00087cbd0, 0xc000a53560?)
	k8s.io/kops/pkg/instancegroups/rollingupdate.go:196 +0xa69
main.RunRollingUpdateCluster({0x7187d58?, 0xc00005c1e8}, 0x0?, {0x716c340, 0xc000010018}, 0xc0008085a0)
	k8s.io/kops/cmd/kops/rolling-update_cluster.go:434 +0xf45
main.NewCmdRollingUpdateCluster.func1(0xc000177900?, {0xc000707380?, 0x2?, 0x2?})
	k8s.io/kops/cmd/kops/rolling-update_cluster.go:176 +0x37
github.com/spf13/cobra.(*Command).execute(0xc000177900, {0xc000707360, 0x2, 0x2})
	github.com/spf13/cobra@v1.5.0/command.go:872 +0x694
github.com/spf13/cobra.(*Command).ExecuteC(0x94984a0)
	github.com/spf13/cobra@v1.5.0/command.go:990 +0x3b4
github.com/spf13/cobra.(*Command).Execute(...)
	github.com/spf13/cobra@v1.5.0/command.go:918
main.Execute()
	k8s.io/kops/cmd/kops/root.go:95 +0x5c
main.main()
	k8s.io/kops/cmd/kops/main.go:20 +0x17

olemarkus · 2022-09-09T07:45:58Z

Do you happen to have an ASG with warm pools enabled and being scaled to 0? In that case, I think

kops/pkg/instancegroups/instancegroups.go

Line 155 in 9ed92a9

u := update[len(update)-numSurge-skippedNodes]

would end up with trying to access update[0 - 1 - 0].

olemarkus · 2022-09-09T07:55:07Z

Yep. This will cause a panic. Reproduced!

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jun 9, 2022

yurrriq mentioned this issue Jun 9, 2022

Fix out of bounds error when instance detach fails #12698

Merged

olemarkus mentioned this issue Sep 9, 2022

Warm pool-enabled ASGs scaled to zero will no longer panic #14251

Merged

k8s-ci-robot closed this as completed in #14251 Sep 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rolling update panic with warm pools #13758

Rolling update panic with warm pools #13758

yurrriq commented Jun 9, 2022 •

edited

Loading

yurrriq commented Aug 10, 2022

olemarkus commented Aug 10, 2022

yurrriq commented Aug 10, 2022

yurrriq commented Sep 8, 2022

olemarkus commented Sep 9, 2022

olemarkus commented Sep 9, 2022

Rolling update panic with warm pools #13758

Rolling update panic with warm pools #13758

Comments

yurrriq commented Jun 9, 2022 • edited Loading

yurrriq commented Aug 10, 2022

olemarkus commented Aug 10, 2022

yurrriq commented Aug 10, 2022

yurrriq commented Sep 8, 2022

olemarkus commented Sep 9, 2022

olemarkus commented Sep 9, 2022

yurrriq commented Jun 9, 2022 •

edited

Loading