Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rolling update panic with warm pools #13758

Closed
yurrriq opened this issue Jun 9, 2022 · 6 comments · Fixed by #14251
Closed

Rolling update panic with warm pools #13758

yurrriq opened this issue Jun 9, 2022 · 6 comments · Fixed by #14251
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@yurrriq
Copy link
Contributor

yurrriq commented Jun 9, 2022

/kind bug

1. What kops version are you running? The command kops version, will display
this information.

Version 1.23.2 (git-1.23.2)

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

kubernetesVersion: 1.22.10

3. What cloud provider are you using?

cloudProvider: aws

4. What commands did you run? What is the simplest way to reproduce this issue?

kops rolling-update cluster --yes. We were also upgrading from kOps 1.22.4 to 1.23.2, in case that's relevant.

5. What happened after the commands executed?

Panic.

6. What did you expect to happen?

Smooth rolling update.

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

---
apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
# ...
spec:
  # ...
  cloudProvider: aws
  kubernetesVersion: 1.22.10 # was 1.22.7, in case that's relevant
---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
# ...
spec:
  # ...
  maxSize: 30
  minSize: 1
  role: Node
  rollingUpdate:
    drainAndTerminate: true
    maxSurge: 60%
  warmPool:
    enableLifecycleHook: true
    maxSize: 4
    minSize: 3

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

I0609 14:19:12.802210     485 instancegroups.go:309] Tainting 2 nodes in "nodes-eu-central-1a" instancegroup.
I0609 14:19:12.837133     485 instancegroups.go:131] deleting warm pool instance "AAAA"
I0609 14:19:13.004328     485 instancegroups.go:131] deleting warm pool instance "BBBB"
I0609 14:19:13.161806     485 instancegroups.go:131] deleting warm pool instance "CCCC"
I0609 14:19:13.309859     485 instancegroups.go:565] Detaching instance "DDDD", ...
I0609 14:19:13.564841     485 instancegroups.go:170] waiting for 15s after detaching instance
I0609 14:19:28.565202     485 instancegroups.go:467] Validating the cluster.
I0609 14:19:30.949595     485 instancegroups.go:503] Cluster validated; revalidating in 10s to make sure it does not flap.
I0609 14:19:44.024648     485 instancegroups.go:500] Cluster validated.
I0609 14:19:44.024687     485 instancegroups.go:565] Detaching instance "EEEE", ...
panic: runtime error: index out of range [-1]
goroutine 1 [running]:
k8s.io/kops/pkg/instancegroups.(*RollingUpdateCluster).rollingUpdateInstanceGroup(0xc0005ba900, 0xc0007de000, 0x37e11d600)
	k8s.io/kops/pkg/instancegroups/instancegroups.go:154 +0x1031
k8s.io/kops/pkg/instancegroups.(*RollingUpdateCluster).RollingUpdate(0xc0005ba900, 0xc0007d5590, 0xc0008aeb40)
	k8s.io/kops/pkg/instancegroups/rollingupdate.go:196 +0xa45
main.RunRollingUpdateCluster({0x7439c80, 0xc00005c1c8}, 0xc0008bfd98, {0x7343860, 0xc000010018}, 0xc00071a510)
	k8s.io/kops/cmd/kops/rolling-update_cluster.go:434 +0xf45
main.NewCmdRollingUpdateCluster.func1(0xc000765900, {0xc000088800, 0x2, 0x2})
	k8s.io/kops/cmd/kops/rolling-update_cluster.go:176 +0x37
k8s.io/kops/vendor/github.com/spf13/cobra.(*Command).execute(0xc000765900, {0xc0000887e0, 0x2, 0x2})
	k8s.io/kops/vendor/github.com/spf13/cobra/command.go:856 +0x60e
k8s.io/kops/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0x9ece900)
	k8s.io/kops/vendor/github.com/spf13/cobra/command.go:974 +0x3bc
k8s.io/kops/vendor/github.com/spf13/cobra.(*Command).Execute(...)
	k8s.io/kops/vendor/github.com/spf13/cobra/command.go:902
main.Execute()
	k8s.io/kops/cmd/kops/root.go:95 +0x5c
main.main()
	k8s.io/kops/cmd/kops/main.go:20 +0x17

9. Anything else do we need to know?

Similar to #11774

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jun 9, 2022
@yurrriq
Copy link
Contributor Author

yurrriq commented Aug 10, 2022

We're still seeing this quite frequently.

@olemarkus
Copy link
Member

Please reproduce this with kops 1.24.1. It should give more information about why the detach fails that may cause this.

@yurrriq
Copy link
Contributor Author

yurrriq commented Aug 10, 2022

Noted. We're currently still using 1.23.2.

@yurrriq
Copy link
Contributor Author

yurrriq commented Sep 8, 2022

Still seeing this with kOps 1.24.1. Here are some logs:

I0908 09:43:08.745335    6639 instancegroups.go:311] Tainting 6 nodes in "nodes-eu-central-1c" instancegroup.
I0908 09:43:08.848436    6639 instancegroups.go:132] deleting warm pool instance "AAAAA"
I0908 09:43:09.003241    6639 instancegroups.go:132] deleting warm pool instance "BBBBB"
I0908 09:43:09.141794    6639 instancegroups.go:132] deleting warm pool instance "CCCCC"
I0908 09:43:09.293505    6639 instancegroups.go:132] deleting warm pool instance "DDDDD"
I0908 09:43:09.463532    6639 instancegroups.go:132] deleting warm pool instance "EEEEE"
I0908 09:43:09.632396    6639 instancegroups.go:132] deleting warm pool instance "FFFFF"
I0908 09:43:09.791094    6639 instancegroups.go:132] deleting warm pool instance "GGGGG"
I0908 09:43:09.962070    6639 instancegroups.go:132] deleting warm pool instance "HHHHH"
I0908 09:43:10.099380    6639 instancegroups.go:132] deleting warm pool instance "IIIII"
I0908 09:43:10.241848    6639 instancegroups.go:132] deleting warm pool instance "JJJJJ"
I0908 09:43:10.385283    6639 instancegroups.go:568] Detaching instance "KKKKK", node "ip-KKKKK.eu-central-1.compute.internal", in group "nodes-eu-central-1c.my-cluster.example".
I0908 09:43:10.739226    6639 instancegroups.go:172] waiting for 15s after detaching instance
I0908 09:43:25.739428    6639 instancegroups.go:470] Validating the cluster.
I0908 09:43:28.186852    6639 instancegroups.go:506] Cluster validated; revalidating in 10s to make sure it does not flap.
I0908 09:43:40.244820    6639 instancegroups.go:503] Cluster validated.
I0908 09:43:40.244869    6639 instancegroups.go:568] Detaching instance "LLLLL", node "ip-LLLLL.eu-central-1.compute.internal", in group "nodes-eu-central-1c.my-cluster.example".
I0908 09:43:40.678965    6639 instancegroups.go:568] Detaching instance "MMMMM", node "ip-MMMMM.eu-central-1.compute.internal", in group "nodes-eu-central-1c.my-cluster.example".
I0908 09:43:41.120884    6639 instancegroups.go:568] Detaching instance "NNNNN", node "ip-NNNNN.eu-central-1.compute.internal", in group "nodes-eu-central-1c.my-cluster.example".
I0908 09:43:41.547119    6639 instancegroups.go:568] Detaching instance "OOOOO", node "ip-OOOOO.eu-central-1.compute.internal", in group "nodes-eu-central-1c.my-cluster.example".
I0908 09:43:41.957413    6639 instancegroups.go:568] Detaching instance "PPPPP", node "ip-PPPPP.eu-central-1.compute.internal", in group "nodes-eu-central-1c.my-cluster.example".
panic: runtime error: index out of range [-1]

goroutine 1 [running]:
k8s.io/kops/pkg/instancegroups.(*RollingUpdateCluster).rollingUpdateInstanceGroup(0xc000452240, 0xc0002c01c0, 0x37e11d600)
	k8s.io/kops/pkg/instancegroups/instancegroups.go:155 +0x1410
k8s.io/kops/pkg/instancegroups.(*RollingUpdateCluster).RollingUpdate(0xc000452240, 0xc00087cbd0, 0xc000a53560?)
	k8s.io/kops/pkg/instancegroups/rollingupdate.go:196 +0xa69
main.RunRollingUpdateCluster({0x7187d58?, 0xc00005c1e8}, 0x0?, {0x716c340, 0xc000010018}, 0xc0008085a0)
	k8s.io/kops/cmd/kops/rolling-update_cluster.go:434 +0xf45
main.NewCmdRollingUpdateCluster.func1(0xc000177900?, {0xc000707380?, 0x2?, 0x2?})
	k8s.io/kops/cmd/kops/rolling-update_cluster.go:176 +0x37
github.com/spf13/cobra.(*Command).execute(0xc000177900, {0xc000707360, 0x2, 0x2})
	github.com/spf13/cobra@v1.5.0/command.go:872 +0x694
github.com/spf13/cobra.(*Command).ExecuteC(0x94984a0)
	github.com/spf13/cobra@v1.5.0/command.go:990 +0x3b4
github.com/spf13/cobra.(*Command).Execute(...)
	github.com/spf13/cobra@v1.5.0/command.go:918
main.Execute()
	k8s.io/kops/cmd/kops/root.go:95 +0x5c
main.main()
	k8s.io/kops/cmd/kops/main.go:20 +0x17

@olemarkus
Copy link
Member

Do you happen to have an ASG with warm pools enabled and being scaled to 0? In that case, I think

u := update[len(update)-numSurge-skippedNodes]
would end up with trying to access update[0 - 1 - 0].

@olemarkus
Copy link
Member

Yep. This will cause a panic. Reproduced!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants