Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to recover from failed cluster deletion #8424

Closed
johngmyers opened this issue Jan 27, 2020 · 1 comment · Fixed by #9052
Closed

Unable to recover from failed cluster deletion #8424

johngmyers opened this issue Jan 27, 2020 · 1 comment · Fixed by #9052
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@johngmyers
Copy link
Member

1. What kops version are you running? The command kops version, will display
this information.

Version 1.15.0 (git-ecffaff)
(this is a private fork)

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

1.15.9

3. What cloud provider are you using?

AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

kops delete cluster --name REDACTED --state REDACTED --yes

5. What happened after the commands executed?

The first time, it failed with:

subnet:subnet-08f4c2ba4b71c9492	still has dependencies, will retry
subnet:subnet-015189c75e105f5cc	still has dependencies, will retry
subnet:subnet-0655ae5887a06b155	still has dependencies, will retry
Not all resources deleted; waiting before reattempting deletion
	route-table:rtb-0c47d709931c7f1bd
	route-table:rtb-07f3c8a3d94b134c8
	subnet:subnet-08f4c2ba4b71c9492
	route-table:rtb-09b6a1f0eeb6ae5fe
	subnet:subnet-015189c75e105f5cc
	subnet:subnet-0655ae5887a06b155
subnet:subnet-0655ae5887a06b155	still has dependencies, will retry
subnet:subnet-015189c75e105f5cc	still has dependencies, will retry
subnet:subnet-08f4c2ba4b71c9492	still has dependencies, will retry
Not all resources deleted; waiting before reattempting deletion
	subnet:subnet-015189c75e105f5cc
	route-table:rtb-09b6a1f0eeb6ae5fe
	subnet:subnet-0655ae5887a06b155
	route-table:rtb-0c47d709931c7f1bd
	route-table:rtb-07f3c8a3d94b134c8
	subnet:subnet-08f4c2ba4b71c9492

not making progress deleting resources; giving up

The subnets couldn't be deleted due to an apparent transient AWS issue: one of the deleted ELBs was still holding IPs from those subnets. After a weekend, the subnets could be deleted.

After deleting the subnets, reattempting a kops delete cluster ... --yes resulted in:

error from DescribeNatGateways: NatGatewayNotFound: NAT gateway nat-0a8aa3ba56bb61389 was not found
	status code: 400, request id: 9be9785e-09e2-4e71-9466-2eeb29d5790a

6. What did you expect to happen?

The second kops delete cluster should have continued to delete the cluster and removed the cluster from the state store.

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

9. Anything else do we need to know?

I believe aws.FindNatGateways() needs to tolerate this particular error from c.EC2().DescribeNatGateways(), treating it as having returned zero NAT gateways.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants