Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster stuck in infinite delete cycle #383

Closed
yissachar opened this issue Aug 29, 2016 · 10 comments
Closed

Cluster stuck in infinite delete cycle #383

yissachar opened this issue Aug 29, 2016 · 10 comments
Assignees

Comments

@yissachar
Copy link
Contributor

I setup a cluster with kops:

kops create cluster --cloud=aws --name=<my-domain> --node-count=1 --networking=external --node-size=t2.medium --master-size=t2.medium --zones=us-east-1b --state=s3://<my-state>

Now I am trying to delete the cluster:

kops delete cluster --name=<my-domain> --state=<my-state> --yes

But it's stuck in an infinite cycle:

subnet:subnet-<id>  still has dependencies, will retry
security-group:sg-<id>  still has dependencies, will retry
internet-gateway:igw-<id>   still has dependencies, will retry
Not all resources deleted; waiting before reattempting deletion
    dhcp-options:dopt-<id>
    vpc:vpc-<id>
    subnet:subnet-<id>
    security-group:sg-<id>
    internet-gateway:igw-<id>
    route-table:rtb-<id>

Eventually it bails with:

F0829 18:10:19.535204   37043 delete_cluster.go:33] Not making progress deleting resources; giving up
@yissachar
Copy link
Contributor Author

Trying again the next day works, and the cluster is deleted. I didn't change anything manually during this time period, so I'm not sure why it would go into an infinite cycle yesterday, but work today.

@justinsb
Copy link
Member

justinsb commented Sep 1, 2016

Sometimes EC2 resources will stick around, blocking deletion of other resources. That's why delete uses the looping retry model, in addition to building a (partial) DAG.

The top culprit is often ELB, which often has invisible resources. Did you maybe have an ELB in the cluster you were deleting? Even so it normally deletes fairly quickly.

I guess we could boost the timeout and/or see if there's a way to force the deletion of an ELB (e.g. in the past when I've been impatient I've manually deleted some IP address allocations). But it would be good to get confirmation that it was ELB...

@yissachar
Copy link
Contributor Author

I'm not sure if I had an ELB or not.

I've been doing a lot of creating/deleting clusters with kops over the past couple of days, and I've only had this happen once. Most of the time it deletes with no issues. Sometimes it loops for a bit but deletes the cluster before the timeout.

In this particular case, I tried to delete the cluster several more times in succession, but each time it timed out, this was over the span of 30-40 minutes. It was the only the next day that I was able to delete the cluster (with no manual intervention). So boosting the timeout doesn't seem like it would be super useful.

If this happens to me again I'll try to record my cluster state so we can narrow this down.

@yissachar
Copy link
Contributor Author

I've had this happen again:

Not all resources deleted; waiting before reattempting deletion
    dhcp-options:dopt-<id>
    vpc:vpc-<id>

Waited 15 hours and tried again, but it still couldn't delete. Finally I went into the AWS console and manually deleted the route table, at which point kops delete was able to finish deleting the cluster.

I noticed that the delete logs never mentioned the route table for some reason, so presumably it somehow forgot to delete that, since usually there is a line mentioning the route table.

@yissachar
Copy link
Contributor Author

To add a bit more info:

The cluster was created with:

kops create cluster --cloud=aws --name=foo.bar.com \
--node-count=1 --node-size=t2.medium \
--master-size=t2.medium  --zones=us-east-1e --state=s3://<my-s3>

Then I edited the cluster to set encryptedVolume: true and kmsKeyId: <my-key-id> on the etcd volumes.

Then I ran kops update cluster --yes and shortly afterward, kops delete cluster --yes

@chrislovecnm
Copy link
Contributor

What is the status on this?

@yissachar
Copy link
Contributor Author

I haven't had this happen since my last post. I think this is safe to close for now, if it crops up again we can reopen.

@justinsb
Copy link
Member

We now have some additional logic to pick up an untagged route table, when it is safe to do so. I think this should no longer happen, and reopen if it does.

@engmsaleh
Copy link

I still get the same issue when I'm trying to delete it.
I have a question, Doesn't there a way to make the create/delete with Terrafrom directly and get this to update Kube config as when I make the install using Terrafrom I got kube config not updated with new values for the cluster?

justinsb added a commit to justinsb/kops that referenced this issue Dec 9, 2020
Highlights:

* Fix arm64 images, which were built with an incorrect base image.
* Initial (experimental) Azure support

Full change list:

* Update Kops dependency for Azure Blob Storage support [kubernetes#372](kopeio/etcd-manager#372)
* Exclude gazelle from tools/deb-tools [kubernetes#373](kopeio/etcd-manager#373)
* Regenerate bazel in tools/deb-tools [kubernetes#374](kopeio/etcd-manager#374)
* Release notes for 3.0.20201202 [kubernetes#375](kopeio/etcd-manager#375)
* Remove travis CI [kubernetes#377](kopeio/etcd-manager#377)
* Fix vendor generation for tools/deb-tools subproject [kubernetes#376](kopeio/etcd-manager#376)
* Add script to verify image hashes [kubernetes#380](kopeio/etcd-manager#380)
* Fix some incorrect base image hashes for arm64 [kubernetes#379](kopeio/etcd-manager#379)
* Support Azure [kubernetes#378](kopeio/etcd-manager#378)
* Add more descriptions to wait loops [kubernetes#383](kopeio/etcd-manager#383)
* Rename fields in the azure client struct [kubernetes#382](kopeio/etcd-manager#382)
* Fix small typo in code comment [kubernetes#381](kopeio/etcd-manager#381)
hakman pushed a commit to hakman/kops that referenced this issue Dec 9, 2020
Highlights:

* Fix arm64 images, which were built with an incorrect base image.
* Initial (experimental) Azure support

Full change list:

* Update Kops dependency for Azure Blob Storage support [kubernetes#372](kopeio/etcd-manager#372)
* Exclude gazelle from tools/deb-tools [kubernetes#373](kopeio/etcd-manager#373)
* Regenerate bazel in tools/deb-tools [kubernetes#374](kopeio/etcd-manager#374)
* Release notes for 3.0.20201202 [kubernetes#375](kopeio/etcd-manager#375)
* Remove travis CI [kubernetes#377](kopeio/etcd-manager#377)
* Fix vendor generation for tools/deb-tools subproject [kubernetes#376](kopeio/etcd-manager#376)
* Add script to verify image hashes [kubernetes#380](kopeio/etcd-manager#380)
* Fix some incorrect base image hashes for arm64 [kubernetes#379](kopeio/etcd-manager#379)
* Support Azure [kubernetes#378](kopeio/etcd-manager#378)
* Add more descriptions to wait loops [kubernetes#383](kopeio/etcd-manager#383)
* Rename fields in the azure client struct [kubernetes#382](kopeio/etcd-manager#382)
* Fix small typo in code comment [kubernetes#381](kopeio/etcd-manager#381)
@shqear93
Copy link

shqear93 commented Jul 25, 2024

It's happening here as well

UPDATE:
In my case it is a different issue, here is a status message from kopf in the annotations about deletion protection:

kopf.zalando.org/prevent_delete: '{"started":"2024-07-25T12:10:44.485361+00:00","delayed":"2024-07-
25T13:55:45.346459+00:00","purpose":"delete","retries":7,"success":false,"failure":false,"message":"Deletion is not
allowed - as the deletion time has not exceeded 7 days - (Currently exceeded 0 days, 1 hours, 30 minutes, 1 seconds
days) - Retrying in 900 seconds"}'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants