Cluster stuck in infinite delete cycle #383

yissachar · 2016-08-29T22:15:35Z

I setup a cluster with kops:

kops create cluster --cloud=aws --name=<my-domain> --node-count=1 --networking=external --node-size=t2.medium --master-size=t2.medium --zones=us-east-1b --state=s3://<my-state>

Now I am trying to delete the cluster:

kops delete cluster --name=<my-domain> --state=<my-state> --yes

But it's stuck in an infinite cycle:

subnet:subnet-<id>  still has dependencies, will retry
security-group:sg-<id>  still has dependencies, will retry
internet-gateway:igw-<id>   still has dependencies, will retry
Not all resources deleted; waiting before reattempting deletion
    dhcp-options:dopt-<id>
    vpc:vpc-<id>
    subnet:subnet-<id>
    security-group:sg-<id>
    internet-gateway:igw-<id>
    route-table:rtb-<id>

Eventually it bails with:

F0829 18:10:19.535204   37043 delete_cluster.go:33] Not making progress deleting resources; giving up

The text was updated successfully, but these errors were encountered:

yissachar · 2016-08-30T15:00:39Z

Trying again the next day works, and the cluster is deleted. I didn't change anything manually during this time period, so I'm not sure why it would go into an infinite cycle yesterday, but work today.

justinsb · 2016-09-01T03:55:50Z

Sometimes EC2 resources will stick around, blocking deletion of other resources. That's why delete uses the looping retry model, in addition to building a (partial) DAG.

The top culprit is often ELB, which often has invisible resources. Did you maybe have an ELB in the cluster you were deleting? Even so it normally deletes fairly quickly.

I guess we could boost the timeout and/or see if there's a way to force the deletion of an ELB (e.g. in the past when I've been impatient I've manually deleted some IP address allocations). But it would be good to get confirmation that it was ELB...

yissachar · 2016-09-01T04:04:27Z

I'm not sure if I had an ELB or not.

I've been doing a lot of creating/deleting clusters with kops over the past couple of days, and I've only had this happen once. Most of the time it deletes with no issues. Sometimes it loops for a bit but deletes the cluster before the timeout.

In this particular case, I tried to delete the cluster several more times in succession, but each time it timed out, this was over the span of 30-40 minutes. It was the only the next day that I was able to delete the cluster (with no manual intervention). So boosting the timeout doesn't seem like it would be super useful.

If this happens to me again I'll try to record my cluster state so we can narrow this down.

yissachar · 2016-09-13T15:05:01Z

I've had this happen again:

Not all resources deleted; waiting before reattempting deletion
    dhcp-options:dopt-<id>
    vpc:vpc-<id>

Waited 15 hours and tried again, but it still couldn't delete. Finally I went into the AWS console and manually deleted the route table, at which point kops delete was able to finish deleting the cluster.

I noticed that the delete logs never mentioned the route table for some reason, so presumably it somehow forgot to delete that, since usually there is a line mentioning the route table.

yissachar · 2016-09-13T15:14:13Z

To add a bit more info:

The cluster was created with:

kops create cluster --cloud=aws --name=foo.bar.com \
--node-count=1 --node-size=t2.medium \
--master-size=t2.medium  --zones=us-east-1e --state=s3://<my-s3>

Then I edited the cluster to set encryptedVolume: true and kmsKeyId: <my-key-id> on the etcd volumes.

Then I ran kops update cluster --yes and shortly afterward, kops delete cluster --yes

chrislovecnm · 2016-10-27T06:23:28Z

What is the status on this?

yissachar · 2016-10-27T15:44:49Z

I haven't had this happen since my last post. I think this is safe to close for now, if it crops up again we can reopen.

justinsb · 2016-10-27T16:02:26Z

We now have some additional logic to pick up an untagged route table, when it is safe to do so. I think this should no longer happen, and reopen if it does.

engmsaleh · 2019-03-22T06:54:05Z

I still get the same issue when I'm trying to delete it.
I have a question, Doesn't there a way to make the create/delete with Terrafrom directly and get this to update Kube config as when I make the install using Terrafrom I got kube config not updated with new values for the cluster?

Highlights: * Fix arm64 images, which were built with an incorrect base image. * Initial (experimental) Azure support Full change list: * Update Kops dependency for Azure Blob Storage support [kubernetes#372](kopeio/etcd-manager#372) * Exclude gazelle from tools/deb-tools [kubernetes#373](kopeio/etcd-manager#373) * Regenerate bazel in tools/deb-tools [kubernetes#374](kopeio/etcd-manager#374) * Release notes for 3.0.20201202 [kubernetes#375](kopeio/etcd-manager#375) * Remove travis CI [kubernetes#377](kopeio/etcd-manager#377) * Fix vendor generation for tools/deb-tools subproject [kubernetes#376](kopeio/etcd-manager#376) * Add script to verify image hashes [kubernetes#380](kopeio/etcd-manager#380) * Fix some incorrect base image hashes for arm64 [kubernetes#379](kopeio/etcd-manager#379) * Support Azure [kubernetes#378](kopeio/etcd-manager#378) * Add more descriptions to wait loops [kubernetes#383](kopeio/etcd-manager#383) * Rename fields in the azure client struct [kubernetes#382](kopeio/etcd-manager#382) * Fix small typo in code comment [kubernetes#381](kopeio/etcd-manager#381)

shqear93 · 2024-07-25T12:56:53Z

It's happening here as well

UPDATE:
In my case it is a different issue, here is a status message from kopf in the annotations about deletion protection:

kopf.zalando.org/prevent_delete: '{"started":"2024-07-25T12:10:44.485361+00:00","delayed":"2024-07-
25T13:55:45.346459+00:00","purpose":"delete","retries":7,"success":false,"failure":false,"message":"Deletion is not
allowed - as the deletion time has not exceeded 7 days - (Currently exceeded 0 days, 1 hours, 30 minutes, 1 seconds
days) - Retrying in 900 seconds"}'

krisnova assigned justinsb Oct 27, 2016

yissachar closed this as completed Oct 27, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster stuck in infinite delete cycle #383

Cluster stuck in infinite delete cycle #383

yissachar commented Aug 29, 2016

yissachar commented Aug 30, 2016

justinsb commented Sep 1, 2016

yissachar commented Sep 1, 2016

yissachar commented Sep 13, 2016

yissachar commented Sep 13, 2016

chrislovecnm commented Oct 27, 2016

yissachar commented Oct 27, 2016

justinsb commented Oct 27, 2016

engmsaleh commented Mar 22, 2019

shqear93 commented Jul 25, 2024 •

edited

Loading

Cluster stuck in infinite delete cycle #383

Cluster stuck in infinite delete cycle #383

Comments

yissachar commented Aug 29, 2016

yissachar commented Aug 30, 2016

justinsb commented Sep 1, 2016

yissachar commented Sep 1, 2016

yissachar commented Sep 13, 2016

yissachar commented Sep 13, 2016

chrislovecnm commented Oct 27, 2016

yissachar commented Oct 27, 2016

justinsb commented Oct 27, 2016

engmsaleh commented Mar 22, 2019

shqear93 commented Jul 25, 2024 • edited Loading

shqear93 commented Jul 25, 2024 •

edited

Loading