Dealing with pvs and pvcs on deleted nodes #201

rbtcollins · 2020-06-19T02:39:30Z

When a node is deleted - either due to failure or due to an autoscale event removing it, local static volumes on it become permanently destroyed. But the metadata for those pv's, and for any pvc's referencing them, are not.

This is frustrating, because we have to take care not to reuse node names, yet in AKS, node names are deterministic and reused commonly, as you can see from these sample names:

aks-fdbl32sv2-26122852-vmss000000   Ready    agent   27h   v1.15.11
aks-fdbl32sv2-26122852-vmss000001   Ready    agent   27h   v1.15.11
aks-fdbl32sv2-26122852-vmss000002   Ready    agent   27h   v1.15.11
aks-fdbl32sv2-26122852-vmss000003   Ready    agent   27h   v1.15.11

This leads to rogue volumes and claims resurrecting themselves back when a nodepool is scaled up again - whether automatically or not.

Describe the solution you'd like in detail

I would like tackle this from two angles:

To prevent this zombie resurrect threat I would like to extend the topology key with some sort of uuid for the node, for instance the Machine ID: (7b59f36627eb48f0af387b76c602cf9b from one of those examples); though this may require changes outside of this sig and thus be a future work - since we cannot benefit from things that we can't deploy in AKS.

To prevent the leak in the first place, I would like to have a controller which watches nodes, and when a node is removed from the cluster, removes all the claims and pvs that were on that node. This would be a separate controller, so that it can be opt-in rather than run implicitly in the existing daemonsets.

Describe alternatives you've considered

I have not considered any alternatives.

Additional context

The text was updated successfully, but these errors were encountered:

rbtcollins · 2020-06-19T02:50:50Z

Basically I never want to write thing again:

kubectl get pv | awk '/^local/ { print $1}' | xargs kubectl get pv -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.nodeAffinity.required.nodeSelectorTerms[0].matchExpressions[0].values[0]}{"\n"}{end}' | awk '/aks-fdbl16sv2/ { print $1 }' | xargs kubectl delete pv

cofyc · 2020-06-19T03:55:40Z

yes, this is a general issue.

To prevent the leak in the first place, I would like to have a controller which watches nodes, and when a node is removed from the cluster, removes all the claims and pvs that were on that node. This would be a separate controller, so that it can be opt-in rather than run implicitly in the existing daemonsets.

I agree a cloud controller is required. I've written up a proposal: https://docs.google.com/document/d/1SA9epEwA3jPwibRV0ccQwJ2UfZXoeUYKyNxNegt0vn4 for this.

rbtcollins · 2020-06-22T10:42:33Z

Thank you for the link, I've added a thought to the document there. Is a prototype under way?

msau42 · 2020-09-15T01:37:38Z

As an intermediate step, would it help if we put an OwnerRef on the PV pointing to the Node object? PVC protection will prevent the PV from actually being deleted, but if a user comes in a deletes the PVC, then the PV object will get cleaned up instead of taking up cycles from the PV controller (which every few seconds goes through all PVs in the cluster).

fejta-bot · 2020-12-14T02:03:45Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

rbtcollins · 2020-12-14T06:53:56Z

/remove-lifecycle stale

…

On Mon, 14 Dec 2020, 03:03 fejta-bot, ***@***.***> wrote: Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close. Send feedback to sig-testing, kubernetes/test-infra and/or fejta <https://github.com/fejta>. /lifecycle stale — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#201 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADZ7XSHS24ULCLXAHNEVALSUVXA3ANCNFSM4OCKP5MA> .

fejta-bot · 2021-03-14T07:23:04Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

cofyc · 2021-03-14T14:55:51Z

/remove-lifecycle stale.
/lifecycle frozen

cdenneen · 2021-08-27T18:55:37Z

/help
/kind feature

k8s-ci-robot · 2021-08-27T18:55:38Z

@cdenneen:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help
/kind feature

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

pierreozoux · 2022-09-13T15:40:00Z

A maybe better scripts, depending on your needs:

kubectl get pv  -o custom-columns=NAME:metadata.name,PATH:spec.local.path,PVC:spec.claimRef.name,NODE:spec.nodeAffinity.required.nodeSelectorTerms[0].matchExpressions[0].values[0]  | grep ${NODE_NAME} |  awk '{ print $1}' > /tmp/delete_pv

kubectl get pv  -o custom-columns=NAME:metadata.name,PATH:spec.local.path,NAMESPACE:spec.claimRef.namespace,PVC:spec.claimRef.name,NODE:spec.nodeAffinity.required.nodeSelectorTerms[0].matchExpressions[0].values[0]  | grep ${NODE_NAME}  | grep -v none | awk '{ print $3" "$4}' > /tmp/delete_pvc

while read PVC; do
kubectl delete pvc -n $PVC
done </tmp/delete_pvc

for PV in `cat /tmp/delete_pv`;do kubectl delete pv $PV; done

GrigorievNick · 2022-11-06T09:59:36Z

From k8s 1.23 there a feature for gracefull shutdown of node and from 1.24 none-gracefull shutdown.
So may be, we can use it to delete pvc from nodes?

loveRhythm1990 · 2024-01-27T16:17:16Z

From k8s 1.23 there a feature for gracefull shutdown of node and from 1.24 none-gracefull shutdown. So may be, we can use it to delete pvc from nodes?

It seems that this issue has been fixed by #385, but I think by the feature of gracefull node shutdown is a good idea

niranjandarshann · 2024-07-16T04:44:22Z

From k8s 1.23 there a feature for gracefull shutdown of node and from 1.24 none-gracefull shutdown. So may be, we can use it to delete pvc from nodes?

Yes, but It is better to go with gracefull shutdown of nodes This feature allows nodes in our Kubernetes cluster to shut down in an orderly manner. Instead of abruptly stopping, the node will notify Kubernetes, giving it time to safely relocate workloads to other nodes. It will protect from Data loss aswell.

Where as in non graceful shutdown of nodes abrupt and can lead to data loss, application downtime, and disrupted services.

niranjandarshann · 2024-07-16T04:44:58Z

The issue is addressed and resolved in #385

cofyc added the kind/feature Categorizes issue or PR as related to a new feature. label Jun 19, 2020

cofyc mentioned this issue Jun 19, 2020

Recover workloads using local PVs from data loss in the cloud #202

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 14, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 14, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 14, 2021

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 14, 2021

k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Aug 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dealing with pvs and pvcs on deleted nodes #201

Dealing with pvs and pvcs on deleted nodes #201

rbtcollins commented Jun 19, 2020

rbtcollins commented Jun 19, 2020

cofyc commented Jun 19, 2020

rbtcollins commented Jun 22, 2020

msau42 commented Sep 15, 2020

fejta-bot commented Dec 14, 2020

rbtcollins commented Dec 14, 2020 via email

fejta-bot commented Mar 14, 2021

cofyc commented Mar 14, 2021

cdenneen commented Aug 27, 2021

k8s-ci-robot commented Aug 27, 2021

pierreozoux commented Sep 13, 2022 •

edited

Loading

GrigorievNick commented Nov 6, 2022

loveRhythm1990 commented Jan 27, 2024 •

edited

Loading

niranjandarshann commented Jul 16, 2024

niranjandarshann commented Jul 16, 2024

Dealing with pvs and pvcs on deleted nodes #201

Dealing with pvs and pvcs on deleted nodes #201

Comments

rbtcollins commented Jun 19, 2020

rbtcollins commented Jun 19, 2020

cofyc commented Jun 19, 2020

rbtcollins commented Jun 22, 2020

msau42 commented Sep 15, 2020

fejta-bot commented Dec 14, 2020

rbtcollins commented Dec 14, 2020 via email

fejta-bot commented Mar 14, 2021

cofyc commented Mar 14, 2021

cdenneen commented Aug 27, 2021

k8s-ci-robot commented Aug 27, 2021

pierreozoux commented Sep 13, 2022 • edited Loading

GrigorievNick commented Nov 6, 2022

loveRhythm1990 commented Jan 27, 2024 • edited Loading

niranjandarshann commented Jul 16, 2024

niranjandarshann commented Jul 16, 2024

pierreozoux commented Sep 13, 2022 •

edited

Loading

loveRhythm1990 commented Jan 27, 2024 •

edited

Loading