Waiting for volumeAttachments deletion #1190

mlavacca · 2022-02-16T17:01:43Z

What this PR does / why we need it:
When a machine is deleted, the machine controller now waits for the volumeAttachments deletion before deleting the node.

Which issue(s) this PR fixes (optional, in fixes #<issue number> format, will close the issue(s) when PR gets merged):
Fixes #1189

Special notes for your reviewer:

Optional Release Note:

NONE

moadqassem · 2022-02-16T20:03:07Z

@mlavacca This seems quite strange and I have two questions:
1- It seems that the issue is related to the vSphere not the other cloud providers, thus why you are doing this on a global level(machine level) ?
2- If the node got deleted before the volume what is the error that you get actually? Because this is quite strange and an issue in the CSI driver itself not machine controller. Usually when a node doesn't exist, CCM/CSI or whatever controller should delete its resources. IIRC, it was possible to delete the volumeAttachment objects and then the PVs are released, PVCs get deleted, via a ticker service.

In general I don't think it is the job for machine controller to remove volumes, that's why we use CSI driver :-) . If this is indeed an issue in the driver, I would strongly recommend to leave a finalizer on the machine object, during the cleanup, check for the node name under the machine status and remove the volumeAttachment. and then remove the finalizer off the machine.

mlavacca · 2022-02-17T09:03:58Z

@moadqassem

2- If the node got deleted before the volume what is the error that you get actually? Because this is quite strange and an issue in the CSI driver itself not machine controller.

Yes, there is an issue in the CSI driver for vSphere, and the aiming of this PR is to mitigate that. If the node is deleted before that CSI driver has time to delete the volumeAttachmentes, they will hang forever, and the new pods that will try to mount the associated volume won't be able because of the existence of the outdated volumeAttachments.

In general I don't think it is the job for machine controller to remove volumes, that's why we use CSI driver :-) . If this is indeed an issue in the driver, I would strongly recommend to leave a finalizer on the machine object, during the cleanup, check for the node name under the machine status and remove the volumeAttachment. and then remove the finalizer off the machine.

I'm not removing any volumeAttachment, this code introduces a wait to be sure that all the volumeAttachments are correctly deleted by the CSI driver. For this reason, I do not see why it should be a problem having this behavior for all the cloud providers.

BTW, something didn't work as expected during the e2e tests, I'm going to investigate and debug them

moadqassem · 2022-02-17T09:27:08Z

I'm not removing any volumeAttachment

That's the thing. I mean if we. can just simply remove those as part of the cleanup process. So in other words, when cleanup is called, check out the volume attachment, and remove them based on the machine referenced node. This way you will not block the node draining/termination and you keep the mitigation locally where it should be until the vmware folks fix the issue is resolved:

kubernetes-sigs/vsphere-csi-driver#359

P.S: it was created almost 2 years ago, but was active again 3 weeks ago so 🤞

mlavacca · 2022-02-17T11:10:06Z

I'm not removing any volumeAttachment

That's the thing. I mean if we. can just simply remove those as part of the cleanup process. So in other words, when cleanup is called, check out the volume attachment, and remove them based on the machine referenced node. This way you will not block the node draining/termination and you keep the mitigation locally where it should be until the vmware folks fix the issue is resolved:

kubernetes-sigs/vsphere-csi-driver#359

P.S: it was created almost 2 years ago, but was active again 3 weeks ago so 🤞

Problem is that the volumeAttachments are managed by the CSI driver; if you delete them, they are automatically recreated. Furthermore, they have a finalizer (external-attacher/csi-vsphere-vmware-com) that is added/removed by the CSI driver, therefore I don't see a correct way to manually clean them up.

moadqassem · 2022-02-21T12:47:43Z

pkg/controller/machine/machine_controller.go

+func (r *Reconciler) deleteNodeForMachine(ctx context.Context, machine *clusterv1alpha1.Machine) (*reconcile.Result, error) {
+	// List all the volumeAttachments in the cluster; we must be sure that all
+	// of them will be deleted before deleting the node
+	volumeAttachments := &storagev1.VolumeAttachmentList{}


Can we just abstract this behaviour in a method and only run it if the cloud provider is vSphere.

moadqassem · 2022-02-21T12:49:13Z

pkg/controller/machine/machine_controller.go

 	}

-	return nil, r.deleteNodeForMachine(ctx, machine)
+	return r.deleteNodeForMachine(ctx, machine)


Just call this method here when the cloud provider is vsphere and the Volume attachment are gone.

mlavacca · 2022-02-23T11:18:09Z

/retest

mlavacca · 2022-02-23T14:04:15Z

/test pull-machine-controller-e2e-gce

mlavacca · 2022-02-23T14:05:00Z

@moadqassem I implemented your suggested solution, PTAL

mlavacca · 2022-02-23T14:45:44Z

/retest

Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com>

mlavacca · 2022-03-01T07:47:30Z

Code has been improved to handle both the node rollout and the node deletion on a single node cluster. @moadqassem PTAL

pkg/controller/machine/machine_controller.go

moadqassem · 2022-03-07T20:20:47Z

pkg/node/poddeletion/pod_deletion.go

+	ErrorQueueLen = 1000
+)
+
+type NodeVolumeAttachmentsCleanup struct {


I thought that we are not gonna need this and we will just change how the node is drained. For example if there is a volume still attached then don't remove the csi driver pod and once no volumeattachment is there then remove the pod.

Yes, and this is an implementation of that behavior. If there are volumeAttachments:

Cordon the old node

delete all pods using volumes attached to the old node

wait for the CSI-driver to collect the volumeAttachments

drain the old node.

Sure, but can't we just fix this in the NodeEviction.Run method? why don't we handle the case over there, instead of having it here. Eventually the code is kinda similar, only difference is, the deletion criteria.

pkg/controller/machine/machine_controller.go

moadqassem · 2022-03-08T12:58:17Z

pkg/node/poddeletion/pod_deletion.go

+	ErrorQueueLen = 1000
+)
+
+type NodeVolumeAttachmentsCleanup struct {


Sure, but can't we just fix this in the NodeEviction.Run method? why don't we handle the case over there, instead of having it here. Eventually the code is kinda similar, only difference is, the deletion criteria.

Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com>

moadqassem · 2022-03-10T11:33:20Z

/retest

moadqassem

/approve
/lgtm

kubermatic-bot · 2022-03-10T14:21:52Z

LGTM label has been added.

Git tree hash: 67d0192b6ca5112a0f5afb02d39ab85add9c1667

kubermatic-bot · 2022-03-10T14:21:54Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mlavacca, moadqassem

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [moadqassem]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

moadqassem · 2022-03-11T12:41:50Z

/cherry-pick release/v1.36

kubermatic-bot · 2022-03-11T12:42:24Z

@moadqassem: #1190 failed to apply on top of branch "release/v1.36":

Applying: Waiting for volumeAttachments deletion
Applying: volumeAttachments check only for vSphere
Applying: ClusterRole updated
Applying: yaml linter fixed
Applying: VolumeAttachments correctly handled
Using index info to reconstruct a base tree...
M	cmd/machine-controller/main.go
M	pkg/controller/machine/machine_controller.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/controller/machine/machine_controller.go
Auto-merging cmd/machine-controller/main.go
CONFLICT (content): Merge conflict in cmd/machine-controller/main.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0005 VolumeAttachments correctly handled
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release/v1.36

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

* Waiting for volumeAttachments deletion Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * volumeAttachments check only for vSphere Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * ClusterRole updated Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * yaml linter fixed Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * VolumeAttachments correctly handled Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * Code factorized Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * renaming Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * fix yamllint Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * Logic applied only to vSphere Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> (cherry picked from commit e35da15)

* Waiting for volumeAttachments deletion * volumeAttachments check only for vSphere * ClusterRole updated Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> (cherry picked from commit e35da15)

Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> (cherry picked from commit e35da15)

* Fix wrong CPU config Signed-off-by: Helene Durand <helene@kubermatic.com> * fix vSphere tests Signed-off-by: Moath Qasim <moad.qassem@gmail.com> * Waiting for volumeAttachments deletion (#1190) Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> (cherry picked from commit e35da15)

* Waiting for volumeAttachments deletion Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * volumeAttachments check only for vSphere Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * ClusterRole updated Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * yaml linter fixed Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * VolumeAttachments correctly handled Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * Code factorized Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * renaming Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * fix yamllint Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * Logic applied only to vSphere Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> (cherry picked from commit e35da15)

moadqassem · 2022-03-15T18:02:08Z

/cherry-pick release/v1.42

kubermatic-bot · 2022-03-15T18:02:45Z

@moadqassem: new pull request created: #1212

In response to this:

/cherry-pick release/v1.42

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

* Waiting for volumeAttachments deletion Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * volumeAttachments check only for vSphere Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * ClusterRole updated Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * yaml linter fixed Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * VolumeAttachments correctly handled Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * Code factorized Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * renaming Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * fix yamllint Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * Logic applied only to vSphere Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com>

kron4eg · 2022-04-20T13:47:44Z

/cherrypick release/v1.43

kubermatic-bot · 2022-04-20T13:48:26Z

@kron4eg: new pull request created: #1256

In response to this:

/cherrypick release/v1.43

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

kron4eg · 2022-04-20T14:33:55Z

/cherrypick release/v1.37

kubermatic-bot · 2022-04-20T14:34:29Z

@kron4eg: #1190 failed to apply on top of branch "release/v1.37":

Applying: Waiting for volumeAttachments deletion
Applying: volumeAttachments check only for vSphere
Applying: ClusterRole updated
Applying: yaml linter fixed
Applying: VolumeAttachments correctly handled
Using index info to reconstruct a base tree...
M	cmd/machine-controller/main.go
M	pkg/controller/machine/machine_controller.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/controller/machine/machine_controller.go
Auto-merging cmd/machine-controller/main.go
CONFLICT (content): Merge conflict in cmd/machine-controller/main.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0005 VolumeAttachments correctly handled
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherrypick release/v1.37

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

* Waiting for volumeAttachments deletion Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * volumeAttachments check only for vSphere Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * ClusterRole updated Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * yaml linter fixed Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * VolumeAttachments correctly handled Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * Code factorized Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * renaming Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * fix yamllint Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * Logic applied only to vSphere Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com>

* Waiting for volumeAttachments deletion (#1190) * Waiting for volumeAttachments deletion Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * volumeAttachments check only for vSphere Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * ClusterRole updated Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * yaml linter fixed Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * VolumeAttachments correctly handled Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * Code factorized Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * renaming Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * fix yamllint Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * Logic applied only to vSphere Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * disable vSphere tests (#1172) Signed-off-by: Moath Qasim <moad.qassem@gmail.com> * enable vSphere tests (#1180) * enable vSphere tests Signed-off-by: Moath Qasim <moad.qassem@gmail.com> # Conflicts: # go.sum * refactor vSphere datastore cluster Signed-off-by: Moath Qasim <moad.qassem@gmail.com> * refactor vSphere tests Signed-off-by: Moath Qasim <moad.qassem@gmail.com> * enable vsphere test Signed-off-by: Moath Qasim <moad.qassem@gmail.com> * debug vsphere datastore test Signed-off-by: Moath Qasim <moad.qassem@gmail.com> * debug vsphere datastore test Signed-off-by: Moath Qasim <moad.qassem@gmail.com> Co-authored-by: Mattia Lavacca <lavacca.mattia@gmail.com> Co-authored-by: Moath Qasim <moad.qassem@gmail.com>

moadqassem reviewed Feb 21, 2022

View reviewed changes

kubermatic-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 23, 2022

mlavacca requested a review from moadqassem February 23, 2022 14:04

mlavacca added 5 commits February 28, 2022 21:49

Waiting for volumeAttachments deletion

d36c378

Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com>

volumeAttachments check only for vSphere

37747e9

Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com>

ClusterRole updated

c353411

Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com>

yaml linter fixed

1ca466c

Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com>

VolumeAttachments correctly handled

6393218

Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com>

mlavacca force-pushed the wait-volumeattachments-deletion branch from e73baa6 to 6393218 Compare February 28, 2022 20:51

moadqassem reviewed Mar 7, 2022

View reviewed changes

moadqassem reviewed Mar 8, 2022

View reviewed changes

Code factorized

a70fe7e

Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com>

kubermatic-bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 9, 2022

mlavacca added 2 commits March 9, 2022 11:17

renaming

03f3747

Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com>

fix yamllint

1672255

Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com>

moadqassem approved these changes Mar 10, 2022

View reviewed changes

kubermatic-bot assigned moadqassem Mar 10, 2022

kubermatic-bot added the lgtm Indicates that a PR is ready to be merged. label Mar 10, 2022

kubermatic-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 10, 2022

kubermatic-bot merged commit e35da15 into kubermatic:master Mar 10, 2022

moadqassem pushed a commit to kubermatic-bot/machine-controller that referenced this pull request Mar 14, 2022

Waiting for volumeAttachments deletion (kubermatic#1190)

1e5e53b

Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> (cherry picked from commit e35da15)

moadqassem mentioned this pull request Mar 14, 2022

[release/v1.42] Fix wrong CPU config #1205

Merged

kubermatic-bot mentioned this pull request Mar 15, 2022

[release/v1.42] Waiting for volumeAttachments deletion #1212

Merged

moadqassem mentioned this pull request Mar 26, 2022

[vsphere external CCM] user-cluster volumes are not getting detached after node deletion kubermatic/kubermatic#8856

Closed

kubermatic-bot mentioned this pull request Apr 20, 2022

[release/v1.43] Waiting for volumeAttachments deletion #1256

Merged

This was referenced Apr 20, 2022

[release/v1.37] Waiting for volumeAttachments deletion #1257

Merged

Upgrade machine-controller to v1.43.1 kubermatic/kubeone#1982

Merged

Waiting for volumeAttachments deletion #1190

Waiting for volumeAttachments deletion #1190

Uh oh!

Conversation

mlavacca commented Feb 16, 2022

Uh oh!

moadqassem commented Feb 16, 2022

Uh oh!

mlavacca commented Feb 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

moadqassem commented Feb 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mlavacca commented Feb 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

moadqassem Feb 21, 2022

Choose a reason for hiding this comment

Uh oh!

moadqassem Feb 21, 2022

Choose a reason for hiding this comment

Uh oh!

mlavacca commented Feb 23, 2022

Uh oh!

mlavacca commented Feb 23, 2022

Uh oh!

mlavacca commented Feb 23, 2022

Uh oh!

mlavacca commented Feb 23, 2022

Uh oh!

mlavacca commented Mar 1, 2022

Uh oh!

Uh oh!

moadqassem Mar 7, 2022

Choose a reason for hiding this comment

Uh oh!

mlavacca Mar 8, 2022

Choose a reason for hiding this comment

Uh oh!

moadqassem Mar 8, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

moadqassem Mar 8, 2022

Choose a reason for hiding this comment

Uh oh!

moadqassem commented Mar 10, 2022

Uh oh!

moadqassem left a comment

Choose a reason for hiding this comment

Uh oh!

kubermatic-bot commented Mar 10, 2022

Uh oh!

kubermatic-bot commented Mar 10, 2022

Uh oh!

moadqassem commented Mar 11, 2022

Uh oh!

kubermatic-bot commented Mar 11, 2022

Uh oh!

moadqassem commented Mar 15, 2022

Uh oh!

kubermatic-bot commented Mar 15, 2022

Uh oh!

kron4eg commented Apr 20, 2022

Uh oh!

kubermatic-bot commented Apr 20, 2022

Uh oh!

kron4eg commented Apr 20, 2022

Uh oh!

kubermatic-bot commented Apr 20, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mlavacca commented Feb 17, 2022 •

edited

Loading

moadqassem commented Feb 17, 2022 •

edited

Loading

mlavacca commented Feb 17, 2022 •

edited

Loading