-
Notifications
You must be signed in to change notification settings - Fork 137
Waiting for volumeAttachments deletion #1190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Waiting for volumeAttachments deletion #1190
Conversation
|
@mlavacca This seems quite strange and I have two questions: In general I don't think it is the job for machine controller to remove volumes, that's why we use CSI driver :-) . If this is indeed an issue in the driver, I would strongly recommend to leave a finalizer on the machine object, during the cleanup, check for the node name under the machine status and remove the volumeAttachment. and then remove the finalizer off the machine. |
Yes, there is an issue in the CSI driver for vSphere, and the aiming of this PR is to mitigate that. If the node is deleted before that CSI driver has time to delete the
I'm not removing any BTW, something didn't work as expected during the e2e tests, I'm going to investigate and debug them |
That's the thing. I mean if we. can just simply remove those as part of the cleanup process. So in other words, when cleanup is called, check out the volume attachment, and remove them based on the machine referenced node. This way you will not block the node draining/termination and you keep the mitigation locally where it should be until the vmware folks fix the issue is resolved: kubernetes-sigs/vsphere-csi-driver#359 P.S: it was created almost 2 years ago, but was active again 3 weeks ago so 🤞 |
Problem is that the volumeAttachments are managed by the CSI driver; if you delete them, they are automatically recreated. Furthermore, they have a finalizer ( |
| func (r *Reconciler) deleteNodeForMachine(ctx context.Context, machine *clusterv1alpha1.Machine) (*reconcile.Result, error) { | ||
| // List all the volumeAttachments in the cluster; we must be sure that all | ||
| // of them will be deleted before deleting the node | ||
| volumeAttachments := &storagev1.VolumeAttachmentList{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just abstract this behaviour in a method and only run it if the cloud provider is vSphere.
| } | ||
|
|
||
| return nil, r.deleteNodeForMachine(ctx, machine) | ||
| return r.deleteNodeForMachine(ctx, machine) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just call this method here when the cloud provider is vsphere and the Volume attachment are gone.
|
/retest |
|
/test pull-machine-controller-e2e-gce |
|
@moadqassem I implemented your suggested solution, PTAL |
|
/retest |
Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com>
Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com>
Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com>
Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com>
Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com>
e73baa6 to
6393218
Compare
|
Code has been improved to handle both the node rollout and the node deletion on a single node cluster. @moadqassem PTAL |
| ErrorQueueLen = 1000 | ||
| ) | ||
|
|
||
| type NodeVolumeAttachmentsCleanup struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that we are not gonna need this and we will just change how the node is drained. For example if there is a volume still attached then don't remove the csi driver pod and once no volumeattachment is there then remove the pod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and this is an implementation of that behavior. If there are volumeAttachments:
- Cordon the old node
- delete all pods using volumes attached to the old node
- wait for the CSI-driver to collect the volumeAttachments
- drain the old node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but can't we just fix this in the NodeEviction.Run method? why don't we handle the case over there, instead of having it here. Eventually the code is kinda similar, only difference is, the deletion criteria.
| ErrorQueueLen = 1000 | ||
| ) | ||
|
|
||
| type NodeVolumeAttachmentsCleanup struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but can't we just fix this in the NodeEviction.Run method? why don't we handle the case over there, instead of having it here. Eventually the code is kinda similar, only difference is, the deletion criteria.
Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com>
Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com>
|
/retest |
moadqassem
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
/lgtm
|
LGTM label has been added. Git tree hash: 67d0192b6ca5112a0f5afb02d39ab85add9c1667
|
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mlavacca, moadqassem The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/cherry-pick release/v1.36 |
|
@moadqassem: #1190 failed to apply on top of branch "release/v1.36": In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
* Waiting for volumeAttachments deletion Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * volumeAttachments check only for vSphere Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * ClusterRole updated Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * yaml linter fixed Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * VolumeAttachments correctly handled Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * Code factorized Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * renaming Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * fix yamllint Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * Logic applied only to vSphere Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> (cherry picked from commit e35da15)
Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> (cherry picked from commit e35da15)
* Waiting for volumeAttachments deletion Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * volumeAttachments check only for vSphere Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * ClusterRole updated Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * yaml linter fixed Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * VolumeAttachments correctly handled Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * Code factorized Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * renaming Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * fix yamllint Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * Logic applied only to vSphere Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> (cherry picked from commit e35da15)
|
/cherry-pick release/v1.42 |
|
@moadqassem: new pull request created: #1212 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
* Waiting for volumeAttachments deletion Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * volumeAttachments check only for vSphere Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * ClusterRole updated Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * yaml linter fixed Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * VolumeAttachments correctly handled Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * Code factorized Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * renaming Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * fix yamllint Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * Logic applied only to vSphere Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com>
|
/cherrypick release/v1.43 |
|
@kron4eg: new pull request created: #1256 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/cherrypick release/v1.37 |
|
@kron4eg: #1190 failed to apply on top of branch "release/v1.37": In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
* Waiting for volumeAttachments deletion Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * volumeAttachments check only for vSphere Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * ClusterRole updated Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * yaml linter fixed Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * VolumeAttachments correctly handled Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * Code factorized Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * renaming Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * fix yamllint Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * Logic applied only to vSphere Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com>
* Waiting for volumeAttachments deletion (#1190) * Waiting for volumeAttachments deletion Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * volumeAttachments check only for vSphere Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * ClusterRole updated Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * yaml linter fixed Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * VolumeAttachments correctly handled Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * Code factorized Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * renaming Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * fix yamllint Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * Logic applied only to vSphere Signed-off-by: Mattia Lavacca <lavacca.mattia@gmail.com> * disable vSphere tests (#1172) Signed-off-by: Moath Qasim <moad.qassem@gmail.com> * enable vSphere tests (#1180) * enable vSphere tests Signed-off-by: Moath Qasim <moad.qassem@gmail.com> # Conflicts: # go.sum * refactor vSphere datastore cluster Signed-off-by: Moath Qasim <moad.qassem@gmail.com> * refactor vSphere tests Signed-off-by: Moath Qasim <moad.qassem@gmail.com> * enable vsphere test Signed-off-by: Moath Qasim <moad.qassem@gmail.com> * debug vsphere datastore test Signed-off-by: Moath Qasim <moad.qassem@gmail.com> * debug vsphere datastore test Signed-off-by: Moath Qasim <moad.qassem@gmail.com> Co-authored-by: Mattia Lavacca <lavacca.mattia@gmail.com> Co-authored-by: Moath Qasim <moad.qassem@gmail.com>
What this PR does / why we need it:
When a machine is deleted, the machine controller now waits for the volumeAttachments deletion before deleting the node.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>format, will close the issue(s) when PR gets merged):Fixes #1189
Special notes for your reviewer:
Optional Release Note: