New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
check vm existence even if machineRef is not set #643
Conversation
Signed-off-by: Yassine TIJANI <ytijani@vmware.com>
Overall PR looks good. I was able to test/reproduce the bug as specified in #622 and ran into an interesting case. Deleted the VM directly like so:
Check the
Looks pretty normal. Then I delete the machine object:
But now
Which means the machine ref was unset -- which also means DestroyVM returns a VM with state It seems like this patch fixes the issue because we are able to re-check VM existence but I'm still curious why the first reconcile which successfully unsets MachineRef doesn't also delete the machine's finalizers, maybe a conflict in the Patch request? 🤔 |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andrewsykim, yastij The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
I also see that taskRef never get unset even though hasInFlightTask eventually unsets it. Wondering if that's related |
What do you mean it isn’t unset? The function you referenced does unset it, and it’s part of the destroy call. |
I encourage you to review the reconcile diagram I posted. The model is based on a reconcile loop. It doesn’t expect the final state to be achieved on the first run through. This is how it’s worked since @sidharthsurana and I first worked out the loop during the first refactor. |
As you said, unsetting the MachineRef happens and then the next time through the loop the finalizer is removed. |
Was referring to taskRef on that one -- the code I linked unsets it but my resources had it set. See my comment #643 (comment), but still note that machineRef is still not set
|
Yup, just pointing out the odd behavior that we are returning nil error in |
Isn’t that because the VM isn’t marked as not found until the call is entered and the moref and task are both unset? https://github.com/yastij/cluster-api-provider-vsphere/blob/4e2cead625a8fa0e146a04d7ce8a03583d96d65c/pkg/cloud/vsphere/services/govmomi/service.go#L129 There’s no reason to think a non nil error means the op is successful or complete. You have the (book, error) pattern a lot. In this case the error means the state couldn't be ascertained. It’s the VM object that dictates when finalizers are removed. |
Plus I think it’s probably best not to infer what the reconcile loop should do based on an internal service. It’s probably best to look at the reconcile loop on the controller.
|
FWIW, I intend to get rid of the requeues entirely and implement a watch on the vCenter task manager using the external resource trigger. That way reconciles are triggered when task events occur for known VMs. |
But we also mark the VM as not found when we initially unset the machineRef (which we know happens) https://github.com/yastij/cluster-api-provider-vsphere/blob/4e2cead625a8fa0e146a04d7ce8a03583d96d65c/pkg/cloud/vsphere/services/govmomi/service.go#L146-L147
Hmm.. maybe I'm missing context -- in this case it seems like machine deletion is dependant on the error by DestroyVM since that determines if we delete the finalizer of |
This sounds great :) |
I think we just disagree what the purpose of an error is. There’s no reason to think the absence of an error means an action was completed successfully. It just means the action was performed successfully. Whether the result is complete or what you want, we don’t know yet. Remember, interactions with vSphere aren’t synchronous. Thus we’re just acting and reacting each time through the reconcile loop based on the current state as we’re able to determine it. |
The error from destroyVM is not the deciding factor whether we delete the finalizer. The state of the VM object determines that. If there is a nil error and VM.state == notfound, we remove finalizer. |
Right, and my point of confusion is that we do set |
Regarding “but we also..” That’s the artifact of the refactor away from a single function. I encourage you to go look at the old CRUD model with “lookupVM” at top of each call. It was more obvious what was occurring. I should probably put that back. |
Aw crap, is this the new code from Yassine? You’re saying that the link you provided still isn’t causing the finalizer to be removed upon return? |
I agree. Upon this return the finalizer should be removed. |
@akutz @andrewsykim - maybe I'm missing something but the finalizer is removed anyway in the machine_controller no ?:
|
That's exactly it @yastij -- before this PR we are already removing the machine ref, which means we also set the vm state to not found but the finalizer is not removed until a 2nd (or possibly 3rd) pass at DestroyVM. See my test case in #643 (comment). This PR seems to fix the problem because we allow more attempts to findVMByInstanceUUID but it sounds like there's a Patch error or something else happening that is the root cause. |
Signed-off-by: Vince Prignano <vincepri@vmware.com>
Signed-off-by: Yassine TIJANI ytijani@vmware.com
What this PR does / why we need it: This fix checks vm existence even if the machineRef is not set, this covers the case where a vm delete happens fast enough through vcenter
Which issue(s) this PR fixes : Fixes #622
Special notes for your reviewer:
/assign @andrewsykim @akutz
Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.
Release note: