Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check vm existence even if machineRef is not set #643

Merged
merged 1 commit into from Oct 25, 2019

Conversation

yastij
Copy link
Member

@yastij yastij commented Oct 24, 2019

Signed-off-by: Yassine TIJANI ytijani@vmware.com

What this PR does / why we need it: This fix checks vm existence even if the machineRef is not set, this covers the case where a vm delete happens fast enough through vcenter

Which issue(s) this PR fixes : Fixes #622

Special notes for your reviewer:

/assign @andrewsykim @akutz

Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

Release note:

NONE

Signed-off-by: Yassine TIJANI <ytijani@vmware.com>
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Oct 24, 2019
@andrewsykim
Copy link
Member

andrewsykim commented Oct 25, 2019

Overall PR looks good.

I was able to test/reproduce the bug as specified in #622 and ran into an interesting case.

Deleted the VM directly like so:

$ govc vm.destroy target-cluster01-md-0-7d99bd4955-snc8h

Check the vspheremachine resource:

$ kubectl get vspheremachine target-cluster01-md-0-hsjmv -o yaml
apiVersion: infrastructure.cluster.x-k8s.io/v1alpha2
kind: VSphereMachine
metadata:
  creationTimestamp: "2019-10-25T02:36:54Z"
  finalizers:
  - vspheremachine.infrastructure.cluster.x-k8s.io
  generateName: target-cluster01-md-0-
  generation: 3
  name: target-cluster01-md-0-hsjmv
  namespace: default
  ownerReferences:
  - apiVersion: cluster.x-k8s.io/v1alpha2
    kind: Machine
    name: target-cluster01-md-0-7d99bd4955-snc8h
    uid: 59aa2223-19e7-457c-b47e-6ff14b082cc3
  resourceVersion: "355425"
  selfLink: /apis/infrastructure.cluster.x-k8s.io/v1alpha2/namespaces/default/vspheremachines/target-cluster01-md-0-hsjmv
  uid: bd3bb10a-7202-4a2e-bc43-c84f5a78fb1f
spec:
  datacenter: SDDC-Datacenter
  diskGiB: 50
  machineRef: vm-36065
  memoryMiB: 2048
  network:
    devices:
    - dhcp4: true
      dhcp6: false
      networkName: sddc-cgw-network-3
  numCPUs: 2
  providerID: vsphere://4230467f-656c-ebb4-06ef-95852af30a7e
  template: ubuntu-1804-kube-v1.16.2
status:
  addresses:
  - address: 192.168.3.192
    type: InternalIP
  networkStatus:
  - connected: true
    ipAddrs:
    - 192.168.3.192
    macAddr: 00:50:56:b0:7f:60
  ready: true
  taskRef: task-266925

Looks pretty normal.

Then I delete the machine object:

$ kubectl delete machine target-cluster01-md-0-7d99bd4955-snc8h
<hangs, assuming cause of finalizer> 

But now vspheremachine has no machineRef

$ kubectl get vspheremachine target-cluster01-md-0-hsjmv -o yaml
apiVersion: infrastructure.cluster.x-k8s.io/v1alpha2
kind: VSphereMachine
metadata:
  creationTimestamp: "2019-10-25T02:36:54Z"
  deletionGracePeriodSeconds: 0
  deletionTimestamp: "2019-10-25T02:42:36Z"
  finalizers:
  - vspheremachine.infrastructure.cluster.x-k8s.io
  generateName: target-cluster01-md-0-
  generation: 5
  name: target-cluster01-md-0-hsjmv
  namespace: default
  ownerReferences:
  - apiVersion: cluster.x-k8s.io/v1alpha2
    kind: Machine
    name: target-cluster01-md-0-7d99bd4955-snc8h
    uid: 59aa2223-19e7-457c-b47e-6ff14b082cc3
  resourceVersion: "356239"
  selfLink: /apis/infrastructure.cluster.x-k8s.io/v1alpha2/namespaces/default/vspheremachines/target-cluster01-md-0-hsjmv
  uid: bd3bb10a-7202-4a2e-bc43-c84f5a78fb1f
spec:
  datacenter: SDDC-Datacenter
  diskGiB: 50
  memoryMiB: 2048
  network:
    devices:
    - dhcp4: true
      dhcp6: false
      networkName: sddc-cgw-network-3
  numCPUs: 2
  providerID: vsphere://4230467f-656c-ebb4-06ef-95852af30a7e
  template: ubuntu-1804-kube-v1.16.2
status:
  addresses:
  - address: 192.168.3.192
    type: InternalIP
  networkStatus:
  - connected: true
    ipAddrs:
    - 192.168.3.192
    macAddr: 00:50:56:b0:7f:60
  ready: true
  taskRef: task-266925

Which means the machine ref was unset -- which also means DestroyVM returns a VM with state VirtualMachineStateNotFound and nil error which should unblock the machine deletion, but it doesn't. As far as I can tell this is the only place we unset the machine ref in the delete code path.

It seems like this patch fixes the issue because we are able to re-check VM existence but I'm still curious why the first reconcile which successfully unsets MachineRef doesn't also delete the machine's finalizers, maybe a conflict in the Patch request? 🤔

@andrewsykim
Copy link
Member

/approve
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 25, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andrewsykim, yastij

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 25, 2019
@k8s-ci-robot k8s-ci-robot merged commit 98a597b into kubernetes-sigs:master Oct 25, 2019
@andrewsykim
Copy link
Member

I also see that taskRef never get unset even though hasInFlightTask eventually unsets it. Wondering if that's related

@akutz
Copy link
Contributor

akutz commented Oct 25, 2019

What do you mean it isn’t unset? The function you referenced does unset it, and it’s part of the destroy call.

@akutz
Copy link
Contributor

akutz commented Oct 25, 2019

I encourage you to review the reconcile diagram I posted. The model is based on a reconcile loop. It doesn’t expect the final state to be achieved on the first run through. This is how it’s worked since @sidharthsurana and I first worked out the loop during the first refactor.

@akutz
Copy link
Contributor

akutz commented Oct 25, 2019

As you said, unsetting the MachineRef happens and then the next time through the loop the finalizer is removed.

@andrewsykim
Copy link
Member

andrewsykim commented Oct 25, 2019

What do you mean it isn’t unset? The function you referenced does unset it, and it’s part of the destroy call.

Was referring to taskRef on that one -- the code I linked unsets it but my resources had it set. See my comment #643 (comment), but still note that machineRef is still not set

$ kubectl get vspheremachine target-cluster01-md-0-hsjmv -o yaml
apiVersion: infrastructure.cluster.x-k8s.io/v1alpha2
kind: VSphereMachine
metadata:
  creationTimestamp: "2019-10-25T02:36:54Z"
  deletionGracePeriodSeconds: 0
  deletionTimestamp: "2019-10-25T02:42:36Z"
  finalizers:
  - vspheremachine.infrastructure.cluster.x-k8s.io
  generateName: target-cluster01-md-0-
  generation: 5
  name: target-cluster01-md-0-hsjmv
  namespace: default
  ownerReferences:
  - apiVersion: cluster.x-k8s.io/v1alpha2
    kind: Machine
    name: target-cluster01-md-0-7d99bd4955-snc8h
    uid: 59aa2223-19e7-457c-b47e-6ff14b082cc3
  resourceVersion: "356239"
  selfLink: /apis/infrastructure.cluster.x-k8s.io/v1alpha2/namespaces/default/vspheremachines/target-cluster01-md-0-hsjmv
  uid: bd3bb10a-7202-4a2e-bc43-c84f5a78fb1f
spec:
  datacenter: SDDC-Datacenter
  diskGiB: 50
  memoryMiB: 2048
  network:
    devices:
    - dhcp4: true
      dhcp6: false
      networkName: sddc-cgw-network-3
  numCPUs: 2
  providerID: vsphere://4230467f-656c-ebb4-06ef-95852af30a7e
  template: ubuntu-1804-kube-v1.16.2
status:
  addresses:
  - address: 192.168.3.192
    type: InternalIP
  networkStatus:
  - connected: true
    ipAddrs:
    - 192.168.3.192
    macAddr: 00:50:56:b0:7f:60
  ready: true
  taskRef: task-266925  // TASK REF IS STILL HERE

@andrewsykim
Copy link
Member

andrewsykim commented Oct 25, 2019

As you said, unsetting the MachineRef happens and then the next time through the loop the finalized is removed.

Yup, just pointing out the odd behavior that we are returning nil error in DestroyVM once but that isn't removing the machine finalizers. That's fine because of the reconcile model but still seems wrong?

@akutz
Copy link
Contributor

akutz commented Oct 25, 2019

Isn’t that because the VM isn’t marked as not found until the call is entered and the moref and task are both unset? https://github.com/yastij/cluster-api-provider-vsphere/blob/4e2cead625a8fa0e146a04d7ce8a03583d96d65c/pkg/cloud/vsphere/services/govmomi/service.go#L129

There’s no reason to think a non nil error means the op is successful or complete. You have the (book, error) pattern a lot. In this case the error means the state couldn't be ascertained. It’s the VM object that dictates when finalizers are removed.

@akutz
Copy link
Contributor

akutz commented Oct 25, 2019

Plus I think it’s probably best not to infer what the reconcile loop should do based on an internal service. It’s probably best to look at the reconcile loop on the controller.

// Requeue the operation until the VM is "notfound".

@akutz
Copy link
Contributor

akutz commented Oct 25, 2019

FWIW, I intend to get rid of the requeues entirely and implement a watch on the vCenter task manager using the external resource trigger. That way reconciles are triggered when task events occur for known VMs.

@andrewsykim
Copy link
Member

andrewsykim commented Oct 25, 2019

Isn’t that because the VM isn’t marked as not found until the call is entered and the moref and task are both unset?

But we also mark the VM as not found when we initially unset the machineRef (which we know happens) https://github.com/yastij/cluster-api-provider-vsphere/blob/4e2cead625a8fa0e146a04d7ce8a03583d96d65c/pkg/cloud/vsphere/services/govmomi/service.go#L146-L147

There’s no reason to think a non nil error means the op is successful or complete. You have the (book, error) pattern a lot. In this case the error means the state couldn't be ascertained. It’s the VM object that dictates when finalizers are removed.

Hmm.. maybe I'm missing context -- in this case it seems like machine deletion is dependant on the error by DestroyVM since that determines if we delete the finalizer of VSphereMachine -- which subseqeuently blocks delete on Machines cause of owner ref.

@andrewsykim
Copy link
Member

FWIW, I intend to get rid of the requeues entirely and implement a watch on the vCenter task manager using the external resource trigger. That way reconciles are triggered when task events occur for known VMs.

This sounds great :)

@akutz
Copy link
Contributor

akutz commented Oct 25, 2019

I think we just disagree what the purpose of an error is. There’s no reason to think the absence of an error means an action was completed successfully. It just means the action was performed successfully. Whether the result is complete or what you want, we don’t know yet. Remember, interactions with vSphere aren’t synchronous. Thus we’re just acting and reacting each time through the reconcile loop based on the current state as we’re able to determine it.

@akutz
Copy link
Contributor

akutz commented Oct 25, 2019

The error from destroyVM is not the deciding factor whether we delete the finalizer. The state of the VM object determines that. If there is a nil error and VM.state == notfound, we remove finalizer.

@andrewsykim
Copy link
Member

andrewsykim commented Oct 25, 2019

The error from destroyVM is not the deciding factor whether we delete the finalizer. The state of the VM object determines that. If there is a nil error and VM.state == notfound, we remove finalizer.

Right, and my point of confusion is that we do set vm.State = infrav1.VirtualMachineStateNotFound in the only place we unset MachineRef (which we know happens) AND we return nil error -- so why is the VM not deleting at that point?

@akutz
Copy link
Contributor

akutz commented Oct 25, 2019

Regarding “but we also..”

That’s the artifact of the refactor away from a single function. I encourage you to go look at the old CRUD model with “lookupVM” at top of each call. It was more obvious what was occurring. I should probably put that back.

@akutz
Copy link
Contributor

akutz commented Oct 25, 2019

Aw crap, is this the new code from Yassine? You’re saying that the link you provided still isn’t causing the finalizer to be removed upon return?

@akutz
Copy link
Contributor

akutz commented Oct 25, 2019

@yastij
Copy link
Member Author

yastij commented Oct 25, 2019

@akutz @andrewsykim - maybe I'm missing something but the finalizer is removed anyway in the machine_controller no ?:

@andrewsykim
Copy link
Member

andrewsykim commented Oct 25, 2019

maybe I'm missing something but the finalizer is removed anyway in the machine_controller

then in the machine_controller we should go through this https://github.com/yastij/cluster-api-provider-vsphere/blob/4e2cead625a8fa0e146a04d7ce8a03583d96d65c/controllers/vspheremachine_controller.go#L180 since the vm state is not found

That's exactly it @yastij -- before this PR we are already removing the machine ref, which means we also set the vm state to not found but the finalizer is not removed until a 2nd (or possibly 3rd) pass at DestroyVM. See my test case in #643 (comment). This PR seems to fix the problem because we allow more attempts to findVMByInstanceUUID but it sounds like there's a Patch error or something else happening that is the root cause.

jayunit100 pushed a commit to jayunit100/cluster-api-provider-vsphere that referenced this pull request Feb 26, 2020
Signed-off-by: Vince Prignano <vincepri@vmware.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

If you delete a VM and then delete the cluster, the cluster won't delete
5 participants