check vm existence even if machineRef is not set #643

yastij · 2019-10-24T16:38:27Z

Signed-off-by: Yassine TIJANI ytijani@vmware.com

What this PR does / why we need it: This fix checks vm existence even if the machineRef is not set, this covers the case where a vm delete happens fast enough through vcenter

Which issue(s) this PR fixes : Fixes #622

Special notes for your reviewer:

/assign @andrewsykim @akutz

Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

Release note:

NONE

Signed-off-by: Yassine TIJANI <ytijani@vmware.com>

pkg/cloud/vsphere/services/govmomi/service.go

andrewsykim · 2019-10-25T02:49:43Z

Overall PR looks good.

I was able to test/reproduce the bug as specified in #622 and ran into an interesting case.

Deleted the VM directly like so:

$ govc vm.destroy target-cluster01-md-0-7d99bd4955-snc8h

Check the vspheremachine resource:

$ kubectl get vspheremachine target-cluster01-md-0-hsjmv -o yaml
apiVersion: infrastructure.cluster.x-k8s.io/v1alpha2
kind: VSphereMachine
metadata:
  creationTimestamp: "2019-10-25T02:36:54Z"
  finalizers:
  - vspheremachine.infrastructure.cluster.x-k8s.io
  generateName: target-cluster01-md-0-
  generation: 3
  name: target-cluster01-md-0-hsjmv
  namespace: default
  ownerReferences:
  - apiVersion: cluster.x-k8s.io/v1alpha2
    kind: Machine
    name: target-cluster01-md-0-7d99bd4955-snc8h
    uid: 59aa2223-19e7-457c-b47e-6ff14b082cc3
  resourceVersion: "355425"
  selfLink: /apis/infrastructure.cluster.x-k8s.io/v1alpha2/namespaces/default/vspheremachines/target-cluster01-md-0-hsjmv
  uid: bd3bb10a-7202-4a2e-bc43-c84f5a78fb1f
spec:
  datacenter: SDDC-Datacenter
  diskGiB: 50
  machineRef: vm-36065
  memoryMiB: 2048
  network:
    devices:
    - dhcp4: true
      dhcp6: false
      networkName: sddc-cgw-network-3
  numCPUs: 2
  providerID: vsphere://4230467f-656c-ebb4-06ef-95852af30a7e
  template: ubuntu-1804-kube-v1.16.2
status:
  addresses:
  - address: 192.168.3.192
    type: InternalIP
  networkStatus:
  - connected: true
    ipAddrs:
    - 192.168.3.192
    macAddr: 00:50:56:b0:7f:60
  ready: true
  taskRef: task-266925

Looks pretty normal.

Then I delete the machine object:

$ kubectl delete machine target-cluster01-md-0-7d99bd4955-snc8h
<hangs, assuming cause of finalizer>

But now vspheremachine has no machineRef

$ kubectl get vspheremachine target-cluster01-md-0-hsjmv -o yaml
apiVersion: infrastructure.cluster.x-k8s.io/v1alpha2
kind: VSphereMachine
metadata:
  creationTimestamp: "2019-10-25T02:36:54Z"
  deletionGracePeriodSeconds: 0
  deletionTimestamp: "2019-10-25T02:42:36Z"
  finalizers:
  - vspheremachine.infrastructure.cluster.x-k8s.io
  generateName: target-cluster01-md-0-
  generation: 5
  name: target-cluster01-md-0-hsjmv
  namespace: default
  ownerReferences:
  - apiVersion: cluster.x-k8s.io/v1alpha2
    kind: Machine
    name: target-cluster01-md-0-7d99bd4955-snc8h
    uid: 59aa2223-19e7-457c-b47e-6ff14b082cc3
  resourceVersion: "356239"
  selfLink: /apis/infrastructure.cluster.x-k8s.io/v1alpha2/namespaces/default/vspheremachines/target-cluster01-md-0-hsjmv
  uid: bd3bb10a-7202-4a2e-bc43-c84f5a78fb1f
spec:
  datacenter: SDDC-Datacenter
  diskGiB: 50
  memoryMiB: 2048
  network:
    devices:
    - dhcp4: true
      dhcp6: false
      networkName: sddc-cgw-network-3
  numCPUs: 2
  providerID: vsphere://4230467f-656c-ebb4-06ef-95852af30a7e
  template: ubuntu-1804-kube-v1.16.2
status:
  addresses:
  - address: 192.168.3.192
    type: InternalIP
  networkStatus:
  - connected: true
    ipAddrs:
    - 192.168.3.192
    macAddr: 00:50:56:b0:7f:60
  ready: true
  taskRef: task-266925

Which means the machine ref was unset -- which also means DestroyVM returns a VM with state VirtualMachineStateNotFound and nil error which should unblock the machine deletion, but it doesn't. As far as I can tell this is the only place we unset the machine ref in the delete code path.

It seems like this patch fixes the issue because we are able to re-check VM existence but I'm still curious why the first reconcile which successfully unsets MachineRef doesn't also delete the machine's finalizers, maybe a conflict in the Patch request? 🤔

andrewsykim · 2019-10-25T02:59:05Z

/approve
/lgtm

k8s-ci-robot · 2019-10-25T02:59:17Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andrewsykim, yastij

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [andrewsykim]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

andrewsykim · 2019-10-25T03:01:33Z

I also see that taskRef never get unset even though hasInFlightTask eventually unsets it. Wondering if that's related

akutz · 2019-10-25T03:33:29Z

What do you mean it isn’t unset? The function you referenced does unset it, and it’s part of the destroy call.

akutz · 2019-10-25T03:39:18Z

I encourage you to review the reconcile diagram I posted. The model is based on a reconcile loop. It doesn’t expect the final state to be achieved on the first run through. This is how it’s worked since @sidharthsurana and I first worked out the loop during the first refactor.

akutz · 2019-10-25T03:40:42Z

As you said, unsetting the MachineRef happens and then the next time through the loop the finalizer is removed.

andrewsykim · 2019-10-25T03:41:09Z

What do you mean it isn’t unset? The function you referenced does unset it, and it’s part of the destroy call.

Was referring to taskRef on that one -- the code I linked unsets it but my resources had it set. See my comment #643 (comment), but still note that machineRef is still not set

$ kubectl get vspheremachine target-cluster01-md-0-hsjmv -o yaml
apiVersion: infrastructure.cluster.x-k8s.io/v1alpha2
kind: VSphereMachine
metadata:
  creationTimestamp: "2019-10-25T02:36:54Z"
  deletionGracePeriodSeconds: 0
  deletionTimestamp: "2019-10-25T02:42:36Z"
  finalizers:
  - vspheremachine.infrastructure.cluster.x-k8s.io
  generateName: target-cluster01-md-0-
  generation: 5
  name: target-cluster01-md-0-hsjmv
  namespace: default
  ownerReferences:
  - apiVersion: cluster.x-k8s.io/v1alpha2
    kind: Machine
    name: target-cluster01-md-0-7d99bd4955-snc8h
    uid: 59aa2223-19e7-457c-b47e-6ff14b082cc3
  resourceVersion: "356239"
  selfLink: /apis/infrastructure.cluster.x-k8s.io/v1alpha2/namespaces/default/vspheremachines/target-cluster01-md-0-hsjmv
  uid: bd3bb10a-7202-4a2e-bc43-c84f5a78fb1f
spec:
  datacenter: SDDC-Datacenter
  diskGiB: 50
  memoryMiB: 2048
  network:
    devices:
    - dhcp4: true
      dhcp6: false
      networkName: sddc-cgw-network-3
  numCPUs: 2
  providerID: vsphere://4230467f-656c-ebb4-06ef-95852af30a7e
  template: ubuntu-1804-kube-v1.16.2
status:
  addresses:
  - address: 192.168.3.192
    type: InternalIP
  networkStatus:
  - connected: true
    ipAddrs:
    - 192.168.3.192
    macAddr: 00:50:56:b0:7f:60
  ready: true
  taskRef: task-266925  // TASK REF IS STILL HERE

andrewsykim · 2019-10-25T03:42:27Z

As you said, unsetting the MachineRef happens and then the next time through the loop the finalized is removed.

Yup, just pointing out the odd behavior that we are returning nil error in DestroyVM once but that isn't removing the machine finalizers. That's fine because of the reconcile model but still seems wrong?

akutz · 2019-10-25T03:51:11Z

Isn’t that because the VM isn’t marked as not found until the call is entered and the moref and task are both unset? https://github.com/yastij/cluster-api-provider-vsphere/blob/4e2cead625a8fa0e146a04d7ce8a03583d96d65c/pkg/cloud/vsphere/services/govmomi/service.go#L129

There’s no reason to think a non nil error means the op is successful or complete. You have the (book, error) pattern a lot. In this case the error means the state couldn't be ascertained. It’s the VM object that dictates when finalizers are removed.

akutz · 2019-10-25T03:55:04Z

Plus I think it’s probably best not to infer what the reconcile loop should do based on an internal service. It’s probably best to look at the reconcile loop on the controller.

cluster-api-provider-vsphere/controllers/vspheremachine_controller.go

Line 173 in 98a597b

// Requeue the operation until the VM is "notfound".

akutz · 2019-10-25T03:57:11Z

FWIW, I intend to get rid of the requeues entirely and implement a watch on the vCenter task manager using the external resource trigger. That way reconciles are triggered when task events occur for known VMs.

andrewsykim · 2019-10-25T03:58:08Z

Isn’t that because the VM isn’t marked as not found until the call is entered and the moref and task are both unset?

But we also mark the VM as not found when we initially unset the machineRef (which we know happens) https://github.com/yastij/cluster-api-provider-vsphere/blob/4e2cead625a8fa0e146a04d7ce8a03583d96d65c/pkg/cloud/vsphere/services/govmomi/service.go#L146-L147

There’s no reason to think a non nil error means the op is successful or complete. You have the (book, error) pattern a lot. In this case the error means the state couldn't be ascertained. It’s the VM object that dictates when finalizers are removed.

Hmm.. maybe I'm missing context -- in this case it seems like machine deletion is dependant on the error by DestroyVM since that determines if we delete the finalizer of VSphereMachine -- which subseqeuently blocks delete on Machines cause of owner ref.

andrewsykim · 2019-10-25T03:58:21Z

FWIW, I intend to get rid of the requeues entirely and implement a watch on the vCenter task manager using the external resource trigger. That way reconciles are triggered when task events occur for known VMs.

This sounds great :)

akutz · 2019-10-25T04:00:27Z

I think we just disagree what the purpose of an error is. There’s no reason to think the absence of an error means an action was completed successfully. It just means the action was performed successfully. Whether the result is complete or what you want, we don’t know yet. Remember, interactions with vSphere aren’t synchronous. Thus we’re just acting and reacting each time through the reconcile loop based on the current state as we’re able to determine it.

akutz · 2019-10-25T04:01:59Z

The error from destroyVM is not the deciding factor whether we delete the finalizer. The state of the VM object determines that. If there is a nil error and VM.state == notfound, we remove finalizer.

andrewsykim · 2019-10-25T04:03:14Z

The error from destroyVM is not the deciding factor whether we delete the finalizer. The state of the VM object determines that. If there is a nil error and VM.state == notfound, we remove finalizer.

Right, and my point of confusion is that we do set vm.State = infrav1.VirtualMachineStateNotFound in the only place we unset MachineRef (which we know happens) AND we return nil error -- so why is the VM not deleting at that point?

akutz · 2019-10-25T04:04:02Z

Regarding “but we also..”

That’s the artifact of the refactor away from a single function. I encourage you to go look at the old CRUD model with “lookupVM” at top of each call. It was more obvious what was occurring. I should probably put that back.

akutz · 2019-10-25T04:05:24Z

Aw crap, is this the new code from Yassine? You’re saying that the link you provided still isn’t causing the finalizer to be removed upon return?

akutz · 2019-10-25T04:06:13Z

Re https://github.com/yastij/cluster-api-provider-vsphere/blob/4e2cead625a8fa0e146a04d7ce8a03583d96d65c/pkg/cloud/vsphere/services/govmomi/service.go#L146-L147

I agree. Upon this return the finalizer should be removed.

yastij · 2019-10-25T13:25:07Z

@akutz @andrewsykim - maybe I'm missing something but the finalizer is removed anyway in the machine_controller no ?:

I confirmed with @dougm that all methods of a SearchIndex returns nil,nil when an object is not found, so this should go through here: https://github.com/yastij/cluster-api-provider-vsphere/blob/4e2cead625a8fa0e146a04d7ce8a03583d96d65c/pkg/cloud/vsphere/services/govmomi/service.go#L148
then in the machine_controller we should go through this https://github.com/yastij/cluster-api-provider-vsphere/blob/4e2cead625a8fa0e146a04d7ce8a03583d96d65c/controllers/vspheremachine_controller.go#L180 since the vm state is not found

andrewsykim · 2019-10-25T14:15:11Z

maybe I'm missing something but the finalizer is removed anyway in the machine_controller

then in the machine_controller we should go through this https://github.com/yastij/cluster-api-provider-vsphere/blob/4e2cead625a8fa0e146a04d7ce8a03583d96d65c/controllers/vspheremachine_controller.go#L180 since the vm state is not found

That's exactly it @yastij -- before this PR we are already removing the machine ref, which means we also set the vm state to not found but the finalizer is not removed until a 2nd (or possibly 3rd) pass at DestroyVM. See my test case in #643 (comment). This PR seems to fix the problem because we allow more attempts to findVMByInstanceUUID but it sounds like there's a Patch error or something else happening that is the root cause.

Signed-off-by: Vince Prignano <vincepri@vmware.com>

check vm existance even if machineRef is not set

4e2cead

Signed-off-by: Yassine TIJANI <ytijani@vmware.com>

k8s-ci-robot assigned akutz and andrewsykim Oct 24, 2019

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Oct 24, 2019

k8s-ci-robot requested review from sflxn and sidharthsurana October 24, 2019 16:38

sidharthsurana reviewed Oct 24, 2019

View reviewed changes

pkg/cloud/vsphere/services/govmomi/service.go Show resolved Hide resolved

andrewsykim reviewed Oct 24, 2019

View reviewed changes

pkg/cloud/vsphere/services/govmomi/service.go Show resolved Hide resolved

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 25, 2019

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 25, 2019

k8s-ci-robot merged commit 98a597b into kubernetes-sigs:master Oct 25, 2019

jayunit100 pushed a commit to jayunit100/cluster-api-provider-vsphere that referenced this pull request Feb 26, 2020

Fix panic when IAM profile isn't available (kubernetes-sigs#643)

9f7f989

Signed-off-by: Vince Prignano <vincepri@vmware.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

check vm existence even if machineRef is not set #643

check vm existence even if machineRef is not set #643

yastij commented Oct 24, 2019

andrewsykim commented Oct 25, 2019 •

edited

andrewsykim commented Oct 25, 2019

k8s-ci-robot commented Oct 25, 2019

andrewsykim commented Oct 25, 2019

akutz commented Oct 25, 2019

akutz commented Oct 25, 2019

akutz commented Oct 25, 2019 •

edited

andrewsykim commented Oct 25, 2019 •

edited

andrewsykim commented Oct 25, 2019 •

edited

akutz commented Oct 25, 2019

akutz commented Oct 25, 2019 •

edited

akutz commented Oct 25, 2019

andrewsykim commented Oct 25, 2019 •

edited

andrewsykim commented Oct 25, 2019

akutz commented Oct 25, 2019

akutz commented Oct 25, 2019

andrewsykim commented Oct 25, 2019 •

edited

akutz commented Oct 25, 2019

akutz commented Oct 25, 2019

akutz commented Oct 25, 2019

yastij commented Oct 25, 2019

andrewsykim commented Oct 25, 2019 •

edited

check vm existence even if machineRef is not set #643

check vm existence even if machineRef is not set #643

Conversation

yastij commented Oct 24, 2019

andrewsykim commented Oct 25, 2019 • edited

andrewsykim commented Oct 25, 2019

k8s-ci-robot commented Oct 25, 2019

andrewsykim commented Oct 25, 2019

akutz commented Oct 25, 2019

akutz commented Oct 25, 2019

akutz commented Oct 25, 2019 • edited

andrewsykim commented Oct 25, 2019 • edited

andrewsykim commented Oct 25, 2019 • edited

akutz commented Oct 25, 2019

akutz commented Oct 25, 2019 • edited

akutz commented Oct 25, 2019

andrewsykim commented Oct 25, 2019 • edited

andrewsykim commented Oct 25, 2019

akutz commented Oct 25, 2019

akutz commented Oct 25, 2019

andrewsykim commented Oct 25, 2019 • edited

akutz commented Oct 25, 2019

akutz commented Oct 25, 2019

akutz commented Oct 25, 2019

yastij commented Oct 25, 2019

andrewsykim commented Oct 25, 2019 • edited

andrewsykim commented Oct 25, 2019 •

edited

akutz commented Oct 25, 2019 •

edited

andrewsykim commented Oct 25, 2019 •

edited

andrewsykim commented Oct 25, 2019 •

edited

akutz commented Oct 25, 2019 •

edited

andrewsykim commented Oct 25, 2019 •

edited

andrewsykim commented Oct 25, 2019 •

edited

andrewsykim commented Oct 25, 2019 •

edited