Bug 1880110: vSphere: preserve taskID in local cache #711

michaelgugino · 2020-09-28T15:44:02Z

After patching the machine object following the initial
create, the subsequent reconcile may have a stale machine
object. Additionally, any transient k8s API erros might
result in the patch operation failing and there is no
way to recover that operation. Either of these
scenarios will result in a double creation event
and a leaked instance.

This commit ensures we persist the taskID from the
vSphere API in case of this situations and preserve
it to ensure we do not double create.

openshift-ci-robot · 2020-09-28T15:46:51Z

@michaelgugino: This pull request references Bugzilla bug 1880110, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.6.0) matches configured target release for branch (4.6.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1880110: vSphere: preserve taskID in local cache

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

JoelSpeed · 2020-09-28T15:56:06Z

pkg/controller/vsphere/actuator.go

+	if val, ok := a.taskIDCache[machine.Name]; ok {
+		if val != r.providerStatus.TaskRef {
+			klog.Errorf("%s: machine object missing expected provider task ID, requeue", machine.GetName())
+			return &machinecontroller.RequeueAfterError{RequeueAfter: requeueAfterSeconds * time.Second}
+		}
+	}


If the patch fails to update the machine (taskRef will be empty), will this not result in an infinite loop? Constant re-queueing and never getting anywhere?

What happens if the task ref is different to that in the cache? Is that a possibility? Could we not patch the machine with the TaskRef from this cache?

No, we don't actually need to know the taskID for the subsequent Exists() calls to work on this provider. In most cases, the actuator never goes down the Create() path twice, just in the cases where we double create, the instance cloning happens quite quickly to be discoverable by the machine's name.

Ack yep, just double checked the code and can see that is the case

It does make me wonder how this double create can happen when we are setting the VM name and UUID to be predictable from the Machine. Surely as soon as a VM is created, we can find that VM by instance UUID?

JoelSpeed · 2020-09-28T15:58:31Z

pkg/controller/vsphere/actuator.go

 	if err := newReconciler(scope).create(); err != nil {
+		// save the taskRef in our cache in case of any error with patch.
+		if scope.providerStatus.TaskRef != "" {
+			a.taskIDCache[machine.Name] = scope.providerStatus.TaskRef
+		}
 		if err := scope.PatchMachine(); err != nil {
 			return err
 		}
 		fmtErr := fmt.Errorf(reconcilerFailFmt, machine.GetName(), createEventAction, err)
 		return a.handleMachineError(machine, fmtErr, createEventAction)
 	}
+	// save the taskRef in our cache in case of any error with patch.
+	if scope.providerStatus.TaskRef != "" {
+		a.taskIDCache[machine.Name] = scope.providerStatus.TaskRef
+	}


Would it be better to break this up a bit and not duplicate the caching of the task ref?

Suggested change

if err := newReconciler(scope).create(); err != nil {

// save the taskRef in our cache in case of any error with patch.

if scope.providerStatus.TaskRef != "" {

a.taskIDCache[machine.Name] = scope.providerStatus.TaskRef

}

if err := scope.PatchMachine(); err != nil {

return err

}

fmtErr := fmt.Errorf(reconcilerFailFmt, machine.GetName(), createEventAction, err)

return a.handleMachineError(machine, fmtErr, createEventAction)

}

// save the taskRef in our cache in case of any error with patch.

if scope.providerStatus.TaskRef != "" {

a.taskIDCache[machine.Name] = scope.providerStatus.TaskRef

}

err := newReconciler(scope).create();

// save the taskRef in our cache before checking the error in case of any errors later.

if scope.providerStatus.TaskRef != "" {

a.taskIDCache[machine.Name] = scope.providerStatus.TaskRef

}

if err != nil {

if err := scope.PatchMachine(); err != nil {

return err

}

fmtErr := fmt.Errorf(reconcilerFailFmt, machine.GetName(), createEventAction, err)

return a.handleMachineError(machine, fmtErr, createEventAction)

}

After patching the machine object following the initial create, the subsequent reconcile may have a stale machine object. Additionally, any transient k8s API erros might result in the patch operation failing and there is no way to recover that operation. Either of these scenarios will result in a double creation event and a leaked instance. This commit ensures we persist the taskID from the vSphere API in case of this situations and preserve it to ensure we do not double create. This commit also removes the RequeueAfterError for successful creation events as it is ineffectual.

JoelSpeed

/approve

JoelSpeed · 2020-09-29T14:35:40Z

pkg/controller/vsphere/actuator.go

 		fmtErr := fmt.Errorf(reconcilerFailFmt, machine.GetName(), createEventAction, err)
-		return a.handleMachineError(machine, fmtErr, createEventAction)
+		retErr = a.handleMachineError(machine, fmtErr, createEventAction)


Nit: Maybe add a comment here to explain why we aren't returning here immediately? WDYT?

openshift-ci-robot · 2020-09-29T14:36:15Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JoelSpeed

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [JoelSpeed]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

michaelgugino · 2020-09-29T16:44:49Z

/test e2e-vsphere-serial

Danil-Grigorev · 2020-09-30T15:50:46Z

pkg/controller/vsphere/actuator.go

+
+	// Ensure we're not reconciling a stale machine by checking our task-id.
+	// This is a workaround for a cache race condition.
+	if val, ok := a.TaskIDCache[machine.Name]; ok {


Did you try running go test -race on this? We could have a race here and probably need a lock or sync.Map

elmiko

this makes sense to me, i have a small question to help my understanding.
/lgtm

elmiko · 2020-09-30T15:54:15Z

pkg/controller/vsphere/actuator.go

@@ -108,6 +121,9 @@ func (a *Actuator) Exists(ctx context.Context, machine *machinev1.Machine) (bool

 func (a *Actuator) Update(ctx context.Context, machine *machinev1.Machine) error {
 	klog.Infof("%s: actuator updating machine", machine.GetName())
+	// Cleanup TaskIDCache so we don't continually grow
+	delete(a.TaskIDCache, machine.Name)


i just want to understand this better, is the idea here that once we are updating we know we have passed the initial need for the creation task id?

Once we've hit update() or delete() we know that the task ID was previously successfully recorded, we won't be going down the create() pathway again, so we remove it from the local cache to we don't grow unbounded memory.

openshift-bot · 2020-09-30T16:11:04Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-09-30T16:24:00Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-09-30T17:29:09Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-09-30T17:42:03Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-09-30T17:54:59Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-09-30T18:21:02Z

/retest

Please review the full test history for this PR and help us cut down flakes.

Danil-Grigorev · 2020-09-30T18:32:13Z

/hold
This implementation is not thread safe. You cannot concurrently read and write to a global map variable. This result into panic, the pod will fail, which could cause the attached BZ to reoccur again and a delay during installation process. Please reconsider this: #711 (comment)

JoelSpeed · 2020-10-01T11:41:30Z

@Danil-Grigorev I don't believe this is an issue in this case. We only ever run a single thread for our controllers. There's no goroutines involved here at all so there shouldn't be any race conditions in the running code

/retest

Danil-Grigorev · 2020-10-01T12:23:13Z

@Danil-Grigorev I don't believe this is an issue in this case. We only ever run a single thread for our controllers. There's no goroutines involved here at all so there shouldn't be any race conditions in the running code

/retest

I see now that the Actuator reconcile method could run only one thread concurrently, and we execute Create, Update or Delete in there. SGTM

/hold cancel
/retest
/lgtm

openshift-bot · 2020-10-01T14:03:09Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-10-01T14:16:08Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-10-01T14:29:10Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-10-01T16:13:14Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-10-01T16:26:15Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-10-01T19:15:16Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-10-01T19:28:13Z

/retest

Please review the full test history for this PR and help us cut down flakes.

michaelgugino · 2020-10-01T21:09:37Z

vSphere run demonstrates the exact condition is caught and mitigated in this run: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/711/pull-ci-openshift-machine-api-operator-master-e2e-vsphere-serial/1310983960695672832

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/711/pull-ci-openshift-machine-api-operator-master-e2e-vsphere-serial/1310983960695672832/artifacts/e2e-vsphere-serial/gather-extra/pods/openshift-machine-api_machine-api-controllers-795884cdfc-7zdwh_machine-controller.log

E0929 17:48:18.393190 1 actuator.go:83] ci-op-l2mlynd3-0c8b9-mtkfg-worker-b5jbc: machine object missing expected provider task ID, requeue

We can see the worker machine ci-op-l2mlynd3-0c8b9-mtkfg-worker-b5jbc has the taskRef field populated here:
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/711/pull-ci-openshift-machine-api-operator-master-e2e-vsphere-serial/1310983960695672832/artifacts/e2e-vsphere-serial/gather-extra/machines.json

openshift-ci-robot · 2020-10-01T21:11:40Z

@michaelgugino: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-gcp	`5e76830`	link	`/test e2e-gcp`
ci/prow/e2e-azure	`5e76830`	link	`/test e2e-azure`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci-robot · 2020-10-01T21:31:08Z

@michaelgugino: All pull requests linked via external trackers have merged:

Bugzilla bug 1880110 has been moved to the MODIFIED state.

In response to this:

Bug 1880110: vSphere: preserve taskID in local cache

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot requested review from alexander-demicev and elmiko September 28, 2020 15:44

michaelgugino changed the title ~~vSphere: preserve taskID in local cache~~ Bug 1880110: vSphere: preserve taskID in local cache Sep 28, 2020

openshift-ci-robot added the bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. label Sep 28, 2020

openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Sep 28, 2020

JoelSpeed reviewed Sep 28, 2020

View reviewed changes

michaelgugino force-pushed the taskid-cache branch 2 times, most recently from a92ae84 to 93246d9 Compare September 29, 2020 13:54

michaelgugino force-pushed the taskid-cache branch from 93246d9 to 5e76830 Compare September 29, 2020 14:14

JoelSpeed approved these changes Sep 29, 2020

View reviewed changes

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 29, 2020

Danil-Grigorev reviewed Sep 30, 2020

View reviewed changes

elmiko approved these changes Sep 30, 2020

View reviewed changes

openshift-ci-robot assigned elmiko Sep 30, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 30, 2020

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 30, 2020

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 1, 2020

openshift-ci-robot assigned Danil-Grigorev Oct 1, 2020

openshift-merge-robot merged commit f4110d9 into openshift:master Oct 1, 2020

Bug 1880110: vSphere: preserve taskID in local cache #711

Bug 1880110: vSphere: preserve taskID in local cache #711

Uh oh!

Conversation

michaelgugino commented Sep 28, 2020

Uh oh!

openshift-ci-robot commented Sep 28, 2020

Uh oh!

JoelSpeed Sep 28, 2020

Choose a reason for hiding this comment

Uh oh!

michaelgugino Sep 28, 2020

Choose a reason for hiding this comment

Uh oh!

JoelSpeed Sep 29, 2020

Choose a reason for hiding this comment

Uh oh!

JoelSpeed Sep 28, 2020

Choose a reason for hiding this comment

Uh oh!

JoelSpeed left a comment

Choose a reason for hiding this comment

Uh oh!

JoelSpeed Sep 29, 2020

Choose a reason for hiding this comment

Uh oh!

openshift-ci-robot commented Sep 29, 2020

Uh oh!

michaelgugino commented Sep 29, 2020

Uh oh!

Danil-Grigorev Sep 30, 2020

Choose a reason for hiding this comment

Uh oh!

elmiko left a comment

Choose a reason for hiding this comment

Uh oh!

elmiko Sep 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

michaelgugino Sep 30, 2020

Choose a reason for hiding this comment

Uh oh!

openshift-bot commented Sep 30, 2020

Uh oh!

openshift-bot commented Sep 30, 2020

Uh oh!

openshift-bot commented Sep 30, 2020

Uh oh!

openshift-bot commented Sep 30, 2020

Uh oh!

openshift-bot commented Sep 30, 2020

Uh oh!

openshift-bot commented Sep 30, 2020

Uh oh!

Danil-Grigorev commented Sep 30, 2020

Uh oh!

JoelSpeed commented Oct 1, 2020

Uh oh!

Danil-Grigorev commented Oct 1, 2020

Uh oh!

openshift-bot commented Oct 1, 2020

Uh oh!

openshift-bot commented Oct 1, 2020

Uh oh!

openshift-bot commented Oct 1, 2020

Uh oh!

openshift-bot commented Oct 1, 2020

Uh oh!

openshift-bot commented Oct 1, 2020

Uh oh!

openshift-bot commented Oct 1, 2020

Uh oh!

openshift-bot commented Oct 1, 2020

Uh oh!

michaelgugino commented Oct 1, 2020

Uh oh!

openshift-ci-robot commented Oct 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elmiko Sep 30, 2020 •

edited

Loading

openshift-ci-robot commented Oct 1, 2020 •

edited

Loading