Skip to content

Conversation

michaelgugino
Copy link
Contributor

After patching the machine object following the initial
create, the subsequent reconcile may have a stale machine
object. Additionally, any transient k8s API erros might
result in the patch operation failing and there is no
way to recover that operation. Either of these
scenarios will result in a double creation event
and a leaked instance.

This commit ensures we persist the taskID from the
vSphere API in case of this situations and preserve
it to ensure we do not double create.

@michaelgugino michaelgugino changed the title vSphere: preserve taskID in local cache Bug 1880110: vSphere: preserve taskID in local cache Sep 28, 2020
@openshift-ci-robot openshift-ci-robot added the bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. label Sep 28, 2020
@openshift-ci-robot
Copy link
Contributor

@michaelgugino: This pull request references Bugzilla bug 1880110, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.6.0) matches configured target release for branch (4.6.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1880110: vSphere: preserve taskID in local cache

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Sep 28, 2020
Comment on lines 70 to 75
if val, ok := a.taskIDCache[machine.Name]; ok {
if val != r.providerStatus.TaskRef {
klog.Errorf("%s: machine object missing expected provider task ID, requeue", machine.GetName())
return &machinecontroller.RequeueAfterError{RequeueAfter: requeueAfterSeconds * time.Second}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the patch fails to update the machine (taskRef will be empty), will this not result in an infinite loop? Constant re-queueing and never getting anywhere?

What happens if the task ref is different to that in the cache? Is that a possibility? Could we not patch the machine with the TaskRef from this cache?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we don't actually need to know the taskID for the subsequent Exists() calls to work on this provider. In most cases, the actuator never goes down the Create() path twice, just in the cases where we double create, the instance cloning happens quite quickly to be discoverable by the machine's name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack yep, just double checked the code and can see that is the case

It does make me wonder how this double create can happen when we are setting the VM name and UUID to be predictable from the Machine. Surely as soon as a VM is created, we can find that VM by instance UUID?

Comment on lines 87 to 93
if err := newReconciler(scope).create(); err != nil {
// save the taskRef in our cache in case of any error with patch.
if scope.providerStatus.TaskRef != "" {
a.taskIDCache[machine.Name] = scope.providerStatus.TaskRef
}
if err := scope.PatchMachine(); err != nil {
return err
}
fmtErr := fmt.Errorf(reconcilerFailFmt, machine.GetName(), createEventAction, err)
return a.handleMachineError(machine, fmtErr, createEventAction)
}
// save the taskRef in our cache in case of any error with patch.
if scope.providerStatus.TaskRef != "" {
a.taskIDCache[machine.Name] = scope.providerStatus.TaskRef
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to break this up a bit and not duplicate the caching of the task ref?

Suggested change
if err := newReconciler(scope).create(); err != nil {
// save the taskRef in our cache in case of any error with patch.
if scope.providerStatus.TaskRef != "" {
a.taskIDCache[machine.Name] = scope.providerStatus.TaskRef
}
if err := scope.PatchMachine(); err != nil {
return err
}
fmtErr := fmt.Errorf(reconcilerFailFmt, machine.GetName(), createEventAction, err)
return a.handleMachineError(machine, fmtErr, createEventAction)
}
// save the taskRef in our cache in case of any error with patch.
if scope.providerStatus.TaskRef != "" {
a.taskIDCache[machine.Name] = scope.providerStatus.TaskRef
}
err := newReconciler(scope).create();
// save the taskRef in our cache before checking the error in case of any errors later.
if scope.providerStatus.TaskRef != "" {
a.taskIDCache[machine.Name] = scope.providerStatus.TaskRef
}
if err != nil {
if err := scope.PatchMachine(); err != nil {
return err
}
fmtErr := fmt.Errorf(reconcilerFailFmt, machine.GetName(), createEventAction, err)
return a.handleMachineError(machine, fmtErr, createEventAction)
}

@michaelgugino michaelgugino force-pushed the taskid-cache branch 2 times, most recently from a92ae84 to 93246d9 Compare September 29, 2020 13:54
After patching the machine object following the initial
create, the subsequent reconcile may have a stale machine
object.  Additionally, any transient k8s API erros might
result in the patch operation failing and there is no
way to recover that operation. Either of these
scenarios will result in a double creation event
and a leaked instance.

This commit ensures we persist the taskID from the
vSphere API in case of this situations and preserve
it to ensure we do not double create.

This commit also removes the RequeueAfterError for
successful creation events as it is ineffectual.
Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

fmtErr := fmt.Errorf(reconcilerFailFmt, machine.GetName(), createEventAction, err)
return a.handleMachineError(machine, fmtErr, createEventAction)
retErr = a.handleMachineError(machine, fmtErr, createEventAction)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Maybe add a comment here to explain why we aren't returning here immediately? WDYT?

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JoelSpeed

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 29, 2020
@michaelgugino
Copy link
Contributor Author

/test e2e-vsphere-serial


// Ensure we're not reconciling a stale machine by checking our task-id.
// This is a workaround for a cache race condition.
if val, ok := a.TaskIDCache[machine.Name]; ok {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you try running go test -race on this? We could have a race here and probably need a lock or sync.Map

Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes sense to me, i have a small question to help my understanding.
/lgtm

@@ -108,6 +121,9 @@ func (a *Actuator) Exists(ctx context.Context, machine *machinev1.Machine) (bool

func (a *Actuator) Update(ctx context.Context, machine *machinev1.Machine) error {
klog.Infof("%s: actuator updating machine", machine.GetName())
// Cleanup TaskIDCache so we don't continually grow
delete(a.TaskIDCache, machine.Name)
Copy link
Contributor

@elmiko elmiko Sep 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i just want to understand this better, is the idea here that once we are updating we know we have passed the initial need for the creation task id?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we've hit update() or delete() we know that the task ID was previously successfully recorded, we won't be going down the create() pathway again, so we remove it from the local cache to we don't grow unbounded memory.

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 30, 2020
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

5 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@Danil-Grigorev
Copy link

/hold
This implementation is not thread safe. You cannot concurrently read and write to a global map variable. This result into panic, the pod will fail, which could cause the attached BZ to reoccur again and a delay during installation process. Please reconsider this: #711 (comment)

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 30, 2020
@JoelSpeed
Copy link
Contributor

@Danil-Grigorev I don't believe this is an issue in this case. We only ever run a single thread for our controllers. There's no goroutines involved here at all so there shouldn't be any race conditions in the running code

/retest

@Danil-Grigorev
Copy link

@Danil-Grigorev I don't believe this is an issue in this case. We only ever run a single thread for our controllers. There's no goroutines involved here at all so there shouldn't be any race conditions in the running code

/retest

I see now that the Actuator reconcile method could run only one thread concurrently, and we execute Create, Update or Delete in there. SGTM

/hold cancel
/retest
/lgtm

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 1, 2020
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

6 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Oct 1, 2020

@michaelgugino: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-gcp 5e76830 link /test e2e-gcp
ci/prow/e2e-azure 5e76830 link /test e2e-azure

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit f4110d9 into openshift:master Oct 1, 2020
@openshift-ci-robot
Copy link
Contributor

@michaelgugino: All pull requests linked via external trackers have merged:

Bugzilla bug 1880110 has been moved to the MODIFIED state.

In response to this:

Bug 1880110: vSphere: preserve taskID in local cache

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants