-
Notifications
You must be signed in to change notification settings - Fork 223
Bug 1880110: vSphere: preserve taskID in local cache #711
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1880110: vSphere: preserve taskID in local cache #711
Conversation
@michaelgugino: This pull request references Bugzilla bug 1880110, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
pkg/controller/vsphere/actuator.go
Outdated
if val, ok := a.taskIDCache[machine.Name]; ok { | ||
if val != r.providerStatus.TaskRef { | ||
klog.Errorf("%s: machine object missing expected provider task ID, requeue", machine.GetName()) | ||
return &machinecontroller.RequeueAfterError{RequeueAfter: requeueAfterSeconds * time.Second} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the patch fails to update the machine (taskRef will be empty), will this not result in an infinite loop? Constant re-queueing and never getting anywhere?
What happens if the task ref is different to that in the cache? Is that a possibility? Could we not patch the machine with the TaskRef from this cache?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we don't actually need to know the taskID for the subsequent Exists() calls to work on this provider. In most cases, the actuator never goes down the Create() path twice, just in the cases where we double create, the instance cloning happens quite quickly to be discoverable by the machine's name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack yep, just double checked the code and can see that is the case
It does make me wonder how this double create can happen when we are setting the VM name and UUID to be predictable from the Machine. Surely as soon as a VM is created, we can find that VM by instance UUID?
pkg/controller/vsphere/actuator.go
Outdated
if err := newReconciler(scope).create(); err != nil { | ||
// save the taskRef in our cache in case of any error with patch. | ||
if scope.providerStatus.TaskRef != "" { | ||
a.taskIDCache[machine.Name] = scope.providerStatus.TaskRef | ||
} | ||
if err := scope.PatchMachine(); err != nil { | ||
return err | ||
} | ||
fmtErr := fmt.Errorf(reconcilerFailFmt, machine.GetName(), createEventAction, err) | ||
return a.handleMachineError(machine, fmtErr, createEventAction) | ||
} | ||
// save the taskRef in our cache in case of any error with patch. | ||
if scope.providerStatus.TaskRef != "" { | ||
a.taskIDCache[machine.Name] = scope.providerStatus.TaskRef | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to break this up a bit and not duplicate the caching of the task ref?
if err := newReconciler(scope).create(); err != nil { | |
// save the taskRef in our cache in case of any error with patch. | |
if scope.providerStatus.TaskRef != "" { | |
a.taskIDCache[machine.Name] = scope.providerStatus.TaskRef | |
} | |
if err := scope.PatchMachine(); err != nil { | |
return err | |
} | |
fmtErr := fmt.Errorf(reconcilerFailFmt, machine.GetName(), createEventAction, err) | |
return a.handleMachineError(machine, fmtErr, createEventAction) | |
} | |
// save the taskRef in our cache in case of any error with patch. | |
if scope.providerStatus.TaskRef != "" { | |
a.taskIDCache[machine.Name] = scope.providerStatus.TaskRef | |
} | |
err := newReconciler(scope).create(); | |
// save the taskRef in our cache before checking the error in case of any errors later. | |
if scope.providerStatus.TaskRef != "" { | |
a.taskIDCache[machine.Name] = scope.providerStatus.TaskRef | |
} | |
if err != nil { | |
if err := scope.PatchMachine(); err != nil { | |
return err | |
} | |
fmtErr := fmt.Errorf(reconcilerFailFmt, machine.GetName(), createEventAction, err) | |
return a.handleMachineError(machine, fmtErr, createEventAction) | |
} |
a92ae84
to
93246d9
Compare
After patching the machine object following the initial create, the subsequent reconcile may have a stale machine object. Additionally, any transient k8s API erros might result in the patch operation failing and there is no way to recover that operation. Either of these scenarios will result in a double creation event and a leaked instance. This commit ensures we persist the taskID from the vSphere API in case of this situations and preserve it to ensure we do not double create. This commit also removes the RequeueAfterError for successful creation events as it is ineffectual.
93246d9
to
5e76830
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
fmtErr := fmt.Errorf(reconcilerFailFmt, machine.GetName(), createEventAction, err) | ||
return a.handleMachineError(machine, fmtErr, createEventAction) | ||
retErr = a.handleMachineError(machine, fmtErr, createEventAction) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Maybe add a comment here to explain why we aren't returning here immediately? WDYT?
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: JoelSpeed The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/test e2e-vsphere-serial |
|
||
// Ensure we're not reconciling a stale machine by checking our task-id. | ||
// This is a workaround for a cache race condition. | ||
if val, ok := a.TaskIDCache[machine.Name]; ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you try running go test -race
on this? We could have a race here and probably need a lock or sync.Map
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this makes sense to me, i have a small question to help my understanding.
/lgtm
@@ -108,6 +121,9 @@ func (a *Actuator) Exists(ctx context.Context, machine *machinev1.Machine) (bool | |||
|
|||
func (a *Actuator) Update(ctx context.Context, machine *machinev1.Machine) error { | |||
klog.Infof("%s: actuator updating machine", machine.GetName()) | |||
// Cleanup TaskIDCache so we don't continually grow | |||
delete(a.TaskIDCache, machine.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i just want to understand this better, is the idea here that once we are updating we know we have passed the initial need for the creation task id?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once we've hit update() or delete() we know that the task ID was previously successfully recorded, we won't be going down the create() pathway again, so we remove it from the local cache to we don't grow unbounded memory.
/retest Please review the full test history for this PR and help us cut down flakes. |
5 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/hold |
@Danil-Grigorev I don't believe this is an issue in this case. We only ever run a single thread for our controllers. There's no goroutines involved here at all so there shouldn't be any race conditions in the running code /retest |
I see now that the /hold cancel |
/retest Please review the full test history for this PR and help us cut down flakes. |
6 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
vSphere run demonstrates the exact condition is caught and mitigated in this run: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/711/pull-ci-openshift-machine-api-operator-master-e2e-vsphere-serial/1310983960695672832 E0929 17:48:18.393190 1 actuator.go:83] ci-op-l2mlynd3-0c8b9-mtkfg-worker-b5jbc: machine object missing expected provider task ID, requeue We can see the worker machine ci-op-l2mlynd3-0c8b9-mtkfg-worker-b5jbc has the taskRef field populated here: |
@michaelgugino: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@michaelgugino: All pull requests linked via external trackers have merged:
Bugzilla bug 1880110 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
After patching the machine object following the initial
create, the subsequent reconcile may have a stale machine
object. Additionally, any transient k8s API erros might
result in the patch operation failing and there is no
way to recover that operation. Either of these
scenarios will result in a double creation event
and a leaked instance.
This commit ensures we persist the taskID from the
vSphere API in case of this situations and preserve
it to ensure we do not double create.