New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VM CrashLoop Detection and Exponential Backoff #5905
Conversation
90d28aa
to
fb9b2a0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @davidvossel, this looks like a very useful addition!
I've included several minor comments inline.
And another general comment:
Currently, the crash loop detection only considers VMIs that never reached the Running phase. However, "Running" VMIs too might end up in some crash loop scenario, in which the guest is launched successfully but quickly fails (boot error, misconfigured liveness probe, ...). Would it make sense to extend it to detect such scenarios too?
@@ -348,6 +348,7 @@ func main() { | |||
qemuAgentUserInterval := pflag.Duration("qemu-agent-user-interval", 10, "Interval in seconds between consecutive qemu agent calls for user command") | |||
qemuAgentVersionInterval := pflag.Duration("qemu-agent-version-interval", 300, "Interval in seconds between consecutive qemu agent calls for version command") | |||
qemuAgentFSFreezeStatusInterval := pflag.Duration("qemu-fsfreeze-status-interval", 5, "Interval in seconds between consecutive qemu agent calls for fsfreeze status command") | |||
simulateCrash := pflag.Bool("simulate-crash", false, "Causes virt-launcher to immediately crash. This is used by functional tests to simulate crash loop scenarios.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering whether there's some existing VM/VMI configuration we could use that will cause virt-launcher to crash (i.e., without needing an explicit virt-launcher flag/VMI annotation). Maybe very low resource requests that won't allow the qemu process to proceed?
On the other hand, I'd expect that if there's such a known configuration, then we should reject it during validation and/or fix virt-launcher.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, i'm not aware of a reliable method that will immediately fail.
// add randomized seconds to offset multiple failing VMs from one another | ||
delaySeconds += rand.Intn(randomRange) | ||
|
||
if delaySeconds > maxDelay { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add the randomized offset only after capping at maxDelay
? Otherwise we might end up with multiple failing VMs all stuck at maxDelay
. Alternatively, if we want to be strict over maxDelay
, we can subtract a randomized offset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise we might end up with multiple failing VMs all stuck at maxDelay
we're adding jitter to the backoff which causes multiple VMs stuck in crash loops not get retried at the same time. by the time max delay is hit, all VMs which are stuck in a crash loop will at least offset from one another due to the randomization leadign up to the max delay
pkg/virt-controller/watch/vm.go
Outdated
for _, ts := range vmi.Status.PhaseTransitionTimestamps { | ||
if ts.Phase == virtv1.Running { | ||
return true | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can extract this for loop into a wasVMIRunning(vmi) bool
or wasVMIInRunningPhase(vmi) bool
and use it both here and in vmiFailedEarly()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 fixed
pkg/virt-controller/watch/vm.go
Outdated
return false | ||
} | ||
|
||
func hasStartFailureBackoffExpired(vm *virtv1.VirtualMachine) int64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'm a bit confused by this function name. Should it be called getStartFailureBackoffTimeLeft
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
pkg/virt-controller/watch/vm.go
Outdated
vm.Status.StartFailure.LastFailedVMIUID = vmi.UID | ||
vm.Status.StartFailure.RetryAfterTimestamp = &retryAfter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two lines are redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, good catch, fixed
pkg/virt-controller/watch/vm.go
Outdated
@@ -58,6 +61,8 @@ const ( | |||
failureDeletingVmiErrFormat = "Failure attempting to delete VMI: %v" | |||
) | |||
|
|||
const defaultMaxCrashLoopBackoffDelay = 300 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to have this const defined as 300 * time.Second
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like working with ints here, i think the issue is i didn't specify that this variable is in seconds.
I renamed this to defaultMaxCrashLoopBackoffDelaySeconds
@@ -748,6 +775,125 @@ func (c *VMController) startVMI(vm *virtv1.VirtualMachine) error { | |||
return nil | |||
} | |||
|
|||
// Returns in seconds how long to wait before trying to start the VM again. | |||
func calculateStartBackoffTime(failCount int, maxDelay int) int { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here too, would it be better to return a time.Duration
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since all the calculations are dealing with seconds represented as ints, i'd rather keep it that way.
@@ -461,6 +478,11 @@ func (c *VMIController) updateStatus(vmi *virtv1.VirtualMachineInstance, pod *k8 | |||
vmiCopy.Status.LauncherContainerImageVersion = "" | |||
} | |||
|
|||
if !c.hasOwnerVM(vmi) && len(vmiCopy.Finalizers) > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch here!
There's a corner case in which an ownerless VMI is created a crashes, after which a VM is created and adopts the VMI. In this case, the adoption logic should add the VirtualMachineControllerFinalizer too.
@@ -1179,6 +1184,8 @@ const ( | |||
// VirtualMachineStatusTerminating indicates that the virtual machine is in the process of deletion, | |||
// as well as its associated resources (VirtualMachineInstance, DataVolumes, …). | |||
VirtualMachineStatusTerminating VirtualMachinePrintableStatus = "Terminating" | |||
// VirtualMachineStatusCrashLoop indicates that the virtual machine is currently in a crash loop waiting to be retried | |||
VirtualMachineStatusCrashLoop VirtualMachinePrintableStatus = "CrashLoop" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency with pods, should this be "CrashLoopBackOff"
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, fixed
it's possible we can extend this backoff behavior to failing |
I see your point. I think the main benefit of the CrashLoop detection for Running VMs is the fact that this now gets reported to the user, which otherwise would only see Stopped --> Starting --> Running cycle. |
Can we also re-evaluate the need for this timeout when the pod hits the running phase? I am not sure we should still keep it. Maybe we should just keep running and wait. |
This comment has been minimized.
This comment has been minimized.
fb9b2a0
to
0b5dc7f
Compare
0b5dc7f
to
01d9432
Compare
This comment has been minimized.
This comment has been minimized.
01d9432
to
888dea6
Compare
Thanks David. |
888dea6
to
899b705
Compare
@@ -1282,6 +1283,10 @@ var _ = Describe("VirtualMachineInstance watcher", func() { | |||
|
|||
if vmExists { | |||
vmSource.Add(vm) | |||
// the controller isn't using informer callbacks for the VM informer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes me think whether it can happen in a real cluster (i.e., the finalizer is removed because the VMI controller incorrectly determines that there's no owner VM). Would it be safer if hasOwnerVM()
would attempt to read the VM from the API server, instead of from the local cache?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the VM controller and VMI controller share the same VM informer. There has to be a VM in the informer cache for the VM controller to create the VMI.
I do need to add a vmInfomer.HasSynced()
at the vmi controller startup though to prevent the VMI controller from starting until the VM informer is up to date.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point, so basically this can only happen at the test since the VM and VMI are fed to the informers "in parallel".
Anyway, looking at the code, you've already added a vmInformer.HasSynced()
to the VMI controller's Run()
method, so it's good to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do need to add a vmInfomer.HasSynced() at the vmi controller startup though to prevent the VMI controller from starting until the VM informer is up to date.
ah, i already did that, never mind
/lgtm |
Signed-off-by: David Vossel <davidvossel@gmail.com>
Signed-off-by: David Vossel <davidvossel@gmail.com>
Signed-off-by: David Vossel <davidvossel@gmail.com>
Signed-off-by: David Vossel <davidvossel@gmail.com>
Signed-off-by: David Vossel <davidvossel@gmail.com>
Signed-off-by: David Vossel <davidvossel@gmail.com>
Signed-off-by: David Vossel <davidvossel@gmail.com>
… be stopped Signed-off-by: David Vossel <davidvossel@gmail.com>
Signed-off-by: David Vossel <davidvossel@gmail.com>
…is no longer present in the cluster Signed-off-by: David Vossel <davidvossel@gmail.com>
…an deletes the VMI Signed-off-by: David Vossel <davidvossel@gmail.com>
…or RunStrategyAlways regardless of VMI state Signed-off-by: David Vossel <davidvossel@gmail.com>
Signed-off-by: David Vossel <davidvossel@gmail.com>
Signed-off-by: David Vossel <davidvossel@gmail.com>
Signed-off-by: David Vossel <davidvossel@gmail.com>
eb6c927
to
2c98621
Compare
/retest |
/lgtm Will need an approver too. |
/cc @rmohr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davidvossel can you explain a little bit in the PR description why a finalizer is now needed?
@@ -578,6 +588,13 @@ func (c *VMController) startStop(vm *virtv1.VirtualMachine, vmi *virtv1.VirtualM | |||
return nil | |||
} | |||
|
|||
timeLeft := startFailureBackoffTimeLeft(vm) | |||
if timeLeft > 0 { | |||
log.Log.Object(vm).Infof("Delaying start of VM %s with 'runStrategy: %s' due to start failure backoff. Waiting %d more seconds before starting.", startingVmMsg, runStrategy, timeLeft) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds like something which I would want to see as a warning event.
@@ -724,6 +748,9 @@ func (c *VMController) startVMI(vm *virtv1.VirtualMachine) error { | |||
|
|||
// start it | |||
vmi := c.setupVMIFromVM(vm) | |||
// add a finalizer to ensure the VM controller has a chance to see | |||
// the VMI before it is deleted | |||
vmi.Finalizers = append(vmi.Finalizers, virtv1.VirtualMachineControllerFinalizer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you mentioned that you want this to avoid issues if the VMI disappears. For me that means that this can still happen (since nothing blocks one from removing finalizers).
It is ok for me to have the finalizer if we have to clean up things. Is that the case? Otherwise I would prefer not to add a finalizer.
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: rmohr The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
table.DescribeTable("has start failure backoff expired", func(vm *v1.VirtualMachine, expected int64) { | ||
seconds := startFailureBackoffTimeLeft(vm) | ||
|
||
if expected > 0 { | ||
// since the tests all run in parallel, it's difficult to | ||
// do precise timing. We set the `retryAfter` time but the test | ||
// execution may happen seconds later. We use big numbers and | ||
// account for some jitter to make sure the calculation falls within | ||
// the ballpark of what we expect. | ||
parallelTestJitter := expected / 10 | ||
if (expected - seconds) > parallelTestJitter { | ||
Expect(seconds).To(Equal(expected)) | ||
} | ||
} | ||
|
||
}, | ||
|
||
table.Entry("no vm start failures", | ||
&v1.VirtualMachine{}, | ||
int64(0)), | ||
table.Entry("vm failure waiting 300 seconds", | ||
&v1.VirtualMachine{ | ||
Status: v1.VirtualMachineStatus{ | ||
StartFailure: &v1.VirtualMachineStartFailure{ | ||
RetryAfterTimestamp: &metav1.Time{ | ||
Time: time.Now().Add(300 * time.Second), | ||
}, | ||
}, | ||
}, | ||
}, | ||
int64(300)), | ||
table.Entry("vm failure 300 seconds past retry time", | ||
&v1.VirtualMachine{ | ||
Status: v1.VirtualMachineStatus{ | ||
StartFailure: &v1.VirtualMachineStartFailure{ | ||
RetryAfterTimestamp: &metav1.Time{ | ||
Time: time.Now().Add(-300 * time.Second), | ||
}, | ||
}, | ||
}, | ||
}, | ||
int64(0)), | ||
) | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize this is merged for quite some time, but resurrecting this following a unit tests failure I saw on the CI (this).
Looking closely at this test, I reckon that:
- In table entries 1 and 3,
expected
is always 0, so nothing is actually being asserted (if expected > 0 { ... }
). - In entry 2, if test execution happens >30 seconds after the evaluation of the entry, it's bound to fail, and otherwise silently pass without any assertion being made.
Actually, I'm not entirely sure what the purpose of this specific test. Also, given the logic of calculateStartBackoffTime()
is quite simple and is anyway covered by the larger testcases in the Context("crashloop backoff tests")
, do you think we could just remove it?
@davidvossel WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CrashLoop detection
A VM crashloop is being defined as a VM with runStrategy = Always|RerunOnFailure with VMIs that continually fail to reach
vmi.Status.Phase == Running
... Meaning a VM's VMI continually gets scheduled and fails before ever successfully launching the qemu process.During such an event, the VM crashloop detection will begin to exponentially back off re-launching new VMIs to replace the failed VMIs. Once a VM's VMI reaches the "running" phase (or the VM is manually stopped) the crashloop tracking will be reset.
Implementation
Crashloop detection is tracked by a new field on the VM status call
StartFailure
which tracks the number of concurrent start failures as well as tracking when the next start can be retried after a failure.When a crashloop occurs, users receive feedback that their VM is in a crash loop via the VM.Status.PrintableStatus being set to "CrashLoop"
Changes in existing behavior
Users can now call
virctl stop my-vm
for VMs with runStrategy Always|RerunOnFailure even when an active VMI is not present. This allows someone to "stop" a VM which is in a crash loop. Without this change in logic, a user wouldn't be able to use virtclt stop during a crash loop because the virt-api subresource endpoint always expects an active VMI in order to stop.Testing
Unit test coverage exists for all new and altered functionality
New functional tests exist to invoke a crashloop, verify exponential backoff occurs, and verify crash loops recover as expected once a VM's vmi eventually hits a running phase.
related to: https://bugzilla.redhat.com/show_bug.cgi?id=1973852
Release note: