VM CrashLoop Detection and Exponential Backoff #5905

davidvossel · 2021-06-22T16:30:06Z

CrashLoop detection

A VM crashloop is being defined as a VM with runStrategy = Always|RerunOnFailure with VMIs that continually fail to reach vmi.Status.Phase == Running... Meaning a VM's VMI continually gets scheduled and fails before ever successfully launching the qemu process.

During such an event, the VM crashloop detection will begin to exponentially back off re-launching new VMIs to replace the failed VMIs. Once a VM's VMI reaches the "running" phase (or the VM is manually stopped) the crashloop tracking will be reset.

Implementation

Crashloop detection is tracked by a new field on the VM status call StartFailure which tracks the number of concurrent start failures as well as tracking when the next start can be retried after a failure.

When a crashloop occurs, users receive feedback that their VM is in a crash loop via the VM.Status.PrintableStatus being set to "CrashLoop"

Changes in existing behavior

Users can now call virctl stop my-vm for VMs with runStrategy Always|RerunOnFailure even when an active VMI is not present. This allows someone to "stop" a VM which is in a crash loop. Without this change in logic, a user wouldn't be able to use virtclt stop during a crash loop because the virt-api subresource endpoint always expects an active VMI in order to stop.

Testing

Unit test coverage exists for all new and altered functionality

New functional tests exist to invoke a crashloop, verify exponential backoff occurs, and verify crash loops recover as expected once a VM's vmi eventually hits a running phase.

related to: https://bugzilla.redhat.com/show_bug.cgi?id=1973852

Release note:

VM CrashLoop detection and Exponential Backoff

zcahana

Hi @davidvossel, this looks like a very useful addition!
I've included several minor comments inline.

And another general comment:
Currently, the crash loop detection only considers VMIs that never reached the Running phase. However, "Running" VMIs too might end up in some crash loop scenario, in which the guest is launched successfully but quickly fails (boot error, misconfigured liveness probe, ...). Would it make sense to extend it to detect such scenarios too?

zcahana · 2021-06-24T10:45:50Z

cmd/virt-launcher/virt-launcher.go

@@ -348,6 +348,7 @@ func main() {
 	qemuAgentUserInterval := pflag.Duration("qemu-agent-user-interval", 10, "Interval in seconds between consecutive qemu agent calls for user command")
 	qemuAgentVersionInterval := pflag.Duration("qemu-agent-version-interval", 300, "Interval in seconds between consecutive qemu agent calls for version command")
 	qemuAgentFSFreezeStatusInterval := pflag.Duration("qemu-fsfreeze-status-interval", 5, "Interval in seconds between consecutive qemu agent calls for fsfreeze status command")
+	simulateCrash := pflag.Bool("simulate-crash", false, "Causes virt-launcher to immediately crash. This is used by functional tests to simulate crash loop scenarios.")


I'm wondering whether there's some existing VM/VMI configuration we could use that will cause virt-launcher to crash (i.e., without needing an explicit virt-launcher flag/VMI annotation). Maybe very low resource requests that won't allow the qemu process to proceed?

On the other hand, I'd expect that if there's such a known configuration, then we should reject it during validation and/or fix virt-launcher.

yeah, i'm not aware of a reliable method that will immediately fail.

zcahana · 2021-06-24T11:17:17Z

pkg/virt-controller/watch/vm.go

+	// add randomized seconds to offset multiple failing VMs from one another
+	delaySeconds += rand.Intn(randomRange)
+
+	if delaySeconds > maxDelay {


Should we add the randomized offset only after capping at maxDelay? Otherwise we might end up with multiple failing VMs all stuck at maxDelay. Alternatively, if we want to be strict over maxDelay, we can subtract a randomized offset.

Otherwise we might end up with multiple failing VMs all stuck at maxDelay

we're adding jitter to the backoff which causes multiple VMs stuck in crash loops not get retried at the same time. by the time max delay is hit, all VMs which are stuck in a crash loop will at least offset from one another due to the randomization leadign up to the max delay

zcahana · 2021-06-24T11:23:51Z

pkg/virt-controller/watch/vm.go

+		for _, ts := range vmi.Status.PhaseTransitionTimestamps {
+			if ts.Phase == virtv1.Running {
+				return true
+			}
+		}


nit: can extract this for loop into a wasVMIRunning(vmi) bool or wasVMIInRunningPhase(vmi) bool and use it both here and in vmiFailedEarly().

zcahana · 2021-06-24T11:35:12Z

pkg/virt-controller/watch/vm.go

+	return false
+}
+
+func hasStartFailureBackoffExpired(vm *virtv1.VirtualMachine) int64 {


nit: I'm a bit confused by this function name. Should it be called getStartFailureBackoffTimeLeft?

zcahana · 2021-06-24T11:41:25Z

pkg/virt-controller/watch/vm.go

+		vm.Status.StartFailure.LastFailedVMIUID = vmi.UID
+		vm.Status.StartFailure.RetryAfterTimestamp = &retryAfter


These two lines are redundant.

yup, good catch, fixed

zcahana · 2021-06-24T11:42:10Z

pkg/virt-controller/watch/vm.go

@@ -58,6 +61,8 @@ const (
 	failureDeletingVmiErrFormat           = "Failure attempting to delete VMI: %v"
 )

+const defaultMaxCrashLoopBackoffDelay = 300


Would it be better to have this const defined as 300 * time.Second

I like working with ints here, i think the issue is i didn't specify that this variable is in seconds.

I renamed this to defaultMaxCrashLoopBackoffDelaySeconds

zcahana · 2021-06-24T11:44:03Z

pkg/virt-controller/watch/vm.go

@@ -748,6 +775,125 @@ func (c *VMController) startVMI(vm *virtv1.VirtualMachine) error {
 	return nil
 }

+// Returns in seconds how long to wait before trying to start the VM again.
+func calculateStartBackoffTime(failCount int, maxDelay int) int {


Here too, would it be better to return a time.Duration?

since all the calculations are dealing with seconds represented as ints, i'd rather keep it that way.

zcahana · 2021-06-24T11:54:21Z

pkg/virt-controller/watch/vmi.go

@@ -461,6 +478,11 @@ func (c *VMIController) updateStatus(vmi *virtv1.VirtualMachineInstance, pod *k8
 			vmiCopy.Status.LauncherContainerImageVersion = ""
 		}

+		if !c.hasOwnerVM(vmi) && len(vmiCopy.Finalizers) > 0 {


Good catch here!
There's a corner case in which an ownerless VMI is created a crashes, after which a VM is created and adopts the VMI. In this case, the adoption logic should add the VirtualMachineControllerFinalizer too.

zcahana · 2021-06-24T11:57:20Z

staging/src/kubevirt.io/client-go/api/v1/types.go

@@ -1179,6 +1184,8 @@ const (
 	// VirtualMachineStatusTerminating indicates that the virtual machine is in the process of deletion,
 	// as well as its associated resources (VirtualMachineInstance, DataVolumes, …).
 	VirtualMachineStatusTerminating VirtualMachinePrintableStatus = "Terminating"
+	// VirtualMachineStatusCrashLoop indicates that the virtual machine is currently in a crash loop waiting to be retried
+	VirtualMachineStatusCrashLoop VirtualMachinePrintableStatus = "CrashLoop"


For consistency with pods, should this be "CrashLoopBackOff"?

davidvossel · 2021-06-24T17:00:59Z

Currently, the crash loop detection only considers VMIs that never reached the Running phase. However, "Running" VMIs too might end up in some crash loop scenario, in which the guest is launched successfully but quickly fails (boot error, misconfigured liveness probe, ...). Would it make sense to extend it to detect such scenarios too?

it's possible we can extend this backoff behavior to failing liveness probes. I feel better about the liveness probe situation as it is now because it at least involves some timeout period before the VMI will get rescheduled. That wasn't the case for VMIs that hit some critical error that prevented the qemu process from ever launching successfully (which is represented by Phase: Running)

zcahana · 2021-06-24T17:13:51Z

I feel better about the liveness probe situation as it is now because it at least involves some timeout period before the VMI will get rescheduled

I see your point. I think the main benefit of the CrashLoop detection for Running VMs is the fact that this now gets reported to the user, which otherwise would only see Stopped --> Starting --> Running cycle.

rmohr · 2021-06-29T12:45:13Z

Can we also re-evaluate the need for this timeout when the pod hits the running phase? I am not sure we should still keep it. Maybe we should just keep running and wait.

davidvossel · 2021-07-08T19:51:28Z

@zcahana Thanks for the review. I think I've addressed all your feedback

@rmohr I added jitter to the virt-launcher qemu timeout. The timeout is now a randomized range between 4 and 6 minutes

zcahana · 2021-07-11T07:08:00Z

@zcahana Thanks for the review. I think I've addressed all your feedback

Thanks David.
/lgtm

zcahana · 2021-07-13T06:54:03Z

pkg/virt-controller/watch/vmi_test.go

@@ -1282,6 +1283,10 @@ var _ = Describe("VirtualMachineInstance watcher", func() {

 			if vmExists {
 				vmSource.Add(vm)
+				// the controller isn't using informer callbacks for the VM informer


This makes me think whether it can happen in a real cluster (i.e., the finalizer is removed because the VMI controller incorrectly determines that there's no owner VM). Would it be safer if hasOwnerVM() would attempt to read the VM from the API server, instead of from the local cache?

the VM controller and VMI controller share the same VM informer. There has to be a VM in the informer cache for the VM controller to create the VMI.

I do need to add a vmInfomer.HasSynced() at the vmi controller startup though to prevent the VMI controller from starting until the VM informer is up to date.

I see your point, so basically this can only happen at the test since the VM and VMI are fed to the informers "in parallel".
Anyway, looking at the code, you've already added a vmInformer.HasSynced() to the VMI controller's Run() method, so it's good to go.

I do need to add a vmInfomer.HasSynced() at the vmi controller startup though to prevent the VMI controller from starting until the VM informer is up to date.

ah, i already did that, never mind

zcahana · 2021-07-13T12:27:57Z

/lgtm

Signed-off-by: David Vossel <davidvossel@gmail.com>

… be stopped Signed-off-by: David Vossel <davidvossel@gmail.com>

Signed-off-by: David Vossel <davidvossel@gmail.com>

…is no longer present in the cluster Signed-off-by: David Vossel <davidvossel@gmail.com>

…an deletes the VMI Signed-off-by: David Vossel <davidvossel@gmail.com>

…or RunStrategyAlways regardless of VMI state Signed-off-by: David Vossel <davidvossel@gmail.com>

Signed-off-by: David Vossel <davidvossel@gmail.com>

davidvossel · 2021-07-23T13:58:41Z

/retest

zcahana · 2021-07-23T19:10:47Z

/lgtm

Will need an approver too.

davidvossel · 2021-07-26T12:42:44Z

Will need an approver too.

/cc @rmohr

rmohr

@davidvossel can you explain a little bit in the PR description why a finalizer is now needed?

rmohr · 2021-07-27T09:13:43Z

pkg/virt-controller/watch/vm.go

@@ -578,6 +588,13 @@ func (c *VMController) startStop(vm *virtv1.VirtualMachine, vmi *virtv1.VirtualM
 			return nil
 		}

+		timeLeft := startFailureBackoffTimeLeft(vm)
+		if timeLeft > 0 {
+			log.Log.Object(vm).Infof("Delaying start of VM %s with 'runStrategy: %s' due to start failure backoff. Waiting %d more seconds before starting.", startingVmMsg, runStrategy, timeLeft)


This sounds like something which I would want to see as a warning event.

rmohr · 2021-07-27T09:16:42Z

pkg/virt-controller/watch/vm.go

@@ -724,6 +748,9 @@ func (c *VMController) startVMI(vm *virtv1.VirtualMachine) error {

 	// start it
 	vmi := c.setupVMIFromVM(vm)
+	// add a finalizer to ensure the VM controller has a chance to see
+	// the VMI before it is deleted
+	vmi.Finalizers = append(vmi.Finalizers, virtv1.VirtualMachineControllerFinalizer)


I think you mentioned that you want this to avoid issues if the VMI disappears. For me that means that this can still happen (since nothing blocks one from removing finalizers).

It is ok for me to have the finalizer if we have to clean up things. Is that the case? Otherwise I would prefer not to add a finalizer.

rmohr · 2021-07-27T12:22:33Z

/approve

kubevirt-bot · 2021-07-27T12:22:40Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rmohr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [rmohr]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

zcahana · 2021-08-08T12:09:53Z

pkg/virt-controller/watch/vm_test.go

+			table.DescribeTable("has start failure backoff expired", func(vm *v1.VirtualMachine, expected int64) {
+				seconds := startFailureBackoffTimeLeft(vm)
+
+				if expected > 0 {
+					// since the tests all run in parallel, it's difficult to
+					// do precise timing. We set the `retryAfter` time but the test
+					// execution may happen seconds later. We use big numbers and
+					// account for some jitter to make sure the calculation falls within
+					// the ballpark of what we expect.
+					parallelTestJitter := expected / 10
+					if (expected - seconds) > parallelTestJitter {
+						Expect(seconds).To(Equal(expected))
+					}
+				}
+
+			},
+
+				table.Entry("no vm start failures",
+					&v1.VirtualMachine{},
+					int64(0)),
+				table.Entry("vm failure waiting 300 seconds",
+					&v1.VirtualMachine{
+						Status: v1.VirtualMachineStatus{
+							StartFailure: &v1.VirtualMachineStartFailure{
+								RetryAfterTimestamp: &metav1.Time{
+									Time: time.Now().Add(300 * time.Second),
+								},
+							},
+						},
+					},
+					int64(300)),
+				table.Entry("vm failure 300 seconds past retry time",
+					&v1.VirtualMachine{
+						Status: v1.VirtualMachineStatus{
+							StartFailure: &v1.VirtualMachineStartFailure{
+								RetryAfterTimestamp: &metav1.Time{
+									Time: time.Now().Add(-300 * time.Second),
+								},
+							},
+						},
+					},
+					int64(0)),
+			)
+		})


I realize this is merged for quite some time, but resurrecting this following a unit tests failure I saw on the CI (this).
Looking closely at this test, I reckon that:

In table entries 1 and 3, expected is always 0, so nothing is actually being asserted (if expected > 0 { ... }).

In entry 2, if test execution happens >30 seconds after the evaluation of the entry, it's bound to fail, and otherwise silently pass without any assertion being made.

Actually, I'm not entirely sure what the purpose of this specific test. Also, given the logic of calculateStartBackoffTime() is quite simple and is anyway covered by the larger testcases in the Context("crashloop backoff tests"), do you think we could just remove it?

@davidvossel WDYT?

kubevirt-bot requested review from enp0s3 and rnetser June 22, 2021 16:30

davidvossel force-pushed the vm-failure-backoff branch from 90d28aa to fb9b2a0 Compare June 23, 2021 20:38

kubevirt-bot added size/XXL and removed size/XL labels Jun 23, 2021

zcahana reviewed Jun 24, 2021

View reviewed changes

kubevirt-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 24, 2021

This comment has been minimized.

Sign in to view

davidvossel force-pushed the vm-failure-backoff branch from fb9b2a0 to 0b5dc7f Compare July 8, 2021 19:43

kubevirt-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 8, 2021

davidvossel force-pushed the vm-failure-backoff branch from 0b5dc7f to 01d9432 Compare July 8, 2021 19:50

This comment has been minimized.

Sign in to view

davidvossel force-pushed the vm-failure-backoff branch from 01d9432 to 888dea6 Compare July 9, 2021 19:34

kubevirt-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 10, 2021

kubevirt-bot assigned zcahana Jul 11, 2021

kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Jul 11, 2021

davidvossel force-pushed the vm-failure-backoff branch from 888dea6 to 899b705 Compare July 12, 2021 16:12

kubevirt-bot removed lgtm Indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jul 12, 2021

zcahana reviewed Jul 13, 2021

View reviewed changes

davidvossel added 15 commits July 22, 2021 11:27

Reset start failure tracking on manual change request

3052fe4

Signed-off-by: David Vossel <davidvossel@gmail.com>

Add func test annotation for simulating launcher failure

98d4d73

Signed-off-by: David Vossel <davidvossel@gmail.com>

Func Test to exercise crash loop backoff

c4695d7

Signed-off-by: David Vossel <davidvossel@gmail.com>

unit tests for crash loop backoff calculations

a120165

Signed-off-by: David Vossel <davidvossel@gmail.com>

add crash loop vm printable status

da9bb5d

Signed-off-by: David Vossel <davidvossel@gmail.com>

Unit tests for crash loop printable status

ff22f61

Signed-off-by: David Vossel <davidvossel@gmail.com>

Always allow runstratgy Always and RerunOnFailure to be stopped

1161f57

Signed-off-by: David Vossel <davidvossel@gmail.com>

Unit tests to verify runstrategy always and RerunOnFailure can always…

012b155

… be stopped Signed-off-by: David Vossel <davidvossel@gmail.com>

Functional tests to validate stop is always possible during crash loop

94f865f

Signed-off-by: David Vossel <davidvossel@gmail.com>

VMI controller should remove the VM controller's finalizer if the VM …

36ef1d9

…is no longer present in the cluster Signed-off-by: David Vossel <davidvossel@gmail.com>

Unit tests validating VMI controller removes VM controller if VM orph…

e7c965e

…an deletes the VMI Signed-off-by: David Vossel <davidvossel@gmail.com>

Update subresource unit tests to account for Stop always succeeding f…

f07a7d2

…or RunStrategyAlways regardless of VMI state Signed-off-by: David Vossel <davidvossel@gmail.com>

Address reviewer feedback regarding variable and function naming

ed4847c

Signed-off-by: David Vossel <davidvossel@gmail.com>

Add jitter to virt launchers qemu timeout

1f3239b

Signed-off-by: David Vossel <davidvossel@gmail.com>

Fix flaky unit test timing issue

2c98621

Signed-off-by: David Vossel <davidvossel@gmail.com>

davidvossel force-pushed the vm-failure-backoff branch from eb6c927 to 2c98621 Compare July 22, 2021 15:57

kubevirt-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 22, 2021

kubevirt deleted a comment from kubevirt-bot Jul 23, 2021

kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Jul 23, 2021

kubevirt-bot requested a review from rmohr July 26, 2021 12:42

rmohr reviewed Jul 27, 2021

View reviewed changes

kubevirt-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 27, 2021

kubevirt-bot merged commit c5e0d39 into kubevirt:main Jul 27, 2021

zcahana reviewed Aug 8, 2021

View reviewed changes

zcahana mentioned this pull request Aug 9, 2021

Fix flaky VM CrashLoop unit test #6203

Merged

		vm.Status.StartFailure.LastFailedVMIUID = vmi.UID
		vm.Status.StartFailure.RetryAfterTimestamp = &retryAfter

VM CrashLoop Detection and Exponential Backoff #5905

VM CrashLoop Detection and Exponential Backoff #5905

Conversation

davidvossel commented Jun 22, 2021

CrashLoop detection

Implementation

Changes in existing behavior

Testing

zcahana left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidvossel commented Jun 24, 2021

zcahana commented Jun 24, 2021

rmohr commented Jun 29, 2021

This comment has been minimized.

davidvossel commented Jul 8, 2021

This comment has been minimized.

zcahana commented Jul 11, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zcahana commented Jul 13, 2021

davidvossel commented Jul 23, 2021

zcahana commented Jul 23, 2021

davidvossel commented Jul 26, 2021

rmohr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rmohr commented Jul 27, 2021

kubevirt-bot commented Jul 27, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment