Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VM CrashLoop Detection and Exponential Backoff #5905

Merged
merged 17 commits into from Jul 27, 2021

Conversation

davidvossel
Copy link
Member

CrashLoop detection

A VM crashloop is being defined as a VM with runStrategy = Always|RerunOnFailure with VMIs that continually fail to reach vmi.Status.Phase == Running... Meaning a VM's VMI continually gets scheduled and fails before ever successfully launching the qemu process.

During such an event, the VM crashloop detection will begin to exponentially back off re-launching new VMIs to replace the failed VMIs. Once a VM's VMI reaches the "running" phase (or the VM is manually stopped) the crashloop tracking will be reset.

Implementation

Crashloop detection is tracked by a new field on the VM status call StartFailure which tracks the number of concurrent start failures as well as tracking when the next start can be retried after a failure.

When a crashloop occurs, users receive feedback that their VM is in a crash loop via the VM.Status.PrintableStatus being set to "CrashLoop"

Changes in existing behavior

Users can now call virctl stop my-vm for VMs with runStrategy Always|RerunOnFailure even when an active VMI is not present. This allows someone to "stop" a VM which is in a crash loop. Without this change in logic, a user wouldn't be able to use virtclt stop during a crash loop because the virt-api subresource endpoint always expects an active VMI in order to stop.

Testing

Unit test coverage exists for all new and altered functionality

New functional tests exist to invoke a crashloop, verify exponential backoff occurs, and verify crash loops recover as expected once a VM's vmi eventually hits a running phase.

related to: https://bugzilla.redhat.com/show_bug.cgi?id=1973852

Release note:

VM CrashLoop detection and Exponential Backoff

@kubevirt-bot kubevirt-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. size/XL kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API labels Jun 22, 2021
Copy link
Contributor

@zcahana zcahana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @davidvossel, this looks like a very useful addition!
I've included several minor comments inline.

And another general comment:
Currently, the crash loop detection only considers VMIs that never reached the Running phase. However, "Running" VMIs too might end up in some crash loop scenario, in which the guest is launched successfully but quickly fails (boot error, misconfigured liveness probe, ...). Would it make sense to extend it to detect such scenarios too?

@@ -348,6 +348,7 @@ func main() {
qemuAgentUserInterval := pflag.Duration("qemu-agent-user-interval", 10, "Interval in seconds between consecutive qemu agent calls for user command")
qemuAgentVersionInterval := pflag.Duration("qemu-agent-version-interval", 300, "Interval in seconds between consecutive qemu agent calls for version command")
qemuAgentFSFreezeStatusInterval := pflag.Duration("qemu-fsfreeze-status-interval", 5, "Interval in seconds between consecutive qemu agent calls for fsfreeze status command")
simulateCrash := pflag.Bool("simulate-crash", false, "Causes virt-launcher to immediately crash. This is used by functional tests to simulate crash loop scenarios.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering whether there's some existing VM/VMI configuration we could use that will cause virt-launcher to crash (i.e., without needing an explicit virt-launcher flag/VMI annotation). Maybe very low resource requests that won't allow the qemu process to proceed?

On the other hand, I'd expect that if there's such a known configuration, then we should reject it during validation and/or fix virt-launcher.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, i'm not aware of a reliable method that will immediately fail.

// add randomized seconds to offset multiple failing VMs from one another
delaySeconds += rand.Intn(randomRange)

if delaySeconds > maxDelay {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add the randomized offset only after capping at maxDelay? Otherwise we might end up with multiple failing VMs all stuck at maxDelay. Alternatively, if we want to be strict over maxDelay, we can subtract a randomized offset.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise we might end up with multiple failing VMs all stuck at maxDelay

we're adding jitter to the backoff which causes multiple VMs stuck in crash loops not get retried at the same time. by the time max delay is hit, all VMs which are stuck in a crash loop will at least offset from one another due to the randomization leadign up to the max delay

Comment on lines 828 to 832
for _, ts := range vmi.Status.PhaseTransitionTimestamps {
if ts.Phase == virtv1.Running {
return true
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can extract this for loop into a wasVMIRunning(vmi) bool or wasVMIInRunningPhase(vmi) bool and use it both here and in vmiFailedEarly().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 fixed

return false
}

func hasStartFailureBackoffExpired(vm *virtv1.VirtualMachine) int64 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'm a bit confused by this function name. Should it be called getStartFailureBackoffTimeLeft?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines 892 to 893
vm.Status.StartFailure.LastFailedVMIUID = vmi.UID
vm.Status.StartFailure.RetryAfterTimestamp = &retryAfter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two lines are redundant.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, good catch, fixed

@@ -58,6 +61,8 @@ const (
failureDeletingVmiErrFormat = "Failure attempting to delete VMI: %v"
)

const defaultMaxCrashLoopBackoffDelay = 300
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to have this const defined as 300 * time.Second

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like working with ints here, i think the issue is i didn't specify that this variable is in seconds.

I renamed this to defaultMaxCrashLoopBackoffDelaySeconds

@@ -748,6 +775,125 @@ func (c *VMController) startVMI(vm *virtv1.VirtualMachine) error {
return nil
}

// Returns in seconds how long to wait before trying to start the VM again.
func calculateStartBackoffTime(failCount int, maxDelay int) int {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here too, would it be better to return a time.Duration?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since all the calculations are dealing with seconds represented as ints, i'd rather keep it that way.

@@ -461,6 +478,11 @@ func (c *VMIController) updateStatus(vmi *virtv1.VirtualMachineInstance, pod *k8
vmiCopy.Status.LauncherContainerImageVersion = ""
}

if !c.hasOwnerVM(vmi) && len(vmiCopy.Finalizers) > 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch here!
There's a corner case in which an ownerless VMI is created a crashes, after which a VM is created and adopts the VMI. In this case, the adoption logic should add the VirtualMachineControllerFinalizer too.

@@ -1179,6 +1184,8 @@ const (
// VirtualMachineStatusTerminating indicates that the virtual machine is in the process of deletion,
// as well as its associated resources (VirtualMachineInstance, DataVolumes, …).
VirtualMachineStatusTerminating VirtualMachinePrintableStatus = "Terminating"
// VirtualMachineStatusCrashLoop indicates that the virtual machine is currently in a crash loop waiting to be retried
VirtualMachineStatusCrashLoop VirtualMachinePrintableStatus = "CrashLoop"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency with pods, should this be "CrashLoopBackOff"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, fixed

@kubevirt-bot kubevirt-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 24, 2021
@davidvossel
Copy link
Member Author

Currently, the crash loop detection only considers VMIs that never reached the Running phase. However, "Running" VMIs too might end up in some crash loop scenario, in which the guest is launched successfully but quickly fails (boot error, misconfigured liveness probe, ...). Would it make sense to extend it to detect such scenarios too?

it's possible we can extend this backoff behavior to failing liveness probes. I feel better about the liveness probe situation as it is now because it at least involves some timeout period before the VMI will get rescheduled. That wasn't the case for VMIs that hit some critical error that prevented the qemu process from ever launching successfully (which is represented by Phase: Running)

@zcahana
Copy link
Contributor

zcahana commented Jun 24, 2021

I feel better about the liveness probe situation as it is now because it at least involves some timeout period before the VMI will get rescheduled

I see your point. I think the main benefit of the CrashLoop detection for Running VMs is the fact that this now gets reported to the user, which otherwise would only see Stopped --> Starting --> Running cycle.

@rmohr
Copy link
Member

rmohr commented Jun 29, 2021

Can we also re-evaluate the need for this timeout when the pod hits the running phase? I am not sure we should still keep it. Maybe we should just keep running and wait.

@davidvossel

This comment has been minimized.

@kubevirt-bot kubevirt-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 8, 2021
@davidvossel
Copy link
Member Author

@zcahana Thanks for the review. I think I've addressed all your feedback

@rmohr I added jitter to the virt-launcher qemu timeout. The timeout is now a randomized range between 4 and 6 minutes

@davidvossel

This comment has been minimized.

@kubevirt-bot kubevirt-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 10, 2021
@zcahana
Copy link
Contributor

zcahana commented Jul 11, 2021

@zcahana Thanks for the review. I think I've addressed all your feedback

Thanks David.
/lgtm

@kubevirt-bot kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Jul 11, 2021
@kubevirt-bot kubevirt-bot removed lgtm Indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jul 12, 2021
@@ -1282,6 +1283,10 @@ var _ = Describe("VirtualMachineInstance watcher", func() {

if vmExists {
vmSource.Add(vm)
// the controller isn't using informer callbacks for the VM informer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me think whether it can happen in a real cluster (i.e., the finalizer is removed because the VMI controller incorrectly determines that there's no owner VM). Would it be safer if hasOwnerVM() would attempt to read the VM from the API server, instead of from the local cache?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the VM controller and VMI controller share the same VM informer. There has to be a VM in the informer cache for the VM controller to create the VMI.

I do need to add a vmInfomer.HasSynced() at the vmi controller startup though to prevent the VMI controller from starting until the VM informer is up to date.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point, so basically this can only happen at the test since the VM and VMI are fed to the informers "in parallel".
Anyway, looking at the code, you've already added a vmInformer.HasSynced() to the VMI controller's Run() method, so it's good to go.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do need to add a vmInfomer.HasSynced() at the vmi controller startup though to prevent the VMI controller from starting until the VM informer is up to date.

ah, i already did that, never mind

@zcahana
Copy link
Contributor

zcahana commented Jul 13, 2021

/lgtm

Signed-off-by: David Vossel <davidvossel@gmail.com>
Signed-off-by: David Vossel <davidvossel@gmail.com>
Signed-off-by: David Vossel <davidvossel@gmail.com>
Signed-off-by: David Vossel <davidvossel@gmail.com>
Signed-off-by: David Vossel <davidvossel@gmail.com>
Signed-off-by: David Vossel <davidvossel@gmail.com>
Signed-off-by: David Vossel <davidvossel@gmail.com>
… be stopped

Signed-off-by: David Vossel <davidvossel@gmail.com>
Signed-off-by: David Vossel <davidvossel@gmail.com>
…is no longer present in the cluster

Signed-off-by: David Vossel <davidvossel@gmail.com>
…an deletes the VMI

Signed-off-by: David Vossel <davidvossel@gmail.com>
…or RunStrategyAlways regardless of VMI state

Signed-off-by: David Vossel <davidvossel@gmail.com>
Signed-off-by: David Vossel <davidvossel@gmail.com>
Signed-off-by: David Vossel <davidvossel@gmail.com>
Signed-off-by: David Vossel <davidvossel@gmail.com>
@kubevirt-bot kubevirt-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 22, 2021
@davidvossel
Copy link
Member Author

/retest

@kubevirt kubevirt deleted a comment from kubevirt-bot Jul 23, 2021
@zcahana
Copy link
Contributor

zcahana commented Jul 23, 2021

/lgtm

Will need an approver too.

@kubevirt-bot kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Jul 23, 2021
@davidvossel
Copy link
Member Author

Will need an approver too.

/cc @rmohr

@kubevirt-bot kubevirt-bot requested a review from rmohr July 26, 2021 12:42
Copy link
Member

@rmohr rmohr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidvossel can you explain a little bit in the PR description why a finalizer is now needed?

@@ -578,6 +588,13 @@ func (c *VMController) startStop(vm *virtv1.VirtualMachine, vmi *virtv1.VirtualM
return nil
}

timeLeft := startFailureBackoffTimeLeft(vm)
if timeLeft > 0 {
log.Log.Object(vm).Infof("Delaying start of VM %s with 'runStrategy: %s' due to start failure backoff. Waiting %d more seconds before starting.", startingVmMsg, runStrategy, timeLeft)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like something which I would want to see as a warning event.

@@ -724,6 +748,9 @@ func (c *VMController) startVMI(vm *virtv1.VirtualMachine) error {

// start it
vmi := c.setupVMIFromVM(vm)
// add a finalizer to ensure the VM controller has a chance to see
// the VMI before it is deleted
vmi.Finalizers = append(vmi.Finalizers, virtv1.VirtualMachineControllerFinalizer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you mentioned that you want this to avoid issues if the VMI disappears. For me that means that this can still happen (since nothing blocks one from removing finalizers).

It is ok for me to have the finalizer if we have to clean up things. Is that the case? Otherwise I would prefer not to add a finalizer.

@rmohr
Copy link
Member

rmohr commented Jul 27, 2021

/approve

@kubevirt-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rmohr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubevirt-bot kubevirt-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 27, 2021
@kubevirt-bot kubevirt-bot merged commit c5e0d39 into kubevirt:main Jul 27, 2021
Comment on lines +845 to +888
table.DescribeTable("has start failure backoff expired", func(vm *v1.VirtualMachine, expected int64) {
seconds := startFailureBackoffTimeLeft(vm)

if expected > 0 {
// since the tests all run in parallel, it's difficult to
// do precise timing. We set the `retryAfter` time but the test
// execution may happen seconds later. We use big numbers and
// account for some jitter to make sure the calculation falls within
// the ballpark of what we expect.
parallelTestJitter := expected / 10
if (expected - seconds) > parallelTestJitter {
Expect(seconds).To(Equal(expected))
}
}

},

table.Entry("no vm start failures",
&v1.VirtualMachine{},
int64(0)),
table.Entry("vm failure waiting 300 seconds",
&v1.VirtualMachine{
Status: v1.VirtualMachineStatus{
StartFailure: &v1.VirtualMachineStartFailure{
RetryAfterTimestamp: &metav1.Time{
Time: time.Now().Add(300 * time.Second),
},
},
},
},
int64(300)),
table.Entry("vm failure 300 seconds past retry time",
&v1.VirtualMachine{
Status: v1.VirtualMachineStatus{
StartFailure: &v1.VirtualMachineStartFailure{
RetryAfterTimestamp: &metav1.Time{
Time: time.Now().Add(-300 * time.Second),
},
},
},
},
int64(0)),
)
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize this is merged for quite some time, but resurrecting this following a unit tests failure I saw on the CI (this).
Looking closely at this test, I reckon that:

  • In table entries 1 and 3, expected is always 0, so nothing is actually being asserted (if expected > 0 { ... }).
  • In entry 2, if test execution happens >30 seconds after the evaluation of the entry, it's bound to fail, and otherwise silently pass without any assertion being made.

Actually, I'm not entirely sure what the purpose of this specific test. Also, given the logic of calculateStartBackoffTime() is quite simple and is anyway covered by the larger testcases in the Context("crashloop backoff tests"), do you think we could just remove it?

@davidvossel WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants