Leave virt-launcher container up for failed VMI for improved debuggab… #6040

yuhaohaoyu · 2021-07-14T16:43:41Z

…ility

Signed-off-by: Hao Yu yuh@us.ibm.com

What this PR does / why we need it:

The goal of this PR is to keep the VMI environment up for inspection or debugging when a VMI fails.

Currently the virt-launcher container will exit when the corresponding VMI reached the Failed phase. The specific goal of the PR is to provide user a mean to have the VMI defined in a VM to pertain its environment on failure. To have a failed VMI to leave its virt-launcher container alive, user needs to

Specify the VM's RunStrategy as Manual
Define an annotation kubevirt.io/leave-launcher-pod-after-qemu-exit in the VM's VMI Template and assign a "true" value.

Then user can create the VM, start the VMI manually, then the VMI environment (the corresponding virt-launcher container) will stay alive.

In addition, this PR does not change the behavior or a VMI with its VM's RunStrategy not defined as Manual.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

Release note:

Improved debuggability by keeping the environment of a failed VMI alive.

yuhaohaoyu · 2021-07-14T16:59:35Z

/assign @ashleyschuett

davidvossel

for the <-keepAfterQemuChan it seems like we could get away with only placing that in one point in the code, right after ForkAndMonitor() is called.

That will keep the pod's pid 1 alive indefinitely

davidvossel · 2021-07-14T18:54:39Z

cmd/virt-launcher/virt-launcher.go

+		<-keepAfterQemuChan
+
 		// Now that the pid has exited, we wait for the final delete notification to be
 		// sent back to virt-handler. This delete notification contains the reason the
 		// domain exited.
 		waitForFinalNotify(events, domainManager, vmi)


why is it important to have the <-keepAfterQemuChan wait before the final notify is sent?

With your insight on "placing the blocking right after the invocation of forkandmonitor', the reason for me placing the blocking before the waitForFinalNotify was weak.

I was trying to have the 'blocking site' (conceptually reassemble a little something of debugger's breaking point) close to the point of failure.

davidvossel · 2021-07-14T18:56:06Z

cmd/virt-launcher/virt-launcher.go

+
+	<-keepAfterQemuChan
+


ForkAndMonitor can exit before this is ever reached, which will result in the pod exiting even when keepAfterQemuChan is set.

Maybe the <-keepAfterQemuChan line should be moved right after ForkAndMonitor returns.

Your suggestion is quite clean. Covering for sure the case when 'ForkAndMonitor' is employed.

So with 'fork' is the only way to call cmd/virt-launcher from outside the program ?

ashleyschuett

Then user can create the VM, start the VMI manually, then the VMI environment (the corresponding virt-launcher container) will stay alive. Given that the VMI is in the Failed phase, it will not respond to virtctl stop but will respond to virtctl restart.

This is no longer the case. virtctl stop <vmiName> now works on Failed vmis. https://github.com/kubevirt/kubevirt/pull/5953/files#diff-028d148aa715883f3fd17ae07f8a03225251bdd5eb1893566dfb68085298a434R612

ashleyschuett · 2021-07-14T19:20:42Z

cmd/virt-launcher/virt-launcher.go

@@ -348,6 +348,8 @@ func main() {
 	qemuAgentUserInterval := pflag.Duration("qemu-agent-user-interval", 10, "Interval in seconds between consecutive qemu agent calls for user command")
 	qemuAgentVersionInterval := pflag.Duration("qemu-agent-version-interval", 300, "Interval in seconds between consecutive qemu agent calls for version command")
 	qemuAgentFSFreezeStatusInterval := pflag.Duration("qemu-fsfreeze-status-interval", 5, "Interval in seconds between consecutive qemu agent calls for fsfreeze status command")
+	keepAfterQemu := pflag.Bool("keep-after-qemu", false, "virt-launcher will be kept alive after qemu exits for debugging if set to true")


does the user know or care that the pod exists when qemu dies? Does it make sense to say virt-launcher will be kept alive after failure for debugging if set to true? or is that too vague?

Nice. Will take your suggestion.

I have adjust the name of the command line option, the annotation key, and corresponding internal variable to be inline with 'keep after failure'.

ashleyschuett · 2021-07-14T19:41:46Z

cmd/virt-launcher/virt-launcher.go

 	if !*noFork {
-		exitCode, err := ForkAndMonitor(*containerDiskDir)
+		exitCode, err := ForkAndMonitor(*containerDiskDir, keepAfterQemuChan)


instead of passing in the chan should we hang after we log log.Log.Reason(err).Error("monitoring virt-launcher failed") only in the case of error? If we are shutting down successfully do we still want to wait?

Per David's idea, this change is not needed anymore.

jean-edouard · 2021-07-16T17:31:13Z

tests/vm_test.go

+
+					By("Check Virt Launcher status")
+					// Virt launcher should be in the Running phase if the Annotation v1.KeepLauncherAfterFailureAnnotation is set to true
+					time.Sleep(10 * time.Second)


Please try to avoid arbitrary sleeps.
Maybe here we can use Consistently() if we expect virt-launcher to stay in the running phase, and Eventually() if we expect it to fail.

Thanks for the catch, this one is removed.

jean-edouard · 2021-07-16T17:36:11Z

tests/vm_test.go

+					}, 160*time.Second, 1*time.Second).Should(BeTrue())
+
+					By("Check Virt Launcher log message")
+					time.Sleep(20 * time.Second)


What is this waiting for? Can we actively wait for it?

This block for checking the virt-launcher log info message is not actually required. I have removed the block.

yuhaohaoyu · 2021-07-16T19:12:46Z

/retest

iholder101 · 2021-07-19T07:06:41Z

Thanks @yuhaohaoyu! Looks valuable!

My only feedback here is that we probably want to warn the user (and / or add this condition to the flag's description) if the VM's RunStrategy is not Manual and --keep-after-failure flag is true.

And if we're at the subject - we can also log the fact the we intentionally hang and don't kill virt-launcher (e.g. hanging virt-launcher container since --keep-after-failure is defined to be true..)

yuhaohaoyu · 2021-07-19T17:59:38Z

Thanks @yuhaohaoyu! Looks valuable!

My only feedback here is that we probably want to warn the user (and / or add this condition to the flag's description) if the VM's RunStrategy is not Manual and --keep-after-failure flag is true.

And if we're at the subject - we can also log the fact the we intentionally hang and don't kill virt-launcher (e.g. hanging virt-launcher container since --keep-after-failure is defined to be true..)

When the RunStrategy is not Manual, the effect of --keep-after-failure will not be visible by the VM owner. Specifically, if RunStrategy is either Always or RerunOnFailure, the virt-controller will kill the pod as long as the VM is in Failed phase, regardless the option --keep-after-failure.

I have just added an comment when the block-wait happens. Thanks.

yuhaohaoyu · 2021-07-19T19:37:31Z

/retest

iholder101 · 2021-07-20T07:51:59Z

Thanks @yuhaohaoyu! Looks valuable!
My only feedback here is that we probably want to warn the user (and / or add this condition to the flag's description) if the VM's RunStrategy is not Manual and --keep-after-failure flag is true.
And if we're at the subject - we can also log the fact the we intentionally hang and don't kill virt-launcher (e.g. hanging virt-launcher container since --keep-after-failure is defined to be true..)

When the RunStrategy is not Manual, the effect of --keep-after-failure will not be visible by the VM owner. Specifically, if RunStrategy is either Always or RerunOnFailure, the virt-controller will kill the pod as long as the VM is in Failed phase, regardless the option --keep-after-failure.

I have just added an comment when the block-wait happens. Thanks.

Sure, I'm just thinking on a user that expects the pod to hang and, by mistake, sets leave-launcher-pod-after-qemu-exit to true in VM's annotations and RunStrategy to always. Since this is a configuration that doesn't make sense we can warn the user so that if he tries to figure out what's wrong he'll see a ignoring "keep-after-failure" since RunStrategy is not Manual or something similar.

iholder101 · 2021-07-20T07:53:05Z

pkg/virt-controller/services/template.go

@@ -2001,3 +2005,16 @@ func filterVMIAnnotationsForPod(vmiAnnotations map[string]string) map[string]str
 	}
 	return annotationsList
 }
+
+func checkForKeepLauncherAfterFailure(vmi *v1.VirtualMachineInstance) bool {


nit: would suggest toKeepLauncherAfterFailure

iholder101 · 2021-07-20T07:53:44Z

My feedback is just small nits :) up to you.
/lgtm

…ility Signed-off-by: Hao Yu <yuh@us.ibm.com>

iholder101 · 2021-07-20T13:05:10Z

Thanks @yuhaohaoyu!
/lgtm

yuhaohaoyu · 2021-07-20T15:01:31Z

/retest

yuhaohaoyu · 2021-07-20T15:59:43Z

/retest

yuhaohaoyu · 2021-07-20T17:39:03Z

/retest

yuhaohaoyu · 2021-07-20T21:03:22Z

/retest

yuhaohaoyu · 2021-07-21T00:11:48Z

/retest

jean-edouard · 2021-07-21T12:27:22Z

/approve
/retest

kubevirt-bot · 2021-07-21T12:27:29Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jean-edouard

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jean-edouard]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

yuhaohaoyu · 2021-07-21T14:06:44Z

/retest

kubevirt-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. size/L labels Jul 14, 2021

kubevirt-bot requested review from kbidarkar and maiqueb July 14, 2021 16:43

kubevirt-bot added the kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API label Jul 14, 2021

kubevirt-bot assigned ashleyschuett Jul 14, 2021

yuhaohaoyu mentioned this pull request Jul 14, 2021

Improved debuggability by leaving virt-launcher container up for failed VMI #6032

Closed

davidvossel reviewed Jul 14, 2021

View reviewed changes

ashleyschuett reviewed Jul 14, 2021

View reviewed changes

yuhaohaoyu force-pushed the leave-virt-launcher-after-qemu branch from a1dddc1 to 514c21d Compare July 16, 2021 17:02

kubevirt-bot added size/M and removed size/L labels Jul 16, 2021

jean-edouard reviewed Jul 16, 2021

View reviewed changes

yuhaohaoyu force-pushed the leave-virt-launcher-after-qemu branch from 514c21d to 7cadb8a Compare July 16, 2021 19:09

yuhaohaoyu force-pushed the leave-virt-launcher-after-qemu branch from 7cadb8a to 842533b Compare July 19, 2021 17:58

iholder101 reviewed Jul 20, 2021

View reviewed changes

kubevirt-bot assigned iholder101 Jul 20, 2021

kubevirt-bot added lgtm Indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jul 20, 2021

Leave virt-launcher container up for failed VMI for improved debuggab…

82e2cf5

…ility Signed-off-by: Hao Yu <yuh@us.ibm.com>

yuhaohaoyu force-pushed the leave-virt-launcher-after-qemu branch from 842533b to 82e2cf5 Compare July 20, 2021 12:52

kubevirt-bot removed lgtm Indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jul 20, 2021

kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Jul 20, 2021

kubevirt-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 21, 2021

kubevirt-bot merged commit 366173b into kubevirt:main Jul 21, 2021

enp0s3 mentioned this pull request Jul 28, 2021

Take test_id:4100 out of quarantine #6134

Merged

zcahana mentioned this pull request Sep 14, 2021

WIP: Design proposal for vm maintenance mode kubevirt/community#131

Closed

yuhaohaoyu mentioned this pull request Nov 17, 2021

Fix flaky test [7164] for PR 6040: Leave virt-launcher container up for failed VMI #6812

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Leave virt-launcher container up for failed VMI for improved debuggab… #6040

Leave virt-launcher container up for failed VMI for improved debuggab… #6040

yuhaohaoyu commented Jul 14, 2021 •

edited

Loading

yuhaohaoyu commented Jul 14, 2021

davidvossel left a comment

davidvossel Jul 14, 2021

yuhaohaoyu Jul 15, 2021

davidvossel Jul 14, 2021

yuhaohaoyu Jul 15, 2021

ashleyschuett left a comment

ashleyschuett Jul 14, 2021

yuhaohaoyu Jul 15, 2021

yuhaohaoyu Jul 16, 2021

ashleyschuett Jul 14, 2021

yuhaohaoyu Jul 15, 2021

jean-edouard Jul 16, 2021

yuhaohaoyu Jul 16, 2021

jean-edouard Jul 16, 2021

yuhaohaoyu Jul 16, 2021

yuhaohaoyu commented Jul 16, 2021

iholder101 commented Jul 19, 2021

yuhaohaoyu commented Jul 19, 2021

yuhaohaoyu commented Jul 19, 2021

iholder101 commented Jul 20, 2021

iholder101 Jul 20, 2021

iholder101 commented Jul 20, 2021

iholder101 commented Jul 20, 2021 •

edited

Loading

yuhaohaoyu commented Jul 20, 2021

yuhaohaoyu commented Jul 20, 2021

yuhaohaoyu commented Jul 20, 2021

yuhaohaoyu commented Jul 20, 2021

yuhaohaoyu commented Jul 21, 2021

jean-edouard commented Jul 21, 2021

kubevirt-bot commented Jul 21, 2021

yuhaohaoyu commented Jul 21, 2021

Leave virt-launcher container up for failed VMI for improved debuggab… #6040

Leave virt-launcher container up for failed VMI for improved debuggab… #6040

Conversation

yuhaohaoyu commented Jul 14, 2021 • edited Loading

yuhaohaoyu commented Jul 14, 2021

davidvossel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ashleyschuett left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yuhaohaoyu commented Jul 16, 2021

iholder101 commented Jul 19, 2021

yuhaohaoyu commented Jul 19, 2021

yuhaohaoyu commented Jul 19, 2021

iholder101 commented Jul 20, 2021

Choose a reason for hiding this comment

iholder101 commented Jul 20, 2021

iholder101 commented Jul 20, 2021 • edited Loading

yuhaohaoyu commented Jul 20, 2021

yuhaohaoyu commented Jul 20, 2021

yuhaohaoyu commented Jul 20, 2021

yuhaohaoyu commented Jul 20, 2021

yuhaohaoyu commented Jul 21, 2021

jean-edouard commented Jul 21, 2021

kubevirt-bot commented Jul 21, 2021

yuhaohaoyu commented Jul 21, 2021

yuhaohaoyu commented Jul 14, 2021 •

edited

Loading

iholder101 commented Jul 20, 2021 •

edited

Loading