Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leave virt-launcher container up for failed VMI for improved debuggab… #6040

Merged

Conversation

yuhaohaoyu
Copy link
Contributor

@yuhaohaoyu yuhaohaoyu commented Jul 14, 2021

…ility

Signed-off-by: Hao Yu yuh@us.ibm.com

What this PR does / why we need it:

The goal of this PR is to keep the VMI environment up for inspection or debugging when a VMI fails.

Currently the virt-launcher container will exit when the corresponding VMI reached the Failed phase. The specific goal of the PR is to provide user a mean to have the VMI defined in a VM to pertain its environment on failure. To have a failed VMI to leave its virt-launcher container alive, user needs to

  1. Specify the VM's RunStrategy as Manual
  2. Define an annotation kubevirt.io/leave-launcher-pod-after-qemu-exit in the VM's VMI Template and assign a "true" value.

Then user can create the VM, start the VMI manually, then the VMI environment (the corresponding virt-launcher container) will stay alive.

In addition, this PR does not change the behavior or a VMI with its VM's RunStrategy not defined as Manual.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

Release note:

Improved debuggability by keeping the environment of a failed VMI alive.

@kubevirt-bot kubevirt-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. size/L labels Jul 14, 2021
@kubevirt-bot kubevirt-bot added the kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API label Jul 14, 2021
@yuhaohaoyu
Copy link
Contributor Author

/assign @ashleyschuett

Copy link
Member

@davidvossel davidvossel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the <-keepAfterQemuChan it seems like we could get away with only placing that in one point in the code, right after ForkAndMonitor() is called.

That will keep the pod's pid 1 alive indefinitely

Comment on lines 495 to 506
<-keepAfterQemuChan

// Now that the pid has exited, we wait for the final delete notification to be
// sent back to virt-handler. This delete notification contains the reason the
// domain exited.
waitForFinalNotify(events, domainManager, vmi)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is it important to have the <-keepAfterQemuChan wait before the final notify is sent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With your insight on "placing the blocking right after the invocation of forkandmonitor', the reason for me placing the blocking before the waitForFinalNotify was weak.

I was trying to have the 'blocking site' (conceptually reassemble a little something of debugger's breaking point) close to the point of failure.

Comment on lines 576 to 578

<-keepAfterQemuChan

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ForkAndMonitor can exit before this is ever reached, which will result in the pod exiting even when keepAfterQemuChan is set.

Maybe the <-keepAfterQemuChan line should be moved right after ForkAndMonitor returns.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your suggestion is quite clean. Covering for sure the case when 'ForkAndMonitor' is employed.

So with 'fork' is the only way to call cmd/virt-launcher from outside the program ?

Copy link
Contributor

@ashleyschuett ashleyschuett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then user can create the VM, start the VMI manually, then the VMI environment (the corresponding virt-launcher container) will stay alive. Given that the VMI is in the Failed phase, it will not respond to virtctl stop but will respond to virtctl restart.

This is no longer the case. virtctl stop <vmiName> now works on Failed vmis. https://github.com/kubevirt/kubevirt/pull/5953/files#diff-028d148aa715883f3fd17ae07f8a03225251bdd5eb1893566dfb68085298a434R612

@@ -348,6 +348,8 @@ func main() {
qemuAgentUserInterval := pflag.Duration("qemu-agent-user-interval", 10, "Interval in seconds between consecutive qemu agent calls for user command")
qemuAgentVersionInterval := pflag.Duration("qemu-agent-version-interval", 300, "Interval in seconds between consecutive qemu agent calls for version command")
qemuAgentFSFreezeStatusInterval := pflag.Duration("qemu-fsfreeze-status-interval", 5, "Interval in seconds between consecutive qemu agent calls for fsfreeze status command")
keepAfterQemu := pflag.Bool("keep-after-qemu", false, "virt-launcher will be kept alive after qemu exits for debugging if set to true")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the user know or care that the pod exists when qemu dies? Does it make sense to say virt-launcher will be kept alive after failure for debugging if set to true? or is that too vague?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Will take your suggestion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have adjust the name of the command line option, the annotation key, and corresponding internal variable to be inline with 'keep after failure'.

if !*noFork {
exitCode, err := ForkAndMonitor(*containerDiskDir)
exitCode, err := ForkAndMonitor(*containerDiskDir, keepAfterQemuChan)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of passing in the chan should we hang after we log log.Log.Reason(err).Error("monitoring virt-launcher failed") only in the case of error? If we are shutting down successfully do we still want to wait?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per David's idea, this change is not needed anymore.

@yuhaohaoyu yuhaohaoyu force-pushed the leave-virt-launcher-after-qemu branch from a1dddc1 to 514c21d Compare July 16, 2021 17:02
tests/vm_test.go Outdated

By("Check Virt Launcher status")
// Virt launcher should be in the Running phase if the Annotation v1.KeepLauncherAfterFailureAnnotation is set to true
time.Sleep(10 * time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please try to avoid arbitrary sleeps.
Maybe here we can use Consistently() if we expect virt-launcher to stay in the running phase, and Eventually() if we expect it to fail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catch, this one is removed.

tests/vm_test.go Outdated
}, 160*time.Second, 1*time.Second).Should(BeTrue())

By("Check Virt Launcher log message")
time.Sleep(20 * time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this waiting for? Can we actively wait for it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block for checking the virt-launcher log info message is not actually required. I have removed the block.

@yuhaohaoyu yuhaohaoyu force-pushed the leave-virt-launcher-after-qemu branch from 514c21d to 7cadb8a Compare July 16, 2021 19:09
@yuhaohaoyu
Copy link
Contributor Author

/retest

@iholder101
Copy link
Contributor

Thanks @yuhaohaoyu! Looks valuable!

My only feedback here is that we probably want to warn the user (and / or add this condition to the flag's description) if the VM's RunStrategy is not Manual and --keep-after-failure flag is true.

And if we're at the subject - we can also log the fact the we intentionally hang and don't kill virt-launcher (e.g. hanging virt-launcher container since --keep-after-failure is defined to be true..)

@yuhaohaoyu yuhaohaoyu force-pushed the leave-virt-launcher-after-qemu branch from 7cadb8a to 842533b Compare July 19, 2021 17:58
@yuhaohaoyu
Copy link
Contributor Author

Thanks @yuhaohaoyu! Looks valuable!

My only feedback here is that we probably want to warn the user (and / or add this condition to the flag's description) if the VM's RunStrategy is not Manual and --keep-after-failure flag is true.

And if we're at the subject - we can also log the fact the we intentionally hang and don't kill virt-launcher (e.g. hanging virt-launcher container since --keep-after-failure is defined to be true..)

When the RunStrategy is not Manual, the effect of --keep-after-failure will not be visible by the VM owner. Specifically, if RunStrategy is either Always or RerunOnFailure, the virt-controller will kill the pod as long as the VM is in Failed phase, regardless the option --keep-after-failure.

I have just added an comment when the block-wait happens. Thanks.

@yuhaohaoyu
Copy link
Contributor Author

/retest

@iholder101
Copy link
Contributor

Thanks @yuhaohaoyu! Looks valuable!
My only feedback here is that we probably want to warn the user (and / or add this condition to the flag's description) if the VM's RunStrategy is not Manual and --keep-after-failure flag is true.
And if we're at the subject - we can also log the fact the we intentionally hang and don't kill virt-launcher (e.g. hanging virt-launcher container since --keep-after-failure is defined to be true..)

When the RunStrategy is not Manual, the effect of --keep-after-failure will not be visible by the VM owner. Specifically, if RunStrategy is either Always or RerunOnFailure, the virt-controller will kill the pod as long as the VM is in Failed phase, regardless the option --keep-after-failure.

I have just added an comment when the block-wait happens. Thanks.

Sure, I'm just thinking on a user that expects the pod to hang and, by mistake, sets leave-launcher-pod-after-qemu-exit to true in VM's annotations and RunStrategy to always. Since this is a configuration that doesn't make sense we can warn the user so that if he tries to figure out what's wrong he'll see a ignoring "keep-after-failure" since RunStrategy is not Manual or something similar.

@@ -2001,3 +2005,16 @@ func filterVMIAnnotationsForPod(vmiAnnotations map[string]string) map[string]str
}
return annotationsList
}

func checkForKeepLauncherAfterFailure(vmi *v1.VirtualMachineInstance) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: would suggest toKeepLauncherAfterFailure

@iholder101
Copy link
Contributor

My feedback is just small nits :) up to you.
/lgtm

@kubevirt-bot kubevirt-bot added lgtm Indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jul 20, 2021
…ility

Signed-off-by: Hao Yu <yuh@us.ibm.com>
@yuhaohaoyu yuhaohaoyu force-pushed the leave-virt-launcher-after-qemu branch from 842533b to 82e2cf5 Compare July 20, 2021 12:52
@kubevirt-bot kubevirt-bot removed lgtm Indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jul 20, 2021
@iholder101
Copy link
Contributor

iholder101 commented Jul 20, 2021

Thanks @yuhaohaoyu!
/lgtm

@kubevirt-bot kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Jul 20, 2021
@yuhaohaoyu
Copy link
Contributor Author

/retest

4 similar comments
@yuhaohaoyu
Copy link
Contributor Author

/retest

@yuhaohaoyu
Copy link
Contributor Author

/retest

@yuhaohaoyu
Copy link
Contributor Author

/retest

@yuhaohaoyu
Copy link
Contributor Author

/retest

@jean-edouard
Copy link
Contributor

/approve
/retest

@kubevirt-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jean-edouard

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubevirt-bot kubevirt-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 21, 2021
@yuhaohaoyu
Copy link
Contributor Author

/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants