-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Leave virt-launcher container up for failed VMI for improved debuggab… #6040
Leave virt-launcher container up for failed VMI for improved debuggab… #6040
Conversation
/assign @ashleyschuett |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the <-keepAfterQemuChan
it seems like we could get away with only placing that in one point in the code, right after ForkAndMonitor() is called.
That will keep the pod's pid 1 alive indefinitely
cmd/virt-launcher/virt-launcher.go
Outdated
<-keepAfterQemuChan | ||
|
||
// Now that the pid has exited, we wait for the final delete notification to be | ||
// sent back to virt-handler. This delete notification contains the reason the | ||
// domain exited. | ||
waitForFinalNotify(events, domainManager, vmi) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is it important to have the <-keepAfterQemuChan
wait before the final notify is sent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With your insight on "placing the blocking right after the invocation of forkandmonitor', the reason for me placing the blocking before the waitForFinalNotify was weak.
I was trying to have the 'blocking site' (conceptually reassemble a little something of debugger's breaking point) close to the point of failure.
cmd/virt-launcher/virt-launcher.go
Outdated
|
||
<-keepAfterQemuChan | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ForkAndMonitor can exit before this is ever reached, which will result in the pod exiting even when keepAfterQemuChan is set.
Maybe the <-keepAfterQemuChan
line should be moved right after ForkAndMonitor returns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your suggestion is quite clean. Covering for sure the case when 'ForkAndMonitor' is employed.
So with 'fork' is the only way to call cmd/virt-launcher from outside the program ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then user can create the VM, start the VMI manually, then the VMI environment (the corresponding virt-launcher container) will stay alive. Given that the VMI is in the Failed phase, it will not respond to virtctl stop but will respond to virtctl restart.
This is no longer the case. virtctl stop <vmiName>
now works on Failed vmis. https://github.com/kubevirt/kubevirt/pull/5953/files#diff-028d148aa715883f3fd17ae07f8a03225251bdd5eb1893566dfb68085298a434R612
cmd/virt-launcher/virt-launcher.go
Outdated
@@ -348,6 +348,8 @@ func main() { | |||
qemuAgentUserInterval := pflag.Duration("qemu-agent-user-interval", 10, "Interval in seconds between consecutive qemu agent calls for user command") | |||
qemuAgentVersionInterval := pflag.Duration("qemu-agent-version-interval", 300, "Interval in seconds between consecutive qemu agent calls for version command") | |||
qemuAgentFSFreezeStatusInterval := pflag.Duration("qemu-fsfreeze-status-interval", 5, "Interval in seconds between consecutive qemu agent calls for fsfreeze status command") | |||
keepAfterQemu := pflag.Bool("keep-after-qemu", false, "virt-launcher will be kept alive after qemu exits for debugging if set to true") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does the user know or care that the pod exists when qemu dies? Does it make sense to say virt-launcher will be kept alive after failure for debugging if set to true
? or is that too vague?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. Will take your suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have adjust the name of the command line option, the annotation key, and corresponding internal variable to be inline with 'keep after failure'.
cmd/virt-launcher/virt-launcher.go
Outdated
if !*noFork { | ||
exitCode, err := ForkAndMonitor(*containerDiskDir) | ||
exitCode, err := ForkAndMonitor(*containerDiskDir, keepAfterQemuChan) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of passing in the chan should we hang after we log log.Log.Reason(err).Error("monitoring virt-launcher failed")
only in the case of error? If we are shutting down successfully do we still want to wait?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per David's idea, this change is not needed anymore.
a1dddc1
to
514c21d
Compare
tests/vm_test.go
Outdated
|
||
By("Check Virt Launcher status") | ||
// Virt launcher should be in the Running phase if the Annotation v1.KeepLauncherAfterFailureAnnotation is set to true | ||
time.Sleep(10 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please try to avoid arbitrary sleeps.
Maybe here we can use Consistently()
if we expect virt-launcher to stay in the running phase, and Eventually()
if we expect it to fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the catch, this one is removed.
tests/vm_test.go
Outdated
}, 160*time.Second, 1*time.Second).Should(BeTrue()) | ||
|
||
By("Check Virt Launcher log message") | ||
time.Sleep(20 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this waiting for? Can we actively wait for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block for checking the virt-launcher log info message is not actually required. I have removed the block.
514c21d
to
7cadb8a
Compare
/retest |
Thanks @yuhaohaoyu! Looks valuable! My only feedback here is that we probably want to warn the user (and / or add this condition to the flag's description) if the VM's And if we're at the subject - we can also log the fact the we intentionally hang and don't kill virt-launcher (e.g. |
7cadb8a
to
842533b
Compare
When the I have just added an comment when the block-wait happens. Thanks. |
/retest |
Sure, I'm just thinking on a user that expects the pod to hang and, by mistake, sets |
@@ -2001,3 +2005,16 @@ func filterVMIAnnotationsForPod(vmiAnnotations map[string]string) map[string]str | |||
} | |||
return annotationsList | |||
} | |||
|
|||
func checkForKeepLauncherAfterFailure(vmi *v1.VirtualMachineInstance) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: would suggest toKeepLauncherAfterFailure
My feedback is just small nits :) up to you. |
…ility Signed-off-by: Hao Yu <yuh@us.ibm.com>
842533b
to
82e2cf5
Compare
Thanks @yuhaohaoyu! |
/retest |
4 similar comments
/retest |
/retest |
/retest |
/retest |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jean-edouard The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
…ility
Signed-off-by: Hao Yu yuh@us.ibm.com
What this PR does / why we need it:
The goal of this PR is to keep the VMI environment up for inspection or debugging when a VMI fails.
Currently the
virt-launcher
container will exit when the corresponding VMI reached theFailed
phase. The specific goal of the PR is to provide user a mean to have the VMI defined in a VM to pertain its environment on failure. To have a failed VMI to leave its virt-launcher container alive, user needs toRunStrategy
asManual
kubevirt.io/leave-launcher-pod-after-qemu-exit
in the VM's VMI Template and assign a "true
" value.Then user can create the VM, start the VMI manually, then the VMI environment (the corresponding
virt-launcher
container) will stay alive.In addition, this PR does not change the behavior or a VMI with its VM's
RunStrategy
not defined asManual
.Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #
Special notes for your reviewer:
Release note: