[shimV2] Fix guest/VM exit on their own#2743
Merged
Merged
Conversation
19c3466 to
063c302
Compare
AdityaMittal1306
approved these changes
May 18, 2026
jterry75
reviewed
May 18, 2026
- Once a utility VM exits or its in-guest agent dies, container and network cleanup has nothing left to act on, yet the old code kept returning errors and looping on resources that are already gone. - Export the closed-bridge sentinel and wrap it around the underlying transport failure when the bridge is killed, and add a shared helper that classifies the host-side conditions meaning the compute system is no longer available for modification. Container resource release (combined layers, every category of guest mount and device, and the explicit state delete) and both legs of network endpoint and namespace removal now treat these two signals as a successful release and move on. The VM exit watcher also collapses the bridge on termination so in-flight waits unblock instead of hanging until context cancellation, and task deletion is allowed once the VM has reached the terminated state so containerd can drain stale tasks. - Along the way, migrate the shim's task service registration to the v3 task API, and stop logging the noisy JSON parse error emitted when GCS writes a non-JSON trailer (such as a runtime panic dump) on shutdown, since the raw payload is already surfaced by the terminated-output path. Signed-off-by: Harsh Rawat <harshrawat@microsoft.com>
063c302 to
9fc7d39
Compare
jterry75
approved these changes
May 18, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Once a utility VM exits or its in-guest agent dies, container and
network cleanup has nothing left to act on, yet the old code kept
returning errors and looping on resources that are already gone.
Export the closed-bridge sentinel and wrap it around the underlying
transport failure when the bridge is killed, and add a shared helper
that classifies the host-side conditions meaning the compute system
is no longer available for modification. Container resource release
(combined layers, every category of guest mount and device, and the
explicit state delete) and both legs of network endpoint and
namespace removal now treat these two signals as a successful
release and move on. The VM exit watcher also collapses the bridge
on termination so in-flight waits unblock instead of hanging until
context cancellation, and task deletion is allowed once the VM has
reached the terminated state so containerd can drain stale tasks.
Along the way, migrate the shim's task service registration to the
v3 task API, and stop logging the noisy JSON parse error emitted
when GCS writes a non-JSON trailer (such as a runtime panic dump)
on shutdown, since the raw payload is already surfaced by the
terminated-output path.
Testing
LinuxBootFileswhich introduces a UVM panic after 45 seconds. Ran multiple pods in the same UVM and when the UVM exited automatically, we saw the state being marked asNOT_READYin the containerd output.