Skip to content

[shimV2] Fix guest/VM exit on their own#2743

Merged
rawahars merged 1 commit into
mainfrom
harshrawat/v2_vm_exit
May 18, 2026
Merged

[shimV2] Fix guest/VM exit on their own#2743
rawahars merged 1 commit into
mainfrom
harshrawat/v2_vm_exit

Conversation

@rawahars
Copy link
Copy Markdown
Contributor

@rawahars rawahars commented May 18, 2026

Summary

  • Once a utility VM exits or its in-guest agent dies, container and
    network cleanup has nothing left to act on, yet the old code kept
    returning errors and looping on resources that are already gone.

  • Export the closed-bridge sentinel and wrap it around the underlying
    transport failure when the bridge is killed, and add a shared helper
    that classifies the host-side conditions meaning the compute system
    is no longer available for modification. Container resource release
    (combined layers, every category of guest mount and device, and the
    explicit state delete) and both legs of network endpoint and
    namespace removal now treat these two signals as a successful
    release and move on. The VM exit watcher also collapses the bridge
    on termination so in-flight waits unblock instead of hanging until
    context cancellation, and task deletion is allowed once the VM has
    reached the terminated state so containerd can drain stale tasks.

  • Along the way, migrate the shim's task service registration to the
    v3 task API, and stop logging the noisy JSON parse error emitted
    when GCS writes a non-JSON trailer (such as a runtime panic dump)
    on shutdown, since the raw payload is already surfaced by the
    terminated-output path.

Testing

  • Created custom LinuxBootFiles which introduces a UVM panic after 45 seconds. Ran multiple pods in the same UVM and when the UVM exited automatically, we saw the state being marked as NOT_READY in the containerd output.

@rawahars rawahars requested a review from a team as a code owner May 18, 2026 06:31
@rawahars rawahars force-pushed the harshrawat/v2_vm_exit branch from 19c3466 to 063c302 Compare May 18, 2026 09:11
@rawahars rawahars changed the title [shimV2] Tolerate guest/VM disappearance during teardown [shimV2] Fix guest/VM exit on their own May 18, 2026
Comment thread cmd/containerd-shim-lcow-v2/service/service.go
Comment thread cmd/containerd-shim-lcow-v2/service/service_task_internal.go Outdated
- Once a utility VM exits or its in-guest agent dies, container and
network cleanup has nothing left to act on, yet the old code kept
returning errors and looping on resources that are already gone.

- Export the closed-bridge sentinel and wrap it around the underlying
transport failure when the bridge is killed, and add a shared helper
that classifies the host-side conditions meaning the compute system
is no longer available for modification. Container resource release
(combined layers, every category of guest mount and device, and the
explicit state delete) and both legs of network endpoint and
namespace removal now treat these two signals as a successful
release and move on. The VM exit watcher also collapses the bridge
on termination so in-flight waits unblock instead of hanging until
context cancellation, and task deletion is allowed once the VM has
reached the terminated state so containerd can drain stale tasks.

- Along the way, migrate the shim's task service registration to the
v3 task API, and stop logging the noisy JSON parse error emitted
when GCS writes a non-JSON trailer (such as a runtime panic dump)
on shutdown, since the raw payload is already surfaced by the
terminated-output path.

Signed-off-by: Harsh Rawat <harshrawat@microsoft.com>
@rawahars rawahars force-pushed the harshrawat/v2_vm_exit branch from 063c302 to 9fc7d39 Compare May 18, 2026 17:44
@rawahars rawahars merged commit 14f91aa into main May 18, 2026
85 of 89 checks passed
@rawahars rawahars deleted the harshrawat/v2_vm_exit branch May 18, 2026 20:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants