Skip to content

Fix QEMU startup cleanup for failed launches#197

Merged
sjmiller609 merged 1 commit intomainfrom
codex/qemu-startup-cleanup
Apr 14, 2026
Merged

Fix QEMU startup cleanup for failed launches#197
sjmiller609 merged 1 commit intomainfrom
codex/qemu-startup-cleanup

Conversation

@sjmiller609
Copy link
Copy Markdown
Collaborator

@sjmiller609 sjmiller609 commented Apr 14, 2026

Summary

  • wait on QEMU subprocesses during startup so failed launches are reaped instead of left as zombies
  • wire startup cleanup to kill, reap, and remove qemu.sock on any failed attempt or retry
  • fail fast when QEMU exits before QMP becomes reachable, with focused tests for the new cleanup path

Testing

  • go test ./lib/hypervisor/qemu/...

Context

This addresses the startup-path bug behind instances getting stuck in Unknown after QEMU launch retries leave behind a stale monitor socket and stale PID metadata.


Note

Medium Risk
Touches QEMU process lifecycle management (start/kill/wait) and socket readiness logic; mistakes here could cause hangs or kill the wrong process, but the change is scoped and covered by focused tests.

Overview
Improves QEMU startup failure handling by tracking the launched process and ensuring it is killed and reaped (avoiding zombies) and that qemu.sock is removed on failed attempts.

Startup now fails fast if QEMU exits before the QMP socket becomes reachable (waitForSocketOrExit) and also checks for early exit while retrying QMP client creation. Adds unit tests to verify cleanup reaps exited processes and removes stale sockets, and that socket-wait returns quickly when the process dies.

Reviewed by Cursor Bugbot for commit e49b01c. Bugbot is set up for automated code reviews on this repo. Configure here.

@sjmiller609 sjmiller609 requested a review from hiroTamada April 14, 2026 17:49
@sjmiller609 sjmiller609 marked this pull request as ready for review April 14, 2026 17:49
@firetiger-agent
Copy link
Copy Markdown

Firetiger deploy monitoring skipped

This PR didn't match the auto-monitor filter configured on your GitHub connection:

Any PR that changes the kernel API. Monitor changes to API endpoints (packages/api/cmd/api/) and Temporal workflows (packages/api/lib/temporal) in the kernel repo

Reason: PR modifies QEMU hypervisor cleanup logic, not kernel API endpoints or Temporal workflows as specified in the filter.

To monitor this PR anyway, reply with @firetiger monitor this.

@hiroTamada
Copy link
Copy Markdown
Contributor

@firetiger monitor this

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit e49b01c. Configure here.

Comment thread lib/hypervisor/qemu/process.go
@sjmiller609 sjmiller609 merged commit 2c5790d into main Apr 14, 2026
13 of 14 checks passed
@sjmiller609 sjmiller609 deleted the codex/qemu-startup-cleanup branch April 14, 2026 18:03
@firetiger-agent
Copy link
Copy Markdown

I'm working on a monitoring plan for this PR. You can follow the progress here.

Tag me in a comment at any point to steer the plan. When this PR merges, I'll watch for deployments and use that as the signal to start monitoring.

@firetiger-agent
Copy link
Copy Markdown

I'll monitor this QEMU startup cleanup refactor. The changes introduce proper process lifecycle management with faster failure detection when QEMU exits early.

What I'm watching:

  • Hypeman spawn error rate (baseline: 0.5-1.5%) - will alert if it exceeds 3% sustained
  • SpawnInvocationInstanceActivity errors - looking for new "qemu exited early" error patterns
  • InvocationWorkflowV2 failure rate - ensuring cascading failures don't increase

The change is low-medium risk since Hypeman handles ~7.5% of serverless spawn volume. The unit tests look solid, covering both cleanup and early exit detection. I'll post updates as the deployment progresses.

View agent

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants