Skip to content

Optimize Firecracker snapshot resume#261

Closed
sjmiller609 wants to merge 3 commits into
mainfrom
hypeship/fc-resume-on-load-v2
Closed

Optimize Firecracker snapshot resume#261
sjmiller609 wants to merge 3 commits into
mainfrom
hypeship/fc-resume-on-load-v2

Conversation

@sjmiller609
Copy link
Copy Markdown
Collaborator

@sjmiller609 sjmiller609 commented Jun 1, 2026

Summary

  • Uses Firecracker snapshot resume_vm support so restored guests can resume during /snapshot/load instead of requiring a separate post-load Resume call.
  • Tracks whether a restored hypervisor has already resumed through hypervisor.RestoredResumed, and skips the extra resume step only in that case.
  • Speeds up Firecracker API socket readiness detection with inotify on Linux, while keeping polling fallback and non-Linux polling behavior.
  • Adds tests for resume-on-load config, restored-resumed propagation, socket readiness waiting, and tracing wrapper behavior.

Tests

  • go test ./lib/hypervisor ./lib/hypervisor/firecracker ./lib/mailbox ./lib/guest ./lib/system/guest_agent ./lib/oapi -count=1
  • git diff --check

@sjmiller609 sjmiller609 force-pushed the hypeship/network-handoff-v2 branch 2 times, most recently from dffc792 to 05bc363 Compare June 1, 2026 13:52
Base automatically changed from hypeship/network-handoff-v2 to main June 1, 2026 19:18
@sjmiller609 sjmiller609 marked this pull request as ready for review June 1, 2026 19:26
}

parent := filepath.Dir(path)
fd, err := unix.InotifyInit1(unix.IN_CLOEXEC | unix.IN_NONBLOCK)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we attempt to not wait via polling, then we fall back to polling

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit b7c7c05. Configure here.

defer snapshotSourceAliasMu.Unlock()
return withSnapshotSourceDirAlias(meta, filepath.Dir(socketPath), func() error {
return hv.loadSnapshot(ctx, snapshotPath, meta.NetworkOverrides)
return hv.loadSnapshot(ctx, snapshotPath, meta.NetworkOverrides, resumeOnLoad)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guest runs before alias cleanup

Medium Severity

With resume-on-load enabled, snapshot load can start the guest while still inside withSnapshotSourceDirAlias, before the temporary source-data symlink is removed. Restore then skips the separate Resume call when RestoredResumed is set, so alias restores no longer guarantee the guest stays paused until after that teardown finishes.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit b7c7c05. Configure here.

@firetiger-agent
Copy link
Copy Markdown

Created a monitoring plan for this PR.

What this PR does: Speeds up Firecracker VM restore by resuming the VM inside the snapshot load call (instead of a separate step) and replacing 50ms-interval polling with inotify-based socket detection on Linux. Adds an env-var escape hatch (HYPEMAN_FIRECRACKER_RESTORE_RESUME_ON_LOAD) to revert resume behavior without a redeploy.

Intended effect:

  • Hypeman spawn success rate (kernel_invocation_spawn_total, backend_type=hypeman): baseline 2,500–21,000 successes/hr, zero failures; confirmed if rate holds with no success=false rows post-deploy.
  • Restore-to-running latency: Baseline not directly measurable in Firetiger; confirmed if spawn success rate remains healthy and no "failed to resume VM" ERROR logs appear (pre-deploy baseline: 0).

Risks:

  • Double-resume error — if RestoredResumed() guard is bypassed (e.g. interface stripping in wrapper), Firecracker returns an error on an already-running VM; alert if any success=false row in kernel_invocation_spawn_total for backend_type=hypeman.
  • Guest network race — VM executes too early before network reconfiguration completes; alert if any "failed to configure guest network after restore" ERROR log appears post-deploy.
  • inotify miss — socket creation event is missed, blocking until timeout; alert if hypeman spawn count drops below 1,000/hr during active hours (09:00–18:00 UTC).
  • Env var misconfigurationHYPEMAN_FIRECRACKER_RESTORE_RESUME_ON_LOAD=0 at deploy silently reverts to old behavior (safe but removes the improvement); verify via "VM was resumed during snapshot load" INFO log appearing post-deploy.

Status updates will be posted automatically on this PR as monitoring progresses.

View monitor

@sjmiller609
Copy link
Copy Markdown
Collaborator Author

closing this one intentionally. resume-on-load saves a small Firecracker API step, but it complicates the restore invariant: when a temporary snapshot source alias is needed, resume_vm: true can let guest execution begin before alias cleanup. The child work does not fundamentally require this optimization, so we are dropping #261 from the production stack and rebasing the UFFD / fork-concurrency PRs directly onto main.

@sjmiller609
Copy link
Copy Markdown
Collaborator Author

Closing intentionally; see latest comment. The dependent PRs are being rebased directly onto main without the resume-on-load layer.

@sjmiller609 sjmiller609 closed this Jun 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant