Optimize Firecracker snapshot resume by sjmiller609 · Pull Request #261 · kernel/hypeman

sjmiller609 · 2026-06-01T11:49:41Z

Summary

Uses Firecracker snapshot resume_vm support so restored guests can resume during /snapshot/load instead of requiring a separate post-load Resume call.
Tracks whether a restored hypervisor has already resumed through hypervisor.RestoredResumed, and skips the extra resume step only in that case.
Speeds up Firecracker API socket readiness detection with inotify on Linux, while keeping polling fallback and non-Linux polling behavior.
Adds tests for resume-on-load config, restored-resumed propagation, socket readiness waiting, and tracing wrapper behavior.

Tests

go test ./lib/hypervisor ./lib/hypervisor/firecracker ./lib/mailbox ./lib/guest ./lib/system/guest_agent ./lib/oapi -count=1
git diff --check

sjmiller609 · 2026-06-01T19:29:18Z

+	}
+
+	parent := filepath.Dir(path)
+	fd, err := unix.InotifyInit1(unix.IN_CLOEXEC | unix.IN_NONBLOCK)


we attempt to not wait via polling, then we fall back to polling

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit b7c7c05. Configure here.}

cursor · 2026-06-01T19:29:38Z

 		defer snapshotSourceAliasMu.Unlock()
 		return withSnapshotSourceDirAlias(meta, filepath.Dir(socketPath), func() error {
-			return hv.loadSnapshot(ctx, snapshotPath, meta.NetworkOverrides)
+			return hv.loadSnapshot(ctx, snapshotPath, meta.NetworkOverrides, resumeOnLoad)


Guest runs before alias cleanup

Medium Severity

With resume-on-load enabled, snapshot load can start the guest while still inside withSnapshotSourceDirAlias, before the temporary source-data symlink is removed. Restore then skips the separate Resume call when RestoredResumed is set, so alias restores no longer guarantee the guest stays paused until after that teardown finishes.

Additional Locations (1)

lib/instances/restore.go#L284-L287

^{Reviewed by Cursor Bugbot for commit b7c7c05. Configure here.}

firetiger-agent · 2026-06-01T19:32:20Z

Created a monitoring plan for this PR.

What this PR does: Speeds up Firecracker VM restore by resuming the VM inside the snapshot load call (instead of a separate step) and replacing 50ms-interval polling with inotify-based socket detection on Linux. Adds an env-var escape hatch (HYPEMAN_FIRECRACKER_RESTORE_RESUME_ON_LOAD) to revert resume behavior without a redeploy.

Intended effect:

Hypeman spawn success rate (kernel_invocation_spawn_total, backend_type=hypeman): baseline 2,500–21,000 successes/hr, zero failures; confirmed if rate holds with no success=false rows post-deploy.
Restore-to-running latency: Baseline not directly measurable in Firetiger; confirmed if spawn success rate remains healthy and no "failed to resume VM" ERROR logs appear (pre-deploy baseline: 0).

Risks:

Double-resume error — if RestoredResumed() guard is bypassed (e.g. interface stripping in wrapper), Firecracker returns an error on an already-running VM; alert if any success=false row in kernel_invocation_spawn_total for backend_type=hypeman.
Guest network race — VM executes too early before network reconfiguration completes; alert if any "failed to configure guest network after restore" ERROR log appears post-deploy.
inotify miss — socket creation event is missed, blocking until timeout; alert if hypeman spawn count drops below 1,000/hr during active hours (09:00–18:00 UTC).
Env var misconfiguration — HYPEMAN_FIRECRACKER_RESTORE_RESUME_ON_LOAD=0 at deploy silently reverts to old behavior (safe but removes the improvement); verify via "VM was resumed during snapshot load" INFO log appearing post-deploy.

Status updates will be posted automatically on this PR as monitoring progresses.

View monitor

sjmiller609 · 2026-06-01T19:48:31Z

closing this one intentionally. resume-on-load saves a small Firecracker API step, but it complicates the restore invariant: when a temporary snapshot source alias is needed, resume_vm: true can let guest execution begin before alias cleanup. The child work does not fundamentally require this optimization, so we are dropping #261 from the production stack and rebasing the UFFD / fork-concurrency PRs directly onto main.

sjmiller609 · 2026-06-01T19:48:32Z

Closing intentionally; see latest comment. The dependent PRs are being rebased directly onto main without the resume-on-load layer.

sjmiller609 and others added 2 commits June 1, 2026 11:41

Add mailbox resume network handoff

87821dd

Optimize Firecracker snapshot resume

a8a47e5

This was referenced Jun 1, 2026

Add restore deep trace debug mode #255

Closed

Optimize Firecracker snapshot resume #256

Closed

Add Firecracker UFFD snapshot pager #257

Closed

sjmiller609 force-pushed the hypeship/network-handoff-v2 branch 2 times, most recently from dffc792 to 05bc363 Compare June 1, 2026 13:52

Base automatically changed from hypeship/network-handoff-v2 to main June 1, 2026 19:18

Merge origin/main into Firecracker resume optimization

b7c7c05

sjmiller609 marked this pull request as ready for review June 1, 2026 19:26

sjmiller609 commented Jun 1, 2026

View reviewed changes

cursor Bot reviewed Jun 1, 2026

View reviewed changes

sjmiller609 closed this Jun 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize Firecracker snapshot resume#261

Optimize Firecracker snapshot resume#261
sjmiller609 wants to merge 3 commits into
mainfrom
hypeship/fc-resume-on-load-v2

sjmiller609 commented Jun 1, 2026 •

edited

Loading

Uh oh!

sjmiller609 Jun 1, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Jun 1, 2026

Uh oh!

firetiger-agent Bot commented Jun 1, 2026

Uh oh!

sjmiller609 commented Jun 1, 2026

Uh oh!

sjmiller609 commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sjmiller609 commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Tests

Uh oh!

sjmiller609 Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 1, 2026

Choose a reason for hiding this comment

Guest runs before alias cleanup

Uh oh!

firetiger-agent Bot commented Jun 1, 2026

Uh oh!

sjmiller609 commented Jun 1, 2026

Uh oh!

sjmiller609 commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sjmiller609 commented Jun 1, 2026 •

edited

Loading