Fast-start local emulator via RAM snapshot + live secret rotation#1340
Fast-start local emulator via RAM snapshot + live secret rotation#1340
Conversation
Ships a compressed RAM/device snapshot (stack-emulator-<arch>.savevm.zst) alongside the qcow2. `emulator start` resumes from it and rotates the per-install secrets in place, taking cold-boot from 30-120s to ~6-7s. Build phase adds a STACKCFG runtime ISO so stack.service can boot during image creation, starts qemu-guest-agent so its virtio-serial port stays open in the snapshot, then stop+migrate file:+quit via QMP. Runtime sends fresh secrets through QGA guest-exec input-data, which pipes them to trigger-fast-rotate and rotate-secrets inside the container: targeted sed on the placeholder PCK in built JS, UPDATE on the internal ApiKeySet, supervisorctl restart stack-app + cron-jobs. Placeholder hex values are baked in instead of random keys under STACK_EMULATOR_BUILD_SNAPSHOT=1 so no real secret ships in the snapshot. Device topology and SMP must match at capture and resume; runtime adds phantom seed/bundle drives and pins SMP=4. Cold-boot fallback kicks in automatically when the snapshot is missing, corrupt, or incompatible. supervisord.conf now uses stopasgroup/killasgroup for stack-app and cron-jobs so supervisor restart actually kills the Node children (they were keeping their port bindings and breaking rotation).
Snapshot resume drops from ~14s to ~5-7s with rotation, ~2.5s without. Build uses QEMU's mapped-ram + multifd migration capability so the RAM state is written at page-aligned offsets in a sparse file. Runtime decompresses the shipped .savevm.zst once to a local .raw cache and reloads via -incoming file: + migrate-incoming on subsequent starts, avoiding the per-start zstd decode. Adds EMULATOR_NO_ROTATION=1 for tests/CI that don't mind the placeholder secrets; saves the full ~3s rotation window. Misc runtime cleanups: tighter QMP/QGA poll intervals (1s → 0.2s), shorter socat keep-alive windows, 1s settle before the post-rotation health-check to avoid racing old Node processes, fallback path preserves the CLI-generated runtime-config.iso instead of blowing away VM_DIR. Build-time qmp_session keeps stdin open briefly after the caller's commands so migrate-set-capabilities is actually processed before socat closes — without this, mapped-ram was silently a no-op. CI workflow publishes .savevm.zst alongside the .qcow2 (optional asset; CLI falls back to cold boot when missing). Test + verify steps go through the CLI now that ISO generation is owned by packages/stack-cli.
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds QEMU snapshot capture/resume and a Node-based emulator CLI: CI builds/caches QEMU from source, runtime ISO creation, QMP-based VM-state capture/restore, guest-agent secret rotation, updated emulator start/resume flows, and new ISO/CLI tests. Changes
Sequence Diagram(s)sequenceDiagram
participant Host as Host (build-image.sh)
participant QEMU as QEMU VM
participant QMP as QMP Socket
participant Disk as Disk (state.raw)
participant ZStd as Zstd
Host->>QEMU: Launch VM for provisioning
QEMU->>Host: Boot & provision (cloud-init)
Note over QEMU: Start services, emit STACK_SERVICES_READY
Host->>QMP: Set migration caps (mapped-ram, multifd)
Host->>QMP: migrate to file:state.raw
QMP->>Disk: Write VM state
Host->>ZStd: Compress state.raw -> .savevm.zst
Host->>Host: Persist snapshot artifact
sequenceDiagram
participant CLI as stack-cli
participant GH as GitHub API
participant CDN as Release CDN
participant Host as Host (run-emulator.sh)
participant ZStd as Zstd
participant QMP as QMP Socket
participant QEMU as QEMU VM
participant QGA as QGA Guest Agent
CLI->>GH: Query releases & assets
GH-->>CLI: Return qcow2 + optional .savevm.zst metadata
CLI->>CDN: Download qcow2 (streaming)
CLI->>CDN: Download optional .savevm.zst (or skip)
Host->>ZStd: Decompress .savevm.zst -> state.raw
Host->>QEMU: Launch with -incoming defer
Host->>QMP: migrate-incoming file:state.raw
QMP->>QEMU: Restore state
Host->>QMP: cont
QEMU->>QGA: Guest agent becomes available
Host->>QGA: trigger-fast-rotate (guest-exec) -> rotate secrets
QGA->>Host: Rotation complete, emulator ready
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Greptile SummaryThis PR adds a RAM snapshot fast-start path to
Confidence Score: 4/5Safe to merge after fixing the snapshot artifact name mismatch in the --run/--pr pull path. One P1 bug: users who pull via --pr or --run get a misleading 'fast-start disabled' message even though the snapshot is present (the .savevm.zst is already extracted by the main artifact download). Functionally the emulator still works, but the false-negative log misleads users testing this feature. All other changes are well-structured, thoroughly tested, and include robust cold-boot fallback paths. packages/stack-cli/src/commands/emulator.ts (lines 521–532): snapshot artifact name mismatch in the --run/--pr download path. Important Files Changed
Sequence DiagramsequenceDiagram
participant CLI as stack-cli
participant Shell as run-emulator.sh
participant QEMU as QEMU VM
participant QGA as qemu-guest-agent
participant Container as stack container
CLI->>Shell: emulator start (generates runtime ISO)
Shell->>Shell: decompress .savevm.zst → .savevm.raw (one-time)
Shell->>QEMU: launch with -incoming defer
Shell->>QEMU: QMP: migrate-set-capabilities (mapped-ram + multifd)
Shell->>QEMU: QMP: migrate-incoming file:.savevm.raw
Shell->>QEMU: QMP: cont
QEMU-->>Shell: status=running (~2–5s)
Shell->>QGA: guest-sync (wait for agent)
Shell->>QGA: guest-exec trigger-fast-rotate (base64 secrets via input-data)
QGA->>Container: docker exec -e PCK/SSK/SAK/CRON_SECRET rotate-secrets
Container->>Container: sed JS files (replace placeholder PCK)
Container->>Container: psql UPDATE ApiKeySet
Container->>Container: supervisorctl restart stack-app cron-jobs
Container-->>Shell: exit 0
Shell->>Shell: wait_for_condition (health checks)
Shell-->>CLI: ready (~5–7s total)
Prompt To Fix All With AIThis is a comment left during a code review.
Path: packages/stack-cli/src/commands/emulator.ts
Line: 521-532
Comment:
**Wrong artifact name for snapshot lookup**
CI produces a single artifact `qemu-emulator-<arch>` containing **both** `.qcow2` and `.savevm.zst`, but this code searches for a non-existent separate artifact `qemu-emulator-<arch>-savevm`. `downloadArtifactByName` returns `false`, so `snapshotDownloaded` stays `false` and the user always sees "Snapshot artifact not available … fast-start disabled" — even though the `.savevm.zst` was already extracted to `imageDir` by the main download above.
The fix is to drop the second download call and instead check whether the file landed on disk:
```suggestion
// The savevm snapshot was packaged inside the same artifact as the qcow2;
// both files land in imageDir after extraction above.
if (existsSync(snapshotDest)) {
console.log(`Downloaded: ${snapshotDest}`);
} else {
console.log(`Snapshot not present in run ${runId}; fast-start disabled.`);
}
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: docker/local-emulator/qemu/run-emulator.sh
Line: 602-603
Comment:
**`qga.sock` not cleaned up on stop**
`stop_vm` removes `monitor.sock`, `serial.log`, and `runtime-config.iso`, but the new `qga.sock` (added for snapshot-resume QGA comms) is left behind. The socket is functionally harmless at the next start (QEMU recreates it with `server=on`), but stale sockets in `VM_DIR` can mislead `qga_wait_ready` if the code is invoked before the new QEMU process binds.
```suggestion
rm -f "$VM_DIR/qemu.pid" "$VM_DIR/monitor.sock" "$VM_DIR/qga.sock" "$VM_DIR/serial.log"
rm -f "$VM_DIR/runtime-config.iso"
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: docker/local-emulator/rotate-secrets.sh
Line: 4-6
Comment:
**Stale comment describes the old file-based secret delivery**
Lines 5–6 say "Host writes fresh secrets to `/host/stack-runtime/fresh-secrets.env`", but the current implementation never creates that file. Secrets now arrive exclusively via `docker exec -e` env vars (injected by `trigger-fast-rotate`). The `STACK_ROTATE_INPUT` fallback path at line 80 handles backward-compat, but the header comment will mislead future maintainers.
Suggest updating the description to match the actual flow:
```suggestion
# Called inside the stack container by the emulator snapshot-resume path.
# Fresh secrets arrive as environment variables (STACK_SEED_INTERNAL_PROJECT_*
# and CRON_SECRET), injected by the host via `docker exec -e` through
# trigger-fast-rotate. For backward compatibility, STACK_ROTATE_INPUT may
# point to a file as a fallback.
```
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "faster snapshot resume via mapped-ram + ..." | Re-trigger Greptile |
There was a problem hiding this comment.
Actionable comments posted: 9
🧹 Nitpick comments (3)
packages/stack-cli/src/lib/iso.ts (1)
400-403: Consider adding error handling forwriteFileSync.The
writeIsofunction directly callswriteFileSyncwithout handling potential I/O errors. Per coding guidelines, avoid uncaught errors. Consider wrapping with appropriate error handling or documenting that callers must handle exceptions.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/stack-cli/src/lib/iso.ts` around lines 400 - 403, The writeIso function calls writeFileSync directly and can throw uncaught I/O errors; wrap the writeFileSync(path, buf) call in a try/catch inside writeIso and handle errors by adding contextual information (include path and volumeId) and either logging via the module/logger used in this package and rethrowing a new Error or rethrowing the original error after attaching context, so callers get clear failure details; keep buildIso(...) unchanged and ensure the catch preserves the original stack (e.g., include err as cause or append err.message) so debugging remains possible.packages/stack-cli/src/commands/emulator.ts (2)
93-102: Validate the credentials payload instead of castingres.json().Line 93 trusts the backend shape via a type assertion, so a malformed response becomes
{ project_id: undefined, ... }and the CLI will happily print/inject bad credentials. Parseunknownand assert the three required string fields before returning. As per coding guidelines, "Validate all assumptions either through the type system (preferred), assertions, or tests." and "Do NOT useas/any/type casts to bypass the type system unless explicitly asked by the user."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/stack-cli/src/commands/emulator.ts` around lines 93 - 102, The code currently casts res.json() to a credential shape which can hide malformed responses; change to parse the JSON as unknown, then validate that the result is an object and that data.project_id, data.publishable_client_key, and data.secret_server_key exist and are strings (and non-empty if desired) before returning them; replace the type assertion on the `data` variable with a runtime type guard/assertion that throws a descriptive error if validation fails so the function that calls this code (the code around the `data` constant and its return) never returns undefined credentials.
57-69: Use a monotonic clock for elapsed-time calculations.Both the internal-PCK wait loop and the download progress math are derived from
Date.now(), which can jump with wall-clock changes and skew timeouts/ETA.performance.now()is the safer fit in both places. As per coding guidelines, "Don't useDate.now()for measuring elapsed real time. Instead useperformance.now()"Also applies to: 304-313
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/stack-cli/src/commands/emulator.ts` around lines 57 - 69, The readInternalPck() loop (and the download-progress ETA math elsewhere) currently uses Date.now() for elapsed-time which can jump; replace those elapsed/timeout calculations to use a monotonic clock via performance.now() (or Node's perf_hooks.performance) instead: compute deadline as performance.now() + timeoutMs and compare performance.now() in the loop, and update any ETA/progress math that uses Date.now() to use performance.now() so timeout/duration arithmetic stays stable; ensure units remain milliseconds and import or reference the correct performance symbol used in this codebase.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/qemu-emulator-build.yaml:
- Around line 58-67: The workflow only installs pnpm and actions/setup-node when
matrix.arch == 'amd64', leaving arm64 builds to rely on the runner's
preinstalled Node; update the job matrix steps so that actions/setup-node@v4
(with node-version: 22 and cache: pnpm) and pnpm/action-setup@v4 (version:
10.23.0) run for both amd64 and arm64 (remove or adjust the if: matrix.arch ==
'amd64' guards around uses: pnpm/action-setup@v4 and uses:
actions/setup-node@v4), ensuring the arm64 build runs the same Node/pnpm setup
used by the amd64 job, including the later node usages in
docker/local-emulator/qemu/build-image.sh and the steps at lines 93-94.
- Around line 230-232: The failure-step "Print serial log on failure" is
triggering SC2086 because the serial log path uses an unquoted $HOME; update the
step's run command to quote the expanded path (i.e., use
"$HOME/.stack/emulator/run/vm/serial.log") so the shell does not perform
word-splitting or globbing; keep the rest of the redirection and fallback
(2>/dev/null || true) unchanged.
In `@docker/local-emulator/entrypoint.sh`:
- Around line 40-42: The CRON_SECRET assignment masks openssl failures because
the command is embedded in export; change it to run openssl rand separately,
check its exit status and exit on failure, then export the variable.
Specifically, in entrypoint.sh replace the inline export with: run openssl rand
-hex 32 into a temp/value variable (e.g., cron_val), if the command fails log an
error and exit non‑zero, otherwise export CRON_SECRET="$cron_val" so openssl
failures are not hidden.
In `@docker/local-emulator/qemu/run-emulator.sh`:
- Around line 691-700: The cold-boot fallback deletes the needed ISO because
stop_vm() always removes "$VM_DIR/runtime-config.iso"; change stop_vm to accept
an optional flag/parameter (e.g., preserve_runtime_config) and skip deleting
runtime-config.iso when that flag is true, then call stop_vm with
preserve_runtime_config=true from snapshot_fallback_to_cold_boot() before
calling cmd_start so ensure_runtime_config_iso can find the ISO; keep default
behavior of stop_vm unchanged for other callers.
- Around line 311-344: The snapshot branch removes the -virtfs hostfs mount so
the guest /host won't be exposed after resume; restore the hostfs virtfs option
into the snapshot resume path by adding the same -virtfs
"local,path=/,mount_tag=hostfs,security_model=none" entry to runtime_only_args
when snapshot_available is true (alongside where seed_phantom and bundle_phantom
are appended and runtime_only_args is modified), ensuring the mount_tag=hostfs
is present for resumed VMs so the CLI's publish-to-host workflow (used by
cloud-init/user-data and packages/stack-cli) continues to work.
- Around line 311-320: When a snapshot is available we must pin RAM to the
build-time size just like SMP is pinned: add a snapshot_ram variable (similar to
snapshot_smp) that defaults to EMULATOR_BUILD_RAM and, when snapshot_available
is true, override VM_RAM (or the -m argument) with that snapshot_ram so QEMU
restores with the same memory layout used at capture; update the snapshot branch
where snapshot_args and snapshot_smp are set (and where -m "$VM_RAM" is later
used) to use the snapshot_ram/EMULATOR_BUILD_RAM instead of the runtime
EMULATOR_RAM when resuming from a snapshot.
In `@docker/local-emulator/rotate-secrets.sh`:
- Around line 90-100: The UPDATE against the "ApiKeySet" table can silently
affect 0 rows; modify the rotate block that runs psql to perform the UPDATE ...
RETURNING id (or SELECT count(*) via GET DIAGNOSTICS) and capture its output,
then assert exactly one row was returned/updated and exit non‑zero (and log an
error) if not; specifically change the psql invocation around the UPDATE of
"ApiKeySet" (the hard-coded id '3142e763-b230-44b5-8636-aa62f7489c26' and the
environment keys STACK_SEED_INTERNAL_PROJECT_*) to return the updated id and
check in the shell that the result is non-empty (or equals one) before
proceeding to restart services.
In `@docker/server/entrypoint.sh`:
- Line 183: The find invocation uses an unquoted variable find $WORK_DIR/apps
... which can cause word splitting; update the command (the find line that
references WORK_DIR) to quote the variable as "$WORK_DIR" so the path is treated
as a single argument and shellcheck warnings are silenced.
In `@packages/stack-cli/src/commands/emulator.ts`:
- Around line 515-532: The code misreports snapshot availability because the
initial artifact extraction of `qemu-emulator-${arch}` may already create
`snapshotDest`; change the logic in the block around `downloaded`,
`snapshotDownloaded` and the `downloadArtifactByName` call so you first check if
`existsSync(snapshotDest)` immediately after the initial download/extract and
set `snapshotDownloaded = true` if present, and only call
`downloadArtifactByName(repo, runId, \`qemu-emulator-${arch}-savevm\`,
imageDir)` when `snapshotDest` does not exist; update the console messages to
reflect the actual condition (use `snapshotDest`, `dest`, `downloaded`,
`snapshotDownloaded`, `downloadArtifactByName`, `imageDir`, `runId`, and `repo`
to find the code).
---
Nitpick comments:
In `@packages/stack-cli/src/commands/emulator.ts`:
- Around line 93-102: The code currently casts res.json() to a credential shape
which can hide malformed responses; change to parse the JSON as unknown, then
validate that the result is an object and that data.project_id,
data.publishable_client_key, and data.secret_server_key exist and are strings
(and non-empty if desired) before returning them; replace the type assertion on
the `data` variable with a runtime type guard/assertion that throws a
descriptive error if validation fails so the function that calls this code (the
code around the `data` constant and its return) never returns undefined
credentials.
- Around line 57-69: The readInternalPck() loop (and the download-progress ETA
math elsewhere) currently uses Date.now() for elapsed-time which can jump;
replace those elapsed/timeout calculations to use a monotonic clock via
performance.now() (or Node's perf_hooks.performance) instead: compute deadline
as performance.now() + timeoutMs and compare performance.now() in the loop, and
update any ETA/progress math that uses Date.now() to use performance.now() so
timeout/duration arithmetic stays stable; ensure units remain milliseconds and
import or reference the correct performance symbol used in this codebase.
In `@packages/stack-cli/src/lib/iso.ts`:
- Around line 400-403: The writeIso function calls writeFileSync directly and
can throw uncaught I/O errors; wrap the writeFileSync(path, buf) call in a
try/catch inside writeIso and handle errors by adding contextual information
(include path and volumeId) and either logging via the module/logger used in
this package and rethrowing a new Error or rethrowing the original error after
attaching context, so callers get clear failure details; keep buildIso(...)
unchanged and ensure the catch preserves the original stack (e.g., include err
as cause or append err.message) so debugging remains possible.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 40549b75-ec4e-4b25-9d99-05d8519d737b
⛔ Files ignored due to path filters (1)
pnpm-lock.yamlis excluded by!**/pnpm-lock.yaml
📒 Files selected for processing (17)
.github/workflows/qemu-emulator-build.yaml.gitignoredocker/local-emulator/Dockerfiledocker/local-emulator/entrypoint.shdocker/local-emulator/qemu/build-image.shdocker/local-emulator/qemu/cloud-init/emulator/user-datadocker/local-emulator/qemu/run-emulator.shdocker/local-emulator/rotate-secrets.shdocker/local-emulator/run-cron-jobs.shdocker/local-emulator/supervisord.confdocker/server/entrypoint.shpackages/stack-cli/package.jsonpackages/stack-cli/src/commands/emulator.test.tspackages/stack-cli/src/commands/emulator.tspackages/stack-cli/src/lib/iso.test.tspackages/stack-cli/src/lib/iso.tspackages/stack-cli/vitest.config.ts
Ubuntu 24.04 (ubicloud-standard-8) ships QEMU 8.2, which predates the mapped-ram migration capability used by the fast-resume snapshot path. Compile 10.2.2 once per runner image and cache the resulting /opt/qemu so subsequent runs are fast.
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (2)
.github/workflows/qemu-emulator-build.yaml (2)
300-300:⚠️ Potential issue | 🟡 MinorQuote the serial-log path in the failure handler.
Unquoted
$HOMEstill triggers SC2086 here. Use"$HOME/.stack/emulator/run/vm/serial.log"to avoid word-splitting and globbing.Suggested change
- run: tail -100 $HOME/.stack/emulator/run/vm/serial.log 2>/dev/null || true + run: tail -100 "$HOME/.stack/emulator/run/vm/serial.log" 2>/dev/null || true🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/qemu-emulator-build.yaml at line 300, The failure handler uses an unquoted path in the tail command which triggers shellcheck SC2086; update the run step string (the line containing the tail invocation) to quote the expanded path, e.g. replace the unquoted $HOME/.stack/emulator/run/vm/serial.log with "$HOME/.stack/emulator/run/vm/serial.log" (and apply the same quoting to any similar $HOME-based paths in that handler) so word-splitting and globbing are avoided.
63-67:⚠️ Potential issue | 🟠 MajorInstall Node for the arm64 build path too.
The
if: matrix.arch == 'amd64'guard leaves the arm64 build dependent on whatever Node happens to be preinstalled on the runner, even though Line 132 still runsnode docker/local-emulator/generate-env-development.mjsfor every matrix entry. That makes the arm64 path brittle as the runner image changes.Suggested change
- - uses: actions/setup-node@v4 - if: matrix.arch == 'amd64' + - uses: actions/setup-node@v4 with: node-version: 22 cache: pnpm🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/qemu-emulator-build.yaml around lines 63 - 67, The workflow only runs actions/setup-node@v4 when `if: matrix.arch == 'amd64'`, leaving arm64 to rely on runner defaults while the job later runs `node docker/local-emulator/generate-env-development.mjs`; remove or widen the `if` guard so setup-node@v4 is executed for all matrix.arch entries (or add a separate setup-node step for arm64) and keep `node-version: 22` and `cache: pnpm` so both amd64 and arm64 use the pinned Node version and pnpm caching.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/qemu-emulator-build.yaml:
- Around line 169-179: The "Package image" step currently treats a missing
SAVEVM ("docker/local-emulator/qemu/images/stack-emulator-${{ matrix.arch
}}.savevm.zst") as a note; change it so that when matrix.arch == amd64 and the
SAVEVM file is missing the step fails (exit non‑zero) instead of continuing,
while preserving the current best‑effort behavior for other arches; update the
downstream smoke test logic (the conditional around the cold/fast boot switch)
to not silently fall back for amd64 if the artifact is absent so the CI run
fails early and surfaces the missing fast-resume snapshot.
- Around line 144-167: The workflow only runs emulator start/status/stop for
amd64 so arm64 snapshots can be published without a resume check; update the
emulator verification steps (the jobs named "Start emulator and verify", "Verify
services are healthy", and "Stop emulator" that run node
packages/stack-cli/dist/index.js emulator start|status|stop) to also run for
arm64 (change the if condition from matrix.arch == 'amd64' to include arm64,
e.g. matrix.arch == 'amd64' || matrix.arch == 'arm64' or remove the arch guard)
so arm64 gets the same resume smoke test, or alternatively add a guard in the
publish step to skip uploading the arm64 .savevm.zst unless a prior
resume/status check passed.
- Around line 90-105: In the "Build QEMU 10.2.2 from source" step, download the
matching .tar.xz.asc signature, import the QEMU release signing key (via gpg
--import or gpg --keyserver ...) and run gpg --verify /tmp/qemu.tar.xz.asc
/tmp/qemu.tar.xz; if verification fails exit non‑zero and do not extract; apply
the same signature download/import/verify flow at the other QEMU download
location as well so both build and test jobs validate the tarball before tar
-xf.
---
Duplicate comments:
In @.github/workflows/qemu-emulator-build.yaml:
- Line 300: The failure handler uses an unquoted path in the tail command which
triggers shellcheck SC2086; update the run step string (the line containing the
tail invocation) to quote the expanded path, e.g. replace the unquoted
$HOME/.stack/emulator/run/vm/serial.log with
"$HOME/.stack/emulator/run/vm/serial.log" (and apply the same quoting to any
similar $HOME-based paths in that handler) so word-splitting and globbing are
avoided.
- Around line 63-67: The workflow only runs actions/setup-node@v4 when `if:
matrix.arch == 'amd64'`, leaving arm64 to rely on runner defaults while the job
later runs `node docker/local-emulator/generate-env-development.mjs`; remove or
widen the `if` guard so setup-node@v4 is executed for all matrix.arch entries
(or add a separate setup-node step for arm64) and keep `node-version: 22` and
`cache: pnpm` so both amd64 and arm64 use the pinned Node version and pnpm
caching.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: a447ac40-4b80-485b-9549-6f893e4754da
📒 Files selected for processing (1)
.github/workflows/qemu-emulator-build.yaml
Switch the CLI build step from `pnpm --filter @stackframe/stack-cli run build` to `turbo run build --filter=@stackframe/stack-cli...` so that stack-cli's workspace dependencies (@stackframe/js and @stackframe/stack-shared) also get compiled to their dist/ outputs. Without them, `node dist/index.js` fails with ERR_MODULE_NOT_FOUND at import time.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
.github/workflows/qemu-emulator-build.yaml (1)
216-237: Consider extracting the QEMU build into a composite action.The QEMU cache restore + build-from-source logic is duplicated between the
buildandtestjobs. Extracting it into a composite action or reusable workflow would reduce duplication and ensure both jobs stay in sync.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/qemu-emulator-build.yaml around lines 216 - 237, Extract the duplicated QEMU restore+build steps into a reusable composite action (e.g., .github/actions/qemu-build) or a reusable workflow and replace the inline blocks in both the build and test jobs with a single call; move the actions/cache@v4 restore step and the shell build script (the block under the "Build QEMU 10.2.2 from source" step that references /opt/qemu, key qemu-10.2.2-${{ runner.os }}-${{ runner.arch }}-v1, and the configure/make/install sequence) into the composite action, expose inputs for version, prefix, target-list and cache key, and return/cache-hit as an output so both jobs can consume the same behavior without duplication.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In @.github/workflows/qemu-emulator-build.yaml:
- Around line 216-237: Extract the duplicated QEMU restore+build steps into a
reusable composite action (e.g., .github/actions/qemu-build) or a reusable
workflow and replace the inline blocks in both the build and test jobs with a
single call; move the actions/cache@v4 restore step and the shell build script
(the block under the "Build QEMU 10.2.2 from source" step that references
/opt/qemu, key qemu-10.2.2-${{ runner.os }}-${{ runner.arch }}-v1, and the
configure/make/install sequence) into the composite action, expose inputs for
version, prefix, target-list and cache key, and return/cache-hit as an output so
both jobs can consume the same behavior without duplication.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 24bafb7a-caa3-4fc2-872e-04f655beca5f
📒 Files selected for processing (1)
.github/workflows/qemu-emulator-build.yaml
First downloadArtifactByName already extracts both qcow2 and savevm.zst
from the single qemu-emulator-${arch} artifact; the second lookup for a
nonexistent -savevm artifact always failed and produced a misleading
'fast-start disabled' message.
The docker/server image runs as the unprivileged `node` user, which cannot write to /var/run. With `set -e` at the top of the script, the failed `touch` aborted execution after sentinel replacement but before the backend/dashboard were started — the Check server health CI step then saw connection refused on ports 8101/8102. Move the marker into $WORK_DIR (which is already created and owned by the running user). The emulator snapshot-resume path still benefits: the marker persists across supervisorctl restarts because $WORK_DIR lives on the container filesystem.
There was a problem hiding this comment.
🧹 Nitpick comments (5)
docker/server/entrypoint.sh (1)
156-159: Consider using null-delimited pipeline for robustness.The
find | xargschain without-print0/-0can fail on filenames containing spaces or newlines. While the built app tree is unlikely to have such filenames, using null-delimited handling would silence the shellcheck warning and be more robust.♻️ Suggested improvement
- unhandled_sentinels=$(find "$WORK_DIR/apps" -type f -exec grep -l "STACK_ENV_VAR_SENTINEL" {} + | \ - xargs grep -h "STACK_ENV_VAR_SENTINEL" | \ + unhandled_sentinels=$(find "$WORK_DIR/apps" -type f -print0 | \ + xargs -0 grep -l "STACK_ENV_VAR_SENTINEL" 2>/dev/null | \ + xargs grep -h "STACK_ENV_VAR_SENTINEL" 2>/dev/null | \ grep -o "STACK_ENV_VAR_SENTINEL[A-Z_]*" | \ sort -u | grep -v "^STACK_ENV_VAR_SENTINEL$")Alternatively, keep the current approach if you're confident the build output won't contain problematic filenames.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docker/server/entrypoint.sh` around lines 156 - 159, The pipeline that builds unhandled_sentinels using find/xargs/grep is unsafe for filenames with spaces/newlines; update the commands that produce and consume file lists (the find invocation used to populate unhandled_sentinels and the xargs usage) to use null-delimited output/inputs (e.g., find ... -print0 and xargs -0 or use grep -Z) so filenames are handled robustly, and preserve the subsequent grep/sort -u/grep -v processing to produce the same STACK_ENV_VAR_SENTINEL list.packages/stack-cli/src/commands/emulator.ts (4)
620-622: Voided promise in cleanup path.Per coding guidelines, promises should not be voided with
.catch(() => {}). While this is a best-effort cleanup beforeprocess.exit(), consider at minimum logging the error for debugging purposes.♻️ Suggested fix
console.log("\nStopping emulator..."); runEmulator("stop") - .catch(() => { /* best-effort stop */ }) + .catch((err) => { + // Best-effort stop — log but don't block exit + console.warn(`Warning: failed to stop emulator cleanly: ${err instanceof Error ? err.message : err}`); + }) .finally(() => process.exit(exitCode));🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/stack-cli/src/commands/emulator.ts` around lines 620 - 622, The cleanup call currently voids errors from runEmulator("stop") with .catch(() => {}); change this to surface at least the error via logging (e.g., use the existing logger or console to log the error) so failures aren't silently ignored while preserving the finally that calls process.exit(exitCode); locate the runEmulator("stop") invocation in emulator.ts and replace the empty catch with a handler that logs the caught error (including context like "emulator stop failed") before allowing the finally block to run.
57-68: Useperformance.now()for elapsed time measurement.Per coding guidelines,
Date.now()should not be used for measuring elapsed real time. This timeout logic should useperformance.now()instead.♻️ Suggested fix
async function readInternalPck(timeoutMs = 60_000): Promise<string> { const path = internalPckPath(); - const deadline = Date.now() + timeoutMs; + const deadline = performance.now() + timeoutMs; let delay = 250; - while (Date.now() < deadline) { + while (performance.now() < deadline) { if (existsSync(path)) { const contents = readFileSync(path, "utf-8").trim(); if (contents) return contents; } await new Promise((r) => setTimeout(r, delay)); delay = Math.min(delay * 2, 2000); } throw new CliError(`Timed out waiting for emulator internal publishable client key at ${path}`); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/stack-cli/src/commands/emulator.ts` around lines 57 - 68, In readInternalPck replace use of Date.now() for deadline/loop checks with performance.now(): compute deadline = performance.now() + timeoutMs and change the while condition and any timestamps to compare performance.now() against the deadline; update any related variables (e.g., the deadline calculation and loop condition inside readInternalPck) and ensure the runtime provides performance (import from 'perf_hooks' if needed) so elapsed-time measurement follows the coding guideline.
174-174: Consider usingSTACK_prefix forPORT_PREFIX.Per coding guidelines, environment variables should be prefixed with
STACK_(orNEXT_PUBLIC_STACK_). The fallback toPORT_PREFIXwithout prefix may be intentional for backwards compatibility, but consider documenting this or migrating toSTACK_PORT_PREFIX.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/stack-cli/src/commands/emulator.ts` at line 174, The environment variable lookup for portPrefix currently reads process.env.PORT_PREFIX and process.env.NEXT_PUBLIC_STACK_PORT_PREFIX; update it to prefer the STACK_ names by checking process.env.STACK_PORT_PREFIX then process.env.NEXT_PUBLIC_STACK_PORT_PREFIX, optionally retaining process.env.PORT_PREFIX only as a final fallback for backwards compatibility (or remove it if migrating). Modify the initializer that sets portPrefix (the const portPrefix assignment) to reflect the new precedence and add a brief inline comment noting the compatibility fallback if you keep process.env.PORT_PREFIX.
304-312: Useperformance.now()for download timing.The download speed calculation uses
Date.now()for elapsed time measurement, which should beperformance.now()per coding guidelines.♻️ Suggested fix
- const startedAt = Date.now(); + const startedAt = performance.now(); let downloaded = 0; let lastRender = 0; const render = (final: boolean) => { - const now = Date.now(); + const now = performance.now(); if (!final && now - lastRender < 100) return; lastRender = now; - const elapsed = Math.max(0.001, (now - startedAt) / 1000); + const elapsed = Math.max(0.001, (now - startedAt) / 1000);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/stack-cli/src/commands/emulator.ts` around lines 304 - 312, The timing logic in the download progress (startedAt, lastRender and render) uses Date.now(); replace it with performance.now() so elapsed uses high-resolution timing—set startedAt = performance.now(), compute now = performance.now() inside render, keep the 100ms throttle and compute elapsed = Math.max(0.001, (now - startedAt) / 1000). If this runs in Node and performance is not globally available, import performance from 'perf_hooks' at the top of the module; update references in the render function accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@docker/server/entrypoint.sh`:
- Around line 156-159: The pipeline that builds unhandled_sentinels using
find/xargs/grep is unsafe for filenames with spaces/newlines; update the
commands that produce and consume file lists (the find invocation used to
populate unhandled_sentinels and the xargs usage) to use null-delimited
output/inputs (e.g., find ... -print0 and xargs -0 or use grep -Z) so filenames
are handled robustly, and preserve the subsequent grep/sort -u/grep -v
processing to produce the same STACK_ENV_VAR_SENTINEL list.
In `@packages/stack-cli/src/commands/emulator.ts`:
- Around line 620-622: The cleanup call currently voids errors from
runEmulator("stop") with .catch(() => {}); change this to surface at least the
error via logging (e.g., use the existing logger or console to log the error) so
failures aren't silently ignored while preserving the finally that calls
process.exit(exitCode); locate the runEmulator("stop") invocation in emulator.ts
and replace the empty catch with a handler that logs the caught error (including
context like "emulator stop failed") before allowing the finally block to run.
- Around line 57-68: In readInternalPck replace use of Date.now() for
deadline/loop checks with performance.now(): compute deadline =
performance.now() + timeoutMs and change the while condition and any timestamps
to compare performance.now() against the deadline; update any related variables
(e.g., the deadline calculation and loop condition inside readInternalPck) and
ensure the runtime provides performance (import from 'perf_hooks' if needed) so
elapsed-time measurement follows the coding guideline.
- Line 174: The environment variable lookup for portPrefix currently reads
process.env.PORT_PREFIX and process.env.NEXT_PUBLIC_STACK_PORT_PREFIX; update it
to prefer the STACK_ names by checking process.env.STACK_PORT_PREFIX then
process.env.NEXT_PUBLIC_STACK_PORT_PREFIX, optionally retaining
process.env.PORT_PREFIX only as a final fallback for backwards compatibility (or
remove it if migrating). Modify the initializer that sets portPrefix (the const
portPrefix assignment) to reflect the new precedence and add a brief inline
comment noting the compatibility fallback if you keep process.env.PORT_PREFIX.
- Around line 304-312: The timing logic in the download progress (startedAt,
lastRender and render) uses Date.now(); replace it with performance.now() so
elapsed uses high-resolution timing—set startedAt = performance.now(), compute
now = performance.now() inside render, keep the 100ms throttle and compute
elapsed = Math.max(0.001, (now - startedAt) / 1000). If this runs in Node and
performance is not globally available, import performance from 'perf_hooks' at
the top of the module; update references in the render function accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 764ee52b-d2d7-4018-9826-9079a0127b7d
⛔ Files ignored due to path filters (1)
pnpm-lock.yamlis excluded by!**/pnpm-lock.yaml
📒 Files selected for processing (3)
docker/server/entrypoint.shpackages/stack-cli/package.jsonpackages/stack-cli/src/commands/emulator.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- packages/stack-cli/package.json
- stop_vm no longer deletes runtime-config.iso; the CLI owns its lifecycle and the snapshot → cold-boot fallback needs it preserved (cmd_reset still wipes RUN_DIR for a full reset). Also sweeps qga.sock. - Write internal-pck to \$VM_DIR on the host in snapshot mode. Cold boot publishes this via virtfs/9p; snapshot mode drops virtfs, so --config-file flows would otherwise hang. Handles both the rotation path (fresh PCK) and EMULATOR_NO_ROTATION (placeholder PCK). - Pin RAM in snapshot mode to the build-time 4096 (overridable via EMULATOR_SNAPSHOT_RAM). Migration replay requires an identical -m value, same constraint as CPU count. - Fail amd64 build when .savevm.zst is missing rather than shipping a cold-boot-only release silently. arm64 stays best-effort for now because it runs under TCG and can't be verified end-to-end. - Install Node/pnpm on both arches. arm64 also runs generate-env-development.mjs, which otherwise relied on the runner image's preinstalled Node.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
docker/local-emulator/qemu/run-emulator.sh (1)
690-700: Consider a more robust readiness check after rotation.The
sleep 1workaround (line 694) addresses a race between supervisor sending SIGTERM and the old Node processes dying. This is pragmatic but could still race if the old processes take longer than 1 second to stop. A more robust approach would be to wait for the services to become unavailable first (indicating old processes stopped), then wait for them to become available again.💡 Optional: Wait for services to go down before checking up
if ! qga_trigger_fast_rotate; then warn "Failed to trigger rotate-secrets — falling back to cold boot." snapshot_fallback_to_cold_boot return fi - # Wait for the *new* backend (post-supervisor-restart) to actually be - # listening. all_ready may briefly return true against the OLD Node - # processes between when supervisor sends SIGTERM and when the children - # die; sleep a beat so we measure the real readiness. - sleep 1 + # Wait for services to become unavailable (old processes stopping), + # then wait for them to come back up (new processes starting). + local down_deadline=$((SECONDS + 10)) + while [ "$SECONDS" -lt "$down_deadline" ] && all_ready; do + sleep 0.2 + done if ! wait_for_condition "rotated services" "$SNAPSHOT_READY_TIMEOUT" all_ready; then🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docker/local-emulator/qemu/run-emulator.sh` around lines 690 - 700, The single-second sleep after supervisor rotation is racy; replace the blind sleep with a two-step readiness check that first waits for services to become unavailable (old processes stopped) and then waits for them to be available again using wait_for_condition: call wait_for_condition (e.g., "services down") with a short timeout polling !all_ready or an all_down helper until services are confirmed down, then call wait_for_condition "rotated services" "$SNAPSHOT_READY_TIMEOUT" all_ready as before; update the block around wait_for_condition, all_ready, and SNAPSHOT_READY_TIMEOUT (and keep the existing fallback to snapshot_fallback_to_cold_boot and tail_vm_logs) so we avoid the race without a fixed sleep.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@docker/local-emulator/qemu/run-emulator.sh`:
- Around line 690-700: The single-second sleep after supervisor rotation is
racy; replace the blind sleep with a two-step readiness check that first waits
for services to become unavailable (old processes stopped) and then waits for
them to be available again using wait_for_condition: call wait_for_condition
(e.g., "services down") with a short timeout polling !all_ready or an all_down
helper until services are confirmed down, then call wait_for_condition "rotated
services" "$SNAPSHOT_READY_TIMEOUT" all_ready as before; update the block around
wait_for_condition, all_ready, and SNAPSHOT_READY_TIMEOUT (and keep the existing
fallback to snapshot_fallback_to_cold_boot and tail_vm_logs) so we avoid the
race without a fixed sleep.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 325e1ebd-d1ce-4c10-967a-664e70b2d258
📒 Files selected for processing (2)
.github/workflows/qemu-emulator-build.yamldocker/local-emulator/qemu/run-emulator.sh
🚧 Files skipped from review as they are similar to previous changes (1)
- .github/workflows/qemu-emulator-build.yaml
- run-emulator.sh: drop wait_for_condition poll interval from 1s to 0.2s - emulator.ts: replace existsSync+readFileSync TOCTOU in readInternalPck with try/ENOENT; tighten initial backoff to 50ms; drop redundant mkdirSync in startEmulator; surface stop-failure on stderr instead of swallowing silently - iso.ts: inline trivial buildRootDirRecordInVD wrapper
…ct start - build-image.sh: move runtime.iso drive before netdev so its virtio-blk slot precedes virtio-net-pci, matching run-emulator.sh's resume argv. Previously migrate-incoming against CI's savevm hit a device-tree mismatch and only looked green because snapshot_fallback_to_cold_boot silently retried as cold boot. - run-emulator.sh: drop early-return in ensure_runtime_config_iso so PORT_PREFIX/EMULATOR_*_PORT changes take effect on every start; the preserved ISO from a prior run would otherwise silently override the host-forward ports picked up by QEMU's netdev. - common.sh: fix backslash-escaped JSON in capture_vm_state's migrate- timeout diagnostic; single-quoted printf was emitting literal backslashes, so QMP replied with a parse error instead of the real query-migrate status.
ensure_runtime_config_iso unconditionally fell through to make_iso_from_dir, which still required hdiutil/mkisofs/genisoimage — the host dep the lib/iso.ts TS writer was supposed to remove. The Smoke Test job doesn't install genisoimage, so emulator start failed. CLI now sets STACK_EMULATOR_CLI_WROTE_ISO=1 and the shell short-circuits when that flag plus a non-empty ISO are present.
…-host ISO Two bugs surfaced by end-to-end testing against a freshly-built qcow2: 1. $STACK_EMULATOR_CLI_WROTE_ISO was referenced unguarded under `set -u`, so any code path that didn't set it (capture, direct-shell) tripped `unbound variable` before reaching the early-return. Use :- default. 2. ensure_runtime_config_iso was overwriting cmd_capture's specialized empty-VM_DIR_HOST ISO with the host-dir variant. Since virtfs is detached in capture mode, run-stack-container then tried to publish internal-pck to /host/... and restart-looped stack.service, so no service ever became healthy and capture aborted after 240s. Previously masked by snapshot_fallback_to_cold_boot; 510ef38 fixed the fallback mask and exposed this. Skip regen when EMULATOR_CAPTURING_SNAPSHOT=1.
Splits EMULATOR_BUILD_SNAPSHOT into two independent flags: * EMULATOR_BUILD_SNAPSHOT (default 1) — bake placeholder PCK/SSK/SAK/ CRON_SECRET into the qcow2 so runtime rotate-secrets can swap them per install. Cheap; no extra wall time. * EMULATOR_CAPTURE_SAVEVM (default 0) — start the stack, wait for backend+dashboard health, then capture savevm.zst via QMP. Implies BUILD_SNAPSHOT. CI never captures (snapshots aren't portable across KVM/HVF/TCG; users capture locally on first `stack emulator pull`). The previous default of capturing in CI was wasted work on amd64 and made arm64 fail — wait-for-stack-ready couldn't bring the stack up under cross-arch TCG inside its 600s budget, so cloud-final.service was marked failed.
Prisma's default interactive-transaction timeout is 5s, but under cross-arch arm64 TCG in the emulator qcow2 build this single batch (deleteMany + createMany for events + ipInfos) takes ~10s. Bump just this call to 30s. Production (KVM/native) runs it in <1s, so the looser bound only engages when the DB is genuinely slow. Per-call option — no other transaction is affected.
runBulldozerPaymentsInit / paginatedIngress issues a single $executeRaw per row that writes a JSONB payload. Under cross-arch arm64 TCG in the qcow2 build it takes ~31s per row and Postgres kills it with code 57014 (canceling statement due to statement timeout). 120s covers the observed time with a ~4× safety margin.
|
Promptless prepared a documentation update related to this change. Triggered by PR #1340 Updated emulator documentation with fast-start capability (~5-8s startup via VM snapshot resume vs 30-120s cold boot), new CLI options ( Review: Document local emulator fast-start and CLI enhancements |
Summary
stack emulator startnow resumes a fully-warm VM snapshot instead of cold-booting, bringing startup from 30–120s down to ~5–8s with per-install secret rotation, or ~2.5s with rotation opt-out. The snapshot is captured locally on firststack emulator pull, not shipped from CI — QEMU migration state isn't portable across accelerators (KVM/HVF/TCG) or-cpu maxfeature sets, so a CI-captured snapshot couldn't resume reliably on arbitrary user hardware.Also bundles a pile of CLI QoL fixes (progress bars, PR/run artifact pulls, PR-build download, native-TS ISO writer replacing
hdiutil/mkisofs/genisoimagehost dep, unit tests).stack emulator pull(one-time, includes local snapshot capture)EMULATOR_NO_ROTATION=1Backend (
/health?db=1) and dashboard (/handler/sign-in) return 200 on all paths. Two successive snapshot resumes produce different rotated PCK/SSK/SAK/CRON_SECRET values per install.How it works
Build (CI) —
docker/local-emulator/qemu/build-image.sh:stack-emulator-<arch>.qcow2.STACK_EMULATOR_BUILD_SNAPSHOT=1)..savevm.zstships.Pull (user's machine) —
packages/stack-cli/src/commands/emulator.ts+run-emulator.sh capture:stack emulator pulldownloads the qcow2 with a progress bar (or from a PR / workflow run via--pr/--run).run-emulator.sh capture: cold-boots the qcow2 with a matching device layout (phantom ISOs, fsdev, pcie-root-port, virtfs detached — migration-incompatible), waits for backend+dashboard health, then drives QMP:stop→ setmapped-ram+multifdcaps →migrate file:state.raw→ pollquery-migrate→quit. Raw mapped-ram file is zstd-compressed tostack-emulator-<arch>.savevm.zstin the images dir.--skip-snapshotopts out (firststartwill then cold-boot).Runtime —
run-emulator.sh start:-incoming deferwhen a.savevm.zstis present; decompress on first use, keep the.rawcached for subsequent starts.mapped-ram+multifdcaps →migrate-incoming file:<.raw>→ poll forpaused→cont.guest-exec input-data→trigger-fast-rotatein the guest →docker exec -e … rotate-secrets.rotate-secretsin the container: validate keys (hex-only), targetedsedon the placeholder PCK across built JS,UPDATE ApiKeySet,supervisorctl restart stack-app cron-jobs(withstopasgroup/killasgroupso the Node children actually die and release their ports).Security model: placeholder hex values are baked into the snapshot (
00…ffPCK,00…eeSSK,00…ddSAK,00…ccCRON_SECRET). They are non-secret by construction. Real per-install secrets are generated at eachemulator startand never leave the host.CLI changes (
packages/stack-cli)src/lib/iso.ts(new): native TypeScript ISO 9660 + Joliet writer, replacing the host-sidehdiutil/mkisofs/genisoimagedependency for generating the STACKCFG runtime config disk. Unit tests insrc/lib/iso.test.ts.src/commands/emulator.ts:pull: streamed downloads with progress bar + ETA;--pr <number>and--run <id>to pull from a PR build's CI artifacts (usesextract-zipfor the nested zip);--skip-snapshotto opt out of the one-time local capture.start(existing, extended): auto-pulls AND auto-captures when no image exists, so first-everstartis self-bootstrapping; emitsSTACK_EMULATOR_CLI_WROTE_ISO=1so the shell helper skips its own ISO regen (avoids the genisoimage host dep).capture(new, invoked bypulland the auto-pull path ofstart): drives the local snapshot capture viarun-emulator.sh.status,stop,reset,list-releases: preflight + path-resolution tightening (STACK_EMULATOR_HOME→ images/run dirs).src/commands/emulator.test.ts.EMULATOR_NO_ROTATION=1env var skips the post-resume rotation (intended for tests/CI where the placeholder secrets are fine — comes with a loud warning).CI (
.github/workflows/qemu-emulator-build.yaml)mapped-ram/multifdmigration capabilities aren't available in the distro's QEMU. Enables KVM on ubicloud runners so amd64 boots at hardware speed.ubicloud-standard-8); arm64 runs under cross-arch TCG (provisioning only — boot/verify smoke test is amd64-only).emulator start→emulator status→emulator stopagainst the freshly-built qcow2 (viaSTACK_EMULATOR_HOMEpointing at the workspace, so the CLI doesn't silently auto-pull a prior release)..savevm.zstupload / publish.Key files
Shell / guest:
docker/local-emulator/qemu/build-image.sh— snapshot-compatible device topology + STACKCFG runtime ISO at build timedocker/local-emulator/qemu/run-emulator.sh—start,capture,stop,reset,status;-incoming defer,.rawcache, QGA-driven rotation, cold-boot fallbackdocker/local-emulator/qemu/common.sh(new) — sharedqmp_session+capture_vm_state(factored out so build-image.sh and run-emulator.sh share the capture path)docker/local-emulator/qemu/cloud-init/emulator/user-data— placeholder secrets in snapshot mode,wait-for-stack-ready,trigger-fast-rotate, qemu-guest-agent enableddocker/local-emulator/rotate-secrets.sh(new) — in-container rotation (sed + UPDATE + supervisorctl)docker/local-emulator/supervisord.conf—stopasgroup/killasgrouponstack-appandcron-jobsdocker/local-emulator/entrypoint.sh— only mint CRON_SECRET if unset (placeholder supplied in snapshot mode via --env-file)docker/local-emulator/Dockerfile— shipsrotate-secretsto/usr/local/bindocker/server/entrypoint.sh— source/run/stack-auth/rotated-secrets.env; skip full-tree sentinel scan on warm restarts via markerCLI:
packages/stack-cli/src/lib/iso.ts(new) +iso.test.ts(new)packages/stack-cli/src/commands/emulator.ts+emulator.test.ts(new)packages/stack-cli/vitest.config.ts(new)CI:
.github/workflows/qemu-emulator-build.yamlTest plan
docker/local-emulator/qemu/build-image.sh {amd64,arm64}producesstack-emulator-<arch>.qcow2with snapshot-compatible topologystack emulator pulldownloads qcow2 with progress, then captures locally (~1–3 min) and writesstack-emulator-<arch>.savevm.zstin the images dirstack emulator pull --skip-snapshotstops after downloadstack emulator pull --pr <n>/--run <id>pull from PR / workflow run artifactsstack emulator starton a fresh dir auto-pulls and auto-captures, then starts; subsequent starts fast-resume in ~5–8s; backend + dashboard return 200EMULATOR_NO_ROTATION=1 stack emulator startcompletes in ~2.5s; backend + dashboard return 200 with warning printedemulator startinvocations produce different PCK values in the internalApiKeySetrowstack emulator status/stop/resetresolve paths fromSTACK_EMULATOR_HOMEpnpm lintandpnpm typecheckpass; stack-cli unit tests (iso + emulator) passgh release download emulator-<branch>-latestcontains onlystack-emulator-<arch>.qcow2once this PR merges and publish runsSummary by CodeRabbit
New Features
Bug Fixes
Tests