Skip to content

local emulator image optimization#1318

Closed
BilalG1 wants to merge 4 commits intodevfrom
local-emulator-image-optimization
Closed

local emulator image optimization#1318
BilalG1 wants to merge 4 commits intodevfrom
local-emulator-image-optimization

Conversation

@BilalG1
Copy link
Copy Markdown
Collaborator

@BilalG1 BilalG1 commented Apr 9, 2026

Summary by CodeRabbit

  • Chores

    • Optimized emulator images with binary stripping and UPX compression, producing slimmer final images and preserving a node_modules snapshot during image slimming.
    • Improved build/provisioning flow with incremental progress output, better failure diagnostics, persisted provisioning logs, and guest-mounted progress logging.
    • Propagated per-architecture build env and added runtime host-port mappings.
    • CI: added KVM enablement/checks for QEMU jobs.
  • Tests

    • Added a fast QEMU boot-and-serial verification script for emulator validation.

@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 9, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
stack-auth-hosted-components Ready Ready Preview, Comment Apr 10, 2026 4:41pm
stack-backend Ready Ready Preview, Comment Apr 10, 2026 4:41pm
stack-dashboard Ready Ready Preview, Comment Apr 10, 2026 4:41pm
stack-demo Ready Ready Preview, Comment Apr 10, 2026 4:41pm
stack-docs Ready Ready Preview, Comment Apr 10, 2026 4:41pm
stack-preview-backend Ready Ready Preview, Comment Apr 10, 2026 4:41pm
stack-preview-dashboard Ready Ready Preview, Comment Apr 10, 2026 4:41pm

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 9, 2026

📝 Walkthrough

Walkthrough

Adds UPX compression and stripping of emulator service binaries; enhances QEMU build tooling with provision log tracking, incremental output, virtfs-mounted provisioning, and direct image conversion; restructures cloud-init to run build migrations, slim the emulator image, and emit completion markers.

Changes

Cohort / File(s) Summary
Dockerfile / binary optimization
docker/local-emulator/Dockerfile
Added upx-compress build stage installing upx-ucl/binutils, strips and UPX-compresses service binaries (Svix/MinIO/MC/QStash; ClickHouse left stripped), updates final image to copy from compressed artifacts, and persists /app/node_modules to /app/node_modules.standalone before migration-pruner overwrites it.
QEMU build tooling & helpers
docker/local-emulator/qemu/build-image.sh, docker/local-emulator/qemu/test-serial.sh
Introduced provision-log handling, functions to detect success/failure markers, incremental progress printing, enhanced timeout/diagnostics, persisting logs, virtfs mount of host tmp, disk discard/unmap qemu options, direct tmp-img convert/compress flow, build-env generation, and a new fast serial verification script.
Cloud-init & provisioning pipeline
docker/local-emulator/qemu/cloud-init/emulator/user-data
Replaced inline runcmd with /usr/local/bin/provision-build; added log-provision, run-build-migrations, slim-docker-image; conditionally copy /mnt/stack-bundle/build.env; write /etc/stack-build-computed.env; compute host-port vars and set STACK_SKIP_* flags; emit STACK_CLOUD_INIT_DONE and handle failure exit markers.
Runtime config generation
docker/local-emulator/qemu/run-emulator.sh
prepare_runtime_config_iso() now includes additional STACK_EMULATOR_*_HOST_PORT entries (dashboard/backend/minio/inbucket) into runtime.env used by the guest.
CI workflow
.github/workflows/qemu-emulator-build.yaml
Install qemu-kvm, add step to enable KVM access via udev rule and probe /dev/kvm, and emit warning/fallback message when KVM not writable.

Sequence Diagram(s)

sequenceDiagram
    participant Guest as QEMU Guest
    participant CI as cloud-init (guest scripts)
    participant Build as Build Scripts (/usr/local/bin)
    participant DockerHost as Docker Host

    CI->>Guest: mount runtime-config.iso (with STACK_EMULATOR_*_HOST_PORT)
    CI->>Guest: exec /usr/local/bin/provision-build
    provision-build->>Build: log-provision -> run-build-migrations
    Build->>DockerHost: start deps-only container (uses /etc/stack-build.env + computed env)
    DockerHost->>DockerHost: wait for init/deps markers
    DockerHost->>DockerHost: run backend migrations & seed
    DockerHost-->>Build: migrations complete
    Build->>DockerHost: slim-docker-image (swap node_modules, smoke test, flatten/export)
    DockerHost-->>Build: return slim image
    Build->>Guest: write STACK_CLOUD_INIT_DONE to serial/console
    Guest-->>CI: serial shows STACK_CLOUD_INIT_DONE (provision complete)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • emu with a q stuff #1266: Modifies the same docker/local-emulator Dockerfile and emulator tooling; likely overlapping changes to binary handling and image build steps.

Poem

🐰 I nudged the bins, I hopped and chewed,

Stripped their coats and made them shrewd.
Logs that whisper, mounts that sing,
Migrations run — the dev bells ring.
A tiny hop, the emulator's tuned.

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The pull request description is empty except for the template boilerplate, providing no explanation of the changes, rationale, or implementation details. Add a comprehensive description explaining the optimization strategy, key changes (UPX compression, provisioning refactoring, KVM support), and the impact on build/image size.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title directly relates to the main objective of the changeset: optimizing the local emulator Docker image through binary compression, build optimization, and provisioning improvements.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch local-emulator-image-optimization

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
docker/local-emulator/qemu/cloud-init/emulator/user-data (1)

291-311: Aggressive Docker storage nuke is intentional but worth documenting.

This approach (stop Docker → delete /var/lib/docker → restart → restore) is the most reliable way to reclaim space since docker image prune can't free layers referenced by build cache. It's appropriate for a build-time operation where we control the full Docker state.

Consider adding a brief comment explaining why a simple prune isn't sufficient, to prevent future maintainers from "simplifying" this.

📝 Optional: Add explanatory comment
       # Save the final image and volume data, nuke ALL Docker storage
       # (images, build cache, overlay2 layers), then reload. This is the
       # only reliable way to reclaim space — the build cache holds refs
       # to old layers, preventing docker image prune from freeing them.
+      # A simple `docker system prune` is insufficient because build cache
+      # references keep old layers allocated even after intermediate images
+      # are removed.
       docker rm flatten
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docker/local-emulator/qemu/cloud-init/emulator/user-data` around lines 291 -
311, Add a short explanatory comment above the Docker storage reset sequence in
the cloud-init user-data (the block that runs docker rm flatten, docker save
stack-local-emulator:final, cp -a /var/lib/docker/volumes, systemctl stop docker
containerd, rm -rf /var/lib/docker /var/lib/containerd, and subsequent restore)
stating that a full removal of /var/lib/docker is required because build cache
and image references prevent docker image prune / docker image prune -a from
freeing layered storage, and that this is intentional and safe here because this
is a controlled build-time operation; keep the comment concise and reference
that the script restores images/volumes afterward to reassure maintainers.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docker/local-emulator/qemu/cloud-init/emulator/user-data`:
- Around line 182-190: The wait loop for the sentinel file
(/var/run/stack-local-init-services.done) can time out silently; after the while
loop in user-data that runs `docker exec stack-build-init test -f
/var/run/stack-local-init-services.done` add an explicit check: if the sentinel
still doesn't exist, emit an error (e.g., echo with context and optionally dump
`docker logs stack-build-init`) and exit non-zero (exit 1) so the build stops
instead of proceeding to migrations; reference the timeout/elapsed variables and
the same docker exec test to perform the check.

---

Nitpick comments:
In `@docker/local-emulator/qemu/cloud-init/emulator/user-data`:
- Around line 291-311: Add a short explanatory comment above the Docker storage
reset sequence in the cloud-init user-data (the block that runs docker rm
flatten, docker save stack-local-emulator:final, cp -a /var/lib/docker/volumes,
systemctl stop docker containerd, rm -rf /var/lib/docker /var/lib/containerd,
and subsequent restore) stating that a full removal of /var/lib/docker is
required because build cache and image references prevent docker image prune /
docker image prune -a from freeing layered storage, and that this is intentional
and safe here because this is a controlled build-time operation; keep the
comment concise and reference that the script restores images/volumes afterward
to reassure maintainers.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: cb920ace-540a-4aed-a942-841cb08cc523

📥 Commits

Reviewing files that changed from the base of the PR and between 0301961 and 7d9e156.

📒 Files selected for processing (4)
  • docker/local-emulator/Dockerfile
  • docker/local-emulator/qemu/build-image.sh
  • docker/local-emulator/qemu/cloud-init/emulator/user-data
  • docker/local-emulator/qemu/run-emulator.sh

Comment thread docker/local-emulator/qemu/cloud-init/emulator/user-data
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 9, 2026

Greptile Summary

This PR optimizes the local emulator image size through three main strategies: (1) adding UPX compression for svix-server, minio, mc, and qstash binaries and strip --strip-all for ClickHouse, (2) slimming the Docker image by swapping the full migration-pruner node_modules back to the smaller standalone-traced ones at QEMU provision time, and (3) aggressively reclaiming disk space by wiping Docker's entire storage layer after baking in migrations, then restoring only the final flattened image and named volumes.

Confidence Score: 5/5

Safe to merge — all findings are P2 suggestions; build-time gates (UPX exit code) and the smoke test provide reasonable safeguards.

All remaining comments are P2: one is a speculative runtime risk for UPX-compressed binaries (caught at build time if upx fails, and by the smoke test for the backend), one is a harmless glob-quoting style issue, and one is a silent-failure risk that would also be caught by the smoke test. No P0/P1 defects found.

docker/local-emulator/Dockerfile (upx-compress stage) and docker/local-emulator/qemu/cloud-init/emulator/user-data (slim-docker-image smoke test coverage)

Vulnerabilities

No security concerns identified. The build.env file (API keys / development secrets) is bundled into the QEMU ISO only for the build phase and is not present in the final runtime image or exposed over the network. The SSH service is explicitly disabled and masked in cloud-init.

Important Files Changed

Filename Overview
docker/local-emulator/Dockerfile Adds strip-clickhouse and upx-compress intermediate stages; saves standalone node_modules before overwriting with migration-pruner. UPX compression applied to four binaries without service-level smoke test coverage.
docker/local-emulator/qemu/cloud-init/emulator/user-data Adds run-build-migrations and slim-docker-image scripts; performs full Docker storage wipe and reload cycle to reclaim space; includes smoke test for backend health but not for UPX-compressed services (MinIO, Svix, QStash).
docker/local-emulator/qemu/build-image.sh Adds discard=on,detect-zeroes=unmap to QEMU drive (TRIM support), bundles build.env into the ISO, and generates the env file at script level before the build loop. Clean improvement.
docker/local-emulator/qemu/run-emulator.sh Adds four host-side port variables (dashboard, backend, MinIO, inbucket) to the runtime config ISO so the VM can construct accurate browser-facing URLs. Straightforward and correct.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[build-image.sh] --> B[Docker buildx: stack-local-emulator]
    B --> B1[upx-compress stage\nminio / mc / svix-server / qstash]
    B --> B2[strip-clickhouse stage\nclickhouse binary]
    B --> B3[node_modules.standalone\nsaved before migration-pruner overwrite]
    B1 --> C[Docker image bundled into ISO]
    B2 --> C
    B3 --> C
    A --> D[QEMU VM boots with cloud-init]
    D --> E[install-emulator-containers\nloads Docker image + build.env]
    E --> F[run-build-migrations\nstart deps-only container\nrun migrations & seed]
    F --> G[slim-docker-image\nswap node_modules to standalone-traced\nbuild stack-local-emulator-slim]
    G --> H[Smoke test: backend /health?db=1]
    H -->|pass| I[Flatten to single layer\ndocker export + import]
    H -->|fail| Z[Build fails]
    I --> J[Wipe Docker storage\nsave + reload final image\nrestore volumes]
    J --> K[Zero free space + fstrim]
    K --> L[STACK_CLOUD_INIT_DONE emitted]
    L --> M[qemu-img convert -c\ncompressed qcow2 final image]
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: docker/local-emulator/Dockerfile
Line: 121

Comment:
**UPX on Go binaries — smoke test doesn't cover these services**

`upx -9` is applied to `minio`, `mc`, `qstash` (Go binaries) and `svix-server` (Rust). UPX compression can succeed at build time yet produce a binary that crashes at runtime, particularly for newer statically-linked Go binaries on ARM64. The smoke test that follows only checks `http://127.0.0.1:8102/health?db=1` (backend + Postgres), so a broken MinIO, Svix, or QStash binary would pass the build gate and only surface when users try to upload files or fire webhooks.

Consider extending the smoke test to include at least one health probe per compressed binary before the flatten/wipe step:

```bash
# After smoke_passed check, verify UPX-compressed services are up
curl -sf http://127.0.0.1:9090/minio/health/live || { echo "MinIO health failed"; exit 1; }
curl -sf http://127.0.0.1:8071/api/v1/health/ || { echo "Svix health failed"; exit 1; }
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: docker/local-emulator/qemu/cloud-init/emulator/user-data
Line: 264

Comment:
**Unquoted URL with glob-special character `?`**

The URL `http://127.0.0.1:8102/health?db=1` is passed unquoted to `curl`. Under `set -euo pipefail`, bash will attempt glob expansion on the `?` character (which matches any single character). This works in practice only because no file matching `http:/127.0.0.1:8102/health.db=1` exists, but it is fragile. Quote the URL for correctness:

```suggestion
        code=$(curl -s -o /dev/null -w "%{http_code}" --max-time 3 "http://127.0.0.1:8102/health?db=1" 2>/dev/null || true)
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: docker/local-emulator/Dockerfile
Line: 188

Comment:
**Silent `cp` failure creates an empty `node_modules.standalone`**

The `2>/dev/null || mkdir -p /app/node_modules.standalone` fallback means that if the `cp -a` silently fails (e.g. disk full, source missing), the build continues with an empty `node_modules.standalone`. When `slim-docker-image` later swaps it in, the container would have no node_modules and fail at runtime — likely caught only by the smoke test. Consider removing the suppression and fallback to surface the failure fast:

```suggestion
RUN cp -a /app/node_modules /app/node_modules.standalone
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "local emulator image opt" | Re-trigger Greptile

Comment thread docker/local-emulator/Dockerfile Outdated
COPY --from=minio-bin /usr/bin/minio /out/minio
COPY --from=mc-bin /usr/bin/mc /out/mc
COPY --from=qstash-bin /qstash-binary /out/qstash
RUN upx -9 /out/minio /out/svix-server /out/mc /out/qstash
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 UPX on Go binaries — smoke test doesn't cover these services

upx -9 is applied to minio, mc, qstash (Go binaries) and svix-server (Rust). UPX compression can succeed at build time yet produce a binary that crashes at runtime, particularly for newer statically-linked Go binaries on ARM64. The smoke test that follows only checks http://127.0.0.1:8102/health?db=1 (backend + Postgres), so a broken MinIO, Svix, or QStash binary would pass the build gate and only surface when users try to upload files or fire webhooks.

Consider extending the smoke test to include at least one health probe per compressed binary before the flatten/wipe step:

# After smoke_passed check, verify UPX-compressed services are up
curl -sf http://127.0.0.1:9090/minio/health/live || { echo "MinIO health failed"; exit 1; }
curl -sf http://127.0.0.1:8071/api/v1/health/ || { echo "Svix health failed"; exit 1; }
Prompt To Fix With AI
This is a comment left during a code review.
Path: docker/local-emulator/Dockerfile
Line: 121

Comment:
**UPX on Go binaries — smoke test doesn't cover these services**

`upx -9` is applied to `minio`, `mc`, `qstash` (Go binaries) and `svix-server` (Rust). UPX compression can succeed at build time yet produce a binary that crashes at runtime, particularly for newer statically-linked Go binaries on ARM64. The smoke test that follows only checks `http://127.0.0.1:8102/health?db=1` (backend + Postgres), so a broken MinIO, Svix, or QStash binary would pass the build gate and only surface when users try to upload files or fire webhooks.

Consider extending the smoke test to include at least one health probe per compressed binary before the flatten/wipe step:

```bash
# After smoke_passed check, verify UPX-compressed services are up
curl -sf http://127.0.0.1:9090/minio/health/live || { echo "MinIO health failed"; exit 1; }
curl -sf http://127.0.0.1:8071/api/v1/health/ || { echo "Svix health failed"; exit 1; }
```

How can I resolve this? If you propose a fix, please make it concise.

smoke_elapsed=0
smoke_passed=false
while [ "$smoke_elapsed" -lt "$smoke_timeout" ]; do
code=$(curl -s -o /dev/null -w "%{http_code}" --max-time 3 http://127.0.0.1:8102/health?db=1 2>/dev/null || true)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Unquoted URL with glob-special character ?

The URL http://127.0.0.1:8102/health?db=1 is passed unquoted to curl. Under set -euo pipefail, bash will attempt glob expansion on the ? character (which matches any single character). This works in practice only because no file matching http:/127.0.0.1:8102/health.db=1 exists, but it is fragile. Quote the URL for correctness:

Suggested change
code=$(curl -s -o /dev/null -w "%{http_code}" --max-time 3 http://127.0.0.1:8102/health?db=1 2>/dev/null || true)
code=$(curl -s -o /dev/null -w "%{http_code}" --max-time 3 "http://127.0.0.1:8102/health?db=1" 2>/dev/null || true)
Prompt To Fix With AI
This is a comment left during a code review.
Path: docker/local-emulator/qemu/cloud-init/emulator/user-data
Line: 264

Comment:
**Unquoted URL with glob-special character `?`**

The URL `http://127.0.0.1:8102/health?db=1` is passed unquoted to `curl`. Under `set -euo pipefail`, bash will attempt glob expansion on the `?` character (which matches any single character). This works in practice only because no file matching `http:/127.0.0.1:8102/health.db=1` exists, but it is fragile. Quote the URL for correctness:

```suggestion
        code=$(curl -s -o /dev/null -w "%{http_code}" --max-time 3 "http://127.0.0.1:8102/health?db=1" 2>/dev/null || true)
```

How can I resolve this? If you propose a fix, please make it concise.

# Save the standalone-traced node_modules (runtime deps only) before the full
# migration-pruner copy overwrites it. The slim-docker-image step in the QEMU
# build restores this after migrations are baked in.
RUN cp -a /app/node_modules /app/node_modules.standalone 2>/dev/null || mkdir -p /app/node_modules.standalone
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Silent cp failure creates an empty node_modules.standalone

The 2>/dev/null || mkdir -p /app/node_modules.standalone fallback means that if the cp -a silently fails (e.g. disk full, source missing), the build continues with an empty node_modules.standalone. When slim-docker-image later swaps it in, the container would have no node_modules and fail at runtime — likely caught only by the smoke test. Consider removing the suppression and fallback to surface the failure fast:

Suggested change
RUN cp -a /app/node_modules /app/node_modules.standalone 2>/dev/null || mkdir -p /app/node_modules.standalone
RUN cp -a /app/node_modules /app/node_modules.standalone
Prompt To Fix With AI
This is a comment left during a code review.
Path: docker/local-emulator/Dockerfile
Line: 188

Comment:
**Silent `cp` failure creates an empty `node_modules.standalone`**

The `2>/dev/null || mkdir -p /app/node_modules.standalone` fallback means that if the `cp -a` silently fails (e.g. disk full, source missing), the build continues with an empty `node_modules.standalone`. When `slim-docker-image` later swaps it in, the container would have no node_modules and fail at runtime — likely caught only by the smoke test. Consider removing the suppression and fallback to surface the failure fast:

```suggestion
RUN cp -a /app/node_modules /app/node_modules.standalone
```

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
docker/local-emulator/qemu/test-serial.sh (2)

65-74: Hardcoded hvf acceleration limits portability to macOS.

The accelerator is hardcoded to hvf which only works on macOS. If this script needs to run on Linux (e.g., CI runners using KVM), it will fail. Consider detecting the host OS and selecting the appropriate accelerator.

Also, there's no fallback case—if $ARCH is neither arm64 nor amd64, qemu_base remains unset and the script will fail with an unclear error.

♻️ Proposed fix to support both macOS and Linux
 case "$ARCH" in
   arm64)
-    accel="hvf"
+    case "$(uname -s)" in
+      Darwin) accel="hvf" ;;
+      Linux)  accel="kvm" ;;
+      *)      accel="tcg" ;;
+    esac
     firmware="$(find_aarch64_firmware)"
     qemu_base="qemu-system-aarch64 -machine virt -accel $accel -cpu max -bios $firmware"
     ;;
   amd64)
-    qemu_base="qemu-system-x86_64 -machine q35 -accel hvf -cpu max"
+    case "$(uname -s)" in
+      Darwin) accel="hvf" ;;
+      Linux)  accel="kvm" ;;
+      *)      accel="tcg" ;;
+    esac
+    qemu_base="qemu-system-x86_64 -machine q35 -accel $accel -cpu max"
+    ;;
+  *)
+    echo "Unsupported architecture: $ARCH" >&2
+    exit 1
     ;;
 esac
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docker/local-emulator/qemu/test-serial.sh` around lines 65 - 74, The script
hardcodes accel="hvf" and leaves qemu_base unset for unknown ARCH; update the
qemu setup to detect the host OS (via uname or similar) and choose the
accelerator dynamically (e.g., use "hvf" on Darwin/macOS, "kvm" on Linux, and
fallback to "tcg" if none available), apply this logic for both the arm64 branch
(where firmware is set via find_aarch64_firmware) and the amd64 branch (set
qemu_base using the chosen $accel), and add a default case for the case
statement that emits a clear error and exits if $ARCH is not supported so
qemu_base cannot remain unset.

104-109: Duplicate progress lines printed on each poll iteration.

The grep "STACK_PROVISION" inside the polling loop prints all matching lines every 2 seconds, causing repeated output. Consider tracking which lines have already been shown.

♻️ Proposed fix to track and show only new lines
 elapsed=0
 timeout=120
+last_line_count=0
 while [ "$elapsed" -lt "$timeout" ]; do
   if grep -q "STACK_CLOUD_INIT_DONE" "$TMP_DIR/serial.log" 2>/dev/null; then
     echo ""
     echo "=== SUCCESS: STACK_CLOUD_INIT_DONE received ==="
     echo ""
     echo "=== All STACK_PROVISION lines ==="
     grep "STACK_PROVISION" "$TMP_DIR/serial.log" || echo "(none found)"
     exit 0
   fi

   # Show any STACK_PROVISION lines as they appear
-  if grep -q "STACK_PROVISION" "$TMP_DIR/serial.log" 2>/dev/null; then
-    grep "STACK_PROVISION" "$TMP_DIR/serial.log" | while IFS= read -r line; do
-      echo "  [${elapsed}s] $line"
-    done
-  fi
+  current_lines=$(grep -c "STACK_PROVISION" "$TMP_DIR/serial.log" 2>/dev/null || echo 0)
+  if [ "$current_lines" -gt "$last_line_count" ]; then
+    grep "STACK_PROVISION" "$TMP_DIR/serial.log" | tail -n +"$((last_line_count + 1))" | while IFS= read -r line; do
+      echo "  [${elapsed}s] $line"
+    done
+    last_line_count="$current_lines"
+  fi

   sleep 2
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docker/local-emulator/qemu/test-serial.sh` around lines 104 - 109, The
polling loop currently reprints all "STACK_PROVISION" matches from
"$TMP_DIR/serial.log" every iteration; modify the loop that contains grep
"STACK_PROVISION" "$TMP_DIR/serial.log" | while ... to only print new lines by
tracking progress (e.g., keep a shown_lines_count or last_offset variable for
"$TMP_DIR/serial.log"), use that to read/grep only the unseen portion (sed -n or
tail --lines=+N or tail --follow/--bytes with stored offset) and update the
tracker after printing; ensure the tracker variable is initialized before
polling and updated inside the while so each "STACK_PROVISION" line is shown
exactly once.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docker/local-emulator/qemu/cloud-init/emulator/user-data`:
- Around line 308-324: The current sequence moves Docker volumes to
/var/tmp/volumes-backup then nukes /var/lib/docker which can leave volumes
orphaned if any step (e.g., docker load, docker tag, systemctl) fails; add a
rollback/cleanup mechanism (use set -e and a trap ERR/EXIT or explicit error
checks) around the critical section that ensures on any failure you move
/var/tmp/volumes-backup back to /var/lib/docker/volumes and restart
docker/containerd, and only delete /var/tmp/final-image.tar and the backup after
confirming docker load, docker tag (stack-local-emulator:final ->
stack-local-emulator) and the service restarts succeeded; reference the mv
/var/tmp/volumes-backup, docker load -i /var/tmp/final-image.tar, docker tag
stack-local-emulator:final stack-local-emulator, and systemctl stop/start docker
containerd lines to locate where to add the trap and checks.

---

Nitpick comments:
In `@docker/local-emulator/qemu/test-serial.sh`:
- Around line 65-74: The script hardcodes accel="hvf" and leaves qemu_base unset
for unknown ARCH; update the qemu setup to detect the host OS (via uname or
similar) and choose the accelerator dynamically (e.g., use "hvf" on
Darwin/macOS, "kvm" on Linux, and fallback to "tcg" if none available), apply
this logic for both the arm64 branch (where firmware is set via
find_aarch64_firmware) and the amd64 branch (set qemu_base using the chosen
$accel), and add a default case for the case statement that emits a clear error
and exits if $ARCH is not supported so qemu_base cannot remain unset.
- Around line 104-109: The polling loop currently reprints all "STACK_PROVISION"
matches from "$TMP_DIR/serial.log" every iteration; modify the loop that
contains grep "STACK_PROVISION" "$TMP_DIR/serial.log" | while ... to only print
new lines by tracking progress (e.g., keep a shown_lines_count or last_offset
variable for "$TMP_DIR/serial.log"), use that to read/grep only the unseen
portion (sed -n or tail --lines=+N or tail --follow/--bytes with stored offset)
and update the tracker after printing; ensure the tracker variable is
initialized before polling and updated inside the while so each
"STACK_PROVISION" line is shown exactly once.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7d7d3843-8e75-47a3-9856-02b083c3954c

📥 Commits

Reviewing files that changed from the base of the PR and between 7d9e156 and cd087c5.

📒 Files selected for processing (4)
  • docker/local-emulator/Dockerfile
  • docker/local-emulator/qemu/build-image.sh
  • docker/local-emulator/qemu/cloud-init/emulator/user-data
  • docker/local-emulator/qemu/test-serial.sh
🚧 Files skipped from review as they are similar to previous changes (2)
  • docker/local-emulator/Dockerfile
  • docker/local-emulator/qemu/build-image.sh

Comment on lines +308 to +324
log "Saving final image to /var/tmp..."
docker rm flatten
docker save stack-local-emulator:final -o /var/tmp/final-image.tar
mv /var/lib/docker/volumes /var/tmp/volumes-backup
log "Nuking Docker storage and reloading..."
systemctl stop docker containerd
rm -rf /var/lib/docker /var/lib/containerd
systemctl start docker containerd
until docker info >/dev/null 2>&1; do sleep 1; done
docker load -i /var/tmp/final-image.tar
docker tag stack-local-emulator:final stack-local-emulator
docker rmi stack-local-emulator:final || true
rm -f /var/tmp/final-image.tar
systemctl stop docker
rm -rf /var/lib/docker/volumes
mv /var/tmp/volumes-backup /var/lib/docker/volumes
systemctl start docker
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Risky volume backup during Docker storage rebuild.

The sequence of moving volumes to /var/tmp, nuking Docker storage, then moving them back is fragile. If the script fails between lines 311 and 323 (e.g., docker load fails, system crashes), the volumes remain orphaned in /var/tmp and Docker storage is gone.

Consider adding error handling to restore volumes on failure:

🛡️ Proposed fix to add rollback on failure
       log "Saving final image to /var/tmp..."
       docker rm flatten
       docker save stack-local-emulator:final -o /var/tmp/final-image.tar
       mv /var/lib/docker/volumes /var/tmp/volumes-backup
+
+      restore_volumes() {
+        if [ -d /var/tmp/volumes-backup ]; then
+          log "Restoring volumes backup..."
+          rm -rf /var/lib/docker/volumes 2>/dev/null || true
+          mv /var/tmp/volumes-backup /var/lib/docker/volumes
+        fi
+      }
+      trap 'restore_volumes' ERR
+
       log "Nuking Docker storage and reloading..."
       systemctl stop docker containerd
       rm -rf /var/lib/docker /var/lib/containerd
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
log "Saving final image to /var/tmp..."
docker rm flatten
docker save stack-local-emulator:final -o /var/tmp/final-image.tar
mv /var/lib/docker/volumes /var/tmp/volumes-backup
log "Nuking Docker storage and reloading..."
systemctl stop docker containerd
rm -rf /var/lib/docker /var/lib/containerd
systemctl start docker containerd
until docker info >/dev/null 2>&1; do sleep 1; done
docker load -i /var/tmp/final-image.tar
docker tag stack-local-emulator:final stack-local-emulator
docker rmi stack-local-emulator:final || true
rm -f /var/tmp/final-image.tar
systemctl stop docker
rm -rf /var/lib/docker/volumes
mv /var/tmp/volumes-backup /var/lib/docker/volumes
systemctl start docker
log "Saving final image to /var/tmp..."
docker rm flatten
docker save stack-local-emulator:final -o /var/tmp/final-image.tar
mv /var/lib/docker/volumes /var/tmp/volumes-backup
restore_volumes() {
if [ -d /var/tmp/volumes-backup ]; then
log "Restoring volumes backup..."
rm -rf /var/lib/docker/volumes 2>/dev/null || true
mv /var/tmp/volumes-backup /var/lib/docker/volumes
fi
}
trap 'restore_volumes' ERR
log "Nuking Docker storage and reloading..."
systemctl stop docker containerd
rm -rf /var/lib/docker /var/lib/containerd
systemctl start docker containerd
until docker info >/dev/null 2>&1; do sleep 1; done
docker load -i /var/tmp/final-image.tar
docker tag stack-local-emulator:final stack-local-emulator
docker rmi stack-local-emulator:final || true
rm -f /var/tmp/final-image.tar
systemctl stop docker
rm -rf /var/lib/docker/volumes
mv /var/tmp/volumes-backup /var/lib/docker/volumes
systemctl start docker
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docker/local-emulator/qemu/cloud-init/emulator/user-data` around lines 308 -
324, The current sequence moves Docker volumes to /var/tmp/volumes-backup then
nukes /var/lib/docker which can leave volumes orphaned if any step (e.g., docker
load, docker tag, systemctl) fails; add a rollback/cleanup mechanism (use set -e
and a trap ERR/EXIT or explicit error checks) around the critical section that
ensures on any failure you move /var/tmp/volumes-backup back to
/var/lib/docker/volumes and restart docker/containerd, and only delete
/var/tmp/final-image.tar and the backup after confirming docker load, docker tag
(stack-local-emulator:final -> stack-local-emulator) and the service restarts
succeeded; reference the mv /var/tmp/volumes-backup, docker load -i
/var/tmp/final-image.tar, docker tag stack-local-emulator:final
stack-local-emulator, and systemctl stop/start docker containerd lines to locate
where to add the trap and checks.

Comment on lines +67 to +72
accel="hvf"
firmware="$(find_aarch64_firmware)"
qemu_base="qemu-system-aarch64 -machine virt -accel $accel -cpu max -bios $firmware"
;;
amd64)
qemu_base="qemu-system-x86_64 -machine q35 -accel hvf -cpu max"
Copy link
Copy Markdown

@vercel vercel Bot Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test-serial.sh script hardcodes macOS-specific HVF accelerator, causing it to fail on Linux systems.

Fix on Vercel

Provisioning used to silently wait out the full 6000s timeout on any
guest-side failure because the cleanup trap only logged the error. Now
it writes STACK_CLOUD_INIT_FAILED and shuts the VM down, and the host
waiter breaks on that marker and reports it distinctly.

Also bump smoke test timeout 120s->300s, dump docker ps / container
logs / free -m / verbose curl when it fails, log the qemu accel path,
and enable /dev/kvm on the CI runner so the VM isn't stuck in TCG.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/qemu-emulator-build.yaml:
- Around line 52-63: The "Enable KVM access" step and its log are misleading
because build-image.sh only enables -accel kvm when host arch == guest arch, so
making /dev/kvm writable on an x86 runner does not speed up the arm64 matrix
leg; update the workflow to route the arm64 job to the same-arch runner (use
ubicloud-standard-8-arm for the arm64 matrix entry) or, if you prefer not to
change runners, modify the step's log to explicitly state "KVM here only helps
same-arch guests (host arch must match guest arch for -accel kvm)" and keep the
existing udev commands; reference the "Enable KVM access" step, the matrix entry
for arm64, and build-image.sh to ensure the fix is consistent.

In `@docker/local-emulator/qemu/build-image.sh`:
- Around line 367-369: The script calls node to run generate-env-development.mjs
but check_deps() doesn’t include node, so add "node" to the dependency list used
by check_deps() (the same place where other tools like docker/awk are listed) so
that check_deps() fails early if node is missing; update the dependency array or
call to check_deps() that feeds into the check_deps function and keep the
BUILD_ENV_FILE and the node
"$REPO_ROOT/docker/local-emulator/generate-env-development.mjs" invocation
unchanged.
- Around line 358-360: The qemu-img convert currently writes directly to
"$final_img" which can leave a partially overwritten corrupt file on failure;
change the workflow around the qemu-img convert invocation (the block using
tmp_img and final_img) to write the output to a temporary file (e.g.,
"$final_img.tmp" or similar) and then atomically rename/move it to "$final_img"
only after convert succeeds, and still ensure tmp_dir cleanup (rm -rf
"$tmp_dir") happens appropriately on both success and failure; additionally,
update the dependency check in the check_deps function to include "node" in the
tooling list so the script validates node is installed before invoking it later.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0da9cb3f-5870-4986-83d6-34baa6e95ce0

📥 Commits

Reviewing files that changed from the base of the PR and between cd087c5 and 784f17c.

📒 Files selected for processing (3)
  • .github/workflows/qemu-emulator-build.yaml
  • docker/local-emulator/qemu/build-image.sh
  • docker/local-emulator/qemu/cloud-init/emulator/user-data
🚧 Files skipped from review as they are similar to previous changes (1)
  • docker/local-emulator/qemu/cloud-init/emulator/user-data

Comment on lines +52 to +63
- name: Enable KVM access
run: |
echo 'KERNEL=="kvm", GROUP="kvm", MODE="0666", OPTIONS+="static_node=kvm"' \
| sudo tee /etc/udev/rules.d/99-kvm4all.rules
sudo udevadm control --reload-rules
sudo udevadm trigger --name-match=kvm || true
ls -la /dev/kvm || echo "no /dev/kvm present"
if [ -w /dev/kvm ]; then
echo "KVM is writable — hardware acceleration will be used"
else
echo "WARNING: /dev/kvm is not writable — will fall back to TCG (very slow)"
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

This KVM step does not speed up the arm64 matrix leg on ubicloud-standard-8.

build-image.sh only switches to -accel kvm when the host arch matches the guest arch, so on the current x86 runner the arm64 build still falls back to TCG even if /dev/kvm is writable. If the goal is to optimize both architectures, route the arm64 job to ubicloud-standard-8-arm or make the log message explicit that KVM here only helps same-arch guests.

Based on learnings, this workflow can use both ubicloud-standard-8 and ubicloud-standard-8-arm, so there is an existing same-arch runner option for the arm64 build.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/qemu-emulator-build.yaml around lines 52 - 63, The "Enable
KVM access" step and its log are misleading because build-image.sh only enables
-accel kvm when host arch == guest arch, so making /dev/kvm writable on an x86
runner does not speed up the arm64 matrix leg; update the workflow to route the
arm64 job to the same-arch runner (use ubicloud-standard-8-arm for the arm64
matrix entry) or, if you prefer not to change runners, modify the step's log to
explicitly state "KVM here only helps same-arch guests (host arch must match
guest arch for -accel kvm)" and keep the existing udev commands; reference the
"Enable KVM access" step, the matrix entry for arm64, and build-image.sh to
ensure the fix is consistent.

Comment on lines 358 to +360
log "Compressing final image (this may take several minutes)..."
qemu-img convert -p -O qcow2 -c "$final_img" "$final_img.tmp"
mv "$final_img.tmp" "$final_img"
qemu-img convert -p -O qcow2 -c "$tmp_img" "$final_img"
rm -rf "$tmp_dir"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Check the file size and read lines around 358-360
wc -l docker/local-emulator/qemu/build-image.sh

Repository: stack-auth/stack-auth

Length of output: 110


🏁 Script executed:

# Read the relevant section
sed -n '350,375p' docker/local-emulator/qemu/build-image.sh

Repository: stack-auth/stack-auth

Length of output: 951


🏁 Script executed:

# Search for node invocations and check_deps function
rg -n "node " docker/local-emulator/qemu/build-image.sh | head -20

Repository: stack-auth/stack-auth

Length of output: 137


🏁 Script executed:

# Find the check_deps function to see if node is validated
rg -n "check_deps\|qemu-img\|node" docker/local-emulator/qemu/build-image.sh | head -30

Repository: stack-auth/stack-auth

Length of output: 47


🏁 Script executed:

# Search for check_deps function more broadly
rg -n "check_deps|declare -a|qemu-img" docker/local-emulator/qemu/build-image.sh

Repository: stack-auth/stack-auth

Length of output: 274


🏁 Script executed:

# Get the full check for dependencies/requirements
sed -n '1,100p' docker/local-emulator/qemu/build-image.sh | rg -n "check|qemu|curl|docker|gzip|node"

Repository: stack-auth/stack-auth

Length of output: 453


🏁 Script executed:

# Search for any validation of dependencies in the entire file
rg -B2 -A5 "command -v|which " docker/local-emulator/qemu/build-image.sh

Repository: stack-auth/stack-auth

Length of output: 610


Avoid writing the final qcow2 in place.

Line 359 converts directly into "$final_img". If qemu-img convert fails or is interrupted, an existing image can be left partially overwritten, which makes reruns or package steps pick up a corrupt artifact. Use a temporary file with atomic rename to ensure failure doesn't corrupt the final artifact.

Additionally, node is invoked at line 368 but is not validated in the check_deps() function (line 50), creating an incomplete dependency check.

Suggested fix
-  qemu-img convert -p -O qcow2 -c "$tmp_img" "$final_img"
-  rm -rf "$tmp_dir"
+  local tmp_final="${final_img}.tmp"
+  rm -f "$tmp_final"
+  qemu-img convert -p -O qcow2 -c "$tmp_img" "$tmp_final"
+  mv "$tmp_final" "$final_img"
+  rm -rf "$tmp_dir"

Also add node to the dependency check loop at line 50.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
log "Compressing final image (this may take several minutes)..."
qemu-img convert -p -O qcow2 -c "$final_img" "$final_img.tmp"
mv "$final_img.tmp" "$final_img"
qemu-img convert -p -O qcow2 -c "$tmp_img" "$final_img"
rm -rf "$tmp_dir"
log "Compressing final image (this may take several minutes)..."
local tmp_final="${final_img}.tmp"
rm -f "$tmp_final"
qemu-img convert -p -O qcow2 -c "$tmp_img" "$tmp_final"
mv "$tmp_final" "$final_img"
rm -rf "$tmp_dir"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docker/local-emulator/qemu/build-image.sh` around lines 358 - 360, The
qemu-img convert currently writes directly to "$final_img" which can leave a
partially overwritten corrupt file on failure; change the workflow around the
qemu-img convert invocation (the block using tmp_img and final_img) to write the
output to a temporary file (e.g., "$final_img.tmp" or similar) and then
atomically rename/move it to "$final_img" only after convert succeeds, and still
ensure tmp_dir cleanup (rm -rf "$tmp_dir") happens appropriately on both success
and failure; additionally, update the dependency check in the check_deps
function to include "node" in the tooling list so the script validates node is
installed before invoking it later.

Comment on lines +367 to +369
log "Generating emulator build env file..."
node "$REPO_ROOT/docker/local-emulator/generate-env-development.mjs"
BUILD_ENV_FILE="$REPO_ROOT/docker/local-emulator/.env.development"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's examine the script file around lines 367-369
cat -n docker/local-emulator/qemu/build-image.sh | sed -n '360,375p'

Repository: stack-auth/stack-auth

Length of output: 718


🏁 Script executed:

# Now let's find the check_deps() function
rg -A 15 "check_deps\(\)" docker/local-emulator/qemu/build-image.sh

Repository: stack-auth/stack-auth

Length of output: 576


🏁 Script executed:

# Let's also search for all occurrences of "node" in the script to see if it's used elsewhere
rg -n "node" docker/local-emulator/qemu/build-image.sh

Repository: stack-auth/stack-auth

Length of output: 137


Add node to the upfront dependency check.

This script unconditionally calls node at line 368, but check_deps() doesn't verify it's available. Adding it to the existing dependency check will catch missing node early instead of failing at runtime.

Suggested fix
-  for cmd in qemu-img curl docker gzip; do
+  for cmd in qemu-img curl docker gzip node; do
     command -v "$cmd" >/dev/null 2>&1 || missing+=("$cmd")
   done
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docker/local-emulator/qemu/build-image.sh` around lines 367 - 369, The script
calls node to run generate-env-development.mjs but check_deps() doesn’t include
node, so add "node" to the dependency list used by check_deps() (the same place
where other tools like docker/awk are listed) so that check_deps() fails early
if node is missing; update the dependency array or call to check_deps() that
feeds into the check_deps function and keep the BUILD_ENV_FILE and the node
"$REPO_ROOT/docker/local-emulator/generate-env-development.mjs" invocation
unchanged.

@BilalG1 BilalG1 closed this Apr 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant