Skip to content

ci(release): build docker images on native arm64 runners#940

Closed
vanducng wants to merge 1 commit intodevfrom
fix/docker-build-native-arm64
Closed

ci(release): build docker images on native arm64 runners#940
vanducng wants to merge 1 commit intodevfrom
fix/docker-build-native-arm64

Conversation

@vanducng
Copy link
Copy Markdown
Contributor

Root cause

Run 24516158412 showed docker-images (full) and docker-images (latest) stalled for 6 hours before GHA cancelled them at the job timeout. The base (19 min) and otel (27 min) variants on the same workflow finished fine.

The bottleneck is QEMU emulating linux/arm64 on an amd64 runner. When ENABLE_PYTHON=true, the Alpine build layer runs pnpm + pip source-compiles pandas/numpy/lxml under QEMU — effectively single-threaded software emulation of a different ISA. This reliably hangs or takes 10× longer than native.

Fix

Replace the single cross-arch buildx job with a 2-stage pipeline:

Stage 1 — *-build (per-variant × per-arch, native runners)

  • linux/amd64 jobs run on ubuntu-latest (existing free runner)
  • linux/arm64 jobs run on ubuntu-24.04-arm (GitHub-hosted ARM runner, free for public repos)
  • Each job builds a single-platform image and pushes it by content digest (no tags yet)
  • Digest files uploaded as artifacts (digest-<variant>-<arch>)

Stage 2 — *-merge (per-variant, fuses digests into multi-arch manifest)

  • Downloads per-arch digest artifacts
  • Runs docker buildx imagetools create twice — once for GHCR tags, once for Docker Hub tags — using GHCR digests as the source (content-addressable, accepted by both registries)
  • No QEMU, no emulation at any point

Same pattern applied to:

  • release.yaml: docker-imagesdocker-images-build + docker-images-merge; docker-webdocker-web-build + docker-web-merge
  • release-beta.yaml: docker-imagesdocker-images-build + docker-images-merge

notify-discord.needs updated to reference docker-images-merge and docker-web-merge.

Validation

  • YAML syntax: validated with python3 yaml.safe_load (actionlint not installed locally — CI is the authoritative validator)
  • Job key sets verified correct via Python
  • No setup-qemu-action remaining in either file
  • No combined linux/amd64,linux/arm64 platform lines remaining
  • Tag shapes unchanged: :vX.Y.Z, :vX.Y.Z-full, :full, :latest, :beta, :beta-full, etc.
  • Both GHCR and Docker Hub receive all tags via separate imagetools create invocations

Checklist

  • No changes outside the two workflow files
  • No Dockerfile or requirements-*.txt changes
  • notify-discord needs updated
  • fail-fast: false on all new matrices
  • Cache scopes updated to include arch suffix (avoids cross-contamination)

Split each cross-arch docker build into per-arch build jobs on native
ubuntu-24.04-arm/ubuntu-latest runners, then fuse per-arch digests into
a multi-arch manifest in a downstream merge job. Eliminates QEMU
emulation which stalled the full/latest variants past the 6h job
timeout (see run 24516158412).

Applies to release.yaml (docker-images, docker-web) and
release-beta.yaml (docker-images).
@vanducng vanducng marked this pull request as draft April 16, 2026 21:26
vanducng added a commit to vanducng/goclaw that referenced this pull request Apr 16, 2026
vanducng added a commit to vanducng/goclaw that referenced this pull request Apr 17, 2026
Validation branch for upstream PR. Contains:
- Native arm64 runners (upstream PR nextlevelbuilder#940 equivalent)
- docker-registry-login composite action
- docker-multiarch reusable workflow
- release.yaml + release-beta.yaml refactored to callers
- ubuntu-24.04 pinned across workflows
- DOCKERHUB_IMAGE: dataplanelabs/goclaw (fork-local patch)
@vanducng
Copy link
Copy Markdown
Contributor Author

Superseded by #946, which combines this arm64 fix with a DRY refactor (composite action + reusable workflow). Fork-validated on both beta and stable paths.

@vanducng vanducng closed this Apr 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant