Skip to content

ci: pilot per-arch split + manifest merge for faster-whisper and llama-cpp-quantization#9727

Merged
mudler merged 1 commit intomasterfrom
ci/per-arch-split-pilot
May 8, 2026
Merged

ci: pilot per-arch split + manifest merge for faster-whisper and llama-cpp-quantization#9727
mudler merged 1 commit intomasterfrom
ci/per-arch-split-pilot

Conversation

@localai-bot
Copy link
Copy Markdown
Collaborator

Summary

Pilots Phase 2.3 + 2.4 of the CI migration plan: convert two backends from QEMU-emulated multi-arch to native per-arch + manifest-list merge. This validates the split-and-merge pattern end-to-end on real CI before fanning out to the other 34 multi-arch entries (Task 2.5, follow-up PR).

What changes

Two pilot backends split (.github/backend-matrix.yml):

  • -cpu-faster-whisper (small Python, fast baseline)
  • -cpu-llama-cpp-quantization (heavier compile, stress test)

For each: the single platforms: 'linux/amd64,linux/arm64' matrix entry is replaced with two per-arch entries — amd64 leg on ubuntu-latest, arm64 leg on ubuntu-24.04-arm (native, ~5–10× faster than emulated). Each new entry carries platform-tag: 'amd64' | 'arm64', which the previously-merged Phase 2.1 wires into backend_build.yml to scope the registry cache and the digest artifact name.

Merge-job infrastructure (reused by Task 2.5+):

  • .github/workflows/backend.yml and backend_pr.yml forward platform-tag from matrix to backend_build.yml.
  • A new backend-merge-jobs job in both workflows consumes a merge-matrix output from generate-matrix and calls the existing backend_merge.yml (already-shipped from PR ci: phase 1-3 of GHA free tier migration (path filter, multi-arch split prep, /mnt disk relief) #9726).
  • scripts/changed-backends.js gains a computeMergeMatrix(entries) helper that groups filtered linux entries by tag-suffix, emits an entry only for groups of size ≥ 2, and warns if tag-latest disagrees across legs (cheap insurance for the 34-backend fan-out coming next).
  • PR-side merge job is also event-gated on github.event_name != 'pull_request' so the no-op-on-PR run doesn't even start.

What's NOT in here (follow-ups)

  • Task 2.5: fan out the same shape to the other 34 multi-arch entries.
  • Task 2.6: same pattern for image.yml / image-pr.yml (3 multi-arch entries).
  • Phases 4–5: migrate bigger-runner and arc-runner-set jobs to free tier (depends on Phase 3 disk relief, already shipped).

Decisions worth flagging

  • Singletons not merged: backends with a single matrix entry (single-arch) push by digest only and don't need a manifest list. The computeMergeMatrix helper skips them.
  • tag-latest mismatch guard: cheap warning surfaced if the two legs disagree on tag-latest. Won't fire today (the two pilot legs both say 'auto'); future-proofs the 34-entry fan-out.
  • PR variant gating: the merge job's if: is at the job level (github.event_name != 'pull_request'), so the matrix doesn't even instantiate on PRs — saves a runner over relying on backend_merge.yml's internal step gates.

Test plan

  • On this PR, the backend_pr.yml generate-matrix job runs and emits an empty merge-matrix (no backend dir touched), so backend-merge-jobs is correctly skipped via has-merges == 'false'.
  • After merge, the first push that touches backend/python/faster-whisper/ schedules:
    • 2 per-arch backend_build.yml jobs (amd64 + arm64 native), each pushing by digest under digests-cpu-faster-whisper-amd64 / -arm64.
    • 1 backend-merge-jobs matrix entry for -cpu-faster-whisper that downloads both digests and runs docker buildx imagetools create to produce the final tagged manifest list.
  • docker buildx imagetools inspect quay.io/go-skynet/local-ai-backends:master-cpu-faster-whisper shows two platforms (linux/amd64, linux/arm64).
  • arm64 native build of faster-whisper finishes faster than the previous emulated multi-arch run (compare wall-clock from before/after).
  • Same checks for -cpu-llama-cpp-quantization (the heavier one).
  • Weekly Sunday cron (added in PR ci: phase 1-3 of GHA free tier migration (path filter, multi-arch split prep, /mnt disk relief) #9726) still rebuilds the full matrix and the merge-matrix correctly contains the two pilots.

Plan reference: docs/superpowers/plans/2026-05-08-ci-migration-to-gha-free-tier.md (uncommitted working artifact).

Assisted-by: Claude:claude-opus-4-7

Convert two backends from QEMU-emulated multi-arch (linux/amd64,linux/arm64
on a single ubuntu-latest) to native per-arch + manifest-list merge:
- amd64 leg on ubuntu-latest
- arm64 leg on ubuntu-24.04-arm (native, ~5-10x faster than emulated)
- merge job assembles both digests under the final tag via
  docker buildx imagetools create

Backends piloted:
- -cpu-faster-whisper (small Python, fast baseline)
- -cpu-llama-cpp-quantization (heavier compile path, stress test)

Infrastructure changes that the rest of Phase 2 (Tasks 2.5+) will reuse:
- .github/backend-matrix.yml entries gain a `platform-tag` field
  ('amd64'/'arm64') for matrix entries that participate in the split.
  Other entries omit it; backend_build.yml already defaults missing
  values to '' (empty cache key suffix preserved as cache<suffix>-).
- backend.yml + backend_pr.yml forward `platform-tag` from matrix to
  the reusable backend_build.yml.
- scripts/changed-backends.js groups filtered entries by tag-suffix
  and emits a `merge-matrix` (plus `has-merges`) for groups of size>=2.
  Singletons aren't merged.
- backend.yml + backend_pr.yml gain a `backend-merge-jobs` job that
  consumes merge-matrix and calls backend_merge.yml after backend-jobs.
  PR variant is also event-gated so the no-op-on-PR merge job doesn't
  even start.

The other 34 multi-arch entries are unchanged in this PR -- Task 2.5
fans out the same shape to them once the pilot is observed green.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
@mudler
Copy link
Copy Markdown
Owner

mudler commented May 8, 2026

can only test on master

@mudler mudler merged commit cb68cd1 into master May 8, 2026
51 checks passed
@mudler mudler deleted the ci/per-arch-split-pilot branch May 8, 2026 22:04
mudler added a commit that referenced this pull request May 8, 2026
The PR that introduced the per-arch + manifest-merge pilot (#9727)
only touched CI infrastructure files, so the path filter correctly
skipped backend builds on its merge commit. To observe the new
backend-merge-jobs flow assemble a real manifest list, this commit
touches faster-whisper's Makefile so its two new per-arch entries
schedule and the merge job runs.

The trailing comment is the smallest possible diff and is harmless
to the build.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
mudler added a commit that referenced this pull request May 9, 2026
…etire self-hosted, fix provenance) (#9730)

* ci: add per-arch + manifest-merge support for LocalAI server image

Mirror the backend_build.yml + backend_merge.yml pattern shipped in
PR #9726 for the LocalAI server image:

- image_build.yml accepts optional platform-tag (default ''), scopes
  registry cache to cache-localai<suffix>-<platform-tag>, and pushes
  by canonical digest only on push events. Digests upload as artifacts
  named digests-localai<suffix>-<platform-tag>, with a "-core"
  placeholder when tag-suffix is empty so the merge job's download
  pattern doesn't over-match across multiple suffixes.
- image_merge.yml is a new reusable workflow that downloads matching
  digest artifacts and assembles the final tagged manifest list via
  docker buildx imagetools create.

Image names differ from backend_*.yml: the LocalAI server is published
under quay.io/go-skynet/local-ai and localai/localai (not -backends).

Not yet wired into image.yml / image-pr.yml — Commit C does that.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci: fan out per-arch split to remaining 34 backends

Convert all remaining linux/amd64,linux/arm64 entries in
backend-matrix.yml to per-arch + manifest-merge form. Each was a
single matrix entry running both arches on x86 under QEMU emulation;
each becomes two entries — amd64 on ubuntu-latest, arm64 on
ubuntu-24.04-arm (native).

Four backends that were on bigger-runner (-cpu-llama-cpp,
-cpu-turboquant, -gpu-vulkan-llama-cpp, -gpu-vulkan-turboquant) have
both legs moved to free tier as part of the same change. They are
compile-only (no torch/CUDA install) and fit comfortably with the
setup-build-disk /mnt relocation. Phase 4 (next commit) retires the
remaining 5 single-arch bigger-runner entries.

After this commit:
- 271 total matrix entries (was 237)
- 0 multi-arch entries left
- 36 per-arch pairs (34 new + 2 pilots from PR #9727)
- 5 bigger-runner entries remaining (single-arch, Phase 4 target)

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci: split LocalAI image multi-arch entries per arch + merge

Mirror the backend per-arch split for the main LocalAI image:

- image.yml's core-image-build matrix: split the core ('') and
  -gpu-vulkan entries into amd64 + arm64 legs each. amd64 on
  ubuntu-latest, arm64 on ubuntu-24.04-arm (native).
- New top-level core-image-merge and gpu-vulkan-image-merge jobs
  call image_merge.yml after core-image-build completes.
- image-pr.yml's image-build matrix: split the -vulkan-core entry.
  No merge job added on the PR side — image_build.yml's digest-push
  is push-only-event-gated, so a PR-side merge would have nothing
  to download.

After this commit, no workflow file references
linux/amd64,linux/arm64 in a single matrix slot.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci: retire bigger-runner from backend matrix (Phase 4)

Migrate the remaining 5 single-arch bigger-runner entries to
ubuntu-latest. Combined with the Phase 3 setup-build-disk /mnt
relocation (PR #9726), free-tier ubuntu-latest now has ~100 GB of
working space — enough for ROCm dev image (~16 GB), CUDA toolkit
(~5 GB), and the per-backend compile/install steps these entries do.

Backends migrated:
- -gpu-nvidia-cuda-12-llama-cpp
- -gpu-nvidia-cuda-12-turboquant
- -gpu-rocm-hipblas-faster-whisper
- -gpu-rocm-hipblas-coqui
- -cpu-ik-llama-cpp

After this commit, .github/backend-matrix.yml has zero bigger-runner
references. The bigger-runner used in tests-vibevoice-cpp-grpc-
transcription (test-extra.yml) is a separate concern handled in a
follow-up.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci: migrate 9 Intel oneAPI backends to free tier (Phase 5.1)

Intel oneAPI base image is ~6 GB; each backend's wheel install
stays well within the ~100 GB working space provided by Phase 3's
setup-build-disk /mnt relocation. Lowest-risk batch of the
arc-runner-set retirement.

Backends migrated:
  vllm, sglang, vibevoice, qwen-asr, nemo, qwen-tts,
  fish-speech, voxcpm, pocket-tts (all -gpu-intel-* variants).

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci: migrate 15 ROCm Python backends to free tier (Phase 5.2)

ROCm dev image (~16 GB) plus per-backend torch/wheels install fits
on ubuntu-latest with the /mnt-relocated Docker root. These entries
include the heavier vLLM/sglang/transformers/diffusers stack on
ROCm; if any specific backend OOMs or runs out of disk, individual
flips back to arc-runner-set are revertable per-entry.

Backends migrated: all 15 -gpu-rocm-hipblas-* entries previously on
arc-runner-set (vllm/vllm-omni/sglang/transformers/diffusers/
ace-step/kokoro/vibevoice/qwen-asr/nemo/qwen-tts/fish-speech/
voxcpm/pocket-tts/neutts).

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci: migrate 6 CUDA Python backends to free tier (Phase 5.3)

vLLM/sglang stacks on CUDA 12 and CUDA 13 are the heaviest
backends in the matrix — flash-attn intermediate layers can spike
disk usage during build. setup-build-disk's /mnt relocation gives
~100 GB working space which fits the documented peak.

Highest-risk batch of the arc-runner-set retirement; if any
backend fails to build on free tier, the per-entry runs-on flip
is the unit of revert.

Backends migrated: -gpu-nvidia-cuda-{12,13}-{vllm,vllm-omni,sglang}.

After this commit, .github/backend-matrix.yml has zero references
to arc-runner-set or bigger-runner. The migration is complete.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci: disable provenance on multi-registry digest pushes

Root-caused on master via PR #9727's pilot: when docker/build-push-action@v7
pushes a single build to TWO registries simultaneously with
push-by-digest=true, buildx generates a per-registry provenance
attestation manifest (because mode=max — the default for push:true —
includes the runner ID). That makes the resulting manifest-list digest
diverge across registries:

  arm64 -cpu-faster-whisper build:
    image manifest:        sha256:d3bdd34b... (identical, content-only)
    quay manifest list:    sha256:66b4cfc8... (with quay attestation)
    dockerhub manifest list: sha256:e0733c3b... (with dockerhub attestation)

steps.build.outputs.digest returns only one of the list digests
(empirically the dockerhub one). The merge job then asks
"quay.io/...@sha256:e0733c3b..." which doesn't exist on quay — that
list has digest 66b4cfc8 there. Result: imagetools create fails with
"not found" and the merge job fails (run 25581983094, job 75110021491).

Setting provenance: false drops the per-registry attestation; the
manifest-list digest becomes pure content, identical across both
registries, and steps.build.outputs.digest works on either lookup.

Applied to backend_build.yml and image_build.yml — both refactored
to use the same multi-registry digest-push pattern in the prior PRs.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
@localai-bot localai-bot added the enhancement New feature or request label May 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants