PUMA v2.6.0 — Apple Silicon M3/M4/M5 support
PUMA v2.6.0 Release Notes
Release date: 2026-05-16
Previous release: v2.5.0 (2026-05-16)
Branch: develop → main (post-tag)
Summary
This release consolidates Sprint 9 (Apple Silicon M3/M4/M5 support)
onto the v2.5.0 base. It adds first-class detection of Apple Silicon
hosts (9 new profile identifiers covering M3 base/Pro/Max, M4
base/Pro/Max, M5 base/Pro/Max, and M5 Ultra), a native runtime mode
via ./start_puma.sh --native that boots Ollama with Metal on
macOS (no Docker), macOS-aware CodeCarbon tracking with a
powermetrics-availability probe, and the cross-architecture
reproducibility question documented as a testable hypothesis for a
future empirical close-out.
Empirical validation status: all apple-silicon-* profiles
declare empirical_validation: pending. PUMA's validation hardware
is the RTX 2060 Mobile 6 GB (gpu-entry); Apple Silicon hardware
joins the validation set in a future Sprint when MacBook M-series
hardware becomes available to the project. The dispatch
infrastructure shipped here enables that validation; the testing
protocol is documented in docs/CROSS_ARCH_REPRODUCIBILITY.md
§ "Testing protocol".
Highlights
Apple Silicon catalogued end-to-end
| Layer | Artefact |
|---|---|
| Profiles | 9 new apple-silicon-* identifiers in config/profiles.yaml; schema extended additively with apple_silicon_required, chip_brand_match, min_unified_memory_gb |
| Catalog | catalog_version bumped 2.5.0 → 2.6.0; conservative profiles_compatible[] additions per a ≈ 2× GGUF memory-headroom rule |
| Detection | src/puma/preflight/apple_silicon.py (NEW) — platform-isolated sysctl-based detection; testable on Linux via unittest.mock |
| Dispatch | SystemCapabilities gains chip_brand + unified_memory_gb; Profile gains optional Apple Silicon fields; select_profile() runs a new branch BEFORE the existing GPU/CPU dispatch |
| Runtime | ./start_puma.sh --native boots native Ollama + Python venv on macOS; ./stop_puma_native.sh teardown companion |
| Emissions | get_tracking_mode_and_warnings() resolver in puma.sustainability.codecarbon_wrapper; powermetrics probe; graceful process-mode fallback on macOS |
| Docs | docs/CROSS_ARCH_REPRODUCIBILITY.md (NEW); extensions in MACOS_NOTES.md, HARDWARE.md, CATALOG_HISTORY.md |
Conservative model compatibility
Apple Silicon variants need at least roughly 2× the model's GGUF
size in unified memory plus a small OS overhead. Applied as a rule:
| Model | Compatible Apple Silicon variants |
|---|---|
qwen2.5:1.5b, gemma3:1b |
All 10 (m3 base → m5-ultra) |
qwen2.5:3b, gemma3:4b |
All except m3 base (8 GB tight) |
qwen2.5:7b, mistral:7b, llama3.1:8b, deepseek-r1:7b |
Pro / Max / Ultra only (≥ 18 GB) |
qwen2.5:14b, deepseek-r1:14b, gemma3:27b |
Max / Ultra only (≥ 36 GB) |
gemma3:12b |
m3-max, m4-pro+, m5-pro+ (≥ 24 GB) |
gemma4:* |
Excluded from every apple-silicon-* — P6 enforcement |
gemma4 family — exclusion preserved AND extended
The gemma4 family (gemma4:e2b, gemma4:e4b, gemma4:26b-a4b)
stays excluded from gpu-entry per F8 / D18.
test_gemma4_family_excluded_from_gpu_entry is preserved
unchanged. v2.6.0 extends the exclusion with a new invariant
test
test_gemma4_family_not_compatible_with_any_apple_silicon
ensuring the same VRAM-pressure failure mode is not re-introduced
on small unified-memory variants by accidental copy-paste during
future catalog edits. Re-enabling any (gemma4, apple-silicon-*)
pair requires new empirical evidence on Mac hardware and an
explicit debt entry referencing the prior exclusion.
CodeCarbon survives on macOS (Mode B)
tracking_mode="machine" is what PUMA's split-container Linux
architecture relies upon (the D15 fix). On macOS Mode B without
passwordless powermetrics, machine-mode silently fails to record
energy. v2.6.0 adopts a graceful fallback in
get_tracking_mode_and_warnings():
- Passwordless powermetrics configured →
tracking_mode="machine", no warnings. - Default macOS state (sudo required) →
tracking_mode="process"+ one warning pointing atdocs/MACOS_NOTES.md. puma run --no-emissions→ tracking disabled entirely.
Linux + NVIDIA path is byte-identical to v2.5.0:
tracking_mode="machine" with no warnings.
Cross-architecture reproducibility — open question, testable hypothesis
v2.6.0 frames bit-exact reproducibility between x86_64 Linux and
arm64 macOS as an open empirical question. The Q4_K_M integer
quantisation makes F1 and MAE expected to be bit-exact across
architectures; logprobs (and therefore ECE) are expected to differ
by FP rounding. The document records H0 / H1 / H2 / H3 hypotheses
and a 6-step testing protocol for closing them out when Mac
hardware joins the validation set. See
docs/CROSS_ARCH_REPRODUCIBILITY.md.
Tests
- 354 → 402 passing (
-m "not ollama"), 7 deselected. - New
tests/unit/test_apple_silicon.py: 28 tests covering every
public entry point of the new module with mocks for the
Darwin/arm64 gate, sysctl success + 3 failure modes
(FileNotFoundError, TimeoutExpired, CalledProcessError), a
parametrised mapping for all 10 chip brands, forward-compat for
unmapped chips, theget_apple_silicon_infodict shape, and
consistency checks onCHIP_BRAND_TO_PROFILE. - New
tests/unit/test_codecarbon_macos.py: 7 tests for the
tracking-mode helper and powermetrics probe — Linux
short-circuit (never invokes subprocess), macOS with/without
sudoers, probe behaviour on FileNotFoundError / non-zero exit /
zero exit. - Extended
tests/unit/test_catalog_metadata.py: +5 tests
(VALID_PROFILES inclusion, profiles.yaml definitions for all 9
apple-silicon-,chip_brand_matchuniqueness, gemma4 exclusion
from every apple-silicon-, qwen2.5:3b anchor on
apple-silicon-m4-pro). Pre-existing
test_model_metadata_is_internally_consistentand
test_gemma4_family_excluded_from_gpu_entrypreserved
unchanged. - Extended
tests/unit/test_preflight_profile.py: +7 tests for the
auto-dispatch path (M4/M4 Pro/M5 Max), boundary at 8 GB unified
memory, fall-through cases for insufficient unified memory,
unmapped chips, and non-Apple chip-brand values; +1 manual
override test for apple-silicon-m4. pre-commit run --all-files: all hooks green.puma validate-baseline(triage, F1 path, fresh Ollama):
PASS f1=0.5831, delta=-0.0036, ±0.01.puma validate-baseline --expected-mae 5.7150(estimation,
fresh Ollama):PASS mae=5.7150, delta=+0.0000, ±0.05—
bit-exact.
Quality
- Coverage: 61 % (no significant change from v2.5.0).
- CI: green on both
mainanddevelop. Theintegration-tests-ollama
job introduced in v2.5.0 continues to run on push to those
branches only. - Baseline reproducibility: F1 = 0.5867 ± 0.01 on
triage_jira
preserved. MAE = 5.7150 ± 0.05 onestimation_tawospreserved. src/puma/cli.pyLOC unchanged from v2.5.0; the only
meaningfully-grown file issrc/puma/preflight/apple_silicon.py
(NEW, 141 LOC).
Design decisions
- Linux path byte-identical to v2.5.0. Every Apple Silicon
code path returnsNone/ no-ops whencaps.chip_brand is None
oris_apple_silicon()is False, so no CI invocation on Linux
changes behaviour. Governed by P5 (additive over modification)
and P3 (reproducibility non-negotiable). apple-silicon-*profiles ship withempirical_validation: pending. Cataloguing without validation is a deliberate
signalling choice — the dispatch infrastructure exists, the
numbers do not. Frames the future Mac-hardware Sprint as
empirical close-out rather than groundwork. Aligns with the
F8/D18 lesson: nominal specifications do not predict runtime
compatibility on constrained hardware.- gemma4 stays excluded across both
gpu-entryAND
apple-silicon-*. P6 generalises from "do not re-introduce
previously-rejected (model, profile) pairs" to "do not
re-introduce the failure mode in a different profile family". get_tracking_mode_and_warnings()is additive. The Linux
branch returns("machine", []), byte-identical to the
v2.5.0-hardcoded value. The macOS branch is the new behaviour;
it only runs whenis_apple_silicon()returns True, which is
False on every Linux test runner.- Cross-arch reproducibility documented as testable, not as
fact. v2.6.0 ships a hypothesis (H0/H1/H2/H3) and a protocol,
not a claim of bit-exactness. Honest about the empirical gap.
Debt tracking
- No new open debt introduced by this release.
- No closure of pre-existing debt — Sprint 9 is forward-looking
infrastructure, not a debt-paydown Sprint. - Empirical validation of
apple-silicon-*profiles is the
explicit follow-up; it is not tracked as "debt" because the
catalogue declares its own status (empirical_validation: pending).
Known limitations
Unchanged from v2.5.0:
- Single hardware tier empirically evaluated (
gpu-entry); models
requiringgpu-mid/gpu-highare catalogued but not yet
validated. - AMD ROCm not yet detected.
- TAWOS SHA-256 end-to-end fetch test pending (Gate D criterion 3).
input_textnot persisted intriage_jirainstances (D22, Low).
New in v2.6.0:
- All
apple-silicon-*profiles declare
empirical_validation: pending. Users on Mac hardware should
validate cross-arch reproducibility on their specific chip
before treating F1/MAE results as comparable to the Linux
baselines — seedocs/CROSS_ARCH_REPRODUCIBILITY.md§ "Testing
protocol". - macOS Mode B (native Ollama) is opt-in via
./start_puma.sh --native; the default./start_puma.shDocker
path is unchanged.
Upgrade notes
- No breaking changes. Existing CI invocations of
puma validate-baseline(no flags) continue to work unchanged.
Linux + NVIDIA dispatch is byte-identical to v2.5.0. - macOS users can opt into native mode (Metal acceleration,
no Docker) via./start_puma.sh --native. Requires
brew install ollamafirst. Seedocs/MACOS_NOTES.mdfor the
Mode A / Mode B operational comparison. - For energy tracking on macOS native mode, consider
configuring passwordlesspowermetricsper the section in
docs/MACOS_NOTES.md. Without it, PUMA falls back to
process-mode and emits a single warning — energy data will be
recorded but less precise. - New docs to know about:
CROSS_ARCH_REPRODUCIBILITY.md,
extendedMACOS_NOTES.md(powermetrics section),
CATALOG_HISTORY.md(catalog_version 2.6.0 entry).
Future work pointer
Sprint 10 (planned, awaiting explicit user confirmation per P8):
catalog expansion to add forthcoming Qwen 3.6 and Kimi K2.6 model
families to the gpu-high profile with
empirical_validation: pending — same restraint as Apple Silicon,
catalogued without claiming compatibility on PUMA's current
validation hardware. Empirical close-out of cross-arch
reproducibility (H0–H3) is gated on Mac hardware availability and
is not tied to Sprint 10.
Acknowledgments
Development assistance provided by generative AI tooling. All
commits are attributed to the project's git identity per
repository convention.