Skip to content

PUMA v2.6.0 — Apple Silicon M3/M4/M5 support

Choose a tag to compare

@pumacp pumacp released this 16 May 02:40
· 235 commits to main since this release

PUMA v2.6.0 Release Notes

Release date: 2026-05-16
Previous release: v2.5.0 (2026-05-16)
Branch: develop → main (post-tag)

Summary

This release consolidates Sprint 9 (Apple Silicon M3/M4/M5 support)
onto the v2.5.0 base. It adds first-class detection of Apple Silicon
hosts (9 new profile identifiers covering M3 base/Pro/Max, M4
base/Pro/Max, M5 base/Pro/Max, and M5 Ultra), a native runtime mode
via ./start_puma.sh --native that boots Ollama with Metal on
macOS (no Docker), macOS-aware CodeCarbon tracking with a
powermetrics-availability probe, and the cross-architecture
reproducibility question documented as a testable hypothesis for a
future empirical close-out.

Empirical validation status: all apple-silicon-* profiles
declare empirical_validation: pending. PUMA's validation hardware
is the RTX 2060 Mobile 6 GB (gpu-entry); Apple Silicon hardware
joins the validation set in a future Sprint when MacBook M-series
hardware becomes available to the project. The dispatch
infrastructure shipped here enables that validation; the testing
protocol
is documented in docs/CROSS_ARCH_REPRODUCIBILITY.md
§ "Testing protocol".

Highlights

Apple Silicon catalogued end-to-end

Layer Artefact
Profiles 9 new apple-silicon-* identifiers in config/profiles.yaml; schema extended additively with apple_silicon_required, chip_brand_match, min_unified_memory_gb
Catalog catalog_version bumped 2.5.0 → 2.6.0; conservative profiles_compatible[] additions per a ≈ 2× GGUF memory-headroom rule
Detection src/puma/preflight/apple_silicon.py (NEW) — platform-isolated sysctl-based detection; testable on Linux via unittest.mock
Dispatch SystemCapabilities gains chip_brand + unified_memory_gb; Profile gains optional Apple Silicon fields; select_profile() runs a new branch BEFORE the existing GPU/CPU dispatch
Runtime ./start_puma.sh --native boots native Ollama + Python venv on macOS; ./stop_puma_native.sh teardown companion
Emissions get_tracking_mode_and_warnings() resolver in puma.sustainability.codecarbon_wrapper; powermetrics probe; graceful process-mode fallback on macOS
Docs docs/CROSS_ARCH_REPRODUCIBILITY.md (NEW); extensions in MACOS_NOTES.md, HARDWARE.md, CATALOG_HISTORY.md

Conservative model compatibility

Apple Silicon variants need at least roughly 2× the model's GGUF
size in unified memory plus a small OS overhead. Applied as a rule:

Model Compatible Apple Silicon variants
qwen2.5:1.5b, gemma3:1b All 10 (m3 base → m5-ultra)
qwen2.5:3b, gemma3:4b All except m3 base (8 GB tight)
qwen2.5:7b, mistral:7b, llama3.1:8b, deepseek-r1:7b Pro / Max / Ultra only (≥ 18 GB)
qwen2.5:14b, deepseek-r1:14b, gemma3:27b Max / Ultra only (≥ 36 GB)
gemma3:12b m3-max, m4-pro+, m5-pro+ (≥ 24 GB)
gemma4:* Excluded from every apple-silicon-* — P6 enforcement

gemma4 family — exclusion preserved AND extended

The gemma4 family (gemma4:e2b, gemma4:e4b, gemma4:26b-a4b)
stays excluded from gpu-entry per F8 / D18.
test_gemma4_family_excluded_from_gpu_entry is preserved
unchanged. v2.6.0 extends the exclusion with a new invariant
test
test_gemma4_family_not_compatible_with_any_apple_silicon
ensuring the same VRAM-pressure failure mode is not re-introduced
on small unified-memory variants by accidental copy-paste during
future catalog edits. Re-enabling any (gemma4, apple-silicon-*)
pair requires new empirical evidence on Mac hardware and an
explicit debt entry referencing the prior exclusion.

CodeCarbon survives on macOS (Mode B)

tracking_mode="machine" is what PUMA's split-container Linux
architecture relies upon (the D15 fix). On macOS Mode B without
passwordless powermetrics, machine-mode silently fails to record
energy. v2.6.0 adopts a graceful fallback in
get_tracking_mode_and_warnings():

  1. Passwordless powermetrics configuredtracking_mode="machine", no warnings.
  2. Default macOS state (sudo required)tracking_mode="process" + one warning pointing at docs/MACOS_NOTES.md.
  3. puma run --no-emissions → tracking disabled entirely.

Linux + NVIDIA path is byte-identical to v2.5.0:
tracking_mode="machine" with no warnings.

Cross-architecture reproducibility — open question, testable hypothesis

v2.6.0 frames bit-exact reproducibility between x86_64 Linux and
arm64 macOS as an open empirical question. The Q4_K_M integer
quantisation makes F1 and MAE expected to be bit-exact across
architectures; logprobs (and therefore ECE) are expected to differ
by FP rounding. The document records H0 / H1 / H2 / H3 hypotheses
and a 6-step testing protocol for closing them out when Mac
hardware joins the validation set. See
docs/CROSS_ARCH_REPRODUCIBILITY.md.

Tests

  • 354 → 402 passing (-m "not ollama"), 7 deselected.
  • New tests/unit/test_apple_silicon.py: 28 tests covering every
    public entry point of the new module with mocks for the
    Darwin/arm64 gate, sysctl success + 3 failure modes
    (FileNotFoundError, TimeoutExpired, CalledProcessError), a
    parametrised mapping for all 10 chip brands, forward-compat for
    unmapped chips, the get_apple_silicon_info dict shape, and
    consistency checks on CHIP_BRAND_TO_PROFILE.
  • New tests/unit/test_codecarbon_macos.py: 7 tests for the
    tracking-mode helper and powermetrics probe — Linux
    short-circuit (never invokes subprocess), macOS with/without
    sudoers, probe behaviour on FileNotFoundError / non-zero exit /
    zero exit.
  • Extended tests/unit/test_catalog_metadata.py: +5 tests
    (VALID_PROFILES inclusion, profiles.yaml definitions for all 9
    apple-silicon-, chip_brand_match uniqueness, gemma4 exclusion
    from every apple-silicon-
    , qwen2.5:3b anchor on
    apple-silicon-m4-pro). Pre-existing
    test_model_metadata_is_internally_consistent and
    test_gemma4_family_excluded_from_gpu_entry preserved
    unchanged.
  • Extended tests/unit/test_preflight_profile.py: +7 tests for the
    auto-dispatch path (M4/M4 Pro/M5 Max), boundary at 8 GB unified
    memory, fall-through cases for insufficient unified memory,
    unmapped chips, and non-Apple chip-brand values; +1 manual
    override test for apple-silicon-m4.
  • pre-commit run --all-files: all hooks green.
  • puma validate-baseline (triage, F1 path, fresh Ollama):
    PASS f1=0.5831, delta=-0.0036, ±0.01.
  • puma validate-baseline --expected-mae 5.7150 (estimation,
    fresh Ollama): PASS mae=5.7150, delta=+0.0000, ±0.05
    bit-exact.

Quality

  • Coverage: 61 % (no significant change from v2.5.0).
  • CI: green on both main and develop. The integration-tests-ollama
    job introduced in v2.5.0 continues to run on push to those
    branches only.
  • Baseline reproducibility: F1 = 0.5867 ± 0.01 on triage_jira
    preserved. MAE = 5.7150 ± 0.05 on estimation_tawos preserved.
  • src/puma/cli.py LOC unchanged from v2.5.0; the only
    meaningfully-grown file is src/puma/preflight/apple_silicon.py
    (NEW, 141 LOC).

Design decisions

  • Linux path byte-identical to v2.5.0. Every Apple Silicon
    code path returns None / no-ops when caps.chip_brand is None
    or is_apple_silicon() is False, so no CI invocation on Linux
    changes behaviour. Governed by P5 (additive over modification)
    and P3 (reproducibility non-negotiable).
  • apple-silicon-* profiles ship with empirical_validation: pending. Cataloguing without validation is a deliberate
    signalling choice — the dispatch infrastructure exists, the
    numbers do not. Frames the future Mac-hardware Sprint as
    empirical close-out rather than groundwork. Aligns with the
    F8/D18 lesson: nominal specifications do not predict runtime
    compatibility on constrained hardware.
  • gemma4 stays excluded across both gpu-entry AND
    apple-silicon-
    *. P6 generalises from "do not re-introduce
    previously-rejected (model, profile) pairs" to "do not
    re-introduce the failure mode in a different profile family".
  • get_tracking_mode_and_warnings() is additive. The Linux
    branch returns ("machine", []), byte-identical to the
    v2.5.0-hardcoded value. The macOS branch is the new behaviour;
    it only runs when is_apple_silicon() returns True, which is
    False on every Linux test runner.
  • Cross-arch reproducibility documented as testable, not as
    fact.
    v2.6.0 ships a hypothesis (H0/H1/H2/H3) and a protocol,
    not a claim of bit-exactness. Honest about the empirical gap.

Debt tracking

  • No new open debt introduced by this release.
  • No closure of pre-existing debt — Sprint 9 is forward-looking
    infrastructure, not a debt-paydown Sprint.
  • Empirical validation of apple-silicon-* profiles is the
    explicit follow-up; it is not tracked as "debt" because the
    catalogue declares its own status (empirical_validation: pending).

Known limitations

Unchanged from v2.5.0:

  • Single hardware tier empirically evaluated (gpu-entry); models
    requiring gpu-mid/gpu-high are catalogued but not yet
    validated.
  • AMD ROCm not yet detected.
  • TAWOS SHA-256 end-to-end fetch test pending (Gate D criterion 3).
  • input_text not persisted in triage_jira instances (D22, Low).

New in v2.6.0:

  • All apple-silicon-* profiles declare
    empirical_validation: pending. Users on Mac hardware should
    validate cross-arch reproducibility on their specific chip
    before treating F1/MAE results as comparable to the Linux
    baselines — see docs/CROSS_ARCH_REPRODUCIBILITY.md § "Testing
    protocol".
  • macOS Mode B (native Ollama) is opt-in via
    ./start_puma.sh --native; the default ./start_puma.sh Docker
    path is unchanged.

Upgrade notes

  • No breaking changes. Existing CI invocations of
    puma validate-baseline (no flags) continue to work unchanged.
    Linux + NVIDIA dispatch is byte-identical to v2.5.0.
  • macOS users can opt into native mode (Metal acceleration,
    no Docker) via ./start_puma.sh --native. Requires
    brew install ollama first. See docs/MACOS_NOTES.md for the
    Mode A / Mode B operational comparison.
  • For energy tracking on macOS native mode, consider
    configuring passwordless powermetrics per the section in
    docs/MACOS_NOTES.md. Without it, PUMA falls back to
    process-mode and emits a single warning — energy data will be
    recorded but less precise.
  • New docs to know about: CROSS_ARCH_REPRODUCIBILITY.md,
    extended MACOS_NOTES.md (powermetrics section),
    CATALOG_HISTORY.md (catalog_version 2.6.0 entry).

Future work pointer

Sprint 10 (planned, awaiting explicit user confirmation per P8):
catalog expansion to add forthcoming Qwen 3.6 and Kimi K2.6 model
families to the gpu-high profile with
empirical_validation: pending — same restraint as Apple Silicon,
catalogued without claiming compatibility on PUMA's current
validation hardware. Empirical close-out of cross-arch
reproducibility (H0–H3) is gated on Mac hardware availability and
is not tied to Sprint 10.

Acknowledgments

Development assistance provided by generative AI tooling. All
commits are attributed to the project's git identity per
repository convention.