PUMA v2.6.0 Release Notes

Release date: 2026-05-16
Previous release: v2.5.0 (2026-05-16)
Branch: develop → main (post-tag)

Summary

This release consolidates Sprint 9 (Apple Silicon M3/M4/M5 support)
onto the v2.5.0 base. It adds first-class detection of Apple Silicon
hosts (9 new profile identifiers covering M3 base/Pro/Max, M4
base/Pro/Max, M5 base/Pro/Max, and M5 Ultra), a native runtime mode
via ./start_puma.sh --native that boots Ollama with Metal on
macOS (no Docker), macOS-aware CodeCarbon tracking with a
powermetrics-availability probe, and the cross-architecture
reproducibility question documented as a testable hypothesis for a
future empirical close-out.

Empirical validation status: all apple-silicon-* profiles
declare empirical_validation: pending. PUMA's validation hardware
is the RTX 2060 Mobile 6 GB (gpu-entry); Apple Silicon hardware
joins the validation set in a future Sprint when MacBook M-series
hardware becomes available to the project. The dispatch
infrastructure shipped here enables that validation; the testing
protocol is documented in docs/CROSS_ARCH_REPRODUCIBILITY.md
§ "Testing protocol".

Highlights

Apple Silicon catalogued end-to-end

Layer	Artefact
Profiles	9 new `apple-silicon-*` identifiers in `config/profiles.yaml`; schema extended additively with `apple_silicon_required`, `chip_brand_match`, `min_unified_memory_gb`
Catalog	`catalog_version` bumped 2.5.0 → 2.6.0; conservative `profiles_compatible[]` additions per a ≈ 2× GGUF memory-headroom rule
Detection	`src/puma/preflight/apple_silicon.py` (NEW) — platform-isolated sysctl-based detection; testable on Linux via `unittest.mock`
Dispatch	`SystemCapabilities` gains `chip_brand` + `unified_memory_gb`; `Profile` gains optional Apple Silicon fields; `select_profile()` runs a new branch BEFORE the existing GPU/CPU dispatch
Runtime	`./start_puma.sh --native` boots native Ollama + Python venv on macOS; `./stop_puma_native.sh` teardown companion
Emissions	`get_tracking_mode_and_warnings()` resolver in `puma.sustainability.codecarbon_wrapper`; powermetrics probe; graceful process-mode fallback on macOS
Docs	`docs/CROSS_ARCH_REPRODUCIBILITY.md` (NEW); extensions in `MACOS_NOTES.md`, `HARDWARE.md`, `CATALOG_HISTORY.md`

Conservative model compatibility

Apple Silicon variants need at least roughly 2× the model's GGUF
size in unified memory plus a small OS overhead. Applied as a rule:

Model	Compatible Apple Silicon variants
`qwen2.5:1.5b`, `gemma3:1b`	All 10 (m3 base → m5-ultra)
`qwen2.5:3b`, `gemma3:4b`	All except m3 base (8 GB tight)
`qwen2.5:7b`, `mistral:7b`, `llama3.1:8b`, `deepseek-r1:7b`	Pro / Max / Ultra only (≥ 18 GB)
`qwen2.5:14b`, `deepseek-r1:14b`, `gemma3:27b`	Max / Ultra only (≥ 36 GB)
`gemma3:12b`	m3-max, m4-pro+, m5-pro+ (≥ 24 GB)
*`gemma4:`**	Excluded from every apple-silicon-* — P6 enforcement

gemma4 family — exclusion preserved AND extended

The gemma4 family (gemma4:e2b, gemma4:e4b, gemma4:26b-a4b)
stays excluded from gpu-entry per F8 / D18.
test_gemma4_family_excluded_from_gpu_entry is preserved
unchanged. v2.6.0 extends the exclusion with a new invariant
test
test_gemma4_family_not_compatible_with_any_apple_silicon
ensuring the same VRAM-pressure failure mode is not re-introduced
on small unified-memory variants by accidental copy-paste during
future catalog edits. Re-enabling any (gemma4, apple-silicon-*)
pair requires new empirical evidence on Mac hardware and an
explicit debt entry referencing the prior exclusion.

CodeCarbon survives on macOS (Mode B)

tracking_mode="machine" is what PUMA's split-container Linux
architecture relies upon (the D15 fix). On macOS Mode B without
passwordless powermetrics, machine-mode silently fails to record
energy. v2.6.0 adopts a graceful fallback in
get_tracking_mode_and_warnings():

Passwordless powermetrics configured → tracking_mode="machine", no warnings.
Default macOS state (sudo required) → tracking_mode="process" + one warning pointing at docs/MACOS_NOTES.md.
puma run --no-emissions → tracking disabled entirely.

Linux + NVIDIA path is byte-identical to v2.5.0:
tracking_mode="machine" with no warnings.

Cross-architecture reproducibility — open question, testable hypothesis

v2.6.0 frames bit-exact reproducibility between x86_64 Linux and
arm64 macOS as an open empirical question. The Q4_K_M integer
quantisation makes F1 and MAE expected to be bit-exact across
architectures; logprobs (and therefore ECE) are expected to differ
by FP rounding. The document records H0 / H1 / H2 / H3 hypotheses
and a 6-step testing protocol for closing them out when Mac
hardware joins the validation set. See
docs/CROSS_ARCH_REPRODUCIBILITY.md.

Tests

354 → 402 passing (-m "not ollama"), 7 deselected.
New tests/unit/test_apple_silicon.py: 28 tests covering every
public entry point of the new module with mocks for the
Darwin/arm64 gate, sysctl success + 3 failure modes
(FileNotFoundError, TimeoutExpired, CalledProcessError), a
parametrised mapping for all 10 chip brands, forward-compat for
unmapped chips, the get_apple_silicon_info dict shape, and
consistency checks on CHIP_BRAND_TO_PROFILE.
New tests/unit/test_codecarbon_macos.py: 7 tests for the
tracking-mode helper and powermetrics probe — Linux
short-circuit (never invokes subprocess), macOS with/without
sudoers, probe behaviour on FileNotFoundError / non-zero exit /
zero exit.
Extended tests/unit/test_catalog_metadata.py: +5 tests
(VALID_PROFILES inclusion, profiles.yaml definitions for all 9
apple-silicon-, chip_brand_match uniqueness, gemma4 exclusion
from every apple-silicon-, qwen2.5:3b anchor on
apple-silicon-m4-pro). Pre-existing
test_model_metadata_is_internally_consistent and
test_gemma4_family_excluded_from_gpu_entry preserved
unchanged.
Extended tests/unit/test_preflight_profile.py: +7 tests for the
auto-dispatch path (M4/M4 Pro/M5 Max), boundary at 8 GB unified
memory, fall-through cases for insufficient unified memory,
unmapped chips, and non-Apple chip-brand values; +1 manual
override test for apple-silicon-m4.
pre-commit run --all-files: all hooks green.
puma validate-baseline (triage, F1 path, fresh Ollama):
PASS f1=0.5831, delta=-0.0036, ±0.01.
puma validate-baseline --expected-mae 5.7150 (estimation,
fresh Ollama): PASS mae=5.7150, delta=+0.0000, ±0.05 —
bit-exact.

Quality

Coverage: 61 % (no significant change from v2.5.0).
CI: green on both main and develop. The integration-tests-ollama
job introduced in v2.5.0 continues to run on push to those
branches only.
Baseline reproducibility: F1 = 0.5867 ± 0.01 on triage_jira
preserved. MAE = 5.7150 ± 0.05 on estimation_tawos preserved.
src/puma/cli.py LOC unchanged from v2.5.0; the only
meaningfully-grown file is src/puma/preflight/apple_silicon.py
(NEW, 141 LOC).

Design decisions

Linux path byte-identical to v2.5.0. Every Apple Silicon
code path returns None / no-ops when caps.chip_brand is None
or is_apple_silicon() is False, so no CI invocation on Linux
changes behaviour. Governed by P5 (additive over modification)
and P3 (reproducibility non-negotiable).
apple-silicon-* profiles ship with empirical_validation: pending. Cataloguing without validation is a deliberate
signalling choice — the dispatch infrastructure exists, the
numbers do not. Frames the future Mac-hardware Sprint as
empirical close-out rather than groundwork. Aligns with the
F8/D18 lesson: nominal specifications do not predict runtime
compatibility on constrained hardware.
gemma4 stays excluded across both gpu-entry AND
apple-silicon-*. P6 generalises from "do not re-introduce
previously-rejected (model, profile) pairs" to "do not
re-introduce the failure mode in a different profile family".
get_tracking_mode_and_warnings() is additive. The Linux
branch returns ("machine", []), byte-identical to the
v2.5.0-hardcoded value. The macOS branch is the new behaviour;
it only runs when is_apple_silicon() returns True, which is
False on every Linux test runner.
Cross-arch reproducibility documented as testable, not as
fact. v2.6.0 ships a hypothesis (H0/H1/H2/H3) and a protocol,
not a claim of bit-exactness. Honest about the empirical gap.

Debt tracking

No new open debt introduced by this release.
No closure of pre-existing debt — Sprint 9 is forward-looking
infrastructure, not a debt-paydown Sprint.
Empirical validation of apple-silicon-* profiles is the
explicit follow-up; it is not tracked as "debt" because the
catalogue declares its own status (empirical_validation: pending).

Known limitations

Unchanged from v2.5.0:

Single hardware tier empirically evaluated (gpu-entry); models
requiring gpu-mid/gpu-high are catalogued but not yet
validated.
AMD ROCm not yet detected.
TAWOS SHA-256 end-to-end fetch test pending (Gate D criterion 3).
input_text not persisted in triage_jira instances (D22, Low).

New in v2.6.0:

All apple-silicon-* profiles declare
empirical_validation: pending. Users on Mac hardware should
validate cross-arch reproducibility on their specific chip
before treating F1/MAE results as comparable to the Linux
baselines — see docs/CROSS_ARCH_REPRODUCIBILITY.md § "Testing
protocol".
macOS Mode B (native Ollama) is opt-in via
./start_puma.sh --native; the default ./start_puma.sh Docker
path is unchanged.

Upgrade notes

No breaking changes. Existing CI invocations of
puma validate-baseline (no flags) continue to work unchanged.
Linux + NVIDIA dispatch is byte-identical to v2.5.0.
macOS users can opt into native mode (Metal acceleration,
no Docker) via ./start_puma.sh --native. Requires
brew install ollama first. See docs/MACOS_NOTES.md for the
Mode A / Mode B operational comparison.
For energy tracking on macOS native mode, consider
configuring passwordless powermetrics per the section in
docs/MACOS_NOTES.md. Without it, PUMA falls back to
process-mode and emits a single warning — energy data will be
recorded but less precise.
New docs to know about: CROSS_ARCH_REPRODUCIBILITY.md,
extended MACOS_NOTES.md (powermetrics section),
CATALOG_HISTORY.md (catalog_version 2.6.0 entry).

Future work pointer

Sprint 10 (planned, awaiting explicit user confirmation per P8):
catalog expansion to add forthcoming Qwen 3.6 and Kimi K2.6 model
families to the gpu-high profile with
empirical_validation: pending — same restraint as Apple Silicon,
catalogued without claiming compatibility on PUMA's current
validation hardware. Empirical close-out of cross-arch
reproducibility (H0–H3) is gated on Mac hardware availability and
is not tied to Sprint 10.

Acknowledgments

Development assistance provided by generative AI tooling. All
commits are attributed to the project's git identity per
repository convention.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PUMA v2.6.0 — Apple Silicon M3/M4/M5 support

Choose a tag to compare

Sorry, something went wrong.