Skip to content

PUMA v2.7.0 — Catalog expansion: Qwen3 (pending validation)

Choose a tag to compare

@pumacp pumacp released this 16 May 04:54
· 232 commits to main since this release

PUMA v2.7.0 Release Notes

Release date: 2026-05-16
Previous release: v2.6.0 (2026-05-16)
Branch: develop → main (post-tag)

Summary

This release consolidates Sprint 10 (catalog expansion) onto the
v2.6.0 base. It adds two Alibaba Qwen3 family entries to the
catalog — both verified against the Ollama registry before
inclusion — and formally excludes Kimi K2.6 after a 13-tag
registry probe confirmed it is not distributed via Ollama. The
catalog schema is preserved at the 8 fields used since v2.0.0;
all v2.7.0 metadata lives within the existing notes field per
the project's minimum-complexity discipline.

Highlights

Two Qwen3 entries — registry-verified before cataloguing

Tag Type GGUF (verified) Profile Notes
qwen3:30b Dense 17.3 GB gpu-high Hybrid Gated DeltaNet + self-attention, 262144 context
qwen3:30b-a3b MoE 17.3 GB gpu-high 30B total / ~3B active per token; F8/D18 MoE caveat preserved in notes

Both entries:

  • Verified via Ollama registry manifest probe
    (registry.ollama.ai/v2/library/qwen3/manifests/* returned HTTP
    200 with the GGUF size derived from the sum of layer sizes).
  • Declare logprobs_supported: false conservatively until
    empirical verification on appropriate hardware.
  • Excluded from gpu-entry AND every apple-silicon-* profile by
    the P11 pending-validation invariant. Five new regression-guard
    tests in tests/unit/test_catalog_metadata.py pin this contract
    via exact-equality on profiles_compatible == ['gpu-high'].
  • params_b: 30.0 follows the gemma4:26b-a4b precedent (TOTAL
    when the tag encodes both numbers). The MoE/F8 caveat in the
    notes field documents that active-params count does NOT
    predict GGUF size or runtime VRAM consumption.

Kimi K2.6 — formally excluded after 13-tag registry probe

A registry probe on 2026-05-16 returned HTTP 404 on every
plausible Ollama tag naming for Kimi K2.6:

kimi-k2:6              kimi-k2:latest        kimi-k2:1t
kimi-k2:1t-instruct    kimi-k2:0905          kimi-k2:base
kimi-k2:instruct       kimi:latest           kimi-k2.6:latest
moonshot:latest        moonshot:kimi-k2      kimi-k2-base:latest
kimi-k2-instruct:latest

The model is not distributed via the Ollama registry as of the
v2.7.0 cut. Cataloguing a non-existent ollama_tag would violate
the project's empirical-first principle (P10) and produce a
broken puma models pull command for users following the catalog
metadata. The exclusion decision is recorded in
docs/CATALOG_HISTORY.md v2.7.0 § "Considered but not catalogued"
with the full probe table for academic traceability. It may be
reconsidered in a future release if Moonshot AI or a third-party
distributor publishes K2.6 to the Ollama registry.

Deferred — known on Ollama but out of v2.7.0 scope

The registry probe confirmed these tags exist (HTTP 200) but they
are deferred from v2.7.0 for scope discipline:

Tag Real GGUF Reason
qwen3:32b (dense) 18.8 GB Marginal upgrade over qwen3:30b; defer until empirical validation on gpu-high can distinguish them
qwen3:235b-a22b (MoE) 132.4 GB Requires multi-GPU rigs well beyond gpu-high (24+ GB VRAM); pending hardware tier extension
qwen3-coder:30b, qwen3-coder:480b Coder family is task-specific; out of scope for PMO benchmarks

Schema unchanged — minimum-complexity preserved

The original Sprint 10 plan proposed ~12 new YAML fields (family,
parameters_total_b, parameters_active_b, profile_recommended,
size_gb_disk_estimate, size_gb_vram_estimate, quantization,
license, release_date, capabilities, empirical_validation,
validation_blockers). The user's minimum-complexity decision
kept the catalog at the v2.0.0–v2.6.0 schema (8 fields:
ollama_tag, params_b, gguf_size_gb, context_window,
logprobs_supported, profiles_compatible, timeout_s,
notes). All v2.7.0 metadata (license, release date, MoE caveat,
validation blockers, architecture details) lives within
multi-line notes: text. src/puma/preflight/catalog.py and the
ModelEntry dataclass are byte-identical to v2.6.0.

Invariants generalised, not relaxed

The pending-validation exclusion from gpu-entry (established in
Sprint 9 for Apple Silicon entries) is reaffirmed for the new
Qwen3 entries and extended to every apple-silicon-* profile via
explicit tests:

Test Status Invariant
test_gemma4_family_excluded_from_gpu_entry PASSED (preserved) D18/F8 (Sprint 2)
test_gemma4_family_not_compatible_with_any_apple_silicon PASSED (preserved) P6 extension to Apple Silicon (Sprint 9)
test_qwen3_entries_excluded_from_gpu_entry PASSED (new) P10/P11 (Sprint 10)
test_qwen3_entries_excluded_from_all_apple_silicon PASSED (new) P11 generalisation across profile families
test_qwen3_entries_target_gpu_high_only PASSED (new) Exact-equality anchor against accidental loosening

The pattern is now: new entries default to the safest profile
only; loosening requires empirical evidence and an explicit
debt-tracker entry referencing the prior exclusion.

Tests

  • 402 → 407 passing (-m "not ollama"), 7 deselected.
  • 5 new regression-guard tests in
    tests/unit/test_catalog_metadata.py (see invariant table
    above).
  • tests/unit/test_preflight_catalog.py::test_load_catalog_returns_all_entries:
    entry-count expectation updated 15 → 17 to reflect the two new
    Qwen3 additions.
  • pre-commit run --all-files: all hooks green.
  • puma validate-baseline (triage, F1 path, fresh Ollama):
    PASS f1=0.5831, delta=-0.0036, ±0.01.
  • puma validate-baseline --expected-mae 5.7150 (estimation,
    fresh Ollama): PASS mae=5.7150, delta=+0.0000, ±0.05
    bit-exact.

Quality

  • Coverage: 61 % (no significant change from v2.6.0; new entries
    are YAML-only, no Python statements added).
  • CI: green on both main and develop. The
    integration-tests-ollama job introduced in v2.5.0 continues to
    run on push to those branches.
  • Baseline reproducibility: F1 = 0.5867 ± 0.01 on triage_jira
    preserved; MAE = 5.7150 ± 0.05 on estimation_tawos preserved.
  • Linux + NVIDIA dispatch byte-identical to v2.6.0 (no new
    profiles, no new dispatch logic, no new code paths). The Qwen3
    entries appear in models_for_profile('gpu-high') only.

Design decisions

  • Schema unchanged at 8 fields. Documented Sprint-10-original
    proposal of 12 new fields; chose to keep schema minimal. All
    metadata that would have required new fields now lives in the
    notes multi-line text. Governed by P5 (additive over
    modification) and the project's minimum-complexity discipline.
  • Real Ollama tags only. Every catalogued ollama_tag is
    verified against registry.ollama.ai/v2/library/<repo>/manifests/<tag>
    before inclusion. The originally-planned qwen3:27b and
    qwen3:35b-a3b were remapped to the real qwen3:30b and
    qwen3:30b-a3b after probe; Kimi K2.6 was removed entirely
    after every plausible tag returned 404.
  • Conservative params_b for MoE. The qwen3:30b-a3b entry
    declares params_b: 30.0 following the gemma4:26b-a4b
    precedent — TOTAL params when the tag encodes both numbers. The
    F8/D18 caveat in notes documents that active-params count
    does NOT predict VRAM consumption.
  • logprobs_supported: false conservatively. The Qwen3 family
    announces logprob support upstream, but PUMA has not yet
    empirically verified token-level confidence on these specific
    tags. Flipping to true is part of the empirical validation
    protocol when hardware becomes available.
  • gpu-high as the only target. 17.3 GB GGUF exceeds gpu-mid's
    12–24 GB upper bound once OS + context overhead are accounted
    for; gpu-high (24+ GB VRAM) is the only safe default. The
    exact-equality anchor test
    test_qwen3_entries_target_gpu_high_only pins this so future
    loosening requires deliberate intent.

Debt tracking

  • No new open debt introduced by this release.
  • No closure of pre-existing debt — Sprint 10 is
    forward-looking catalog expansion, not a debt-paydown Sprint.
  • Empirical validation of qwen3:30b and qwen3:30b-a3b is the
    explicit follow-up; tracked via the notes text on each entry
    and via the validation roadmap in
    docs/CATALOG_HISTORY.md § "Empirical validation roadmap".

Known limitations

Unchanged from v2.6.0:

  • Single hardware tier empirically evaluated (gpu-entry);
    models requiring gpu-mid/gpu-high are catalogued but not
    yet validated.
  • AMD ROCm not yet detected.
  • All apple-silicon-* profiles declare
    empirical_validation: pending (Sprint 9 forward-work).
  • TAWOS SHA-256 end-to-end fetch test pending (Gate D criterion
    3).

New in v2.7.0:

  • Both qwen3:30b and qwen3:30b-a3b are catalogued with
    validation pending. Users on gpu-high hardware can use them
    via puma run and report empirical results to close the
    validation gap.
  • Kimi K2.6 is not distributed via Ollama; the catalog
    intentionally omits it. If a third-party distributor publishes
    K2.6 to Ollama, a future Sprint can revisit cataloguing.

Empirical validation roadmap (when gpu-high hardware available)

The protocol for closing the validation gap is documented in
docs/CATALOG_HISTORY.md v2.7.0 § "Empirical validation roadmap":

  1. Pull the model via ollama pull qwen3:30b and verify the
    digest matches the registry manifest probed at cataloguing.
  2. Run the canonical baselines: triage_jira (F1) and
    estimation_tawos (MAE).
  3. Measure parse_failure_rate (should be 0 for usable models)
    and reproducibility (bit-exact under T=0.0 + seed=42 on the
    gpu-high hardware).
  4. If validation succeeds: bump logprobs_supported to true
    (after a logprobs-enabled probe), extend profiles_compatible
    to vetted Apple Silicon Max/Ultra variants (≥36 GB unified
    memory) pending separate Apple-side validation, and document
    the validating run_id in CATALOG_HISTORY.
  5. If validation fails: document the failure mode (analogous to
    gemma4 D18) in docs/known_debt.md and keep the entry at its
    current [gpu-high] restriction.

Upgrade notes

  • No breaking changes. Existing CI invocations of
    puma validate-baseline continue to work unchanged. Linux +
    NVIDIA dispatch is byte-identical to v2.6.0.
  • Catalog grows by 2 entries, total 17. puma models list
    shows the new entries; models_for_profile('gpu-high') now
    returns 6 entries (4 from earlier releases + 2 from v2.7.0).
    models_for_profile('gpu-entry') is unchanged.
  • Schema is byte-identical to v2.6.0 — no migration needed
    for tooling that reads the catalog YAML directly.
  • New docs to know about: docs/CATALOG_HISTORY.md v2.7.0
    section (full Kimi probe table and deferred-variants list).

Future work pointer

Sprint 10 closes the originally-planned multi-Sprint sequence
(Sprint 8 hardening → Sprint 9 Apple Silicon → Sprint 10 catalog
expansion). The project's technical infrastructure is complete;
the open follow-ups are empirical validations gated on hardware
availability:

  • MacBook Pro M4/M5: validates the Apple Silicon detection,
    native runtime mode, and cross-arch reproducibility hypotheses
    (H0/H1/H2/H3 per docs/CROSS_ARCH_REPRODUCIBILITY.md).
  • gpu-high NVIDIA hardware (24+ GB VRAM): validates
    qwen3:30b and qwen3:30b-a3b empirically; closes the
    validation roadmap in this release notes file.
  • Future catalog Sprints may add the deferred Qwen3 variants
    (qwen3:32b, qwen3:235b-a22b, qwen3-coder:*) and
    reconsider Kimi K2.6 if it becomes available on Ollama.

Acknowledgments

Development assistance provided by generative AI tooling. All
commits are attributed to the project's git identity per
repository convention.