feat(falsify-ship-002): MODEL-1 apr run emits valid Python (zero syntax errors) PARTIAL discharge by noahgift · Pull Request #1016 · paiml/aprender

noahgift · 2026-04-22T19:41:52Z

Summary

6th compute-free MODEL-1 PARTIAL lever. Wires AC-SHIP1-002 (`apr
run .safetensors --prompt "def fib(n):"` emits syntactically
valid Python) to a pure `verdict_from_syntax_error_count(usize) ->
Ship002Verdict` const fn. Targets main directly (upstream stack
#1012/#1014 merged).

Contract `contracts/qwen2-e2e-verification-v1.yaml` v1.2.0 → v1.3.0:
adds `FALSIFY-QW2E-SHIP-002` (ship_blocking, PARTIAL_ALGORITHM_LEVEL)
binding `AC_SHIP1_002_MAX_TOLERATED_SYNTAX_ERRORS = 0` to
`verdict_from_syntax_error_count(usize) -> Ship002Verdict` in
`crates/aprender-core/src/qa/ship_002.rs`.
6-section mutation survey: zero-errors Pass / exactly-one-error
Fail at boundary / many-errors Fail band {2, 10, 100} / monotonicity
0..=256 / usize::MAX sanity Fail / provenance pin on tolerance = 0.
Spec v2.27.0 → v2.28.0: AC-SHIP1-002 + FALSIFY-SHIP-002 rows
annotated `(PARTIAL_ALGORITHM_LEVEL v2.28.0)`; amendment
history entry notes MODEL-1 coverage 5/10 → 6/10 touched and
12 PARTIAL + 3 DISCHARGED across both models.

Mirrors MODEL-2 SHIP-017 `verdict_from_syntax_error_count` shape
with a tighter zero-tolerance rule (MODEL-2 tolerates ≤ 1 out of 100;
MODEL-1 spec wording "emits valid Python" has no slack). Authored
self-contained because SHIP-017 PR #1004 not yet on main.

Test plan

`cargo test -p aprender-core --lib
falsify_ship_002_python_syntax_error_threshold_logic` — 1 passed
`cargo run --quiet -p aprender-contracts-cli --bin pv --
validate contracts/qwen2-e2e-verification-v1.yaml` — Contract is valid
Full discharge: live `apr run paiml/qwen2.5-coder-7b-apache-q4k-v1
--prompt "def fib(n):"` on RTX 4090 with `--features cuda`;
`rustpython` or `ruff check --select=E` → 0 syntax errors

🤖 Generated with Claude Code

Discharge FALSIFY-SHIP-008 / AC-SHIP1-008 at PARTIAL_ALGORITHM_LEVEL. - contracts/chat-template-v1.yaml v1.0.0 -> v1.1.0: adds GATE-CHAT-SHIP-008 binding ChatMLTemplate::format_conversation to the canonical Qwen2.5-Coder-7B (system, user) golden via a pure verdict_from_chat_template_render const fn. ship_blocking: true, discharge_status: PARTIAL_ALGORITHM_LEVEL; full discharge blocks on live `apr run paiml/qwen2.5-coder-7b-apache-q4k-v1` completion diff against golden. - crates/aprender-core/src/text/chat_template/ship_008.rs (new): AC_SHIP1_008_CANONICAL_{SYSTEM,USER,GOLDEN} constants + Ship008Verdict enum + verdict_from_chat_template_render const fn (byte-equality, UTF-8-safe) + 5-section mutation survey (engine-binding, empty Fail, missing-gen-prompt Fail, wrong-delim Fail, swapped-roles Fail, single-byte flip Fail) + symmetry + provenance pin. - crates/aprender-core/src/text/chat_template/mod.rs: include! ship_008.rs alongside existing template.rs, raw_template.rs. - docs/specifications/aprender-train/ship-two-models-spec.md v2.23.0 -> v2.24.0: AC-SHIP1-008 row + FALSIFY-SHIP-008 row annotated PARTIAL_ALGORITHM_LEVEL; v2.24.0 amendment entry records MODEL-1 coverage 1/10 -> 2/10 (first MODEL-1 non-provenance PARTIAL; mirrors SHIP-016/017/018/020 pattern). Test: cargo test -p aprender-core --lib falsify_ship_008_chat_template_render_bind -> 1 passed Contract: pv validate contracts/chat-template-v1.yaml -> Contract is valid Refs: SHIP-TWO-001, task #155 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…arge Wires AC-SHIP1-006 "apr qa <model> — all 8 gates PASS" at PARTIAL_ALGORITHM_LEVEL: a pure aggregate-AND verdict fn bound to the 8-gate ship criterion from `docs/specifications/components/qa.md` §3 (golden / throughput / ollama parity / gpu speedup / tensor contracts / format parity / ptx parity / metadata). Files: - `crates/aprender-core/src/qa/ship_006.rs` (NEW, 217 lines) — `verdict_from_qa_gates(&[bool]) -> Ship006Verdict` const fn with 7-section mutation survey: all-Pass→Pass, all-Fail→Fail, single-gate-flip × 8, exhaustive 2^8=256 bitmask proof, Pass→Fail monotonicity, length-drift counter-examples (0 / 7 / 9 / 16), provenance pin (AC_SHIP1_006_REQUIRED_QA_GATE_COUNT = 8). - `crates/aprender-core/src/qa/mod.rs` — register `pub mod ship_006;`. - `contracts/apr-model-qa-v1.yaml` v1.1.0 → v1.2.0 — adds `FALSIFY-QA-SHIP-006` with `ship_blocking: true`, `discharge_status: PARTIAL_ALGORITHM_LEVEL`, `evidence_discharged_by` pointing at ship_006.rs + the harness test, and `full_discharge_blocks_on` live `apr qa paiml/qwen2.5-coder-7b-apache-q4k-v1 --json` on an RTX 4090 host (8× `"pass": true` entries in the JSON body). - `docs/specifications/aprender-train/ship-two-models-spec.md` v2.24.0 → v2.25.0 — annotates AC-SHIP1-006 + FALSIFY-SHIP-006 rows with PARTIAL_ALGORITHM_LEVEL markers and adds v2.25.0 amendment entry. Design: mirrors the aggregate-AND shape set by MODEL-2 SHIP-016 (task #152 on `feat/falsify-ship-016-partial-discharge`, not yet on main). Authored self-contained because SHIP-016 hasn't landed; once both ship, the two `verdict_from_qa_gates_*` fns should be deduplicated into a single parameterized helper. Required gate count differs by model (both 8 today — the spec's "All must Pass" is model-independent). MODEL-1 AC-SHIP1 coverage: 2/10 touched (SHIP-008 + SHIP-009) → **3/10** touched (+ SHIP-006). First MODEL-1 aggregate-AND PARTIAL. Full discharge blocks on a live `apr qa` run against the teacher weights on RTX 4090; the compute-heavy portion is intentionally out of scope here. Test: `cargo test -p aprender-core --lib falsify_ship_006_apr_qa_eight_gates_aggregate` → 1 passed. Contract: `cargo run --quiet -p aprender-contracts-cli --bin pv -- validate contracts/apr-model-qa-v1.yaml` → 0 errors. Stacked on #1012 (feat/falsify-ship-008-partial-discharge). Spec v2.25.0 builds on v2.24.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…scharge Wires AC-SHIP1-007 "apr bench decode throughput ≥30 tok/s on RTX 4090 (7B Q4_K target)" at PARTIAL_ALGORITHM_LEVEL: a pure f32 threshold verdict fn bound to the MODEL-1 teacher ship floor. The decision rule is proven today; the compute-heavy half (live `apr bench` on RTX 4090) is deferred to hardware evidence collection. Files: - `crates/aprender-core/src/bench/ship_007.rs` (NEW, 158 lines) — `AC_SHIP1_007_MIN_DECODE_TPS_RTX4090_7B = 30.0`, `Ship007Verdict { Pass, Fail }`, `verdict_from_decode_tps(f32) -> Ship007Verdict`, `falsify_ship_007_decode_tps_threshold_logic` 7-section survey: 1. boundary (30.0 exactly → Pass; the contract is ≥, not >) 2. one-ULP-below → Fail (sharpest off-by-one counter-example) 3. clear Pass band (45 / 100 tok/s) 4. clear Fail band (0 / 10 / 29.999999) 5. monotonicity above floor + below floor 6. non-finite → Fail conservatively (NaN, +∞, -∞) 7. provenance pin binding the 30.0 constant to spec §4.2. - `crates/aprender-core/src/bench/mod.rs` — register `pub mod ship_007;`. - `contracts/qwen2-e2e-verification-v1.yaml` v1.0.0 → v1.1.0 — adds `FALSIFY-QW2E-SHIP-007` with `ship_blocking: true`, `discharge_status: PARTIAL_ALGORITHM_LEVEL`, `evidence_discharged_by` pointing at ship_007.rs + the harness test, and `full_discharge_blocks_on` live `apr bench --iterations 5 --max-tokens 128 paiml/qwen2.5-coder-7b-apache-q4k-v1` on RTX 4090 with --features cuda; median of 5 iterations must be ≥ 30.0. Also 4 `counter_example_classes` (regressed_kernel / drifted_constant / relaxed_rule / nan_promoted). - `docs/specifications/aprender-train/ship-two-models-spec.md` v2.25.0 → v2.26.0 — annotates AC-SHIP1-007 + FALSIFY-SHIP-007 rows with PARTIAL_ALGORITHM_LEVEL markers and adds v2.26.0 amendment entry. Design: mirrors the MODEL-2 SHIP-020 single-f32-threshold shape (task #150 on `feat/falsify-ship-020-partial-discharge`, PR #1005 not yet on main). Authored self-contained because SHIP-020 hasn't landed; once both ship, the two `verdict_from_decode_tps_*` fns should be deduplicated into a single parameterized helper `verdict_from_decode_tps(measured, floor) -> ThresholdVerdict` with the model-specific floor pinned as a module-level const. MODEL-1 floor is 30.0 (7B Q4_K, bandwidth-bound at ~3.5× the size of 370M); MODEL-2 floor is 100.0 (370M sovereign, compute-bound at RTX 4090 bandwidth). MODEL-1 AC-SHIP1 coverage: 3/10 touched (SHIP-008 + SHIP-009 + SHIP-006) → **4/10** touched (+ SHIP-007). Test: `cargo test -p aprender-core --lib falsify_ship_007_decode_tps_threshold_logic` → 1 passed. Contract: `cargo run --quiet -p aprender-contracts-cli --bin pv -- validate contracts/qwen2-e2e-verification-v1.yaml` → 0 errors. Stacked on #1013 (feat/falsify-ship-006-partial-discharge), which is itself stacked on #1012 (feat/falsify-ship-008-partial-discharge). Spec v2.26.0 builds on v2.25.0 which builds on v2.24.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…nEval pass@1 ≥86.00% (1.2 pp noise → effective 84.80%) Wires MODEL-1 `apr eval --benchmark humaneval` ship floor (AC-SHIP1-005) to a pure two-number threshold verdict fn. 5th compute-free MODEL-1 lever (SHIP-008 + SHIP-009 + SHIP-006 + SHIP-007 + SHIP-005) brings MODEL-1 AC-SHIP1 coverage to 5/10 touched. Mirrors MODEL-2 SHIP-018 pattern (pass@1 threshold) but uniquely carries a 1.2 pp noise allowance called out by spec §4.2 AC-SHIP1-005. contracts/qwen2-e2e-verification-v1.yaml v1.1.0 → v1.2.0: - Adds FALSIFY-QW2E-SHIP-005 binding AC_SHIP1_005_NOMINAL_HUMANEVAL_PASS_AT_1_PCT = 86.00 AC_SHIP1_005_NOISE_ALLOWANCE_PP = 1.20 AC_SHIP1_005_EFFECTIVE_HUMANEVAL_PASS_AT_1_PCT ≈ 84.80 to `verdict_from_pass_at_1(correct, total, threshold_pct) -> Ship005Verdict` in `crates/aprender-core/src/metrics/ship_005.rs`. - 8-section mutation survey: 1. Safe-margin Pass above effective floor (85/100 = 85.0%) 2. Above nominal floor (87/100 = 87.0%) Pass 3. Noise-window Fail at nominal (85/100 Fails nominal) 4. Below-effective Fail incl. HumanEval-canonical 139/164 = 84.756% 5. Monotonicity sweep correct=0..=164 at effective 6. Div-safety (total=0) + sanity (correct>total) → Fail 7. Non-finite threshold (NaN, ±∞) → Fail conservatively 8. Tolerance-bounded provenance pin on all three constants (86.0 − 1.2 in f32 yields ~84.79999924, not exact 84.80). - `ship_blocking: true`, `discharge_status: PARTIAL_ALGORITHM_LEVEL`, `full_discharge_blocks_on: live apr eval --benchmark humaneval ...` on RTX 4090; 6 named counter_example_classes. crates/aprender-core/src/metrics/ship_005.rs (NEW, 310 lines): - Three-constant design unique to MODEL-1 (SHIP-007/018 had one). - `#[must_use] pub fn verdict_from_pass_at_1(...)` returns `Ship005Verdict::Fail` conservatively on: total=0 (div guard), correct>total (sanity), !threshold.is_finite() (NaN/±∞). - `falsify_ship_005_humaneval_pass_at_1_threshold_logic` — 1 passing. Spec `docs/specifications/aprender-train/ship-two-models-spec.md` v2.26.0 → v2.27.0 annotates AC-SHIP1-005 + FALSIFY-SHIP-005 rows `**(PARTIAL_ALGORITHM_LEVEL v2.27.0)**` and appends amendment entry noting 11 PARTIAL + 3 DISCHARGED across both models, MODEL-1 5/10. Authored self-contained because MODEL-2 SHIP-018 sibling PR has not yet landed on main. Once it does, the two `verdict_from_pass_at_1_*` fns should be dedup'd into a single parameterized helper. Full discharge blocks on: live `apr eval --benchmark humaneval paiml/qwen2.5-coder-7b-apache-q4k-v1 --json` on RTX 4090 with --features cuda; median pass@1 across 3 seed=0 runs ≥ 86.00 (or ≥ 84.80 under the 1.2 pp noise allowance). Tests: cargo test -p aprender-core --lib \ falsify_ship_005_humaneval_pass_at_1_threshold_logic Contract: cargo run --quiet -p aprender-contracts-cli --bin pv -- validate \ contracts/qwen2-e2e-verification-v1.yaml Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… run --prompt "def fib(n):"` emits valid Python (zero syntax errors) 6th compute-free MODEL-1 lever. Wires AC-SHIP1-002 (`apr run <model>.safetensors --prompt "def fib(n):"` emits syntactically valid Python) to a pure `verdict_from_syntax_error_count(usize) -> Ship002Verdict` const fn. Brings MODEL-1 AC-SHIP1 coverage to 6/10 touched (SHIP-008 + SHIP-009 + SHIP-006 + SHIP-007 + SHIP-005 + SHIP-002). Mirrors MODEL-2 SHIP-017 shape but with a tighter zero-tolerance rule. contracts/qwen2-e2e-verification-v1.yaml v1.2.0 → v1.3.0: - Adds FALSIFY-QW2E-SHIP-002 binding AC_SHIP1_002_MAX_TOLERATED_SYNTAX_ERRORS = 0 to `verdict_from_syntax_error_count(usize) -> Ship002Verdict` const fn in `crates/aprender-core/src/qa/ship_002.rs`. - 6-section mutation survey: 1. Zero errors → Pass (trivial unanimous-parse case) 2. Exactly-one error → Fail (boundary; MODEL-1 zero-tolerance) 3. Many-errors Fail band {2, 10, 100} 4. Monotonicity sweep 0..=256 5. usize::MAX sanity guard → Fail 6. Provenance pin on tolerance constant = 0 - `ship_blocking: true`, `discharge_status: PARTIAL_ALGORITHM_LEVEL`, `full_discharge_blocks_on: live apr run ... + rustpython/ruff AST parse with zero errors`; 5 named counter_example_classes (regressed_sampling, widened_tolerance, relaxed_rule, flipped_inequality, mask_byte_malformed). crates/aprender-core/src/qa/ship_002.rs (NEW, ~130 lines): - `#[must_use] pub const fn verdict_from_syntax_error_count(usize)` returns Pass iff `n <= 0` (i.e., `n == 0`). - `AC_SHIP1_002_MAX_TOLERATED_SYNTAX_ERRORS: usize = 0` — spec §4.2 AC-SHIP1-002 has no noise allowance. - `falsify_ship_002_python_syntax_error_threshold_logic` — 1 passing. Spec `docs/specifications/aprender-train/ship-two-models-spec.md` v2.27.0 → v2.28.0 annotates AC-SHIP1-002 + FALSIFY-SHIP-002 rows `**(PARTIAL_ALGORITHM_LEVEL v2.28.0)**` and appends amendment entry noting 12 PARTIAL + 3 DISCHARGED across both models, MODEL-1 6/10. Authored self-contained because MODEL-2 SHIP-017 sibling PR #1004 has not yet landed on main. Once it does, the two `verdict_from_syntax_error_count_*` fns should be dedup'd into a single parameterized helper `verdict_from_syntax_error_count(errors, max_allowed) -> SyntaxVerdict` with the per-model tolerance held externally. Full discharge blocks on: live `apr run paiml/qwen2.5-coder-7b-apache-q4k-v1 --prompt "def fib(n):"` on RTX 4090 with `--features cuda`; completion parses cleanly via `rustpython` or `ruff check --select=E` with zero syntax errors. Tests: cargo test -p aprender-core --lib \ falsify_ship_002_python_syntax_error_threshold_logic Contract: cargo run --quiet -p aprender-contracts-cli --bin pv -- validate \ contracts/qwen2-e2e-verification-v1.yaml Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

noahgift · 2026-04-22T20:07:33Z

Superseded by fresh clean-branch PR from main (c1263178a3) — no stacked SHIP-005/007 dependency. New PR at feat/falsify-ship-002-clean.

noahgift and others added 5 commits April 22, 2026 18:25

noahgift closed this Apr 22, 2026

This was referenced Apr 22, 2026

feat(falsify-ship-002): MODEL-1 Python syntax PARTIAL discharge (clean branch) #1017

Merged

CI infra: disk-guard cross-runner race wipes target/ mid-build (ci/coverage persistent flake) #1020

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(falsify-ship-002): MODEL-1 apr run emits valid Python (zero syntax errors) PARTIAL discharge#1016

feat(falsify-ship-002): MODEL-1 apr run emits valid Python (zero syntax errors) PARTIAL discharge#1016
noahgift wants to merge 5 commits into
mainfrom
feat/falsify-ship-002-partial-discharge

noahgift commented Apr 22, 2026

Uh oh!

noahgift commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented Apr 22, 2026

Summary

Test plan

Uh oh!

noahgift commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant