feat(falsify-ship-002): MODEL-1 apr run emits valid Python (zero syntax errors) PARTIAL discharge#1016
Closed
noahgift wants to merge 5 commits into
Closed
feat(falsify-ship-002): MODEL-1 apr run emits valid Python (zero syntax errors) PARTIAL discharge#1016noahgift wants to merge 5 commits into
noahgift wants to merge 5 commits into
Conversation
Discharge FALSIFY-SHIP-008 / AC-SHIP1-008 at PARTIAL_ALGORITHM_LEVEL.
- contracts/chat-template-v1.yaml v1.0.0 -> v1.1.0: adds
GATE-CHAT-SHIP-008 binding ChatMLTemplate::format_conversation to
the canonical Qwen2.5-Coder-7B (system, user) golden via a pure
verdict_from_chat_template_render const fn. ship_blocking: true,
discharge_status: PARTIAL_ALGORITHM_LEVEL; full discharge blocks
on live `apr run paiml/qwen2.5-coder-7b-apache-q4k-v1` completion
diff against golden.
- crates/aprender-core/src/text/chat_template/ship_008.rs (new):
AC_SHIP1_008_CANONICAL_{SYSTEM,USER,GOLDEN} constants +
Ship008Verdict enum + verdict_from_chat_template_render const fn
(byte-equality, UTF-8-safe) + 5-section mutation survey
(engine-binding, empty Fail, missing-gen-prompt Fail, wrong-delim
Fail, swapped-roles Fail, single-byte flip Fail) + symmetry +
provenance pin.
- crates/aprender-core/src/text/chat_template/mod.rs: include!
ship_008.rs alongside existing template.rs, raw_template.rs.
- docs/specifications/aprender-train/ship-two-models-spec.md
v2.23.0 -> v2.24.0: AC-SHIP1-008 row + FALSIFY-SHIP-008 row
annotated PARTIAL_ALGORITHM_LEVEL; v2.24.0 amendment entry
records MODEL-1 coverage 1/10 -> 2/10 (first MODEL-1
non-provenance PARTIAL; mirrors SHIP-016/017/018/020 pattern).
Test: cargo test -p aprender-core --lib
falsify_ship_008_chat_template_render_bind -> 1 passed
Contract: pv validate contracts/chat-template-v1.yaml -> Contract is valid
Refs: SHIP-TWO-001, task #155
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…arge Wires AC-SHIP1-006 "apr qa <model> — all 8 gates PASS" at PARTIAL_ALGORITHM_LEVEL: a pure aggregate-AND verdict fn bound to the 8-gate ship criterion from `docs/specifications/components/qa.md` §3 (golden / throughput / ollama parity / gpu speedup / tensor contracts / format parity / ptx parity / metadata). Files: - `crates/aprender-core/src/qa/ship_006.rs` (NEW, 217 lines) — `verdict_from_qa_gates(&[bool]) -> Ship006Verdict` const fn with 7-section mutation survey: all-Pass→Pass, all-Fail→Fail, single-gate-flip × 8, exhaustive 2^8=256 bitmask proof, Pass→Fail monotonicity, length-drift counter-examples (0 / 7 / 9 / 16), provenance pin (AC_SHIP1_006_REQUIRED_QA_GATE_COUNT = 8). - `crates/aprender-core/src/qa/mod.rs` — register `pub mod ship_006;`. - `contracts/apr-model-qa-v1.yaml` v1.1.0 → v1.2.0 — adds `FALSIFY-QA-SHIP-006` with `ship_blocking: true`, `discharge_status: PARTIAL_ALGORITHM_LEVEL`, `evidence_discharged_by` pointing at ship_006.rs + the harness test, and `full_discharge_blocks_on` live `apr qa paiml/qwen2.5-coder-7b-apache-q4k-v1 --json` on an RTX 4090 host (8× `"pass": true` entries in the JSON body). - `docs/specifications/aprender-train/ship-two-models-spec.md` v2.24.0 → v2.25.0 — annotates AC-SHIP1-006 + FALSIFY-SHIP-006 rows with PARTIAL_ALGORITHM_LEVEL markers and adds v2.25.0 amendment entry. Design: mirrors the aggregate-AND shape set by MODEL-2 SHIP-016 (task #152 on `feat/falsify-ship-016-partial-discharge`, not yet on main). Authored self-contained because SHIP-016 hasn't landed; once both ship, the two `verdict_from_qa_gates_*` fns should be deduplicated into a single parameterized helper. Required gate count differs by model (both 8 today — the spec's "All must Pass" is model-independent). MODEL-1 AC-SHIP1 coverage: 2/10 touched (SHIP-008 + SHIP-009) → **3/10** touched (+ SHIP-006). First MODEL-1 aggregate-AND PARTIAL. Full discharge blocks on a live `apr qa` run against the teacher weights on RTX 4090; the compute-heavy portion is intentionally out of scope here. Test: `cargo test -p aprender-core --lib falsify_ship_006_apr_qa_eight_gates_aggregate` → 1 passed. Contract: `cargo run --quiet -p aprender-contracts-cli --bin pv -- validate contracts/apr-model-qa-v1.yaml` → 0 errors. Stacked on #1012 (feat/falsify-ship-008-partial-discharge). Spec v2.25.0 builds on v2.24.0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…scharge
Wires AC-SHIP1-007 "apr bench decode throughput ≥30 tok/s on RTX 4090
(7B Q4_K target)" at PARTIAL_ALGORITHM_LEVEL: a pure f32 threshold
verdict fn bound to the MODEL-1 teacher ship floor. The decision rule
is proven today; the compute-heavy half (live `apr bench` on RTX 4090)
is deferred to hardware evidence collection.
Files:
- `crates/aprender-core/src/bench/ship_007.rs` (NEW, 158 lines) —
`AC_SHIP1_007_MIN_DECODE_TPS_RTX4090_7B = 30.0`,
`Ship007Verdict { Pass, Fail }`,
`verdict_from_decode_tps(f32) -> Ship007Verdict`,
`falsify_ship_007_decode_tps_threshold_logic` 7-section survey:
1. boundary (30.0 exactly → Pass; the contract is ≥, not >)
2. one-ULP-below → Fail (sharpest off-by-one counter-example)
3. clear Pass band (45 / 100 tok/s)
4. clear Fail band (0 / 10 / 29.999999)
5. monotonicity above floor + below floor
6. non-finite → Fail conservatively (NaN, +∞, -∞)
7. provenance pin binding the 30.0 constant to spec §4.2.
- `crates/aprender-core/src/bench/mod.rs` — register `pub mod ship_007;`.
- `contracts/qwen2-e2e-verification-v1.yaml` v1.0.0 → v1.1.0 — adds
`FALSIFY-QW2E-SHIP-007` with `ship_blocking: true`,
`discharge_status: PARTIAL_ALGORITHM_LEVEL`, `evidence_discharged_by`
pointing at ship_007.rs + the harness test, and
`full_discharge_blocks_on` live `apr bench --iterations 5
--max-tokens 128 paiml/qwen2.5-coder-7b-apache-q4k-v1` on RTX 4090
with --features cuda; median of 5 iterations must be ≥ 30.0. Also
4 `counter_example_classes` (regressed_kernel / drifted_constant /
relaxed_rule / nan_promoted).
- `docs/specifications/aprender-train/ship-two-models-spec.md`
v2.25.0 → v2.26.0 — annotates AC-SHIP1-007 + FALSIFY-SHIP-007 rows
with PARTIAL_ALGORITHM_LEVEL markers and adds v2.26.0 amendment entry.
Design: mirrors the MODEL-2 SHIP-020 single-f32-threshold shape (task
#150 on `feat/falsify-ship-020-partial-discharge`, PR #1005 not yet on
main). Authored self-contained because SHIP-020 hasn't landed; once
both ship, the two `verdict_from_decode_tps_*` fns should be
deduplicated into a single parameterized helper
`verdict_from_decode_tps(measured, floor) -> ThresholdVerdict` with
the model-specific floor pinned as a module-level const. MODEL-1 floor
is 30.0 (7B Q4_K, bandwidth-bound at ~3.5× the size of 370M); MODEL-2
floor is 100.0 (370M sovereign, compute-bound at RTX 4090 bandwidth).
MODEL-1 AC-SHIP1 coverage: 3/10 touched (SHIP-008 + SHIP-009 +
SHIP-006) → **4/10** touched (+ SHIP-007).
Test: `cargo test -p aprender-core --lib falsify_ship_007_decode_tps_threshold_logic` → 1 passed.
Contract: `cargo run --quiet -p aprender-contracts-cli --bin pv -- validate contracts/qwen2-e2e-verification-v1.yaml` → 0 errors.
Stacked on #1013 (feat/falsify-ship-006-partial-discharge), which is
itself stacked on #1012 (feat/falsify-ship-008-partial-discharge).
Spec v2.26.0 builds on v2.25.0 which builds on v2.24.0.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…nEval pass@1 ≥86.00% (1.2 pp noise → effective 84.80%)
Wires MODEL-1 `apr eval --benchmark humaneval` ship floor (AC-SHIP1-005)
to a pure two-number threshold verdict fn. 5th compute-free MODEL-1
lever (SHIP-008 + SHIP-009 + SHIP-006 + SHIP-007 + SHIP-005) brings
MODEL-1 AC-SHIP1 coverage to 5/10 touched. Mirrors MODEL-2 SHIP-018
pattern (pass@1 threshold) but uniquely carries a 1.2 pp noise
allowance called out by spec §4.2 AC-SHIP1-005.
contracts/qwen2-e2e-verification-v1.yaml v1.1.0 → v1.2.0:
- Adds FALSIFY-QW2E-SHIP-005 binding
AC_SHIP1_005_NOMINAL_HUMANEVAL_PASS_AT_1_PCT = 86.00
AC_SHIP1_005_NOISE_ALLOWANCE_PP = 1.20
AC_SHIP1_005_EFFECTIVE_HUMANEVAL_PASS_AT_1_PCT ≈ 84.80
to `verdict_from_pass_at_1(correct, total, threshold_pct) ->
Ship005Verdict` in `crates/aprender-core/src/metrics/ship_005.rs`.
- 8-section mutation survey:
1. Safe-margin Pass above effective floor (85/100 = 85.0%)
2. Above nominal floor (87/100 = 87.0%) Pass
3. Noise-window Fail at nominal (85/100 Fails nominal)
4. Below-effective Fail incl. HumanEval-canonical 139/164 = 84.756%
5. Monotonicity sweep correct=0..=164 at effective
6. Div-safety (total=0) + sanity (correct>total) → Fail
7. Non-finite threshold (NaN, ±∞) → Fail conservatively
8. Tolerance-bounded provenance pin on all three constants
(86.0 − 1.2 in f32 yields ~84.79999924, not exact 84.80).
- `ship_blocking: true`, `discharge_status: PARTIAL_ALGORITHM_LEVEL`,
`full_discharge_blocks_on: live apr eval --benchmark humaneval ...`
on RTX 4090; 6 named counter_example_classes.
crates/aprender-core/src/metrics/ship_005.rs (NEW, 310 lines):
- Three-constant design unique to MODEL-1 (SHIP-007/018 had one).
- `#[must_use] pub fn verdict_from_pass_at_1(...)` returns
`Ship005Verdict::Fail` conservatively on: total=0 (div guard),
correct>total (sanity), !threshold.is_finite() (NaN/±∞).
- `falsify_ship_005_humaneval_pass_at_1_threshold_logic` — 1 passing.
Spec `docs/specifications/aprender-train/ship-two-models-spec.md`
v2.26.0 → v2.27.0 annotates AC-SHIP1-005 + FALSIFY-SHIP-005 rows
`**(PARTIAL_ALGORITHM_LEVEL v2.27.0)**` and appends amendment entry
noting 11 PARTIAL + 3 DISCHARGED across both models, MODEL-1 5/10.
Authored self-contained because MODEL-2 SHIP-018 sibling PR has not
yet landed on main. Once it does, the two `verdict_from_pass_at_1_*`
fns should be dedup'd into a single parameterized helper.
Full discharge blocks on: live `apr eval --benchmark humaneval
paiml/qwen2.5-coder-7b-apache-q4k-v1 --json` on RTX 4090 with
--features cuda; median pass@1 across 3 seed=0 runs ≥ 86.00 (or
≥ 84.80 under the 1.2 pp noise allowance).
Tests:
cargo test -p aprender-core --lib \
falsify_ship_005_humaneval_pass_at_1_threshold_logic
Contract:
cargo run --quiet -p aprender-contracts-cli --bin pv -- validate \
contracts/qwen2-e2e-verification-v1.yaml
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… run --prompt "def fib(n):"` emits valid Python (zero syntax errors)
6th compute-free MODEL-1 lever. Wires AC-SHIP1-002 (`apr run <model>.safetensors
--prompt "def fib(n):"` emits syntactically valid Python) to a pure
`verdict_from_syntax_error_count(usize) -> Ship002Verdict` const fn.
Brings MODEL-1 AC-SHIP1 coverage to 6/10 touched (SHIP-008 + SHIP-009
+ SHIP-006 + SHIP-007 + SHIP-005 + SHIP-002). Mirrors MODEL-2 SHIP-017
shape but with a tighter zero-tolerance rule.
contracts/qwen2-e2e-verification-v1.yaml v1.2.0 → v1.3.0:
- Adds FALSIFY-QW2E-SHIP-002 binding
AC_SHIP1_002_MAX_TOLERATED_SYNTAX_ERRORS = 0
to `verdict_from_syntax_error_count(usize) -> Ship002Verdict`
const fn in `crates/aprender-core/src/qa/ship_002.rs`.
- 6-section mutation survey:
1. Zero errors → Pass (trivial unanimous-parse case)
2. Exactly-one error → Fail (boundary; MODEL-1 zero-tolerance)
3. Many-errors Fail band {2, 10, 100}
4. Monotonicity sweep 0..=256
5. usize::MAX sanity guard → Fail
6. Provenance pin on tolerance constant = 0
- `ship_blocking: true`, `discharge_status: PARTIAL_ALGORITHM_LEVEL`,
`full_discharge_blocks_on: live apr run ... + rustpython/ruff
AST parse with zero errors`; 5 named counter_example_classes
(regressed_sampling, widened_tolerance, relaxed_rule,
flipped_inequality, mask_byte_malformed).
crates/aprender-core/src/qa/ship_002.rs (NEW, ~130 lines):
- `#[must_use] pub const fn verdict_from_syntax_error_count(usize)`
returns Pass iff `n <= 0` (i.e., `n == 0`).
- `AC_SHIP1_002_MAX_TOLERATED_SYNTAX_ERRORS: usize = 0` — spec
§4.2 AC-SHIP1-002 has no noise allowance.
- `falsify_ship_002_python_syntax_error_threshold_logic` — 1 passing.
Spec `docs/specifications/aprender-train/ship-two-models-spec.md`
v2.27.0 → v2.28.0 annotates AC-SHIP1-002 + FALSIFY-SHIP-002 rows
`**(PARTIAL_ALGORITHM_LEVEL v2.28.0)**` and appends amendment entry
noting 12 PARTIAL + 3 DISCHARGED across both models, MODEL-1 6/10.
Authored self-contained because MODEL-2 SHIP-017 sibling PR #1004
has not yet landed on main. Once it does, the two
`verdict_from_syntax_error_count_*` fns should be dedup'd into a
single parameterized helper `verdict_from_syntax_error_count(errors,
max_allowed) -> SyntaxVerdict` with the per-model tolerance held
externally.
Full discharge blocks on: live `apr run paiml/qwen2.5-coder-7b-apache-q4k-v1
--prompt "def fib(n):"` on RTX 4090 with `--features cuda`;
completion parses cleanly via `rustpython` or `ruff check --select=E`
with zero syntax errors.
Tests:
cargo test -p aprender-core --lib \
falsify_ship_002_python_syntax_error_threshold_logic
Contract:
cargo run --quiet -p aprender-contracts-cli --bin pv -- validate \
contracts/qwen2-e2e-verification-v1.yaml
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Contributor
Author
|
Superseded by fresh clean-branch PR from main (c1263178a3) — no stacked SHIP-005/007 dependency. New PR at feat/falsify-ship-002-clean. |
This was referenced Apr 22, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
6th compute-free MODEL-1 PARTIAL lever. Wires AC-SHIP1-002 (`apr
run .safetensors --prompt "def fib(n):"` emits syntactically
valid Python) to a pure `verdict_from_syntax_error_count(usize) ->
Ship002Verdict` const fn. Targets main directly (upstream stack
#1012/#1014 merged).
adds `FALSIFY-QW2E-SHIP-002` (ship_blocking, PARTIAL_ALGORITHM_LEVEL)
binding `AC_SHIP1_002_MAX_TOLERATED_SYNTAX_ERRORS = 0` to
`verdict_from_syntax_error_count(usize) -> Ship002Verdict` in
`crates/aprender-core/src/qa/ship_002.rs`.
Fail at boundary / many-errors Fail band {2, 10, 100} / monotonicity
0..=256 / usize::MAX sanity Fail / provenance pin on tolerance = 0.
annotated `(PARTIAL_ALGORITHM_LEVEL v2.28.0)`; amendment
history entry notes MODEL-1 coverage 5/10 → 6/10 touched and
12 PARTIAL + 3 DISCHARGED across both models.
Mirrors MODEL-2 SHIP-017 `verdict_from_syntax_error_count` shape
with a tighter zero-tolerance rule (MODEL-2 tolerates ≤ 1 out of 100;
MODEL-1 spec wording "emits valid Python" has no slack). Authored
self-contained because SHIP-017 PR #1004 not yet on main.
Test plan
falsify_ship_002_python_syntax_error_threshold_logic` — 1 passed
validate contracts/qwen2-e2e-verification-v1.yaml` — Contract is valid
--prompt "def fib(n):"` on RTX 4090 with `--features cuda`;
`rustpython` or `ruff check --select=E` → 0 syntax errors
🤖 Generated with Claude Code