Skip to content

feat(falsify-ship-002): MODEL-1 apr run emits valid Python (zero syntax errors) PARTIAL discharge#1016

Closed
noahgift wants to merge 5 commits into
mainfrom
feat/falsify-ship-002-partial-discharge
Closed

feat(falsify-ship-002): MODEL-1 apr run emits valid Python (zero syntax errors) PARTIAL discharge#1016
noahgift wants to merge 5 commits into
mainfrom
feat/falsify-ship-002-partial-discharge

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

Summary

6th compute-free MODEL-1 PARTIAL lever. Wires AC-SHIP1-002 (`apr
run .safetensors --prompt "def fib(n):"` emits syntactically
valid Python) to a pure `verdict_from_syntax_error_count(usize) ->
Ship002Verdict` const fn. Targets main directly (upstream stack
#1012/#1014 merged).

  • Contract `contracts/qwen2-e2e-verification-v1.yaml` v1.2.0 → v1.3.0:
    adds `FALSIFY-QW2E-SHIP-002` (ship_blocking, PARTIAL_ALGORITHM_LEVEL)
    binding `AC_SHIP1_002_MAX_TOLERATED_SYNTAX_ERRORS = 0` to
    `verdict_from_syntax_error_count(usize) -> Ship002Verdict` in
    `crates/aprender-core/src/qa/ship_002.rs`.
  • 6-section mutation survey: zero-errors Pass / exactly-one-error
    Fail at boundary / many-errors Fail band {2, 10, 100} / monotonicity
    0..=256 / usize::MAX sanity Fail / provenance pin on tolerance = 0.
  • Spec v2.27.0 → v2.28.0: AC-SHIP1-002 + FALSIFY-SHIP-002 rows
    annotated `(PARTIAL_ALGORITHM_LEVEL v2.28.0)`; amendment
    history entry notes MODEL-1 coverage 5/10 → 6/10 touched and
    12 PARTIAL + 3 DISCHARGED across both models.

Mirrors MODEL-2 SHIP-017 `verdict_from_syntax_error_count` shape
with a tighter zero-tolerance rule (MODEL-2 tolerates ≤ 1 out of 100;
MODEL-1 spec wording "emits valid Python" has no slack). Authored
self-contained because SHIP-017 PR #1004 not yet on main.

Test plan

  • `cargo test -p aprender-core --lib
    falsify_ship_002_python_syntax_error_threshold_logic` — 1 passed
  • `cargo run --quiet -p aprender-contracts-cli --bin pv --
    validate contracts/qwen2-e2e-verification-v1.yaml` — Contract is valid
  • Full discharge: live `apr run paiml/qwen2.5-coder-7b-apache-q4k-v1
    --prompt "def fib(n):"` on RTX 4090 with `--features cuda`;
    `rustpython` or `ruff check --select=E` → 0 syntax errors

🤖 Generated with Claude Code

noahgift and others added 5 commits April 22, 2026 18:25
Discharge FALSIFY-SHIP-008 / AC-SHIP1-008 at PARTIAL_ALGORITHM_LEVEL.

- contracts/chat-template-v1.yaml v1.0.0 -> v1.1.0: adds
  GATE-CHAT-SHIP-008 binding ChatMLTemplate::format_conversation to
  the canonical Qwen2.5-Coder-7B (system, user) golden via a pure
  verdict_from_chat_template_render const fn. ship_blocking: true,
  discharge_status: PARTIAL_ALGORITHM_LEVEL; full discharge blocks
  on live `apr run paiml/qwen2.5-coder-7b-apache-q4k-v1` completion
  diff against golden.
- crates/aprender-core/src/text/chat_template/ship_008.rs (new):
  AC_SHIP1_008_CANONICAL_{SYSTEM,USER,GOLDEN} constants +
  Ship008Verdict enum + verdict_from_chat_template_render const fn
  (byte-equality, UTF-8-safe) + 5-section mutation survey
  (engine-binding, empty Fail, missing-gen-prompt Fail, wrong-delim
  Fail, swapped-roles Fail, single-byte flip Fail) + symmetry +
  provenance pin.
- crates/aprender-core/src/text/chat_template/mod.rs: include!
  ship_008.rs alongside existing template.rs, raw_template.rs.
- docs/specifications/aprender-train/ship-two-models-spec.md
  v2.23.0 -> v2.24.0: AC-SHIP1-008 row + FALSIFY-SHIP-008 row
  annotated PARTIAL_ALGORITHM_LEVEL; v2.24.0 amendment entry
  records MODEL-1 coverage 1/10 -> 2/10 (first MODEL-1
  non-provenance PARTIAL; mirrors SHIP-016/017/018/020 pattern).

Test: cargo test -p aprender-core --lib
  falsify_ship_008_chat_template_render_bind -> 1 passed
Contract: pv validate contracts/chat-template-v1.yaml -> Contract is valid

Refs: SHIP-TWO-001, task #155

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…arge

Wires AC-SHIP1-006 "apr qa <model> — all 8 gates PASS" at
PARTIAL_ALGORITHM_LEVEL: a pure aggregate-AND verdict fn bound
to the 8-gate ship criterion from `docs/specifications/components/qa.md`
§3 (golden / throughput / ollama parity / gpu speedup / tensor contracts
/ format parity / ptx parity / metadata).

Files:
- `crates/aprender-core/src/qa/ship_006.rs` (NEW, 217 lines) —
  `verdict_from_qa_gates(&[bool]) -> Ship006Verdict` const fn with
  7-section mutation survey: all-Pass→Pass, all-Fail→Fail,
  single-gate-flip × 8, exhaustive 2^8=256 bitmask proof, Pass→Fail
  monotonicity, length-drift counter-examples (0 / 7 / 9 / 16),
  provenance pin (AC_SHIP1_006_REQUIRED_QA_GATE_COUNT = 8).

- `crates/aprender-core/src/qa/mod.rs` — register `pub mod ship_006;`.

- `contracts/apr-model-qa-v1.yaml` v1.1.0 → v1.2.0 — adds
  `FALSIFY-QA-SHIP-006` with `ship_blocking: true`,
  `discharge_status: PARTIAL_ALGORITHM_LEVEL`, `evidence_discharged_by`
  pointing at ship_006.rs + the harness test, and
  `full_discharge_blocks_on` live `apr qa paiml/qwen2.5-coder-7b-apache-q4k-v1
  --json` on an RTX 4090 host (8× `"pass": true` entries in the JSON body).

- `docs/specifications/aprender-train/ship-two-models-spec.md`
  v2.24.0 → v2.25.0 — annotates AC-SHIP1-006 + FALSIFY-SHIP-006 rows
  with PARTIAL_ALGORITHM_LEVEL markers and adds v2.25.0 amendment entry.

Design: mirrors the aggregate-AND shape set by MODEL-2 SHIP-016
(task #152 on `feat/falsify-ship-016-partial-discharge`, not yet on
main). Authored self-contained because SHIP-016 hasn't landed;
once both ship, the two `verdict_from_qa_gates_*` fns should be
deduplicated into a single parameterized helper. Required gate
count differs by model (both 8 today — the spec's "All must Pass"
is model-independent).

MODEL-1 AC-SHIP1 coverage: 2/10 touched (SHIP-008 + SHIP-009) →
**3/10** touched (+ SHIP-006). First MODEL-1 aggregate-AND PARTIAL.

Full discharge blocks on a live `apr qa` run against the teacher
weights on RTX 4090; the compute-heavy portion is intentionally
out of scope here.

Test: `cargo test -p aprender-core --lib falsify_ship_006_apr_qa_eight_gates_aggregate` → 1 passed.
Contract: `cargo run --quiet -p aprender-contracts-cli --bin pv -- validate contracts/apr-model-qa-v1.yaml` → 0 errors.

Stacked on #1012 (feat/falsify-ship-008-partial-discharge). Spec
v2.25.0 builds on v2.24.0.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…scharge

Wires AC-SHIP1-007 "apr bench decode throughput ≥30 tok/s on RTX 4090
(7B Q4_K target)" at PARTIAL_ALGORITHM_LEVEL: a pure f32 threshold
verdict fn bound to the MODEL-1 teacher ship floor. The decision rule
is proven today; the compute-heavy half (live `apr bench` on RTX 4090)
is deferred to hardware evidence collection.

Files:
- `crates/aprender-core/src/bench/ship_007.rs` (NEW, 158 lines) —
  `AC_SHIP1_007_MIN_DECODE_TPS_RTX4090_7B = 30.0`,
  `Ship007Verdict { Pass, Fail }`,
  `verdict_from_decode_tps(f32) -> Ship007Verdict`,
  `falsify_ship_007_decode_tps_threshold_logic` 7-section survey:
    1. boundary (30.0 exactly → Pass; the contract is ≥, not >)
    2. one-ULP-below → Fail (sharpest off-by-one counter-example)
    3. clear Pass band (45 / 100 tok/s)
    4. clear Fail band (0 / 10 / 29.999999)
    5. monotonicity above floor + below floor
    6. non-finite → Fail conservatively (NaN, +∞, -∞)
    7. provenance pin binding the 30.0 constant to spec §4.2.

- `crates/aprender-core/src/bench/mod.rs` — register `pub mod ship_007;`.

- `contracts/qwen2-e2e-verification-v1.yaml` v1.0.0 → v1.1.0 — adds
  `FALSIFY-QW2E-SHIP-007` with `ship_blocking: true`,
  `discharge_status: PARTIAL_ALGORITHM_LEVEL`, `evidence_discharged_by`
  pointing at ship_007.rs + the harness test, and
  `full_discharge_blocks_on` live `apr bench --iterations 5
  --max-tokens 128 paiml/qwen2.5-coder-7b-apache-q4k-v1` on RTX 4090
  with --features cuda; median of 5 iterations must be ≥ 30.0. Also
  4 `counter_example_classes` (regressed_kernel / drifted_constant /
  relaxed_rule / nan_promoted).

- `docs/specifications/aprender-train/ship-two-models-spec.md`
  v2.25.0 → v2.26.0 — annotates AC-SHIP1-007 + FALSIFY-SHIP-007 rows
  with PARTIAL_ALGORITHM_LEVEL markers and adds v2.26.0 amendment entry.

Design: mirrors the MODEL-2 SHIP-020 single-f32-threshold shape (task
#150 on `feat/falsify-ship-020-partial-discharge`, PR #1005 not yet on
main). Authored self-contained because SHIP-020 hasn't landed; once
both ship, the two `verdict_from_decode_tps_*` fns should be
deduplicated into a single parameterized helper
`verdict_from_decode_tps(measured, floor) -> ThresholdVerdict` with
the model-specific floor pinned as a module-level const. MODEL-1 floor
is 30.0 (7B Q4_K, bandwidth-bound at ~3.5× the size of 370M); MODEL-2
floor is 100.0 (370M sovereign, compute-bound at RTX 4090 bandwidth).

MODEL-1 AC-SHIP1 coverage: 3/10 touched (SHIP-008 + SHIP-009 +
SHIP-006) → **4/10** touched (+ SHIP-007).

Test: `cargo test -p aprender-core --lib falsify_ship_007_decode_tps_threshold_logic` → 1 passed.
Contract: `cargo run --quiet -p aprender-contracts-cli --bin pv -- validate contracts/qwen2-e2e-verification-v1.yaml` → 0 errors.

Stacked on #1013 (feat/falsify-ship-006-partial-discharge), which is
itself stacked on #1012 (feat/falsify-ship-008-partial-discharge).
Spec v2.26.0 builds on v2.25.0 which builds on v2.24.0.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…nEval pass@1 ≥86.00% (1.2 pp noise → effective 84.80%)

Wires MODEL-1 `apr eval --benchmark humaneval` ship floor (AC-SHIP1-005)
to a pure two-number threshold verdict fn. 5th compute-free MODEL-1
lever (SHIP-008 + SHIP-009 + SHIP-006 + SHIP-007 + SHIP-005) brings
MODEL-1 AC-SHIP1 coverage to 5/10 touched. Mirrors MODEL-2 SHIP-018
pattern (pass@1 threshold) but uniquely carries a 1.2 pp noise
allowance called out by spec §4.2 AC-SHIP1-005.

contracts/qwen2-e2e-verification-v1.yaml v1.1.0 → v1.2.0:
  - Adds FALSIFY-QW2E-SHIP-005 binding
      AC_SHIP1_005_NOMINAL_HUMANEVAL_PASS_AT_1_PCT = 86.00
      AC_SHIP1_005_NOISE_ALLOWANCE_PP              = 1.20
      AC_SHIP1_005_EFFECTIVE_HUMANEVAL_PASS_AT_1_PCT ≈ 84.80
    to `verdict_from_pass_at_1(correct, total, threshold_pct) ->
    Ship005Verdict` in `crates/aprender-core/src/metrics/ship_005.rs`.
  - 8-section mutation survey:
      1. Safe-margin Pass above effective floor (85/100 = 85.0%)
      2. Above nominal floor (87/100 = 87.0%) Pass
      3. Noise-window Fail at nominal (85/100 Fails nominal)
      4. Below-effective Fail incl. HumanEval-canonical 139/164 = 84.756%
      5. Monotonicity sweep correct=0..=164 at effective
      6. Div-safety (total=0) + sanity (correct>total) → Fail
      7. Non-finite threshold (NaN, ±∞) → Fail conservatively
      8. Tolerance-bounded provenance pin on all three constants
         (86.0 − 1.2 in f32 yields ~84.79999924, not exact 84.80).
  - `ship_blocking: true`, `discharge_status: PARTIAL_ALGORITHM_LEVEL`,
    `full_discharge_blocks_on: live apr eval --benchmark humaneval ...`
    on RTX 4090; 6 named counter_example_classes.

crates/aprender-core/src/metrics/ship_005.rs (NEW, 310 lines):
  - Three-constant design unique to MODEL-1 (SHIP-007/018 had one).
  - `#[must_use] pub fn verdict_from_pass_at_1(...)` returns
    `Ship005Verdict::Fail` conservatively on: total=0 (div guard),
    correct>total (sanity), !threshold.is_finite() (NaN/±∞).
  - `falsify_ship_005_humaneval_pass_at_1_threshold_logic` — 1 passing.

Spec `docs/specifications/aprender-train/ship-two-models-spec.md`
  v2.26.0 → v2.27.0 annotates AC-SHIP1-005 + FALSIFY-SHIP-005 rows
  `**(PARTIAL_ALGORITHM_LEVEL v2.27.0)**` and appends amendment entry
  noting 11 PARTIAL + 3 DISCHARGED across both models, MODEL-1 5/10.

Authored self-contained because MODEL-2 SHIP-018 sibling PR has not
yet landed on main. Once it does, the two `verdict_from_pass_at_1_*`
fns should be dedup'd into a single parameterized helper.

Full discharge blocks on: live `apr eval --benchmark humaneval
paiml/qwen2.5-coder-7b-apache-q4k-v1 --json` on RTX 4090 with
--features cuda; median pass@1 across 3 seed=0 runs ≥ 86.00 (or
≥ 84.80 under the 1.2 pp noise allowance).

Tests:
  cargo test -p aprender-core --lib \
    falsify_ship_005_humaneval_pass_at_1_threshold_logic
Contract:
  cargo run --quiet -p aprender-contracts-cli --bin pv -- validate \
    contracts/qwen2-e2e-verification-v1.yaml

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… run --prompt "def fib(n):"` emits valid Python (zero syntax errors)

6th compute-free MODEL-1 lever. Wires AC-SHIP1-002 (`apr run <model>.safetensors
--prompt "def fib(n):"` emits syntactically valid Python) to a pure
`verdict_from_syntax_error_count(usize) -> Ship002Verdict` const fn.
Brings MODEL-1 AC-SHIP1 coverage to 6/10 touched (SHIP-008 + SHIP-009
+ SHIP-006 + SHIP-007 + SHIP-005 + SHIP-002). Mirrors MODEL-2 SHIP-017
shape but with a tighter zero-tolerance rule.

contracts/qwen2-e2e-verification-v1.yaml v1.2.0 → v1.3.0:
  - Adds FALSIFY-QW2E-SHIP-002 binding
      AC_SHIP1_002_MAX_TOLERATED_SYNTAX_ERRORS = 0
    to `verdict_from_syntax_error_count(usize) -> Ship002Verdict`
    const fn in `crates/aprender-core/src/qa/ship_002.rs`.
  - 6-section mutation survey:
      1. Zero errors → Pass (trivial unanimous-parse case)
      2. Exactly-one error → Fail (boundary; MODEL-1 zero-tolerance)
      3. Many-errors Fail band {2, 10, 100}
      4. Monotonicity sweep 0..=256
      5. usize::MAX sanity guard → Fail
      6. Provenance pin on tolerance constant = 0
  - `ship_blocking: true`, `discharge_status: PARTIAL_ALGORITHM_LEVEL`,
    `full_discharge_blocks_on: live apr run ... + rustpython/ruff
    AST parse with zero errors`; 5 named counter_example_classes
    (regressed_sampling, widened_tolerance, relaxed_rule,
    flipped_inequality, mask_byte_malformed).

crates/aprender-core/src/qa/ship_002.rs (NEW, ~130 lines):
  - `#[must_use] pub const fn verdict_from_syntax_error_count(usize)`
    returns Pass iff `n <= 0` (i.e., `n == 0`).
  - `AC_SHIP1_002_MAX_TOLERATED_SYNTAX_ERRORS: usize = 0` — spec
    §4.2 AC-SHIP1-002 has no noise allowance.
  - `falsify_ship_002_python_syntax_error_threshold_logic` — 1 passing.

Spec `docs/specifications/aprender-train/ship-two-models-spec.md`
  v2.27.0 → v2.28.0 annotates AC-SHIP1-002 + FALSIFY-SHIP-002 rows
  `**(PARTIAL_ALGORITHM_LEVEL v2.28.0)**` and appends amendment entry
  noting 12 PARTIAL + 3 DISCHARGED across both models, MODEL-1 6/10.

Authored self-contained because MODEL-2 SHIP-017 sibling PR #1004
has not yet landed on main. Once it does, the two
`verdict_from_syntax_error_count_*` fns should be dedup'd into a
single parameterized helper `verdict_from_syntax_error_count(errors,
max_allowed) -> SyntaxVerdict` with the per-model tolerance held
externally.

Full discharge blocks on: live `apr run paiml/qwen2.5-coder-7b-apache-q4k-v1
--prompt "def fib(n):"` on RTX 4090 with `--features cuda`;
completion parses cleanly via `rustpython` or `ruff check --select=E`
with zero syntax errors.

Tests:
  cargo test -p aprender-core --lib \
    falsify_ship_002_python_syntax_error_threshold_logic
Contract:
  cargo run --quiet -p aprender-contracts-cli --bin pv -- validate \
    contracts/qwen2-e2e-verification-v1.yaml

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift
Copy link
Copy Markdown
Contributor Author

Superseded by fresh clean-branch PR from main (c1263178a3) — no stacked SHIP-005/007 dependency. New PR at feat/falsify-ship-002-clean.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant