Skip to content

feat(falsify-ship-002): MODEL-1 Python syntax PARTIAL discharge (clean branch)#1017

Merged
noahgift merged 2 commits into
mainfrom
feat/falsify-ship-002-clean
Apr 22, 2026
Merged

feat(falsify-ship-002): MODEL-1 Python syntax PARTIAL discharge (clean branch)#1017
noahgift merged 2 commits into
mainfrom
feat/falsify-ship-002-clean

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

Summary

Binds AC-SHIP1-002 (MODEL-1 teacher apr run emits syntactically valid Python on the canonical def fib(n): prompt) to a zero-tolerance algorithm-level verdict rule via a pure const fn verdict_from_syntax_error_count(usize) -> Ship002Verdict in crates/aprender-core/src/qa/ship_002.rs.

Supersedes #1016 — clean rebuild on fresh branch from main (no stacked SHIP-005/007 dependency).

Deltas

  • Contractcontracts/qwen2-e2e-verification-v1.yaml v1.0.0 → v1.1.0: adds FALSIFY-QW2E-SHIP-002 with discharge_status: PARTIAL_ALGORITHM_LEVEL, evidence_discharged_by (3 symbols), full_discharge_blocks_on (live apr run + rustpython/ruff AST parse on RTX 4090), and 5 counter_example_classes.
  • Specdocs/specifications/aprender-train/ship-two-models-spec.md v2.25.0 → v2.26.0: AC-SHIP1-002 + FALSIFY-SHIP-002 rows annotated (PARTIAL_ALGORITHM_LEVEL v2.26.0).
  • New modulecrates/aprender-core/src/qa/ship_002.rs (156 lines) — const AC_SHIP1_002_MAX_TOLERATED_SYNTAX_ERRORS = 0, Ship002Verdict {Pass, Fail}, verdict_from_syntax_error_count const fn.
  • Module registrationcrates/aprender-core/src/qa/mod.rs gains pub mod ship_002;.

Mutation Survey (6 sections)

  1. zero-errors → Pass (only shipping scenario)
  2. exactly-one-error → Fail (boundary; tighter than MODEL-2 SHIP-017 which tolerates 1/100)
  3. many-errors Fail band {2, 10, 100}
  4. monotonicity sweep 0..=256 (no Fail→Pass flip once Fail observed)
  5. usize::MAX sanity Fail
  6. provenance pin locks tolerance = 0 (spec §4.2)

Why

4th MODEL-1 PARTIAL discharged from main's baseline (after SHIP-009 + SHIP-008 + SHIP-006). Uniquely tight rule — zero tolerance on the single canonical def fib(n): prompt — because spec §4.2 AC-SHIP1-002 "emits valid Python" carries no noise allowance, unlike MODEL-2 SHIP-017 which tolerates ≤1/100.

Full discharge blocks on

apr run paiml/qwen2.5-coder-7b-apache-q4k-v1.safetensors --prompt "def fib(n):"

on RTX 4090 + rustpython/ruff AST parse of the emitted completion. Any parse error count > 0 is a ship-blocker.

Test plan

  • cargo fmt -p aprender-core -- --check — exit 0
  • cargo test -p aprender-core --lib falsify_ship_002_python_syntax_error_threshold_logic — green (1 passed, 0 failed)
  • cargo run --quiet -p aprender-contracts-cli --bin pv -- validate contracts/qwen2-e2e-verification-v1.yaml0 error(s), 0 warning(s). Contract is valid.
  • CI: ci / gate + workspace-test

Refs: task #159

🤖 Generated with Claude Code

Bind AC-SHIP1-002 ("teacher emits syntactically valid Python on
canonical `def fib(n):` prompt") to a zero-tolerance algorithm-level
verdict rule via a pure `const fn verdict_from_syntax_error_count(
usize) -> Ship002Verdict` in `crates/aprender-core/src/qa/ship_002.rs`.

Contract delta — `contracts/qwen2-e2e-verification-v1.yaml`
v1.0.0 → v1.1.0 adds `FALSIFY-QW2E-SHIP-002` with
`discharge_status: PARTIAL_ALGORITHM_LEVEL`,
`evidence_discharged_by` (const + fn + test), and
`full_discharge_blocks_on` (live `apr run` + `rustpython`/`ruff`
AST parse on RTX 4090).

Spec delta — `docs/specifications/aprender-train/ship-two-models-spec.md`
v2.25.0 → v2.26.0; AC-SHIP1-002 + FALSIFY-SHIP-002 rows annotated
`(PARTIAL_ALGORITHM_LEVEL v2.26.0)`.

Mutation survey — 6 sections (single test) exercised by
`falsify_ship_002_python_syntax_error_threshold_logic`:
1. zero-errors → Pass (only shipping scenario)
2. exactly-one-error → Fail (boundary; tighter than MODEL-2
   SHIP-017 which tolerates 1/100)
3. many-errors Fail band {2, 10, 100}
4. monotonicity sweep 0..=256 (no Fail→Pass flip)
5. `usize::MAX` sanity Fail
6. provenance pin locks tolerance = 0 (spec §4.2)

Why: 4th MODEL-1 PARTIAL touched from main's baseline (SHIP-009 +
SHIP-008 + SHIP-006 + now SHIP-002). Uniquely tight rule — zero
tolerance on the single canonical `def fib(n):` prompt — because
spec §4.2 AC-SHIP1-002 "emits valid Python" carries no noise
allowance, unlike MODEL-2 SHIP-017 which tolerates ≤1 error across
100 held-out prompts.

Full discharge blocks on live `apr run
paiml/qwen2.5-coder-7b-apache-q4k-v1.safetensors --prompt
"def fib(n):"` on RTX 4090 + `rustpython`/`ruff` AST parse of the
emitted completion.

Refs: task #159

Verify:
  cargo test -p aprender-core --lib \
    falsify_ship_002_python_syntax_error_threshold_logic
  cargo run --quiet -p aprender-contracts-cli --bin pv -- \
    validate contracts/qwen2-e2e-verification-v1.yaml

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift force-pushed the feat/falsify-ship-002-clean branch from 6bbf1ce to 9d39857 Compare April 22, 2026 20:27
…rance <=

Clippy flags `syntax_errors <= AC_SHIP1_002_MAX_TOLERATED_SYNTAX_ERRORS`
because the constant is `0` (unsigned minimum), so `<=` is vacuously
equivalent to `==`. CI ci/lint went red on this.

Keep the `<=` shape intentionally — it mirrors MODEL-2 SHIP-017's
`verdict_from_syntax_error_count` (tolerance = 1, where `<=` is
non-vacuous) so the two can be deduplicated into a single
parameterized helper once both PRs land. Add
`#[allow(clippy::absurd_extreme_comparisons)]` with an inline
justification block above the attribute.

Verify: cargo clippy -p aprender-core --lib -- -D warnings → clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift merged commit f615148 into main Apr 22, 2026
10 checks passed
@noahgift noahgift deleted the feat/falsify-ship-002-clean branch April 22, 2026 21:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant