feat(falsify-ship-002): MODEL-1 Python syntax PARTIAL discharge (clean branch)#1017
Merged
Conversation
Bind AC-SHIP1-002 ("teacher emits syntactically valid Python on
canonical `def fib(n):` prompt") to a zero-tolerance algorithm-level
verdict rule via a pure `const fn verdict_from_syntax_error_count(
usize) -> Ship002Verdict` in `crates/aprender-core/src/qa/ship_002.rs`.
Contract delta — `contracts/qwen2-e2e-verification-v1.yaml`
v1.0.0 → v1.1.0 adds `FALSIFY-QW2E-SHIP-002` with
`discharge_status: PARTIAL_ALGORITHM_LEVEL`,
`evidence_discharged_by` (const + fn + test), and
`full_discharge_blocks_on` (live `apr run` + `rustpython`/`ruff`
AST parse on RTX 4090).
Spec delta — `docs/specifications/aprender-train/ship-two-models-spec.md`
v2.25.0 → v2.26.0; AC-SHIP1-002 + FALSIFY-SHIP-002 rows annotated
`(PARTIAL_ALGORITHM_LEVEL v2.26.0)`.
Mutation survey — 6 sections (single test) exercised by
`falsify_ship_002_python_syntax_error_threshold_logic`:
1. zero-errors → Pass (only shipping scenario)
2. exactly-one-error → Fail (boundary; tighter than MODEL-2
SHIP-017 which tolerates 1/100)
3. many-errors Fail band {2, 10, 100}
4. monotonicity sweep 0..=256 (no Fail→Pass flip)
5. `usize::MAX` sanity Fail
6. provenance pin locks tolerance = 0 (spec §4.2)
Why: 4th MODEL-1 PARTIAL touched from main's baseline (SHIP-009 +
SHIP-008 + SHIP-006 + now SHIP-002). Uniquely tight rule — zero
tolerance on the single canonical `def fib(n):` prompt — because
spec §4.2 AC-SHIP1-002 "emits valid Python" carries no noise
allowance, unlike MODEL-2 SHIP-017 which tolerates ≤1 error across
100 held-out prompts.
Full discharge blocks on live `apr run
paiml/qwen2.5-coder-7b-apache-q4k-v1.safetensors --prompt
"def fib(n):"` on RTX 4090 + `rustpython`/`ruff` AST parse of the
emitted completion.
Refs: task #159
Verify:
cargo test -p aprender-core --lib \
falsify_ship_002_python_syntax_error_threshold_logic
cargo run --quiet -p aprender-contracts-cli --bin pv -- \
validate contracts/qwen2-e2e-verification-v1.yaml
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
6bbf1ce to
9d39857
Compare
…rance <= Clippy flags `syntax_errors <= AC_SHIP1_002_MAX_TOLERATED_SYNTAX_ERRORS` because the constant is `0` (unsigned minimum), so `<=` is vacuously equivalent to `==`. CI ci/lint went red on this. Keep the `<=` shape intentionally — it mirrors MODEL-2 SHIP-017's `verdict_from_syntax_error_count` (tolerance = 1, where `<=` is non-vacuous) so the two can be deduplicated into a single parameterized helper once both PRs land. Add `#[allow(clippy::absurd_extreme_comparisons)]` with an inline justification block above the attribute. Verify: cargo clippy -p aprender-core --lib -- -D warnings → clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Binds AC-SHIP1-002 (MODEL-1 teacher
apr runemits syntactically valid Python on the canonicaldef fib(n):prompt) to a zero-tolerance algorithm-level verdict rule via a pureconst fn verdict_from_syntax_error_count(usize) -> Ship002Verdictincrates/aprender-core/src/qa/ship_002.rs.Supersedes #1016 — clean rebuild on fresh branch from main (no stacked SHIP-005/007 dependency).
Deltas
contracts/qwen2-e2e-verification-v1.yamlv1.0.0 → v1.1.0: addsFALSIFY-QW2E-SHIP-002withdischarge_status: PARTIAL_ALGORITHM_LEVEL,evidence_discharged_by(3 symbols),full_discharge_blocks_on(liveapr run+rustpython/ruffAST parse on RTX 4090), and 5counter_example_classes.docs/specifications/aprender-train/ship-two-models-spec.mdv2.25.0 → v2.26.0: AC-SHIP1-002 + FALSIFY-SHIP-002 rows annotated(PARTIAL_ALGORITHM_LEVEL v2.26.0).crates/aprender-core/src/qa/ship_002.rs(156 lines) — constAC_SHIP1_002_MAX_TOLERATED_SYNTAX_ERRORS = 0,Ship002Verdict {Pass, Fail},verdict_from_syntax_error_countconst fn.crates/aprender-core/src/qa/mod.rsgainspub mod ship_002;.Mutation Survey (6 sections)
0..=256(no Fail→Pass flip once Fail observed)usize::MAXsanity FailWhy
4th MODEL-1 PARTIAL discharged from main's baseline (after SHIP-009 + SHIP-008 + SHIP-006). Uniquely tight rule — zero tolerance on the single canonical
def fib(n):prompt — because spec §4.2 AC-SHIP1-002 "emits valid Python" carries no noise allowance, unlike MODEL-2 SHIP-017 which tolerates ≤1/100.Full discharge blocks on
on RTX 4090 +
rustpython/ruffAST parse of the emitted completion. Any parse error count > 0 is a ship-blocker.Test plan
cargo fmt -p aprender-core -- --check— exit 0cargo test -p aprender-core --lib falsify_ship_002_python_syntax_error_threshold_logic— green (1 passed, 0 failed)cargo run --quiet -p aprender-contracts-cli --bin pv -- validate contracts/qwen2-e2e-verification-v1.yaml—0 error(s), 0 warning(s). Contract is valid.ci / gate+workspace-testRefs: task #159
🤖 Generated with Claude Code