Critical modules lack Python tests despite Dafny companions

## Problem

Four modules have zero Python test coverage:

| Module | Functions | Dafny Companion | Risk |
|--------|-----------|-----------------|------|
| `state.py` | 9 (make_state, start_phase, complete_phase, fail_phase, transition_to_next, feedback_loop, retry_phase, diagnostic_transition, is_review_complete) | `spec/state.dfy`, `lib/verified/state.dfy` | **High** — incorrect transitions corrupt review state |
| `saturation.py` | 4 (discovery_saturation, conceptual_saturation, should_terminate_discovery, should_feedback_loop) | `spec/saturation.dfy`, `lib/verified/saturation.dfy` | **High** — incorrect thresholds cause premature termination or infinite loops |
| `metrics.py` | 8 (count_excluded, count_flagged, count_included, recompute_all, etc.) | `lib/verified/metrics.dfy` | **Medium** — incorrect counts reported in state.json |
| `cli.py` | 21 (all CLI commands, JSONL readers, workspace loader) | None | **Medium** — I/O boundary, parse failures, missing file handling |

The Dafny companions prove properties about the *specification*, not the Python implementation. A conformance gap between the two is possible and would be invisible without Python-level tests.

## What to Test

**state.py** — Mirror the Dafny lemmas:
- Forward progress: complete phase N, transition to N+1
- No skipping: cannot complete phase N without N-1
- Retry bound: at most 1 retry
- Feedback bound: feedback_iterations never exceeds max
- Invalid transitions raise ValueError

**saturation.py** — Mirror the Dafny lemmas:
- Zero denominator returns Fraction(0, 1)
- Saturation range [0, 1]
- Threshold boundary: saturation == threshold exactly
- should_feedback_loop dual condition

**metrics.py** — Mirror the Dafny lemmas:
- Accounting identity: excluded + flagged + included == total screened
- Non-negative counts
- Idempotency: same inputs produce same output
- Empty inputs return zero counts

**cli.py** — I/O boundary tests:
- JSONL parsing (valid, empty, malformed)
- Missing workspace directory
- Missing individual data files (graceful degradation)

## Context

Identified during panel review (2026-03-23). Dijkstra (Formal Methods) panelist flagged as non-blocking concern. The 85 existing tests cover validators, postconditions, oracle contracts, preprocessing, PRISMA, export, and section parsing well — the gap is specifically in state machine, saturation, metrics, and CLI.

Module	Functions	Dafny Companion	Risk
`state.py`	9 (make_state, start_phase, complete_phase, fail_phase, transition_to_next, feedback_loop, retry_phase, diagnostic_transition, is_review_complete)	`spec/state.dfy`, `lib/verified/state.dfy`	High — incorrect transitions corrupt review state
`saturation.py`	4 (discovery_saturation, conceptual_saturation, should_terminate_discovery, should_feedback_loop)	`spec/saturation.dfy`, `lib/verified/saturation.dfy`	High — incorrect thresholds cause premature termination or infinite loops
`metrics.py`	8 (count_excluded, count_flagged, count_included, recompute_all, etc.)	`lib/verified/metrics.dfy`	Medium — incorrect counts reported in state.json
`cli.py`	21 (all CLI commands, JSONL readers, workspace loader)	None	Medium — I/O boundary, parse failures, missing file handling

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Critical modules lack Python tests despite Dafny companions #2

Problem

What to Test

Context

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Critical modules lack Python tests despite Dafny companions #2

Description

Problem

What to Test

Context

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions