-
Notifications
You must be signed in to change notification settings - Fork 0
Sprint 07 Integration and property tests
This page was migrated from the paxman repositorys docs/sprints/ folder as part of the Sprint 11 repo springclean. The original git history is preserved in the paxman repo (commit 3121eb2 and earlier).
Duration: 2 weeks Goal: Build the full test pyramid (property tests, integration tests, end-to-end fixtures, golden artifacts) and ship the
paxman.testingmodule (Hypothesis strategies). End of sprint: the test suite proves Paxman's determinism and replay claims with high confidence. Status: This is the sprint where V1's quality bar is met: 90% coverage oncontract/,planner/,executor/,reconciler/; 95% onartifact/; 100% onerrors.pyandversioning.py. Property tests for determinism pass.
-
paxman.testingmodule — public Hypothesis strategies (contracts(),inputs(),budgets(),policies(),registries(),candidate_sets(),artifacts()) -
tests/fixtures/contracts/— fill in remaining planned contracts (Pydantic, JSON Schema, Dict DSL, OpenAPI) -
tests/fixtures/inputs/{invoices,receipts,quotations,procurement,multi_page,adversarial}/— fully vendored -
tests/fixtures/artifacts/— ≥5 goldenExecutionArtifactJSON files (bootstrapped from real runs, pertests/fixtures/README.md) -
tests/fixtures/generated/—factory_boy+fakerfactories (Layer 2 programmatic fixtures)
-
determinism/test_planner_is_deterministic.py— same inputs → same plan (Hypothesis, 100 examples) -
determinism/test_executor_is_deterministic.py— same plan → same candidates -
determinism/test_reconciler_is_deterministic.py— same candidates → same resolved -
replay/test_replay_is_byte_equal.py— replay reproduces artifact byte-for-byte -
replay/test_hash_detects_modification.py— any change → hash changes -
reconciler/test_reconciler_is_monotonic.py— better evidence → higher confidence
-
end_to_end/test_invoice_pipeline.py— full pipeline on a Pydantic invoice contract -
end_to_end/test_quotation_pipeline.py— quotation with MONEY -
end_to_end/test_adversarial_inputs.py— empty, unicode, prompt injection, mismatched currency cross_subsystem/test_planner_executor_integration.pycross_subsystem/test_executor_reconciler_integration.py
-
pytest-covconfigured to fail under 90% oncontract/,planner/,executor/,reconciler/ -
pytest-covconfigured to fail under 95% onartifact/ -
pytest-covconfigured to fail under 100% onerrors.pyandversioning.py
-
make test-unit,make test-property,make test-integration,make test-covall green - CI matrix runs all 4 test categories
- Documentation beyond what's needed for tests (Sprint 8).
- Performance optimization (Sprint 9).
- External user validation (Sprint 10).
- Mypy/pyright cross-validation (Sprint 8).
- Mutation testing (V2).
| ID | Deliverable | Effort (id-ed) |
|---|---|---|
| D7.1 |
paxman.testing module — 7 public strategies |
3.0 |
| D7.2 | Fixture contracts: complete pydantic/ (10 files), json_schema/ (10 files + drafts), dict_dsl/ (6 files), openapi/ (3 files) |
4.0 |
| D7.3 |
tests/fixtures/artifacts/ — ≥5 golden ExecutionArtifact JSON files |
3.0 |
| D7.4 |
tests/fixtures/generated/ — factory_boy + faker factories (5 files) |
3.0 |
| D7.5 | Property tests: planner determinism | 1.0 |
| D7.6 | Property tests: executor determinism | 1.0 |
| D7.7 | Property tests: reconciler determinism | 1.0 |
| D7.8 | Property tests: replay byte-equal | 1.0 |
| D7.9 | Property tests: hash modification detection | 0.5 |
| D7.10 | Property tests: reconciler monotonicity | 1.0 |
| D7.11 | Integration test: invoice pipeline (end-to-end) | 1.5 |
| D7.12 | Integration test: quotation pipeline with MONEY | 1.5 |
| D7.13 | Integration test: adversarial inputs | 1.5 |
| D7.14 | Integration test: cross-subsystem | 1.0 |
| D7.15 |
pytest-cov configuration: per-subsystem thresholds |
0.5 |
| D7.16 | Replay golden test: full pipeline reproducibility (subprocess + same hash) | 1.0 |
| D7.17 | Update Makefile to add make test-property and make test-integration targets |
0.3 |
| D7.18 | CI workflow: add separate jobs for unit, property, integration | 0.5 |
Total: ~25.3 id-ed. Sized for 3 engineers × 2 weeks (1 on testing infrastructure, 1 on fixtures + goldens, 1 on property + integration tests).
| Type | Item | Notes |
|---|---|---|
| People | 3 engineers (1 senior, 2 mid-level) | Senior needed for golden artifacts |
| Tools |
hypothesis (already dev), factory_boy, faker (new dev deps) |
Install this sprint |
| Tests | All Sprint 1-6 deliverables | Done |
| External | Network access to vendored datasets (HuggingFace, GitHub) | First time we need this in CI |
| Storage | ~50 MB disk space on dev machines for the vendored corpus | Already in Sprint 5 |
| Tool | Version | Purpose | Notes |
|---|---|---|---|
| hypothesis | ≥ 6.0 | Property-based tests | Already dev dep |
| factory_boy | latest | Programmatic fixture generation (Layer 2) | New dev dep |
| faker | latest | Synthetic data generation | New dev dep |
| pytest-subtests | latest | Subtests for parameterized golden artifact checks | Optional |
None.
-
paxman.testingexposes 7 public strategies:contracts(),inputs(),budgets(),policies(),registries(),candidate_sets(),artifacts(). - ≥5 golden
ExecutionArtifactJSON files intests/fixtures/artifacts/, bootstrapped from real runs (not predicted). - Every property test passes 100 Hypothesis examples without failure.
-
make test-covreports:-
contract/≥ 90% -
planner/≥ 90% -
executor/≥ 90% -
reconciler/≥ 90% -
artifact/≥ 95% -
errors.py= 100% -
versioning.py= 100% - Overall ≥ 90%
-
- Adversarial inputs (empty, unicode, prompt injection, mismatched currency) all return
ExecutionArtifactwithUNRESOLVEDorPARTIAL_SUCCESSstatus — never a crash. - Replay reproducibility test: a real fixture's artifact, when replayed, produces the same
replay_hashacross two separate Python invocations. - CI runs
make test-unit,make test-property,make test-integrationas separate jobs; all green. -
tests/fixtures/artifacts/*.jsonare deterministic (the same fixture run twice produces byte-equal JSON). -
make ciis green. - The
paxman.testingstrategies are importable:from paxman.testing import contracts, inputs, budgets, policies, registriesworks.
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Golden artifacts are "predicted" instead of bootstrapped from real runs | Medium | High | Code review must reject any golden artifact PR that does not include a script or command showing how it was generated from a real paxman.normalize() call. Add a tests/fixtures/artifacts/GENERATION.md explaining how each golden was produced. |
| A property test finds a real bug (e.g., planner is non-deterministic) | Medium | High | The Hypothesis output gives a minimal counterexample. Fix the bug, add the counterexample as a @example, ensure the fix is correct, then re-run with 1000 examples. |
Hypothesis derandomize=True produces a flaky test (e.g., due to time-dependent code) |
Low | High | Use Deadline(100ms) to catch accidental non-determinism. Add assume() to filter out impossible cases. |
| The vendored corpus inflates the dev environment | Low | Low | Add make test-data-vendor as a separate step. CI uses --verify only. |
factory_boy + faker factories produce invalid contracts (e.g., unknown field type) |
Medium | Medium | Validate every factory-generated contract before yielding it. The contracts() strategy wraps the factory with validation. |
| The integration tests run too slowly (>2 min) | Low | Low | Use @pytest.mark.slow and split into separate CI job. |
Golden artifacts have a replay_hash that is not stable across Paxman versions |
Low | High | Golden artifacts are pinned to a specific paxman_version in their JSON. Replay checks the version. If the version changes, the golden is regenerated as part of a release checklist. |
tests/public_api/test_public_api.py snapshot is too strict and blocks the paxman.testing module addition |
Low | Low | Sprint 7 adds paxman.testing re-exports to __init__.py? No — the strategies are accessed via from paxman.testing import ..., not from paxman import .... The public surface remains unchanged. |
-
../V1_ACCEPTANCE_CRITERIA.md§1.5, §2.2 (coverage), §2.4. -
../TESTING_STRATEGY.md§3 (property tests), §8 (E2E fixtures), §9 (coverage). -
../REPLAY_AND_DETERMINISM.md§5, §6. -
../tests/fixtures/README.md— 5-layer test data model. -
../docs/TEST_DATA.md— vendoring procedure. -
../tests/fixtures/contracts/README.md— planned contract fixtures. -
../tests/fixtures/inputs/README.md— vendored + synthetic input catalog. -
../tests/fixtures/artifacts/README.md— golden artifact rules.