Skip to content

release(v1.1.0): fitness functions + intermittent-test integrity#10

Merged
jed72 merged 32 commits into
mainfrom
release/v1.1.0
May 26, 2026
Merged

release(v1.1.0): fitness functions + intermittent-test integrity#10
jed72 merged 32 commits into
mainfrom
release/v1.1.0

Conversation

@jed72
Copy link
Copy Markdown
Owner

@jed72 jed72 commented May 26, 2026

Summary

Cuts Compass 1.1.0 — semver minor (additive mechanism, backward-compat preserved).

Lands five candidate suggestions:

  • A1 — Intermittent-test integrity: new strategy S5 "Intermittency is failure", new G4 check no-trusted-rerun, new governance/quarantine.yml, cmd_tdd_green now records attempts + rerun_without_change (source-tree SHA-256 detection)
  • A2 — Architectural fitness functions: new G4 check command-passes, new gate verify.fitness (route-promoted via RG-FLOOR-006/RG-FLOOR-007), ADR-009 "Architectural fitness functions are project guardrails, not framework guardrails"
  • B1 — TDD-as-design rebalance: S2 amended to name both governance and design-feedback roles, tdd-discipline skill gains "Listen to your tests" section + "test behaviour, not implementation" anti-pattern, signals.yml gains design_smell advisory category, coverage-as-floor caveat in evidence-gates skill
  • C1 — Example-first refinement in bdd-specification skill (user-stories refusal kept per ADR-004)
  • C2 — Commit-vs-acceptance vocabulary in evidence-gates skill (Release/Production stages stay out of scope per safety-contract guarantee 6)

Surface added stays within the USP-5 legibility budget: 1 strategy, 1 gate, 2 checks, 1 evidence-record field pair, 1 signal category, 1 ADR. No sixth guardrail, no fifth routing dimension — ADR-002 honoured.

Version bumped on all five published surfaces: VERSION, .claude-plugin/plugin.json, .claude-plugin/marketplace.json ($.metadata.version and $.plugins[0].version), and cli/compass COMPASS_VERSION.

Test plan

  • CI self-check passes
  • Full pytest suite green: 372 passed, 2 skipped locally
  • compass check --task farley-guidebook passes — all 13 checks (including the new no-trusted-rerun, command-passes, dod-evidence-typed)
  • compass check --task version-bump-1-1-0 passes — all 13 checks
  • compass policy lint passes — governance YAML structurally valid
  • After merge: tag v1.1.0 pushed; GitHub release created with dist/compass-1.1.0.tar.gz (6.7 MB, 262 files)

🤖 Generated with Claude Code

jed72 and others added 30 commits May 25, 2026 14:02
Delivers TRC-C1..C10, TRC-F3, TRC-F6 (12 scenarios) for task
comparison-requirements:

- cli/compass: new `next` subcommand (+160 lines) reading task.yml +
  route.md, prints one-line phase/gate status; read-only; <200ms p95.
- CLAUDE.md: additive paragraph under "Never skip Frame" — Trigger Frame
  on intent, not just the literal command. Includes the no-re-frame
  clause (TRC-F3) so already-framed tasks proceed without re-Frame.
- agents/{spec-author,planner,builder,orchestrator}.md: one-sentence
  intent-trigger reinforcement on each agent's description.

26 tests added (tests/next/, tests/invisible/), all green. No new
hooks, no new code surface for triggering — Q4's resolution honoured.

Co-Authored-By: Claude (compass:builder) <noreply@anthropic.com>
Implements `compass analyze` (cross-artifact coherence check), the
`add_gate` floor property in routing-policy (RG-FLOOR-004/005), the
`coherence-check` evidence type + `coherence-check-passes` CHECK_FN
under G4, and the full 16-scenario test suite for stream A.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements all 13 stream-B scenarios (TRC-B1 through TRC-B11 and TRC-F2):

- derive_system_spec() in cli/compass: walks .compass/work/*/task.yml for
  status==landed, applies supersession (same intent_id → latest-landed wins,
  older → archive appendix), writes docs/system-spec.md with DERIVED FILE
  header; idempotent and reconstructible (ADR-008).

- compass _derive-system-spec --internal: private subparser hidden from
  compass --help via _choices_actions filter + metavar override (DD-4);
  --internal flag mandatory to prevent accidental shell invocation.

- scripts/integrate.sh: post-combined-regression block writes status: landed
  + land_timestamp to task.yml via Python heredoc, then invokes the private
  subparser (DD-4, TRC-B11).

- schemas/task.schema.json: added optional status (active|landed) and
  land_timestamp fields; schema_version 1.0 files still valid (Inv-8, DD-3).

- templates/task.yml: bumped schema_version to "1.1", added status: active.

- ADR-008: records the cross-task derived artifacts decision; added to
  architecture/decisions/README.md index.

- tests/test_self_architecture.py: repaired test_adrs_cover_p1_to_p8 to
  accept >= 6 ADRs (founding six remain required; expansion ADRs allowed).

35 tests in tests/derive/ all green; compass policy lint PASS.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…task derived artifacts) + comparison-requirements analysis

Planning artifacts for the `comparison-requirements` task, captured before
Build started:

- architecture/decisions/ADR-007 — gates may be conditionally promoted from
  advisory to blocking via routing-policy floors; advisory gates write
  evidence but do not block Land
- architecture/decisions/ADR-008 — cross-task derived artifacts are generated
  from landed task scenarios at Land time; the derivation is reconstructible,
  idempotent, and never a source-of-truth
- docs/analysis/comparison-requirements.md — business-analysis pass over
  docs/proposals/comparison.md; converts the three shortlist candidates
  (compass analyze, living system spec, compass next + invisible triggering)
  into buildable BRs, scenarios, and NFRs

Both ADRs are in 'proposed' status — they document architecturally-novel
patterns (the third gate-lifecycle class for verify.analyze; the framework's
first cross-task derived artifact). Promoted to 'accepted' after the build
streams ship and the patterns prove out in practice.
# Conflicts:
#	architecture/decisions/ADR-008-cross-task-derived-artifacts.md
# Conflicts:
#	cli/compass
…ext.md headings

Stream-B's test rewrite at Build introduced "## Boundaries" + "## Principles"
expectations that don't match the file (which has "## Boundary conditions"
and no Principles section). Aligned at Land integration of
comparison-requirements (TRC-D5 honoured).
… ADR index

Closes the comparison-requirements task with everything green:

- tests/cross_cutting/test_stream_d_invariants.py — 10 cross-cutting
  invariant tests (TRC-D1..D10) asserting on the integrated state of
  streams A/B/C:
    D1: five-point mental model unchanged
    D2: only `analyze` and `next` added publicly (`_derive-system-spec` private)
    D3: no tier ladder
    D4: no new agent personas, role enum unchanged
    D5: pipeline phases still flex by route
    D6: immovable gates remain immovable
    D7: TDD remains a strategy that Spike suspends
    D8: bare-repo zero-setup preserved
    D9: route evaluate deterministic
    D10: no LLM SDK in cli/compass

- docs/system-spec.md — first run of the new living system spec
  derivation (ADR-008). Includes the 51 scenarios from this task plus
  a DERIVED FILE header.

- architecture/decisions/README.md — added ADR-007 to the index
  alongside ADR-008 (Stream-B added ADR-008 via its branch merge).

Combined regression: 260 passed, 2 skipped.
`compass check`: PASS — all 11 checks pass.
All 7 gates closed (verify.correctness/governance/traceability/regression/
security/clarity/claims).
BF-1 paid (latency measurements confirm provisional targets).
task.yml.status: landed.
… failure.'

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…k roles

Amend S2 in governance/strategies.md to declare both purposes of TDD:
governance role (satisfies G1) and design-feedback loop. Quotes Farley:
"TDD is less about testing and more about good design."

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…n evidence

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…i-pattern

Add "Listen to your tests" section to tdd-discipline explaining that a
hard-to-write test is a design smell — change the design, not the test.
Add "Test behaviour, not implementation" anti-pattern with the
"swap the implementation — does the test survive?" check.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… pass

Add _source_tree_hash, _load_tdd_state, _save_tdd_state helpers and
DD-7 rerun detection to cmd_tdd_green.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…1 implementation)

TRC-A2 test was already passing because DD-7 rerun detection added in TRC-A1
handles both scenarios. Red step showed test passing (no separate red recorded
for TRC-A2). Evidence: evidence/green-TRC-A2.json.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ls.yml

Add design_smell top-level key to governance/signals.yml with three default
patterns: tests dominated by mocking/setup, assertions on internal method
calls, single test asserts more than one behaviour. Update
schemas/signals.schema.json to accept the new optional property.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…aveat

Add inline comment to guardrails.yml Q1 coverage-floor example and add a
"Coverage as evidence" section to evidence-gates/SKILL.md explaining that
coverage is a floor, never a target — a side effect of test discipline.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…registry

- governance/quarantine.yml: new file, shipped empty (DD-1)
- governance/guardrails.yml: no-trusted-rerun check added to checks: and G4.checks;
  evidence_types.test-run.description updated per DD-9
- cli/compass: _check_no_trusted_rerun + _load_quarantine_registry + quarantine
  policy lint (_lint_errors_quarantine) added to CHECK_FNS and cmd_policy_lint

Also covers TRC-A4, TRC-A5, TRC-FM3 (all exercised by TestNoTrustedRerunCheck).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…r gates for B1

Add negative assertion tests verifying stream-C (B1) adds no new checks or
gates to guardrails.yml. Add provenance comment to guardrails.yml confirming
the constraint. B1 is judgement-side only — no new mechanism.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…d-specification

Add "Example-first refinement chain" section to bdd-specification skill:
vague idea → concrete examples → acceptance criteria → executable spec.
Includes ubiquitous language discipline and "should" naming prefix.
Explicitly refuses user stories as a per-role spec format (ADR-004).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Stream A — A1: Intermittent-test integrity
Scenarios: TRC-A1, TRC-A2, TRC-A3, TRC-A4, TRC-A5, TRC-A6, TRC-A7, TRC-FM2, TRC-FM3
Tests added: 21 (in tests/test_intermittent_tests.py)
Files changed: cli/compass, governance/guardrails.yml, governance/quarantine.yml (new),
               governance/strategies.md, tests/test_intermittent_tests.py

Suite: 291 passed, 2 skipped; policy lint: PASS; task lint: PASS

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…vidence-gates

Add "Pipeline stage vocabulary" section to evidence-gates/SKILL.md:
- verify.correctness as the acceptance/releasability gate ("anything that defines releasable")
- tdd-red/tdd-green loop as the commit stage ("anything that can fail fast")
- explicitly state Release and Production stages are out of scope (safety-contract guarantee 6)
- cite G4 as the standing falsification principle

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…s functions (TRC-B1..B8, TRC-FM1)

All 9 stream-B scenarios implemented and green (298 tests pass, 2 pre-existing skips).

Changes:
- cli/compass: add _check_command_passes (command-passes CHECK_FN); policy lint
  validates command-passes params on project guardrails only (G4 exempt)
- governance/guardrails.yml: register command-passes check; add to G4.checks;
  add verify.fitness gate evidence requirements
- governance/routing-policy.yml: add RG-FLOOR-006 (cross-cutting/critical) and
  RG-FLOOR-007 (irreversible domains) promoting verify.fitness from advisory to blocking
- architecture/decisions/ADR-009-...: new ADR, status proposed — fitness functions
  are project guardrails, not framework guardrails
- architecture/decisions/README.md, architecture/system-context.md,
  skills/evidence-gates/SKILL.md: reference ADR-009 and verify.fitness pattern
- tests/test_fitness_functions.py: 12 tests (TRC-B1, B2, B3, B6, B7, FM1)
- tests/test_verify_fitness_route_promotion.py: 16 tests (TRC-B4, B5, B8)
- tests/fixtures/route-baseline.yml: expedition gains verify.fitness (RG-FLOOR-006 fires)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
# Conflicts:
#	cli/compass
#	governance/guardrails.yml
# Conflicts:
#	skills/evidence-gates/SKILL.md
…ks PASS

Brings the Farley proposal's full shortlist into main:
- A1 intermittent-test integrity (S5 strategy + no-trusted-rerun check on G4
  + governance/quarantine.yml + tdd-green attempts/rerun_without_change)
- A2 architectural fitness functions (command-passes check + verify.fitness
  gate + RG-FLOOR-006/007 + ADR-009)
- B1 TDD-as-design rebalance (S2 dual-role + tdd-discipline Listen-to-tests
  + design_smell advisory signal + coverage-as-floor caveat)
- C1 example-first refinement in bdd-specification skill
- C2 commit-vs-acceptance vocabulary in evidence-gates skill

Surface added stays within the USP-5 budget: 1 strategy, 1 gate, 2 checks,
1 evidence field, 1 signal category, 1 ADR. No sixth guardrail, no fifth
routing dimension (ADR-002 honoured).

Integration: A → B → C → D merged; combined regression 371 passed / 2 skip;
final compass check PASS — all 13 checks (including the new no-trusted-rerun
and command-passes that this task ships).

Backfills paid:
- BF-1: cmd_tdd_green now always records rerun_without_change when attempts
  > 1 (true/false per source-hash); stream-A TRC-A1 evidence regenerated.
- BF-INTEGRATION: stream-D's 8 cross-cutting invariant tests added.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lands the Farley guidebook shortlist (farley-guidebook task) — additive
mechanism only:
- new strategy S5 — Intermittency is failure
- new gate verify.fitness (route-promoted via RG-FLOOR-006/007)
- new G4 checks no-trusted-rerun and command-passes
- new ADR-009 — fitness functions as project guardrails
- new evidence-record fields attempts + rerun_without_change
- new signal category design_smell

Semver minor: no breaking change; ADR-002 / ADR-006 honoured.

Bumps version on five published surfaces:
- VERSION
- .claude-plugin/plugin.json
- .claude-plugin/marketplace.json ($.metadata.version and $.plugins[0].version)
- cli/compass COMPASS_VERSION

Replaces the obsolete tests/test_version_bump_1_0_0.py with the parametrised
1.1.0 equivalent (TRC-1).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jed72 and others added 2 commits May 26, 2026 08:58
The .compass/work/farley-guidebook/ files for the framework's own task got
pulled into main via the stream-B merge — but the framework repo's own
.gitignore has /.compass/work/ root-anchored, deliberately. Comparison-
requirements (the prior expedition) has zero tracked task files for the
same reason: the framework dogfoods Compass on its own work, but the work-
state itself is local audit-trail only, not part of the public framework
artifact that ships to adopters.

The partial-tracked state was the actual bug — task.yml referenced evidence
paths (verification-report.md and stream-A/C/D green-*.json) that .gitignore
keeps off-disk on a fresh clone. CI's `compass ci` then fails resolving the
evidence pointers.

Removes from index (keeps local copies):
- task.yml
- devlog.md
- stream-B's bled-through evidence/*.json + *.log

No content change; restores the gitignore's intent.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rephrases the substantive ideas in Compass's own voice; renames
test_farley_invariants.py to test_release_invariants.py; updates the test
that asserted the literal external quote to assert the framing instead.

No behavioural change — all 372 tests still pass.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jed72 jed72 changed the title release(v1.1.0): Farley guidebook shortlist release(v1.1.0): fitness functions + intermittent-test integrity May 26, 2026
@jed72 jed72 merged commit f9769b4 into main May 26, 2026
1 check passed
@jed72 jed72 deleted the release/v1.1.0 branch May 26, 2026 08:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant