diff --git a/.opencode/agents/product-owner.md b/.opencode/agents/product-owner.md
index 40e6d05..e211ce7 100644
--- a/.opencode/agents/product-owner.md
+++ b/.opencode/agents/product-owner.md
@@ -59,4 +59,5 @@ When a gap is reported (by software-engineer or reviewer):
 ## Available Skills
 
 - `session-workflow` — session start/end protocol
+- `feature-selection` — when TODO.md is idle: score and select next backlog feature using WSJF
 - `scope` — Step 1: 3-session discovery (Phase 1 + 2), stories (Phase 3), and criteria (Phase 4)
\ No newline at end of file
diff --git a/.opencode/agents/software-engineer.md b/.opencode/agents/software-engineer.md
index d7a8bff..a802229 100644
--- a/.opencode/agents/software-engineer.md
+++ b/.opencode/agents/software-engineer.md
@@ -38,7 +38,7 @@ Load `skill session-workflow` first — it reads TODO.md, orients you to the cur
 | Step | Action |
 |---|---|
 | **Step 2 — ARCH** | Load `skill implementation` — contains Step 2 architecture protocol |
-| **Step 3 — TDD LOOP** | Load `skill implementation` — contains Step 3 TDD Loop |
+| **Step 3 — TDD LOOP** | Load `skill implementation` — contains Step 3 TDD Loop; load `skill refactor` when entering REFACTOR phase or doing preparatory refactoring |
 | **Step 5 — after PO accepts** | Load `skill pr-management` and `skill git-release` as needed |
 
 ## Ownership Rules
@@ -57,7 +57,8 @@ If during implementation you discover behavior not covered by existing acceptanc
 
 - `session-workflow` — session start/end protocol
 - `implementation` — Steps 2-3: architecture + TDD loop
-- `design-patterns` — on-demand when smell detected during refactor
+- `refactor` — REFACTOR phase and preparatory refactoring (load on-demand)
+- `design-patterns` — on-demand when smell detected during architecture or refactor
 - `pr-management` — Step 5: PRs with conventional commits
 - `git-release` — Step 5: calver versioning and themed release naming
 - `create-skill` — meta: create new skills when needed
\ No newline at end of file
diff --git a/.opencode/skills/create-skill/SKILL.md b/.opencode/skills/create-skill/SKILL.md
index 8d94116..db8a679 100644
--- a/.opencode/skills/create-skill/SKILL.md
+++ b/.opencode/skills/create-skill/SKILL.md
@@ -27,7 +27,7 @@ Before writing any skill, research the domain to ground the skill in industry st
    - Vendor documentation (OpenAI, Anthropic, Google, Microsoft)
    - Industry standards (ISO, NIST, OMG)
    - Established methodologies (e.g., FDD, Scrum, Kanban for process skills)
-3. **Read existing research**: Check `docs/academic_research.md` for related entries
+3. **Read existing research**: Check `docs/scientific-research/` for related entries — each file covers a domain (testing, oop-design, architecture, ai-agents, etc.)
 4. **Synthesize conclusions**: Extract actionable conclusions — what works, why, and when to apply it
 5. **Embed as guidance**: Write the skill's steps, checklists, and decision rules based on those conclusions — not as academic citations but as direct guidance ("Use X because it produces Y outcome")
 
@@ -133,6 +133,7 @@ Add the skill name to the agent's "Available Skills" section so the agent knows
 | Skill | Used By | Purpose |
 |---|---|---|
 | `session-workflow` | all agents | Session start/end protocol |
+| `feature-selection` | product-owner | Score and select next backlog feature (WSJF) |
 | `scope` | product-owner | Step 1: define acceptance criteria |
 | `implementation` | software-engineer | Steps 2-3: architecture + TDD loop |
 | `design-patterns` | software-engineer | Steps 2, 3: refactor when smell detected |
diff --git a/.opencode/skills/design-patterns/SKILL.md b/.opencode/skills/design-patterns/SKILL.md
index 591ab6d..106c4f1 100644
--- a/.opencode/skills/design-patterns/SKILL.md
+++ b/.opencode/skills/design-patterns/SKILL.md
@@ -1,7 +1,7 @@
 ---
 name: design-patterns
-description: Reference skill for GoF design patterns, SOLID, Object Calisthenics, Python Zen, and other SE principles — with smell triggers and Python before/after examples
-version: "1.0"
+description: GoF design pattern catalogue — smell triggers and Python before/after examples
+version: "2.0"
 author: software-engineer
 audience: software-engineer
 workflow: feature-lifecycle
@@ -9,18 +9,18 @@ workflow: feature-lifecycle
 
 # Design Patterns Reference
 
-Load this skill when:
-- Running the architecture smell check in Step 2 and a smell is detected
-- Refactoring in Step 3 and a pattern smell appears in the self-declaration
+Load this skill when the refactor skill's smell table points to a GoF pattern and you need the Python before/after example.
+
+Sources: Gamma, Helm, Johnson, Vlissides. *Design Patterns: Elements of Reusable Object-Oriented Software*. Addison-Wesley, 1995. See `docs/scientific-research/oop-design.md` entry 34.
 
 ---
 
 ## How to Use This Skill
 
-1. **Identify the smell** from the checklist in your self-declaration or architecture check
+1. **Identify the smell** from the refactor skill's lookup table
 2. **Find the smell category** below (Creational / Structural / Behavioral)
 3. **Read the trigger and the before/after example**
-4. **Apply the pattern** and update the Architecture section (Step 2) or the refactored code (Step 3)
+4. **Apply the pattern** — update the stub files (Step 2) or the refactored code (Step 3)
 
 ---
 
@@ -178,7 +178,7 @@ def apply_discount(order: Order, strategy: DiscountStrategy) -> Money:
 #### Smell: Feature Envy
 **Signal**: A method in class A uses data from class B more than its own data. The method "envies" class B.
 
-**Pattern**: Move Method to the envied class (not a GoF pattern — a Fowler refactoring that often precedes Strategy or Command)
+**Pattern**: Move Method to the envied class (Fowler refactoring that often precedes Strategy or Command)
 
 ```python
 # BEFORE — OrderPrinter knows too much about Order internals
@@ -309,9 +309,9 @@ class Order:
 class Order:
     def confirm(self) -> None:
         self.status = "confirmed"
-        EmailService().send_confirmation(self)     # direct coupling
-        InventoryService().reserve(self)            # direct coupling
-        AnalyticsService().record_conversion(self)  # direct coupling
+        EmailService().send_confirmation(self)      # direct coupling
+        InventoryService().reserve(self)             # direct coupling
+        AnalyticsService().record_conversion(self)   # direct coupling
 ```
 
 ```python
@@ -380,182 +380,6 @@ class JsonImporter(Importer):
 
 ---
 
-## SOLID — Python Examples
-
-### S — Single Responsibility
-One class, one reason to change.
-
-```python
-# WRONG — Report handles both data and formatting
-class Report:
-    def generate(self) -> dict: ...
-    def to_pdf(self) -> bytes: ...    # separate concern
-    def to_csv(self) -> str: ...      # separate concern
-
-# RIGHT — split concerns
-class Report:
-    def generate(self) -> ReportData: ...
-
-class PdfRenderer:
-    def render(self, data: ReportData) -> bytes: ...
-```
-
-### O — Open/Closed
-Open for extension, closed for modification.
-
-```python
-# WRONG — must edit this function to add a new format
-def export(data: ReportData, fmt: str) -> bytes:
-    if fmt == "pdf": ...
-    elif fmt == "csv": ...
-
-# RIGHT — new formats extend without touching existing code
-class Exporter(Protocol):
-    def export(self, data: ReportData) -> bytes: ...
-
-class PdfExporter:
-    def export(self, data: ReportData) -> bytes: ...
-```
-
-### L — Liskov Substitution
-Subtypes must be fully substitutable for their base type.
-
-```python
-# WRONG — ReadOnlyFile violates the contract of File
-class File:
-    def write(self, content: str) -> None: ...
-
-class ReadOnlyFile(File):
-    def write(self, content: str) -> None:
-        raise PermissionError  # narrows the contract — LSP violation
-
-# RIGHT — separate interfaces for readable and writable
-class ReadableFile(Protocol):
-    def read(self) -> str: ...
-
-class WritableFile(Protocol):
-    def write(self, content: str) -> None: ...
-```
-
-### I — Interface Segregation
-No implementor should be forced to implement methods it doesn't use.
-
-```python
-# WRONG — Printer is forced to implement scan() and fax()
-class Machine(Protocol):
-    def print(self, doc: Document) -> None: ...
-    def scan(self, doc: Document) -> None: ...
-    def fax(self, doc: Document) -> None: ...
-
-# RIGHT — each capability is its own Protocol
-class Printer(Protocol):
-    def print(self, doc: Document) -> None: ...
-
-class Scanner(Protocol):
-    def scan(self, doc: Document) -> None: ...
-```
-
-### D — Dependency Inversion
-Domain depends on abstractions (Protocols), not on concrete I/O or frameworks.
-
-```python
-# WRONG — domain imports infrastructure directly
-from app.db import PostgresConnection
-
-class OrderRepository:
-    def __init__(self) -> None:
-        self.db = PostgresConnection()  # domain imports infra
-
-# RIGHT — domain defines the Protocol; infra implements it
-class OrderRepository(Protocol):
-    def find(self, order_id: OrderId) -> Order: ...
-    def save(self, order: Order) -> None: ...
-
-class PostgresOrderRepository:      # in adapters/
-    def find(self, order_id: OrderId) -> Order: ...
-    def save(self, order: Order) -> None: ...
-```
-
----
-
-## Object Calisthenics — Python Rules
-
-Jeff Bay's 9 rules for object-oriented discipline. Each has a Python signal.
-
-| Rule | Constraint | Python Signal of Violation |
-|---|---|---|
-| **OC-1** | One indent level per method | `for` inside `if` inside a method body |
-| **OC-2** | No `else` after `return` | `if cond: return x \n else: return y` |
-| **OC-3** | Wrap all primitives that have domain meaning | `def process(user_id: int)` instead of `def process(user_id: UserId)` |
-| **OC-4** | Wrap all collections that have domain meaning | `list[Order]` passed around instead of `OrderCollection` |
-| **OC-5** | One dot per line | `obj.repo.find(id).name` |
-| **OC-6** | No abbreviations | `usr`, `mgr`, `cfg`, `val`, `tmp` |
-| **OC-7** | Keep classes small (≤50 lines) and methods short (≤20 lines) | Any method requiring scrolling |
-| **OC-8** | No class with more than 2 instance variables | `__init__` with 3+ `self.x =` assignments |
-| **OC-9** | No getters/setters | `def get_name(self)` / `def set_name(self, v)` |
-
----
-
-## Python Zen — Mapped to Code Practices
-
-The relevant items from PEP 20 (`import this`) with concrete code implications:
-
-| Zen Item | Code Practice |
-|---|---|
-| Beautiful is better than ugly | Name things clearly; prefer named types over bare primitives |
-| Explicit is better than implicit | Explicit return types; explicit Protocol dependencies; no magic |
-| Simple is better than complex | KISS — one function, one job; prefer a plain function over a class |
-| Complex is better than complicated | A well-designed abstraction is acceptable; an accidental tangle is not |
-| Flat is better than nested | OC-1 — one indent level; early returns |
-| Sparse is better than dense | One statement per line; no semicolons; no lambda chains |
-| Readability counts | OC-6 — no abbreviations; docstrings on every public function |
-| Special cases aren't special enough to break the rules | Do not add `if isinstance` branches to avoid refactoring |
-| Errors should never pass silently | No bare `except:`; no `except Exception: pass` |
-| In the face of ambiguity, refuse the temptation to guess | Raise on invalid input; never silently return a default |
-| There should be one obvious way to do it | DRY — every shared concept in exactly one place |
-| If the implementation is hard to explain, it's a bad idea | KISS — if you can't describe the function in one sentence, split it |
-
----
-
-## Other Principles
-
-### Law of Demeter (Tell, Don't Ask)
-A method should only call methods on:
-- `self`
-- Objects passed as parameters
-- Objects it creates
-- Direct component objects (`self.x`)
-
-**Violation signal**: `a.b.c()` — two dots. Assign `b = a.b` and call `b.c()`, or better: ask `a` to do what you need (`a.do_thing()`).
-
-### Command-Query Separation (CQS)
-A method either **changes state** (command) or **returns a value** (query) — never both.
-
-```python
-# WRONG — pop() both returns and mutates
-value = stack.pop()
-
-# RIGHT (CQS strict)
-value = stack.peek()   # query — no mutation
-stack.remove_top()     # command — no return value
-```
-
-Note: Python's standard library violates CQS in places (`list.pop()`, `dict.update()`). Apply CQS to your domain objects; do not fight the stdlib.
-
-### Tell, Don't Ask
-Instead of querying an object's state and acting on it externally, tell the object to do the work itself.
-
-```python
-# WRONG — ask state, decide externally
-if order.status == OrderStatus.PENDING:
-    order.status = OrderStatus.CONFIRMED
-
-# RIGHT — tell the object
-order.confirm()   # Order decides if the transition is valid
-```
-
----
-
 ## Quick Smell → Pattern Lookup
 
 | Smell | Pattern |
@@ -569,4 +393,3 @@ order.confirm()   # Order decides if the transition is valid
 | Class directly calls B, C, D on state change | Observer |
 | Two functions share the same skeleton, differ in one step | Template Method |
 | Subsystem is complex and callers need a simple entry point | Facade |
-| Object needs logging/caching without changing its class | Decorator / Proxy |
diff --git a/.opencode/skills/feature-selection/SKILL.md b/.opencode/skills/feature-selection/SKILL.md
new file mode 100644
index 0000000..567e3ef
--- /dev/null
+++ b/.opencode/skills/feature-selection/SKILL.md
@@ -0,0 +1,118 @@
+---
+name: feature-selection
+description: Score and select the next backlog feature by value, effort, and dependencies
+version: "1.0"
+author: product-owner
+audience: product-owner
+workflow: feature-lifecycle
+---
+
+# Feature Selection
+
+Select the next most valuable, unblocked feature from the backlog using a lightweight scoring model grounded in flow economics and dependency analysis.
+
+**Research basis**: Weighted Shortest Job First (WSJF) — Reinertsen *Principles of Product Development Flow* (2009); INVEST criteria — Wake (2003); Kano model — Kano (1984); Dependency analysis — PMBOK Critical Path Method. See `docs/scientific-research/requirements-elicitation.md`.
+
+**Core principle**: Cost of Delay ÷ Duration. Features with high user value and low implementation effort should start first. Features blocked by unfinished work should wait regardless of value.
+
+## When to Use
+
+Load this skill when `TODO.md` says "No feature in progress" — before moving any feature to `in-progress/`.
+
+## Step-by-Step
+
+### 1. Verify WIP is Zero
+
+```bash
+ls docs/features/in-progress/
+```
+
+- 0 files → proceed
+- 1 file → a feature is already in progress; do not start another; exit this skill
+- >1 files → WIP violation; stop and resolve before proceeding
+
+### 2. List BASELINED Candidates
+
+Read each `.feature` file in `docs/features/backlog/`. Check its discovery section for `Status: BASELINED`.
+
+- Non-BASELINED features are not eligible — they need Step 1 (scope) first
+- If no BASELINED features exist: inform the stakeholder; run `@product-owner` with `skill scope` to baseline the most promising backlog item first
+
+### 3. Score Each Candidate
+
+For each BASELINED feature, fill this table:
+
+| Feature | Value (1–5) | Effort (1–5) | Dependency (0/1) | WSJF |
+|---|---|---|---|---|
+| `<name>` | | | | Value ÷ Effort |
+
+**Value (1–5)** — estimate user/business impact:
+- 5: Must-have — core workflow blocked without it (Kano: basic need)
+- 4: High — significantly improves the primary use case
+- 3: Medium — useful but not blocking (Kano: performance)
+- 2: Low — nice-to-have (Kano: delighter)
+- 1: Minimal — cosmetic or out-of-scope edge case
+
+Use the number of `Must` Examples in the feature's `Rule:` blocks as a tiebreaker: more Musts → higher value.
+
+**Effort (1–5)** — estimate implementation complexity:
+- 1: Trivial — 1–2 `@id` Examples, no new domain concepts
+- 2: Small — 3–5 `@id` Examples, one new domain entity
+- 3: Medium — 6–8 `@id` Examples or cross-cutting concern
+- 4: Large — >8 Examples or multiple interacting domain entities
+- 5: Very large — spans multiple modules or has unknown complexity
+
+**Dependency (0/1)** — does this feature assume another backlog feature is already built?
+- 0: Independent — no hard prerequisite
+- 1: Blocked — requires another backlog feature to be completed first
+
+A Dependency=1 feature is **ineligible for selection** regardless of WSJF score. Apply WSJF only to Dependency=0 features.
+
+### 4. Select
+
+Pick the BASELINED, Dependency=0 feature with the highest WSJF score.
+
+Ties: prefer higher Value (user impact matters more than effort optimization).
+
+If all BASELINED features have Dependency=1: stop and resolve the blocking dependency first — select and complete the depended-upon feature.
+
+### 5. Move and Update TODO.md
+
+```bash
+mv docs/features/backlog/<name>.feature docs/features/in-progress/<name>.feature
+```
+
+Update `TODO.md`:
+
+```markdown
+# Current Work
+
+Feature: <name>
+Step: 1 (SCOPE) or 2 (ARCH) — whichever is next
+Source: docs/features/in-progress/<name>.feature
+
+## Next
+Run @<agent-name> — <first concrete action for this feature>
+```
+
+- If the feature has no `Rule:` blocks yet → Step 1 (SCOPE): `Run @product-owner — load skill scope and write stories`
+- If the feature has `Rule:` blocks but no `@id` Examples → Step 1 Phase 4 (Criteria): `Run @product-owner — load skill scope and write acceptance criteria`
+- If the feature has `@id` Examples → Step 2 (ARCH): `Run @software-engineer — load skill implementation and write architecture stubs`
+
+### 6. Commit
+
+```bash
+git add docs/features/in-progress/<name>.feature TODO.md
+git commit -m "chore: select <name> as next feature"
+```
+
+## Checklist
+
+- [ ] `in-progress/` confirmed empty before selection
+- [ ] Only BASELINED features considered
+- [ ] Dependency=1 features excluded from scoring
+- [ ] WSJF scores filled for all candidates
+- [ ] Selected feature has highest WSJF among Dependency=0 candidates
+- [ ] Feature moved to `in-progress/`
+- [ ] `TODO.md` updated with correct Step and `Next` line
+- [ ] Changes committed
diff --git a/.opencode/skills/implementation/SKILL.md b/.opencode/skills/implementation/SKILL.md
index 27187fc..c3a5a89 100644
--- a/.opencode/skills/implementation/SKILL.md
+++ b/.opencode/skills/implementation/SKILL.md
@@ -52,6 +52,7 @@ Update `TODO.md` Source path from `backlog/` to `in-progress/`.
 1. Read `docs/features/discovery.md` (project-level)
 2. Read **ALL** `.feature` files in `docs/features/backlog/` (discovery + entities sections)
 3. Read in-progress `.feature` file (full: Rules + Examples + @id)
+4. Read **ALL** existing `.py` files in `<package>/` — understand what already exists before adding anything
 
 ### Domain Analysis
 
@@ -78,44 +79,73 @@ For each noun:
 
 If pattern smell detected, load `skill design-patterns`.
 
-### Write Architecture Section
+### Write Stubs into Package
 
-Append to `docs/features/in-progress/<name>.feature` (before first `Rule:`):
+From the domain analysis, write or extend `.py` files in `<package>/`. For each entity:
 
-```gherkin
-  Architecture:
+- **If the file already exists**: add the new class or method signature — do not remove or alter existing code.
+- **If the file does not exist**: create it with the new signatures only.
 
-  ### Module Structure
-  - `<package>/domain/<noun>.py` — named class + responsibilities
-  - `<package>/domain/service.py` — cross-entity operations
-  - `<package>/adapters/<dep>.py` — Protocol implementation
+**Stub rules (strictly enforced):**
+- Method bodies must be `...` — no logic, no conditionals, no imports beyond `typing` and domain types
+- No docstrings — signatures will change; add docstrings after GREEN (lint enforces this at quality gate)
+- No inline comments, no TODO comments, no speculative code
 
-  ### Key Decisions
-  ADR-001: <title>
-  Decision: <what>
-  Reason: <why in one sentence>
-  Alternatives considered: <what was rejected and why>
+**Example — correct stub style:**
 
-  ### Build Changes (needs PO approval: yes/no)
-  - New runtime dependency: <name> — reason: <why>
+```python
+from dataclasses import dataclass
+from typing import Protocol
+
+
+@dataclass(frozen=True, slots=True)
+class EmailAddress:
+    value: str
+
+    def validate(self) -> None: ...
+
+
+class UserRepository(Protocol):
+    def save(self, user: "User") -> None: ...
+    def find_by_email(self, email: EmailAddress) -> "User | None": ...
+```
+
+**File placement (common patterns, not required names):**
+- `<package>/domain/<noun>.py` — entities, value objects
+- `<package>/domain/service.py` — cross-entity operations
+
+Place stubs where responsibility dictates — do not pre-create `ports/` or `adapters/` folders unless a concrete external dependency was identified in scope. Structure follows domain analysis, not a template.
+
+### Write ADR Files (significant decisions only)
+
+For each significant architectural decision, create `docs/architecture/adr-NNN-<title>.md`:
+
+```markdown
+# ADR-NNN: <title>
+
+**Decision:** <what was decided>
+**Reason:** <why, one sentence>
+**Alternatives considered:** <what was rejected and why>
 ```
 
-Signatures are informative — tests/implementation may refine them. Record significant changes as ADR updates.
+Only write an ADR if the decision is non-obvious or has meaningful trade-offs. Routine YAGNI choices do not need an ADR.
 
 ### Architecture Smell Check (hard gate)
 
-- [ ] No planned class with >2 responsibilities (SOLID-S)
-- [ ] No planned class with >2 instance variables (OC-8)
-- [ ] All external deps assigned a Protocol/Adapter (SOLID-D + Hexagonal)
-- [ ] No noun with different meaning across planned modules (DDD BC)
-- [ ] No missing Creational pattern
-- [ ] No missing Structural pattern  
-- [ ] No missing Behavioral pattern
+Apply to the stub files just written:
+
+- [ ] No class with >2 responsibilities (SOLID-S)
+- [ ] No class with >2 instance variables (OC-8)
+- [ ] All external deps assigned a Protocol (SOLID-D + Hexagonal) — N/A if no external dependencies identified in scope
+- [ ] No noun with different meaning across modules (DDD Bounded Context)
+- [ ] No missing Creational pattern: repeated construction without Factory/Builder
+- [ ] No missing Structural pattern: type-switching without Strategy/Visitor
+- [ ] No missing Behavioral pattern: state machine or scattered notification without State/Observer
 - [ ] Each ADR consistent with each @id AC — no contradictions
 
-If any check fails: fix before committing.
+If any check fails: fix the stub files before committing.
 
-Commit: `feat(<feature-name>): add architecture`
+Commit: `feat(<feature-name>): add architecture stubs`
 
 ---
 
@@ -123,14 +153,35 @@ Commit: `feat(<feature-name>): add architecture`
 
 ### Prerequisites
 
-- [ ] Architecture section present in in-progress `.feature` file
-- [ ] All tests written in `tests/features/<feature-name>/`
+- [ ] Architecture stubs present in `<package>/` (committed by Step 2)
+- [ ] Read all `docs/architecture/adr-NNN-*.md` files — understand the architectural decisions before writing any test
+- [ ] Test stub files exist in `tests/features/<feature-name>/` — one file per `Rule:` block, all `@id` functions present with `@pytest.mark.skip`; if missing, write them now before entering RED
+
+### Write Test Stubs (if not present)
+
+For each `Rule:` block in the in-progress `.feature` file, create `tests/features/<feature-name>/<rule-slug>_test.py` if it does not already exist. Write one function per `@id` Example, all skipped:
+
+```python
+@pytest.mark.skip(reason="not yet implemented")
+def test_<rule_slug>_<8char_hex>() -> None:
+    """
+    Given: ...
+    When: ...
+    Then: ...
+    """
+    # Given
+    # When
+    # Then
+```
+
+Run `uv run task gen-todo` after writing stubs to sync `@id` rows into `TODO.md`.
 
 ### Build TODO.md Test List
 
 1. List all `@id` tags from in-progress `.feature` file
 2. Order: fewest dependencies first; most impactful within that set
 3. Each `@id` = one TODO item, status: `pending`
+4. Confirm each `@id` has a corresponding skipped stub in `tests/features/<feature-name>/` — if any are missing, add them before proceeding
 
 ### Outer Loop — One @id at a time
 
@@ -141,7 +192,10 @@ For each pending `@id`:
 ```
 INNER LOOP
 ├── RED
-│   ├── Write test body (Given/When/Then → Arrange/Act/Assert)
+│   ├── Confirm stub for this @id exists in tests/features/<feature-name>/ with @pytest.mark.skip
+│   ├── Read existing stubs in `<package>/` — base the test on the current data model and signatures
+│   ├── Write test body (Given/When/Then → Arrange/Act/Assert); remove @pytest.mark.skip
+│   ├── Update stub signatures as needed — edit the `.py` file directly
 │   ├── uv run task test-fast
 │   └── EXIT: this @id FAILS
 │       (if it passes: test is wrong — fix it first)
@@ -154,10 +208,8 @@ INNER LOOP
 │       (fix implementation only; do not advance to next @id)
 │
 └── REFACTOR
-    ├── Apply: DRY → SOLID → OC → patterns
-    ├── Load design-patterns skill if smell detected
-    ├── Add type hints and docstrings
-    ├── uv run task test-fast after each change
+    ├── Load `skill refactor` — follow its protocol for this phase
+    ├── uv run task test-fast after each individual change
     └── EXIT: test-fast passes; no smells remain
 
 Mark @id completed in TODO.md
@@ -344,25 +396,28 @@ Extra tests in `tests/unit/` are allowed freely (coverage, edge cases, etc.) —
 
 ## Signature Design
 
-Design signatures before writing bodies. Use Python protocols for abstractions:
+Signatures are written during Step 2 (Architecture) and refined during Step 3 (RED). They live directly in the package `.py` files — never in the `.feature` file.
+
+Key rules:
+- Bodies are always `...` in the architecture stub
+- GREEN phase replaces `...` with the minimum implementation
+- REFACTOR phase cleans up the result
+
+Use Python Protocols for external dependencies if they are identified in scope — never depend on a concrete class directly:
 
 ```python
 from typing import Protocol
 from dataclasses import dataclass
 
+
 @dataclass(frozen=True, slots=True)
 class EmailAddress:
-    """A validated email address."""
-
     value: str
 
-    def __post_init__(self) -> None:
-        if "@" not in self.value:
-            raise ValueError(f"Invalid email: {self.value!r}")
+    def validate(self) -> None: ...
 
-class UserRepository(Protocol):
-    """Persistence interface for users."""
 
+class UserRepository(Protocol):
     def save(self, user: "User") -> None: ...
     def find_by_email(self, email: EmailAddress) -> "User | None": ...
 ```
\ No newline at end of file
diff --git a/.opencode/skills/refactor/SKILL.md b/.opencode/skills/refactor/SKILL.md
new file mode 100644
index 0000000..fcf27e2
--- /dev/null
+++ b/.opencode/skills/refactor/SKILL.md
@@ -0,0 +1,417 @@
+---
+name: refactor
+description: Safe refactoring protocol for TDD — green bar rule, two-hats discipline, preparatory refactoring, and Fowler catalogue
+version: "1.0"
+author: software-engineer
+audience: software-engineer
+workflow: feature-lifecycle
+---
+
+# Refactor
+
+Load this skill when entering the REFACTOR phase of a TDD cycle, or before starting RED on a new `@id` when preparatory refactoring is needed.
+
+Sources: Fowler *Refactoring* 2nd ed. (2018); Beck *Canon TDD* (2023); Beck *Tidy First?* (2023); Martin *SOLID* (2000); Bay *Object Calisthenics* (2005). See `docs/scientific-research/oop-design.md` and `docs/scientific-research/refactoring-empirical.md`.
+
+---
+
+## The Definition
+
+A refactoring is a **behavior-preserving** transformation of internal structure. If the transformation changes observable behavior, it is not a refactoring — it is a feature change, and requires its own RED-GREEN-REFACTOR cycle.
+
+---
+
+## The Green Bar Rule (absolute)
+
+**Refactoring is only permitted while all existing tests pass.**
+
+Every individual refactoring step must leave `test-fast` green. There are no exceptions.
+
+---
+
+## The Two-Hats Rule
+
+Wear one hat at a time:
+
+| Hat | Activity | Allowed during this hat |
+|---|---|---|
+| **Feature hat** | RED → GREEN | Write failing test, write minimum code to pass |
+| **Refactoring hat** | REFACTOR | Restructure passing code; never add new behavior |
+
+**Never mix hats in the same step.** If you discover a refactoring is needed while making a test pass (GREEN), note it — finish GREEN first, then switch hats.
+
+---
+
+## When to Load This Skill
+
+### 1. REFACTOR phase (opportunistic)
+
+After GREEN: `test-fast` passes for the current `@id`. Now restructure.
+
+### 2. Preparatory refactoring (before RED)
+
+When the current structure would make the next `@id` awkward to implement:
+- Put on the **refactoring hat first**
+- Refactor until the feature is easy to add
+- Commit the preparatory refactoring separately (see Commit Discipline)
+- Then put on the feature hat and run RED-GREEN-REFACTOR normally
+
+Beck: *"For each desired change, make the change easy (warning: this may be hard), then make the easy change."*
+
+---
+
+## Refactoring Protocol
+
+### Step 1 — Identify the smell
+
+Run the smell checklist from your Self-Declaration or from the Architecture Smell Check:
+
+| Smell | Likely catalogue entry |
+|---|---|
+| Function needs a comment to explain it | Extract Function |
+| Class does two jobs | Extract Class |
+| Method uses another class's data more than its own | Move Function |
+| Same parameter group in multiple signatures | Introduce Parameter Object |
+| Primitive with behaviour (money, email, range) | Replace Primitive with Object |
+| Local variable holds a computed result | Replace Temp with Query |
+| `isinstance` / type-flag conditionals | Replace Conditional with Polymorphism |
+| Multiple functions share a data cluster | Combine Functions into Class |
+| Nested conditions beyond 2 levels | Decompose Conditional / Guard Clauses |
+| Object construction scattered without pattern | Factory Method / Builder |
+| Scattered notification or state transition | Observer / State |
+| Type-switching across callers | Strategy / Visitor |
+
+If pattern smell detected: load `skill design-patterns` for before/after examples.
+
+### Step 2 — Apply one catalogue entry at a time
+
+Apply a **single** catalogue entry, then run `test-fast` before moving to the next.
+
+Never batch multiple catalogue entries into one step — you lose the ability to pinpoint which step broke something.
+
+### Step 3 — Run after each step
+
+```bash
+uv run task test-fast
+```
+
+All tests green → proceed to next catalogue entry.
+Any test red → see "When a Refactoring Breaks a Test" below.
+
+### Step 4 — Commit when smell-free
+
+Once no smells remain and `test-fast` is green:
+
+```bash
+uv run task test-fast   # must pass
+```
+
+Commit (see Commit Discipline below).
+
+---
+
+## Key Catalogue Entries
+
+### Extract Function
+Pull a cohesive fragment into a named function. Trigger: the fragment needs a comment to explain it.
+
+```python
+# Before
+def process(order):
+    # apply 10% discount
+    order.total = order.total * Decimal("0.9")
+    send_confirmation(order)
+
+# After
+def apply_discount(order: Order) -> None:
+    """Apply the standard 10% discount."""
+    order.total = order.total * Decimal("0.9")
+
+def process(order: Order) -> None:
+    """Process an order."""
+    apply_discount(order)
+    send_confirmation(order)
+```
+
+### Extract Class
+Split a class doing two jobs. Trigger: data cluster + related behaviours that travel together.
+
+```python
+# Before
+@dataclass
+class Order:
+    id: str
+    street: str
+    city: str
+    total: Decimal
+
+# After
+@dataclass(frozen=True, slots=True)
+class Address:
+    """A delivery address."""
+    street: str
+    city: str
+
+@dataclass
+class Order:
+    """An order placed by a customer."""
+    id: str
+    address: Address
+    total: Decimal
+```
+
+### Introduce Parameter Object
+Replace a recurring parameter group with a value object. Trigger: same 2+ params appear together across multiple signatures.
+
+```python
+# Before
+def summarise(start_date: date, end_date: date) -> Report: ...
+def filter_events(start_date: date, end_date: date) -> list[Event]: ...
+
+# After
+@dataclass(frozen=True, slots=True)
+class DateRange:
+    """An inclusive date range."""
+    start: date
+    end: date
+
+def summarise(period: DateRange) -> Report: ...
+def filter_events(period: DateRange) -> list[Event]: ...
+```
+
+### Replace Primitive with Object
+Elevate a domain primitive to a class with behaviour. Trigger: primitive has validation rules or operations.
+
+```python
+# Before
+def send_invoice(email: str) -> None: ...
+
+# After
+@dataclass(frozen=True, slots=True)
+class EmailAddress:
+    """A validated email address."""
+    value: str
+
+    def validate(self) -> None:
+        """Validate the email format.
+
+        Raises:
+            ValueError: if the address has no '@' character.
+        """
+        if "@" not in self.value:
+            raise ValueError(f"Invalid email: {self.value!r}")
+
+def send_invoice(email: EmailAddress) -> None: ...
+```
+
+### Decompose Conditional / Guard Clauses
+Flatten nested logic to ≤2 levels. Trigger: OC-1 violation or deeply nested `if` chains.
+
+```python
+# Before
+def process(order):
+    if order is not None:
+        if order.total > 0:
+            if order.is_confirmed:
+                ship(order)
+
+# After
+def process(order: Order | None) -> None:
+    """Ship a confirmed order."""
+    if order is None:
+        return
+    if order.total <= 0:
+        return
+    if not order.is_confirmed:
+        return
+    ship(order)
+```
+
+---
+
+## When a Refactoring Breaks a Test
+
+A refactoring that breaks a test is **not a refactoring**. Stop. Diagnose:
+
+### Diagnosis flow
+
+```
+Test fails after a structural change
+         │
+         ▼
+Is the test testing internal structure
+(private methods, specific call chains,
+concrete types) rather than observable behavior?
+         │
+    YES  │  NO
+         │   └──→ The "refactoring" changed observable behavior.
+         │         This is a FEATURE CHANGE.
+         │         Revert the step.
+         │         Put on the feature hat.
+         │         Run RED-GREEN-REFACTOR for it explicitly.
+         ▼
+Rewrite the test to use the public interface.
+Re-apply the refactoring step.
+Run test-fast — must be green.
+```
+
+**Never delete a failing test without diagnosing it first.**
+
+---
+
+## Commit Discipline
+
+Refactoring commits are always **separate** from feature commits.
+
+| Commit type | Message format | When |
+|---|---|---|
+| Preparatory refactoring | `refactor(<feature-name>): <what>` | Before RED, to make the feature easier |
+| REFACTOR phase | `refactor(<feature-name>): <what>` | After GREEN, cleaning up the green code |
+| Feature addition | `feat(<feature-name>): <what>` | After GREEN (never mixed with refactor) |
+
+Never mix a structural cleanup with a behavior addition in one commit. This keeps history bisectable and CI green at every commit.
+
+---
+
+## Self-Declaration Check (before exiting REFACTOR)
+
+Before marking the `@id` complete, verify all of the following. Each failed item is a smell — apply the catalogue entry, run `test-fast`, then re-check.
+
+### Green Bar
+- [ ] `test-fast` passes
+- [ ] No smell from the checklist in Step 1 remains
+
+### Object Calisthenics (Bay 2005)
+| Rule | Constraint | Violation signal |
+|---|---|---|
+| OC-1 | One indent level per method | `for` inside `if` inside a method body |
+| OC-2 | No `else` after `return` | `if cond: return x` then `else: return y` |
+| OC-3 | Wrap primitives with domain meaning | `def process(user_id: int)` instead of `UserId` |
+| OC-4 | Wrap collections with domain meaning | `list[Order]` passed around instead of `OrderCollection` |
+| OC-5 | One dot per line | `obj.repo.find(id).name` |
+| OC-6 | No abbreviations | `usr`, `mgr`, `cfg`, `val`, `tmp` |
+| OC-7 | Classes ≤ 50 lines, methods ≤ 20 lines | Any method requiring scrolling |
+| OC-8 | ≤ 2 instance variables per class | `__init__` with 3+ `self.x =` assignments |
+| OC-9 | No getters/setters | `def get_name(self)` / `def set_name(self, v)` |
+
+### SOLID (Martin 2000)
+| Principle | Check |
+|---|---|
+| **S** — Single Responsibility | Does this class have exactly one reason to change? |
+| **O** — Open/Closed | Can new behavior be added without editing this class? |
+| **L** — Liskov Substitution | Do all subtypes honor the full contract of their base type? |
+| **I** — Interface Segregation | Does every implementor use every method in the Protocol? |
+| **D** — Dependency Inversion | Does domain code depend only on Protocols, not concrete I/O? |
+
+#### SOLID Python signals
+
+**S — Single Responsibility**
+```python
+# WRONG — Report handles both data and formatting
+class Report:
+    def generate(self) -> dict: ...
+    def to_pdf(self) -> bytes: ...    # separate concern
+    def to_csv(self) -> str: ...      # separate concern
+
+# RIGHT
+class Report:
+    def generate(self) -> ReportData: ...
+
+class PdfRenderer:
+    def render(self, data: ReportData) -> bytes: ...
+```
+
+**O — Open/Closed**
+```python
+# WRONG — must edit this function to add a new format
+def export(data: ReportData, fmt: str) -> bytes:
+    if fmt == "pdf": ...
+    elif fmt == "csv": ...
+
+# RIGHT — new formats extend without touching existing code
+class Exporter(Protocol):
+    def export(self, data: ReportData) -> bytes: ...
+```
+
+**L — Liskov Substitution**
+```python
+# WRONG — ReadOnlyFile narrows the contract of File
+class ReadOnlyFile(File):
+    def write(self, content: str) -> None:
+        raise PermissionError  # LSP violation
+
+# RIGHT — separate interfaces
+class ReadableFile(Protocol):
+    def read(self) -> str: ...
+
+class WritableFile(Protocol):
+    def write(self, content: str) -> None: ...
+```
+
+**I — Interface Segregation**
+```python
+# WRONG — Printer forced to implement scan() and fax()
+class Machine(Protocol):
+    def print(self, doc: Document) -> None: ...
+    def scan(self, doc: Document) -> None: ...
+    def fax(self, doc: Document) -> None: ...
+
+# RIGHT
+class Printer(Protocol):
+    def print(self, doc: Document) -> None: ...
+
+class Scanner(Protocol):
+    def scan(self, doc: Document) -> None: ...
+```
+
+**D — Dependency Inversion**
+```python
+# WRONG — domain imports infrastructure directly
+from app.db import PostgresConnection
+
+class OrderRepository:
+    def __init__(self) -> None:
+        self.db = PostgresConnection()
+
+# RIGHT — domain defines the Protocol; infra implements it
+class OrderRepository(Protocol):
+    def find(self, order_id: OrderId) -> Order: ...
+    def save(self, order: Order) -> None: ...
+
+class PostgresOrderRepository:      # in adapters/
+    def find(self, order_id: OrderId) -> Order: ...
+    def save(self, order: Order) -> None: ...
+```
+
+### Law of Demeter / Tell, Don't Ask / CQS
+
+**Law of Demeter** — a method should only call methods on: `self`, parameters, objects it creates, direct components (`self.x`).
+- Violation signal: `a.b.c()` — two dots. Ask `a` to do the thing instead: `a.do_thing()`.
+
+**Tell, Don't Ask** — tell objects what to do; don't query state and decide externally.
+```python
+# WRONG
+if order.status == OrderStatus.PENDING:
+    order.status = OrderStatus.CONFIRMED
+
+# RIGHT
+order.confirm()
+```
+
+**Command-Query Separation** — a method either changes state (command) or returns a value (query), never both.
+- Apply to domain objects. Do not fight stdlib (`list.pop()` is a known violation).
+
+### Python Zen (PEP 20) signals
+
+| Zen item | Code implication |
+|---|---|
+| Explicit is better than implicit | Explicit return types; explicit Protocol dependencies; no magic |
+| Simple is better than complex | One function, one job; prefer a plain function over a class |
+| Flat is better than nested | OC-1 — one indent level; early returns |
+| Readability counts | OC-6 — no abbreviations; docstrings on every public item |
+| Errors should never pass silently | No bare `except:`; no `except Exception: pass` |
+| In the face of ambiguity, refuse to guess | Raise on invalid input; never silently return a default |
+
+### Type and docstring hygiene
+- [ ] Type hints present on all public signatures
+- [ ] Docstrings present on all public classes and methods
diff --git a/.opencode/skills/session-workflow/SKILL.md b/.opencode/skills/session-workflow/SKILL.md
index 85da1f7..d8c02bb 100644
--- a/.opencode/skills/session-workflow/SKILL.md
+++ b/.opencode/skills/session-workflow/SKILL.md
@@ -27,7 +27,7 @@ Every session starts by reading state. Every session ends by writing state. This
 3. Run `git status` — understand what is committed vs. what is not
 4. Confirm scope: you are working on exactly one step of one feature
 
-If TODO.md says "No feature in progress", report to the PO that backlog features are waiting. **The software-engineer never self-selects a feature from the backlog — only the PO picks.** The PO must verify the feature has `Status: BASELINED` in its discovery section before moving it to `in-progress/` — if not baselined, the PO must complete Step 1 first.
+If TODO.md says "No feature in progress", load `skill feature-selection` — it guides the PO through scoring and selecting the next BASELINED backlog feature. **The software-engineer never self-selects a feature from the backlog — only the PO picks.** The PO must verify the feature has `Status: BASELINED` in its discovery section before moving it to `in-progress/` — if not baselined, the PO must complete Step 1 first.
 
 ## Session End
 
@@ -70,9 +70,15 @@ Source: docs/features/in-progress/<name>.feature
 - [ ] `@id:<hex>`: <description>
 
 ## Next
-<One sentence: exactly what to do in the next session>
+Run @<agent-name> — <one concrete action>
 ```
 
+**"Next" line format**: Always prefix with `Run @<agent-name>` so the human knows exactly which agent to invoke. Examples:
+- `Run @software-engineer — implement @id:a1b2c3d4 (Step 3 RED)`
+- `Run @reviewer — verify feature display-version at Step 4`
+- `Run @product-owner — pick next BASELINED feature from backlog`
+- `Run @product-owner — accept feature display-version at Step 5`
+
 **Source path by step:**
 - Step 1: `Source: docs/features/backlog/<name>.feature`
 - Steps 2–4: `Source: docs/features/in-progress/<name>.feature`
@@ -89,7 +95,7 @@ When no feature is active:
 # Current Work
 
 No feature in progress.
-Next: PO picks a feature from docs/features/backlog/ that has Status: BASELINED and moves it to docs/features/in-progress/.
+Next: Run @product-owner — load skill feature-selection and pick the next BASELINED feature from backlog.
 ```
 
 ## Step 3 (TDD Loop) Cycle-Aware TODO Format
diff --git a/.opencode/skills/verify/SKILL.md b/.opencode/skills/verify/SKILL.md
index 9a738a7..7d145f2 100644
--- a/.opencode/skills/verify/SKILL.md
+++ b/.opencode/skills/verify/SKILL.md
@@ -123,6 +123,7 @@ Read the source files changed in this feature. **Do this before running lint/sta
 | Contract test | Would test survive internal rewrite? | Yes | No |
 | No internal attribute access | Search for `_x` in assertions | None found | `_x`, `isinstance`, `type()` |
 | Every `@id` has a mapped test | Match `@id` to test functions | All mapped | Missing test |
+| No orphaned skipped stubs | Search for `@pytest.mark.skip` in `tests/features/` | None found | Any found — stub was written but never implemented |
 | Function naming | Matches `test_<rule_slug>_<8char_hex>` | All match | Mismatch |
 | Hypothesis tests have `@slow` | Read every `@given` for `@slow` marker | All present | Any missing |
 
@@ -218,6 +219,10 @@ Undeclared violations → REJECT.
 OR
 **REJECTED** — fix the following:
 1. `<file>:<line>` — <specific, actionable fix>
+
+### Next Steps
+**If APPROVED**: Run `@product-owner` — accept the feature at Step 5.
+**If REJECTED**: Run `@software-engineer` — apply the fixes listed above, re-run quality gate, update Self-Declaration, then signal Step 4 again.
 ```
 
 ## Standards Summary
diff --git a/AGENTS.md b/AGENTS.md
index ebde9e6..88102af 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -11,7 +11,7 @@ Features flow through 5 steps with a WIP limit of 1 feature at a time. The files
 
 ```
 STEP 1: SCOPE          (product-owner)  → discovery + Gherkin stories + criteria
-STEP 2: ARCH           (software-engineer)      → read all backlog features, design module structure
+STEP 2: ARCH           (software-engineer)      → read all features + existing package files, write domain stubs (signatures only, no bodies); ADRs to docs/architecture/
 STEP 3: TDD LOOP       (software-engineer)      → RED → GREEN → REFACTOR, one @id at a time
 STEP 4: VERIFY         (reviewer)       → run all commands, review code
 STEP 5: ACCEPT         (product-owner)  → demo, validate, move folder to completed/
@@ -40,9 +40,11 @@ STEP 5: ACCEPT         (product-owner)  → demo, validate, move folder to compl
 | Skill | Used By | Step |
 |---|---|---|
 | `session-workflow` | all agents | every session |
+| `feature-selection` | product-owner | between features (idle state) |
 | `scope` | product-owner | 1 |
 | `implementation` | software-engineer | 2, 3 |
-| `design-patterns` | software-engineer | 2 (on-demand, if smell detected), 3 (refactor) |
+| `design-patterns` | software-engineer | 2, 3 (on-demand, when GoF pattern needed) |
+| `refactor` | software-engineer | 3 (REFACTOR phase + preparatory refactoring) |
 | `verify` | reviewer | 4 |
 | `code-quality` | software-engineer | pre-handoff (redirects to `verify`) |
 | `pr-management` | software-engineer | 5 |
@@ -87,6 +89,10 @@ docs/features/
   in-progress/<feature-name>.feature  ← file moves here at Step 2
   completed/<feature-name>.feature    ← file moves here at Step 5
 
+docs/architecture/
+  STEP2-ARCH.md                       ← Step 2 reference diagram (canonical)
+  adr-NNN-<title>.md                  ← one per significant architectural decision
+
 tests/
   features/<feature-name>/
     <rule-slug>_test.py               ← one per Rule: block, software-engineer-written
diff --git a/CHANGELOG.md b/CHANGELOG.md
index b047d83..9ffc167 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -2,6 +2,25 @@
 
 All notable changes to this template will be documented in this file.
 
+## [v5.1.20260418] - Emergent Colugo - 2026-04-18
+
+### Added
+- **refactor skill**: Standalone skill with Fowler's full catalogue, green-bar rule, two-hats rule, SOLID/OC self-declaration table, and preparatory refactoring protocol — loaded on demand at REFACTOR phase
+- **feature-selection skill**: WSJF-based backlog prioritisation (Reinertsen 2009) with Kano value scoring and dependency gate — PO loads this when `TODO.md` is idle
+- **ADR template**: `docs/architecture/adr-template.md` for Step 2 architectural decisions
+- **Logo and banner**: visual identity added to README (SVG assets in `docs/images/`)
+
+### Changed
+- **Architecture stubs**: Step 2 now writes stubs directly into `<package>/` instead of an Architecture section in the feature file; stubs have no docstrings (add after GREEN when lint enforces them); folder structure is suggested, not prescribed — `ports/` and `adapters/` only created when a concrete external dependency is confirmed
+- **design-patterns skill**: Narrowed to pure GoF catalogue (23 patterns, smell-triggered before/after examples); SOLID, OC, LoD, CQS, Python Zen moved to refactor skill self-declaration checklist
+- **session-workflow**: `Next` line in TODO.md now requires `Run @<agent-name>` prefix so the human always knows which agent to invoke; idle state loads `skill feature-selection` instead of a vague prompt
+- **verify skill**: Added orphaned-stub check (skip-marked tests that were never implemented); report template now includes structured `Next Steps` block directing the human to the correct agent
+- **Scientific research**: `docs/academic_research.md` split into 9 domain files under `docs/scientific-research/` (cognitive-science, testing, architecture, oop-design, refactoring-empirical, requirements-elicitation, domain-modeling, software-economics, ai-agents)
+
+### Fixed
+- Stale `docs/architecture/STEP2-ARCH.md` reference removed from workflow diagram and skill
+- Protocol smell-check gate now marked N/A when no external dependencies are identified in scope
+
 ## [v5.0.20260418] - Structured Phascolarctos - 2026-04-18
 
 ### Added
diff --git a/README.md b/README.md
index 06db549..cc4aed1 100644
--- a/README.md
+++ b/README.md
@@ -1,4 +1,8 @@
-# Python Project Template
+<div align="center">
+
+<img src="docs/images/banner.svg" alt="Python Project Template" width="100%"/>
+
+<br/><br/>
 
 [![Contributors][contributors-shield]][contributors-url]
 [![Forks][forks-shield]][forks-url]
@@ -6,165 +10,129 @@
 [![Issues][issues-shield]][issues-url]
 [![MIT License][license-shield]][license-url]
 [![Coverage](https://img.shields.io/badge/coverage-100%25-brightgreen?style=for-the-badge)](https://nullhack.github.io/python-project-template/coverage/)
-
 [![CI](https://img.shields.io/github/actions/workflow/status/nullhack/python-project-template/ci.yml?style=for-the-badge&label=CI)](https://github.com/nullhack/python-project-template/actions/workflows/ci.yml)
 [![Python](https://img.shields.io/badge/python-3.13-blue?style=for-the-badge)](https://www.python.org/downloads/)
 
-> Python template to quickstart any project with production-ready workflow, quality tooling, and AI-assisted development.
+**Production-ready Python scaffolding with a structured AI-agent workflow — from idea to shipped feature.**
+
+</div>
+
+---
 
 ## Quick Start
 
 ```bash
-# 1. Clone the template
 git clone https://github.com/nullhack/python-project-template
 cd python-project-template
-
-# 2. Install UV package manager (if not installed)
-curl -LsSf https://astral.sh/uv/install.sh | sh
-
-# 3. Set up the development environment
+curl -LsSf https://astral.sh/uv/install.sh | sh  # skip if uv installed
 uv sync --all-extras
-
-# 4. Customize template placeholders for your project
-opencode && @setup-project
-
-# 5. Validate everything works
-uv run task test && uv run task lint && uv run task static-check && timeout 10s uv run task run
+opencode && @setup-project                        # personalise for your project
+uv run task test && uv run task lint && uv run task static-check
 ```
 
-## What This Template Provides
+---
 
-### Development Workflow
+## What You Get
 
-A **5-step Kanban workflow** with WIP=1 (one feature at a time), enforced by the filesystem:
+### A structured 5-step development cycle
 
 ```
-docs/features/backlog/       ← features waiting to be worked on
-docs/features/in-progress/   ← exactly one feature being built
-docs/features/completed/     ← accepted and shipped features
+SCOPE → ARCH → TDD LOOP → VERIFY → ACCEPT
 ```
 
-**4 roles, 5 steps:**
-
-| Step | Role | What happens |
-|------|------|-------------|
-| 1. SCOPE | Product Owner | Discovery + Gherkin stories + `@id` criteria |
-| 2. ARCH | Software Engineer | Design module structure, read all backlog features |
-| 3. TDD LOOP | Software Engineer | RED→GREEN→REFACTOR, one `@id` at a time |
-| 4. VERIFY | Reviewer | Run all commands, adversarial code review |
-| 5. ACCEPT | Product Owner | Demo, validate, move folder to completed/ |
+| Step | Who | What |
+|------|-----|------|
+| **SCOPE** | Product Owner | Discovery interviews → Gherkin stories → `@id` criteria |
+| **ARCH** | Software Engineer | Module design, ADRs, test stubs |
+| **TDD LOOP** | Software Engineer | RED → GREEN → REFACTOR, one `@id` at a time |
+| **VERIFY** | Reviewer | Adversarial verification — default hypothesis: broken |
+| **ACCEPT** | Product Owner | Demo, validate, ship |
 
-### AI Agents
+WIP limit of 1. Features are `.feature` files that move between filesystem folders:
 
-```bash
-@product-owner   # Defines features, picks from backlog, accepts deliveries
-@software-engineer  # Architecture, tests, code, git, releases
-@reviewer        # Runs commands, reviews code — read+bash only
-@setup-project   # One-time template initialization
+```
+docs/features/backlog/      ← waiting
+docs/features/in-progress/  ← building (max 1)
+docs/features/completed/    ← shipped
 ```
 
-### Skills
+### AI agents included
 
-```bash
-/skill session-workflow    # Read TODO.md, continue, hand off cleanly
-/skill scope               # Write user stories + acceptance criteria
-/skill implementation      # Steps 2-3: architecture + TDD loop
-/skill design-patterns     # Refactor with patterns when smell detected
-/skill code-quality        # Redirects to verify (quick reference)
-/skill verify              # Step 4 verification checklist
-/skill pr-management       # Branch naming, PR template, squash merge
-/skill git-release         # Hybrid calver versioning, themed naming
-/skill create-skill        # Add new skills to the system
-/skill create-agent        # Add new agents (human-user only)
 ```
+@product-owner      — scope, stories, acceptance
+@software-engineer  — architecture, TDD, git, releases
+@reviewer           — adversarial verification
+@setup-project      — one-time project initialisation
+```
+
+### Quality tooling, pre-configured
 
-## Development Commands
+| Tool | Role |
+|------|------|
+| `uv` | Package & environment management |
+| `ruff` | Lint + format (Google docstrings) |
+| `pyright` | Static type checking — 0 errors |
+| `pytest` + `hypothesis` | Tests + property-based testing |
+| `pytest-cov` | Coverage — 100% required |
+| `pdoc` | API docs → GitHub Pages |
+| `taskipy` | Task runner |
+
+---
+
+## Commands
 
 ```bash
-uv run task run              # Run the application (humans)
-timeout 10s uv run task run   # Run with timeout (agents — exit 124 = hung = FAIL)
-uv run task test             # Full test suite with 100% coverage (Step 4 handoff)
-uv run task test-fast        # Fast tests no coverage (Step 3 Red-Green-Refactor cycle)
-uv run task lint             # ruff check + format (Step 4 handoff)
-uv run task static-check    # pyright type checking (Step 4 handoff)
+uv run task test          # Full suite + coverage
+uv run task test-fast     # Fast, no coverage (use during TDD loop)
+uv run task lint          # ruff check + format
+uv run task static-check  # pyright
+uv run task run           # Run the app
 ```
 
-## Code Quality Standards
+---
+
+## Code Standards
 
-| Standard | Target |
-|----------|--------|
+| | |
+|---|---|
 | Coverage | 100% |
-| Type checking | pyright, 0 errors |
-| Linting | ruff, 0 issues, Google docstrings |
+| Type errors | 0 |
 | Function length | ≤ 20 lines |
 | Class length | ≤ 50 lines |
 | Max nesting | 2 levels |
-| Principles | YAGNI > KISS > DRY > SOLID > Object Calisthenics |
+| Principles | YAGNI › KISS › DRY › SOLID › Object Calisthenics |
+
+---
 
-## Test Conventions
+## Test Convention
 
 ```python
 @pytest.mark.skip(reason="not yet implemented")
-def test_bounce_physics_a3f2b1c4() -> None:
+def test_feature_a3f2b1c4() -> None:
     """
-    Given: A ball moving upward reaches y=0
-    When: The physics engine processes the next frame
-    Then: The ball velocity y-component becomes positive
+    Given: ...
+    When:  ...
+    Then:  ...
     """
-    # Given
-    ...
-    # When
-    ...
-    # Then
-    ...
 ```
 
-**Markers**: `@pytest.mark.slow` · `@pytest.mark.deprecated`
-
-## Technology Stack
-
-| Category | Tools |
-|----------|-------|
-| Package management | uv |
-| Task automation | taskipy |
-| Linting + formatting | Ruff |
-| Type checking | PyRight |
-| Testing | pytest + Hypothesis |
-| Coverage | pytest-cov (100% required) |
-| Documentation | pdoc + ghp-import |
-| AI development | OpenCode agents + skills |
-
-## Documentation Site
-
-Published at [nullhack.github.io/python-project-template](https://nullhack.github.io/python-project-template):
-- **API Reference** — pdoc-generated from source docstrings
-- **Coverage Report** — line-by-line coverage breakdown
-- **Test Results** — full pytest run results
+Each test is traced to exactly one `@id` acceptance criterion.
 
-## Release Versioning
-
-Format: `v{major}.{minor}.{YYYYMMDD}`
+---
 
-Each release gets a unique **adjective-animal** name generated from the commit/PR content.
+## Versioning
 
-## Contributing
+`v{major}.{minor}.{YYYYMMDD}` — each release gets a unique adjective-animal name.
 
-```bash
-git clone https://github.com/nullhack/python-project-template
-uv sync --all-extras
-uv run task test && uv run task lint
-```
+---
 
 ## License
 
 MIT — see [LICENSE](LICENSE).
 
----
-
-**Author:** eol ([@nullhack](https://github.com/nullhack))
-**Documentation:** [nullhack.github.io/python-project-template](https://nullhack.github.io/python-project-template)
+**Author:** [@nullhack](https://github.com/nullhack) · [Documentation](https://nullhack.github.io/python-project-template)
 
-<!-- MARKDOWN LINKS & IMAGES -->
+<!-- MARKDOWN LINKS -->
 [contributors-shield]: https://img.shields.io/github/contributors/nullhack/python-project-template.svg?style=for-the-badge
 [contributors-url]: https://github.com/nullhack/python-project-template/graphs/contributors
 [forks-shield]: https://img.shields.io/github/forks/nullhack/python-project-template.svg?style=for-the-badge
diff --git a/docs/academic_research.md b/docs/academic_research.md
deleted file mode 100644
index 129a1e1..0000000
--- a/docs/academic_research.md
+++ /dev/null
@@ -1,912 +0,0 @@
-# Academic Research — Theoretical Foundations
-
-This document explains the cognitive and social-science mechanisms that justify the workflow reforms in this template. Each mechanism is grounded in peer-reviewed research.
-
----
-
-## Mechanisms
-
-### 1. Pre-mortem (Prospective Hindsight)
-
-| | |
-|---|---|
-| **Source** | Klein, G. (1998). *Sources of Power: How People Make Decisions*. MIT Press. |
-| **Date** | 1998 |
-| **URL** | — |
-| **Alternative** | — |
-| **Status** | Confirmed |
-| **Core finding** | Asking "imagine this failed — why?" catches 30% more issues than forward-looking review. |
-| **Mechanism** | Prospective hindsight shifts from prediction (weak) to explanation (strong). The brain is better at explaining past events than predicting future ones. By framing as "it already failed," you activate explanation mode. |
-| **Where used** | PO pre-mortem at scope, developer pre-mortem before handoff. |
-
----
-
-### 2. Implementation Intentions
-
-| | |
-|---|---|
-| **Source** | Gollwitzer, P. M. (1999). Implementation intentions: Strong effects of simple planning aids. *American Journal of Preventive Medicine*, 16(4), 257–276. |
-| **Date** | 1999 |
-| **URL** | — |
-| **Alternative** | — |
-| **Status** | Confirmed |
-| **Core finding** | "If X then Y" plans are 2–3x more likely to execute than general intentions. |
-| **Mechanism** | If-then plans create automatic cue-response links in memory. The brain processes "if function > 20 lines then extract helper" as an action trigger, not a suggestion to consider. |
-| **Where used** | Refactor Self-Check Gates in `implementation/SKILL.md`, Code Quality checks in `verify/SKILL.md`. |
-
----
-
-### 3. Commitment Devices
-
-| | |
-|---|---|
-| **Source** | Cialdini, R. B. (2001). *Influence: The Psychology of Persuasion* (rev. ed.). HarperBusiness. |
-| **Date** | 2001 |
-| **URL** | — |
-| **Alternative** | — |
-| **Status** | Confirmed |
-| **Core finding** | Forcing an explicit micro-commitment (filling in a PASS/FAIL cell) creates resistance to reversals. A checkbox checked is harder to uncheck than a todo noted. |
-| **Mechanism** | Structured tables with PASS/FAIL cells create commitment-device effects. The act of marking "FAIL" requires justification, making silent passes psychologically costly. |
-| **Where used** | SOLID enforcement table, ObjCal enforcement table, Design Patterns table — all require explicit PASS/FAIL with evidence. |
-
----
-
-### 4. System 2 Before System 1
-
-| | |
-|---|---|
-| **Source** | Kahneman, D. (2011). *Thinking, Fast and Slow*. Farrar, Straus and Giroux. |
-| **Date** | 2011 |
-| **URL** | — |
-| **Alternative** | — |
-| **Status** | Confirmed |
-| **Core finding** | System 1 (fast, automatic) is vulnerable to anchoring and confirmation bias. System 2 (slow, deliberate) must be activated before System 1's judgments anchor. |
-| **Mechanism** | Running semantic review *before* automated commands prevents the "all green" dopamine hit from anchoring the reviewer's judgment. Doing hard cognitive work first protects against System 1 shortcuts. |
-| **Where used** | Verification order in `verify/SKILL.md`: semantic alignment check before commands. |
-
----
-
-### 5. Adversarial Collaboration
-
-| | |
-|---|---|
-| **Source** | Mellers, B. A., Hertwig, R., & Kahneman, D. (2001). Do frequency representations eliminate cooperative bias? *Psychological Review*, 108(4), 709–735. |
-| **Date** | 2001 |
-| **URL** | — |
-| **Alternative** | — |
-| **Status** | Confirmed |
-| **Core finding** | Highest-quality thinking emerges when parties hold different hypotheses and are charged with finding flaws in each other's reasoning. |
-| **Mechanism** | Explicitly framing the reviewer as "your job is to break this feature" activates the adversarial collaboration mode. The reviewer seeks disconfirmation rather than confirmation. |
-| **Where used** | Adversarial mandate in `reviewer.md` and `verify/SKILL.md`. |
-
----
-
-### 6. Accountability to Unknown Audience
-
-| | |
-|---|---|
-| **Source** | Tetlock, P. E. (1983). Accountability: A social determinant of judgment. In M. D. B. T. Strother (Ed.), *Psychology of Learning and Motivation* (Vol. 17, pp. 295–332). Academic Press. |
-| **Date** | 1983 |
-| **URL** | — |
-| **Alternative** | — |
-| **Status** | Confirmed |
-| **Core finding** | Accountability to an unknown audience with unknown views improves reasoning quality. The agent anticipates being audited and adjusts reasoning. |
-| **Mechanism** | The explicit report format (APPROVED/REJECTED with evidence) creates an accountability structure — the reviewer's reasoning will be read by the PO. |
-| **Where used** | Report format in `verify/SKILL.md`, structured evidence columns in all enforcement tables. |
-
----
-
-### 7. Chunking and Cognitive Load Reduction
-
-| | |
-|---|---|
-| **Source** | Miller, G. A. (1956). The magical number seven, plus or minus two. *Psychological Review*, 63(2), 81–97. |
-| **Date** | 1956 |
-| **URL** | — |
-| **Alternative** | Sweller, J. (1988). Cognitive load during problem solving. *Cognitive Science*, 12(2), 257–285. |
-| **Status** | Confirmed |
-| **Core finding** | Structured tables reduce working memory load vs. narrative text. Chunking related items into table rows enables parallel processing. |
-| **Mechanism** | Replacing prose checklists ("Apply SOLID principles") with structured tables (5 rows, 4 columns) allows the reviewer to process all items in a single pass. |
-| **Where used** | All enforcement tables in `verify/SKILL.md` and `reviewer.md`. |
-
----
-
-### 8. Elaborative Encoding
-
-| | |
-|---|---|
-| **Source** | Craik, F. I. M., & Lockhart, R. S. (1972). Levels of processing: A framework for memory research. *Journal of Verbal Learning and Verbal Behavior*, 11(6), 671–684. |
-| **Date** | 1972 |
-| **URL** | — |
-| **Alternative** | — |
-| **Status** | Confirmed |
-| **Core finding** | Deeper processing — explaining *why* a rule matters — leads to better retention and application than shallow processing (just listing rules). |
-| **Mechanism** | Adding a "Why it matters" column to enforcement tables forces the reviewer to process the rationale, not just scan the rule name. |
-| **Where used** | SOLID table, ObjCal table, Design Patterns table — all have "Why it matters" column. |
-
----
-
-### 9. Error-Specific Feedback
-
-| | |
-|---|---|
-| **Source** | Hattie, J., & Timperley, H. (2007). The power of feedback. *Review of Educational Research*, 77(1), 81–112. |
-| **Date** | 2007 |
-| **URL** | — |
-| **Alternative** | — |
-| **Status** | Confirmed |
-| **Core finding** | Feedback is most effective when it tells the agent exactly what went wrong and what the correct action is. "FAIL: function > 20 lines at file:47" is actionable. "Apply function length rules" is not. |
-| **Mechanism** | The evidence column in enforcement tables requires specific file:line references, turning vague rules into actionable directives. |
-| **Where used** | Evidence column in all enforcement tables. |
-
----
-
-### 10. Prospective Memory Cues
-
-| | |
-|---|---|
-| **Source** | McDaniel, M. A., & Einstein, G. O. (2000). Strategic and automatic processes in prospective memory retrieval. *Applied Cognitive Psychology*, 14(7), S127–S144. |
-| **Date** | 2000 |
-| **URL** | — |
-| **Alternative** | — |
-| **Status** | Confirmed |
-| **Core finding** | Memory for intended actions is better when cues are embedded at the point of action, not in a separate appendix. |
-| **Mechanism** | Placing if-then gates inline (in the REFACTOR section) rather than in a separate "reference" document increases adherence. The cue appears exactly when the developer is about to make the relevant decision. |
-| **Where used** | Refactor Self-Check Gates embedded inline in `implementation/SKILL.md`. |
-
----
-
-### 11. Observable Behavior Testing
-
-| | |
-|---|---|
-| **Source** | Fowler, M. (2018). *The Practical Test Pyramid*. Thoughtworks. https://martinfowler.com/articles/practical-test-pyramid.html |
-| **Date** | 2018 |
-| **URL** | https://martinfowler.com/articles/practical-test-pyramid.html |
-| **Alternative** | — |
-| **Status** | Confirmed |
-| **Core finding** | Tests should answer "if I enter X and Y, will the result be Z?" — not "will method A call class B first?" |
-| **Mechanism** | A test is behavioral if its assertion describes something a caller/user can observe without knowing the implementation. The test should still pass if you completely rewrite the internals. |
-| **Where used** | Contract test rule in `tdd/SKILL.md`: "Write every test as if you cannot see the production code." |
-
----
-
-### 12. Test-Behavior Alignment
-
-| | |
-|---|---|
-| **Source** | Google Testing Blog (2013). *Testing on the Toilet: Test Behavior, Not Implementation*. |
-| **Date** | 2013 |
-| **URL** | — |
-| **Alternative** | — |
-| **Status** | Confirmed |
-| **Core finding** | Test setup may need to change if implementation changes, but the actual test shouldn't need to change if the code's user-facing behavior doesn't change. |
-| **Mechanism** | Tests that are tightly coupled to implementation break on refactoring and become a drag on design improvement. Behavioral tests survive internal rewrites. |
-| **Where used** | Contract test rule + bad example in `tdd/SKILL.md`, reviewer verification check in `reviewer.md`. |
-
----
-
-### 13. Tests as First-Class Citizens
-
-| | |
-|---|---|
-| **Source** | Martin, R. C. (2017). *First-Class Tests*. Clean Coder Blog. |
-| **Date** | 2017 |
-| **URL** | — |
-| **Alternative** | — |
-| **Status** | Confirmed |
-| **Core finding** | Tests should be treated as first-class citizens of the system — not coupled to implementation. Bad tests are worse than no tests because they give false confidence. |
-| **Mechanism** | Tests written as "contract tests" — describing what the caller observes — remain stable through refactoring. Tests that verify implementation details are fragile and create maintenance burden. |
-| **Where used** | Contract test rule in `tdd/SKILL.md`, verification check in `reviewer.md`. |
-
----
-
-### 14. Property-Based Testing (Invariant Discovery)
-
-| | |
-|---|---|
-| **Source** | MacIver, D. R. (2016). *What is Property Based Testing?* Hypothesis. https://hypothesis.works/articles/what-is-property-based-testing/ |
-| **Date** | 2016 |
-| **URL** | https://hypothesis.works/articles/what-is-property-based-testing/ |
-| **Alternative** | — |
-| **Status** | Confirmed |
-| **Core finding** | Property-based testing is "the construction of tests such that, when these tests are fuzzed, failures reveal problems that could not have been revealed by direct fuzzing." Property tests test *invariants* — things that must always be true about the contract, not things that fall out of how you wrote it. |
-| **Mechanism** | Meaningful property tests assert invariants: "assert Score(x).value >= 0" tests the contract. Tautological tests assert reconstruction: "assert Score(x).value == x" tests the implementation. |
-| **Where used** | Meaningful vs. Tautological table in `tdd/SKILL.md`, Property-Based Testing Decision Rule table in `tdd/SKILL.md`. |
-
----
-
-### 15. Mutation Testing (Test Quality Verification)
-
-| | |
-|---|---|
-| **Source** | King, K. N. (1991). *The Gamma (formerly mutants)*. |
-| **Date** | 1991 |
-| **URL** | — |
-| **Alternative** | Mutation testing tools: Cosmic Ray, mutmut (Python) |
-| **Status** | Confirmed |
-| **Core finding** | A meaningful test fails when a mutation (small deliberate code change) is introduced. A tautological test passes even with mutations because it doesn't constrain the behavior. |
-| **Mechanism** | If a test survives every mutation of the production code without failing, it tests nothing. Only tests that fail on purposeful "damage" to the code are worth keeping. |
-| **Where used** | Note in `tdd/SKILL.md` Quality Rules (implicitly encouraged: tests must describe contracts, not implementation, which is the theoretical complement to mutation testing). |
-
----
-
-### 16. Cost of Change Curve (Shift Left)
-
-| | |
-|---|---|
-| **Source** | Boehm, B. W. (1981). *Software Engineering Economics*. Prentice-Hall. |
-| **Date** | 1981 |
-| **URL** | — |
-| **Alternative** | Boehm, B., & Papaccio, P. N. (1988). Understanding and controlling software costs. *IEEE Transactions on Software Engineering*, 14(10), 1462–1477. |
-| **Status** | Confirmed |
-| **Core finding** | The cost to fix a defect multiplies by roughly 10x per SDLC phase: requirements (1x) → design (5x) → coding (10x) → testing (20x) → production (200x). A defect caught during requirements costs 200x less than the same defect found after release. |
-| **Mechanism** | Defects compound downstream: a wrong requirement becomes a wrong design, which becomes wrong code, which becomes wrong tests, all of which must be unwound. Catching errors at the source eliminates the entire cascade. This is the empirical foundation for "shift left" — investing earlier in quality always dominates fixing later. |
-| **Where used** | Justifies the multi-session PO elicitation model: every acceptance criterion clarified at scope prevents 10–200x rework downstream. Also justifies the adversarial pre-mortem at the end of each elicitation cycle, and the adversarial mandate in `verify/SKILL.md`. The entire 6-step pipeline is ordered to surface defects at the earliest (cheapest) phase. |
-
----
-
-### 17. INVEST Criteria for User Stories
-
-| | |
-|---|---|
-| **Source** | Wake, B. (2003). *INVEST in Good Stories, and SMART Tasks*. XP123.com. |
-| **Date** | 2003 |
-| **URL** | — |
-| **Alternative** | Cohn, M. (2004). *User Stories Applied: For Agile Software Development*. Addison-Wesley. |
-| **Status** | Confirmed |
-| **Core finding** | Stories that are Independent, Negotiable, Valuable, Estimable, Small, and Testable produce fewer downstream defects and smoother development cycles. Stories that fail INVEST — especially "Testable" and "Small" — are the leading cause of scope creep and unbounded iteration. |
-| **Mechanism** | INVEST serves as a quality gate before stories enter development. "Testable" forces the PO to express observable outcomes (directly enabling Given/When/Then). "Small" forces decomposition, which reduces cognitive load and makes estimation feasible. "Independent" prevents hidden ordering dependencies between stories. |
-| **Where used** | INVEST gate in Phase 3 of `scope/SKILL.md`. PO verifies every story against all 6 letters before committing. |
-
----
-
-### 18. Example Mapping (Rules Layer)
-
-| | |
-|---|---|
-| **Source** | Wynne, M. (2015). *Introducing Example Mapping*. Cucumber Blog. https://cucumber.io/blog/bdd/example-mapping-introduction/ |
-| **Date** | 2015 |
-| **URL** | https://cucumber.io/blog/bdd/example-mapping-introduction/ |
-| **Alternative** | — |
-| **Status** | Confirmed |
-| **Core finding** | Inserting a "rules" layer between stories and examples prevents redundant or contradictory acceptance criteria. A story with many rules needs splitting; a story with many open questions is not ready for development. |
-| **Mechanism** | Example Mapping uses four card types: Story (yellow), Rules (blue), Examples (green), Questions (red). The rules layer groups related examples under the business rule they illustrate. Without this layer, POs jump from story directly to examples and lose the reasoning that connects them. Red cards (unanswered questions) are a first-class signal to stop and investigate rather than assume. |
-| **Where used** | `## Rules` section in per-feature `discovery.md` (Phase 2). PO identifies business rules before writing Examples in Phase 4, making the reasoning behind Example clusters visible and reviewable. |
-
----
-
-### 19. Declarative Gherkin
-
-| | |
-|---|---|
-| **Source** | Cucumber Team. (2024). *Better Gherkin*. Cucumber Documentation. https://cucumber.io/docs/bdd/better-gherkin/ |
-| **Date** | 2024 |
-| **URL** | https://cucumber.io/docs/bdd/better-gherkin/ |
-| **Alternative** | — |
-| **Status** | Confirmed |
-| **Core finding** | Declarative Gherkin ("When Bob logs in") produces specifications that survive UI changes. Imperative Gherkin ("When I click the Login button") couples specs to implementation details and breaks on every UI redesign. |
-| **Mechanism** | Declarative steps describe *what happens* at the business level. Imperative steps describe *how the user interacts with a specific UI*. The distinction maps to the abstraction level: declarative = behavior contract, imperative = interaction script. AI agents are especially prone to writing imperative Gherkin because they mirror literal steps. |
-| **Where used** | Declarative vs. imperative table in Phase 4 of `scope/SKILL.md`. PO is explicitly instructed to write behavior descriptions, not UI interaction scripts. |
-
----
-
-### 20. MoSCoW Prioritization (Within-Story Triage)
-
-| | |
-|---|---|
-| **Source** | Clegg, D., & Barker, R. (1994). *Case Method Fast-Track: A RAD Approach*. Addison-Wesley (DSDM origin). |
-| **Date** | 1994 |
-| **URL** | — |
-| **Alternative** | — |
-| **Status** | Confirmed |
-| **Core finding** | Classifying requirements as Must/Should/Could/Won't forces explicit negotiation about what is essential vs. desired. When applied *within* a single story (not just across a backlog), it reveals bloated stories that should be split. |
-| **Mechanism** | DSDM mandates that Musts cannot exceed 60% of total effort. At the story level: if a story has 12 Examples and only 3 are Musts, the remaining 9 can be deferred or split into a follow-up story. This prevents gold-plating and keeps stories small. |
-| **Where used** | MoSCoW triage in Phase 4 of `scope/SKILL.md`. PO applies Must/Should/Could when a story exceeds 5 Examples. |
-
----
-
-### 21. Minimal-Scope Agent Design
-
-| | |
-|---|---|
-| **Source** | OpenAI. (2024). *Agent definitions*. OpenAI Agents SDK Documentation. |
-| **Date** | 2024 |
-| **URL** | https://platform.openai.com/docs/guides/agents/define-agents |
-| **Alternative** | Anthropic. (2024). *Building effective agents*. Anthropic Engineering Blog. https://www.anthropic.com/engineering/building-effective-agents |
-| **Status** | Confirmed — corrects the belief that subagents should be "lean routing agents" |
-| **Core finding** | "Define the smallest agent that can own a clear task. Add more agents only when you need separate ownership, different instructions, different tool surfaces, or different approval policies." The split criterion is ownership boundary, not instruction volume. |
-| **Mechanism** | Multiple agents competing to own the same concern create authority conflicts and inconsistent tool access. The right unit is the smallest coherent domain that requires exclusive responsibility. Keeping handoff descriptions short and concrete enables routing agents to select the right specialist. |
-| **Where used** | Agent design in `.opencode/agents/*.md` — 4 agents, each owning a distinct domain (PO, developer, reviewer, setup). |
-
----
-
-### 22. Context Isolation via Subagents
-
-| | |
-|---|---|
-| **Source** | Anthropic. (2025). *Best practices for Claude Code*. Anthropic Documentation. |
-| **Date** | 2025 |
-| **URL** | https://www.anthropic.com/engineering/claude-code-best-practices |
-| **Alternative** | — |
-| **Status** | Confirmed — the primary reason subagents exist is context isolation, not routing |
-| **Core finding** | Subagents run in their own context windows and report back summaries, keeping the main conversation clean for implementation. Every file read in a subagent burns tokens in a child window, not the primary window. |
-| **Mechanism** | Context window is the primary performance constraint for LLM agents. Investigation tasks (reading many files, exploring a codebase) rapidly exhaust context if done inline. Delegating to a subagent quarantines that cost; the primary agent receives only the distilled result. A fresh context in the subagent also prevents anchoring bias from prior conversation state. |
-| **Where used** | OpenCode `task` tool usage in all agents; `explore` and `general` built-in subagents; explicit subagent invocations in `.opencode/agents/developer.md`. |
-
----
-
-### 23. On-Demand Skill Loading (Context Budget)
-
-| | |
-|---|---|
-| **Source** | Anthropic. (2025). *Best practices for Claude Code*. Anthropic Documentation. |
-| **Date** | 2025 |
-| **URL** | https://www.anthropic.com/engineering/claude-code-best-practices |
-| **Alternative** | OpenCode. (2026). *Agent Skills*. OpenCode Documentation. https://opencode.ai/docs/skills/ |
-| **Status** | Confirmed (vendor guidance) — benefit on task completion quality is extrapolated from RAG retrieval literature; not directly A/B-tested on agent instruction architectures |
-| **Core finding** | "CLAUDE.md is loaded every session, so only include things that apply broadly. For domain knowledge or workflows only relevant sometimes, use skills instead. Claude loads them on demand without bloating every conversation." Bloated always-loaded files cause Claude to ignore critical instructions. |
-| **Mechanism** | Every token in an unconditionally-loaded file competes for attention against the task prompt. Long AGENTS.md/CLAUDE.md files push important instructions beyond effective attention range, causing silent non-compliance. Procedural workflows moved to skills are injected only when the task calls for them, preserving the primary context budget. This is the same principle as lazy loading in software: pay the cost only when needed. |
-| **Where used** | `AGENTS.md` carries only shared project conventions and commands; all step-specific workflows live in `.opencode/skills/*.md` and are loaded via the `skill` tool only when the relevant step begins. |
-
----
-
-### 24. Instruction Conflict Resolution Failure in LLMs
-
-| | |
-|---|---|
-| **Source** | Geng et al. (2025). Control Illusion: The Failure of Instruction Hierarchies in Large Language Models. AAAI-26. arXiv:2502.15851. |
-| **Date** | 2025 |
-| **URL** | https://arxiv.org/abs/2502.15851 |
-| **Alternative** | Wallace et al. (2024). The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions. arXiv:2404.13208. |
-| **Status** | Confirmed — peer-reviewed (AAAI-26), replicated across 6 models; corroborated by OpenAI training research (Wallace et al.) |
-| **Core finding** | LLMs do not reliably prioritize system-prompt instructions over conflicting instructions from other sources. Resolution is inconsistent and biased by pretraining-derived priors, not by prompt structure or position. A dedicated training regime is required to make hierarchy reliable; without it, conflicts are resolved unpredictably. |
-| **Mechanism** | No structural separation between instruction sources enforces reliable priority at inference time. When the same directive appears in two locations with divergent content, the model selects between them based on statistical priors from pretraining, not on explicit authority. |
-| **Where used** | Justifies single source of truth in `AGENTS.md`: workflow details duplicated across agent files and skills that drift out of sync produce conflicting instructions the model cannot resolve reliably. |
-
----
-
-### 25. Positional Attention Degradation in Long Contexts
-
-| | |
-|---|---|
-| **Source** | Liu et al. (2023). Lost in the Middle: How Language Models Use Long Contexts. *Transactions of the Association for Computational Linguistics*. arXiv:2307.03172. |
-| **Date** | 2023 |
-| **URL** | https://arxiv.org/abs/2307.03172 |
-| **Alternative** | McKinnon (2025). arXiv:2511.05850 — effect attenuated for simple retrieval in Gemini 2.5+; persists for multi-hop reasoning (HAMLET, EMNLP 2025; SealQA, ICLR 2026). |
-| **Status** | Confirmed with caveat — robust for multi-hop reasoning; attenuated for simple retrieval in frontier models (2025–2026) |
-| **Core finding** | Performance on tasks requiring retrieval from long contexts follows a U-shaped curve: highest when relevant content is at the beginning or end of the context, degraded when content falls in the middle. |
-| **Mechanism** | Transformer attention is not uniform across token positions. Content placed in the middle of a long context receives less attention weight regardless of its relevance. |
-| **Where used** | Supports keeping always-loaded files (`AGENTS.md`, agent routing files) lean. Duplicated workflow detail in always-loaded files increases total context length, pushing other content into lower-attention positions. |
-
----
-
-### 26. Modular Prompt De-duplication Reduces Interference
-
-| | |
-|---|---|
-| **Source** | Sharma & Henley (2026). Modular Prompt Optimization. arXiv:2601.04055. |
-| **Date** | 2026 |
-| **URL** | https://arxiv.org/abs/2601.04055 |
-| **Alternative** | — |
-| **Status** | Partially confirmed — single-agent reasoning benchmarks (ARC-Challenge, MMLU) only; not tested on multi-file agent architectures |
-| **Core finding** | Structured prompts with explicit section de-duplication outperform both monolithic prompts and unstructured modular prompts. The mechanism cited is "reducing redundancy and interference between components." |
-| **Mechanism** | Redundant content across prompt sections creates competing attention targets. De-duplication concentrates relevant signal in one canonical location per concern. |
-| **Where used** | Supports the rule that skills and agent routing files contain no duplication of `AGENTS.md` content or of each other. |
-
----
-
-### 27. Agent File Architecture — Three-File Separation
-
-| | |
-|---|---|
-| **Source** | Convergence of entries 23, 24, 25, 26. |
-| **Date** | — |
-| **URL** | — |
-| **Alternative** | — |
-| **Status** | Inferred — no direct A/B test of this architecture exists; supported by convergence of confirmed and partially confirmed findings above |
-| **Core finding** | Three distinct failure modes (instruction conflict on drift, positional attention degradation in long contexts, redundancy interference) converge to produce a three-file split with defined content rules for each. |
-| **Mechanism** | Each file runs at a different time and serves a different purpose. Mixing concerns across files reintroduces the failure modes the split is designed to prevent. |
-| **Where used** | Structural rule for `AGENTS.md`, `.opencode/agents/*.md`, and `.opencode/skills/*.md`. |
-
-| File | Runs when | Contains | Does NOT contain |
-|---|---|---|---|
-| `AGENTS.md` | Every session, always loaded | Project conventions, shared commands, formats, standards | Step procedures, role-specific rules, path specs |
-| `.opencode/agents/*.md` | When that role is invoked | Role identity, step ownership, skill load instructions, tool permissions, escalation paths | Workflow details, principle lists, path specs, commit formats |
-| `.opencode/skills/*.md` | On demand, when that step begins | Full procedural instructions for that step, self-contained | Duplication of `AGENTS.md` content or other skills |
-
----
-
-### 28. Active Listening — Paraphrase-Clarify-Summarize
-
-| | |
-|---|---|
-| **Source** | Rogers, C. R., & Farson, R. E. (1957). *Active Listening*. Industrial Relations Center, University of Chicago. |
-| **Date** | 1957 |
-| **URL** | — |
-| **Alternative** | McNaughton, D. et al. (2008). Learning to Listen. *Topics in Early Childhood Special Education*, 27(4), 223–231. (LAFF strategy: Listen, Ask, Focus, Find) |
-| **Status** | Confirmed — foundational clinical research; widely replicated across professional and educational contexts |
-| **Core finding** | Active listening — paraphrasing what was heard in the listener's own words, asking clarifying questions, then summarizing the main points and intent — reduces misunderstanding, builds trust, and confirms mutual understanding before proceeding. The three-step responding sequence (Paraphrase → Clarify → Summarize) is the operationalizable form of the broader active listening construct. |
-| **Mechanism** | Paraphrasing forces the listener to reconstruct the speaker's meaning in their own language, surfacing gaps immediately. Clarifying questions address residual ambiguity. Summarizing creates a shared record that both parties can confirm or correct. Together they eliminate the assumption that "I heard" equals "I understood." Without this protocol, agents (human or AI) proceed on partial or misread requirements, producing work that is technically complete but semantically wrong. |
-| **Where used** | PO summarization protocol in `scope/SKILL.md`: after each interview round, the PO must produce a "Here is what I understood" block (paraphrase → clarify → summarize) before moving to Phase 3 (Stories) or Phase 4 (Criteria). The stakeholder confirms or corrects before the PO proceeds. |
-
----
-
-### 28a. Active Listening — Three-Level Structure and Level 3 Uses (Synthesis)
-
-| | |
-|---|---|
-| **Source** | Synthesis of: Nielsen, J. (2010). *Interviewing Users*. Nielsen Norman Group. + Farrell, S. (2017). UX Research Cheat Sheet. NN/G. + Ambler, S. W. (2002). *Agile Modeling*. Wiley (agilemodeling.com). + Wynne, M. (2015). Introducing Example Mapping. Cucumber Blog. |
-| **Date** | 2010–2015 (synthesis) |
-| **URL** | https://www.nngroup.com/articles/interviewing-users/ ; https://www.agilemodeling.com/essays/fdd.htm ; https://cucumber.io/blog/bdd/example-mapping-introduction/ |
-| **Alternative** | — |
-| **Status** | Synthesized rule of thumb — each component individually confirmed; the three-level structure is a practitioner synthesis |
-| **Core finding** | Active listening in requirements interviews operates at three granularities simultaneously, not as a single end-of-interview act: **Level 1** (per answer) — immediate paraphrase to catch misunderstanding on the spot; **Level 2** (per topic cluster) — transition summary before moving to the next area, acting as a recovery point; **Level 3** (end of interview) — full synthesis, which serves four distinct downstream purposes. |
-| **Mechanism** | Each level addresses a different failure mode. Level 1 prevents individual answer misreads from propagating. Level 2 prevents topic-cluster drift and allows mid-interview correction. Level 3 crystallizes scope and triggers the formal baseline. Without the level structure, practitioners collapse all three into a single end-of-interview summary, which is too late for Level 1 and 2 misunderstandings to be caught cheaply. |
-| **Level 3 — four uses** | 1. **Accuracy gate** (NN/G): stakeholder confirms or corrects the summary before it is used downstream — prevents misread requirements from being frozen. 2. **Scope crystallization** (Ambler/FDD): the summary answers "what problems must this system solve?" and becomes the initial requirements stack. 3. **Input to domain modeling** (Ambler/FDD): nouns and verbs extracted from the Level 3 summary are the raw material for the Entities table — domain analysis cannot begin before this summary exists. 4. **Baseline trigger** (Wynne/Cucumber Example Mapping): when the stakeholder says "yes, that's right" to the summary, discovery is considered complete and frozen. |
-| **Where used** | Phase 1 and Phase 2 of `scope/SKILL.md`: PO applies Level 1 during each exchange, Level 2 when transitioning between topic areas, and Level 3 at the end of each interview phase before proceeding to feature stubs (Phase 1) or user stories (Phase 2). |
-
----
-
-### 29. The Kipling Method — Five Ws and One H
-
-| | |
-|---|---|
-| **Source** | Kipling, R. (1902). *Just So Stories*. Macmillan. |
-| **Date** | 1902 |
-| **URL** | — |
-| **Alternative** | Hermagoras of Temnos (2nd century BCE) — seven circumstances of rhetoric; Thomas Wilson (1560) — "The Arte of Rhetoric"; Aristotle's Nicomachean Ethics |
-| **Status** | Practitioner synthesis — journalism, business analysis, and investigative methodology |
-| **Core finding** | The six interrogative questions (Who, What, When, Where, Why, How) form a complete framework for gathering all essential facts about any event or situation. No single question can be answered with a simple yes/no. Together they ensure completeness and prevent gaps in understanding. |
-| **Mechanism** | The framework originated in ancient Greek rhetoric (Aristotle's "elements of circumstance"), was formalized in 16th-century English rhetoric (Wilson), popularized by Kipling's 1902 poem calling them "six honest serving-men," and became standard in journalism by 1917. The BA community adapted it to requirements gathering by adding "How" as the sixth question, creating the 5W1H framework used in business analysis today. |
-| **Where used** | Phase 1 project discovery: the initial seven questions (Who, What, Why, When, Where, Success, Failure, Out-of-scope) are an adaptation of the 5W1H framework. "Success" maps to "Why" (purpose), "Failure" maps to constraints, "Out-of-scope" defines project boundaries. |
-
----
-
-### 30. BA Requirements Question Framework
-
-| | |
-|---|---|
-| **Source** | Brandenburg, L. (2025). *Requirements Discovery Checklist Pack*. TechCanvass. |
-| **Date** | 2025 |
-| **URL** | https://businessanalyst.techcanvass.com/requirements-gathering-questions-for-ba/ |
-| **Alternative** | Sherwen (2025). "10 Questions to Consider During Requirements Gathering."; Practical Analyst (2024). "Requirements Elicitation: Most Valuable Questions." |
-| **Status** | Practitioner synthesis — consolidated BA methodology, not peer-reviewed |
-| **Core finding** | Ten questions consistently make the most difference in requirements elicitation: (1) What problem are we solving? (2) What happens if we do nothing? (3) Who uses this? (4) What does success look like? (5) Walk me through how this works today (6) Where does this usually break? (7) What decisions will this help? (8) What should definitely not happen? (9) What happens if input is wrong? (10) What assumptions are we making? |
-| **Mechanism** | The first four questions define scope and purpose. Questions 5-6 probe current state and pain points. Questions 7-8 identify business value and constraints. Questions 9-10 surface edge cases and hidden assumptions. This sequence ensures negative requirements (what should NOT happen) are captured, which often contain the most important business rules. |
-| **Where used** | Phase 1 project discovery: the "Success" question maps to "What does success look like?" (question 4), "Failure" maps to "What should definitely not happen?" (question 8), "Out-of-scope" maps to boundary-setting from the 10-question framework. |
-
----
-
-### 31. Domain-Driven Design — Bounded Contexts and Feature Identification
-
-| | |
-|---|---|
-| **Source** | Evans, E. (2003). *Domain-Driven Design: Tackling Complexity in the Heart of Software*. Addison-Wesley. |
-| **Date** | 2003 |
-| **URL** | — |
-| **Alternative** | Context Mapper (2025). Rapid Object-Oriented Analysis and Design. https://contextmapper.org/docs/rapid-ooad |
-| **Status** | Confirmed — foundational DDD literature |
-| **Core finding** | A Bounded Context is a boundary within which a particular ubiquitous language is consistent. Features are identified by grouping related user stories that share the same language. Features can be decomposed into subdomains, and subdomains can be grouped into Bounded Contexts. The decomposition criterion is "single responsibility per context" + "consistency of language." |
-| **Mechanism** | In DDD: (1) Extract ubiquitous language from requirements → (2) Group by language consistency → (3) Each group is a candidate bounded context → (4) Each bounded context maps to a feature. Context Mapper automates this: User Stories → Subdomains (via noun/verb extraction) → Bounded Contexts of type FEATURE. |
-| **Where used** | Phase 1: after feature list identification, verify each feature has consistent language. Phase 2: noun/verb extraction from project discovery answers populates the Entities table, which is the DDD candidate model. The "Rules (Business)" section captures the ubiquitous language rules that govern each feature. |
-
----
-
-### 32. Object Calisthenics — Nine Rules
-
-| | |
-|---|---|
-| **Source** | Bay, J. "Object Calisthenics." *The Thoughtworks Anthology* (PragProg, 2008). Original in IEEE Software/DevX, ~2005. |
-| **Date** | ~2005 |
-| **URL** | https://www.bennadel.com/resources/uploads/2012/objectcalisthenics.pdf |
-| **Alternative** | — |
-| **Status** | Practitioner synthesis |
-| **Core finding** | 9 rules to internalize OOP: (1) One level indentation per method, (2) No ELSE, (3) Wrap primitives/Strings, (4) First class collections, (5) One dot per line, (6) No abbreviations, (7) Classes ≤50 lines, (8) ≤2 instance variables, (9) No getters/setters. 7 of 9 enforce data encapsulation; 1 drives polymorphism; 1 drives naming. |
-| **Mechanism** | Restrictions force decomposition. When you cannot use getters, behavior must move into the object. When you cannot use ELSE, you use polymorphism. When classes must be ≤2 ivars, you discover missing abstractions. |
-| **Where used** | Refactor phase in `implementation/SKILL.md`: rule checklist with PASS/FAIL per rule. |
-
----
-
-### 33. Refactoring
-
-| | |
-|---|---|
-| **Source** | Fowler, M. (1999/2018). *Refactoring: Improving the Design of Existing Code* (2nd ed.). Addison-Wesley. |
-| **Date** | 1999, 2018 |
-| **URL** | https://martinfowler.com/books/refactoring.html |
-| **Alternative** | — |
-| **Status** | Confirmed — foundational |
-| **Core finding** | Refactoring = behavior-preserving transformations. 68 catalogued refactorings, each small enough to do safely but cumulative effect significant. Code smells (duplicate code, long methods, feature envy) indicate refactoring opportunities. |
-| **Mechanism** | Small steps reduce risk. Each refactoring is reversible. Test suite validates behavior unchanged. |
-| **Where used** | Refactor phase in `implementation/SKILL.md`: smell detection triggers refactoring. |
-
----
-
-### 34. Design Patterns
-
-| | |
-|---|---|
-| **Source** | Gamma, E., Helm, R., Johnson, R., Vlissides, J. (1995). *Design Patterns: Elements of Reusable Object-Oriented Software*. Addison-Wesley. |
-| **Date** | 1995 |
-| **URL** | — |
-| **Alternative** | — |
-| **Status** | Confirmed — foundational |
-| **Core finding** | 23 patterns catalogued in 3 categories: Creational (5), Structural (7), Behavioral (11). Key principles: "Favor composition over inheritance," "Program to an interface, not an implementation." |
-| **Mechanism** | Patterns are recurring solutions to common problems. Named and catalogued so developers don't rediscover them. |
-| **Where used** | Refactor phase: when ObjCal rules fail, patterns provide alternative structure. |
-
----
-
-### 35. SOLID Principles
-
-| | |
-|---|---|
-| **Source** | Martin, R. C. (2000). "Principles of OOD." *ButUncleBob.com*. Acronym coined by Michael Feathers (2004). |
-| **Date** | 2000 |
-| **URL** | https://blog.interface-solv.com/wp-content/uploads/2020/07/Principles-Of-OOD.pdf |
-| **Alternative** | — |
-| **Status** | Confirmed |
-| **Core finding** | S: One reason to change. O: Open extension, closed modification. L: Subtypes substitutable. I: No forced stub methods. D: Depend on abstractions, not concretes. |
-| **Mechanism** | Each principle targets a specific coupling failure mode. Together they produce low coupling, high cohesion. |
-| **Where used** | Refactor self-check table in `implementation/SKILL.md`: 5-row SOLID table with PASS/FAIL. |
-
----
-
-### 36. QDIR — Bad-Smells + OO Metrics Prioritization
-
-| | |
-|---|---|
-| **Source** | Malhotra, R., Singh, P. (2020). "Exploiting bad-smells and object-oriented characteristics to prioritize classes for refactoring." *Int. J. Syst. Assur. Eng. Manag.* 11(Suppl 2), 133–144. Springer. |
-| **Date** | 2020 |
-| **URL** | https://doi.org/10.1007/s13198-020-01001-x |
-| **Alternative** | — |
-| **Status** | Confirmed — empirical |
-| **Core finding** | QDIR (Quality Depreciation Index Rule) combines bad-smell severity with OO metrics (LOC, WMC, CBO, RFC, DIT) to prioritize classes for refactoring. Validated on 8 open-source Java systems. |
-| **Mechanism** | Classes with high smell severity AND high OO metrics are prioritized. QDIR = weighted sum. |
-| **Where used** | Refactor prioritization in Step 4: when smell detected, check OO metrics to prioritize. |
-
----
-
-### 37. Smells + Architectural Refactoring
-
-| | |
-|---|---|
-| **Source** | Silva, C. et al. (2020). "When Are Smells Indicators of Architectural Refactoring Opportunities." *Proc. 28th Int. Conf. on Program Comprehension*. ACM. |
-| **Date** | 2020 |
-| **URL** | https://doi.org/10.1145/3387904.3389276 |
-| **Alternative** | — |
-| **Status** | Confirmed — empirical |
-| **Core finding** | Study of 50 projects, 52,667 refactored elements. 67.53% of smells co-occur. Smells that co-occur are indicators of architectural refactoring in 88.53% of cases. |
-| **Mechanism** | Single smells are often code-level; co-occurring smells indicate architectural problems. Pattern catalog for smells→specific architectural refactorings. |
-| **Where used** | Smell detection triggers architectural analysis when co-occurrence patterns detected. |
-
----
-
-### 38. SPIRIT Tool — Code Smell Prioritization
-
-| | |
-|---|---|
-| **Source** | Vidal, S. A., Marcos, C., Díaz-Pace, J. A. (2014). "An Approach to Prioritize Code Smells for Refactoring." *Automated Software Engineering*, 23(3), 501–532. Carleton University/Springer. |
-| **Date** | 2014 |
-| **URL** | https://doi.org/10.1007/s10515-014-0175-x |
-| **Alternative** | — |
-| **Status** | Confirmed — tool |
-| **Core finding** | SPIRIT (Smart Identification of Refactoring opportunITies) prioritizes smells by 3 criteria: (1) component stability, (2) impact on modifiability scenarios, (3) smell relevance. Top-ranked smells correlate with expert developer judgment. |
-| **Mechanism** | Semi-automated ranking. Combines version history (stable vs. unstable), impact analysis, and smell type. |
-| **Where used** | Refactor prioritization: stability = has the class changed recently? Unstable + smelly = prioritize. |
-
----
-
-### 39. Bad Engineering Properties of OOP
-
-| | |
-|---|---|
-| **Source** | Cardelli, L. (1996). "Bad Engineering Properties of Object-Oriented Languages." *ACM Computing Surveys*, 28(4), 150. |
-| **Date** | 1996 |
-| **URL** | https://www.microsoft.com/en-us/research/publication/bad-engineering-properties-of-object-oriented-languages/ |
-| **Alternative** | — |
-| **Status** | Confirmed — foundational critique |
-| **Core finding** | OOP has 5 "economy" problems: (1) Execution (virtual methods prevent inlining), (2) Compilation (no code/interface separation), (3) Small-scale dev (expressive type systems missing), (4) Large-scale dev (poor class extension/modification), (5) Language features (baroque complexity). |
-| **Mechanism** | OOP is not universally superior. Trade-offs exist. Knowing these helps avoid over-engineering. |
-| **Where used** | Anti-pre-pattern: know when OOP adds complexity vs. value. Feedback item 2 rationale. |
-
----
-
-### 40. Code Complexity Model of OOP
-
-| | |
-|---|---|
-| **Source** | Aluthwaththage, J. H., Thathsarani, H. A. N. N. (2024). "A Novel OO-Based Code Complexity Metric." *Proc. Future Technologies Conference (FTC)*, 616–628. Springer/IEEE. |
-| **Date** | 2024 |
-| **URL** | https://link.springer.com/chapter/10.1007/978-3-031-73125-9_39 |
-| **Alternative** | Misra et al. (2024). "A Suite of Object Oriented Cognitive Complexity Metrics." IEEE. |
-| **Status** | Partially confirmed — recent |
-| **Core finding** | CWC (Combined Weighted Complexity) measures OOP complexity at statement level, considering 8 factors: nesting depth, control types, compound conditions, try-catch, threads, pointers, references, dynamic memory. Addresses gap in existing metrics ignoring cognitive load. |
-| **Mechanism** | Granular complexity scoring. Higher scores indicate more cognitively demanding code. |
-| **Where used** | Complexity measurement in Step 4 refactor: when function >20 lines, compute CWC-style granular score. |
-
----
-
-### 41. Metric Thresholds for Smell Detection
-
-| | |
-|---|---|
-| **Source** | Bigonha, M. A. S., et al. (2019). "The usefulness of software metric thresholds for detection of bad smells and fault prediction." *Information and Software Technology*, 115, 79–92. |
-| **Date** | 2019 |
-| **URL** | https://doi.org/10.1016/j.infsof.2019.08.005 |
-| **Alternative** | Catal et al. (2018). "Software metrics thresholds calculation techniques." Info. Softw. Technol. |
-| **Status** | Confirmed |
-| **Core finding** | Metric thresholds (e.g., LOC > 600) used for smell detection are unreliable. Study on 92 open-source systems found precision too low for practical use. Neither heuristic-based (DECOR) nor ML approaches achieve acceptable accuracy. ROC Curves best of 3 threshold techniques but still insufficient alone. |
-| **Mechanism** | Fixed thresholds are context-dependent. Thresholds should be project-specific, not universal. |
-| **Where used** | Anti-pre-pattern: do not rely on fixed thresholds. Use co-occurrence patterns (Entry 37) instead. |
-
----
-
-### 42. Hexagonal Architecture — Ports and Adapters
-
-| | |
-|---|---|
-| **Source** | Cockburn, A. (2005). "Hexagonal Architecture." *alistair.cockburn.us*. https://alistair.cockburn.us/hexagonal-architecture/ |
-| **Date** | 2005 |
-| **URL** | https://alistair.cockburn.us/hexagonal-architecture/ |
-| **Alternative** | Freeman, S., & Pryce, N. (2009). *Growing Object-Oriented Software, Guided by Tests*. Addison-Wesley. (Chapter 7: "Ports and Adapters") |
-| **Status** | Confirmed — foundational; widely adopted as Clean Architecture, Onion Architecture |
-| **Core finding** | The application domain should have no knowledge of external systems (databases, filesystems, network, UI). All contact between the domain and the outside world passes through a **port** (an interface / Protocol) and an **adapter** (a concrete implementation of that port). This makes the domain independently testable without any infrastructure. The key structural rule: dependency arrows point inward — domain code never imports from adapters; adapters import from domain. |
-| **Mechanism** | Two distinct sides of any application: the "driving side" (actors who initiate action — tests, UI, CLI) and the "driven side" (actors the application drives — databases, filesystems, external services). Each driven-side dependency is hidden behind a port. Tests supply a test adapter; production supplies a real adapter. Substituting adapters requires no domain code changes. This is what SOLID-D ("depend on abstractions") looks like at the architectural layer — not just at the class level. |
-| **Where used** | Step 2 (Architecture): every external dependency identified during domain analysis must be assigned a port (Protocol) and a concrete adapter. Module structure always includes `<package>/adapters/<dep>.py` alongside `<package>/domain/`. The `adapters/` layer is decided at Step 2, not discovered during Step 4 refactoring. |
-
----
-
-### 43. Feature-Driven Development — Domain Modeling to Feature List
-
-| | |
-|---|---|
-| **Source** | Ambler, S. W. (2002). *Agile Modeling: Effective Practices for eXtreme Programming and the Unified Process*. Wiley. Supplemented by: agilemodeling.com — "Feature Driven Development and Agile Modeling." |
-| **Date** | 2002 |
-| **URL** | https://www.agilemodeling.com/essays/fdd.htm |
-| **Alternative** | Palmer, S. R., & Felsing, J. M. (2002). *A Practical Guide to Feature-Driven Development*. Prentice Hall. |
-| **Status** | Confirmed |
-| **Core finding** | FDD requires domain modeling *before* feature naming. Features are expressed as "Action result object" triples (e.g., "Enroll a student in a seminar"). Features group into Feature Sets (shared domain object), which group into Subject Areas. 78% of organisations doing Agile also do initial high-level agile requirements modeling; 85% find it worthwhile. |
-| **Mechanism** | Domain modeling extracts the vocabulary (nouns = candidate classes, verbs = candidate methods). Feature identification then asks: "what verbs act on each noun?" This produces a list of small, deliverable units that are coherent with the domain rather than reflecting technical or organisational boundaries. |
-| **Where used** | Phase 1 of `scope/SKILL.md`: after the interview summary is confirmed, PO performs domain analysis (nouns/verbs → subject areas → FDD "Action object" feature names) before creating `.feature` file stubs. |
-
----
-
-### 44. Affinity Mapping / KJ Method — Bottom-Up Feature Identification
-
-| | |
-|---|---|
-| **Source** | Krause, R., & Pernice, K. (2024). Affinity Diagramming for Collaboratively Sorting UX Findings and Design Ideas. *Nielsen Norman Group*. https://www.nngroup.com/articles/affinity-diagram/ |
-| **Date** | 2024 (method origin: Kawakita, J., 1960s) |
-| **URL** | https://www.nngroup.com/articles/affinity-diagram/ |
-| **Alternative** | Kawakita, J. (1967). *Abduction*. Chuokoronsha (KJ Method original). |
-| **Status** | Confirmed |
-| **Core finding** | Affinity diagramming (KJ Method) groups raw observations/requirements into clusters by bottom-up similarity — no categories are named until grouping is complete. This prevents confirmation bias from top-down pre-labelling. Each named cluster becomes a candidate feature. Dot voting on clusters produces a prioritized feature list. Small clusters must not be discarded — they often represent minority concerns or genuinely novel features. |
-| **Mechanism** | Bottom-up category emergence: when categories are not imposed in advance, the grouping reflects actual similarity in the data rather than the analyst's prior mental model. NN/G: "the journey is more important than the destination — the discussions that occurred while building the diagram are more impactful than the final format." |
-| **Where used** | Phase 1 of `scope/SKILL.md` (alternative to FDD domain modeling): PO uses affinity mapping on interview answers to derive feature clusters before creating `.feature` stubs. Best suited when working from interview transcripts solo rather than with a cross-silo team. |
-
----
-
-### 45. Event Storming — Domain Events to Functional Areas
-
-| | |
-|---|---|
-| **Source** | Brandolini, A. (2013–present). *EventStorming*. Leanpub / eventstorming.com. https://eventstorming.com |
-| **Date** | 2013 |
-| **URL** | https://eventstorming.com; Bourgau, P. (2017). Detailed Agenda of a DDD Big Picture Event Storming. https://philippe.bourgau.net/detailed-agenda-of-a-ddd-big-picture-event-storming-part-1/ |
-| **Alternative** | Brandolini, A. (2021). *Introducing EventStorming*. Leanpub. |
-| **Status** | Confirmed |
-| **Core finding** | Event Storming is a collaborative workshop where domain experts place past-tense domain events on a timeline. Sorting the events creates natural Functional Area clusters — these are candidate feature groups / Subject Areas. The workshop also produces Ubiquitous Language (shared vocabulary), a Problem Inventory (open questions), and Actor roles (for user story "As a [role]" parts). It does NOT produce Gherkin directly; its output feeds into Example Mapping per story. |
-| **Mechanism** | Temporal sequencing of domain events forces resolution of conflicting mental models across organisational silos. Clusters emerge from shared vocabulary and causal proximity — not from the facilitator's prior structure. Bourgau: "Although nobody understands Bounded Context from the start, everyone gets Functional Area." |
-| **Where used** | Optional alternative in Phase 1 of `scope/SKILL.md` for cross-silo discovery. Best suited when multiple stakeholders from different departments need to build shared understanding. Outputs (Functional Areas + Ubiquitous Language) map directly to Subject Areas (feature groups) and the Entities table in `.feature` file discovery sections. |
-
----
-
-### 46. Critical Incident Technique — Gap-Finding via Past Events
-
-| | |
-|---|---|
-| **Source** | Flanagan, J. C. (1954). "The critical incident technique." *Psychological Bulletin*, 51(4), 327–357. |
-| **Date** | 1954 |
-| **URL** | https://doi.org/10.1037/h0061470 |
-| **Alternative** | Rosala, M. (2020). The Critical Incident Technique in UX. *Nielsen Norman Group*. https://www.nngroup.com/articles/critical-incident-technique/ |
-| **Status** | Confirmed — foundational; ~200 follow-on empirical studies in marketing alone (Gremler 2004) |
-| **Core finding** | Anchoring an interview on a specific past incident ("Tell me about a time when X broke down") breaks schema-based recall. Stakeholders describing actual past events report real workarounds, edge cases, and failure modes that never surface when asked "how does this usually work?" The technique explicitly requires both positive and negative incidents — positive first to establish rapport, negative second to surface failures. |
-| **Mechanism** | Direct questions ("how does the system work?") elicit the stakeholder's mental schema — a sanitized, normalized, gap-free description of how things *should* work. Incidents bypass the schema because episodic memory is anchored to specific sensory and emotional detail that the schema lacks. Flanagan: "a critical incident must occur in a situation where the purpose or intent of the act seems fairly clear to the observer and where its consequences are sufficiently definite to leave little doubt." |
-| **Where used** | Session 2 (gap-finding) of Phase 1 and Phase 2 in `scope/SKILL.md`. CIT prompts: "Tell me about a specific time this worked well / broke down." Follow up: "What were you trying to do? What made it difficult? What did you do instead?" |
-
----
-
-### 47. Cognitive Interview — Memory-Enhancing Elicitation Technique
-
-| | |
-|---|---|
-| **Source** | Fisher, R. P., & Geiselman, R. E. (1992). *Memory-Enhancing Techniques for Investigative Interviewing: The Cognitive Interview*. Charles C. Thomas. |
-| **Date** | 1984 (original); 1987 (enhanced CI); 1992 (manual) |
-| **URL** | DOI: 10.1037/0021-9010.74.5.722 (1989 field study) |
-| **Alternative** | Moody, W., Will, R. P., & Blanton, J. E. (1996). "Enhancing knowledge elicitation using the cognitive interview." *Expert Systems with Applications*, 10(1), 127–133. DOI: 10.1016/0957-4174(95)00039-9 |
-| **Status** | Confirmed — meta-analysis: Köhnken, Milne, Memon & Bull (1999), *Psychology, Crime & Law*, 5(1-2), 3–27. DOI: 10.1080/10683169908414991 |
-| **Core finding** | The enhanced CI elicits ~35% more correct information than standard interviews with equal accuracy rates (85% vs. 82%). Moody et al. (1996) directly applied CI to knowledge elicitation from domain experts, finding it superior for capturing episodic knowledge that standard structured interviews miss. |
-| **Mechanism** | Four retrieval mnemonics: (1) **Mental reinstatement of context** — stakeholder mentally returns to a specific past situation; (2) **Report everything** — all details including seemingly minor ones; (3) **Temporal reversal** — narrate events from a different starting point to disrupt schema-based reconstruction; (4) **Perspective change** — describe the situation from another actor's viewpoint. Each mnemonic opens a different memory access route, collectively surfacing what direct questions cannot. |
-| **Where used** | Session 2 (gap-finding) of Phase 1 and Phase 2 in `scope/SKILL.md`. CI perspective change prompt: "What do you think the end user experiences in that situation?" CI reversal prompt: "Walk me through that scenario starting from when it went wrong." |
-
----
-
-### 48. Laddering / Means-End Chain — Surfacing Unstated Motivations
-
-| | |
-|---|---|
-| **Source** | Reynolds, T. J., & Gutman, J. (1988). "Laddering theory, method, analysis, and interpretation." *Journal of Advertising Research*, 28(1), 11–31. |
-| **Date** | 1988 (method origin: Kelly, G. (1955). *The Psychology of Personal Constructs*. Norton.) |
-| **URL** | https://en.wikipedia.org/wiki/Repertory_grid |
-| **Alternative** | Hunter, M. G., & Beck, J. E. (2000). "Using repertory grids to conduct cross-cultural information systems research." *Information Systems Research*, 11(1), 93–101. DOI: 10.1287/isre.11.1.93.11786 |
-| **Status** | Confirmed — operationalised in IS research (Hunter & Beck 2000); embedded in NNG interview protocols (Rosala 2021) |
-| **Core finding** | Repeatedly asking "Why is that important to you?" climbs a means-end chain from concrete attribute → functional consequence → psychosocial consequence → terminal value. The stakeholder's first answer is rarely the real constraint — it is the socially expected, conscious-level response. The real motivation (and the actual constraint that requirements must satisfy) emerges two or three levels up the ladder. |
-| **Mechanism** | The Gherkin "So that [benefit]" clause is structurally a single-rung means-end ladder. Full laddering reveals the value conflicts between stakeholders whose surface requirements look identical but whose ladders diverge at the consequence level. Without laddering, requirements that satisfy the stated attribute may fail the underlying goal. |
-| **Where used** | Session 2 (gap-finding) of Phase 1 and Phase 2 in `scope/SKILL.md`. Laddering probe: "Why is that important to you?", "What does that enable for you?", "What would break if that weren't available?" Climb until the stakeholder reaches a terminal value they cannot explain further. |
-
----
-
-### 49. Funnel Technique — Question Ordering to Prevent Priming
-
-| | |
-|---|---|
-| **Source** | Rosala, M., & Moran, K. (2022). The Funnel Technique in Qualitative User Research. *Nielsen Norman Group*. https://www.nngroup.com/articles/the-funnel-technique-in-qualitative-user-research/ |
-| **Date** | 2022 |
-| **URL** | https://www.nngroup.com/articles/the-funnel-technique-in-qualitative-user-research/ |
-| **Alternative** | Christel, M. G., & Kang, K. C. (1992). *Issues in Requirements Elicitation*. CMU/SEI-92-TR-012. https://www.sei.cmu.edu/library/abstracts/reports/92tr012.cfm |
-| **Status** | Confirmed — standard NNG qualitative research protocol |
-| **Core finding** | Starting with broad open-ended questions before narrowing to specifics prevents the interviewer from priming the interviewee's responses. Once a category label is introduced, the interviewee interprets subsequent questions through that frame and under-reports items that don't fit it. Broad-to-narrow sequencing within each topic cluster is the evidence-based default for discovery interviews. |
-| **Mechanism** | Priming bias is structural: human memory is associative, so any category name the interviewer introduces activates a schema that filters what the interviewee considers worth reporting. The funnel sequences questions so the interviewee's own categories emerge first, before the interviewer's categories are introduced. |
-| **Where used** | Within each session of Phase 1 and Phase 2 in `scope/SKILL.md`. Within each topic cluster: start with "Tell me about..." before asking specific follow-up probes. Applies alongside CIT, CI, and Laddering — all of which are also open-ended by design. |
-
----
-
-### 50. Issues in Requirements Elicitation — Why Direct Questions Fail
-
-| | |
-|---|---|
-| **Source** | Christel, M. G., & Kang, K. C. (1992). *Issues in Requirements Elicitation*. CMU/SEI-92-TR-012. Software Engineering Institute, Carnegie Mellon University. |
-| **Date** | 1992 |
-| **URL** | https://www.sei.cmu.edu/library/abstracts/reports/92tr012.cfm |
-| **Alternative** | Sommerville, I., & Sawyer, P. (1997). *Requirements Engineering: A Good Practice Guide*. Wiley. |
-| **Status** | Confirmed — foundational SEI technical report; widely cited in RE literature |
-| **Core finding** | Stakeholders have three structural problems that make direct questioning insufficient: (1) they omit information that is "obvious" to them but unknown to the analyst; (2) they have trouble communicating needs they have never had to articulate; (3) they may not know what they want until they see what they don't want. These are not stakeholder failures — they are structural properties of tacit knowledge. |
-| **Mechanism** | Expert knowledge is largely procedural and tacit. When asked "how does the system work?", experts describe what they believe happens, not what actually happens. This sanitized account is internally consistent but incomplete. Gap-finding techniques (CIT, CI, Laddering) are required because they bypass the expert's mental schema and access the episodic memory layer where real complexity lives. |
-| **Where used** | Theoretical justification for the 3-session interview structure and the use of CIT, CI, and Laddering in `scope/SKILL.md`. Answers the question: "why not just ask the stakeholder directly what they need?" |
-
----
-
-### 51. Canon TDD — Authoritative Red-Green-Refactor Definition
-
-| | |
-|---|---|
-| **Source** | Beck, K. (2023). "Canon TDD." *tidyfirst.substack.com*. December 11, 2023. |
-| **Date** | 2023 |
-| **URL** | https://tidyfirst.substack.com/p/canon-tdd |
-| **Alternative** | Fowler, M. (2023). "Test Driven Development." *martinfowler.com*. December 11, 2023. https://martinfowler.com/bliki/TestDrivenDevelopment.html |
-| **Status** | Confirmed — canonical source; explicitly authored to stop strawman critiques |
-| **Core finding** | The canonical TDD loop is: (1) write a list of test scenarios; (2) convert exactly one item into a runnable test; (3) make it pass; (4) optionally refactor; (5) repeat. Writing all test code before any implementation is an explicit anti-pattern ("Mistake: convert all items on the list into concrete tests, then make them pass"). |
-| **Mechanism** | The interleaving of test-writing and implementation is not cosmetic — each test drives interface decisions at the moment they are cheapest to make. Batch-writing tests first forces speculative interface decisions that later require rework when earlier tests reveal structural problems. |
-| **Where used** | Justifies merging Step 3 (test bodies) into the implementation loop. Removing the separate "write all tests" phase and replacing it with one-@id-at-a-time interleaved TDD. |
-
----
-
-### 52. Growing Object-Oriented Software, Guided by Tests (GOOS) — Outer/Inner TDD Loop
-
-| | |
-|---|---|
-| **Source** | Freeman, S., & Pryce, N. (2009). *Growing Object-Oriented Software, Guided by Tests*. Addison-Wesley. |
-| **Date** | 2009 |
-| **URL** | — |
-| **Alternative** | — |
-| **Status** | Confirmed — canonical ATDD/BDD integration model |
-| **Core finding** | Acceptance tests and unit tests operate at two separate, nested timescales. The outer loop: write one failing acceptance test (Gherkin/feature-level) before writing any implementation. The inner loop: drive implementation with unit-level Red-Green-Refactor cycles until the acceptance test passes. The acceptance test stays red throughout all inner cycles and goes green only when the feature is complete. |
-| **Mechanism** | The outer loop provides direction (what to build); the inner loop provides momentum (how to build it). Running acceptance tests first prevents tunnel vision during unit-level work — the developer always has a red acceptance test as the north star. This is the canonical model for integrating Gherkin acceptance criteria (@id Examples) with unit TDD. |
-| **Where used** | Justifies the two-level structure in Step 3 (TDD Loop): outer loop per @id acceptance test, inner loop per unit. Each @id Example is the acceptance test for one outer loop iteration. |
-
----
-
-### 53. Is TDD Dead? — Anti-Bureaucracy Evidence
-
-| | |
-|---|---|
-| **Source** | Beck, K., Fowler, M., & Hansson, D. H. (2014). "Is TDD Dead?" Video series, *martinfowler.com*. May–June 2014. https://martinfowler.com/articles/is-tdd-dead/ |
-| **Date** | 2014 |
-| **URL** | https://martinfowler.com/articles/is-tdd-dead/ |
-| **Alternative** | — |
-| **Status** | Confirmed — primary evidence for what TDD practitioners reject as overhead |
-| **Core finding** | Per-cycle human reviewer gates, per-cycle checklists, and tests that provide zero delta coverage are all explicitly identified as harmful overhead in TDD workflows. The green bar is the quality gate — not a checklist. DHH: "Many people used to think that documentation was more important than code. Now he's concerned that people think tests are more important than functional code." Beck: "Tests with zero delta coverage should be deleted unless they provide some kind of communication purpose." |
-| **Mechanism** | Administrative overhead added to TDD workflows increases the cost per cycle without increasing coverage or catching defects. The optimal TDD loop is as lean as productive — ceremony that does not eliminate a failure mode should be eliminated. Fowler: "The sign of too much testing is whenever you change the code you think you expend more effort changing the tests than changing the code." |
-| **Where used** | Justifies removing per-test reviewer gates and per-test 24-item self-declaration from the TDD loop. Self-declaration moves to end-of-feature (once), preserving Cialdini+Tetlock accountability at feature granularity without interrupting cycle momentum. |
-
----
-
-### 54. Introducing BDD — Behavioural-Driven Development Origin
-
-| | |
-|---|---|
-| **Source** | North, D. (2006). "Introducing BDD." *Better Software Magazine*, March 2006. https://dannorth.net/introducing-bdd/ |
-| **Date** | 2006 |
-| **URL** | https://dannorth.net/introducing-bdd/ |
-| **Alternative** | Fowler, M. (2013). "Given When Then." *martinfowler.com*. https://martinfowler.com/bliki/GivenWhenThen.html |
-| **Status** | Confirmed — primary BDD source |
-| **Core finding** | BDD evolved directly from TDD to address persistent practitioner confusion: where to start, what to test, how much to test in one go, and what to call tests. BDD reframes TDD vocabulary around observable behavior: scenarios instead of tests, Given-When-Then (G/W/T) instead of Arrange-Act-Assert (AAA). The underlying mechanics are identical — G/W/T is AAA with shared-vocabulary semantics for collaboration between technical and non-technical stakeholders. |
-| **Mechanism** | The "Given" clause captures preconditions (Arrange), "When" captures the triggering event (Act), and "Then" captures the observable outcome (Assert). Translating from AAA to G/W/T shifts the focus from implementation mechanics to user-observable behavior, making acceptance criteria verifiable by non-technical stakeholders and executable by the test suite simultaneously. |
-| **Where used** | Theoretical link between Gherkin @id Examples (Step 1 output) and the TDD inner loop (Step 3). Each @id Example is a G/W/T specification that maps directly to a test function. The outer GOOS loop is an acceptance test written in BDD vocabulary; the inner loop is unit TDD. |
-
----
-
-## Bibliography
-
-1. Cialdini, R. B. (2001). *Influence: The Psychology of Persuasion* (rev. ed.). HarperBusiness.
-2. Craik, F. I. M., & Lockhart, R. S. (1972). Levels of processing: A framework for memory research. *Journal of Verbal Learning and Verbal Behavior*, 11(6), 671–684.
-3. Gollwitzer, P. M. (1999). Implementation intentions: Strong effects of simple planning aids. *American Journal of Preventive Medicine*, 16(4), 257–276.
-4. Hattie, J., & Timperley, H. (2007). The power of feedback. *Review of Educational Research*, 77(1), 81–112.
-5. Kahneman, D. (2011). *Thinking, Fast and Slow*. Farrar, Straus and Giroux.
-6. Klein, G. (1998). *Sources of Power: How People Make Decisions*. MIT Press.
-7. McDaniel, M. A., & Einstein, G. O. (2000). Strategic and automatic processes in prospective memory retrieval. *Applied Cognitive Psychology*, 14(7), S127–S144.
-8. Mellers, B. A., Hertwig, R., & Kahneman, D. (2001). Do frequency representations eliminate cooperative bias? *Psychological Review*, 108(4), 709–735.
-9. Miller, G. A. (1956). The magical number seven, plus or minus two. *Psychological Review*, 63(2), 81–97.
-10. Sweller, J. (1988). Cognitive load during problem solving. *Cognitive Science*, 12(2), 257–285.
-11. Tetlock, P. E. (1983). Accountability: A social determinant of judgment. In M. D. B. T. Strother (Ed.), *Psychology of Learning and Motivation* (Vol. 17, pp. 295–332). Academic Press.
-12. Fowler, M. (2018). The Practical Test Pyramid. *Thoughtworks*. https://martinfowler.com/articles/practical-test-pyramid.html
-13. Google Testing Blog. (2013). Testing on the Toilet: Test Behavior, Not Implementation.
-14. Martin, R. C. (2017). First-Class Tests. *Clean Coder Blog*.
-15. MacIver, D. R. (2016). What is Property Based Testing? *Hypothesis*. https://hypothesis.works/articles/what-is-property-based-testing/
-16. Boehm, B. W. (1981). *Software Engineering Economics*. Prentice-Hall.
-17. Boehm, B., & Papaccio, P. N. (1988). Understanding and controlling software costs. *IEEE Transactions on Software Engineering*, 14(10), 1462–1477.
-18. Wake, B. (2003). INVEST in Good Stories, and SMART Tasks. *XP123.com*.
-19. Cohn, M. (2004). *User Stories Applied: For Agile Software Development*. Addison-Wesley.
-20. Wynne, M. (2015). Introducing Example Mapping. *Cucumber Blog*. https://cucumber.io/blog/bdd/example-mapping-introduction/
-21. Cucumber Team. (2024). Better Gherkin. *Cucumber Documentation*. https://cucumber.io/docs/bdd/better-gherkin/
-22. Clegg, D., & Barker, R. (1994). *Case Method Fast-Track: A RAD Approach*. Addison-Wesley.
-23. OpenAI. (2024). Agent definitions. *OpenAI Agents SDK Documentation*. https://platform.openai.com/docs/guides/agents/define-agents
-24. Anthropic. (2024). Building effective agents. *Anthropic Engineering Blog*. https://www.anthropic.com/engineering/building-effective-agents
-25. Anthropic. (2025). Best practices for Claude Code. *Anthropic Documentation*. https://www.anthropic.com/engineering/claude-code-best-practices
-26. OpenCode. (2026). Agent Skills. *OpenCode Documentation*. https://opencode.ai/docs/skills/
-27. Geng et al. (2025). Control Illusion: The Failure of Instruction Hierarchies in Large Language Models. AAAI-26. arXiv:2502.15851. https://arxiv.org/abs/2502.15851
-28. Wallace, E. et al. (2024). The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions. arXiv:2404.13208. https://arxiv.org/abs/2404.13208
-29. Liu, N. F. et al. (2023). Lost in the Middle: How Language Models Use Long Contexts. *Transactions of the Association for Computational Linguistics*. arXiv:2307.03172. https://arxiv.org/abs/2307.03172
-30. McKinnon, R. (2025). arXiv:2511.05850. https://arxiv.org/abs/2511.05850
-31. Sharma, A., & Henley, A. (2026). Modular Prompt Optimization. arXiv:2601.04055. https://arxiv.org/abs/2601.04055
-32. Rogers, C. R., & Farson, R. E. (1957). *Active Listening*. Industrial Relations Center, University of Chicago.
-33. McNaughton, D., Hamlin, D., McCarthy, J., Head-Reeves, D., & Schreiner, M. (2008). Learning to Listen: Teaching an Active Listening Strategy to Preservice Education Professionals. *Topics in Early Childhood Special Education*, 27(4), 223–231.
-34. Kipling, R. (1902). *Just So Stories*. Macmillan.
-35. Brandenburg, L. (2025). *Requirements Discovery Checklist Pack*. TechCanvass. https://www.businessanalyststoolkit.com/requirements-elicitation-questions/
-36. Sherwen. (2025). "10 Questions to Consider During Requirements Gathering." https://www.sherwen.com/insights/10-questions-you-must-ask-during-requirements-gathering
-37. Evans, E. (2003). *Domain-Driven Design: Tackling Complexity in the Heart of Software*. Addison-Wesley.
-38. Context Mapper. (2025). Rapid Object-Oriented Analysis and Design. https://contextmapper.org/docs/rapid-ooad
-39. Bay, J. (2005). "Object Calisthenics." *IEEE Software/DevX*. https://www.bennadel.com/resources/uploads/2012/objectcalisthenics.pdf
-40. Fowler, M. (1999/2018). *Refactoring: Improving the Design of Existing Code*. Addison-Wesley. https://martinfowler.com/books/refactoring.html
-41. Gamma, E., Helm, R., Johnson, R., Vlissides, J. (1995). *Design Patterns: Elements of Reusable Object-Oriented Software*. Addison-Wesley.
-42. Martin, R. C. (2000). "Principles of OOD." *ButUncleBob.com*. https://blog.interface-solv.com/wp-content/uploads/2020/07/Principles-Of-OOD.pdf
-43. Malhotra, R., & Singh, P. (2020). Exploiting bad-smells and object-oriented characteristics to prioritize classes for refactoring. *Int. J. Syst. Assur. Eng. Manag.*, 11(Suppl 2), 133–144. https://doi.org/10.1007/s13198-020-01001-x
-44. Silva, C. et al. (2020). When Are Smells Indicators of Architectural Refactoring Opportunities. *Proc. 28th Int. Conf. on Program Comprehension*. ACM. https://doi.org/10.1145/3387904.3389276
-45. Vidal, S. A., Marcos, C., & Díaz-Pace, J. A. (2014). An Approach to Prioritize Code Smells for Refactoring. *Automated Software Engineering*, 23(3), 501–532. https://doi.org/10.1007/s10515-014-0175-x
-46. Cardelli, L. (1996). Bad Engineering Properties of Object-Oriented Languages. *ACM Computing Surveys*, 28(4), 150. https://www.microsoft.com/en-us/research/publication/bad-engineering-properties-of-object-oriented-languages/
-47. Aluthwaththage, J. H., & Thathsarani, H. A. N. N. (2024). A Novel OO-Based Code Complexity Metric. *Proc. Future Technologies Conference (FTC)*, 616–628. https://link.springer.com/chapter/10.1007/978-3-031-73125-9_39
-48. Bigonha, M. A. S., et al. (2019). The usefulness of software metric thresholds for detection of bad smells and fault prediction. *Information and Software Technology*, 115, 79–92. https://doi.org/10.1016/j.infsof.2019.08.005
-49. Ambler, S. W. (2002). *Agile Modeling: Effective Practices for eXtreme Programming and the Unified Process*. Wiley. https://www.agilemodeling.com/essays/fdd.htm
-50. Palmer, S. R., & Felsing, J. M. (2002). *A Practical Guide to Feature-Driven Development*. Prentice Hall.
-51. Krause, R., & Pernice, K. (2024). Affinity Diagramming for Collaboratively Sorting UX Findings and Design Ideas. *Nielsen Norman Group*. https://www.nngroup.com/articles/affinity-diagram/
-52. Brandolini, A. (2013–present). *EventStorming*. Leanpub / eventstorming.com. https://eventstorming.com
-53. Bourgau, P. (2017). Detailed Agenda of a DDD Big Picture Event Storming. https://philippe.bourgau.net/detailed-agenda-of-a-ddd-big-picture-event-storming-part-1/
-54. Nielsen, J. (2010). *Interviewing Users*. Nielsen Norman Group. https://www.nngroup.com/articles/interviewing-users/
-55. Farrell, S. (2017). UX Research Cheat Sheet. *Nielsen Norman Group*. https://www.nngroup.com/articles/ux-research-cheat-sheet/
-56. Flanagan, J. C. (1954). The critical incident technique. *Psychological Bulletin*, 51(4), 327–357. https://doi.org/10.1037/h0061470
-57. Fisher, R. P., & Geiselman, R. E. (1992). *Memory-Enhancing Techniques for Investigative Interviewing: The Cognitive Interview*. Charles C. Thomas.
-58. Fisher, R. P., Geiselman, R. E., & Amador, M. (1989). Field test of the cognitive interview: Enhancing the recollection of actual victims and witnesses of crime. *Journal of Applied Psychology*, 74(5), 722–727. https://doi.org/10.1037/0021-9010.74.5.722
-59. Köhnken, G., Milne, R., Memon, A., & Bull, R. (1999). The cognitive interview: A meta-analysis. *Psychology, Crime & Law*, 5(1-2), 3–27. https://doi.org/10.1080/10683169908414991
-60. Moody, W., Will, R. P., & Blanton, J. E. (1996). Enhancing knowledge elicitation using the cognitive interview. *Expert Systems with Applications*, 10(1), 127–133. https://doi.org/10.1016/0957-4174(95)00039-9
-61. Reynolds, T. J., & Gutman, J. (1988). Laddering theory, method, analysis, and interpretation. *Journal of Advertising Research*, 28(1), 11–31.
-62. Christel, M. G., & Kang, K. C. (1992). *Issues in Requirements Elicitation*. CMU/SEI-92-TR-012. Software Engineering Institute, Carnegie Mellon University. https://www.sei.cmu.edu/library/abstracts/reports/92tr012.cfm
-63. Rosala, M. (2020). The Critical Incident Technique in UX. *Nielsen Norman Group*. https://www.nngroup.com/articles/critical-incident-technique/
-64. Rosala, M., & Moran, K. (2022). The Funnel Technique in Qualitative User Research. *Nielsen Norman Group*. https://www.nngroup.com/articles/the-funnel-technique-in-qualitative-user-research/
-65. Cockburn, A. (2005). Hexagonal Architecture. *alistair.cockburn.us*. https://alistair.cockburn.us/hexagonal-architecture/
-66. Freeman, S., & Pryce, N. (2009). *Growing Object-Oriented Software, Guided by Tests*. Addison-Wesley.
-67. Beck, K. (2023). "Canon TDD." *tidyfirst.substack.com*. https://tidyfirst.substack.com/p/canon-tdd
-68. Beck, K., Fowler, M., & Hansson, D. H. (2014). "Is TDD Dead?" Video series. *martinfowler.com*. https://martinfowler.com/articles/is-tdd-dead/
-69. Fowler, M. (2014). "Self Testing Code." *martinfowler.com*. https://martinfowler.com/bliki/SelfTestingCode.html
-70. North, D. (2006). "Introducing BDD." *Better Software Magazine*. https://dannorth.net/introducing-bdd/
diff --git a/docs/architecture/adr-template.md b/docs/architecture/adr-template.md
new file mode 100644
index 0000000..d86faf9
--- /dev/null
+++ b/docs/architecture/adr-template.md
@@ -0,0 +1,10 @@
+# ADR-NNN: <title>
+
+**Status:** PROPOSED | ACCEPTED | SUPERSEDED by ADR-NNN
+
+**Decision:** <what was decided — one sentence>
+
+**Reason:** <why — one sentence>
+
+**Alternatives considered:**
+- <option>: <why rejected>
diff --git a/docs/images/banner.svg b/docs/images/banner.svg
new file mode 100644
index 0000000..01cc920
--- /dev/null
+++ b/docs/images/banner.svg
@@ -0,0 +1,123 @@
+<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 830 160" width="830" height="160" role="img" aria-label="Python Project Template">
+  <defs>
+
+    <!-- Antique gold for border and accents -->
+    <linearGradient id="g-gold" x1="0%" y1="0%" x2="100%" y2="0%">
+      <stop offset="0%"   stop-color="#c9a84c" stop-opacity="0"/>
+      <stop offset="15%"  stop-color="#e8c96a"/>
+      <stop offset="85%"  stop-color="#c9a84c"/>
+      <stop offset="100%" stop-color="#c9a84c" stop-opacity="0"/>
+    </linearGradient>
+    <linearGradient id="g-gold-solid" x1="0%" y1="0%" x2="0%" y2="100%">
+      <stop offset="0%"   stop-color="#e8c96a"/>
+      <stop offset="100%" stop-color="#a07830"/>
+    </linearGradient>
+
+    <!-- "Project" — deep warm brown, harmonizes with marble and gold -->
+    <linearGradient id="g-main-text" x1="0%" y1="0%" x2="0%" y2="100%">
+      <stop offset="0%"   stop-color="#5c3d1e"/>
+      <stop offset="100%" stop-color="#3b2410"/>
+    </linearGradient>
+
+    <!-- "Python" — muted steel blue, smaller, secondary -->
+    <linearGradient id="g-python" x1="0%" y1="0%" x2="0%" y2="100%">
+      <stop offset="0%"   stop-color="#7baabf"/>
+      <stop offset="100%" stop-color="#4a7a96"/>
+    </linearGradient>
+
+    <!-- Logo badge -->
+    <linearGradient id="g-bg" x1="0%" y1="0%" x2="60%" y2="100%">
+      <stop offset="0%"   stop-color="#faf7f2"/>
+      <stop offset="100%" stop-color="#ede8e0"/>
+    </linearGradient>
+    <linearGradient id="g-marble" x1="0%" y1="0%" x2="100%" y2="100%">
+      <stop offset="0%"   stop-color="#f0ece4"/>
+      <stop offset="100%" stop-color="#c8c0b8"/>
+    </linearGradient>
+    <linearGradient id="g-stone" x1="0%" y1="0%" x2="100%" y2="100%">
+      <stop offset="0%"   stop-color="#d6cfc6"/>
+      <stop offset="100%" stop-color="#a89f96"/>
+    </linearGradient>
+    <linearGradient id="g-shadow-lg" x1="0%" y1="0%" x2="100%" y2="100%">
+      <stop offset="0%"   stop-color="#9c9189"/>
+      <stop offset="100%" stop-color="#7a6f67"/>
+    </linearGradient>
+    <linearGradient id="g-ring" x1="0%" y1="0%" x2="100%" y2="100%">
+      <stop offset="0%"   stop-color="#b5a99e"/>
+      <stop offset="100%" stop-color="#7a6f67"/>
+    </linearGradient>
+
+  </defs>
+
+  <!-- White background -->
+  <rect x="0" y="0" width="830" height="160" rx="8" fill="white"/>
+
+  <!-- Golden border — top -->
+  <line x1="12" y1="6"  x2="818" y2="6"  stroke="url(#g-gold)" stroke-width="1.2"/>
+  <!-- Golden border — bottom -->
+  <line x1="12" y1="154" x2="818" y2="154" stroke="url(#g-gold)" stroke-width="1.2"/>
+  <!-- Golden border — left -->
+  <line x1="6" y1="12" x2="6" y2="148" stroke="#c9a84c" stroke-width="1.2" opacity="0.6"/>
+  <!-- Golden border — right -->
+  <line x1="824" y1="12" x2="824" y2="148" stroke="#c9a84c" stroke-width="1.2" opacity="0.6"/>
+  <!-- Corner dots -->
+  <circle cx="6"   cy="6"   r="2" fill="#c9a84c" opacity="0.7"/>
+  <circle cx="824" cy="6"   r="2" fill="#c9a84c" opacity="0.7"/>
+  <circle cx="6"   cy="154" r="2" fill="#c9a84c" opacity="0.7"/>
+  <circle cx="824" cy="154" r="2" fill="#c9a84c" opacity="0.7"/>
+
+  <!-- ── LOGO — centered at x=82, y=80 ── -->
+  <circle cx="82" cy="80" r="60" fill="url(#g-bg)"/>
+  <circle cx="82" cy="80" r="60" fill="none" stroke="url(#g-ring)" stroke-width="1.8"/>
+
+  <g transform="translate(82,80) scale(0.65) translate(-100,-100)">
+    <polygon points="100,34  168,62  32,62" fill="url(#g-marble)"/>
+    <line x1="100" y1="34" x2="168" y2="62" stroke="url(#g-stone)" stroke-width="2" stroke-linecap="round"/>
+    <line x1="100" y1="34" x2="32"  y2="62" stroke="url(#g-stone)" stroke-width="2" stroke-linecap="round"/>
+    <rect x="32" y="62" width="136" height="11" fill="url(#g-marble)"/>
+    <rect x="32" y="71" width="136" height="2"  fill="url(#g-shadow-lg)" opacity="0.25"/>
+    <rect x="50"  y="73" width="17" height="64" fill="url(#g-marble)"/>
+    <rect x="91"  y="73" width="17" height="64" fill="url(#g-marble)"/>
+    <rect x="133" y="73" width="17" height="64" fill="url(#g-marble)"/>
+    <rect x="64"  y="73" width="3" height="64" fill="url(#g-shadow-lg)" opacity="0.22"/>
+    <rect x="105" y="73" width="3" height="64" fill="url(#g-shadow-lg)" opacity="0.22"/>
+    <rect x="147" y="73" width="3" height="64" fill="url(#g-shadow-lg)" opacity="0.22"/>
+    <rect x="32" y="137" width="136" height="10" fill="url(#g-marble)"/>
+    <rect x="38" y="147" width="124" height="8"  fill="url(#g-stone)"/>
+    <rect x="44" y="155" width="112" height="7"  fill="url(#g-shadow-lg)" opacity="0.45"/>
+  </g>
+
+  <!-- Vertical gold divider between logo and text -->
+  <line x1="158" y1="22" x2="158" y2="138" stroke="#c9a84c" stroke-width="1" opacity="0.5"/>
+
+  <!-- "PYTHON" — small label, widely tracked, steel blue -->
+  <text x="178" y="58"
+        font-family="'Gill Sans', 'Optima', Candara, sans-serif"
+        font-size="13"
+        font-weight="400"
+        letter-spacing="7"
+        fill="url(#g-python)"
+        text-anchor="start">PYTHON</text>
+
+  <!-- Thin gold rule under label -->
+  <line x1="178" y1="65" x2="818" y2="65" stroke="#c9a84c" stroke-width="0.8" opacity="0.45"/>
+
+  <!-- "Project" — large, warm brown -->
+  <text x="175" y="112"
+        font-family="'Gill Sans', 'Optima', Candara, sans-serif"
+        font-size="46"
+        font-weight="600"
+        letter-spacing="2"
+        fill="url(#g-main-text)"
+        text-anchor="start">Project</text>
+
+  <!-- "Template" — large, antique gold -->
+  <text x="175" y="145"
+        font-family="'Gill Sans', 'Optima', Candara, sans-serif"
+        font-size="32"
+        font-weight="600"
+        letter-spacing="10"
+        fill="url(#g-gold-solid)"
+        text-anchor="start">TEMPLATE</text>
+
+</svg>
diff --git a/docs/images/logo.svg b/docs/images/logo.svg
new file mode 100644
index 0000000..be5bf7f
--- /dev/null
+++ b/docs/images/logo.svg
@@ -0,0 +1,61 @@
+<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 200 200" width="200" height="200" role="img" aria-label="python-project-template">
+  <defs>
+    <!-- Marble warm white to cool grey -->
+    <linearGradient id="g-marble" x1="0%" y1="0%" x2="100%" y2="100%">
+      <stop offset="0%"   stop-color="#f0ece4"/>
+      <stop offset="100%" stop-color="#c8c0b8"/>
+    </linearGradient>
+    <!-- Slightly darker for base/shadow faces -->
+    <linearGradient id="g-stone" x1="0%" y1="0%" x2="100%" y2="100%">
+      <stop offset="0%"   stop-color="#d6cfc6"/>
+      <stop offset="100%" stop-color="#a89f96"/>
+    </linearGradient>
+    <!-- Deep warm shadow -->
+    <linearGradient id="g-shadow" x1="0%" y1="0%" x2="100%" y2="100%">
+      <stop offset="0%"   stop-color="#9c9189"/>
+      <stop offset="100%" stop-color="#7a6f67"/>
+    </linearGradient>
+    <!-- Badge background: warm off-white parchment -->
+    <linearGradient id="g-bg" x1="0%" y1="0%" x2="60%" y2="100%">
+      <stop offset="0%"   stop-color="#faf7f2"/>
+      <stop offset="100%" stop-color="#ede8e0"/>
+    </linearGradient>
+    <!-- Ring: warm taupe -->
+    <linearGradient id="g-ring" x1="0%" y1="0%" x2="100%" y2="100%">
+      <stop offset="0%"   stop-color="#b5a99e"/>
+      <stop offset="100%" stop-color="#7a6f67"/>
+    </linearGradient>
+  </defs>
+
+  <!-- Badge background — parchment -->
+  <circle cx="100" cy="100" r="92" fill="url(#g-bg)"/>
+  <circle cx="100" cy="100" r="92" fill="none" stroke="url(#g-ring)" stroke-width="2.5"/>
+
+  <!-- Temple group — scaled to fit inside circle with margin -->
+  <!-- Width: 136px (x 32..168), Height: ~138px (y 34..172) — all within r=92 circle -->
+
+  <!-- Pediment -->
+  <polygon points="100,34  168,62  32,62" fill="url(#g-marble)"/>
+  <line x1="100" y1="34" x2="168" y2="62" stroke="url(#g-stone)" stroke-width="2" stroke-linecap="round"/>
+  <line x1="100" y1="34" x2="32"  y2="62" stroke="url(#g-stone)" stroke-width="2" stroke-linecap="round"/>
+
+  <!-- Entablature -->
+  <rect x="32" y="62" width="136" height="11" rx="0" fill="url(#g-marble)"/>
+  <rect x="32" y="71" width="136" height="2"  fill="url(#g-shadow)" opacity="0.25"/>
+
+  <!-- 3 columns — centers at 59, 100, 141 — shaft 17px wide, gaps ~25px -->
+  <rect x="50"  y="73" width="17" height="64" fill="url(#g-marble)"/>
+  <rect x="91"  y="73" width="17" height="64" fill="url(#g-marble)"/>
+  <rect x="133" y="73" width="17" height="64" fill="url(#g-marble)"/>
+
+  <!-- Column right-edge shadows -->
+  <rect x="64"  y="73" width="3" height="64" fill="url(#g-shadow)" opacity="0.22"/>
+  <rect x="105" y="73" width="3" height="64" fill="url(#g-shadow)" opacity="0.22"/>
+  <rect x="147" y="73" width="3" height="64" fill="url(#g-shadow)" opacity="0.22"/>
+
+  <!-- Stylobate / base steps -->
+  <rect x="32" y="137" width="136" height="10" fill="url(#g-marble)"/>
+  <rect x="38" y="147" width="124" height="8"  fill="url(#g-stone)"/>
+  <rect x="44" y="155" width="112" height="7"  fill="url(#g-shadow)" opacity="0.45"/>
+
+</svg>
diff --git a/docs/scientific-research/README.md b/docs/scientific-research/README.md
new file mode 100644
index 0000000..cb9fd99
--- /dev/null
+++ b/docs/scientific-research/README.md
@@ -0,0 +1,15 @@
+# Scientific Research — Index
+
+Theoretical and empirical foundations for the decisions made in this template, organized by domain.
+
+| File | Entries | Domain |
+|---|---|---|
+| `cognitive-science.md` | 1–10 | Pre-mortem, implementation intentions, commitment devices, System 2, adversarial collaboration, accountability, chunking, elaborative encoding, error feedback, prospective memory |
+| `testing.md` | 11–15, 51–54 | Observable behavior testing, test-behavior alignment, first-class tests, property-based testing, mutation testing, Canon TDD, GOOS outer/inner loop, Is TDD Dead, BDD origin |
+| `software-economics.md` | 16 | Cost of change curve (shift left) |
+| `requirements-elicitation.md` | 17–20, 28–30, 43–50 | INVEST, Example Mapping, declarative Gherkin, MoSCoW, active listening, Kipling 5Ws, BA framework, FDD, affinity mapping, Event Storming, CIT, cognitive interview, laddering, funnel technique, RE issues |
+| `domain-modeling.md` | 31 | DDD bounded contexts, ubiquitous language, feature identification |
+| `oop-design.md` | 32–35 | Object Calisthenics, Refactoring (Fowler), GoF Design Patterns, SOLID |
+| `refactoring-empirical.md` | 36–41 | QDIR smell prioritization, smells + architectural refactoring, SPIRIT tool, bad OOP engineering properties, CWC complexity metric, metric threshold unreliability |
+| `architecture.md` | 42 | Hexagonal Architecture — ports and adapters |
+| `ai-agents.md` | 21–27 | Minimal-scope agent design, context isolation, on-demand skills, instruction conflict resolution failure, positional attention degradation, modular prompt de-duplication, three-file separation |
diff --git a/docs/scientific-research/ai-agents.md b/docs/scientific-research/ai-agents.md
new file mode 100644
index 0000000..0960b13
--- /dev/null
+++ b/docs/scientific-research/ai-agents.md
@@ -0,0 +1,118 @@
+# Scientific Research — AI Agent Design
+
+Foundations for the agent architecture, file structure, and context management decisions in this template.
+
+---
+
+### 21. Minimal-Scope Agent Design
+
+| | |
+|---|---|
+| **Source** | OpenAI. (2024). *Agent definitions*. OpenAI Agents SDK Documentation. https://platform.openai.com/docs/guides/agents/define-agents |
+| **Date** | 2024 |
+| **Alternative** | Anthropic. (2024). *Building effective agents*. Anthropic Engineering Blog. https://www.anthropic.com/engineering/building-effective-agents |
+| **Status** | Confirmed — corrects the belief that subagents should be "lean routing agents" |
+| **Core finding** | "Define the smallest agent that can own a clear task. Add more agents only when you need separate ownership, different instructions, different tool surfaces, or different approval policies." The split criterion is ownership boundary, not instruction volume. |
+| **Mechanism** | Multiple agents competing to own the same concern create authority conflicts and inconsistent tool access. The right unit is the smallest coherent domain that requires exclusive responsibility. |
+| **Where used** | Agent design in `.opencode/agents/*.md` — 4 agents, each owning a distinct domain (PO, developer, reviewer, setup). |
+
+---
+
+### 22. Context Isolation via Subagents
+
+| | |
+|---|---|
+| **Source** | Anthropic. (2025). *Best practices for Claude Code*. Anthropic Documentation. https://www.anthropic.com/engineering/claude-code-best-practices |
+| **Date** | 2025 |
+| **Status** | Confirmed — the primary reason subagents exist is context isolation, not routing |
+| **Core finding** | Subagents run in their own context windows and report back summaries, keeping the main conversation clean for implementation. Every file read in a subagent burns tokens in a child window, not the primary window. |
+| **Mechanism** | Context window is the primary performance constraint for LLM agents. Investigation tasks rapidly exhaust context if done inline. Delegating to a subagent quarantines that cost; the primary agent receives only the distilled result. A fresh context in the subagent also prevents anchoring bias from prior conversation state. |
+| **Where used** | OpenCode `task` tool usage in all agents; `explore` and `general` built-in subagents. |
+
+---
+
+### 23. On-Demand Skill Loading (Context Budget)
+
+| | |
+|---|---|
+| **Source** | Anthropic. (2025). *Best practices for Claude Code*. Anthropic Documentation. https://www.anthropic.com/engineering/claude-code-best-practices |
+| **Date** | 2025 |
+| **Alternative** | OpenCode. (2026). *Agent Skills*. OpenCode Documentation. https://opencode.ai/docs/skills/ |
+| **Status** | Confirmed (vendor guidance) — benefit on task completion quality extrapolated from RAG retrieval literature |
+| **Core finding** | "CLAUDE.md is loaded every session, so only include things that apply broadly. For domain knowledge or workflows only relevant sometimes, use skills instead. Claude loads them on demand without bloating every conversation." Bloated always-loaded files cause Claude to ignore critical instructions. |
+| **Mechanism** | Every token in an unconditionally-loaded file competes for attention against the task prompt. Long always-loaded files push important instructions beyond effective attention range, causing silent non-compliance. Skills are injected only when the task calls for them, preserving the primary context budget. |
+| **Where used** | `AGENTS.md` carries only shared project conventions and commands; all step-specific workflows live in `.opencode/skills/*.md` and are loaded via the `skill` tool only when the relevant step begins. |
+
+---
+
+### 24. Instruction Conflict Resolution Failure in LLMs
+
+| | |
+|---|---|
+| **Source** | Geng et al. (2025). Control Illusion: The Failure of Instruction Hierarchies in Large Language Models. AAAI-26. arXiv:2502.15851. https://arxiv.org/abs/2502.15851 |
+| **Date** | 2025 |
+| **Alternative** | Wallace et al. (2024). The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions. arXiv:2404.13208. |
+| **Status** | Confirmed — peer-reviewed (AAAI-26), replicated across 6 models |
+| **Core finding** | LLMs do not reliably prioritize system-prompt instructions over conflicting instructions from other sources. Resolution is inconsistent and biased by pretraining-derived priors, not by prompt structure or position. |
+| **Mechanism** | No structural separation between instruction sources enforces reliable priority at inference time. When the same directive appears in two locations with divergent content, the model selects between them based on statistical priors from pretraining. |
+| **Where used** | Justifies single source of truth in `AGENTS.md`: workflow details duplicated across agent files and skills that drift out of sync produce conflicting instructions the model cannot resolve reliably. |
+
+---
+
+### 25. Positional Attention Degradation in Long Contexts
+
+| | |
+|---|---|
+| **Source** | Liu et al. (2023). Lost in the Middle: How Language Models Use Long Contexts. *Transactions of the Association for Computational Linguistics*. arXiv:2307.03172. https://arxiv.org/abs/2307.03172 |
+| **Date** | 2023 |
+| **Alternative** | McKinnon (2025). arXiv:2511.05850 — effect attenuated for simple retrieval in Gemini 2.5+; persists for multi-hop reasoning. |
+| **Status** | Confirmed with caveat — robust for multi-hop reasoning; attenuated for simple retrieval in frontier models (2025–2026) |
+| **Core finding** | Performance on tasks requiring retrieval from long contexts follows a U-shaped curve: highest when relevant content is at the beginning or end of the context, degraded when content falls in the middle. |
+| **Mechanism** | Transformer attention is not uniform across token positions. Content placed in the middle of a long context receives less attention weight regardless of its relevance. |
+| **Where used** | Supports keeping always-loaded files lean. Duplicated workflow detail in always-loaded files increases total context length, pushing other content into lower-attention positions. |
+
+---
+
+### 26. Modular Prompt De-duplication Reduces Interference
+
+| | |
+|---|---|
+| **Source** | Sharma & Henley (2026). Modular Prompt Optimization. arXiv:2601.04055. https://arxiv.org/abs/2601.04055 |
+| **Date** | 2026 |
+| **Status** | Partially confirmed — single-agent reasoning benchmarks only; not tested on multi-file agent architectures |
+| **Core finding** | Structured prompts with explicit section de-duplication outperform both monolithic prompts and unstructured modular prompts. The mechanism cited is "reducing redundancy and interference between components." |
+| **Mechanism** | Redundant content across prompt sections creates competing attention targets. De-duplication concentrates relevant signal in one canonical location per concern. |
+| **Where used** | Supports the rule that skills and agent routing files contain no duplication of `AGENTS.md` content or of each other. |
+
+---
+
+### 27. Agent File Architecture — Three-File Separation
+
+| | |
+|---|---|
+| **Source** | Convergence of entries 23, 24, 25, 26. |
+| **Date** | — |
+| **Status** | Inferred — no direct A/B test of this architecture exists; supported by convergence of confirmed and partially confirmed findings above |
+| **Core finding** | Three distinct failure modes (instruction conflict on drift, positional attention degradation, redundancy interference) converge to produce a three-file split with defined content rules for each. |
+| **Mechanism** | Each file runs at a different time and serves a different purpose. Mixing concerns across files reintroduces the failure modes the split is designed to prevent. |
+| **Where used** | Structural rule for `AGENTS.md`, `.opencode/agents/*.md`, and `.opencode/skills/*.md`. |
+
+| File | Runs when | Contains | Does NOT contain |
+|---|---|---|---|
+| `AGENTS.md` | Every session, always loaded | Project conventions, shared commands, formats, standards | Step procedures, role-specific rules, path specs |
+| `.opencode/agents/*.md` | When that role is invoked | Role identity, step ownership, skill load instructions, tool permissions, escalation paths | Workflow details, principle lists, path specs, commit formats |
+| `.opencode/skills/*.md` | On demand, when that step begins | Full procedural instructions for that step, self-contained | Duplication of `AGENTS.md` content or other skills |
+
+---
+
+## Bibliography
+
+1. Anthropic. (2024). Building effective agents. https://www.anthropic.com/engineering/building-effective-agents
+2. Anthropic. (2025). Best practices for Claude Code. https://www.anthropic.com/engineering/claude-code-best-practices
+3. Geng et al. (2025). Control Illusion. AAAI-26. arXiv:2502.15851. https://arxiv.org/abs/2502.15851
+4. Liu, N. F. et al. (2023). Lost in the Middle. *TACL*. arXiv:2307.03172. https://arxiv.org/abs/2307.03172
+5. McKinnon, R. (2025). arXiv:2511.05850. https://arxiv.org/abs/2511.05850
+6. OpenAI. (2024). Agent definitions. https://platform.openai.com/docs/guides/agents/define-agents
+7. OpenCode. (2026). Agent Skills. https://opencode.ai/docs/skills/
+8. Sharma, A., & Henley, A. (2026). Modular Prompt Optimization. arXiv:2601.04055. https://arxiv.org/abs/2601.04055
+9. Wallace, E. et al. (2024). The Instruction Hierarchy. arXiv:2404.13208.
diff --git a/docs/scientific-research/architecture.md b/docs/scientific-research/architecture.md
new file mode 100644
index 0000000..5b5bb5f
--- /dev/null
+++ b/docs/scientific-research/architecture.md
@@ -0,0 +1,24 @@
+# Scientific Research — Architecture
+
+Foundations for the architectural decisions and patterns used in this template.
+
+---
+
+### 42. Hexagonal Architecture — Ports and Adapters
+
+| | |
+|---|---|
+| **Source** | Cockburn, A. (2005). "Hexagonal Architecture." *alistair.cockburn.us*. https://alistair.cockburn.us/hexagonal-architecture/ |
+| **Date** | 2005 |
+| **Alternative** | Freeman, S., & Pryce, N. (2009). *Growing Object-Oriented Software, Guided by Tests*. Addison-Wesley. (Chapter 7: "Ports and Adapters") |
+| **Status** | Confirmed — foundational; widely adopted as Clean Architecture, Onion Architecture |
+| **Core finding** | The application domain should have no knowledge of external systems (databases, filesystems, network, UI). All contact between the domain and the outside world passes through a **port** (an interface / Protocol) and an **adapter** (a concrete implementation of that port). The domain is independently testable without any infrastructure. The key structural rule: dependency arrows point inward — domain code never imports from adapters; adapters import from domain. |
+| **Mechanism** | Two distinct sides of any application: the "driving side" (actors who initiate action — tests, UI, CLI) and the "driven side" (actors the application drives — databases, filesystems, external services). Each driven-side dependency is hidden behind a port. Tests supply a test adapter; production supplies a real adapter. Substituting adapters requires no domain code changes. This is SOLID-D at the architectural layer. |
+| **Where used** | Step 2 (Architecture): if an external dependency is identified during domain analysis, assign it a Protocol. `ports/` and `adapters/` folders emerge when a concrete dependency is confirmed — do not pre-create them. The dependency-inversion principle (SOLID-D) is the goal; the folder names are convention, not law. |
+
+---
+
+## Bibliography
+
+1. Cockburn, A. (2005). Hexagonal Architecture. *alistair.cockburn.us*. https://alistair.cockburn.us/hexagonal-architecture/
+2. Freeman, S., & Pryce, N. (2009). *Growing Object-Oriented Software, Guided by Tests*. Addison-Wesley.
diff --git a/docs/scientific-research/cognitive-science.md b/docs/scientific-research/cognitive-science.md
new file mode 100644
index 0000000..fa2b1b8
--- /dev/null
+++ b/docs/scientific-research/cognitive-science.md
@@ -0,0 +1,150 @@
+# Scientific Research — Cognitive Science
+
+Mechanisms from cognitive and social psychology that justify workflow design decisions in this template.
+
+---
+
+### 1. Pre-mortem (Prospective Hindsight)
+
+| | |
+|---|---|
+| **Source** | Klein, G. (1998). *Sources of Power: How People Make Decisions*. MIT Press. |
+| **Date** | 1998 |
+| **Status** | Confirmed |
+| **Core finding** | Asking "imagine this failed — why?" catches 30% more issues than forward-looking review. |
+| **Mechanism** | Prospective hindsight shifts from prediction (weak) to explanation (strong). The brain is better at explaining past events than predicting future ones. By framing as "it already failed," you activate explanation mode. |
+| **Where used** | PO pre-mortem at scope, developer pre-mortem before handoff. |
+
+---
+
+### 2. Implementation Intentions
+
+| | |
+|---|---|
+| **Source** | Gollwitzer, P. M. (1999). Implementation intentions: Strong effects of simple planning aids. *American Journal of Preventive Medicine*, 16(4), 257–276. |
+| **Date** | 1999 |
+| **Status** | Confirmed |
+| **Core finding** | "If X then Y" plans are 2–3x more likely to execute than general intentions. |
+| **Mechanism** | If-then plans create automatic cue-response links in memory. The brain processes "if function > 20 lines then extract helper" as an action trigger, not a suggestion to consider. |
+| **Where used** | Refactor Self-Check Gates in `implementation/SKILL.md`, Code Quality checks in `verify/SKILL.md`. |
+
+---
+
+### 3. Commitment Devices
+
+| | |
+|---|---|
+| **Source** | Cialdini, R. B. (2001). *Influence: The Psychology of Persuasion* (rev. ed.). HarperBusiness. |
+| **Date** | 2001 |
+| **Status** | Confirmed |
+| **Core finding** | Forcing an explicit micro-commitment (filling in a PASS/FAIL cell) creates resistance to reversals. A checkbox checked is harder to uncheck than a todo noted. |
+| **Mechanism** | Structured tables with PASS/FAIL cells create commitment-device effects. The act of marking "FAIL" requires justification, making silent passes psychologically costly. |
+| **Where used** | SOLID enforcement table, ObjCal enforcement table, Design Patterns table — all require explicit PASS/FAIL with evidence. |
+
+---
+
+### 4. System 2 Before System 1
+
+| | |
+|---|---|
+| **Source** | Kahneman, D. (2011). *Thinking, Fast and Slow*. Farrar, Straus and Giroux. |
+| **Date** | 2011 |
+| **Status** | Confirmed |
+| **Core finding** | System 1 (fast, automatic) is vulnerable to anchoring and confirmation bias. System 2 (slow, deliberate) must be activated before System 1's judgments anchor. |
+| **Mechanism** | Running semantic review *before* automated commands prevents the "all green" dopamine hit from anchoring the reviewer's judgment. Doing hard cognitive work first protects against System 1 shortcuts. |
+| **Where used** | Verification order in `verify/SKILL.md`: semantic alignment check before commands. |
+
+---
+
+### 5. Adversarial Collaboration
+
+| | |
+|---|---|
+| **Source** | Mellers, B. A., Hertwig, R., & Kahneman, D. (2001). Do frequency representations eliminate cooperative bias? *Psychological Review*, 108(4), 709–735. |
+| **Date** | 2001 |
+| **Status** | Confirmed |
+| **Core finding** | Highest-quality thinking emerges when parties hold different hypotheses and are charged with finding flaws in each other's reasoning. |
+| **Mechanism** | Explicitly framing the reviewer as "your job is to break this feature" activates the adversarial collaboration mode. The reviewer seeks disconfirmation rather than confirmation. |
+| **Where used** | Adversarial mandate in `reviewer.md` and `verify/SKILL.md`. |
+
+---
+
+### 6. Accountability to Unknown Audience
+
+| | |
+|---|---|
+| **Source** | Tetlock, P. E. (1983). Accountability: A social determinant of judgment. In *Psychology of Learning and Motivation* (Vol. 17, pp. 295–332). Academic Press. |
+| **Date** | 1983 |
+| **Status** | Confirmed |
+| **Core finding** | Accountability to an unknown audience with unknown views improves reasoning quality. The agent anticipates being audited and adjusts reasoning. |
+| **Mechanism** | The explicit report format (APPROVED/REJECTED with evidence) creates an accountability structure — the reviewer's reasoning will be read by the PO. |
+| **Where used** | Report format in `verify/SKILL.md`, structured evidence columns in all enforcement tables. |
+
+---
+
+### 7. Chunking and Cognitive Load Reduction
+
+| | |
+|---|---|
+| **Source** | Miller, G. A. (1956). The magical number seven, plus or minus two. *Psychological Review*, 63(2), 81–97. |
+| **Date** | 1956 |
+| **Alternative** | Sweller, J. (1988). Cognitive load during problem solving. *Cognitive Science*, 12(2), 257–285. |
+| **Status** | Confirmed |
+| **Core finding** | Structured tables reduce working memory load vs. narrative text. Chunking related items into table rows enables parallel processing. |
+| **Mechanism** | Replacing prose checklists with structured tables (rows × columns) allows the reviewer to process all items in a single pass. |
+| **Where used** | All enforcement tables in `verify/SKILL.md` and `reviewer.md`. |
+
+---
+
+### 8. Elaborative Encoding
+
+| | |
+|---|---|
+| **Source** | Craik, F. I. M., & Lockhart, R. S. (1972). Levels of processing: A framework for memory research. *Journal of Verbal Learning and Verbal Behavior*, 11(6), 671–684. |
+| **Date** | 1972 |
+| **Status** | Confirmed |
+| **Core finding** | Deeper processing — explaining *why* a rule matters — leads to better retention and application than shallow processing. |
+| **Mechanism** | Adding a "Why it matters" column to enforcement tables forces the reviewer to process the rationale, not just scan the rule name. |
+| **Where used** | SOLID table, ObjCal table, Design Patterns table — all have "Why it matters" column. |
+
+---
+
+### 9. Error-Specific Feedback
+
+| | |
+|---|---|
+| **Source** | Hattie, J., & Timperley, H. (2007). The power of feedback. *Review of Educational Research*, 77(1), 81–112. |
+| **Date** | 2007 |
+| **Status** | Confirmed |
+| **Core finding** | Feedback is most effective when it tells the agent exactly what went wrong and what the correct action is. "FAIL: function > 20 lines at file:47" is actionable; "Apply function length rules" is not. |
+| **Mechanism** | The evidence column in enforcement tables requires specific file:line references, turning vague rules into actionable directives. |
+| **Where used** | Evidence column in all enforcement tables. |
+
+---
+
+### 10. Prospective Memory Cues
+
+| | |
+|---|---|
+| **Source** | McDaniel, M. A., & Einstein, G. O. (2000). Strategic and automatic processes in prospective memory retrieval. *Applied Cognitive Psychology*, 14(7), S127–S144. |
+| **Date** | 2000 |
+| **Status** | Confirmed |
+| **Core finding** | Memory for intended actions is better when cues are embedded at the point of action, not in a separate appendix. |
+| **Mechanism** | Placing if-then gates inline (in the REFACTOR section) rather than in a separate "reference" document increases adherence. The cue appears exactly when the developer is about to make the relevant decision. |
+| **Where used** | Refactor Self-Check Gates embedded inline in `refactor/SKILL.md`. |
+
+---
+
+## Bibliography
+
+1. Cialdini, R. B. (2001). *Influence: The Psychology of Persuasion* (rev. ed.). HarperBusiness.
+2. Craik, F. I. M., & Lockhart, R. S. (1972). Levels of processing: A framework for memory research. *Journal of Verbal Learning and Verbal Behavior*, 11(6), 671–684.
+3. Gollwitzer, P. M. (1999). Implementation intentions. *American Journal of Preventive Medicine*, 16(4), 257–276.
+4. Hattie, J., & Timperley, H. (2007). The power of feedback. *Review of Educational Research*, 77(1), 81–112.
+5. Kahneman, D. (2011). *Thinking, Fast and Slow*. Farrar, Straus and Giroux.
+6. Klein, G. (1998). *Sources of Power: How People Make Decisions*. MIT Press.
+7. McDaniel, M. A., & Einstein, G. O. (2000). Strategic and automatic processes in prospective memory retrieval. *Applied Cognitive Psychology*, 14(7), S127–S144.
+8. Mellers, B. A., Hertwig, R., & Kahneman, D. (2001). Do frequency representations eliminate cooperative bias? *Psychological Review*, 108(4), 709–735.
+9. Miller, G. A. (1956). The magical number seven, plus or minus two. *Psychological Review*, 63(2), 81–97.
+10. Sweller, J. (1988). Cognitive load during problem solving. *Cognitive Science*, 12(2), 257–285.
+11. Tetlock, P. E. (1983). Accountability: A social determinant of judgment. In *Psychology of Learning and Motivation* (Vol. 17). Academic Press.
diff --git a/docs/scientific-research/domain-modeling.md b/docs/scientific-research/domain-modeling.md
new file mode 100644
index 0000000..d49be2e
--- /dev/null
+++ b/docs/scientific-research/domain-modeling.md
@@ -0,0 +1,24 @@
+# Scientific Research — Domain Modeling
+
+Foundations for bounded context identification, ubiquitous language, and feature decomposition used in this template.
+
+---
+
+### 31. Domain-Driven Design — Bounded Contexts and Feature Identification
+
+| | |
+|---|---|
+| **Source** | Evans, E. (2003). *Domain-Driven Design: Tackling Complexity in the Heart of Software*. Addison-Wesley. |
+| **Date** | 2003 |
+| **Alternative** | Context Mapper (2025). Rapid Object-Oriented Analysis and Design. https://contextmapper.org/docs/rapid-ooad |
+| **Status** | Confirmed — foundational DDD literature |
+| **Core finding** | A Bounded Context is a boundary within which a particular ubiquitous language is consistent. Features are identified by grouping related user stories that share the same language. The decomposition criterion is "single responsibility per context" + "consistency of language." |
+| **Mechanism** | In DDD: (1) Extract ubiquitous language from requirements → (2) Group by language consistency → (3) Each group is a candidate bounded context → (4) Each bounded context maps to a feature. Context Mapper automates this: User Stories → Subdomains (via noun/verb extraction) → Bounded Contexts of type FEATURE. |
+| **Where used** | Phase 1: after feature list identification, verify each feature has consistent language. Phase 2: noun/verb extraction from project discovery answers populates the Entities table — domain analysis cannot begin before this. The "Rules (Business)" section captures the ubiquitous language rules that govern each feature. |
+
+---
+
+## Bibliography
+
+1. Context Mapper. (2025). Rapid Object-Oriented Analysis and Design. https://contextmapper.org/docs/rapid-ooad
+2. Evans, E. (2003). *Domain-Driven Design: Tackling Complexity in the Heart of Software*. Addison-Wesley.
diff --git a/docs/scientific-research/oop-design.md b/docs/scientific-research/oop-design.md
new file mode 100644
index 0000000..4b0637d
--- /dev/null
+++ b/docs/scientific-research/oop-design.md
@@ -0,0 +1,64 @@
+# Scientific Research — OOP Design
+
+Foundations for object-oriented design principles used in this template.
+
+---
+
+### 32. Object Calisthenics — Nine Rules
+
+| | |
+|---|---|
+| **Source** | Bay, J. "Object Calisthenics." *The Thoughtworks Anthology* (PragProg, 2008). Original in IEEE Software/DevX, ~2005. https://www.bennadel.com/resources/uploads/2012/objectcalisthenics.pdf |
+| **Date** | ~2005 |
+| **Status** | Practitioner synthesis |
+| **Core finding** | 9 rules to internalize OOP: (1) One level indentation per method, (2) No ELSE, (3) Wrap primitives/Strings, (4) First class collections, (5) One dot per line, (6) No abbreviations, (7) Classes ≤50 lines, (8) ≤2 instance variables, (9) No getters/setters. 7 of 9 enforce data encapsulation; 1 drives polymorphism; 1 drives naming. |
+| **Mechanism** | Restrictions force decomposition. When you cannot use getters, behavior must move into the object. When you cannot use ELSE, you use polymorphism. When classes must be ≤2 ivars, you discover missing abstractions. |
+| **Where used** | Refactor self-declaration checklist in `refactor/SKILL.md`. |
+
+---
+
+### 33. Refactoring
+
+| | |
+|---|---|
+| **Source** | Fowler, M. (1999/2018). *Refactoring: Improving the Design of Existing Code* (2nd ed.). Addison-Wesley. https://martinfowler.com/books/refactoring.html |
+| **Date** | 1999, 2018 |
+| **Status** | Confirmed — foundational |
+| **Core finding** | Refactoring = behavior-preserving transformations. 68 catalogued refactorings, each small enough to do safely but cumulative effect significant. Code smells (duplicate code, long methods, feature envy) indicate refactoring opportunities. |
+| **Mechanism** | Small steps reduce risk. Each refactoring is reversible. Test suite validates behavior unchanged. |
+| **Where used** | `refactor/SKILL.md`: smell detection triggers refactoring; full protocol and catalogue entries. |
+
+---
+
+### 34. Design Patterns
+
+| | |
+|---|---|
+| **Source** | Gamma, E., Helm, R., Johnson, R., Vlissides, J. (1995). *Design Patterns: Elements of Reusable Object-Oriented Software*. Addison-Wesley. |
+| **Date** | 1995 |
+| **Status** | Confirmed — foundational |
+| **Core finding** | 23 patterns catalogued in 3 categories: Creational (5), Structural (7), Behavioral (11). Key principles: "Favor composition over inheritance," "Program to an interface, not an implementation." |
+| **Mechanism** | Patterns are recurring solutions to common problems. Named and catalogued so developers don't rediscover them. |
+| **Where used** | `design-patterns/SKILL.md`: full GoF catalogue with smell-triggered Python before/after examples. |
+
+---
+
+### 35. SOLID Principles
+
+| | |
+|---|---|
+| **Source** | Martin, R. C. (2000). "Principles of OOD." *ButUncleBob.com*. Acronym coined by Michael Feathers (2004). https://blog.interface-solv.com/wp-content/uploads/2020/07/Principles-Of-OOD.pdf |
+| **Date** | 2000 |
+| **Status** | Confirmed |
+| **Core finding** | S: One reason to change. O: Open extension, closed modification. L: Subtypes substitutable. I: No forced stub methods. D: Depend on abstractions, not concretes. |
+| **Mechanism** | Each principle targets a specific coupling failure mode. Together they produce low coupling, high cohesion. |
+| **Where used** | Refactor self-declaration checklist in `refactor/SKILL.md`: 5-row SOLID table with Python before/after examples. |
+
+---
+
+## Bibliography
+
+1. Bay, J. (~2005). "Object Calisthenics." *IEEE Software/DevX*. https://www.bennadel.com/resources/uploads/2012/objectcalisthenics.pdf
+2. Fowler, M. (1999/2018). *Refactoring: Improving the Design of Existing Code* (2nd ed.). Addison-Wesley. https://martinfowler.com/books/refactoring.html
+3. Gamma, E., Helm, R., Johnson, R., & Vlissides, J. (1995). *Design Patterns: Elements of Reusable Object-Oriented Software*. Addison-Wesley.
+4. Martin, R. C. (2000). "Principles of OOD." *ButUncleBob.com*. https://blog.interface-solv.com/wp-content/uploads/2020/07/Principles-Of-OOD.pdf
diff --git a/docs/scientific-research/refactoring-empirical.md b/docs/scientific-research/refactoring-empirical.md
new file mode 100644
index 0000000..61d666c
--- /dev/null
+++ b/docs/scientific-research/refactoring-empirical.md
@@ -0,0 +1,100 @@
+# Scientific Research — Refactoring (Empirical)
+
+Empirical studies on code smells, refactoring prioritization, and OOP complexity used in this template.
+
+---
+
+### 36. QDIR — Bad-Smells + OO Metrics Prioritization
+
+| | |
+|---|---|
+| **Source** | Malhotra, R., Singh, P. (2020). "Exploiting bad-smells and object-oriented characteristics to prioritize classes for refactoring." *Int. J. Syst. Assur. Eng. Manag.* 11(Suppl 2), 133–144. Springer. |
+| **Date** | 2020 |
+| **URL** | https://doi.org/10.1007/s13198-020-01001-x |
+| **Status** | Confirmed — empirical |
+| **Core finding** | QDIR (Quality Depreciation Index Rule) combines bad-smell severity with OO metrics (LOC, WMC, CBO, RFC, DIT) to prioritize classes for refactoring. Validated on 8 open-source Java systems. |
+| **Mechanism** | Classes with high smell severity AND high OO metrics are prioritized. QDIR = weighted sum. |
+| **Where used** | Refactor prioritization: when smell detected, check OO metrics to prioritize. |
+
+---
+
+### 37. Smells + Architectural Refactoring
+
+| | |
+|---|---|
+| **Source** | Silva, C. et al. (2020). "When Are Smells Indicators of Architectural Refactoring Opportunities." *Proc. 28th Int. Conf. on Program Comprehension*. ACM. |
+| **Date** | 2020 |
+| **URL** | https://doi.org/10.1145/3387904.3389276 |
+| **Status** | Confirmed — empirical |
+| **Core finding** | Study of 50 projects, 52,667 refactored elements. 67.53% of smells co-occur. Smells that co-occur are indicators of architectural refactoring in 88.53% of cases. |
+| **Mechanism** | Single smells are often code-level; co-occurring smells indicate architectural problems. Pattern catalog for smells → specific architectural refactorings. |
+| **Where used** | Smell detection triggers architectural analysis when co-occurrence patterns detected. |
+
+---
+
+### 38. SPIRIT Tool — Code Smell Prioritization
+
+| | |
+|---|---|
+| **Source** | Vidal, S. A., Marcos, C., Díaz-Pace, J. A. (2014). "An Approach to Prioritize Code Smells for Refactoring." *Automated Software Engineering*, 23(3), 501–532. |
+| **Date** | 2014 |
+| **URL** | https://doi.org/10.1007/s10515-014-0175-x |
+| **Status** | Confirmed — tool |
+| **Core finding** | SPIRIT (Smart Identification of Refactoring opportunITies) prioritizes smells by 3 criteria: (1) component stability, (2) impact on modifiability scenarios, (3) smell relevance. Top-ranked smells correlate with expert developer judgment. |
+| **Mechanism** | Semi-automated ranking. Combines version history (stable vs. unstable), impact analysis, and smell type. |
+| **Where used** | Refactor prioritization: stability = has the class changed recently? Unstable + smelly = prioritize. |
+
+---
+
+### 39. Bad Engineering Properties of OOP
+
+| | |
+|---|---|
+| **Source** | Cardelli, L. (1996). "Bad Engineering Properties of Object-Oriented Languages." *ACM Computing Surveys*, 28(4), 150. |
+| **Date** | 1996 |
+| **URL** | https://www.microsoft.com/en-us/research/publication/bad-engineering-properties-of-object-oriented-languages/ |
+| **Status** | Confirmed — foundational critique |
+| **Core finding** | OOP has 5 "economy" problems: (1) Execution (virtual methods prevent inlining), (2) Compilation (no code/interface separation), (3) Small-scale dev (expressive type systems missing), (4) Large-scale dev (poor class extension/modification), (5) Language features (baroque complexity). |
+| **Mechanism** | OOP is not universally superior. Trade-offs exist. Knowing these helps avoid over-engineering. |
+| **Where used** | Anti-pre-pattern: know when OOP adds complexity vs. value. |
+
+---
+
+### 40. Code Complexity Model of OOP
+
+| | |
+|---|---|
+| **Source** | Aluthwaththage, J. H., Thathsarani, H. A. N. N. (2024). "A Novel OO-Based Code Complexity Metric." *Proc. Future Technologies Conference (FTC)*, 616–628. Springer/IEEE. |
+| **Date** | 2024 |
+| **URL** | https://link.springer.com/chapter/10.1007/978-3-031-73125-9_39 |
+| **Alternative** | Misra et al. (2024). "A Suite of Object Oriented Cognitive Complexity Metrics." IEEE. |
+| **Status** | Partially confirmed — recent |
+| **Core finding** | CWC (Combined Weighted Complexity) measures OOP complexity at statement level, considering 8 factors: nesting depth, control types, compound conditions, try-catch, threads, pointers, references, dynamic memory. Addresses gap in existing metrics ignoring cognitive load. |
+| **Mechanism** | Granular complexity scoring. Higher scores indicate more cognitively demanding code. |
+| **Where used** | Complexity measurement: when function > 20 lines, consider CWC-style granular scoring. |
+
+---
+
+### 41. Metric Thresholds for Smell Detection
+
+| | |
+|---|---|
+| **Source** | Bigonha, M. A. S., et al. (2019). "The usefulness of software metric thresholds for detection of bad smells and fault prediction." *Information and Software Technology*, 115, 79–92. |
+| **Date** | 2019 |
+| **URL** | https://doi.org/10.1016/j.infsof.2019.08.005 |
+| **Alternative** | Catal et al. (2018). "Software metrics thresholds calculation techniques." *Info. Softw. Technol.* |
+| **Status** | Confirmed |
+| **Core finding** | Metric thresholds (e.g., LOC > 600) used for smell detection are unreliable. Study on 92 open-source systems found precision too low for practical use. Neither heuristic-based nor ML approaches achieve acceptable accuracy. |
+| **Mechanism** | Fixed thresholds are context-dependent. Thresholds should be project-specific, not universal. |
+| **Where used** | Anti-pre-pattern: do not rely on fixed thresholds. Use co-occurrence patterns (entry 37) instead. |
+
+---
+
+## Bibliography
+
+1. Aluthwaththage, J. H., & Thathsarani, H. A. N. N. (2024). A Novel OO-Based Code Complexity Metric. *Proc. Future Technologies Conference (FTC)*, 616–628. https://link.springer.com/chapter/10.1007/978-3-031-73125-9_39
+2. Bigonha, M. A. S., et al. (2019). The usefulness of software metric thresholds. *Information and Software Technology*, 115, 79–92. https://doi.org/10.1016/j.infsof.2019.08.005
+3. Cardelli, L. (1996). Bad Engineering Properties of Object-Oriented Languages. *ACM Computing Surveys*, 28(4), 150. https://www.microsoft.com/en-us/research/publication/bad-engineering-properties-of-object-oriented-languages/
+4. Malhotra, R., & Singh, P. (2020). Exploiting bad-smells and OO characteristics. *Int. J. Syst. Assur. Eng. Manag.*, 11(Suppl 2), 133–144. https://doi.org/10.1007/s13198-020-01001-x
+5. Silva, C. et al. (2020). When Are Smells Indicators of Architectural Refactoring Opportunities. *Proc. 28th ICPC*. https://doi.org/10.1145/3387904.3389276
+6. Vidal, S. A., Marcos, C., & Díaz-Pace, J. A. (2014). An Approach to Prioritize Code Smells. *Automated Software Engineering*, 23(3), 501–532. https://doi.org/10.1007/s10515-014-0175-x
diff --git a/docs/scientific-research/requirements-elicitation.md b/docs/scientific-research/requirements-elicitation.md
new file mode 100644
index 0000000..ec5e68f
--- /dev/null
+++ b/docs/scientific-research/requirements-elicitation.md
@@ -0,0 +1,246 @@
+# Scientific Research — Requirements Elicitation
+
+Foundations for the PO interview structure, Gherkin criteria, and feature discovery in this template.
+
+---
+
+### 17. INVEST Criteria for User Stories
+
+| | |
+|---|---|
+| **Source** | Wake, B. (2003). *INVEST in Good Stories, and SMART Tasks*. XP123.com. |
+| **Date** | 2003 |
+| **Alternative** | Cohn, M. (2004). *User Stories Applied: For Agile Software Development*. Addison-Wesley. |
+| **Status** | Confirmed |
+| **Core finding** | Stories that are Independent, Negotiable, Valuable, Estimable, Small, and Testable produce fewer downstream defects and smoother development cycles. |
+| **Mechanism** | INVEST serves as a quality gate before stories enter development. "Testable" forces the PO to express observable outcomes (directly enabling Given/When/Then). "Small" forces decomposition. "Independent" prevents hidden ordering dependencies. |
+| **Where used** | INVEST gate in Phase 3 of `scope/SKILL.md`. |
+
+---
+
+### 18. Example Mapping (Rules Layer)
+
+| | |
+|---|---|
+| **Source** | Wynne, M. (2015). *Introducing Example Mapping*. Cucumber Blog. https://cucumber.io/blog/bdd/example-mapping-introduction/ |
+| **Date** | 2015 |
+| **Status** | Confirmed |
+| **Core finding** | Inserting a "rules" layer between stories and examples prevents redundant or contradictory acceptance criteria. A story with many rules needs splitting; a story with many open questions is not ready for development. |
+| **Mechanism** | Four card types: Story (yellow), Rules (blue), Examples (green), Questions (red). The rules layer groups related examples under the business rule they illustrate. Red cards (unanswered questions) are a first-class signal to stop and investigate. |
+| **Where used** | `## Rules` section in per-feature `discovery.md` (Phase 2). PO identifies business rules before writing Examples in Phase 4. |
+
+---
+
+### 19. Declarative Gherkin
+
+| | |
+|---|---|
+| **Source** | Cucumber Team. (2024). *Better Gherkin*. Cucumber Documentation. https://cucumber.io/docs/bdd/better-gherkin/ |
+| **Date** | 2024 |
+| **Status** | Confirmed |
+| **Core finding** | Declarative Gherkin ("When Bob logs in") produces specifications that survive UI changes. Imperative Gherkin ("When I click the Login button") couples specs to implementation details and breaks on every UI redesign. |
+| **Mechanism** | Declarative steps describe *what happens* at the business level. Imperative steps describe *how the user interacts with a specific UI*. AI agents are especially prone to writing imperative Gherkin because they mirror literal steps. |
+| **Where used** | Declarative vs. imperative table in Phase 4 of `scope/SKILL.md`. |
+
+---
+
+### 20. MoSCoW Prioritization (Within-Story Triage)
+
+| | |
+|---|---|
+| **Source** | Clegg, D., & Barker, R. (1994). *Case Method Fast-Track: A RAD Approach*. Addison-Wesley (DSDM origin). |
+| **Date** | 1994 |
+| **Status** | Confirmed |
+| **Core finding** | Classifying requirements as Must/Should/Could/Won't forces explicit negotiation about what is essential vs. desired. When applied *within* a single story, it reveals bloated stories that should be split. |
+| **Mechanism** | DSDM mandates that Musts cannot exceed 60% of total effort. At the story level: if a story has 12 Examples and only 3 are Musts, the remaining 9 can be deferred. This prevents gold-plating and keeps stories small. |
+| **Where used** | MoSCoW triage in Phase 4 of `scope/SKILL.md`. |
+
+---
+
+### 28. Active Listening — Paraphrase-Clarify-Summarize
+
+| | |
+|---|---|
+| **Source** | Rogers, C. R., & Farson, R. E. (1957). *Active Listening*. Industrial Relations Center, University of Chicago. |
+| **Date** | 1957 |
+| **Alternative** | McNaughton, D. et al. (2008). Learning to Listen. *Topics in Early Childhood Special Education*, 27(4), 223–231. |
+| **Status** | Confirmed — foundational clinical research; widely replicated |
+| **Core finding** | Active listening — paraphrasing what was heard in the listener's own words, asking clarifying questions, then summarizing the main points and intent — reduces misunderstanding, builds trust, and confirms mutual understanding before proceeding. |
+| **Mechanism** | Paraphrasing forces the listener to reconstruct the speaker's meaning, surfacing gaps immediately. Clarifying questions address residual ambiguity. Summarizing creates a shared record that both parties can confirm or correct. |
+| **Where used** | PO summarization protocol in `scope/SKILL.md`: after each interview round, PO produces a "Here is what I understood" block before proceeding. |
+
+---
+
+### 28a. Active Listening — Three-Level Structure
+
+| | |
+|---|---|
+| **Source** | Synthesis of: Nielsen (2010); Farrell (2017); Ambler (2002); Wynne (2015). |
+| **Date** | 2010–2015 |
+| **Status** | Synthesized rule of thumb — each component individually confirmed |
+| **Core finding** | Active listening in requirements interviews operates at three granularities: **Level 1** (per answer) — immediate paraphrase; **Level 2** (per topic cluster) — transition summary; **Level 3** (end of interview) — full synthesis serving four downstream purposes. |
+| **Level 3 — four uses** | 1. Accuracy gate (NN/G). 2. Scope crystallization (Ambler/FDD). 3. Input to domain modeling (Ambler/FDD). 4. Baseline trigger (Wynne/Cucumber). |
+| **Where used** | Phase 1 and Phase 2 of `scope/SKILL.md`. |
+
+---
+
+### 29. The Kipling Method — Five Ws and One H
+
+| | |
+|---|---|
+| **Source** | Kipling, R. (1902). *Just So Stories*. Macmillan. |
+| **Date** | 1902 |
+| **Alternative** | Hermagoras of Temnos (2nd century BCE) — seven circumstances of rhetoric. |
+| **Status** | Practitioner synthesis — journalism, business analysis, investigative methodology |
+| **Core finding** | The six interrogative questions (Who, What, When, Where, Why, How) form a complete framework for gathering all essential facts about any situation. Together they ensure completeness and prevent gaps. |
+| **Where used** | Phase 1 project discovery: the initial seven questions are an adaptation of the 5W1H framework. |
+
+---
+
+### 30. BA Requirements Question Framework
+
+| | |
+|---|---|
+| **Source** | Brandenburg, L. (2025). *Requirements Discovery Checklist Pack*. TechCanvass. |
+| **Date** | 2025 |
+| **Status** | Practitioner synthesis — consolidated BA methodology, not peer-reviewed |
+| **Core finding** | Ten questions consistently make the most difference in requirements elicitation: (1) What problem are we solving? (2) What happens if we do nothing? (3) Who uses this? (4) What does success look like? (5) Walk me through how this works today. (6) Where does this usually break? (7) What decisions will this help? (8) What should definitely not happen? (9) What happens if input is wrong? (10) What assumptions are we making? |
+| **Where used** | Phase 1 project discovery: the "Success", "Failure", and "Out-of-scope" questions map to this framework. |
+
+---
+
+### 43. Feature-Driven Development — Domain Modeling to Feature List
+
+| | |
+|---|---|
+| **Source** | Ambler, S. W. (2002). *Agile Modeling*. Wiley. https://www.agilemodeling.com/essays/fdd.htm |
+| **Date** | 2002 |
+| **Alternative** | Palmer, S. R., & Felsing, J. M. (2002). *A Practical Guide to Feature-Driven Development*. Prentice Hall. |
+| **Status** | Confirmed |
+| **Core finding** | FDD requires domain modeling *before* feature naming. Features are expressed as "Action result object" triples. Features group into Feature Sets (shared domain object), which group into Subject Areas. |
+| **Mechanism** | Domain modeling extracts the vocabulary (nouns = candidate classes, verbs = candidate methods). Feature identification then asks: "what verbs act on each noun?" |
+| **Where used** | Phase 1 of `scope/SKILL.md`: after interview summary is confirmed, PO performs domain analysis (nouns/verbs → subject areas → FDD "Action object" feature names). |
+
+---
+
+### 44. Affinity Mapping / KJ Method — Bottom-Up Feature Identification
+
+| | |
+|---|---|
+| **Source** | Krause, R., & Pernice, K. (2024). Affinity Diagramming. *Nielsen Norman Group*. https://www.nngroup.com/articles/affinity-diagram/ |
+| **Date** | 2024 (method origin: Kawakita, J., 1960s) |
+| **Alternative** | Kawakita, J. (1967). *Abduction*. Chuokoronsha. |
+| **Status** | Confirmed |
+| **Core finding** | Affinity diagramming groups raw observations/requirements into clusters by bottom-up similarity — no categories are named until grouping is complete. This prevents confirmation bias from top-down pre-labelling. |
+| **Where used** | Phase 1 of `scope/SKILL.md` (alternative to FDD domain modeling): PO uses affinity mapping on interview answers to derive feature clusters. Best suited when working from interview transcripts solo. |
+
+---
+
+### 45. Event Storming — Domain Events to Functional Areas
+
+| | |
+|---|---|
+| **Source** | Brandolini, A. (2013–present). *EventStorming*. Leanpub / eventstorming.com. https://eventstorming.com |
+| **Date** | 2013 |
+| **Status** | Confirmed |
+| **Core finding** | Event Storming is a collaborative workshop where domain experts place past-tense domain events on a timeline. Sorting the events creates natural Functional Area clusters — these are candidate feature groups. The workshop also produces Ubiquitous Language, a Problem Inventory, and Actor roles. |
+| **Mechanism** | Temporal sequencing of domain events forces resolution of conflicting mental models across organisational silos. Clusters emerge from shared vocabulary and causal proximity. |
+| **Where used** | Optional alternative in Phase 1 of `scope/SKILL.md` for cross-silo discovery. |
+
+---
+
+### 46. Critical Incident Technique — Gap-Finding via Past Events
+
+| | |
+|---|---|
+| **Source** | Flanagan, J. C. (1954). "The critical incident technique." *Psychological Bulletin*, 51(4), 327–357. https://doi.org/10.1037/h0061470 |
+| **Date** | 1954 |
+| **Alternative** | Rosala, M. (2020). The Critical Incident Technique in UX. *Nielsen Norman Group*. https://www.nngroup.com/articles/critical-incident-technique/ |
+| **Status** | Confirmed — foundational; ~200 follow-on empirical studies |
+| **Core finding** | Anchoring an interview on a specific past incident ("Tell me about a time when X broke down") breaks schema-based recall. Stakeholders describing actual past events report real workarounds, edge cases, and failure modes that never surface when asked "how does this usually work?" |
+| **Mechanism** | Direct questions elicit the stakeholder's mental schema — a sanitized, gap-free description of how things *should* work. Incidents bypass the schema because episodic memory is anchored to specific sensory and emotional detail. |
+| **Where used** | Session 2 (gap-finding) of Phase 1 and Phase 2 in `scope/SKILL.md`. |
+
+---
+
+### 47. Cognitive Interview — Memory-Enhancing Elicitation Technique
+
+| | |
+|---|---|
+| **Source** | Fisher, R. P., & Geiselman, R. E. (1992). *Memory-Enhancing Techniques for Investigative Interviewing: The Cognitive Interview*. Charles C. Thomas. |
+| **Date** | 1984 (original); 1987 (enhanced CI); 1992 (manual) |
+| **Alternative** | Moody, W., Will, R. P., & Blanton, J. E. (1996). Enhancing knowledge elicitation using the cognitive interview. *Expert Systems with Applications*, 10(1), 127–133. |
+| **Status** | Confirmed — meta-analysis: Köhnken et al. (1999), *Psychology, Crime & Law*, 5(1-2), 3–27. |
+| **Core finding** | The enhanced CI elicits ~35% more correct information than standard interviews with equal accuracy rates. |
+| **Mechanism** | Four retrieval mnemonics: (1) mental reinstatement of context; (2) report everything; (3) temporal reversal; (4) perspective change. Each mnemonic opens a different memory access route, collectively surfacing what direct questions cannot. |
+| **Where used** | Session 2 (gap-finding) of Phase 1 and Phase 2 in `scope/SKILL.md`. |
+
+---
+
+### 48. Laddering / Means-End Chain — Surfacing Unstated Motivations
+
+| | |
+|---|---|
+| **Source** | Reynolds, T. J., & Gutman, J. (1988). "Laddering theory, method, analysis, and interpretation." *Journal of Advertising Research*, 28(1), 11–31. |
+| **Date** | 1988 |
+| **Status** | Confirmed — operationalised in IS research (Hunter & Beck 2000) |
+| **Core finding** | Repeatedly asking "Why is that important to you?" climbs a means-end chain from concrete attribute → functional consequence → psychosocial consequence → terminal value. The stakeholder's first answer is rarely the real constraint. |
+| **Mechanism** | The Gherkin "So that [benefit]" clause is structurally a single-rung means-end ladder. Full laddering reveals value conflicts between stakeholders whose surface requirements look identical but whose ladders diverge at the consequence level. |
+| **Where used** | Session 2 (gap-finding) of Phase 1 and Phase 2 in `scope/SKILL.md`. |
+
+---
+
+### 49. Funnel Technique — Question Ordering to Prevent Priming
+
+| | |
+|---|---|
+| **Source** | Rosala, M., & Moran, K. (2022). The Funnel Technique in Qualitative User Research. *Nielsen Norman Group*. https://www.nngroup.com/articles/the-funnel-technique-in-qualitative-user-research/ |
+| **Date** | 2022 |
+| **Alternative** | Christel, M. G., & Kang, K. C. (1992). *Issues in Requirements Elicitation*. CMU/SEI-92-TR-012. |
+| **Status** | Confirmed — standard NNG qualitative research protocol |
+| **Core finding** | Starting with broad open-ended questions before narrowing to specifics prevents the interviewer from priming the interviewee's responses. |
+| **Mechanism** | Priming bias is structural: any category name the interviewer introduces activates a schema that filters what the interviewee considers worth reporting. The funnel sequences questions so the interviewee's own categories emerge first. |
+| **Where used** | Within each session of Phase 1 and Phase 2 in `scope/SKILL.md`. |
+
+---
+
+### 50. Issues in Requirements Elicitation — Why Direct Questions Fail
+
+| | |
+|---|---|
+| **Source** | Christel, M. G., & Kang, K. C. (1992). *Issues in Requirements Elicitation*. CMU/SEI-92-TR-012. Software Engineering Institute, Carnegie Mellon University. https://www.sei.cmu.edu/library/abstracts/reports/92tr012.cfm |
+| **Date** | 1992 |
+| **Alternative** | Sommerville, I., & Sawyer, P. (1997). *Requirements Engineering: A Good Practice Guide*. Wiley. |
+| **Status** | Confirmed — foundational SEI technical report |
+| **Core finding** | Stakeholders have three structural problems that make direct questioning insufficient: (1) they omit information that is "obvious" to them; (2) they have trouble communicating needs they have never had to articulate; (3) they may not know what they want until they see what they don't want. |
+| **Mechanism** | Expert knowledge is largely procedural and tacit. When asked "how does the system work?", experts describe what they believe happens, not what actually happens. Gap-finding techniques are required because they bypass the expert's mental schema. |
+| **Where used** | Theoretical justification for the 3-session interview structure and use of CIT, CI, and Laddering in `scope/SKILL.md`. |
+
+---
+
+## Bibliography
+
+1. Ambler, S. W. (2002). *Agile Modeling*. Wiley. https://www.agilemodeling.com/essays/fdd.htm
+2. Brandenburg, L. (2025). *Requirements Discovery Checklist Pack*. TechCanvass.
+3. Brandolini, A. (2013–present). *EventStorming*. https://eventstorming.com
+4. Christel, M. G., & Kang, K. C. (1992). *Issues in Requirements Elicitation*. CMU/SEI-92-TR-012. https://www.sei.cmu.edu/library/abstracts/reports/92tr012.cfm
+5. Clegg, D., & Barker, R. (1994). *Case Method Fast-Track: A RAD Approach*. Addison-Wesley.
+6. Cohn, M. (2004). *User Stories Applied*. Addison-Wesley.
+7. Cucumber Team. (2024). Better Gherkin. https://cucumber.io/docs/bdd/better-gherkin/
+8. Farrell, S. (2017). UX Research Cheat Sheet. *Nielsen Norman Group*. https://www.nngroup.com/articles/ux-research-cheat-sheet/
+9. Fisher, R. P., & Geiselman, R. E. (1992). *Memory-Enhancing Techniques for Investigative Interviewing*. Charles C. Thomas.
+10. Flanagan, J. C. (1954). The critical incident technique. *Psychological Bulletin*, 51(4), 327–357. https://doi.org/10.1037/h0061470
+11. Kawakita, J. (1967). *Abduction*. Chuokoronsha.
+12. Kipling, R. (1902). *Just So Stories*. Macmillan.
+13. Köhnken, G., Milne, R., Memon, A., & Bull, R. (1999). The cognitive interview: A meta-analysis. *Psychology, Crime & Law*, 5(1-2), 3–27.
+14. Krause, R., & Pernice, K. (2024). Affinity Diagramming. *Nielsen Norman Group*. https://www.nngroup.com/articles/affinity-diagram/
+15. McNaughton, D. et al. (2008). Learning to Listen. *Topics in Early Childhood Special Education*, 27(4), 223–231.
+16. Moody, W., Will, R. P., & Blanton, J. E. (1996). Enhancing knowledge elicitation using the cognitive interview. *Expert Systems with Applications*, 10(1), 127–133.
+17. Nielsen, J. (2010). *Interviewing Users*. Nielsen Norman Group. https://www.nngroup.com/articles/interviewing-users/
+18. Palmer, S. R., & Felsing, J. M. (2002). *A Practical Guide to Feature-Driven Development*. Prentice Hall.
+19. Reynolds, T. J., & Gutman, J. (1988). Laddering theory, method, analysis, and interpretation. *Journal of Advertising Research*, 28(1), 11–31.
+20. Rogers, C. R., & Farson, R. E. (1957). *Active Listening*. Industrial Relations Center, University of Chicago.
+21. Rosala, M. (2020). The Critical Incident Technique in UX. *Nielsen Norman Group*. https://www.nngroup.com/articles/critical-incident-technique/
+22. Rosala, M., & Moran, K. (2022). The Funnel Technique. *Nielsen Norman Group*. https://www.nngroup.com/articles/the-funnel-technique-in-qualitative-user-research/
+23. Wake, B. (2003). INVEST in Good Stories, and SMART Tasks. *XP123.com*.
+24. Wynne, M. (2015). Introducing Example Mapping. *Cucumber Blog*. https://cucumber.io/blog/bdd/example-mapping-introduction/
diff --git a/docs/scientific-research/software-economics.md b/docs/scientific-research/software-economics.md
new file mode 100644
index 0000000..becd695
--- /dev/null
+++ b/docs/scientific-research/software-economics.md
@@ -0,0 +1,24 @@
+# Scientific Research — Software Economics
+
+Foundations for the shift-left, early defect detection, and workflow ordering decisions in this template.
+
+---
+
+### 16. Cost of Change Curve (Shift Left)
+
+| | |
+|---|---|
+| **Source** | Boehm, B. W. (1981). *Software Engineering Economics*. Prentice-Hall. |
+| **Date** | 1981 |
+| **Alternative** | Boehm, B., & Papaccio, P. N. (1988). Understanding and controlling software costs. *IEEE Transactions on Software Engineering*, 14(10), 1462–1477. |
+| **Status** | Confirmed |
+| **Core finding** | The cost to fix a defect multiplies by roughly 10x per SDLC phase: requirements (1x) → design (5x) → coding (10x) → testing (20x) → production (200x). A defect caught during requirements costs 200x less than the same defect found after release. |
+| **Mechanism** | Defects compound downstream: a wrong requirement becomes a wrong design, which becomes wrong code, which becomes wrong tests, all of which must be unwound. Catching errors at the source eliminates the entire cascade. This is the empirical foundation for "shift left" — investing earlier in quality always dominates fixing later. |
+| **Where used** | Justifies the multi-session PO elicitation model: every acceptance criterion clarified at scope prevents 10–200x rework downstream. Also justifies the adversarial pre-mortem at the end of each elicitation cycle, and the adversarial mandate in `verify/SKILL.md`. The entire 5-step pipeline is ordered to surface defects at the earliest (cheapest) phase. |
+
+---
+
+## Bibliography
+
+1. Boehm, B. W. (1981). *Software Engineering Economics*. Prentice-Hall.
+2. Boehm, B., & Papaccio, P. N. (1988). Understanding and controlling software costs. *IEEE Transactions on Software Engineering*, 14(10), 1462–1477.
diff --git a/docs/scientific-research/testing.md b/docs/scientific-research/testing.md
new file mode 100644
index 0000000..2c7f7d7
--- /dev/null
+++ b/docs/scientific-research/testing.md
@@ -0,0 +1,137 @@
+# Scientific Research — Testing
+
+Foundations for test design, TDD, BDD, and property-based testing used in this template.
+
+---
+
+### 11. Observable Behavior Testing
+
+| | |
+|---|---|
+| **Source** | Fowler, M. (2018). *The Practical Test Pyramid*. Thoughtworks. https://martinfowler.com/articles/practical-test-pyramid.html |
+| **Date** | 2018 |
+| **Status** | Confirmed |
+| **Core finding** | Tests should answer "if I enter X and Y, will the result be Z?" — not "will method A call class B first?" |
+| **Mechanism** | A test is behavioral if its assertion describes something a caller/user can observe without knowing the implementation. The test should still pass if you completely rewrite the internals. |
+| **Where used** | Contract test rule in `implementation/SKILL.md`: "Write every test as if you cannot see the production code." |
+
+---
+
+### 12. Test-Behavior Alignment
+
+| | |
+|---|---|
+| **Source** | Google Testing Blog (2013). *Testing on the Toilet: Test Behavior, Not Implementation*. |
+| **Date** | 2013 |
+| **Status** | Confirmed |
+| **Core finding** | Test setup may need to change if implementation changes, but the actual test shouldn't need to change if the code's user-facing behavior doesn't change. |
+| **Mechanism** | Tests that are tightly coupled to implementation break on refactoring and become a drag on design improvement. Behavioral tests survive internal rewrites. |
+| **Where used** | Contract test rule in `implementation/SKILL.md`, reviewer verification check in `reviewer.md`. |
+
+---
+
+### 13. Tests as First-Class Citizens
+
+| | |
+|---|---|
+| **Source** | Martin, R. C. (2017). *First-Class Tests*. Clean Coder Blog. |
+| **Date** | 2017 |
+| **Status** | Confirmed |
+| **Core finding** | Tests should be treated as first-class citizens of the system — not coupled to implementation. Bad tests are worse than no tests because they give false confidence. |
+| **Mechanism** | Tests written as "contract tests" — describing what the caller observes — remain stable through refactoring. Tests that verify implementation details are fragile and create maintenance burden. |
+| **Where used** | Contract test rule in `implementation/SKILL.md`, verification check in `reviewer.md`. |
+
+---
+
+### 14. Property-Based Testing (Invariant Discovery)
+
+| | |
+|---|---|
+| **Source** | MacIver, D. R. (2016). *What is Property Based Testing?* Hypothesis. https://hypothesis.works/articles/what-is-property-based-testing/ |
+| **Date** | 2016 |
+| **Status** | Confirmed |
+| **Core finding** | Property-based testing is "the construction of tests such that, when these tests are fuzzed, failures reveal problems that could not have been revealed by direct fuzzing." Property tests test *invariants* — things that must always be true about the contract. |
+| **Mechanism** | Meaningful property tests assert invariants: `assert Score(x).value >= 0` tests the contract. Tautological tests assert reconstruction: `assert Score(x).value == x` tests the implementation. |
+| **Where used** | Meaningful vs. Tautological table in `implementation/SKILL.md`. |
+
+---
+
+### 15. Mutation Testing (Test Quality Verification)
+
+| | |
+|---|---|
+| **Source** | King, K. N. (1991). *The Gamma (formerly mutants)*. |
+| **Date** | 1991 |
+| **Alternative** | Mutation testing tools: Cosmic Ray, mutmut (Python) |
+| **Status** | Confirmed |
+| **Core finding** | A meaningful test fails when a mutation (small deliberate code change) is introduced. A tautological test passes even with mutations because it doesn't constrain the behavior. |
+| **Mechanism** | If a test survives every mutation of the production code without failing, it tests nothing. Only tests that fail on purposeful "damage" to the code are worth keeping. |
+| **Where used** | Implicitly encouraged: tests must describe contracts, not implementation, which is the theoretical complement to mutation testing. |
+
+---
+
+### 51. Canon TDD — Authoritative Red-Green-Refactor Definition
+
+| | |
+|---|---|
+| **Source** | Beck, K. (2023). "Canon TDD." *tidyfirst.substack.com*. December 11, 2023. https://tidyfirst.substack.com/p/canon-tdd |
+| **Date** | 2023 |
+| **Alternative** | Fowler, M. (2023). "Test Driven Development." *martinfowler.com*. https://martinfowler.com/bliki/TestDrivenDevelopment.html |
+| **Status** | Confirmed — canonical source; explicitly authored to stop strawman critiques |
+| **Core finding** | The canonical TDD loop is: (1) write a list of test scenarios; (2) convert exactly one item into a runnable test; (3) make it pass; (4) optionally refactor; (5) repeat. Writing all test code before any implementation is an explicit anti-pattern. |
+| **Mechanism** | The interleaving of test-writing and implementation is not cosmetic — each test drives interface decisions at the moment they are cheapest to make. |
+| **Where used** | Justifies one-@id-at-a-time interleaved TDD in Step 3 of `implementation/SKILL.md`. |
+
+---
+
+### 52. GOOS — Outer/Inner TDD Loop
+
+| | |
+|---|---|
+| **Source** | Freeman, S., & Pryce, N. (2009). *Growing Object-Oriented Software, Guided by Tests*. Addison-Wesley. |
+| **Date** | 2009 |
+| **Status** | Confirmed — canonical ATDD/BDD integration model |
+| **Core finding** | Acceptance tests and unit tests operate at two separate, nested timescales. The outer loop: write one failing acceptance test before any implementation. The inner loop: drive implementation with unit-level Red-Green-Refactor cycles until the acceptance test passes. |
+| **Mechanism** | The outer loop provides direction (what to build); the inner loop provides momentum (how to build it). The acceptance test stays red throughout all inner cycles and goes green only when the feature is complete. |
+| **Where used** | Justifies the two-level structure in Step 3: outer loop per `@id` acceptance test, inner loop per unit. |
+
+---
+
+### 53. Is TDD Dead? — Anti-Bureaucracy Evidence
+
+| | |
+|---|---|
+| **Source** | Beck, K., Fowler, M., & Hansson, D. H. (2014). "Is TDD Dead?" Video series. *martinfowler.com*. https://martinfowler.com/articles/is-tdd-dead/ |
+| **Date** | 2014 |
+| **Status** | Confirmed — primary evidence for what TDD practitioners reject as overhead |
+| **Core finding** | Per-cycle human reviewer gates, per-cycle checklists, and tests with zero delta coverage are all explicitly identified as harmful overhead. The green bar is the quality gate — not a checklist. |
+| **Mechanism** | Administrative overhead added to TDD workflows increases the cost per cycle without increasing coverage or catching defects. The optimal TDD loop is as lean as productive. |
+| **Where used** | Justifies removing per-test reviewer gates. Self-declaration moves to end-of-feature (once), preserving accountability at feature granularity without interrupting cycle momentum. |
+
+---
+
+### 54. Introducing BDD — Behavioural-Driven Development Origin
+
+| | |
+|---|---|
+| **Source** | North, D. (2006). "Introducing BDD." *Better Software Magazine*. https://dannorth.net/introducing-bdd/ |
+| **Date** | 2006 |
+| **Alternative** | Fowler, M. (2013). "Given When Then." *martinfowler.com*. https://martinfowler.com/bliki/GivenWhenThen.html |
+| **Status** | Confirmed — primary BDD source |
+| **Core finding** | BDD evolved directly from TDD to address persistent practitioner confusion. BDD reframes TDD vocabulary around observable behavior: scenarios instead of tests, Given-When-Then instead of Arrange-Act-Assert. |
+| **Mechanism** | "Given" captures preconditions (Arrange), "When" captures the triggering event (Act), "Then" captures the observable outcome (Assert). Translating to G/W/T shifts focus from implementation mechanics to user-observable behavior. |
+| **Where used** | Theoretical link between Gherkin `@id` Examples (Step 1 output) and the TDD inner loop (Step 3). |
+
+---
+
+## Bibliography
+
+1. Beck, K. (2023). "Canon TDD." *tidyfirst.substack.com*. https://tidyfirst.substack.com/p/canon-tdd
+2. Beck, K., Fowler, M., & Hansson, D. H. (2014). "Is TDD Dead?" *martinfowler.com*. https://martinfowler.com/articles/is-tdd-dead/
+3. Fowler, M. (2018). *The Practical Test Pyramid*. https://martinfowler.com/articles/practical-test-pyramid.html
+4. Freeman, S., & Pryce, N. (2009). *Growing Object-Oriented Software, Guided by Tests*. Addison-Wesley.
+5. Google Testing Blog. (2013). Testing on the Toilet: Test Behavior, Not Implementation.
+6. King, K. N. (1991). *The Gamma (formerly mutants)*.
+7. MacIver, D. R. (2016). What is Property Based Testing? *Hypothesis*. https://hypothesis.works/articles/what-is-property-based-testing/
+8. Martin, R. C. (2017). First-Class Tests. *Clean Coder Blog*.
+9. North, D. (2006). Introducing BDD. *Better Software Magazine*. https://dannorth.net/introducing-bdd/
diff --git a/docs/workflow.md b/docs/workflow.md
index e7d2797..f63e559 100644
--- a/docs/workflow.md
+++ b/docs/workflow.md
@@ -125,6 +125,7 @@ Each step has a designated agent and a specific deliverable. No step is skipped.
 │    docs/features/discovery.md (project-level)                      │
 │    ALL backlog .feature files (discovery + entities sections)       │
 │    in-progress .feature file (full: Rules + Examples + @id)        │
+│    ALL existing .py files in <package>/  ← know what exists first  │
 │                                                                     │
 │  DOMAIN ANALYSIS                                                    │
 │    From Entities table + Rules (Business) in .feature file:        │
@@ -143,30 +144,28 @@ Each step has a designated agent and a specific deliverable. No step is skipped.
 │    Any structure missing a named design pattern?                    │
 │    → If pattern smell detected: load skill design-patterns          │
 │                                                                     │
-│  Write Architecture section in in-progress .feature file            │
-│    ### Module Structure                                             │
-│      <package>/domain/<noun>.py                                     │
-│        class <Noun>:          ← named class + responsibilities      │
-│            field: Type                                              │
-│        def <verb>(<Noun>) -> <Type>: ...  ← typed signatures        │
-│        class <DepName>(Protocol): ...     ← external dep contract   │
-│      <package>/domain/service.py          ← cross-entity operations │
-│      <package>/adapters/<dep>.py          ← Protocol impl           │
-│    ### Key Decisions                                                │
-│      ADR-NNN: <title>                                               │
-│      Decision: <what>                                               │
-│      Reason: <why in one sentence>                                  │
+│  WRITE STUBS INTO PACKAGE (signatures only — bodies must be `...`) │
+│    If file exists → add class/method; do not remove existing code  │
+│    If file does not exist → create with signatures only             │
+│    File placement (common patterns, not required names):            │
+│      <package>/domain/<noun>.py   ← entities, value objects        │
+│      <package>/domain/service.py  ← cross-entity operations        │
+│      Do not pre-create ports/ or adapters/ without a concrete dep  │
+│    Stub rules:                                                      │
+│      Bodies: `...` only — no logic, no conditionals                │
+│      No docstrings — add after GREEN when signatures are stable     │
+│      No inline comments, no TODO, no speculative code              │
+│                                                                     │
+│  WRITE ADR FILES (significant decisions only)                       │
+│    docs/architecture/adr-NNN-<title>.md                            │
+│      Decision: <what>  Reason: <why>                               │
 │      Alternatives considered: <what was rejected and why>           │
-│    ### Build Changes (new runtime deps — requires PO approval)      │
-│                                                                     │
-│  NOTE: signatures are informative — tests/implementation may        │
-│  refine them; record significant changes as ADR updates             │
 │                                                                     │
 │  ARCHITECTURE SMELL CHECK — hard gate (fix before commit)           │
-│    [ ] No planned class with >2 responsibilities (SOLID-S)         │
-│    [ ] No planned class with >2 instance variables (OC-8)          │
-│    [ ] All external deps assigned a Protocol/Adapter (SOLID-D +    │
-│        Hexagonal Architecture)                                      │
+│    [ ] No class with >2 responsibilities (SOLID-S)                 │
+│    [ ] No class with >2 instance variables (OC-8)                  │
+│    [ ] All external deps assigned a Protocol (SOLID-D + Hexagonal) │
+│        N/A if no external dependencies identified in scope          │
 │    [ ] No noun with different meaning across planned modules        │
 │        (DDD Bounded Context)                                        │
 │    [ ] No missing Creational pattern: repeated construction         │
@@ -178,7 +177,7 @@ Each step has a designated agent and a specific deliverable. No step is skipped.
 │    [ ] Each ADR consistent with each @id AC — no contradictions    │
 │    [ ] Technically infeasible story → escalate to PO               │
 │                                                                     │
-│  commit: feat(<name>): add architecture                             │
+│  commit: feat(<name>): add architecture stubs                       │
 │                                                                     │
 └─────────────────────────────────────────────────────────────────────┘
                               ↓
@@ -187,7 +186,8 @@ Each step has a designated agent and a specific deliverable. No step is skipped.
 ├─────────────────────────────────────────────────────────────────────┤
 │                                                                     │
 │  PREREQUISITES (stop if any fail — escalate to PO)                 │
-│    [ ] Architecture section present in in-progress .feature file   │
+│    [ ] Architecture stubs present in <package>/ (Step 2 committed) │
+│    [ ] Read all docs/architecture/adr-NNN-*.md files               │
 │    [ ] All tests written in tests/features/<feature>/              │
 │                                                                     │
 │  Build TODO.md test list                                            │
@@ -203,7 +203,9 @@ Each step has a designated agent and a specific deliverable. No step is skipped.
 │  │  INNER LOOP                                                 │   │
 │  │  ┌───────────────────────────────────────────────────────┐ │   │
 │  │  │  RED                                                  │ │   │
+│  │  │    Read stubs in <package>/ — base test on them       │ │   │
 │  │  │    Write test body (Given/When/Then → Arrange/Act/Assert) │ │
+│  │  │    Update stub signatures as needed — edit .py directly │ │ │
 │  │  │    uv run task test-fast                              │ │   │
 │  │  │    EXIT: this @id FAILS                               │ │   │
 │  │  │    (if it passes: test is wrong — fix it first)       │ │   │
@@ -216,10 +218,8 @@ Each step has a designated agent and a specific deliverable. No step is skipped.
 │  │  │    (fix implementation only; do not advance @id)      │ │   │
 │  │  ├───────────────────────────────────────────────────────┤ │   │
 │  │  │  REFACTOR                                             │ │   │
-│  │  │    Apply: DRY → SOLID → OC → patterns                 │ │   │
-│  │  │    Load design-patterns skill if smell detected       │ │   │
-│  │  │    Add type hints and docstrings                      │ │   │
-│  │  │    uv run task test-fast after each change            │ │   │
+│  │  │    Load skill refactor — follow its protocol          │ │   │
+│  │  │    uv run task test-fast after each individual change │ │   │
 │  │  │    EXIT: test-fast passes; no smells remain           │ │   │
 │  │  └───────────────────────────────────────────────────────┘ │   │
 │  │                                                             │   │
@@ -368,25 +368,19 @@ Feature: <title>
 
   Session 1 — Individual Entity Elicitation:
   | ID | Question | Answer | Status |     ← OPEN / ANSWERED
+  Template §1: PENDING | CONFIRMED
   Synthesis: <PO synthesis — confirmed by stakeholder>
+  Pre-mortem: <gaps identified; new questions added above>
 
   Session 2 — Behavior Groups / Big Picture:
   | ID | Question | Answer | Status |
-  Behavior Groups: <named behavior groups derived from answers>
+  Template §2: PENDING | CONFIRMED
+  Behavior Groups:
+  - <behavior group name>: <one-sentence summary>
 
   Session 3 — Feature Synthesis:
-  Synthesis: <full synthesis across behavior groups>
-  Approved: YES / NO
-
-  Architecture:                         ← added at Step 2 by software-engineer
-
-  ### Module Structure
-  - <package>/domain/entity.py — ...
-
-  ### Key Decisions
-  ADR-001: <title>
-  Decision: <what>
-  Reason: <why>
+  Template §3: PENDING | CONFIRMED — stakeholder approved YYYY-MM-DD
+  Synthesis: <full synthesis across all behavior groups>
 
   Rule: <story title>
     As a <role>
@@ -402,7 +396,40 @@ Feature: <title>
 
 Two discovery sources:
 - `docs/features/discovery.md` — project-level 3-session discovery (once per project; additive for new features)
-- Feature file description — per-feature 3-session discovery, entities, clusters, architecture
+- Feature file description — per-feature 3-session discovery, entities, business rules, and acceptance criteria
+
+---
+
+## Architecture Artifacts
+
+Architectural decisions made during Step 2 are recorded as ADR files:
+
+```
+docs/architecture/
+  adr-template.md          ← blank template — copy to create a new ADR
+  adr-001-<title>.md       ← one file per significant decision
+  adr-002-<title>.md
+  ...
+```
+
+**ADR format** (copy `adr-template.md` and fill in):
+
+```markdown
+# ADR-NNN: <title>
+
+**Status:** PROPOSED | ACCEPTED | SUPERSEDED by ADR-NNN
+
+**Decision:** <what was decided — one sentence>
+
+**Reason:** <why — one sentence>
+
+**Alternatives considered:**
+- <option>: <why rejected>
+```
+
+Write an ADR only for non-obvious decisions with real trade-offs — module boundaries, external dependency strategy, Protocol vs. concrete class, data model choices. Routine YAGNI choices do not need an ADR.
+
+Domain entity and service stubs (signatures, no bodies) live directly in the package under `<package>/domain/`, `<package>/ports/`, and `<package>/adapters/` — written at Step 2, filled in during Step 3.
 
 ---
 
diff --git a/feedback.md b/feedback.md
deleted file mode 100644
index 87b57c2..0000000
--- a/feedback.md
+++ /dev/null
@@ -1,312 +0,0 @@
-# Workflow Improvement Feedback
-
-Collected and clarified: 2026-04-17
-
----
-
-## 1. PO Summarization Before Proceeding
-
-**Problem:** The PO moves on too quickly without demonstrating understanding of what the stakeholder said.
-
-**Research basis:** Active listening (Rogers & Farson, 1957) — the listener paraphrases what they heard in their own words, asks clarifying questions, then offers a concise summary of main points and intent before proceeding. This reduces misunderstanding, builds trust, and confirms the PO captured the right requirements.
-
-**Proposed fix:** After each interview round, the PO must produce a "Here is what I understood" block before moving to stories or criteria:
-1. Paraphrase the stakeholder's intent in the PO's own words
-2. Identify any remaining ambiguities and ask targeted follow-up questions
-3. Summarize the main points and confirm with the stakeholder before freezing discovery
-
-This applies at Phase 1 (project discovery), Phase 2 (feature discovery), and before baseline.
-
----
-
-## 2. Developer Avoids OO and Design Patterns — Code Smell Detection
-
-**Problem:** The developer uses plain functions to avoid classes, SOLID, and Object Calisthenics. It does not smell the code to recognize when a simple function should be refactored into a class or design pattern.
-
-**Root cause:** The rules list principles but do not teach the developer to recognize when complexity warrants a structural upgrade. The developer lacks a smell-triggered refactoring instinct.
-
-**Proposed fix:** Add a code smell detection step to the REFACTOR phase. When a solution grows complex (e.g. a function accumulates state, multiple functions share data, behavior varies by type), the developer must ask: "Does this smell indicate I should refactor to a class or design pattern?" The answer drives the refactor, not just the line count or nesting rules.
-
-See also: Item 5 (self-check examples) and Item 6 (ObjCal rule clarity).
-
----
-
-## 3. Design Principle Priority Order Misleads
-
-**Problem:** `YAGNI > KISS > DRY > SOLID > ObjCal > design patterns` implies that design patterns are a last resort and rarely needed. This is incorrect. Good design patterns are better than complex code.
-
-**Python Zen:** The Zen of Python is missing from the priority order. Specifically: "Complex is better than complicated." This matters because:
-- Good design patterns > complex code (patterns reduce complexity)
-- Complex code > complicated code (complicated is hard to reason about)
-- Complicated code > failing code (at least it runs)
-- Failing code > no code (at least it exists)
-
-**Proposed fix:** Replace the flat priority order with a quality hierarchy that reflects this:
-
-```
-1. No code (nothing implemented)          ← worst
-2. Failing code (broken)
-3. Complicated code (hard to reason about)
-4. Complex code (many parts, but clear)
-5. Code following YAGNI/KISS/DRY/SOLID/OC
-6. Code using appropriate design patterns  ← best
-```
-
-Add the Python Zen reference: "Simple is better than complex. Complex is better than complicated." The goal is to reach level 6, not to stop at level 5 because "YAGNI says don't add patterns."
-
----
-
-## 4. Architecture Approval by PO Is Hollow
-
-**Problem:** The PO approves architecture at Step 2 but has no knowledge of ObjCal, SOLID, or whether entities are properly modeled. The PO always approves, making the gate meaningless.
-
-**Additional problem:** The developer starts architecture for the in-progress feature without reading the full backlog. This leads to solutions that work for the current feature but require extensive rework when the next feature arrives, because the architecture did not account for the big picture.
-
-**Proposed fix:**
-- Remove PO architecture approval
-- Replace with a developer self-declaration mental check covering:
-  1. Read all backlog and in-progress feature files (discovery + entities sections at minimum)
-  2. Identify entities, interactions, and constraints across all planned features
-  3. Declare: "I have considered the full feature set. This architecture is the best design for the known requirements."
-- The developer must justify the architecture against the full feature set, not just the current feature
-
----
-
-## 5. Self-Check Table Lost Generalized Examples
-
-**Problem:** The self-check table contains examples like `_x`, but the AI treats `_x` as a literal match rather than understanding it represents any short, meaningless variable name (e.g. `_val`, `_tmp`, `_data`). The rule lacks generalization guidance.
-
-**Proposed fix:** For each ObjCal rule (and SOLID rule), add:
-- A plain-English explanation of what the rule means
-- A "bad" code example showing a violation
-- A "good" code example showing compliance
-- A generalization note: e.g. "This applies to any single-letter or abbreviation variable name, not just `_x`"
-
-This makes the rules learnable and independently verifiable by both the developer and the reviewer.
-
----
-
-## 6. Self-Declaration Accountability Format
-
-**Problem:** The current self-declaration checklist is passive. The developer ticks boxes without being accountable for each claim.
-
-**Proposed format:**
-
-```
-As a [agent-alias] I declare [item] follows [rule] — YES | NO
-```
-
-**If NO:**
-- The developer generates a self-correction plan for the failed item
-- The developer restarts the Red-Green-Refactor cycle from the affected tests
-- Affected tests are marked as rework in TODO.md (format: open to proposal — consider `[R]` prefix or a `## Rework` section, respecting the 15-line limit)
-- The cycle does not proceed to the reviewer until all declarations are YES
-
-**If all YES:** proceed to reviewer as normal.
-
----
-
-## 7. Reviewer Must Independently Verify — No Blind Acceptance
-
-**Problem:** The reviewer accepts self-declared YES claims without independently verifying them. Worse, when the reviewer does not understand a rule (e.g. "types are first class" in ObjCal), it skips the check or accepts the developer's claim unchallenged.
-
-**Two-part fix:**
-
-1. **ObjCal (and SOLID) rules must have plain-English explanations + code examples** (see Item 5). The reviewer should never accept a claim it cannot independently verify.
-
-2. **Reviewer scope:** The reviewer only audits YES declarations. Self-declared NO items are already known failures — the reviewer does not need to re-report them. The reviewer's job is to try to break what the developer claims is correct.
-
----
-
-## 8. PO Not Refining Enough Before Proceeding
-
-**Problem:** The PO moves through discovery phases without pushing back, asking follow-up questions, or confirming understanding. Stories and criteria are written on incomplete understanding.
-
-**Proposed fix:** Enforce the active listening summarization protocol (Item 1) at every phase transition. The PO must not move to Phase 3 (Stories) until the stakeholder has confirmed the PO's paraphrase is accurate. The PO must not move to Phase 4 (Criteria) until each Rule has been validated against the stakeholder's intent.
-
----
-
-## 9. Workflow Verbosity — Test Output and Fail-Fast
-
-**Problem:** The workflow has unnecessary checks, fast test output is too verbose, and there is no fail-fast limit.
-
-**Proposed fixes:**
-- Fast test path (`test-fast`) should suppress passing test output — show only failures. Follow pytest best practice: use `-q` (quiet) flag or equivalent for the fast path.
-- Add a fail-fast threshold configurable in `pyproject.toml` (e.g. `--maxfail=N`). Suggested default: 5.
-- Remove redundant checks that are already covered by tooling (see Item 13).
-
----
-
-## 10. Offload Templated Checks to Scripts
-
-**Problem:** Repetitive checks (e.g. verifying `@id` uniqueness, orphan detection) are performed manually by agents, consuming context and introducing error.
-
-**Proposed fix:** Identify all templated checks currently done by agents and create scripts for them. Agents invoke the script and act on the result. Candidates include:
-- `@id` uniqueness check (already partially done by `gen-tests`)
-- Orphan test detection (`gen-tests --orphans`)
-- Self-declaration formatting validation
-- Session state consistency check
-
----
-
-## 11. docs/workflow.md Is Out of Date
-
-**Problem:** `docs/workflow.md` does not reflect the current workflow. Specifically:
-- It references a separate `discovery.md` file; discovery is now merged into `.feature` files
-- The feature folder structure and conventions have changed
-- The self-declaration section references a 21-item checklist that may no longer match
-
-**Proposed fix:** Update `docs/workflow.md` to reflect the current state of the workflow, including:
-- Discovery merged into `.feature` file description section
-- Current folder structure (`backlog/`, `in-progress/`, `completed/`)
-- Current self-declaration format (post Item 6 changes)
-- Removal of references to `discovery.md` as a separate file
-
----
-
-## 12. Squash on Merge for Feature Branches
-
-**Problem:** Feature branches produce many small commits (one per test). Merging into `main` with a standard merge commit preserves all of them, making history noisy.
-
-**Proposed fix:** Feature branches into `main` should be squashed. Document this in the `pr-management` skill as a required step. One squash commit per feature, with a summary message covering all tests implemented.
-
----
-
-## 13. Code Smell in Self-Declaration
-
-**Problem:** The self-declaration checklist does not include code smell detection. A developer can declare all SOLID/ObjCal rules as YES while the code is full of smells.
-
-**Proposed fix:** Add a code smell section to the self-declaration, covering both categories:
-
-**Standard smells (language-agnostic):**
-- Long method
-- Feature envy (method uses another class's data more than its own)
-- Data clumps (same group of variables appear together repeatedly)
-- Primitive obsession (using primitives instead of small objects)
-- Shotgun surgery (one change requires many small changes across many classes)
-- Divergent change (one class changed for many different reasons)
-- Middle man (class delegates most of its work)
-
-**Python-specific smells:**
-- Bare `except:` clauses
-- Mutable default arguments
-- LBYL (Look Before You Leap) where EAFP (Easier to Ask Forgiveness than Permission) is idiomatic
-- Using `type()` instead of `isinstance()`
-- Overuse of `*args` / `**kwargs` hiding interface contracts
-
----
-
-## 14. Session Continuity — Pick Up Where Left Off
-
-**Problem:** When a session ends and a new one begins, the agent cannot reliably determine the current step, cycle phase, and next action. TODO.md provides some context but lacks precision for mid-cycle resumption.
-
-**Proposed fix:** Open to proposal. Consider extending TODO.md with a `## Cycle State` section that captures:
-- Current step (1-6)
-- Current `@id` under work
-- Current phase (RED / GREEN / REFACTOR / SELF-DECLARE / REVIEWER / COMMITTED)
-- Last completed action
-- Exact next action
-
-The session-workflow skill should enforce reading and updating this section at session start and end. The goal: any agent, in any session, can read TODO.md and know exactly what to do next without re-reading the entire feature file.
-
----
-
-## 15. ID Checks Are Redundant for Agents
-
-**Problem:** Agents manually verify `@id` uniqueness and coverage. This is already done by `gen-tests`. Duplicating this check wastes context and distracts agents from implementation cycles.
-
-**Proposed fix:** Remove manual `@id` verification from all agent checklists. Rely on `gen-tests` for ID validation. Agents should only run `gen-tests` and act on its output.
-
----
-
-## 16. Session Memory and State Tracking
-
-**Problem:** Agents lose session state between sessions. TODO.md is a 15-line bookmark but may not capture enough metadata to track complex multi-session features reliably.
-
-**Proposed fix:** Open to proposal. Evaluate whether TODO.md extensions (Item 14) are sufficient, or whether a separate lightweight state file (e.g. `CYCLE.md` or `.opencode/state.json`) is needed. The artifact should be:
-- Machine-readable by agents
-- Human-readable for debugging
-- Minimal — only what is needed to resume a session
-
----
-
-## 17. AGENTS.md Is Generally Outdated
-
-**Problem:** `AGENTS.md` does not fully reflect the current workflow, conventions, and tooling.
-
-**Proposed fix:** After all other items are resolved, perform a full pass on `AGENTS.md` to align it with:
-- Current feature file structure (discovery merged into `.feature`)
-- Updated self-declaration format
-- Updated principle priority order
-- Removal of hollow PO architecture approval
-- Any new scripts or tools added
-
----
-
-## 18. Developer Does Not Read Full Backlog Before Architecture
-
-**Problem:** The developer starts implementing the in-progress feature without reading the full backlog. This produces a working solution that requires extensive rework when the next feature arrives, because the architecture did not account for future requirements.
-
-**Concrete example:** A feature was implemented correctly in isolation, but the next feature required significant structural changes because the original architecture assumed a design that did not compose well.
-
-**Proposed fix:** At Step 2 (Architecture), the developer must:
-1. Read the discovery and entities sections of every feature in `backlog/` and `in-progress/`
-2. Identify cross-feature entities, shared interfaces, and likely extension points
-3. Design the architecture to accommodate the known future, not just the current feature
-4. Self-declare: "I have read all backlog features and this architecture accounts for the full known feature set"
-
-This is distinct from Item 4 (hollow PO approval) — the fix here is about the developer's reading obligation before making architectural decisions.
-
----
-
-## 19. Workflow Diagram — Redundancies and Late Error Detection
-
-### Redundancies
-
-**19a. Step 3 reviewer gate is a subset of Step 4's per-test reviewer gate**
-
-Step 3 stops for reviewer approval of test design and semantic alignment before any implementation starts. Step 4 then repeats the same semantic alignment check per-test cycle. The Step 3 check reviews all tests at once before any code exists — but semantic alignment is best verified when both the test and the implementation can be seen side by side. The Step 3 review is premature and likely re-done anyway during Step 4.
-
-**19b. Step 5 code review overlaps heavily with Step 4 self-declaration + per-test reviewer**
-
-Step 5 checks Correctness, KISS, SOLID, ObjCal, Design Patterns, Tests, Code Quality (4a–4g). All of these except tooling (lint/coverage) were already covered by the 21-item self-declaration and per-test reviewer in Step 4. Step 5 implies a full re-audit of already-reviewed work, rather than a targeted spot-check of what is novel or risky.
-
-**19c. `gen-tests --check` listed as a separate pre-step that nothing uses**
-
-The `--check` dry-run appears in the tools table as "Before gen-tests" but is never referenced in the actual workflow steps. Either make it a mandatory gate or remove it.
-
-**19d. Step 2 architecture commit and Step 3 gen-tests commit are always consecutive**
-
-These two commits are always paired and never independently useful. Step 2 commits architecture, Step 3 immediately runs `gen-tests` and commits stubs. Combining them into one step would reduce overhead without losing traceability.
-
-### Late Error Detection
-
-**19e. Architecture locked before test bodies reveal structural problems**
-
-Test bodies are written in Step 3 after the architecture is committed in Step 2. If a test body reveals an architectural flaw (wrong abstraction, missing entity), the developer must return to Step 2 — but the diagram has no explicit back-arrow from Step 3 to Step 2. The diagram implies Step 3 is always forward.
-
-**19f. Decomposition check happens at the end of Phase 2, after all discovery is done**
-
-If a feature is too large (>2 concerns, >8 examples), the split happens after discovery questions are already answered. The check should happen earlier — at Phase 1 when the feature list is identified, or at the start of Phase 2 before generating questions.
-
-**19g. `lint + static-check` run only at handoff (end of Step 4)**
-
-A type error or lint violation introduced in cycle 3 is not caught until all cycles are complete. Running these tools only at handoff means multiple commits may need to be unwound.
-
-**19h. Production-grade input→output check first appears in Step 5**
-
-Step 5 verifies that "output changes with input". This basic correctness property is not checked by the developer until the reviewer finds it. The developer's pre-mortem at end of Step 4 exists but is vague — it does not mandate the input→output check explicitly.
-
-### Proposed Improvements
-
-| # | Issue | Proposed change |
-|---|---|---|
-| A | Step 3 reviewer gate redundant with Step 4 | Merge Step 3 into Step 2: after architecture commit, run `gen-tests` to create stubs. Test body writing becomes the first action of Step 4 (write test → RED → GREEN → REFACTOR → SELF-DECLARE → REVIEWER → COMMIT). Removes one full reviewer interaction. |
-| B | Step 5 is a full re-audit of already-reviewed work | Reframe Step 5 as a spot-check + tooling run: skip re-checking items covered by per-test reviewers; focus on (a) tooling — lint, static-check, coverage, orphans, (b) integration/system behavior, (c) semantic alignment of the feature as a whole. |
-| C | Decomposition check too late | Move to Phase 1 (when feature stubs are created) and add a lightweight re-check at the start of Phase 2 (before generating questions). |
-| D | `lint + static-check` run only at handoff | Run `lint + static-check` (not coverage) after each Step 4 commit as a fast sanity check. Keep full `test` with coverage at handoff only. |
-| E | Step 2 + Step 3 always consecutive | Merge into one step: architecture + `gen-tests` stubs in one commit. Test bodies are the opening move of Step 4. |
-| F | No back-arrow from Step 3 to Step 2 | Add explicit "if test body reveals arch flaw → back to Step 2" path in the diagram. |
-| G | Input→output check first found by reviewer | Make it explicit in the developer's Step 4 self-verification (before handoff): run with two different inputs, confirm output differs. |
-
-**Highest-value change: A + E combined.** Collapsing Steps 2+3 removes a full reviewer interaction. Test body writing as the opening move of Step 4 means architectural flaws are discovered immediately when the developer cannot make the test fail for the right reason.
diff --git a/pyproject.toml b/pyproject.toml
index 24ebda6..0ac8962 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -1,6 +1,6 @@
 [project]
 name = "python-project-template"
-version = "5.0.20260418"
+version = "5.1.20260418"
 description = "Python template with some awesome tools to quickstart any Python project"
 readme = "README.md"
 requires-python = ">=3.13"
diff --git a/uv.lock b/uv.lock
index 81e96c9..69aaa7e 100644
--- a/uv.lock
+++ b/uv.lock
@@ -735,7 +735,7 @@ wheels = [
 
 [[package]]
 name = "python-project-template"
-version = "5.0.20260418"
+version = "5.1.20260418"
 source = { virtual = "." }
 dependencies = [
     { name = "fire" },