From 33bea2f97cedaeaa71058d0bdf3ec505b624e892 Mon Sep 17 00:00:00 2001
From: Alan Jowett <alan.jowett@microsoft.com>
Date: Fri, 20 Mar 2026 07:59:58 -0700
Subject: [PATCH 1/5] Add Sonde specification audit case study

Real-world case study auditing 5 components (260 requirements) of the
Sonde IoT runtime using PromptKit's trifecta audit. Key results:

- 60 findings across 5 components using one reusable prompt
- Systemic BLE design gap found across modem, node, and gateway
- Cross-reference with prior manual audit: 49% of findings were
  net-new, almost all design traceability gaps (D1/D6) that the
  manual audit's test-focused lens missed
- Manual and automated audits are complementary, not competing

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
---
 .../case-studies/sonde-specification-audit.md | 324 ++++++++++++++++++
 1 file changed, 324 insertions(+)
 create mode 100644 docs/case-studies/sonde-specification-audit.md
diff --git a/docs/case-studies/sonde-specification-audit.md b/docs/case-studies/sonde-specification-audit.md
new file mode 100644
index 0000000..f2a559b
--- /dev/null
+++ b/docs/case-studies/sonde-specification-audit.md
@@ -0,0 +1,324 @@
+# Case Study: Auditing a Real Project with PromptKit
+
+## The Project
+
+[Sonde](https://github.com/alan-jowett/sonde) is a programmable runtime
+for distributed sensor nodes. It runs verified BPF bytecode (RFC 9669)
+on ESP32 microcontrollers, with a gateway distributing programs over
+ESP-NOW radio and a BLE pairing tool for device onboarding. The system
+comprises five components: a wire protocol codec, node firmware, a
+gateway server, a USB modem bridge, and a BLE pairing tool.
+
+The project follows a specification-first methodology. Every component
+has formal requirements (250+ REQ-IDs across the system), a design
+document, and a validation plan with test matrices. This makes it an
+ideal candidate for PromptKit's traceability audit — and a realistic
+stress test, because the specs were written by humans over months of
+active development.
+
+## The Problem
+
+Sonde's specifications were authored incrementally. Core requirements
+and designs were written first (protocol, node, gateway), then BLE
+pairing support was added as a cross-cutting feature touching all
+components. The team suspected that the design documents hadn't kept
+pace with the expanded requirements — but manually cross-checking
+250+ requirements across 15 specification documents was impractical.
+
+Specific concerns:
+- Did every requirement have a corresponding design section?
+- Did every requirement have a test case?
+- Were there design decisions that didn't trace back to any requirement?
+- Had assumptions drifted between documents as BLE was added?
+
+## The PromptKit Approach
+
+### Prompt Assembly
+
+A single PromptKit prompt was assembled once and reused across all five
+components — only the input documents changed:
+
+```bash
+npx @alan-jowett/promptkit assemble audit-traceability \
+  -p project_name="Sonde <component>" \
+  -p requirements_doc="$(cat docs/<component>-requirements.md)" \
+  -p design_doc="$(cat docs/<component>-design.md)" \
+  -p validation_plan="$(cat docs/<component>-validation.md)" \
+  -p focus_areas="all" \
+  -p audience="engineering team" \
+  -o <component>-trifecta-audit.md
+```
+
+The assembled prompt composes four layers:
+
+- **Persona**: `specification-analyst` — adversarial toward completeness
+  claims, systematic rather than impressionistic
+- **Protocols**: `anti-hallucination` + `self-verification` +
+  `traceability-audit` (6-phase cross-document methodology)
+- **Taxonomy**: `specification-drift` (D1–D7 classification)
+- **Format**: `investigation-report` (F-NNN findings with severity)
+
+### Execution
+
+Each component audit was run in a single LLM session. The assembled
+prompt (~27K chars) plus three specification documents were provided
+as input. The LLM executed the traceability-audit protocol's six
+phases: artifact inventory, forward traceability, backward
+traceability, cross-document consistency, classification, and coverage
+summary.
+
+## Results
+
+### Summary Across All Five Components
+
+| Metric | Modem | Node | Gateway | BLE Tool | Protocol |
+|--------|-------|------|---------|----------|----------|
+| **Requirements** | 31 | 57 | 76 | 55 | 41 |
+| **Reqs → Design** | 51.6% | 68.4% | 70% | 100% | ~85% |
+| **Reqs → Tests** | 93.5% | 93.0% | 93% | 89.7% | 67% |
+| **Total findings** | 13 | 11 | 7 | 15 | 14 |
+| **Critical/High** | 3 | 0 | 2 | 2 | 0 |
+
+**60 findings** across 260 requirements — and a clear systemic pattern
+that no manual review would have caught.
+
+### The Systemic Finding: BLE Design Gap
+
+The most striking result was a pattern that emerged across three
+independent audits. The modem, node, and gateway audits all
+independently flagged the same root cause: **BLE pairing requirements
+were added to all three components after the design documents were
+finalized, and the design documents were never updated.**
+
+| Component | Untraced BLE reqs | Design coverage gap |
+|-----------|-------------------|---------------------|
+| Modem | 14 of 31 requirements (45%) | No BLE module in design at all |
+| Node | 18 of 57 requirements (32%) | BLE section added but incomplete |
+| Gateway | 23 of 76 requirements (30%) | No BLE design section |
+
+The modem audit surfaced this most dramatically:
+
+> **F-001 (Critical, D1)**: 14 BLE pairing relay requirements
+> (MD-0400–MD-0414) have zero representation in the design document.
+> No BLE module, no BLE architecture, no BLE data flow described.
+> Implementers have no design guidance for 45% of active requirements.
+
+> **F-002 (High, D6)**: Design §1 overview claims "bidirectional bridge
+> between USB-CDC and ESP-NOW" and "no crypto, no CBOR parsing, no
+> sessions," but BLE requirements mandate LESC pairing (crypto),
+> connection lifecycle (session-like), and tri-directional bridging.
+> Direct textual contradiction.
+
+A manual reviewer might have noticed that the modem design felt
+incomplete. But they would not have systematically identified that the
+same gap existed across three components, that it affected exactly 55
+requirements total, and that the design documents actively contradicted
+the requirements (D6) rather than merely omitting them (D1).
+
+### Finding Distribution by Drift Type
+
+| Drift Type | Total | Description |
+|------------|-------|-------------|
+| D1 (Untraced requirement) | 12 | Requirements not referenced in design |
+| D2 (Untested requirement) | 10 | Requirements with no test case |
+| D3 (Orphaned design) | 7 | Design decisions with no requirement |
+| D5 (Assumption drift) | 7 | Cross-document assumption conflicts |
+| D6 (Constraint violation) | 3 | Design contradicts a requirement |
+| D7 (Acceptance mismatch) | 9 | Test doesn't verify its linked criteria |
+| D4 (Orphaned test) | 0 | No orphaned tests found |
+
+The zero count for D4 (orphaned test cases) indicates strong backward
+traceability — every test traces to a real requirement. The issues are
+concentrated in forward traceability (requirements that aren't fully
+realized in design and validation).
+
+### Severity Profile
+
+No findings were informational-only. The distribution skews toward
+actionable items:
+
+| Severity | Count | Percentage |
+|----------|-------|------------|
+| Critical | 1 | 2% |
+| High | 6 | 10% |
+| Medium | 19 | 32% |
+| Low | 28 | 47% |
+| Informational | 6 | 10% |
+
+### Notable Findings Beyond the BLE Gap
+
+**Internal design contradiction (Node, D6)**: The node design document's
+§4.1 and §14 describe "sleep indefinitely" for unpaired nodes, directly
+contradicting requirement ND-0900 which mandates entering BLE pairing
+mode. §15 (added later) does say "enter BLE pairing mode" — creating an
+internal contradiction within the same document.
+
+**Traceability mislabel (BLE Tool, D7)**: Test T-PT-309 ("Ed25519 →
+X25519 low-order point rejection") is traced to PT-0405 but actually
+validates PT-0902, a Must-priority security requirement. The mislabel
+makes PT-0902 appear untested in the traceability matrix — a false
+negative that could delay security sign-off.
+
+**Redundant field consistency risk (Protocol, D5)**: The
+`GatewayMessage::Command` struct carries both a `command_type: u8` field
+and a typed `CommandPayload` enum. The field is fully determined by the
+enum variant, making it redundant and writable. No test enforces
+consistency. An encode path writing `command_type` from the field rather
+than deriving it from the variant could silently produce malformed
+frames.
+
+## What Made This Work
+
+### 1. Reusable prompt, variable inputs
+
+One assembled prompt was used across all five components. The
+`specification-analyst` persona and `traceability-audit` protocol don't
+change — only the three input documents change. This made auditing five
+components no more complex than auditing one.
+
+### 2. The taxonomy forced precision
+
+Without the D1–D7 taxonomy, the LLM would report vague observations
+("the design seems incomplete for BLE"). With it, every finding gets a
+precise classification that distinguishes between a missing design
+section (D1), a design that contradicts a requirement (D6), and a test
+that checks the wrong thing (D7). These distinctions matter for
+remediation — D1 means "add a design section," D6 means "fix the
+design," D7 means "fix the test."
+
+### 3. The protocol prevented shortcuts
+
+The traceability-audit protocol requires building a complete inventory
+of every identifier before comparing documents. This is tedious but
+essential — it's how the audit caught the 14/31 untraced BLE
+requirements in the modem spec rather than just noting "some BLE
+coverage seems light."
+
+### 4. Anti-hallucination prevented false positives
+
+The anti-hallucination guardrail forced the LLM to distinguish between
+requirements that are genuinely missing from the design (D1) and
+requirements that are addressed under different section headings
+(not a finding). The gateway audit correctly identified that admin API
+requirements were functionally addressed in design §13 but simply
+lacked explicit REQ-ID cross-references — a bookkeeping issue (Low),
+not a missing design section (High).
+
+## Estimated Remediation
+
+The audits produced actionable remediation guidance. Total estimated
+effort across all five components: **~10–15 hours**, broken down as:
+
+- Add BLE design sections to modem, node, and gateway design docs
+- Add explicit REQ-ID cross-references where design coverage is implicit
+- Define 5 deferred test cases with concrete T-NNNN IDs
+- Fix 1 traceability mislabel (T-PT-309 → PT-0902)
+- Add missing message variant tests in protocol crate
+
+No breaking changes required. All findings are additive — add design
+sections, add tests, add cross-references.
+
+## PromptKit vs. Manual Audit
+
+The Sonde project had already been through a manual audit pass before
+this exercise. The maintainer had filed ~20 GitHub issues based on
+hand-rolled prompts and direct review. This created a natural
+experiment: what does a structured PromptKit audit find that a manual
+audit misses — and vice versa?
+
+### Cross-Reference Results
+
+The 59 PromptKit findings were cross-referenced against existing
+GitHub issues:
+
+| Component | Findings | Direct Match | Partial Match | No Issue |
+|-----------|----------|--------------|---------------|----------|
+| Protocol | 14 | 8 | 2 | 4 |
+| BLE Tool | 15 | 6 | 3 | 6 |
+| Modem | 12 | 2 | 3 | 7 |
+| Gateway | 7 | 1 | 2 | 4 |
+| Node | 11 | 0 | 3 | 8 |
+| **Total** | **59** | **17 (29%)** | **13 (22%)** | **29 (49%)** |
+
+**29% of findings were already known** — the manual audit had caught
+them. **22% partially overlapped** — the issue existed but the finding
+was more specific or broader. **49% were net-new** — issues the manual
+audit did not surface at all.
+
+### Different Audits Find Different Things
+
+The most revealing pattern was not how many findings overlapped, but
+which *types* each approach caught:
+
+**Manual audit strength: validation gaps.** The maintainer's manual
+review excelled at finding missing tests, incomplete test cases, and
+validation plan gaps — particularly in the protocol crate and BLE tool.
+These are the issues an engineer naturally notices when reading test
+plans against requirements.
+
+**PromptKit audit strength: design traceability.** The automated
+trifecta audit found a fundamentally different class of issues —
+cross-document traceability gaps between requirements and design. The
+10 highest-severity net-new findings were almost entirely D1/D3/D6
+(design drift), not D2/D7 (test drift):
+
+- Node F-001: 18 requirements untraced in design doc
+- Node F-006/F-008: Design boot sequence contradicts BLE pairing
+  requirements
+- Modem F-002/F-003: Design overview contradicts and omits BLE module
+- Gateway F-006: Module table missing Admin API and BLE modules
+- Protocol F-010: Redundant `command_type` field inconsistency risk
+
+**The two approaches are complementary, not competing.** A human
+reviewer reads a test plan and thinks "is this test right?" — catching
+D2 and D7 issues. The structured protocol reads all three documents
+and mechanically checks "does every REQ-ID appear in the design?" —
+catching D1 and D6 issues that require comparing documents a human
+doesn't hold in working memory simultaneously.
+
+### Why the Manual Audit Missed Design Drift
+
+The manual audit focused on a natural question: "are the tests
+complete?" This is the question engineers instinctively ask, because
+test gaps have immediate consequences (bugs in production). Design
+traceability gaps have deferred consequences — the code might still be
+correct even if the design document is stale. But when BLE pairing
+design was absent from three design documents simultaneously, the risk
+was not just stale documentation — it was that future implementers
+would have no design guidance for 30-45% of requirements.
+
+The PromptKit audit caught this because the traceability-audit protocol
+requires building a **complete identifier inventory** before comparing
+documents. A human skims; the protocol enumerates. Enumeration is
+tedious but exhaustive — and that's exactly what makes it effective for
+finding what's missing rather than what's wrong.
+
+## Takeaways
+
+- **Specification drift is real and systemic.** BLE pairing was added
+  to requirements across three components, and all three design
+  documents lagged. Manual review would catch this in one component;
+  the audit caught it in all three and quantified the gap precisely.
+
+- **Structured audits find different issues than manual review.**
+  49% of findings were net-new — almost all design traceability gaps
+  that the manual audit's test-focused lens naturally missed. The two
+  approaches are complementary.
+
+- **One prompt, five audits.** The assembled prompt is reusable —
+  the methodology doesn't change, only the inputs. This scales to
+  any project with structured specification documents.
+
+- **Taxonomy classification drives remediation.** D1 (add a section)
+  requires different effort than D6 (fix a contradiction) or D7 (fix
+  a test). The taxonomy makes prioritization mechanical rather than
+  subjective.
+
+- **Coverage metrics tell the story.** "93% test coverage but 52%
+  design coverage" immediately identifies where to focus. The metrics
+  are calculated from actual identifier counts, not impressions.
+
+- **Zero orphaned tests.** Strong backward traceability (D4 = 0 across
+  all components) shows the project's testing discipline. The drift is
+  forward-only — requirements outpacing design, not tests diverging
+  from requirements.

From 57f741e147573a5ddf9fe7ace70b9aff03b6da67 Mon Sep 17 00:00:00 2001
From: Alan Jowett <alan.jowett@microsoft.com>
Date: Fri, 20 Mar 2026 08:02:24 -0700
Subject: [PATCH 2/5] Fix framing: both audits were LLM-driven, difference is
 prompt quality
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The prior audit used ad-hoc LLM prompts, not manual human review.
The comparison is structured PromptKit prompt vs. ad-hoc prompt —
same tool (LLM), different prompt engineering.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
---
 .../case-studies/sonde-specification-audit.md | 75 ++++++++++---------
 1 file changed, 40 insertions(+), 35 deletions(-)

diff --git a/docs/case-studies/sonde-specification-audit.md b/docs/case-studies/sonde-specification-audit.md
index f2a559b..b5718a3 100644
--- a/docs/case-studies/sonde-specification-audit.md
+++ b/docs/case-studies/sonde-specification-audit.md
@@ -218,18 +218,18 @@ effort across all five components: **~10–15 hours**, broken down as:
 No breaking changes required. All findings are additive — add design
 sections, add tests, add cross-references.
 
-## PromptKit vs. Manual Audit
+## PromptKit vs. Ad-Hoc Prompts
 
-The Sonde project had already been through a manual audit pass before
-this exercise. The maintainer had filed ~20 GitHub issues based on
-hand-rolled prompts and direct review. This created a natural
-experiment: what does a structured PromptKit audit find that a manual
-audit misses — and vice versa?
+The Sonde project had already been through an audit pass before this
+exercise. The maintainer had used ad-hoc LLM prompts to review the
+specifications, filing ~20 GitHub issues from those results. This
+created a natural experiment: what does a structured PromptKit audit
+find that an ad-hoc LLM audit misses — and vice versa?
 
 ### Cross-Reference Results
 
 The 59 PromptKit findings were cross-referenced against existing
-GitHub issues:
+GitHub issues filed from the ad-hoc audit:
 
 | Component | Findings | Direct Match | Partial Match | No Issue |
 |-----------|----------|--------------|---------------|----------|
@@ -240,23 +240,26 @@ GitHub issues:
 | Node | 11 | 0 | 3 | 8 |
 | **Total** | **59** | **17 (29%)** | **13 (22%)** | **29 (49%)** |
 
-**29% of findings were already known** — the manual audit had caught
+**29% of findings were already known** — the ad-hoc audit had caught
 them. **22% partially overlapped** — the issue existed but the finding
-was more specific or broader. **49% were net-new** — issues the manual
+was more specific or broader. **49% were net-new** — issues the ad-hoc
 audit did not surface at all.
 
-### Different Audits Find Different Things
+### Different Prompts Find Different Things
 
-The most revealing pattern was not how many findings overlapped, but
-which *types* each approach caught:
+Both audits used LLMs. The difference was the prompt: one was ad-hoc
+("review this spec for gaps"), the other was a composed PromptKit prompt
+with a defined persona, a 6-phase protocol, and a classification
+taxonomy. The most revealing pattern was not how many findings
+overlapped, but which *types* each approach caught:
 
-**Manual audit strength: validation gaps.** The maintainer's manual
-review excelled at finding missing tests, incomplete test cases, and
-validation plan gaps — particularly in the protocol crate and BLE tool.
-These are the issues an engineer naturally notices when reading test
-plans against requirements.
+**Ad-hoc prompt strength: validation gaps.** The ad-hoc audit excelled
+at finding missing tests, incomplete test cases, and validation plan
+gaps — particularly in the protocol crate and BLE tool. These are the
+issues an LLM naturally surfaces when asked to "review" a spec, because
+test gaps are concrete and obvious.
 
-**PromptKit audit strength: design traceability.** The automated
+**PromptKit audit strength: design traceability.** The structured
 trifecta audit found a fundamentally different class of issues —
 cross-document traceability gaps between requirements and design. The
 10 highest-severity net-new findings were almost entirely D1/D3/D6
@@ -269,18 +272,19 @@ cross-document traceability gaps between requirements and design. The
 - Gateway F-006: Module table missing Admin API and BLE modules
 - Protocol F-010: Redundant `command_type` field inconsistency risk
 
-**The two approaches are complementary, not competing.** A human
-reviewer reads a test plan and thinks "is this test right?" — catching
-D2 and D7 issues. The structured protocol reads all three documents
-and mechanically checks "does every REQ-ID appear in the design?" —
-catching D1 and D6 issues that require comparing documents a human
-doesn't hold in working memory simultaneously.
+**The two approaches are complementary, not competing.** An ad-hoc
+prompt tends to read each document in isolation and spot issues within
+it — catching D2 and D7 issues. The structured protocol forces the LLM
+to build a complete identifier inventory across all three documents
+and check every cell in the traceability matrix — catching D1 and D6
+issues that require holding three documents in working memory
+simultaneously.
 
-### Why the Manual Audit Missed Design Drift
+### Why the Ad-Hoc Audit Missed Design Drift
 
-The manual audit focused on a natural question: "are the tests
-complete?" This is the question engineers instinctively ask, because
-test gaps have immediate consequences (bugs in production). Design
+The ad-hoc audit focused on a natural question: "are the tests
+complete?" This is the question engineers instinctively ask (and
+prompt for), because test gaps have immediate consequences. Design
 traceability gaps have deferred consequences — the code might still be
 correct even if the design document is stale. But when BLE pairing
 design was absent from three design documents simultaneously, the risk
@@ -289,9 +293,9 @@ would have no design guidance for 30-45% of requirements.
 
 The PromptKit audit caught this because the traceability-audit protocol
 requires building a **complete identifier inventory** before comparing
-documents. A human skims; the protocol enumerates. Enumeration is
-tedious but exhaustive — and that's exactly what makes it effective for
-finding what's missing rather than what's wrong.
+documents. An ad-hoc prompt skims; the protocol enumerates.
+Enumeration is tedious but exhaustive — and that's exactly what makes
+it effective for finding what's missing rather than what's wrong.
 
 ## Takeaways
 
@@ -300,10 +304,11 @@ finding what's missing rather than what's wrong.
   documents lagged. Manual review would catch this in one component;
   the audit caught it in all three and quantified the gap precisely.
 
-- **Structured audits find different issues than manual review.**
-  49% of findings were net-new — almost all design traceability gaps
-  that the manual audit's test-focused lens naturally missed. The two
-  approaches are complementary.
+- **Structured prompts find different issues than ad-hoc prompts.**
+  Both used LLMs. The difference was prompt engineering: the ad-hoc
+  prompt caught test gaps (what an LLM naturally surfaces), while the
+  PromptKit prompt caught design traceability gaps (what the protocol
+  forces the LLM to check). 49% of findings were net-new.
 
 - **One prompt, five audits.** The assembled prompt is reusable —
   the methodology doesn't change, only the inputs. This scales to

From aa791a75a03717296c08870ea45d035c85002e8b Mon Sep 17 00:00:00 2001
From: Alan Jowett <alan.jowett@microsoft.com>
Date: Fri, 20 Mar 2026 08:09:58 -0700
Subject: [PATCH 3/5] Fix data inconsistencies: reconcile counts, fix severity
 text

- Modem findings: 13 consistently (was 12 in cross-ref table)
- Total: 60 consistently (was 59 in cross-ref section)
- Remove 'no informational' claim that contradicted severity table
- Percentages updated to match corrected totals

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
---
 docs/case-studies/sonde-specification-audit.md | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/docs/case-studies/sonde-specification-audit.md b/docs/case-studies/sonde-specification-audit.md
index b5718a3..0c5a4d5 100644
--- a/docs/case-studies/sonde-specification-audit.md
+++ b/docs/case-studies/sonde-specification-audit.md
@@ -134,8 +134,7 @@ realized in design and validation).
 
 ### Severity Profile
 
-No findings were informational-only. The distribution skews toward
-actionable items:
+The severity distribution skews toward actionable items:
 
 | Severity | Count | Percentage |
 |----------|-------|------------|
@@ -228,21 +227,21 @@ find that an ad-hoc LLM audit misses — and vice versa?
 
 ### Cross-Reference Results
 
-The 59 PromptKit findings were cross-referenced against existing
+The 60 PromptKit findings were cross-referenced against existing
 GitHub issues filed from the ad-hoc audit:
 
 | Component | Findings | Direct Match | Partial Match | No Issue |
 |-----------|----------|--------------|---------------|----------|
 | Protocol | 14 | 8 | 2 | 4 |
 | BLE Tool | 15 | 6 | 3 | 6 |
-| Modem | 12 | 2 | 3 | 7 |
+| Modem | 13 | 2 | 3 | 8 |
 | Gateway | 7 | 1 | 2 | 4 |
 | Node | 11 | 0 | 3 | 8 |
-| **Total** | **59** | **17 (29%)** | **13 (22%)** | **29 (49%)** |
+| **Total** | **60** | **17 (28%)** | **13 (22%)** | **30 (50%)** |
 
-**29% of findings were already known** — the ad-hoc audit had caught
+**28% of findings were already known** — the ad-hoc audit had caught
 them. **22% partially overlapped** — the issue existed but the finding
-was more specific or broader. **49% were net-new** — issues the ad-hoc
+was more specific or broader. **50% were net-new** — issues the ad-hoc
 audit did not surface at all.
 
 ### Different Prompts Find Different Things
@@ -308,7 +307,7 @@ it effective for finding what's missing rather than what's wrong.
   Both used LLMs. The difference was prompt engineering: the ad-hoc
   prompt caught test gaps (what an LLM naturally surfaces), while the
   PromptKit prompt caught design traceability gaps (what the protocol
-  forces the LLM to check). 49% of findings were net-new.
+  forces the LLM to check). 50% of findings were net-new.
 
 - **One prompt, five audits.** The assembled prompt is reusable —
   the methodology doesn't change, only the inputs. This scales to

From 7caf52b43ffcb6d2255b780cc2c9775005422448 Mon Sep 17 00:00:00 2001
From: Alan Jowett <alan.jowett@microsoft.com>
Date: Fri, 20 Mar 2026 08:11:58 -0700
Subject: [PATCH 4/5] Add reverse comparison: what ad-hoc audit caught that
 PromptKit missed

Adds three categories of findings the trifecta audit cannot detect:
- Semantic test gaps (6 issues, ~45 gaps): tests exist but don't
  verify deeply enough
- Domain safety (4 issues, ~50+ gaps): BPF safety invariants from
  a spec outside the trifecta
- Cross-component integration (1 issue, 5 gaps): flows spanning
  multiple components

Adds complementarity summary table and updates takeaways.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
---
 .../case-studies/sonde-specification-audit.md | 58 +++++++++++++++++--
 1 file changed, 54 insertions(+), 4 deletions(-)

diff --git a/docs/case-studies/sonde-specification-audit.md b/docs/case-studies/sonde-specification-audit.md
index 0c5a4d5..ed550be 100644
--- a/docs/case-studies/sonde-specification-audit.md
+++ b/docs/case-studies/sonde-specification-audit.md
@@ -296,6 +296,55 @@ documents. An ad-hoc prompt skims; the protocol enumerates.
 Enumeration is tedious but exhaustive — and that's exactly what makes
 it effective for finding what's missing rather than what's wrong.
 
+### What the Ad-Hoc Audit Caught That PromptKit Missed
+
+The comparison cuts both ways. The ad-hoc audit found **11 spec-relevant
+issues (~100+ individual gaps)** that the trifecta audit did not surface.
+They fall into three categories the structural audit cannot detect by
+design:
+
+**Semantic test gaps (6 issues, ~45 individual gaps).** Tests exist and
+are linked to requirements, but don't verify deeply enough. For example:
+issue #357 found that protocol tests check output exists but not that
+randomness is cryptographic, CBOR is deterministic, or HMAC state is
+isolated. Issue #354 found 11 node tests that check outcomes but not
+timing/ordering constraints ("MUST wait for X before Y"). Issue #359
+found requirements with happy-path tests but no negative tests.
+
+*Why missed:* The trifecta audit checks "does a test case exist for this
+requirement?" (D2). It does not read test procedures to judge whether
+they're thorough enough. That's D7 territory, which it only spot-checks
+at the acceptance-criteria level.
+
+**Domain-specific safety gaps (4 issues, ~50+ individual gaps).** These
+require deep understanding of the BPF interpreter's safety model. Issue
+#330 found 28 tagged register safety invariants with zero test coverage.
+Issue #334 found 8 BPF helper trust boundary gaps. These come from
+`safe-bpf-interpreter.md` — a separate specification not included in
+any component's trifecta.
+
+*Why missed:* The audit examines the three documents it's given. Specs
+outside the trifecta are invisible to it.
+
+**Cross-component integration (1 issue, 5 gaps).** The end-to-end BLE
+onboarding flow across gateway + modem + pairing tool was never
+integration-tested (#361).
+
+*Why missed:* The trifecta audit examines each component independently.
+Cross-component flows are invisible to per-component audits.
+
+### The Complementarity Is the Point
+
+| Approach | Strength | Blind spot |
+|----------|----------|------------|
+| **PromptKit trifecta** | Structural traceability — missing cross-references, orphaned IDs, numbering gaps (30 net-new findings) | Cannot judge test depth, domain safety invariants, or cross-component flows |
+| **Ad-hoc prompt** | Semantic depth — are tests thorough enough? are safety invariants verified? are negative cases covered? (11 issues, ~100+ gaps) | Misses systematic traceability gaps across document sets |
+
+Neither approach alone gives full coverage. The structural audit is
+exhaustive but shallow (does a test *exist*?). The ad-hoc audit is
+deep but selective (is this test *good enough*?). Used together, they
+cover both dimensions.
+
 ## Takeaways
 
 - **Specification drift is real and systemic.** BLE pairing was added
@@ -304,10 +353,11 @@ it effective for finding what's missing rather than what's wrong.
   the audit caught it in all three and quantified the gap precisely.
 
 - **Structured prompts find different issues than ad-hoc prompts.**
-  Both used LLMs. The difference was prompt engineering: the ad-hoc
-  prompt caught test gaps (what an LLM naturally surfaces), while the
-  PromptKit prompt caught design traceability gaps (what the protocol
-  forces the LLM to check). 50% of findings were net-new.
+  Both used LLMs. PromptKit found 30 net-new structural traceability
+  gaps. The ad-hoc prompt found 11 issues (~100+ individual gaps) in
+  semantic test depth and domain safety that the structural audit
+  can't see. Neither alone gives full coverage — the two are
+  complementary by design, not competing.
 
 - **One prompt, five audits.** The assembled prompt is reusable —
   the methodology doesn't change, only the inputs. This scales to

From a5e54917ce80808b670a5ec5e5f17d2be14b3989 Mon Sep 17 00:00:00 2001
From: Alan Jowett <alan.jowett@microsoft.com>
Date: Fri, 20 Mar 2026 08:22:10 -0700
Subject: [PATCH 5/5] Qualify Sonde issue references with repo links
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Issue numbers like #357 were ambiguous — could be read as PromptKit
issues. Now qualified as 'Sonde #357' with full GitHub URLs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
---
 .../case-studies/sonde-specification-audit.md | 25 ++++++++++++-------
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/docs/case-studies/sonde-specification-audit.md b/docs/case-studies/sonde-specification-audit.md
index ed550be..2c64baf 100644
--- a/docs/case-studies/sonde-specification-audit.md
+++ b/docs/case-studies/sonde-specification-audit.md
@@ -305,11 +305,15 @@ design:
 
 **Semantic test gaps (6 issues, ~45 individual gaps).** Tests exist and
 are linked to requirements, but don't verify deeply enough. For example:
-issue #357 found that protocol tests check output exists but not that
-randomness is cryptographic, CBOR is deterministic, or HMAC state is
-isolated. Issue #354 found 11 node tests that check outcomes but not
-timing/ordering constraints ("MUST wait for X before Y"). Issue #359
-found requirements with happy-path tests but no negative tests.
+Sonde issue
+[#357](https://github.com/alan-jowett/sonde/issues/357) found that
+protocol tests check output exists but not that randomness is
+cryptographic, CBOR is deterministic, or HMAC state is isolated. Sonde
+[#354](https://github.com/alan-jowett/sonde/issues/354) found 11 node
+tests that check outcomes but not timing/ordering constraints ("MUST
+wait for X before Y"). Sonde
+[#359](https://github.com/alan-jowett/sonde/issues/359) found
+requirements with happy-path tests but no negative tests.
 
 *Why missed:* The trifecta audit checks "does a test case exist for this
 requirement?" (D2). It does not read test procedures to judge whether
@@ -317,9 +321,11 @@ they're thorough enough. That's D7 territory, which it only spot-checks
 at the acceptance-criteria level.
 
 **Domain-specific safety gaps (4 issues, ~50+ individual gaps).** These
-require deep understanding of the BPF interpreter's safety model. Issue
-#330 found 28 tagged register safety invariants with zero test coverage.
-Issue #334 found 8 BPF helper trust boundary gaps. These come from
+require deep understanding of the BPF interpreter's safety model. Sonde
+[#330](https://github.com/alan-jowett/sonde/issues/330) found 28 tagged
+register safety invariants with zero test coverage. Sonde
+[#334](https://github.com/alan-jowett/sonde/issues/334) found 8 BPF
+helper trust boundary gaps. These come from
 `safe-bpf-interpreter.md` — a separate specification not included in
 any component's trifecta.
 
@@ -328,7 +334,8 @@ outside the trifecta are invisible to it.
 
 **Cross-component integration (1 issue, 5 gaps).** The end-to-end BLE
 onboarding flow across gateway + modem + pairing tool was never
-integration-tested (#361).
+integration-tested (Sonde
+[#361](https://github.com/alan-jowett/sonde/issues/361)).
 
 *Why missed:* The trifecta audit examines each component independently.
 Cross-component flows are invisible to per-component audits.