Skip to content

fix(artifacts): tolerate Scientist YAML-style HYPOTHESES + OPEN_QUESTIONS (#5)#6

Merged
omerakben merged 1 commit into
mainfrom
fix/issue-5-scientist-hypotheses
May 2, 2026
Merged

fix(artifacts): tolerate Scientist YAML-style HYPOTHESES + OPEN_QUESTIONS (#5)#6
omerakben merged 1 commit into
mainfrom
fix/issue-5-scientist-hypotheses

Conversation

@omerakben
Copy link
Copy Markdown
Owner

@omerakben omerakben commented May 2, 2026

Two-layer fix for issue #5. Scientist persona generates YAML-style content; parser expects Markdown H2 blocks.

Layer 1 — Prompt: Added canonical schema examples to scientist.md with wrong/right comparisons.
Layer 2 — Parser: YAML-to-Markdown adapters for both hypotheses.ts and open-questions.ts. Field mapping, status normalization, missing-field synthesis.
Tests: 22 new regression tests (13 hypotheses + 9 open-questions). Full suite: 2153 pass / 0 fail / 1 skip.
Codex review: No BLOCK-PUSH findings.

Closes #5

Summary by CodeRabbit

  • New Features

    • Added support for YAML-style input in hypothesis and open question artifacts with automatic conversion to canonical Markdown format.
  • Documentation

    • Enhanced schema documentation with detailed canonical format specifications, formatting rules, status enums, and example formats for hypotheses and open questions.

…IONS drift (#5)

Every PLAN gate failed `scientist_hypotheses_invalid` (and after that, the
mirror `scientist_open_questions_invalid`) because the Scientist persona
emits YAML-style block entries instead of the canonical `## H-NNN:` /
`## Q-NNN:` H2-block schema. Two layers of root cause:

1. Persona: src/agents/defaults/scientist.md described the protocol in
   prose ("emit the full canonical HYPOTHESES.md") but carried no concrete
   schema example. With no anchor, the LLM defaulted to YAML — `- id: H-001`
   followed by indented `claim:` / `falsifier:` / `phase_introduced:` /
   `status: proposed` / `sources: [...]` continuation lines. `Risk if false`
   was missing entirely. Same shape on OPEN_QUESTIONS.md. Add a "Canonical
   schemas (read before emitting)" section with explicit wrong / right
   examples for both sidecars, the full bullet order, and the locked enums.

2. Parser: src/artifacts/hypotheses.ts and open-questions.ts were strict —
   any line that wasn't `## H-NNN:` / `## Q-NNN:` plus the canonical bullet
   set raised hypothesis_unexpected_content / question_unexpected_content
   errors. Add a narrow tolerance: `adaptYamlStyleHypotheses(raw)` and
   `adaptYamlStyleOpenQuestions(raw)` pre-rewrite YAML-style blocks
   (`- id: <ID>` followed by indented `key:` continuation lines) into
   canonical H2 blocks before strict parsing. Mixed-format input (some
   canonical, some YAML) is supported — only `- id:` blocks are rewritten.

   Field mapping (HYPOTHESES): claim → title (first non-empty line);
   phase_introduced / phase → Phase; status: proposed / draft / pending →
   open; falsifier → Falsifier; sources / evidence → Evidence (inline
   `[a, b, c]` lists are joined as `a, b, c`); risk_if_false / risk → Risk
   if false. Missing falsifier or risk synthesizes a placeholder tagged
   `(auto-synthesized — Scientist did not specify; investigate before
   depending on this hypothesis)` so the round-trip stays honest.

   Field mapping (OPEN_QUESTIONS): question / text / title → heading text;
   phase_introduced / phase → Phase; status mapping mirrors hypotheses;
   importance / due_by / context / resolution_attempts → their canonical
   counterparts. Missing DueBy defaults to `-`; missing Resolution attempts
   defaults to `none yet.` Both are existing contract-permitted defaults,
   not invented semantics.

   Discipline boundary: the adapter fires only on `- id: <ID>` lines.
   Canonical blocks pass through verbatim. The strict parser still owns
   final validation; an unconvertible YAML block (e.g., malformed id) still
   surfaces via the strict id-format error.

Tests: 22 new regression cases across both files covering (a) YAML→canonical
rewrite of every supported field, (b) status:proposed → open mapping,
(c) inline-list sources joined as Evidence, (d) plain-string sources passed
through, (e) synthesized Risk if false / Resolution attempts placeholders,
(f) mixed-format input, (g) round-trip serialize → reparse, (h) unchanged
output for canonical input, (i) malformed YAML id still rejected by strict
parser. Full offline suite: 2153 pass / 0 fail / 1 skip (gated live xAI).

Closes #5
Copilot AI review requested due to automatic review settings May 2, 2026 20:24
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 2, 2026

📝 Walkthrough

Walkthrough

This PR fixes a critical bug where the Scientist persona generates YAML-style sidecar artifacts instead of Markdown H2-block format. It adds YAML-to-Markdown adapters to the hypothesis and question parsers, updates the Scientist prompt with canonical schema examples, and includes comprehensive regression tests for dual-format tolerance.

Changes

YAML-to-Markdown Artifact Tolerance

Layer / File(s) Summary
Prompt Schema Guidance
src/agents/defaults/scientist.md
Adds explicit "Canonical schemas" section forbidding YAML-style blocks and mandating strict Markdown H2 format (## H-NNN:, ## Q-NNN:) with required bullet orders, enum constraints (status, importance, phase), ISO DueBy formatting, and resolved-question Resolved: convention. Includes rejected/accepted output examples.
Hypothesis YAML Adapter
src/artifacts/hypotheses.ts
Introduces adaptYamlStyleHypotheses() to detect - id: H-NNN YAML blocks, parse indented key: value fields, map field names (phase_introducedPhase, sourcesEvidence, status: proposedStatus: open), synthesize missing Risk if false, and render canonical ## H-NNN: blocks with required bullets.
Question YAML Adapter
src/artifacts/open-questions.ts
Introduces adaptYamlStyleOpenQuestions() with parallel logic: detects - id: Q-NNN YAML blocks, maps synonyms (proposed/draft/pendingopen), defaults missing due_by to -, synthesizes Resolution attempts, and renders canonical ## Q-NNN: H2 blocks with required fields.
Parser Integration
src/artifacts/{hypotheses,open-questions}.ts
Both parseHypotheses() and parseOpenQuestions() now preprocess input via their respective adapters before BOM stripping, allowing mixed or pure-YAML input to normalize into canonical Markdown before strict validation.
Regression Tests
tests/artifacts-hypotheses.test.ts, tests/artifacts-open-questions.test.ts
Comprehensive issue #5 suites verifying YAML detection, field mapping, enum status mapping, synthesis of missing bullets, preservation of canonical input, parsing/serialization round-trips, and malformed ID rejection. Tests confirm serializeHypotheses/serializeOpenQuestions produce canonical form.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Hark! The Scientist once spoke in YAML's nested code,
But now adapters bend those blocks to the canonical road.
H's and Q's in Markdown dress, with bullets in their place—
From YAML's tangle to clear prose, we've smoothed the parsing race!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 22.22% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the primary change: adding YAML tolerance for HYPOTHESES and OPEN_QUESTIONS in the artifacts parser, directly addressing issue #5.
Linked Issues check ✅ Passed The PR implements all coding requirements from issue #5: prompt examples in scientist.md, YAML-to-Markdown adapters in hypotheses.ts and open-questions.ts with field mapping and status normalization, and comprehensive regression tests.
Out of Scope Changes check ✅ Passed All changes directly support issue #5 requirements: prompt documentation, parser adapters with field mapping, and regression tests; no unrelated modifications present.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/issue-5-scientist-hypotheses

Tip

💬 Introducing [Slack Agent](https://www.coderabbit.ai/agent): Turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get your free trial and get 200 agent minutes per Slack user (a $50 value).


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 60 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces tolerance for YAML-style formatting in HYPOTHESES.md and OPEN_QUESTIONS.md artifacts, which LLMs occasionally produce instead of the canonical Markdown schema. It adds adapter functions to pre-process and convert these YAML blocks before strict parsing, along with comprehensive regression tests and updated documentation for the Scientist persona. Feedback focuses on improving the robustness of the YAML adapters, specifically by handling multi-line values, ensuring correct Byte Order Mark (BOM) stripping before adaptation, and maintaining determinism by avoiding dynamic dates in auto-synthesized fields.

Comment on lines +142 to +150
while (i < lines.length) {
const cont = lines[i]!
const contMatch = cont.match(YAML_HYP_CONTINUATION)
if (contMatch === null) break
const key = contMatch[1]!.toLowerCase()
const value = contMatch[2]!.trim()
if (!fields.has(key)) fields.set(key, value)
i++
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current YAML adapter only supports single-line values. If an LLM produces a multi-line value (e.g., a long claim or falsifier that wraps), the parser will break the block prematurely and likely trigger validation errors on the subsequent lines.

Consider improving the loop to collect indented lines that don't match a new key as continuations of the previous key's value.

    let lastKey: string | null = null;
    while (i < lines.length) {
      const cont = lines[i]!;
      const contMatch = cont.match(YAML_HYP_CONTINUATION);
      if (contMatch) {
        lastKey = contMatch[1]!.toLowerCase();
        const value = contMatch[2]!.trim();
        if (!fields.has(lastKey)) fields.set(lastKey, value);
        i++;
      } else if (lastKey && /^[ \t]+/.test(cont)) {
        const existing = fields.get(lastKey) ?? '';
        fields.set(lastKey, (existing + ' ' + cont.trim()).trim());
        i++;
      } else {
        break;
      }
    }

Comment on lines +169 to +170
const adapted = adaptYamlStyleHypotheses(raw)
const text = adapted.startsWith(BOM) ? adapted.slice(BOM.length) : adapted
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The Byte Order Mark (BOM) should be stripped before calling the YAML adapter. If a file starts with a BOM and a YAML-style block, the YAML_HYP_ID_PROBE and YAML_HYP_ID_LINE regexes will fail to match the first line because of the leading character, causing the adapter to skip the first block.

Suggested change
const adapted = adaptYamlStyleHypotheses(raw)
const text = adapted.startsWith(BOM) ? adapted.slice(BOM.length) : adapted
const stripped = raw.startsWith(BOM) ? raw.slice(BOM.length) : raw;
const text = adaptYamlStyleHypotheses(stripped);

if (resolved.length > 0) {
out.push(`- Resolved: ${resolved}`)
} else {
const today = new Date().toISOString().slice(0, 10)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using new Date() makes the parser non-deterministic. If a YAML-style question is marked as resolved but lacks a resolution date, the resulting Markdown will change every day the tool is run. While this is a fallback for malformed input, consider using a fixed placeholder date (e.g., 1970-01-01) to maintain determinism in the artifact generation.

Suggested change
const today = new Date().toISOString().slice(0, 10)
const today = '1970-01-01' // Placeholder for missing resolution date

Comment on lines +158 to +166
while (i < lines.length) {
const cont = lines[i]!
const contMatch = cont.match(YAML_Q_CONTINUATION)
if (contMatch === null) break
const key = contMatch[1]!.toLowerCase()
const value = contMatch[2]!.trim()
if (!fields.has(key)) fields.set(key, value)
i++
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the hypotheses adapter, this logic fails to capture multi-line values. Indented lines following a key-value pair should be appended to the current value to ensure long questions or context blocks are preserved correctly.

    let lastKey: string | null = null;
    while (i < lines.length) {
      const cont = lines[i]!;
      const contMatch = cont.match(YAML_Q_CONTINUATION);
      if (contMatch) {
        lastKey = contMatch[1]!.toLowerCase();
        const value = contMatch[2]!.trim();
        if (!fields.has(lastKey)) fields.set(lastKey, value);
        i++;
      } else if (lastKey && /^[ \t]+/.test(cont)) {
        const existing = fields.get(lastKey) ?? '';
        fields.set(lastKey, (existing + ' ' + cont.trim()).trim());
        i++;
      } else {
        break;
      }
    }

Comment on lines +188 to +189
const adapted = adaptYamlStyleOpenQuestions(raw)
const text = adapted.startsWith(BOM) ? adapted.slice(BOM.length) : adapted
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Stripping the BOM after adaptation can cause the adapter to miss a YAML block at the very beginning of the file. Strip the BOM from the raw input before passing it to the adapter.

Suggested change
const adapted = adaptYamlStyleOpenQuestions(raw)
const text = adapted.startsWith(BOM) ? adapted.slice(BOM.length) : adapted
const stripped = raw.startsWith(BOM) ? raw.slice(BOM.length) : raw;
const text = adaptYamlStyleOpenQuestions(stripped);

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 46c5f56ac1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +125 to +126
const today = new Date().toISOString().slice(0, 10)
out.push(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Remove wall-clock dependency from YAML question adaptation

Synthesizing Resolved with new Date() makes parseOpenQuestions time-dependent: the same YAML input (status: resolved without an explicit resolved: field) produces different parsed artifacts on different days, so parse→serialize output is not stable and can create date-driven diff churn in artifacts and flaky expectations in deterministic/offline workflows. Parsers should be pure for identical input; use a fixed sentinel or surface a validation error/intervention instead of injecting the current date.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses issue #5 where the Scientist agent emits YAML-style HYPOTHESES / OPEN QUESTIONS blocks that the artifact parsers expect in canonical Markdown H2-block format, by tightening the Scientist prompt and adding YAML→Markdown adaptation before strict parsing.

Changes:

  • Add canonical “wrong vs right” schema examples to the Scientist persona prompt to prevent YAML-style output.
  • Add YAML-style adapters in hypotheses.ts and open-questions.ts that rewrite - id: ... blocks into canonical ## ...: blocks prior to parsing (including status normalization and missing-field synthesis).
  • Add regression tests covering YAML-only and mixed-format inputs for both artifacts.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/artifacts/hypotheses.ts Adds YAML-style pre-parse adapter and routes parsing through it.
src/artifacts/open-questions.ts Adds YAML-style pre-parse adapter and routes parsing through it.
src/agents/defaults/scientist.md Documents canonical non-YAML schemas with explicit wrong/right examples.
tests/artifacts-hypotheses.test.ts Adds YAML-tolerance regression tests for hypotheses parsing and adaptation.
tests/artifacts-open-questions.test.ts Adds YAML-tolerance regression tests for open-questions parsing and adaptation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +61 to +63
// what was inferred. The strict parser still owns final validation; an
// unconvertible YAML block (e.g., missing `falsifier:`) fails just as a missing
// `- Falsifier:` bullet does.
Comment on lines +120 to +129
if (status === 'resolved') {
const resolved = fields.get('resolved') ?? ''
if (resolved.length > 0) {
out.push(`- Resolved: ${resolved}`)
} else {
const today = new Date().toISOString().slice(0, 10)
out.push(
`- Resolved: ${today} — (auto-synthesized — Scientist did not specify resolution; investigate)`,
)
}
Comment on lines +341 to +355
test('still rejects YAML missing falsifier (tolerance does not invent evidence)', () => {
const bad = `# HYPOTHESES

- id: H-001
claim: missing falsifier on purpose.
status: open
phase_introduced: plan
`
// Adapter synthesizes a placeholder falsifier; downstream contract
// discipline should still surface the gap. The strict parser accepts the
// synthesized placeholder, but the placeholder is tagged so reviewers
// can spot it.
const art = parseHypotheses(bad)
expect(art.hypotheses[0]!.falsifier).toContain('auto-synthesized')
})
Comment on lines +363 to +367
// The probe regex requires H-\d+, but H-1 (single digit) does not match
// \d{3,} in the strict ID validator. The adapter passes the line through
// unchanged because YAML_HYP_ID_LINE requires H-\d+ (1+ digits) — H-1
// does match the adapter, so it gets rewritten, then the strict parser
// rejects the id format.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (2)
tests/artifacts-hypotheses.test.ts (1)

341-355: ⚡ Quick win

Test title/comment contradict expected behavior.

Line 341 says “still rejects YAML missing falsifier”, but the assertion expects successful parsing with an auto-synthesized falsifier. Rename to reflect actual contract being tested.

Suggested fix
-  test('still rejects YAML missing falsifier (tolerance does not invent evidence)', () => {
+  test('accepts YAML missing falsifier by synthesizing a placeholder marker', () => {
...
-    // Adapter synthesizes a placeholder falsifier; downstream contract
-    // discipline should still surface the gap. The strict parser accepts the
-    // synthesized placeholder, but the placeholder is tagged so reviewers
-    // can spot it.
+    // Adapter synthesizes a placeholder falsifier; strict parsing accepts it,
+    // and the marker makes the gap explicit for downstream review.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/artifacts-hypotheses.test.ts` around lines 341 - 355, The test title
and comment are misleading: change the test name and any surrounding comment so
it reflects that missing falsifiers are accepted by the parser with an
"auto-synthesized" placeholder rather than being rejected; update the test
description string in the test that calls parseHypotheses(...) and asserts
expect(art.hypotheses[0]!.falsifier).toContain('auto-synthesized') (and any
nearby explanatory comment) to something like "accepts YAML missing falsifier
and synthesizes placeholder" so the title matches the assertion and behavior.
src/artifacts/hypotheses.ts (1)

62-63: ⚡ Quick win

Commented behavior conflicts with implementation.

Lines 62–63 say missing falsifier fails, but lines 99–101 synthesize a placeholder and parsing succeeds. Please align this comment with actual behavior to avoid misleading future edits.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/artifacts/hypotheses.ts` around lines 62 - 63, The comment claiming an
unconvertible YAML block (missing `falsifier`) fails is inaccurate because the
parser synthesizes a placeholder `falsifier` (see the logic around the
placeholder creation at lines ~99–101), allowing parsing to succeed; update the
comment above the parsing logic (the comment at lines 62–63) to state that when
`falsifier` is missing the code creates a placeholder/default falsifier rather
than failing, or alternatively change the implementation to actually throw on
missing `falsifier` if you prefer failure—refer to the `falsifier` handling and
the placeholder-synthesis code near the parsing function in
src/artifacts/hypotheses.ts to make the comment and behavior consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/agents/defaults/scientist.md`:
- Around line 52-98: Add "md" language identifiers to the triple-fenced code
blocks that show the HYPOTHESES and OPEN QUESTIONS examples so they satisfy
markdownlint MD040; update the three fences surrounding the HYPOTHESES example
blocks and the OPEN QUESTIONS example block (the fences that wrap the text
starting with "# HYPOTHESES" and "# OPEN QUESTIONS") to use ```md instead of ```
so the examples render and lint correctly.

In `@src/artifacts/hypotheses.ts`:
- Around line 96-98: The YAML status lookup is currently case-sensitive (uses
rawStatus directly), so update the mapping to be case-insensitive by normalizing
rawStatus before lookup: compute a normalized value (e.g., const normalized =
String(rawStatus).trim().toLowerCase()) and use YAML_STATUS_MAP[normalized] ??
rawStatus (or ?? normalized if you want the normalized form downstream) when
setting status; update the code around rawStatus, YAML_STATUS_MAP and status to
use this normalized lookup so mixed-case model outputs map correctly.

In `@src/artifacts/open-questions.ts`:
- Around line 102-104: Normalize the raw YAML status value before looking it up
in YAML_Q_STATUS_MAP: when retrieving rawStatus from fields.get('status') (used
to compute status), trim whitespace and convert to lowercase (and optionally
collapse internal whitespace) before using YAML_Q_STATUS_MAP[rawStatus]; keep
the fallback to the original raw value if mapping fails so strict validation
still receives a sensible value. Ensure this change touches the rawStatus ->
status assignment (the variables rawStatus and status) so values like "Proposed"
or "PENDING" are correctly mapped.
- Around line 120-129: The code in open-questions.ts fabricates a time-dependent
"Resolved" value using new Date() when fields.get('resolved') is empty, making
artifacts non-deterministic; change the logic in the block handling status ===
'resolved' (the resolved variable and the out.push call) to avoid inserting the
current date — either omit the Resolved line when resolved is empty or push a
deterministic placeholder like "— (unspecified resolution)" and emit/return a
validation warning or error instead of auto-synthesizing; remove the new Date()
usage so artifacts remain deterministic and update any related tests to expect
the new deterministic behavior.

---

Nitpick comments:
In `@src/artifacts/hypotheses.ts`:
- Around line 62-63: The comment claiming an unconvertible YAML block (missing
`falsifier`) fails is inaccurate because the parser synthesizes a placeholder
`falsifier` (see the logic around the placeholder creation at lines ~99–101),
allowing parsing to succeed; update the comment above the parsing logic (the
comment at lines 62–63) to state that when `falsifier` is missing the code
creates a placeholder/default falsifier rather than failing, or alternatively
change the implementation to actually throw on missing `falsifier` if you prefer
failure—refer to the `falsifier` handling and the placeholder-synthesis code
near the parsing function in src/artifacts/hypotheses.ts to make the comment and
behavior consistent.

In `@tests/artifacts-hypotheses.test.ts`:
- Around line 341-355: The test title and comment are misleading: change the
test name and any surrounding comment so it reflects that missing falsifiers are
accepted by the parser with an "auto-synthesized" placeholder rather than being
rejected; update the test description string in the test that calls
parseHypotheses(...) and asserts
expect(art.hypotheses[0]!.falsifier).toContain('auto-synthesized') (and any
nearby explanatory comment) to something like "accepts YAML missing falsifier
and synthesizes placeholder" so the title matches the assertion and behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1d93a010-d950-439c-8138-77079accee25

📥 Commits

Reviewing files that changed from the base of the PR and between f4fc6ea and 46c5f56.

📒 Files selected for processing (5)
  • src/agents/defaults/scientist.md
  • src/artifacts/hypotheses.ts
  • src/artifacts/open-questions.ts
  • tests/artifacts-hypotheses.test.ts
  • tests/artifacts-open-questions.test.ts

Comment on lines +52 to +98
```
# HYPOTHESES

- id: H-001
claim: The scorer ranks 5 candidates within 50ms.
falsifier: microbenchmark median > 50ms.
status: proposed
phase_introduced: plan
sources: [SC-SPEC-001]
```

Right (Markdown H2 blocks — parser accepts):

```
# HYPOTHESES

## H-001: scorer ranks 5 candidates within 50ms

- Phase: plan
- Status: open
- Falsifier: microbenchmark on M1 emulator profile shows median > 50ms.
- Evidence: SPEC.md AC-1; SPEC constraint phone-class device.
- Risk if false: SPEC AC-1 fails; PLAN T-001 needs rework.
```

Required HYPOTHESES.md rules:

- Heading form: `## H-NNN: <one-line title>` where `H-NNN` is zero-padded three or more digits (`H-001`, `H-042`).
- Five required bullets in this canonical order: `Phase`, `Status`, `Falsifier`, `Evidence`, `Risk if false`.
- Status enum: `open | confirmed | rejected | obsolete`. Use `open` for live claims; do NOT emit `proposed`.
- Phase: a valid SDLC phase (e.g., `define`, `plan`, `build`, `verify`, `review`, `ship`, `audit`).
- `Risk if false:` is required, not optional. Every hypothesis names what breaks if the claim is wrong.

Required OPEN_QUESTIONS.md rules:

```
# OPEN QUESTIONS

## Q-001: Should the app produce gender-neutral suggestions only?

- Phase: define
- Status: open
- Importance: medium
- DueBy: 2026-05-15
- Context: SPEC.md ## Open questions, bullet 1.
- Resolution attempts: none yet.
```
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add language identifiers to fenced code blocks (MD040).

The fences introduced at Lines 52, 65, and 87 are missing a language tag, which matches the markdownlint warnings.

Suggested fix
-```
+```md
 # HYPOTHESES
 ...
-```
+```

-```
+```md
 # HYPOTHESES
 ...
-```
+```

-```
+```md
 # OPEN QUESTIONS
 ...
-```
+```
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 52-52: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


[warning] 65-65: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


[warning] 87-87: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/agents/defaults/scientist.md` around lines 52 - 98, Add "md" language
identifiers to the triple-fenced code blocks that show the HYPOTHESES and OPEN
QUESTIONS examples so they satisfy markdownlint MD040; update the three fences
surrounding the HYPOTHESES example blocks and the OPEN QUESTIONS example block
(the fences that wrap the text starting with "# HYPOTHESES" and "# OPEN
QUESTIONS") to use ```md instead of ``` so the examples render and lint
correctly.

Comment on lines +96 to +98
const rawStatus = fields.get('status') ?? 'open'
const status = YAML_STATUS_MAP[rawStatus] ?? rawStatus
const falsifier =
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Make YAML status mapping case-insensitive.

Line 97 only maps lowercase synonyms; mixed-case model outputs will leak through and fail downstream status validation.

Suggested fix
-  const rawStatus = fields.get('status') ?? 'open'
+  const rawStatus = (fields.get('status') ?? 'open').trim().toLowerCase()
   const status = YAML_STATUS_MAP[rawStatus] ?? rawStatus
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const rawStatus = fields.get('status') ?? 'open'
const status = YAML_STATUS_MAP[rawStatus] ?? rawStatus
const falsifier =
const rawStatus = (fields.get('status') ?? 'open').trim().toLowerCase()
const status = YAML_STATUS_MAP[rawStatus] ?? rawStatus
const falsifier =
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/artifacts/hypotheses.ts` around lines 96 - 98, The YAML status lookup is
currently case-sensitive (uses rawStatus directly), so update the mapping to be
case-insensitive by normalizing rawStatus before lookup: compute a normalized
value (e.g., const normalized = String(rawStatus).trim().toLowerCase()) and use
YAML_STATUS_MAP[normalized] ?? rawStatus (or ?? normalized if you want the
normalized form downstream) when setting status; update the code around
rawStatus, YAML_STATUS_MAP and status to use this normalized lookup so
mixed-case model outputs map correctly.

Comment on lines +102 to +104
const rawStatus = fields.get('status') ?? 'open'
const status = YAML_Q_STATUS_MAP[rawStatus] ?? rawStatus
const importance = fields.get('importance') ?? 'medium'
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Normalize YAML status before mapping.

Line 103 maps only exact lowercase values. Proposed/PENDING from model output will bypass mapping and fail strict validation unnecessarily.

Suggested fix
-  const rawStatus = fields.get('status') ?? 'open'
+  const rawStatus = (fields.get('status') ?? 'open').trim().toLowerCase()
   const status = YAML_Q_STATUS_MAP[rawStatus] ?? rawStatus
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const rawStatus = fields.get('status') ?? 'open'
const status = YAML_Q_STATUS_MAP[rawStatus] ?? rawStatus
const importance = fields.get('importance') ?? 'medium'
const rawStatus = (fields.get('status') ?? 'open').trim().toLowerCase()
const status = YAML_Q_STATUS_MAP[rawStatus] ?? rawStatus
const importance = fields.get('importance') ?? 'medium'
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/artifacts/open-questions.ts` around lines 102 - 104, Normalize the raw
YAML status value before looking it up in YAML_Q_STATUS_MAP: when retrieving
rawStatus from fields.get('status') (used to compute status), trim whitespace
and convert to lowercase (and optionally collapse internal whitespace) before
using YAML_Q_STATUS_MAP[rawStatus]; keep the fallback to the original raw value
if mapping fails so strict validation still receives a sensible value. Ensure
this change touches the rawStatus -> status assignment (the variables rawStatus
and status) so values like "Proposed" or "PENDING" are correctly mapped.

Comment on lines +120 to +129
if (status === 'resolved') {
const resolved = fields.get('resolved') ?? ''
if (resolved.length > 0) {
out.push(`- Resolved: ${resolved}`)
} else {
const today = new Date().toISOString().slice(0, 10)
out.push(
`- Resolved: ${today} — (auto-synthesized — Scientist did not specify resolution; investigate)`,
)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid fabricating Resolved with runtime date.

Line 125 makes parsing time-dependent (new Date()), so the same input can produce different artifacts across days and missing resolution details are silently accepted instead of validated.

Suggested fix
   if (status === 'resolved') {
     const resolved = fields.get('resolved') ?? ''
     if (resolved.length > 0) {
       out.push(`- Resolved: ${resolved}`)
-    } else {
-      const today = new Date().toISOString().slice(0, 10)
-      out.push(
-        `- Resolved: ${today} — (auto-synthesized — Scientist did not specify resolution; investigate)`,
-      )
     }
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/artifacts/open-questions.ts` around lines 120 - 129, The code in
open-questions.ts fabricates a time-dependent "Resolved" value using new Date()
when fields.get('resolved') is empty, making artifacts non-deterministic; change
the logic in the block handling status === 'resolved' (the resolved variable and
the out.push call) to avoid inserting the current date — either omit the
Resolved line when resolved is empty or push a deterministic placeholder like "—
(unspecified resolution)" and emit/return a validation warning or error instead
of auto-synthesizing; remove the new Date() usage so artifacts remain
deterministic and update any related tests to expect the new deterministic
behavior.

@omerakben omerakben merged commit 71a7022 into main May 2, 2026
5 checks passed
@omerakben omerakben deleted the fix/issue-5-scientist-hypotheses branch May 2, 2026 23:56
omerakben added a commit that referenced this pull request May 3, 2026
…o orchestrator (F6)

Closes Codex M14 R1 finding #6. runReviewPanel now calls
assertPanelWithinBudget before any panelist invokes — a partial panel
has no valid quorum, so the policy is "refuse the whole round" rather
than "let the panelist loop hit a per-call refusal mid-round." Soft
warnings emit the existing M13 budget_warning event (no new vocabulary,
per anti-pattern lock).

RunReviewPanelOptions extends with:
- config: CodeOzConfig — caps lookup
- events: readonly LoggedEvent[] — cumulative spend (caller already
  reads the log for cross-family check)
- perPanelistTokensEstimate: number — manifest equality means each
  panelist sees the same files, so one estimate broadcasts across
  panel.length entries
- panelRole?: string — byRole budget routing, defaults to 'reviewer'

Refusal path returns a panel_budget_exceeded intervention, mirroring
the per-call ProviderError → intervention plumbing pattern from M13.
No staging artifacts are written, no panelist invoker is called, and
review_panel_started is never emitted.

Adds 2 orchestrator-level regression tests:
- aggregate over per-phase cap → intervention + 0 invoker calls + no
  review_panel_started event
- soft-warn band (≥0.75, <1.0) → budget_warning emitted BEFORE
  review_panel_started in the event log
omerakben added a commit that referenced this pull request May 9, 2026
…ia emitFired callback (M15 C13a)

Reshapes the fire-path executor contract so debate_scheduler_fired emits
BEFORE the executor invokes requestDebate. Fixes Codex R1 #6
(docs/research/CODEX_REVIEW_M15.md): the previous shape emitted `fired`
AFTER the executor returned, but a real executor would call requestDebate
inside its body (which synchronously emits debate_started + debate_resolved).
That landed the trace as
  evaluated -> debate_started -> debate_resolved -> fired -> postreview
contradicting docs/contracts/DEBATE_POLICY.md "Resume semantics" which
assumes `fired` precedes `debate_started`.

Contract change:

- New SchedulerFirePathHooks interface with `emitFired(selection)` callback.
- New SchedulerFirePathSelection interface for the (opposingProvider, debateTopic) pair.
- SchedulerFirePathExecutor now takes (input, hooks). Executor MUST call
  hooks.emitFired exactly once, after selecting opposingProvider + debateTopic
  and before invoking requestDebate. The hook synchronously appends
  debate_scheduler_fired inside emitFired, locking the trace order to
    evaluated -> fired -> debate_started -> debate_resolved -> postreview/error.

Failure paths:

- Pre-emit throw: executor errors before calling emitFired. No `fired`
  event lands. fireOutcome.fired === false. debate_scheduler_error reason=other
  records the underlying code. Pre-fire budget is preserved.
- Post-emit throw: executor calls emitFired then errors (e.g., requestDebate
  raised). `fired` is in events.jsonl; debate_scheduler_error follows.
  fireOutcome.fired === true and result carries the recorded selection so
  the rule-21 reducer can attribute the failure.
- Returns without emitFired: programming-error path. Surfaced as
  debate_scheduler_error reason=other code='executor_did_not_emit_fired'.
- emitFired called twice: contract violation. The second call throws,
  which routes through the post-emit error path.

Test surface (+4 tests, 2635 -> 2639 pass / 0 fail):

- tests/review-scheduler-fire.test.ts:
  * pre-emit throw (replaces old "thrown error coerced to scheduler_error";
    fired is now correctly false because the event was never written)
  * post-emit throw (new — verifies fired+error sequence with selection)
  * returns without emitFired (new — verifies programming-error path)
  * emitFired called twice (new — verifies once-only contract)
  * event ordering: emitFired returns with `fired` already in events.jsonl
- tests/review-scheduler-postreview.test.ts: mockExec updated to honor contract.
- tests/review-scheduler-fire.test.ts: makeMockExecutor updated to honor contract.

Refs: docs/design/SESSION_M15_PHASE2_KICKOFF.md (C13a row)
omerakben added a commit that referenced this pull request May 11, 2026
…(thread 019e188a)

Closes 4 of 5 actionable Codex retro findings on the 3-session + demo sweep
(docs/design/CODEX_BRIEFING_RETRO_3SESSION_SWEEP.md). Block-tag #2 (asciicast)
needs Ozzy decision and is surfaced separately. Block-next-comparison #4
(comparison/ vs comparisons/ index routing) deferred to next comparison-prep
session per Codex's own block-next-comparison severity.

block-tag #1 — production prose was wrong, not just demo prose

src/phases/verify-mutation.ts:186,191 hardcoded "reverted code passed/failed
the new tests" regardless of what the actual validation command was. The
prose was wrong for ANY cycle whose validation command isn't a test runner.
Strings now read "reverted code passed/failed the validation command (exit X
=== / !== expected Y)" — neutral for `bun test`, `pytest`, `cargo test`, and
demo-style file-existence checks alike. No existing test asserted the old
string, so the change is mechanical. Regenerated all 3 demo captures
(balanced + lite + beast) so the committed VERIFY.md proses match the fix.

block-tag #3 — status docs harmonization

- CLAUDE.md Status line: 3244 tests → 3299 tests; added the 2026-05-12 entries
  (B1a flag landing, opencode triage merge, demo prep) and the 2 opencode
  candidate slots reserved on the roadmap.
- README.md "22 template-comparison borrows" → "22-template comparison sweep
  produced 12 substantive borrows" (Codex flagged the conflation; the sweep
  audited 22 templates and shipped 12 borrows + 4 rule changes).

fix-soon #6 — cross-family REVIEW headline wording

docs/demo/01-todo-cli/README.md cross-family highlight now leads with the
honest framing: "The family labels come from the registry route — both
invocations execute through FakeProvider scripts in this demo, so the
cross-family check is real (the orchestrator's M14 Reviewer panel enforcement
actually compares the registered family IDs) but the underlying LLM
responses are scripted." Disclosure no longer lives only in the bottom
"What's real and what's simulated" table.

nit #7 — ARCHITECTURE.md drift

Architecture doc updated to match shipped runner: Validation: test -f
src/todo.ts (was `true`), lite multiplier 0.4 (was 0.5-ish), VERIFY note
explains the validation-command swap as a post-Codex fix-first move with
the mutation-gate-tautological failure mode that forced it.

block-next-comparison #5 — Session 3 memory entry

Adds feedback_stash_on_stale_base.md to autoMemory: when stashing work whose
merge-base is many milestones behind current main, name the merge-base SHA
in the stash message and the re-apply surfaces in the closing handoff.
Session 3's Q7 stash on 0dce4b0 hit Chorus §3.5 actor-attribution wrapping
drift on src/state/schemas.ts:1187 — exactly the bug class this memory
entry prevents future sessions from rediscovering.

Tests still 3299 / 0 / 2; demo cycle still completes cleanly at all three
effort levels. Pre-tag state preserved; block-tag #2 (asciicast) is the
remaining gate.

Codex retro trail:
- docs/design/CODEX_BRIEFING_RETRO_3SESSION_SWEEP.md
- docs/design/CODEX_RESPONSE_RETRO_3SESSION_SWEEP.md
omerakben added a commit that referenced this pull request May 12, 2026
Drafts the Homebrew formula source-of-truth in this repo and documents
the manual bump workflow that publishes it to the
omerakben/homebrew-code-oz tap repo at release time. Closes fix-soon #6
from the W3a synthesis.

docs/homebrew/code-oz.rb.template — per the synthesis minimum:

  - class CodeOz < Formula, license "MIT", no depends_on (binary is
    self-contained).
  - Per-arch on_macos { on_arm / on_intel } and on_linux { on_arm /
    on_intel } blocks. Each binds a tagged release URL + sha256.
  - bin.install "code-oz".
  - test do block runs `code-oz init` (stronger smoke than --version
    alone, per synthesis); asserts .code-oz/ directory was created in
    Homebrew's per-test temp path.

Placeholder tokens use __TOKEN__ form (visible to grep, replaceable by
sed) since {{...}} and ${...} would collide with Ruby string syntax:
__VERSION__, __SHA256_{DARWIN,LINUX}_{ARM64,X64}__. Substitution
happens at release time from the published checksums.txt.

docs/homebrew/README.md — tap-repo setup recipe:

  1. One-time `gh repo create omerakben/homebrew-code-oz` + Formula/
     dir bootstrap.
  2. Per-release bump: fetch checksums.txt → 4 awk lookups → sed
     substitution → render Formula/code-oz.rb → brew audit --strict
     --online → commit + push.
  3. End-to-end install verification script
     (brew untap → tap → install → init smoke → uninstall).

The "why a template" section explains the source-of-truth split:
template rides with the build pipeline; rendered formula lives in the
tap repo where Homebrew expects it.

tests/homebrew-formula.test.ts — 12 structural assertions on the
template: Ruby class shape, MIT license, no depends_on, per-arch
url/sha256 blocks, bin.install, test-block exercises `code-oz init`,
all expected placeholder tokens present, asset URLs pinned to the
tagged release path.

3341 → 3353 pass / 0 fail / 2 skip. Typecheck silent.

Locked deferral: actual tap-repo creation (gh repo create
omerakben/homebrew-code-oz) and the rendered formula push are part of
the C7 tag commit's release flow per the synthesis, gated on Codex R2
push verdict.
omerakben added a commit that referenced this pull request May 14, 2026
The GUI's spawn primitive previously only rejected runIdDeferred when
exitCode !== 0. When code-oz exited cleanly (exit 0) before emitting a
runId — the v0.20.2 dogfood failure mode — the deferred hung until
pollActiveRunId timed out 30s later with the misleading message
"active.json never updated."

Three changes:

1. createRunIdDeferred now returns an idempotent, settling-aware deferred
   so the three concurrent settlement paths (parseRunIdFromStdout,
   pollActiveRunId, subprocess.exited) cannot double-settle.

2. subprocess.exited.then now skips on settled, awaits both pumps via
   Promise.allSettled to flush buffered output, and composes an error
   from stderr (fallback stdout, fallback "no output captured") with the
   real exit code. Works for exit 0 and any non-zero exit.

3. spawnCodeOzRun gained an optional deps parameter (spawn, resolveBinary,
   supportsRunIdFlag) for DI. Production behavior unchanged.

Three RED tests in tests/unit/code-oz-spawn.test.ts:
- exit 0 with no output rejects fast (<2s) with "exit 0"
- exit 0 with stderr surfaces the stderr content in the error message
- exit 137 with stderr surfaces the exit code + stderr content

RED-first discipline verified by stashing the fix and re-running: all
three tests fail with the 2s budget assertion, proving they catch the
buggy behavior. After unstash: 8 pass / 0 fail in ~110ms. Typecheck and
lint clean.

Codex review thread 019e280c on the implementation: all 8 substantive
risks ok, one fix-first finding ("commit the changes") closed by this
commit. Verdict effectively push.

References Codex design verdict thread 019e27ba (Option 1 locked).
Handoff: docs/handoffs/2026-05-14-v0.20.2-bug-free-handoff.md
omerakben added a commit that referenced this pull request May 14, 2026
Closes task #6 of the v0.20.2 work sequence: pre-implementation Codex
debate for showstopper #0a (BUILD prompt has no TASK_BLOCK injection).

V0_20_2_SHOWSTOPPER_0A_BRIEFING.md is the structured debate brief —
goal, bug + dogfood proof, constraints (rules 1/7/15/19/22/23),
acceptance criteria, the handoff's recommended fix shape, and eight
debate prompts.

V0_20_2_SHOWSTOPPER_0A_CODEX_RESPONSE.md is Codex's structured response
to all eight prompts, model gpt-5.5 xhigh in read-only sandbox, thread
019e281e-d4db-7700-869c-cf3f05b0ea31. Synthesis ends with a
nine-line "Locked decisions" block.

Three notable Codex moves:

1. Pushback on Prompt 4: production ProviderRequest is a single prompt
   blob today, not split into system/user roles. The brief's framing
   was wrong. TASK_BLOCK goes in the composed prompt; role-split is
   #0b's debate to have.

2. Content surface narrowed (Prompt 1): hypotheses + sources excluded
   from BUILD prompt — they are PLAN/Scientist/REVIEW evidence, not
   builder consumer needs. Optional bugfix.existingTest stays because
   it prevents a bugfix BUILD from editing the reproduction test.

3. TRUST.md mitigation (Prompt 7): PLAN task fields (Validation, Risk)
   become BUILD provider prompt content under this fix. PLAN authors
   must use env var names, not literal secrets. TRUST.md addendum
   bundles with the implementation PR.

Critical dependency note from Prompt 8: #0a alone unblocks scaffold-only
tasks (creating a new file from intent, no read needed). For modification
tasks the builder still requires #0b's tool-use loop. The prdiff dogfood
T-001 ("Scaffold src/version.ts") is scaffold-only and should pass with
#0a alone; T-002+ depend on #0b.

No code changes. Doc-only. Implementation follows in a separate PR.
omerakben added a commit that referenced this pull request May 14, 2026
Captures the end-to-end real-provider validation of the v0.20.2 fix.
Fresh prdiff run 01KRM4EBQRX0GK1RT85RP1EF6D (the original
01KRKW0D94C3F80002CSAR29NT exhausted its 96-min lite envelope before
the fix landed) drove DEFINE -> PLAN -> BUILD -> VERIFY -> REVIEW with
real Opus + cross-family GPT-5.5.

What passed:
- BUILD T-001 attempt 1 emitted <build-ready/> + 1659-byte valid
  unified-diff patch (vs 3 failures pre-fix). #0a TASK_BLOCK validated.
- VERIFY ran `bun run src/prdiff.ts --version` and captured exit 0 +
  6 bytes stdout in 8ms.
- REVIEW round 1 with #0b file-manifest wiring returned needs-revision
  score=5 with 2 fix-first findings on the ACTUAL CODE (vs round-1
  block score=1 "worktree cwd empty" pre-REVIEW-wiring). Cross-family
  reviewer is doing its job: catching what the builder + verifier
  missed.

What the dogfood exposed (out of v0.20.2 scope):
- BUILD attempt 2 failed because the orchestrator does NOT reset the
  worktree to base between attempts. Attempt 2's all-added patch can't
  apply on top of attempt 1's already-created files. This is a runtime
  restart-on-fail bug in src/phases/build.ts, separate from #0a/#0b.
  Filed as v0.20.3 finding.

What v0.20.2 actually ships (vs the original handoff plan):
- #6, #2, #4, #0a, #0b BUILD: as planned.
- #0b REVIEW wiring: scope-expanded mid-session because the dogfood
  proved it was needed. Codex's locked decision #7 from thread
  019e2827 explicitly named REVIEW as a symmetric beneficiary;
  scope-deferring it was incorrect.

The doc is the BUG-free verdict the motto requires. Without this
dogfood, the REVIEW gap would have shipped silently.

This commit is doc-only. No code changes.
omerakben added a commit that referenced this pull request May 14, 2026
)

The GUI composer previously had no way to choose the effort envelope
for spawned runs, so every real-provider run defaulted to the CLI's
default. The v0.20.2 prdiff dogfood (handoff finding #5) needed
--effort lite for budgeted cost control; users had to drop to the
terminal to set it.

Implementation threads `EffortLevel` ('lite' | 'normal' | 'high')
through four layers:

1. lib/code-oz-spawn.ts: new exported EFFORT_LEVELS constant + EffortLevel
   type. SpawnInput gains optional `effortOverride`. commandForRun emits
   `--effort <value>` argv when the field is set, omits otherwise. Test
   pattern uses the existing spawn-factory DI to capture argv.

2. components/Composer.tsx: new `effort` + `onEffortChange` props.
   Inline <select> styled to match the composer aesthetic, between the
   textarea and the Compose button. Disabled when no repo is open or
   when a submission is in flight. aria-label="Effort envelope" for a11y.

3. app/page.tsx: new `composerEffort` state (default 'lite'),
   serialized into the /api/run/start body.

4. app/api/run/start/route.ts: body parser validates `effortOverride`
   against EFFORT_LEVELS (rejects unknown values at the API boundary);
   spreads the field into the spawn call when present.

Two new unit tests (tests/unit/code-oz-spawn.test.ts):
- argv contains `--effort high` when effortOverride is 'high'
- argv omits `--effort` entirely when effortOverride is undefined

Tests use the existing spawn-factory DI seam (added in #6) to capture
the command array passed to the subprocess factory. Bun test count
in code-oz-gui/: 18 pass / 0 fail (+2 from new tests).

Typecheck clean.
omerakben added a commit that referenced this pull request May 14, 2026
Trio-bump per m5-fix-first.test.ts discipline ("bump these three
together in the same chore"):

- package.json
- src/cli.ts PKG_VERSION
- src/config/schema.ts DEFAULT_CONFIG.version

Plus the three test literals that assert the trio matches:

- tests/m5-fix-first.test.ts CURRENT
- tests/smoke-test.test.ts VERSION
- tests/cli-init.test.ts config.version

This is the version the v0.20.2 tag will carry. The dogfood-smoke
script (scripts/release/dogfood-smoke.sh) gates the tag on
real-provider lifecycle success.

What lands in v0.20.2-alpha.0:
- #6 GUI exit-0 silent swallow → fixed via settled-flag + Promise.allSettled
- #2 npm @Tuel scope-routing trap documented in TRUST.md
- #4 GUI spawn timeout config (60s default, CODE_OZ_GUI_SPAWN_TIMEOUT_MS env)
- #0a BUILD prompt TASK_BLOCK injection (real-provider validated)
- #0b BUILD + REVIEW file-manifest expansion via provider-file-refs helper
- #5 GUI Composer --effort selector (lite/normal/high)
- #3 FakeProvider honesty banner above Composer
- dogfood-smoke pre-tag release gate script

Test suite: 3432 pass / 0 fail / 2 skip (live xAI). Typecheck clean.
omerakben added a commit that referenced this pull request May 14, 2026
…findings

End-to-end GUI + CLI validation on a brand-new project (quizr trivia CLI,
~$0.55 real-provider spend). All five v0.20.2 fixes validated under real
first-friend interaction:

- #3 FakeProvider banner renders + gates correctly on provider switch
- #5 effort selector emits canonical values; --effort lite reaches CLI
  (effort_envelope_applied lite event)
- #6 no silent swallow; spawn fires + GUI updates immediately
- #4 spawn completes well within 60s ceiling
- #0a TASK_BLOCK reaches builder — builder's CoT cites PLAN's verbatim
  risk text ("under-asserting here makes downstream tests trust bad
  data") and produces <build-ready/> + valid unified-diff patch for the
  scaffold task on first attempt
- #0b file-manifest expansion correctly returns empty for scaffold-only
  task (all change: 'added' paths don't exist yet)

Five new findings surfaced (all v0.20.3 polish, none block v0.20.2):

1. Brownfield detector treats INTENT.md as user content -> AUDIT phase
   fires with no AUDIT runtime -> stuck state. Workaround: move
   INTENT.md aside before init.

2. BUILD requires git HEAD; empty repo (just git init) breaks with raw
   `fatal: ambiguous argument 'HEAD'` instead of a friendly "your repo
   has no commits; run git commit --allow-empty -m init first."

3. build_report_notes_too_long intervention lacks actionable payload
   (which bullet, character count, hint to re-run for carry-forward).

4. GUI doesn't persist workspace/provider/effort across page reloads.
   Refresh -> back to fixture + DEMO MODE.

5. GUI's run-registry survives backend-state deletion (rm -rf
   .code-oz/state) -> ghost "IN PROGRESS" run in sidebar.

Plus the v0.20.3 worktree-reset-between-attempts finding from prdiff
dogfood: 6 polish items queued for v0.20.3.

Verdict: ready to share v0.20.2 with technically-comfortable friends
today (with the "git commit --allow-empty first" recipe caveat).
Ready for any-friend smooth experience after v0.20.3.
omerakben added a commit that referenced this pull request May 14, 2026
Three-workstream handoff for the session after v0.20.2-alpha.0 ships:

A. Finish v0.20.2 distribution (~30-60 min total)
   A1. Rich release notes (replace auto-generated thin one-liner)
   A2. npm publish @tuel/code-oz@0.20.2-alpha.0
   A3. Homebrew tap update

B. v0.20.3 — 6 polish findings (~3-4 h for all six)
   Ordered: #3 empty-repo intervention -> #2 INTENT.md greenfield seed
   -> #1 worktree-reset (Codex debate first) -> #4 Notes payload ->
   #5+#6 GUI persistence (single branch).
   Each follows v0.20.2's per-finding cadence: RED-first per rule 22 ->
   minimal fix -> Codex implementation review -> merge.

C. How to work Claude + Codex together (proven pattern from v0.20.2)
   Codex debate for non-trivial code; 2-3 review rounds typical;
   per-commit review when touching shared infra; deferral of locked
   decisions is risky (dogfood caught the REVIEW wiring miss).

Session-opening checklist, memory pins to read, and an explicit
"what NOT to do" section.

Author note: this handoff is the bridge between the v0.20.2 ship
(commit fb73128 + tag v0.20.2-alpha.0) and the v0.20.3 work. Memory
pin v0_20_2_shipped_local_2026-05-14.md has the persistent state;
this doc has the next-session action plan.
omerakben added a commit that referenced this pull request May 15, 2026
Closes greenfield-friend dogfood findings #5 (localStorage
persistence) and #6 (run-registry disk validation). Bonus typecheck
fix on tests/build-phase.test.ts for the #1 worktree-reset event
type guard.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: Scientist HYPOTHESES output uses YAML format, parser expects Markdown H2 blocks

2 participants