Skip to content

docs(comparisons/gptme): gptme template comparison + M17-M20 candidate slots (5 Codex rounds)#22

Merged
omerakben merged 6 commits into
mainfrom
feat/gptme-borrow-set-ratified
May 11, 2026
Merged

docs(comparisons/gptme): gptme template comparison + M17-M20 candidate slots (5 Codex rounds)#22
omerakben merged 6 commits into
mainfrom
feat/gptme-borrow-set-ratified

Conversation

@omerakben
Copy link
Copy Markdown
Owner

Doc-only PR adding gptme template comparison (5 Codex rounds R1-R5 to convergence) and reserving M17-M20 candidate slots for gptme-derived borrows (AGENT_FILES intake, context-projection probe, worktree topology refusal, eval harness).

Merge-pre-staged: combined gptme's M17-M20 ROADMAP slots with main's RULE21_BENCHMARK reference.

Test plan

  • Doc-only changes; no runtime test impact
  • ROADMAP merge preserves both gptme candidate slots and the RULE21_BENCHMARK link

Part of the AFK release-readiness loop — bundled with 10 already-merged comparison PRs landing on main.

omerakben added 6 commits May 10, 2026 14:31
… synthesis

Lands the four comparison-record files for the gptme template review
under docs/comparisons/gptme/:

- COMPARISON.md: structural review of gptme v0.31.x (chat-loop +
  autocompact + checkpoint + AGENT_FILES + hooks + subagent + plugins)
  vs code-oz (phase-FSM + cross-family REVIEW + debate runtime +
  budgets.global + scientist tails + Rule 20/21 authority discipline);
  final decision matrix after Codex R1 fix-first.

- CODEX_BRIEFING.md: planning-convergence briefing for Codex (gpt-5.5
  xhigh, sandbox: read-only) — recommended verdict + locked answers +
  five challenge prompts.

- CODEX_RESPONSE.md: Codex R1 verdict (thread
  019e12ed-4038-7fe2-8800-5520e5f2048a). Verdict: fix-first.
  Narrowed B1 (compaction) + B3 (AGENT_FILES); demoted B2
  (checkpoint) to defer; added new D3 (release-quality eval harness)
  the briefer missed. Net: 2 narrowed-borrow / 4 defer / 5 reject.

- SYNTHESIS.md (NEW): single source-of-truth post-debate synthesis.
  Records the round-2 thread 019e1319-2169-7ab0-8ca7-036d6252fe60
  ratifying Option A (RATIFY-ONLY — land comparison record + roadmap
  slot reservations only, no implementation). Final aligned decision
  matrix. Why borrows are not implemented in this PR (Rule 20 — one
  authority per milestone; src/phases/audit.ts does not exist; B1
  needs a contract before extending the existing tokensEstimate at
  src/providers/manifest.ts:111 neighborhood). Forward-looking M17
  (B3) / M18 (B1) / M19+ (B2) / M20+ (D3) candidate slots with
  measurement plans. Codex alignment statement quoted verbatim.

No source or test code touched. No telemetry schema changes. Only the
comparison record lands here; ROADMAP.md slot reservations land in a
separate commit per CLAUDE.md cross-model peer review rule (each
milestone gets its own authority boundary).
…ved borrows

Adds a "Template-comparison-derived deferred milestones" subsection
under the existing milestones list, between the M16+ deferred line
and the Provider-expansion track. Reserves four slots, each as a
single-authority milestone per CLAUDE.md rule 20 with a Rule-21
measurement plan:

- M17 candidate — feat(intake): cross-tool AGENT_FILES discovery +
  AUDIT/DEFINE opt-in (gptme borrow B3-narrowed). Discovery only;
  per-file user opt-in; no parent/home walk. Telemetry events:
  agent_files_discovered, agent_files_accepted, agent_files_rejected,
  agent_instruction_conflicts. Hard precondition: not before
  src/phases/audit.ts exists.

- M18 candidate — feat(provider): deterministic context-projection +
  compaction-opportunity probe (gptme borrow B1-narrowed). Telemetry-
  only; no LLM resume summarization; no view-branch swap; no
  automatic provider-context mutation. Extends ProviderContextMetrics
  (src/providers/manifest.ts:111 neighborhood) with
  context_projection_tokens, compaction_opportunity_savings_ratio,
  compaction_skipped_savings_ratio. Discipline rule "no phase
  artifact may exceed N tokens at gate write" lands first as a
  separate gate-preflight check. Rule-21 floor: > 0.10 ratio
  observable across runs.

- M19+ candidate — feat(diagnostic): worktree topology refusal modes
  (gptme borrow B2-deferred). Diagnostic-only kind classification
  (clean_run_worktree | dirty_run_worktree | no_worktree |
  multi_root_repo). NO destructive restore (gptme's git reset --hard
  is incompatible with code-oz's user-change preservation). On
  audit-completeness failure: classify and refuse rather than reset.

- M20+ candidate — feat(eval): release/run-quality eval harness
  (gptme borrow D3, Codex-flagged miss). Separate run-quality
  evaluation surface, not the unit/integration test suite. Inspired
  by gptme/docs/evals.rst (model leaderboards, CSV/JSON, Docker,
  SWE-bench). Trigger: a release-cadence quality regression slips
  through unit tests after v0.2 stabilizes.

Trail: docs/comparisons/gptme/SYNTHESIS.md.
Three findings closed (1 block-push + 1 fix-soon + 1 nit, all from
Codex review thread 019e1323-413a-7612-a767-12fca418fad7):

1. block-push: branch rebased onto current main (e18d127) so the diff
   no longer appears to delete docs/comparison/06-codex/* (those files
   landed on main from a parallel session while this branch was being
   prepared).
2. fix-soon: COMPARISON.md TL;DR + Borrow set heading normalized to the
   post-debate count "2 narrow-borrow candidates / 4 defer / 5 reject"
   and the per-item Borrow now write-ups marked SUPERSEDED with a
   pointer to SYNTHESIS.md as the canonical post-debate record.
3. nit: AI_SOFTWARE_COMPANY_THESIS.md citation in CODEX_RESPONSE.md
   B2 line widened from :196 to :196-197 to cover the
   builders-mutate-isolated-worktrees principle accurately.
Two follow-ups from convergence review (thread 019e132a):

1. fix-soon: R3-2 was not fully closed. Several lines still carried
   "3 deferred" or "Borrow now (narrowed)" or the inline self-correction
   note. Normalized to "2 narrow-borrow candidates / 4 defer / 5 reject"
   throughout COMPARISON.md final borrow set + outcome counts and
   CODEX_RESPONSE.md final-borrow line, with an explicit pointer back to
   SYNTHESIS.md as the canonical post-debate record.
2. nit: SYNTHESIS.md said "four files" but listed five. Corrected to
   "five files" and updated the closure language to reflect the actual
   review trail (R1 fix-first -> R2 scope lock -> R3 fix-first ->
   R4 convergence) instead of the stale "no round 3 required" wording.

After this commit, R3-1, R3-2, R3-3 are all closed and no R4 finding
remains open.
Two remaining mismatches from R5 strict re-scan (thread 019e132f):

1. fix-first: CODEX_RESPONSE.md:69 and :71 still labeled B1/B3 as
   "Borrow now (narrowed)" in the post-fix "Fixed" column of the
   classification-changes table. R2 scope-lock demoted both to
   "narrow-borrow candidate, deferred to own milestone" with M17/M18
   slots in ROADMAP. Updated the table cells to match.
2. fix-first: SYNTHESIS.md:81 closure language said R4 "closed two
   follow-up nits and declared push-clean" but R4 was fix-first. Rewrote
   the trail to accurately list R1 fix-first -> R2 scope lock ->
   R3 fix-first -> R4 fix-first -> R5 fix-first -> post-R5 convergence
   re-check.
Copilot AI review requested due to automatic review settings May 11, 2026 02:48
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 11, 2026

Warning

Rate limit exceeded

@omerakben has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 55 minutes and 6 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 44249f25-3531-4e79-8e92-b9805f2f833a

📥 Commits

Reviewing files that changed from the base of the PR and between 6724165 and 314f3a3.

📒 Files selected for processing (5)
  • docs/comparisons/gptme/CODEX_BRIEFING.md
  • docs/comparisons/gptme/CODEX_RESPONSE.md
  • docs/comparisons/gptme/COMPARISON.md
  • docs/comparisons/gptme/SYNTHESIS.md
  • docs/design/ROADMAP.md
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/gptme-borrow-set-ratified

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@omerakben omerakben merged commit 4ac6f4c into main May 11, 2026
2 checks passed
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 314f3a3149

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread docs/design/ROADMAP.md
- **M17 — Reviewer Memory v1.** Kickoff source for the memory-hygiene rubric: `docs/contracts/REVIEWER_MEMORY.md`.
- **M16+ (deferred until measurable need):** Researcher phase-tail (when Lead-persona source verification overflows), parallel builder candidates (security-wedge trigger), multi-opponent debate (when single-opponent proves insufficient on real disagreement cases), Skills layer architecture (when M9/M10 produce duplication pain).
- **Template-comparison-derived deferred milestones (slots reserved 2026-05-10):**
- **M17 candidate — `feat(intake): cross-tool AGENT_FILES discovery + AUDIT/DEFINE opt-in (gptme borrow B3-narrowed).`** Authority boundary: cross-tool agent-instruction-file intake at AUDIT and DEFINE phase entry. Discovery only (file list per `gptme/prompts/__init__.py`: `AGENTS.md`, `CLAUDE.md`, `COPILOT.md`, `GEMINI.md`, `.cursorrules`, `.windsurfrules`, `.github/copilot-instructions.md`). NO parent/home walk. NO automatic prompt injection. Files become available to the user as a confirm UI in AUDIT/DEFINE intake; user accepts or rejects per file. Telemetry events: `agent_files_discovered`, `agent_files_accepted`, `agent_files_rejected`, `agent_instruction_conflicts`. Rule-21 measurement: `agent_files_accepted / agent_files_discovered` rate; intake-question-count delta vs baseline (no AGENT_FILES) on a brownfield corpus. Trail: `docs/comparisons/gptme/SYNTHESIS.md`. Trigger: lands when brownfield AUDIT runtime ships (W4) OR when greenfield DEFINE intake earns the authority. NOT before `src/phases/audit.ts` exists.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Renumber the new candidate milestone to avoid M17 collision

This adds a second M17 entry (M17 candidate) while M17 — Reviewer Memory v1 already exists just above, which makes milestone references ambiguous for planning, reviews, and any tooling or docs that key off milestone IDs. Use a non-conflicting label (for example M18 candidate or a TC-* namespace) so each roadmap slot remains uniquely addressable.

Useful? React with 👍 / 👎.

Comment thread docs/design/ROADMAP.md
- **M17 — Reviewer Memory v1.** Kickoff source for the memory-hygiene rubric: `docs/contracts/REVIEWER_MEMORY.md`.
- **M16+ (deferred until measurable need):** Researcher phase-tail (when Lead-persona source verification overflows), parallel builder candidates (security-wedge trigger), multi-opponent debate (when single-opponent proves insufficient on real disagreement cases), Skills layer architecture (when M9/M10 produce duplication pain).
- **Template-comparison-derived deferred milestones (slots reserved 2026-05-10):**
- **M17 candidate — `feat(intake): cross-tool AGENT_FILES discovery + AUDIT/DEFINE opt-in (gptme borrow B3-narrowed).`** Authority boundary: cross-tool agent-instruction-file intake at AUDIT and DEFINE phase entry. Discovery only (file list per `gptme/prompts/__init__.py`: `AGENTS.md`, `CLAUDE.md`, `COPILOT.md`, `GEMINI.md`, `.cursorrules`, `.windsurfrules`, `.github/copilot-instructions.md`). NO parent/home walk. NO automatic prompt injection. Files become available to the user as a confirm UI in AUDIT/DEFINE intake; user accepts or rejects per file. Telemetry events: `agent_files_discovered`, `agent_files_accepted`, `agent_files_rejected`, `agent_instruction_conflicts`. Rule-21 measurement: `agent_files_accepted / agent_files_discovered` rate; intake-question-count delta vs baseline (no AGENT_FILES) on a brownfield corpus. Trail: `docs/comparisons/gptme/SYNTHESIS.md`. Trigger: lands when brownfield AUDIT runtime ships (W4) OR when greenfield DEFINE intake earns the authority. NOT before `src/phases/audit.ts` exists.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Resolve contradictory trigger and precondition for AGENT_FILES slot

The trigger says this can land when either brownfield AUDIT ships or greenfield DEFINE earns authority, but the same line then requires NOT before src/phases/audit.ts exists, which blocks the DEFINE-only path you just listed. This contradiction makes the activation criteria non-actionable; split the conditions so the DEFINE path can proceed independently or remove the OR branch.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request documents a structural comparison between gptme and code-oz, including a detailed synthesis of the findings and a revised roadmap for future milestones. The reviewer identified several numbering conflicts where new candidate milestones (M17-M18) overlapped with existing entries in the roadmap. The feedback provides actionable corrections to renumber these slots to M18-M21 across multiple documentation files to ensure a unique and consistent sequence.

```
User → chat() loop
├── prompts/ (system prompt + AGENTS.md/CLAUDE.md/GEMINI.md ingestion)
├── tools/ (shell, ipython, patch, browser, vision, computer, subagent, rag, gh, tmux, todo, …)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This summary is inconsistent with the final verdict reached after the Codex debate. The final count is 2 narrow-borrow candidates, 4 deferred, and 5 rejected, as correctly stated on line 16 and in the final verdict section.

Suggested change
├── tools/ (shell, ipython, patch, browser, vision, computer, subagent, rag, gh, tmux, todo, …)
Borrow set: **2 narrow-borrow candidates, 4 deferred, 5 rejected.**

Comment on lines +297 to +302
| Narrow borrow candidate, deferred to own milestone | **B1** | deterministic context-size + compaction-opportunity probe only; no LLM summarization, no view-branch swap |
| Narrow borrow candidate, deferred to own milestone | **B3** | AGENT_FILES discovery + explicit AUDIT/DEFINE opt-in; no home/parent walk |
| Defer | **B2** (renamed) | worktree topology/refusal diagnostics; revisit when audit-completeness recovery measurably fails |
| Defer | **D1** | generalized hook lifecycle |
| Defer | **D2** | subagent executor/planner/batch |
| Defer | **D3 (new)** | release-quality eval harness inspired by gptme evals |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Renumbering these slots to resolve the conflict with the existing M17 milestone in the roadmap.

Suggested change
| Narrow borrow candidate, deferred to own milestone | **B1** | deterministic context-size + compaction-opportunity probe only; no LLM summarization, no view-branch swap |
| Narrow borrow candidate, deferred to own milestone | **B3** | AGENT_FILES discovery + explicit AUDIT/DEFINE opt-in; no home/parent walk |
| Defer | **B2** (renamed) | worktree topology/refusal diagnostics; revisit when audit-completeness recovery measurably fails |
| Defer | **D1** | generalized hook lifecycle |
| Defer | **D2** | subagent executor/planner/batch |
| Defer | **D3 (new)** | release-quality eval harness inspired by gptme evals |
| Narrow borrow candidate, deferred to own milestone | **B1** | deterministic context-size + compaction-opportunity probe only; no LLM summarization, no view-branch swap | M19 candidate |
| Narrow borrow candidate, deferred to own milestone | **B3** | AGENT_FILES discovery + explicit AUDIT/DEFINE opt-in; no home/parent walk | M18 candidate |
| Defer | **B2** (renamed) | worktree topology/refusal diagnostics; revisit when audit-completeness recovery measurably fails | M20+ candidate |
| Defer | **D1** | generalized hook lifecycle | post-v0.2 |
| Defer | **D2** | subagent executor/planner/batch | post-v0.2 |
| Defer | **D3 (new)** | release-quality eval harness inspired by gptme evals | M21+ candidate |

Comment on lines +29 to +34
| **B1 — Compaction-opportunity probe** | borrow-deferred-to-own-milestone | Deterministic context projection + compaction-opportunity probe is useful telemetry, but it requires a separate milestone authority because gptme's full engine performs LLM resume summarization and view-branch swaps that violate code-oz's "files in `ProviderRequest.files` are explicit, never silently mutated" discipline. | M18 candidate | Extend the existing `tokensEstimate` field on `ProviderContextMetrics` (`src/providers/manifest.ts:111`) with `context_projection_tokens`, `compaction_opportunity_savings_ratio`, `compaction_skipped_savings_ratio`. Rule-21 ship gate: observed `compaction_opportunity_savings_ratio` distribution > 0.10 across runs before any compaction-action authority is added. |
| **B3 — AGENT_FILES discovery + AUDIT/DEFINE opt-in** | borrow-deferred-to-own-milestone | Cross-tool agent-instruction-file discovery is worth doing, but it requires its own milestone authority because the AUDIT runtime does not yet exist (`src/phases/audit.ts` is absent) and the trust-boundary discipline (no parent/home walk, explicit per-file opt-in) is itself a contract surface. | M17 candidate | Telemetry events `agent_files_discovered`, `agent_files_accepted`, `agent_files_rejected`, `agent_instruction_conflicts`. Rule-21 ship gate: `agent_files_accepted / agent_files_discovered` rate observable; intake-question-count delta vs. baseline (no AGENT_FILES) on a brownfield corpus. |
| **B2 — Worktree topology refusal diagnostics** | defer | gptme's restore primitive (`git reset --hard` + optional `git clean -fd`) is incompatible with code-oz's per-run isolated worktrees and user-change preservation discipline; the worktree IS the checkpoint. Only the topology-classification idea has lift-value. | M19+ candidate | Rule-21 ship gate: count of resumes where audit-completeness recovery would have benefited from `kind`-classification refusal vs. count where current recovery is sufficient. |
| **D1 — Generalized hook lifecycle (16+ types)** | defer | Rule 20 — extension authority. gptme's hook surface is wider than the briefing claimed (transforms, confirmations, elicitation, cwd, cache-invalidation per `gptme/hooks/types.py:61,68,100,103`), and code-oz has exactly one production hook today (`review-scheduler-hook.ts` from M15). Revisit when ≥3 features want to subscribe to the same lifecycle event. | post-v0.2 | (n/a — defer) |
| **D2 — Subagent batch + planner pattern** | defer | Rule 21 — parallel-agent execution surface. gptme's subagent API includes executor/planner modes, parallel/sequential subtasks, subprocess mode, ACP mode, profiles, model routing, and optional isolated worktrees (`gptme/tools/subagent/api.py:32,80,95`); pinned to measurable need before adoption. | post-v0.2 | (n/a — defer) |
| **D3 (new) — Release/run-quality eval harness** | defer | Codex-flagged gap the briefer missed: gptme's `docs/evals.rst` has model leaderboards, CSV/JSON export, Docker guidance, and SWE-bench compatibility; code-oz's offline tests validate orchestration but not live run quality across model/release combos. | M20+ candidate | Rule-21 ship gate: a release-cadence quality regression slips through unit tests, motivating a separate run-quality evaluation surface. |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Renumbering these slots to resolve the conflict with the existing M17 milestone in the roadmap.

Suggested change
| **B1 — Compaction-opportunity probe** | borrow-deferred-to-own-milestone | Deterministic context projection + compaction-opportunity probe is useful telemetry, but it requires a separate milestone authority because gptme's full engine performs LLM resume summarization and view-branch swaps that violate code-oz's "files in `ProviderRequest.files` are explicit, never silently mutated" discipline. | M18 candidate | Extend the existing `tokensEstimate` field on `ProviderContextMetrics` (`src/providers/manifest.ts:111`) with `context_projection_tokens`, `compaction_opportunity_savings_ratio`, `compaction_skipped_savings_ratio`. Rule-21 ship gate: observed `compaction_opportunity_savings_ratio` distribution > 0.10 across runs before any compaction-action authority is added. |
| **B3 — AGENT_FILES discovery + AUDIT/DEFINE opt-in** | borrow-deferred-to-own-milestone | Cross-tool agent-instruction-file discovery is worth doing, but it requires its own milestone authority because the AUDIT runtime does not yet exist (`src/phases/audit.ts` is absent) and the trust-boundary discipline (no parent/home walk, explicit per-file opt-in) is itself a contract surface. | M17 candidate | Telemetry events `agent_files_discovered`, `agent_files_accepted`, `agent_files_rejected`, `agent_instruction_conflicts`. Rule-21 ship gate: `agent_files_accepted / agent_files_discovered` rate observable; intake-question-count delta vs. baseline (no AGENT_FILES) on a brownfield corpus. |
| **B2 — Worktree topology refusal diagnostics** | defer | gptme's restore primitive (`git reset --hard` + optional `git clean -fd`) is incompatible with code-oz's per-run isolated worktrees and user-change preservation discipline; the worktree IS the checkpoint. Only the topology-classification idea has lift-value. | M19+ candidate | Rule-21 ship gate: count of resumes where audit-completeness recovery would have benefited from `kind`-classification refusal vs. count where current recovery is sufficient. |
| **D1 — Generalized hook lifecycle (16+ types)** | defer | Rule 20 — extension authority. gptme's hook surface is wider than the briefing claimed (transforms, confirmations, elicitation, cwd, cache-invalidation per `gptme/hooks/types.py:61,68,100,103`), and code-oz has exactly one production hook today (`review-scheduler-hook.ts` from M15). Revisit when ≥3 features want to subscribe to the same lifecycle event. | post-v0.2 | (n/a — defer) |
| **D2 — Subagent batch + planner pattern** | defer | Rule 21 — parallel-agent execution surface. gptme's subagent API includes executor/planner modes, parallel/sequential subtasks, subprocess mode, ACP mode, profiles, model routing, and optional isolated worktrees (`gptme/tools/subagent/api.py:32,80,95`); pinned to measurable need before adoption. | post-v0.2 | (n/a — defer) |
| **D3 (new) — Release/run-quality eval harness** | defer | Codex-flagged gap the briefer missed: gptme's `docs/evals.rst` has model leaderboards, CSV/JSON export, Docker guidance, and SWE-bench compatibility; code-oz's offline tests validate orchestration but not live run quality across model/release combos. | M20+ candidate | Rule-21 ship gate: a release-cadence quality regression slips through unit tests, motivating a separate run-quality evaluation surface. |
| **B1 — Compaction-opportunity probe** | borrow-deferred-to-own-milestone | Deterministic context projection + compaction-opportunity probe is useful telemetry, but it requires a separate milestone authority because gptme's full engine performs LLM resume summarization and view-branch swaps that violate code-oz's "files in `ProviderRequest.files` are explicit, never silently mutated" discipline. | M19 candidate | Extend the existing `tokensEstimate` field on `ProviderContextMetrics` (`src/providers/manifest.ts:111`) with `context_projection_tokens`, `compaction_opportunity_savings_ratio`, `compaction_skipped_savings_ratio`. Rule-21 ship gate: observed `compaction_opportunity_savings_ratio` distribution > 0.10 across runs before any compaction-action authority is added. |
| **B3 — AGENT_FILES discovery + AUDIT/DEFINE opt-in** | borrow-deferred-to-own-milestone | Cross-tool agent-instruction-file discovery is worth doing, but it requires its own milestone authority because the AUDIT runtime does not yet exist (`src/phases/audit.ts` is absent) and the trust-boundary discipline (no parent/home walk, explicit per-file opt-in) is itself a contract surface. | M18 candidate | Telemetry events `agent_files_discovered`, `agent_files_accepted`, `agent_files_rejected`, `agent_instruction_conflicts`. Rule-21 ship gate: `agent_files_accepted / agent_files_discovered` rate observable; intake-question-count delta vs. baseline (no AGENT_FILES) on a brownfield corpus. |
| **B2 — Worktree topology refusal diagnostics** | defer | gptme's restore primitive (`git reset --hard` + optional `git clean -fd`) is incompatible with code-oz's per-run isolated worktrees and user-change preservation discipline; the worktree IS the checkpoint. Only the topology-classification idea has lift-value. | M20+ candidate | Rule-21 ship gate: count of resumes where audit-completeness recovery would have benefited from `kind`-classification refusal vs. count where current recovery is sufficient. |
| **D1 — Generalized hook lifecycle (16+ types)** | defer | Rule 20 — extension authority. gptme's hook surface is wider than the briefing claimed (transforms, confirmations, elicitation, cwd, cache-invalidation per `gptme/hooks/types.py:61,68,100,103`), and code-oz has exactly one production hook today (`review-scheduler-hook.ts` from M15). Revisit when ≥3 features want to subscribe to the same lifecycle event. | post-v0.2 | (n/a — defer) |
| **D2 — Subagent batch + planner pattern** | defer | Rule 21 — parallel-agent execution surface. gptme's subagent API includes executor/planner modes, parallel/sequential subtasks, subprocess mode, ACP mode, profiles, model routing, and optional isolated worktrees (`gptme/tools/subagent/api.py:32,80,95`); pinned to measurable need before adoption. | post-v0.2 | (n/a — defer) |
| **D3 (new) — Release/run-quality eval harness** | defer | Codex-flagged gap the briefer missed: gptme's `docs/evals.rst` has model leaderboards, CSV/JSON export, Docker guidance, and SWE-bench compatibility; code-oz's offline tests validate orchestration but not live run quality across model/release combos. | M21+ candidate | Rule-21 ship gate: a release-cadence quality regression slips through unit tests, motivating a separate run-quality evaluation surface. |

Comment on lines +59 to +63
**M17 candidate — AGENT_FILES intake authority (B3-narrowed).** Lands the discovery list (`AGENTS.md`, `CLAUDE.md`, `COPILOT.md`, `GEMINI.md`, `.cursorrules`, `.windsurfrules`, `.github/copilot-instructions.md` per `gptme/prompts/__init__.py:23`) at AUDIT and DEFINE phase entry. Discovery only — no parent/home walk (gptme walks home → workspace per `gptme/prompts/workspace.py:121,215,233`; code-oz refuses), no automatic prompt injection. Files become a confirm UI that accepts or rejects per file. New telemetry events (`agent_files_discovered`, `agent_files_accepted`, `agent_files_rejected`, `agent_instruction_conflicts`) extend `PhaseEvent`. Trigger: lands when brownfield AUDIT runtime ships (W4) or when greenfield DEFINE intake earns the authority. Hard precondition: not before `src/phases/audit.ts` exists.

**M18 candidate — Compaction-opportunity probe authority (B1-narrowed).** Telemetry-only context projection that reports compaction opportunity without mutating provider invocations. No LLM resume summarization (gptme's `gptme/tools/autocompact/hook.py:164`), no view-branch swap (`gptme/tools/autocompact/hook.py:128`), no automatic provider-context mutation. The discipline rule "no phase artifact may exceed N tokens at gate write" lands first as a separate gate-preflight check (`src/phases/gate-preflight.ts` extension); the probe extends the existing `ProviderContextMetrics` (`src/providers/manifest.ts:111` neighborhood) with `context_projection_tokens`, `compaction_opportunity_savings_ratio`, `compaction_skipped_savings_ratio`. Trigger: M14 Reviewer-panel + M15 debate-scheduler accumulate large enough contexts to make the > 0.10 floor measurable.

**Parallel deferred slots — M19+ (B2) and M20+ (D3).** B2's worktree topology refusal diagnostics waits on actual operator-intervention evidence in the resume corpus. D3's release/run-quality eval harness waits on a release-cadence quality regression that slips through unit tests. Both are reserved with measurement triggers in `ROADMAP.md`; neither is committed.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Renumbering these slots to resolve the conflict with the existing M17 milestone in the roadmap.

Suggested change
**M17 candidate — AGENT_FILES intake authority (B3-narrowed).** Lands the discovery list (`AGENTS.md`, `CLAUDE.md`, `COPILOT.md`, `GEMINI.md`, `.cursorrules`, `.windsurfrules`, `.github/copilot-instructions.md` per `gptme/prompts/__init__.py:23`) at AUDIT and DEFINE phase entry. Discovery only — no parent/home walk (gptme walks home → workspace per `gptme/prompts/workspace.py:121,215,233`; code-oz refuses), no automatic prompt injection. Files become a confirm UI that accepts or rejects per file. New telemetry events (`agent_files_discovered`, `agent_files_accepted`, `agent_files_rejected`, `agent_instruction_conflicts`) extend `PhaseEvent`. Trigger: lands when brownfield AUDIT runtime ships (W4) or when greenfield DEFINE intake earns the authority. Hard precondition: not before `src/phases/audit.ts` exists.
**M18 candidate — Compaction-opportunity probe authority (B1-narrowed).** Telemetry-only context projection that reports compaction opportunity without mutating provider invocations. No LLM resume summarization (gptme's `gptme/tools/autocompact/hook.py:164`), no view-branch swap (`gptme/tools/autocompact/hook.py:128`), no automatic provider-context mutation. The discipline rule "no phase artifact may exceed N tokens at gate write" lands first as a separate gate-preflight check (`src/phases/gate-preflight.ts` extension); the probe extends the existing `ProviderContextMetrics` (`src/providers/manifest.ts:111` neighborhood) with `context_projection_tokens`, `compaction_opportunity_savings_ratio`, `compaction_skipped_savings_ratio`. Trigger: M14 Reviewer-panel + M15 debate-scheduler accumulate large enough contexts to make the > 0.10 floor measurable.
**Parallel deferred slots — M19+ (B2) and M20+ (D3).** B2's worktree topology refusal diagnostics waits on actual operator-intervention evidence in the resume corpus. D3's release/run-quality eval harness waits on a release-cadence quality regression that slips through unit tests. Both are reserved with measurement triggers in `ROADMAP.md`; neither is committed.
**M18 candidate — AGENT_FILES intake authority (B3-narrowed).** Lands the discovery list (`AGENTS.md`, `CLAUDE.md`, `COPILOT.md`, `GEMINI.md`, `.cursorrules`, `.windsurfrules`, `.github/copilot-instructions.md` per `gptme/prompts/__init__.py:23`) at AUDIT and DEFINE phase entry. Discovery only — no parent/home walk (gptme walks home → workspace per `gptme/prompts/workspace.py:121,215,233`; code-oz refuses), no automatic prompt injection. Files become a confirm UI that accepts or rejects per file. New telemetry events (`agent_files_discovered`, `agent_files_accepted`, `agent_files_rejected`, `agent_instruction_conflicts`) extend `PhaseEvent`. Trigger: lands when brownfield AUDIT runtime ships (W4) or when greenfield DEFINE intake earns the authority. Hard precondition: not before `src/phases/audit.ts` exists.
**M19 candidate — Compaction-opportunity probe authority (B1-narrowed).** Telemetry-only context projection that reports compaction opportunity without mutating provider invocations. No LLM resume summarization (gptme's `gptme/tools/autocompact/hook.py:164`), no view-branch swap (`gptme/tools/autocompact/hook.py:128`), no automatic provider-context mutation. The discipline rule "no phase artifact may exceed N tokens at gate write" lands first as a separate gate-preflight check (`src/phases/gate-preflight.ts` extension); the probe extends the existing `ProviderContextMetrics` (`src/providers/manifest.ts:111` neighborhood) with `context_projection_tokens`, `compaction_opportunity_savings_ratio`, `compaction_skipped_savings_ratio`. Trigger: M14 Reviewer-panel + M15 debate-scheduler accumulate large enough contexts to make the > 0.10 floor measurable.
**Parallel deferred slots — M20+ (B2) and M21+ (D3).** B2's worktree topology refusal diagnostics waits on actual operator-intervention evidence in the resume corpus. D3's release/run-quality eval harness waits on a release-cadence quality regression that slips through unit tests. Both are reserved with measurement triggers in `ROADMAP.md`; neither is committed.

Comment thread docs/design/ROADMAP.md
Comment on lines +385 to +388
- **M17 candidate — `feat(intake): cross-tool AGENT_FILES discovery + AUDIT/DEFINE opt-in (gptme borrow B3-narrowed).`** Authority boundary: cross-tool agent-instruction-file intake at AUDIT and DEFINE phase entry. Discovery only (file list per `gptme/prompts/__init__.py`: `AGENTS.md`, `CLAUDE.md`, `COPILOT.md`, `GEMINI.md`, `.cursorrules`, `.windsurfrules`, `.github/copilot-instructions.md`). NO parent/home walk. NO automatic prompt injection. Files become available to the user as a confirm UI in AUDIT/DEFINE intake; user accepts or rejects per file. Telemetry events: `agent_files_discovered`, `agent_files_accepted`, `agent_files_rejected`, `agent_instruction_conflicts`. Rule-21 measurement: `agent_files_accepted / agent_files_discovered` rate; intake-question-count delta vs baseline (no AGENT_FILES) on a brownfield corpus. Trail: `docs/comparisons/gptme/SYNTHESIS.md`. Trigger: lands when brownfield AUDIT runtime ships (W4) OR when greenfield DEFINE intake earns the authority. NOT before `src/phases/audit.ts` exists.
- **M18 candidate — `feat(provider): deterministic context-projection + compaction-opportunity probe (gptme borrow B1-narrowed).`** Authority boundary: telemetry-only context projection that reports compaction opportunity without mutating provider invocations. NO LLM resume summarization. NO view-branch swap. NO automatic provider-context mutation. The discipline rule "no phase artifact may exceed N tokens at gate write" lands FIRST as a separate gate-preflight check; the probe extends existing `tokensEstimate` (`src/providers/manifest.ts:109` neighborhood, computed at line 111) with `context_projection_tokens`, `compaction_opportunity_savings_ratio`, `compaction_skipped_savings_ratio`. Rule-21 measurement: observed `compaction_opportunity_savings_ratio` distribution across runs > 0.10 floor before any compaction action authority is added. Trail: `docs/comparisons/gptme/SYNTHESIS.md`. Trigger: M14 Reviewer-panel + M15 debate-scheduler runs accumulate large enough contexts to make the floor measurable.
- **M19+ candidate — `feat(diagnostic): worktree topology refusal modes (gptme borrow B2-deferred).`** Authority boundary: diagnostic-only kind-classification of worktree state (`clean_run_worktree | dirty_run_worktree | no_worktree | multi_root_repo`). NO destructive restore primitive (gptme's `git reset --hard` is incompatible with code-oz's user-change preservation). On audit-completeness failure, classify and refuse rather than reset. Trigger: lands when actual resumes show audit-completeness recovery cannot recover from a dirty run worktree without operator intervention. Rule-21 measurement: count of resumes where current recovery is destructive vs. count where classification would have made the next action obvious.
- **M20+ candidate — `feat(eval): release/run-quality eval harness (gptme borrow D3, Codex-flagged).`** Authority boundary: a separate run-quality evaluation suite, not the unit/integration test surface. Inspired by `gptme/docs/evals.rst` (model leaderboards, CSV/JSON export, Docker guidance, SWE-bench compatibility). Validates orchestration AND live run quality across model/release combos. Trigger: when v0.2 stabilizes and a release-cadence quality regression slips through unit tests.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is a numbering conflict here. M17 is already assigned to "Reviewer Memory v1" on line 382. These new candidate slots should be renumbered (e.g., M18-M21) to maintain a unique sequence in the roadmap.

Suggested change
- **M17 candidate — `feat(intake): cross-tool AGENT_FILES discovery + AUDIT/DEFINE opt-in (gptme borrow B3-narrowed).`** Authority boundary: cross-tool agent-instruction-file intake at AUDIT and DEFINE phase entry. Discovery only (file list per `gptme/prompts/__init__.py`: `AGENTS.md`, `CLAUDE.md`, `COPILOT.md`, `GEMINI.md`, `.cursorrules`, `.windsurfrules`, `.github/copilot-instructions.md`). NO parent/home walk. NO automatic prompt injection. Files become available to the user as a confirm UI in AUDIT/DEFINE intake; user accepts or rejects per file. Telemetry events: `agent_files_discovered`, `agent_files_accepted`, `agent_files_rejected`, `agent_instruction_conflicts`. Rule-21 measurement: `agent_files_accepted / agent_files_discovered` rate; intake-question-count delta vs baseline (no AGENT_FILES) on a brownfield corpus. Trail: `docs/comparisons/gptme/SYNTHESIS.md`. Trigger: lands when brownfield AUDIT runtime ships (W4) OR when greenfield DEFINE intake earns the authority. NOT before `src/phases/audit.ts` exists.
- **M18 candidate — `feat(provider): deterministic context-projection + compaction-opportunity probe (gptme borrow B1-narrowed).`** Authority boundary: telemetry-only context projection that reports compaction opportunity without mutating provider invocations. NO LLM resume summarization. NO view-branch swap. NO automatic provider-context mutation. The discipline rule "no phase artifact may exceed N tokens at gate write" lands FIRST as a separate gate-preflight check; the probe extends existing `tokensEstimate` (`src/providers/manifest.ts:109` neighborhood, computed at line 111) with `context_projection_tokens`, `compaction_opportunity_savings_ratio`, `compaction_skipped_savings_ratio`. Rule-21 measurement: observed `compaction_opportunity_savings_ratio` distribution across runs > 0.10 floor before any compaction action authority is added. Trail: `docs/comparisons/gptme/SYNTHESIS.md`. Trigger: M14 Reviewer-panel + M15 debate-scheduler runs accumulate large enough contexts to make the floor measurable.
- **M19+ candidate — `feat(diagnostic): worktree topology refusal modes (gptme borrow B2-deferred).`** Authority boundary: diagnostic-only kind-classification of worktree state (`clean_run_worktree | dirty_run_worktree | no_worktree | multi_root_repo`). NO destructive restore primitive (gptme's `git reset --hard` is incompatible with code-oz's user-change preservation). On audit-completeness failure, classify and refuse rather than reset. Trigger: lands when actual resumes show audit-completeness recovery cannot recover from a dirty run worktree without operator intervention. Rule-21 measurement: count of resumes where current recovery is destructive vs. count where classification would have made the next action obvious.
- **M20+ candidate — `feat(eval): release/run-quality eval harness (gptme borrow D3, Codex-flagged).`** Authority boundary: a separate run-quality evaluation suite, not the unit/integration test surface. Inspired by `gptme/docs/evals.rst` (model leaderboards, CSV/JSON export, Docker guidance, SWE-bench compatibility). Validates orchestration AND live run quality across model/release combos. Trigger: when v0.2 stabilizes and a release-cadence quality regression slips through unit tests.
- **M18 candidate — `feat(intake): cross-tool AGENT_FILES discovery + AUDIT/DEFINE opt-in (gptme borrow B3-narrowed).`** Authority boundary: cross-tool agent-instruction-file intake at AUDIT and DEFINE phase entry. Discovery only (file list per `gptme/prompts/__init__.py`: `AGENTS.md`, `CLAUDE.md`, `COPILOT.md`, `GEMINI.md`, `.cursorrules`, `.windsurfrules`, `.github/copilot-instructions.md`). NO parent/home walk. NO automatic prompt injection. Files become available to the user as a confirm UI in AUDIT/DEFINE intake; user accepts or rejects per file. Telemetry events: `agent_files_discovered`, `agent_files_accepted`, `agent_files_rejected`, `agent_instruction_conflicts`. Rule-21 measurement: `agent_files_accepted / agent_files_discovered` rate; intake-question-count delta vs baseline (no AGENT_FILES) on a brownfield corpus. Trail: `docs/comparisons/gptme/SYNTHESIS.md`. Trigger: lands when brownfield AUDIT runtime ships (W4) OR when greenfield DEFINE intake earns the authority. NOT before `src/phases/audit.ts` exists.
- **M19 candidate — `feat(provider): deterministic context-projection + compaction-opportunity probe (gptme borrow B1-narrowed).`** Authority boundary: telemetry-only context projection that reports compaction opportunity without mutating provider invocations. NO LLM resume summarization. NO view-branch swap. NO automatic provider-context mutation. The discipline rule "no phase artifact may exceed N tokens at gate write" lands FIRST as a separate gate-preflight check; the probe extends existing `tokensEstimate` (`src/providers/manifest.ts:109` neighborhood, computed at line 111) with `context_projection_tokens`, `compaction_opportunity_savings_ratio`, `compaction_skipped_savings_ratio`. Rule-21 measurement: observed `compaction_opportunity_savings_ratio` distribution across runs > 0.10 floor before any compaction action authority is added. Trail: `docs/comparisons/gptme/SYNTHESIS.md`. Trigger: M14 Reviewer-panel + M15 debate-scheduler runs accumulate large enough contexts to make the floor measurable.
- **M20+ candidate — `feat(diagnostic): worktree topology refusal modes (gptme borrow B2-deferred).`** Authority boundary: diagnostic-only kind-classification of worktree state (`clean_run_worktree | dirty_run_worktree | no_worktree | multi_root_repo`). NO destructive restore primitive (gptme's `git reset --hard` is incompatible with code-oz's user-change preservation). On audit-completeness failure, classify and refuse rather than reset. Trigger: lands when actual resumes show audit-completeness recovery cannot recover from a dirty run worktree without operator intervention. Rule-21 measurement: count of resumes where current recovery is destructive vs. count where classification would have made the next action obvious.
- **M21+ candidate — `feat(eval): release/run-quality eval harness (gptme borrow D3, Codex-flagged).`** Authority boundary: a separate run-quality evaluation suite, not the unit/integration test surface. Inspired by `gptme/docs/evals.rst` (model leaderboards, CSV/JSON export, Docker guidance, SWE-bench compatibility). Validates orchestration AND live run quality across model/release combos. Trigger: when v0.2 stabilizes and a release-cadence quality regression slips through unit tests.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Doc-only PR that records a gptme vs code-oz template comparison (briefing, Codex response, and synthesis) and reserves roadmap slots for potential future borrows (AGENT_FILES intake, context-projection probe, worktree-topology refusal diagnostics, eval harness).

Changes:

  • Add gptme comparison docs: COMPARISON, Codex briefing/response, and post-debate synthesis.
  • Update docs/design/ROADMAP.md to reserve M17–M20(+)-style candidate slots derived from the gptme comparison.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
docs/design/ROADMAP.md Adds reserved candidate milestone slots derived from the gptme comparison.
docs/comparisons/gptme/COMPARISON.md New side-by-side comparison + borrow/defer/reject rationale.
docs/comparisons/gptme/CODEX_BRIEFING.md New briefing prompt used for cross-model peer review.
docs/comparisons/gptme/CODEX_RESPONSE.md New Codex round-1 response capturing fix-first findings and revised classification.
docs/comparisons/gptme/SYNTHESIS.md New single-source-of-truth synthesis + rationale for “ratify-only” scope and roadmap slot reservation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

| **Knowledge injection** | Lessons (keyword/tool/pattern auto-load) + Anthropic skills | Universal anti-slop rules + per-persona prompts; no auto-load by keywords |
| **Plugins** | Python entry-points; packages tools+hooks+commands | None — agentpacks (skill bundles) only |
| **Subagents** | `subagent` tool: executor + planner + batch + completion-hooks | Phase agents; reviewer panel; debate participants — but not an in-chat subagent surface |
| **CLI agent files** | Loads AGENTS.md/CLAUDE.md/GEMINI.md/COPILOT.md/.cursorrules/.windsurfrules | Loads its own CLAUDE.md only |
Comment thread docs/design/ROADMAP.md
Comment on lines +384 to +388
- **Template-comparison-derived deferred milestones (slots reserved 2026-05-10):**
- **M17 candidate — `feat(intake): cross-tool AGENT_FILES discovery + AUDIT/DEFINE opt-in (gptme borrow B3-narrowed).`** Authority boundary: cross-tool agent-instruction-file intake at AUDIT and DEFINE phase entry. Discovery only (file list per `gptme/prompts/__init__.py`: `AGENTS.md`, `CLAUDE.md`, `COPILOT.md`, `GEMINI.md`, `.cursorrules`, `.windsurfrules`, `.github/copilot-instructions.md`). NO parent/home walk. NO automatic prompt injection. Files become available to the user as a confirm UI in AUDIT/DEFINE intake; user accepts or rejects per file. Telemetry events: `agent_files_discovered`, `agent_files_accepted`, `agent_files_rejected`, `agent_instruction_conflicts`. Rule-21 measurement: `agent_files_accepted / agent_files_discovered` rate; intake-question-count delta vs baseline (no AGENT_FILES) on a brownfield corpus. Trail: `docs/comparisons/gptme/SYNTHESIS.md`. Trigger: lands when brownfield AUDIT runtime ships (W4) OR when greenfield DEFINE intake earns the authority. NOT before `src/phases/audit.ts` exists.
- **M18 candidate — `feat(provider): deterministic context-projection + compaction-opportunity probe (gptme borrow B1-narrowed).`** Authority boundary: telemetry-only context projection that reports compaction opportunity without mutating provider invocations. NO LLM resume summarization. NO view-branch swap. NO automatic provider-context mutation. The discipline rule "no phase artifact may exceed N tokens at gate write" lands FIRST as a separate gate-preflight check; the probe extends existing `tokensEstimate` (`src/providers/manifest.ts:109` neighborhood, computed at line 111) with `context_projection_tokens`, `compaction_opportunity_savings_ratio`, `compaction_skipped_savings_ratio`. Rule-21 measurement: observed `compaction_opportunity_savings_ratio` distribution across runs > 0.10 floor before any compaction action authority is added. Trail: `docs/comparisons/gptme/SYNTHESIS.md`. Trigger: M14 Reviewer-panel + M15 debate-scheduler runs accumulate large enough contexts to make the floor measurable.
- **M19+ candidate — `feat(diagnostic): worktree topology refusal modes (gptme borrow B2-deferred).`** Authority boundary: diagnostic-only kind-classification of worktree state (`clean_run_worktree | dirty_run_worktree | no_worktree | multi_root_repo`). NO destructive restore primitive (gptme's `git reset --hard` is incompatible with code-oz's user-change preservation). On audit-completeness failure, classify and refuse rather than reset. Trigger: lands when actual resumes show audit-completeness recovery cannot recover from a dirty run worktree without operator intervention. Rule-21 measurement: count of resumes where current recovery is destructive vs. count where classification would have made the next action obvious.
- **M20+ candidate — `feat(eval): release/run-quality eval harness (gptme borrow D3, Codex-flagged).`** Authority boundary: a separate run-quality evaluation suite, not the unit/integration test surface. Inspired by `gptme/docs/evals.rst` (model leaderboards, CSV/JSON export, Docker guidance, SWE-bench compatibility). Validates orchestration AND live run quality across model/release combos. Trigger: when v0.2 stabilizes and a release-cadence quality regression slips through unit tests.
| Item | Status | Reason (one sentence) | Target slot | Measurement plan (if borrow) |
|---|---|---|---|---|
| **B1 — Compaction-opportunity probe** | borrow-deferred-to-own-milestone | Deterministic context projection + compaction-opportunity probe is useful telemetry, but it requires a separate milestone authority because gptme's full engine performs LLM resume summarization and view-branch swaps that violate code-oz's "files in `ProviderRequest.files` are explicit, never silently mutated" discipline. | M18 candidate | Extend the existing `tokensEstimate` field on `ProviderContextMetrics` (`src/providers/manifest.ts:111`) with `context_projection_tokens`, `compaction_opportunity_savings_ratio`, `compaction_skipped_savings_ratio`. Rule-21 ship gate: observed `compaction_opportunity_savings_ratio` distribution > 0.10 across runs before any compaction-action authority is added. |
| **B3 — AGENT_FILES discovery + AUDIT/DEFINE opt-in** | borrow-deferred-to-own-milestone | Cross-tool agent-instruction-file discovery is worth doing, but it requires its own milestone authority because the AUDIT runtime does not yet exist (`src/phases/audit.ts` is absent) and the trust-boundary discipline (no parent/home walk, explicit per-file opt-in) is itself a contract surface. | M17 candidate | Telemetry events `agent_files_discovered`, `agent_files_accepted`, `agent_files_rejected`, `agent_instruction_conflicts`. Rule-21 ship gate: `agent_files_accepted / agent_files_discovered` rate observable; intake-question-count delta vs. baseline (no AGENT_FILES) on a brownfield corpus. |
Comment on lines +59 to +63
**M17 candidate — AGENT_FILES intake authority (B3-narrowed).** Lands the discovery list (`AGENTS.md`, `CLAUDE.md`, `COPILOT.md`, `GEMINI.md`, `.cursorrules`, `.windsurfrules`, `.github/copilot-instructions.md` per `gptme/prompts/__init__.py:23`) at AUDIT and DEFINE phase entry. Discovery only — no parent/home walk (gptme walks home → workspace per `gptme/prompts/workspace.py:121,215,233`; code-oz refuses), no automatic prompt injection. Files become a confirm UI that accepts or rejects per file. New telemetry events (`agent_files_discovered`, `agent_files_accepted`, `agent_files_rejected`, `agent_instruction_conflicts`) extend `PhaseEvent`. Trigger: lands when brownfield AUDIT runtime ships (W4) or when greenfield DEFINE intake earns the authority. Hard precondition: not before `src/phases/audit.ts` exists.

**M18 candidate — Compaction-opportunity probe authority (B1-narrowed).** Telemetry-only context projection that reports compaction opportunity without mutating provider invocations. No LLM resume summarization (gptme's `gptme/tools/autocompact/hook.py:164`), no view-branch swap (`gptme/tools/autocompact/hook.py:128`), no automatic provider-context mutation. The discipline rule "no phase artifact may exceed N tokens at gate write" lands first as a separate gate-preflight check (`src/phases/gate-preflight.ts` extension); the probe extends the existing `ProviderContextMetrics` (`src/providers/manifest.ts:111` neighborhood) with `context_projection_tokens`, `compaction_opportunity_savings_ratio`, `compaction_skipped_savings_ratio`. Trigger: M14 Reviewer-panel + M15 debate-scheduler accumulate large enough contexts to make the > 0.10 floor measurable.

**Parallel deferred slots — M19+ (B2) and M20+ (D3).** B2's worktree topology refusal diagnostics waits on actual operator-intervention evidence in the resume corpus. D3's release/run-quality eval harness waits on a release-cadence quality regression that slips through unit tests. Both are reserved with measurement triggers in `ROADMAP.md`; neither is committed.
3. **Rule 21 measurement plans documented for future milestones.** Each borrow's measurement plan is recorded in the decision matrix above and in the `ROADMAP.md` slot reservation. Implementation cannot land without telemetry first.
4. **No source or test code touched.** Only `docs/comparisons/gptme/*` and `docs/design/ROADMAP.md` are modified. No `src/**`, no `tests/**`, no schema changes.

When the convergence Codex round returns `push-clean`, the PR ships and the slots are reserved. The next milestone (M17 candidate or otherwise) opens its own briefing → debate → implementation → review cycle.
Comment on lines +71 to +77
| B3: Borrow now — Cross-tool AGENT_FILES ingestion | **B3: Narrow borrow candidate, deferred to own milestone — AGENT_FILES discovery plus explicit AUDIT/DEFINE opt-in.** No parent/home walk. Cross-tool files are informational until the user accepts them. Measurement: `agent_files_discovered`, `agent_files_accepted`, `agent_files_rejected`, `agent_instruction_conflicts`, intake-question delta. Reserved as M17 candidate slot in `ROADMAP.md`. |
| (none) | **D3 (new): Release/run-quality eval harness inspired by gptme evals.** Defer unless it becomes the single milestone authority. |
| D1, D2, R1, R2, R3, R4, R5 | unchanged |

Borrow set after fix: **2 narrow-borrow candidates (each deferred to its own milestone), 4 deferred (B2 demoted + D1 + D2 + new D3 eval harness), 5 rejected.**

> **Note (post-R2 scope lock, thread `019e1319`):** Codex round 2 ratified Option A — RATIFY-ONLY. The two narrow-borrow candidates above (B1, B3) are reserved for their own future milestones (M17/M18 candidate slots in `docs/design/ROADMAP.md`); they are NOT implemented in this PR. The canonical post-debate settlement is `SYNTHESIS.md`.

Borrow set after fix: **2 narrow-borrow candidates (each deferred to its own milestone), 4 deferred (B2 demoted + D1 + D2 + new D3 eval harness), 5 rejected.**

> **Note (post-R2 scope lock, thread `019e1319`):** Codex round 2 ratified Option A — RATIFY-ONLY. The two narrow-borrow candidates above (B1, B3) are reserved for their own future milestones (M17/M18 candidate slots in `docs/design/ROADMAP.md`); they are NOT implemented in this PR. The canonical post-debate settlement is `SYNTHESIS.md`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants