feat(orchestrator): Phase B Units 10 + 11 — synthetic gate, expiry sweeper, shadow-summary, live-flip runbook (depends on #72068)#72086
Conversation
…lity inference (Unit 3)
… lock CAS (Unit 4)
…through (Unit 5b)
…summary + live-flip runbook (Units 10 + 11)
…rvice stop signature
Greptile SummaryThis PR closes Phase B of the orchestrator extension by adding the R30 synthetic-task observability gate ( Confidence Score: 4/5Safe to merge with minor design concerns; no blocking bugs found. All findings are P2: the production R30 gate reading from
Prompt To Fix All With AIThis is a comment left during a code review.
Path: extensions/orchestrator/src/synthetic.ts
Line: 59-65
Comment:
**Production gate reads from `test/fixtures/`**
`defaultFixturePath()` resolves the R30 fixture data through the `test/fixtures/` directory. The `openclaw orchestrator synthetic-all` command is the documented operator precondition for flipping out of synthetic mode, so `synthetic-tasks.json` is a production asset — not just a dev artifact. Binding it to the test directory makes the path brittle: any future build step, packaging pass, or directory restructuring that omits `test/` will silently break the gate at runtime. Consider moving `synthetic-tasks.json` to `src/fixtures/` (or a sibling `fixtures/` directory) and adjusting the path accordingly.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: extensions/orchestrator/src/cli.ts
Line: 88-94
Comment:
**`init` exits non-zero when a token already exists**
When `init` finds a credential file and `--force` is not set, it prints the advisory and then sets `process.exitCode = 1`. Idiomatic CLI setup verbs (e.g. `git init`) treat an already-initialized state as a no-op success. Any automated script or CI step that runs `openclaw orchestrator init` as a setup guard will fail on every subsequent run after the first, forcing operators to either handle the exit code explicitly or always pass `--force`. Exiting 0 here and reserving the non-zero code for actual write failures would make the command safely composable.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: extensions/orchestrator/src/store.ts
Line: 535-556
Comment:
**`in_progress` tasks are not eligible for expiry**
`STALE_ELIGIBLE` excludes `in_progress`, so `sweepExpired` silently skips any task in that state. In shadow/live mode, if a specialist session crashes without emitting a terminal event, the `spawn-watch` watcher has no timeout and the task stays `in_progress` indefinitely — the sweeper will never reclaim it. The `applyAction` guard also enforces this gap (`expire` throws on `in_progress`). This creates a category of tasks that can accumulate unboundedly and can never be expired or swept. If the intent is to avoid expiring truly active sessions, a documented stale-`in_progress` eviction path (e.g. a separate action type or a TTL-based fallback on the watcher side) would close the gap.
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "fix(orchestrator): align expiry-sweeper ..." | Re-trigger Greptile |
| // synthetic.ts lives at extensions/orchestrator/src/synthetic.ts. The | ||
| // canonical fixture file ships under test/fixtures/. | ||
| const here = dirname(fileURLToPath(import.meta.url)); | ||
| return resolve(here, "..", "test", "fixtures", FIXTURE_FILE); | ||
| } | ||
|
|
||
| export function loadSyntheticFixtures(path?: string): SyntheticFixtureFile { |
There was a problem hiding this comment.
Production gate reads from
test/fixtures/
defaultFixturePath() resolves the R30 fixture data through the test/fixtures/ directory. The openclaw orchestrator synthetic-all command is the documented operator precondition for flipping out of synthetic mode, so synthetic-tasks.json is a production asset — not just a dev artifact. Binding it to the test directory makes the path brittle: any future build step, packaging pass, or directory restructuring that omits test/ will silently break the gate at runtime. Consider moving synthetic-tasks.json to src/fixtures/ (or a sibling fixtures/ directory) and adjusting the path accordingly.
Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/orchestrator/src/synthetic.ts
Line: 59-65
Comment:
**Production gate reads from `test/fixtures/`**
`defaultFixturePath()` resolves the R30 fixture data through the `test/fixtures/` directory. The `openclaw orchestrator synthetic-all` command is the documented operator precondition for flipping out of synthetic mode, so `synthetic-tasks.json` is a production asset — not just a dev artifact. Binding it to the test directory makes the path brittle: any future build step, packaging pass, or directory restructuring that omits `test/` will silently break the gate at runtime. Consider moving `synthetic-tasks.json` to `src/fixtures/` (or a sibling `fixtures/` directory) and adjusting the path accordingly.
How can I resolve this? If you propose a fix, please make it concise.| path, | ||
| token: generateToken(), | ||
| ...(deps.now != null ? { now: deps.now } : {}), | ||
| }); | ||
| out.write(describe(credentials, "created")); | ||
| }); | ||
|
|
There was a problem hiding this comment.
init exits non-zero when a token already exists
When init finds a credential file and --force is not set, it prints the advisory and then sets process.exitCode = 1. Idiomatic CLI setup verbs (e.g. git init) treat an already-initialized state as a no-op success. Any automated script or CI step that runs openclaw orchestrator init as a setup guard will fail on every subsequent run after the first, forcing operators to either handle the exit code explicitly or always pass --force. Exiting 0 here and reserving the non-zero code for actual write failures would make the command safely composable.
Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/orchestrator/src/cli.ts
Line: 88-94
Comment:
**`init` exits non-zero when a token already exists**
When `init` finds a credential file and `--force` is not set, it prints the advisory and then sets `process.exitCode = 1`. Idiomatic CLI setup verbs (e.g. `git init`) treat an already-initialized state as a no-op success. Any automated script or CI step that runs `openclaw orchestrator init` as a setup guard will fail on every subsequent run after the first, forcing operators to either handle the exit code explicitly or always pass `--force`. Exiting 0 here and reserving the non-zero code for actual write failures would make the command safely composable.
How can I resolve this? If you propose a fix, please make it concise.| if (TERMINAL.has(task.state)) { | ||
| continue; | ||
| } | ||
| if (!STALE_ELIGIBLE.has(task.state)) { | ||
| continue; | ||
| } | ||
| if (localNow() <= new Date(task.expiresAt).getTime()) { | ||
| continue; | ||
| } | ||
| try { | ||
| const expired = transition(task.id, { type: "expire" }, { kind, holderId: "sweeper" }); | ||
| swept.push(expired); | ||
| } catch (err) { | ||
| if ((err as StoreError).code === "lock_held") { | ||
| continue; | ||
| } | ||
| throw err; | ||
| } | ||
| } | ||
| } | ||
| return swept; | ||
| } |
There was a problem hiding this comment.
in_progress tasks are not eligible for expiry
STALE_ELIGIBLE excludes in_progress, so sweepExpired silently skips any task in that state. In shadow/live mode, if a specialist session crashes without emitting a terminal event, the spawn-watch watcher has no timeout and the task stays in_progress indefinitely — the sweeper will never reclaim it. The applyAction guard also enforces this gap (expire throws on in_progress). This creates a category of tasks that can accumulate unboundedly and can never be expired or swept. If the intent is to avoid expiring truly active sessions, a documented stale-in_progress eviction path (e.g. a separate action type or a TTL-based fallback on the watcher side) would close the gap.
Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/orchestrator/src/store.ts
Line: 535-556
Comment:
**`in_progress` tasks are not eligible for expiry**
`STALE_ELIGIBLE` excludes `in_progress`, so `sweepExpired` silently skips any task in that state. In shadow/live mode, if a specialist session crashes without emitting a terminal event, the `spawn-watch` watcher has no timeout and the task stays `in_progress` indefinitely — the sweeper will never reclaim it. The `applyAction` guard also enforces this gap (`expire` throws on `in_progress`). This creates a category of tasks that can accumulate unboundedly and can never be expired or swept. If the intent is to avoid expiring truly active sessions, a documented stale-`in_progress` eviction path (e.g. a separate action type or a TTL-based fallback on the watcher side) would close the gap.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b3ba1e66c7
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| requiredCapabilities: body.requiredCapabilities ?? [], | ||
| submittedBy: body.submittedBy ?? submittedByDefault, | ||
| kind: body.kind ?? "synthetic", |
There was a problem hiding this comment.
Enforce synthetic kind in synthetic-only submit route
POST /orchestrator/tasks is explicitly gated to synthetic mode, but the handler persists caller-controlled body.kind directly. A client can submit kind: "live" or "shadow" while mode === "synthetic", which writes tasks into the wrong namespace and bypasses the intended mode boundary. This can contaminate live/shadow task stores and undermine the synthetic/shadow gating flow; the route should derive kind from mode or reject non-synthetic kinds here.
Useful? React with 👍 / 👎.
| const reason = body.reason ?? ""; | ||
| if (reason.trim() === "" || reason.length > 1024) { |
There was a problem hiding this comment.
Type-check reject reason before calling trim
The reject transition path assumes reason is a string and immediately calls reason.trim(). If the client sends a non-string JSON value (for example {"action":"reject","reason":123}), this throws at runtime and the request fails as an internal error instead of returning invalid_reason. Add a string type guard before trim/length validation so invalid payloads are handled as 400s.
Useful? React with 👍 / 👎.
…t idempotency, kind boundary, reason type guard - Move synthetic-tasks.json from test/fixtures/ to src/fixtures/. The fixture is a production asset (the live-flip runbook gates on synthetic-all), so it must ship under the package boundary. - init: drop process.exitCode = 1 when a token already exists. Idempotent re-runs in setup scripts now exit 0; nonzero is reserved for actual write failures. - POST /orchestrator/tasks: force kind='synthetic' since the route is mode-gated. Trusting body.kind would let a client write live/shadow tasks into the synthetic namespace. - POST /tasks/<id>/transition reject: type-guard reason before .trim(). A non-string reason now returns 400 invalid_reason instead of crashing to 500.
… (companion to 06924c4 fixture relocation) The previous commit added src/fixtures/synthetic-tasks.json but the staging step missed deleting the original at test/fixtures/. The runtime resolver already points at src/fixtures/, so the leftover was unreferenced — this just removes the dead copy so the move is complete. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bot review followups landedPushed `06924c4985` + `799c312aa5` addressing four of the five bot findings:
Deferred: the `in_progress` expiry gap (Greptile P2) is dormant in v0 (synthetic mode never produces real `in_progress` liveness), but matters before the shadow/live cutover. Tracked in #72095. Local: `pnpm test extensions/orchestrator` → 175/175 ✅ CI is currently red on this branch due to upstream lockfile drift (`extensions/diagnostics-prometheus/package.json` added `@openclaw/plugin-sdk@workspace:*` without a corresponding `pnpm-lock.yaml` refresh in 0f2e7510cb). Once that lands, this branch should rebase green. |
|
Closing this as better suited for ClawHub/community plugin work after Codex automated review. Close as ClawHub/plugin work. Current main does not contain the orchestrator plugin, and the PR adds an optional heavy orchestration layer using plugin-style CLI, HTTP route, and service surfaces that OpenClaw already exposes. The project vision directs optional capabilities and heavy orchestration layers away from core unless there is explicit maintainer product sponsorship. Best possible solution: Close this OpenClaw core PR and move the orchestrator work to an external ClawHub/npm plugin that uses the existing plugin CLI, HTTP route, and service APIs. If external implementation exposes a concrete missing SDK seam, open a narrow plugin API design issue or a maintainer-sponsored core proposal instead. What I checked:
So I’m closing this as a scope-fit item for the plugin/community path rather than keeping it open as an OpenClaw core request. Codex Review notes: model gpt-5.5, reasoning high; reviewed against 6d60b035b4e7. |
Summary
Closes the Phase B implementation arc. Three deliverables stacked on #72068:
What landed
CLI surface (cumulative across Units 7-11)
Live-flip procedure (now documented in README)
```
```
Rollback at any point is a config edit + restart; in-flight `awaiting_approval` tasks remain operator-actionable from the Approvals tab.
Boundaries respected
Test plan
Phase B is now feature-complete
This is the last openclaw-side unit in Plan 005. After this stack lands and the MC commits ship:
🤖 Generated with Claude Code