Skip to content

feat(orchestrator): Phase B Units 10 + 11 — synthetic gate, expiry sweeper, shadow-summary, live-flip runbook (depends on #72068)#72086

Closed
PeterPlatinum wants to merge 16 commits into
openclaw:mainfrom
PeterPlatinum:feat/orchestrator-unit-10-11
Closed

feat(orchestrator): Phase B Units 10 + 11 — synthetic gate, expiry sweeper, shadow-summary, live-flip runbook (depends on #72068)#72086
PeterPlatinum wants to merge 16 commits into
openclaw:mainfrom
PeterPlatinum:feat/orchestrator-unit-10-11

Conversation

@PeterPlatinum
Copy link
Copy Markdown

Summary

Closes the Phase B implementation arc. Three deliverables stacked on #72068:

  1. R30 synthetic-task observability gate — `openclaw orchestrator synthetic-all` runs a 5-fixture deterministic harness end-to-end through routing + store + dispatch + trajectory. Exits non-zero if any fixture diverges from its expected agent / rule / terminal state. The gate is the operator-facing precondition for ever flipping mode away from synthetic.
  2. Expiry sweeper service — `createExpirySweeper` registered via `api.registerService(...)` with the `stop` lifecycle hook on the service object (recon Q-3 settled — `OpenClawPluginService` is the right model for periodic gateway-resident work). Fires every 60 minutes by default; logs swept count.
  3. Shadow-summary CLI verb + live-flip runbook — `openclaw orchestrator shadow-summary [--window 24]` reads the shadow archive, prints by-state counts + mean duration + window span, and exits non-zero if any spawn failure landed inside the window. The README's new "Live-flip runbook" makes the synthetic → shadow → live transition explicit.

Stacked PR — depends on #72068#72054#72039#72029. Land in that order; this PR's diff narrows after each merge.

What landed

Commit Sub-unit Files
`5de1c34002` Units 10 + 11 main `src/synthetic.ts` (deterministic harness + fixture loader + result formatter), `test/fixtures/synthetic-tasks.json` (5 fixtures: code-1, ops-1, research-1, writing-1, fallback-1), `src/expiry-sweeper.ts` (start/stop/runOnce + crash-tolerant logging), `src/shadow-summary.ts` (windowed shadow stats), three new CLI verbs in `src/cli.ts`, README runbook
follow-up fix Unit 11 service shape Aligns the registered service with `OpenClawPluginService.stop?` (object-level, not return-from-start), and hardens one spawn-watch test against TS narrowing weirdness around closure-captured `let`

CLI surface (cumulative across Units 7-11)

Verb Purpose
`openclaw orchestrator init` Generate bearer token (Unit 7).
`openclaw orchestrator rotate-token` Rotate bearer token (Unit 7).
`openclaw orchestrator synthetic ` One synthetic fixture end-to-end.
`openclaw orchestrator synthetic-all` Full synthetic harness (R30 gate).
`openclaw orchestrator shadow-summary [--window ]` Shadow archive stats + live-flip gate.

Live-flip procedure (now documented in README)

```

  1. openclaw orchestrator init # bearer token
  2. openclaw orchestrator synthetic-all # data-plane gate
  3. mode = "shadow" in ~/.openclaw/openclaw.json # 24h soak
  4. openclaw orchestrator shadow-summary --window 24 # live-flip gate
  5. mode = "live" # production
    ```

Rollback at any point is a config edit + restart; in-flight `awaiting_approval` tasks remain operator-actionable from the Approvals tab.

Boundaries respected

  • `synthetic.ts`, `expiry-sweeper.ts`, `shadow-summary.ts` import only Node built-ins, types from `./types/schema.ts`, and the previously-shipped `store.ts` / `routing.ts` / `dispatch.ts` / `trajectory.ts`. No `src/**` imports.
  • The expiry sweeper is the first time this extension uses `api.registerService`; the service shape matches `src/plugins/types.ts:1996-1999` with separate top-level `start` and `stop` hooks (not `stop` returned from `start`).

Test plan

  • `pnpm test extensions/orchestrator` — 172/172 pass (16 files; +30 new tests across synthetic, expiry-sweeper, shadow-summary)
  • `pnpm tsgo:all` — clean
  • Boundary contract still passing
  • `openclaw orchestrator synthetic-all` exits 0 when all fixtures pass; non-zero with structured reasons when any fail
  • `openclaw orchestrator shadow-summary` exits non-zero when any task in window is in `failed`
  • CI green
  • Live-flip dry run on Peter's machine after PR-set lands

Phase B is now feature-complete

This is the last openclaw-side unit in Plan 005. After this stack lands and the MC commits ship:

  • Synthetic mode produces visible task records with full trajectory in the Pipeline tab
  • Approvals tab handles approve/reject for synthetic awaiting_approval tasks
  • Shadow + live modes are gated behind the documented runbook
  • The expiry sweeper keeps the task archive bounded

🤖 Generated with Claude Code

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 26, 2026

Greptile Summary

This PR closes Phase B of the orchestrator extension by adding the R30 synthetic-task observability gate (synthetic.ts), an expiry sweeper service (expiry-sweeper.ts), and the shadow-summary CLI verb with live-flip runbook (shadow-summary.ts). The implementation is well-structured and the 30 new tests cover the happy paths and failure injection cases thoroughly.

Confidence Score: 4/5

Safe to merge with minor design concerns; no blocking bugs found.

All findings are P2: the production R30 gate reading from test/fixtures/, the non-idempotent init exit code, and the in_progress expiry gap. No P0/P1 issues were identified. The core state-machine logic, atomic IO, and service registration are sound.

extensions/orchestrator/src/synthetic.ts (fixture path), extensions/orchestrator/src/cli.ts (init exit code), extensions/orchestrator/src/store.ts (in_progress expiry gap)

Prompt To Fix All With AI
This is a comment left during a code review.
Path: extensions/orchestrator/src/synthetic.ts
Line: 59-65

Comment:
**Production gate reads from `test/fixtures/`**

`defaultFixturePath()` resolves the R30 fixture data through the `test/fixtures/` directory. The `openclaw orchestrator synthetic-all` command is the documented operator precondition for flipping out of synthetic mode, so `synthetic-tasks.json` is a production asset — not just a dev artifact. Binding it to the test directory makes the path brittle: any future build step, packaging pass, or directory restructuring that omits `test/` will silently break the gate at runtime. Consider moving `synthetic-tasks.json` to `src/fixtures/` (or a sibling `fixtures/` directory) and adjusting the path accordingly.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: extensions/orchestrator/src/cli.ts
Line: 88-94

Comment:
**`init` exits non-zero when a token already exists**

When `init` finds a credential file and `--force` is not set, it prints the advisory and then sets `process.exitCode = 1`. Idiomatic CLI setup verbs (e.g. `git init`) treat an already-initialized state as a no-op success. Any automated script or CI step that runs `openclaw orchestrator init` as a setup guard will fail on every subsequent run after the first, forcing operators to either handle the exit code explicitly or always pass `--force`. Exiting 0 here and reserving the non-zero code for actual write failures would make the command safely composable.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: extensions/orchestrator/src/store.ts
Line: 535-556

Comment:
**`in_progress` tasks are not eligible for expiry**

`STALE_ELIGIBLE` excludes `in_progress`, so `sweepExpired` silently skips any task in that state. In shadow/live mode, if a specialist session crashes without emitting a terminal event, the `spawn-watch` watcher has no timeout and the task stays `in_progress` indefinitely — the sweeper will never reclaim it. The `applyAction` guard also enforces this gap (`expire` throws on `in_progress`). This creates a category of tasks that can accumulate unboundedly and can never be expired or swept. If the intent is to avoid expiring truly active sessions, a documented stale-`in_progress` eviction path (e.g. a separate action type or a TTL-based fallback on the watcher side) would close the gap.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "fix(orchestrator): align expiry-sweeper ..." | Re-trigger Greptile

Comment on lines +59 to +65
// synthetic.ts lives at extensions/orchestrator/src/synthetic.ts. The
// canonical fixture file ships under test/fixtures/.
const here = dirname(fileURLToPath(import.meta.url));
return resolve(here, "..", "test", "fixtures", FIXTURE_FILE);
}

export function loadSyntheticFixtures(path?: string): SyntheticFixtureFile {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Production gate reads from test/fixtures/

defaultFixturePath() resolves the R30 fixture data through the test/fixtures/ directory. The openclaw orchestrator synthetic-all command is the documented operator precondition for flipping out of synthetic mode, so synthetic-tasks.json is a production asset — not just a dev artifact. Binding it to the test directory makes the path brittle: any future build step, packaging pass, or directory restructuring that omits test/ will silently break the gate at runtime. Consider moving synthetic-tasks.json to src/fixtures/ (or a sibling fixtures/ directory) and adjusting the path accordingly.

Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/orchestrator/src/synthetic.ts
Line: 59-65

Comment:
**Production gate reads from `test/fixtures/`**

`defaultFixturePath()` resolves the R30 fixture data through the `test/fixtures/` directory. The `openclaw orchestrator synthetic-all` command is the documented operator precondition for flipping out of synthetic mode, so `synthetic-tasks.json` is a production asset — not just a dev artifact. Binding it to the test directory makes the path brittle: any future build step, packaging pass, or directory restructuring that omits `test/` will silently break the gate at runtime. Consider moving `synthetic-tasks.json` to `src/fixtures/` (or a sibling `fixtures/` directory) and adjusting the path accordingly.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +88 to +94
path,
token: generateToken(),
...(deps.now != null ? { now: deps.now } : {}),
});
out.write(describe(credentials, "created"));
});

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 init exits non-zero when a token already exists

When init finds a credential file and --force is not set, it prints the advisory and then sets process.exitCode = 1. Idiomatic CLI setup verbs (e.g. git init) treat an already-initialized state as a no-op success. Any automated script or CI step that runs openclaw orchestrator init as a setup guard will fail on every subsequent run after the first, forcing operators to either handle the exit code explicitly or always pass --force. Exiting 0 here and reserving the non-zero code for actual write failures would make the command safely composable.

Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/orchestrator/src/cli.ts
Line: 88-94

Comment:
**`init` exits non-zero when a token already exists**

When `init` finds a credential file and `--force` is not set, it prints the advisory and then sets `process.exitCode = 1`. Idiomatic CLI setup verbs (e.g. `git init`) treat an already-initialized state as a no-op success. Any automated script or CI step that runs `openclaw orchestrator init` as a setup guard will fail on every subsequent run after the first, forcing operators to either handle the exit code explicitly or always pass `--force`. Exiting 0 here and reserving the non-zero code for actual write failures would make the command safely composable.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +535 to +556
if (TERMINAL.has(task.state)) {
continue;
}
if (!STALE_ELIGIBLE.has(task.state)) {
continue;
}
if (localNow() <= new Date(task.expiresAt).getTime()) {
continue;
}
try {
const expired = transition(task.id, { type: "expire" }, { kind, holderId: "sweeper" });
swept.push(expired);
} catch (err) {
if ((err as StoreError).code === "lock_held") {
continue;
}
throw err;
}
}
}
return swept;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 in_progress tasks are not eligible for expiry

STALE_ELIGIBLE excludes in_progress, so sweepExpired silently skips any task in that state. In shadow/live mode, if a specialist session crashes without emitting a terminal event, the spawn-watch watcher has no timeout and the task stays in_progress indefinitely — the sweeper will never reclaim it. The applyAction guard also enforces this gap (expire throws on in_progress). This creates a category of tasks that can accumulate unboundedly and can never be expired or swept. If the intent is to avoid expiring truly active sessions, a documented stale-in_progress eviction path (e.g. a separate action type or a TTL-based fallback on the watcher side) would close the gap.

Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/orchestrator/src/store.ts
Line: 535-556

Comment:
**`in_progress` tasks are not eligible for expiry**

`STALE_ELIGIBLE` excludes `in_progress`, so `sweepExpired` silently skips any task in that state. In shadow/live mode, if a specialist session crashes without emitting a terminal event, the `spawn-watch` watcher has no timeout and the task stays `in_progress` indefinitely — the sweeper will never reclaim it. The `applyAction` guard also enforces this gap (`expire` throws on `in_progress`). This creates a category of tasks that can accumulate unboundedly and can never be expired or swept. If the intent is to avoid expiring truly active sessions, a documented stale-`in_progress` eviction path (e.g. a separate action type or a TTL-based fallback on the watcher side) would close the gap.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b3ba1e66c7

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread extensions/orchestrator/src/http.ts Outdated
Comment on lines +281 to +283
requiredCapabilities: body.requiredCapabilities ?? [],
submittedBy: body.submittedBy ?? submittedByDefault,
kind: body.kind ?? "synthetic",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Enforce synthetic kind in synthetic-only submit route

POST /orchestrator/tasks is explicitly gated to synthetic mode, but the handler persists caller-controlled body.kind directly. A client can submit kind: "live" or "shadow" while mode === "synthetic", which writes tasks into the wrong namespace and bypasses the intended mode boundary. This can contaminate live/shadow task stores and undermine the synthetic/shadow gating flow; the route should derive kind from mode or reject non-synthetic kinds here.

Useful? React with 👍 / 👎.

Comment thread extensions/orchestrator/src/http.ts Outdated
Comment on lines +307 to +308
const reason = body.reason ?? "";
if (reason.trim() === "" || reason.length > 1024) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Type-check reject reason before calling trim

The reject transition path assumes reason is a string and immediately calls reason.trim(). If the client sends a non-string JSON value (for example {"action":"reject","reason":123}), this throws at runtime and the request fails as an internal error instead of returning invalid_reason. Add a string type guard before trim/length validation so invalid payloads are handled as 400s.

Useful? React with 👍 / 👎.

…t idempotency, kind boundary, reason type guard

- Move synthetic-tasks.json from test/fixtures/ to src/fixtures/. The fixture is a production asset (the live-flip runbook gates on synthetic-all), so it must ship under the package boundary.
- init: drop process.exitCode = 1 when a token already exists. Idempotent re-runs in setup scripts now exit 0; nonzero is reserved for actual write failures.
- POST /orchestrator/tasks: force kind='synthetic' since the route is mode-gated. Trusting body.kind would let a client write live/shadow tasks into the synthetic namespace.
- POST /tasks/<id>/transition reject: type-guard reason before .trim(). A non-string reason now returns 400 invalid_reason instead of crashing to 500.
… (companion to 06924c4 fixture relocation)

The previous commit added src/fixtures/synthetic-tasks.json but the staging step missed deleting the original at test/fixtures/. The runtime resolver already points at src/fixtures/, so the leftover was unreferenced — this just removes the dead copy so the move is complete.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@PeterPlatinum
Copy link
Copy Markdown
Author

Bot review followups landed

Pushed `06924c4985` + `799c312aa5` addressing four of the five bot findings:

Finding Source Fix
Production gate reads from `test/fixtures/` Greptile Moved `synthetic-tasks.json` → `src/fixtures/`. Resolver updated; new test verifies the relocated path.
`init` exits 1 when token already exists Greptile Removed `process.exitCode = 1` so re-runs are idempotent (`exit 0`). Added `init: idempotent re-run (exit 0)` test.
Synthetic-mode submit honors `body.kind` Codex Forced `kind: "synthetic"` since the route is already mode-gated. New test asserts `{kind: "live"}` payload still lands in the synthetic namespace.
`reason.trim()` crashes on non-string Codex Type-guarded with `typeof reason !== "string"` before `.trim()`. New test asserts `{reason: 123}` returns 400 `invalid_reason` instead of 500.

Deferred: the `in_progress` expiry gap (Greptile P2) is dormant in v0 (synthetic mode never produces real `in_progress` liveness), but matters before the shadow/live cutover. Tracked in #72095.

Local: `pnpm test extensions/orchestrator` → 175/175

CI is currently red on this branch due to upstream lockfile drift (`extensions/diagnostics-prometheus/package.json` added `@openclaw/plugin-sdk@workspace:*` without a corresponding `pnpm-lock.yaml` refresh in 0f2e7510cb). Once that lands, this branch should rebase green.

@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented Apr 26, 2026

Closing this as better suited for ClawHub/community plugin work after Codex automated review.

Close as ClawHub/plugin work. Current main does not contain the orchestrator plugin, and the PR adds an optional heavy orchestration layer using plugin-style CLI, HTTP route, and service surfaces that OpenClaw already exposes. The project vision directs optional capabilities and heavy orchestration layers away from core unless there is explicit maintainer product sponsorship.

Best possible solution:

Close this OpenClaw core PR and move the orchestrator work to an external ClawHub/npm plugin that uses the existing plugin CLI, HTTP route, and service APIs. If external implementation exposes a concrete missing SDK seam, open a narrow plugin API design issue or a maintainer-sponsored core proposal instead.

What I checked:

  • Project scope guardrail: VISION.md says optional capability should usually ship as plugins, plugin discovery/promotion belongs in ClawHub, and the bar for adding optional plugins to core is intentionally high. (VISION.md:52, 6d60b035b4e7)
  • Heavy orchestration guardrail: VISION.md lists agent-hierarchy frameworks and heavy orchestration layers as things OpenClaw will not merge by default for now. (VISION.md:106, 6d60b035b4e7)
  • External plugin path exists: Plugin docs state plugins extend OpenClaw with new capabilities and do not need to be added to the OpenClaw repository; they can be published to ClawHub or npm and installed by users. Public docs: docs/plugins/building-plugins.md. (docs/plugins/building-plugins.md:11, 6d60b035b4e7)
  • Needed plugin APIs already exist: Current plugin API includes registerHttpRoute, registerCli, and registerService, matching the PR's claimed implementation surfaces without showing a missing core SDK seam. (src/plugins/types.ts:2088, 6d60b035b4e7)
  • Not implemented on current main: Current main has no extensions/orchestrator tree and no orchestrator labeler/runtime command strings such as synthetic-all, shadow-summary, or orchestrator-bearer. (6d60b035b4e7)
  • PR discussion handled review followups: The PR discussion records useful bot-review fixes in 06924c4 and 799c312, while deferring the remaining in_progress expiry design to orchestrator store: in_progress tasks have no expiry path #72095. (799c312aa59e)

So I’m closing this as a scope-fit item for the plugin/community path rather than keeping it open as an OpenClaw core request.

Codex Review notes: model gpt-5.5, reasoning high; reviewed against 6d60b035b4e7.

@clawsweeper clawsweeper Bot closed this Apr 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant