Skip to content

feat: add experimental Codex session supervisor#207

Closed
ndycode wants to merge 1 commit intomainfrom
git-clean/pr147-session-supervisor-core
Closed

feat: add experimental Codex session supervisor#207
ndycode wants to merge 1 commit intomainfrom
git-clean/pr147-session-supervisor-core

Conversation

@ndycode
Copy link
Owner

@ndycode ndycode commented Mar 21, 2026

Summary

  • add the experimental session supervisor runtime and route Codex CLI launches through it when enabled
  • keep the feature disabled by default behind codexCliSessionSupervisor
  • add abort-aware quota probing and the config/schema plumbing the supervisor needs
  • add wrapper and supervisor regressions covering startup sync, abort handling, env overrides, and session restart behavior
  • keep this replacement PR scoped to the supervisor runtime surface only

Validation

  • npm run test:session-supervisor:smoke
  • npm run typecheck
  • npm run build

note: greptile review for oc-chatgpt-multi-auth. cite files like lib/foo.ts:123. confirm regression tests + windows concurrency/token redaction coverage.

Greptile Summary

this pr introduces the codexCliSessionSupervisor feature: a new scripts/codex-supervisor.js runtime (~2400 lines) that wraps interactive codex cli sessions, monitors per-account quota snapshots, pre-warms a successor account when the current account nears its 5h/7d quota threshold, and transparently restarts sessions via codex resume <session-id> with a freshly selected account. the feature is disabled by default and gated by the codexCliSessionSupervisor config field / CODEX_AUTH_CLI_SESSION_SUPERVISOR env var.

key design highlights:

  • file-based supervisor lock with heartbeat renewal, win32 EPERM/EBUSY retry, and stale-lock eviction — concurrency is explicitly handled and tested
  • abort-aware quota probe cache with in-flight promise deduplication: concurrent callers wait on the same fetch; an aborting caller yields QuotaProbeUnavailableError to non-aborting co-waiters instead of cancelling their probe
  • token safety: buildCodexChildEnv() strips all CODEX_AUTH_* and CODEX_MULTI_AUTH_* env keys from the child process environment so token material is never inherited by the real codex binary
  • pre-warm overlap: successor account selection begins while the current session is still active, reducing cutover latency
  • windows path safety: symlinks are skipped in the jsonl session scan; safeUnlink retries on EPERM/EBUSY exclusively on win32

one P1 issue found in test/codex-supervisor.test.ts: process.env.PATH is overwritten globally in a test without being restored by afterEach, which can corrupt the test process PATH on linux/mac runners. two minor P2 issues: a redundant double abort-check in lib/quota-probe.ts and the safeUnlink retry timer not calling .unref().

Confidence Score: 4/5

  • safe to merge after fixing the PATH leak in the test; the supervisor runtime itself is solid
  • the core supervisor logic is well-architected with thorough concurrency coverage (lock serialization, heartbeat, abort propagation, win32 retries all tested). the P1 issue is isolated to a test — it doesn't affect production behavior but can silently corrupt the PATH env for downstream tests on non-windows CI runners. fixing it is a one-liner. the two P2s are minor style issues. feature is disabled by default so no production risk even if shipped as-is.
  • test/codex-supervisor.test.ts — process.env.PATH mutation at line 574 needs restoration

Important Files Changed

Filename Overview
scripts/codex-supervisor.js new 2418-line supervisor runtime: file-lock with heartbeat, abort-aware quota probe cache with in-flight deduplication, session binding via jsonl scan, child restart escalation (SIGINT→SIGTERM→SIGKILL with win32 awareness). module-level maps (snapshotProbeCache, sessionRolloutPathById) are shared across concurrent supervisor calls by design. one minor issue: safeUnlink retry timer not .unref()'d.
test/codex-supervisor.test.ts 2715-line vitest suite covering lock serialization, heartbeat loss, abort propagation, cache eviction, pre-warm/commit flow, windows EPERM/EBUSY retries, and interactive supervision. one test (strips CODEX_AUTH env keys) mutates process.env.PATH globally without restoring it in afterEach, which can corrupt the PATH for the rest of the suite on non-windows runners.
lib/quota-probe.ts adds signal?: AbortSignal to ProbeCodexQuotaOptions with correct abort forwarding to the internal AbortController and event-listener cleanup; one redundant consecutive throwIfQuotaProbeAborted call before and inside the same try {} block.
scripts/codex-routing.js adds findPrimaryCodexCommand, hasTopLevelHelpOrVersionFlag, and splitCodexCommandArgs helpers with proper --config= inline flag, -- stop-parsing, and case-normalization. clean and well-scoped.
scripts/codex.js routes codex launches through supervisor when enabled; startup auto-sync logic correctly deferred to syncBeforeSupervisorLaunch so it runs exactly once per forwarded or interactive session. abort sentinel (130) skips post-run sync correctly.
lib/config.ts adds codexCliSessionSupervisor: false default and getCodexCliSessionSupervisor accessor via existing resolveBooleanSetting pattern; synchronous, side-effect free, documented correctly.
test/codex-bin-wrapper.test.ts adds supervisor integration tests via real subprocess runs: non-interactive forward, auto-sync deduplication, abort sentinel skip, and interactive session with post-run sync. correctly copies codex-supervisor.js into fixture dir.

Sequence Diagram

sequenceDiagram
    participant CX as codex.js
    participant SUP as codex-supervisor.js
    participant LK as StorageLock
    participant AM as AccountManager
    participant QP as quota-probe
    participant CH as Codex child

    CX->>SUP: runCodexSupervisorIfEnabled()
    SUP->>SUP: loadSupervisorRuntime() (dynamic dist/ import)
    SUP->>LK: withSupervisorStorageLock()
    LK->>AM: loadFromDisk()
    AM-->>LK: manager
    LK->>QP: probeAccountSnapshot(account, signal)
    QP-->>LK: quota snapshot (cached or live)
    LK-->>SUP: ready account
    SUP->>CH: spawnRealCodex(codexBin, args) [env stripped of CODEX_AUTH_*]
    loop monitor loop (pollMs)
        SUP->>QP: probeAccountSnapshot(currentAccount)
        QP-->>SUP: pressure
        alt prewarm threshold crossed
            SUP-->>SUP: prepareResumeSelection() [background]
        end
        alt rotate threshold + idle
            SUP->>CH: requestChildRestart (SIGINT→SIGTERM→SIGKILL)
            SUP->>LK: markCurrentAccountForRestart()
            SUP->>SUP: commitPreparedSelection() or ensureLaunchableAccount()
            SUP->>CH: spawnRealCodex(codexBin, resume args)
        end
    end
    CH-->>SUP: exit code
    SUP-->>CX: supervisedExitCode (null = not handled)
    CX->>CX: autoSyncManagerActiveSelectionIfEnabled() if needed
Loading

Fix All in Codex

Prompt To Fix All With AI
This is a comment left during a code review.
Path: test/codex-supervisor.test.ts
Line: 574

Comment:
**`process.env.PATH` mutation leaks across test suite**

`process.env.PATH` is set to `"C:\\Windows\\System32"` here but `PATH` is not in the `envKeys` list that `afterEach` restores. on every platform except win32 this permanently replaces `PATH` for all subsequent tests in the same vitest worker, breaking any test that later needs to fork a real process (npm, node, etc.).

fix: either save/restore `PATH` manually or include it in `envKeys`:

```suggestion
		const originalPath = process.env.PATH;
		process.env.PATH = "C:\\Windows\\System32";
		try {
			const env = supervisorTestApi?.buildCodexChildEnv?.();
			expect(env).toBeTruthy();
			expect(env?.CODEX_AUTH_ACCESS_TOKEN).toBeUndefined();
			expect(env?.CODEX_AUTH_REFRESH_TOKEN).toBeUndefined();
```

and close with `} finally { process.env.PATH = originalPath; }`. alternatively just add `"PATH"` to the `envKeys` tuple at the top of the file.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: lib/quota-probe.ts
Line: 344-351

Comment:
**Consecutive `throwIfQuotaProbeAborted` calls without an `await` between them**

the call on the line before `try {` and the call on the first line inside `try {` check the same condition with no async gap between them — only one is needed. the outer check (before `try`) is the useful one since it throws outside the catch scope; the inner one is a no-op.

```suggestion
	for (const model of models) {
		throwIfQuotaProbeAborted(options.signal);
		try {
			const instructions = await getCodexInstructions(model);
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: scripts/codex-supervisor.js
Line: 722-725

Comment:
**`safeUnlink` retry timer not `.unref()`'d — windows exit-delay risk**

the backoff `setTimeout` inside the retry loop holds the event loop open. per project convention (AGENTS.md: "Windows filesystem safety: all `fs.rm` calls in scripts/tests use `removeWithRetry` with EBUSY/EPERM/ENOTEMPTY backoff"), retry timers should be unref'd so a supervisor cleanup during process exit on windows doesn't stall the process.

```suggestion
			const timer = setTimeout(resolve, DEFAULT_UNLINK_RETRY_BASE_DELAY_MS * (attempt + 1));
			if (typeof timer.unref === "function") timer.unref();
			await new Promise((resolve) => timer);
```

or inline it similarly to how `sleep()` calls `timer.unref()` already.

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: "feat: add experiment..."

Greptile also left 1 inline comment on this PR.

@chatgpt-codex-connector
Copy link

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 21, 2026

Important

Review skipped

Auto reviews are limited based on label configuration.

🚫 Review skipped — only excluded labels are configured. (1)
  • skip-review

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: f91f4e6a-ccd7-49b3-a884-f91a3caefe8f

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch git-clean/pr147-session-supervisor-core
✨ Simplify code
  • Create PR with simplified code
  • Commit simplified code in branch git-clean/pr147-session-supervisor-core

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

process.env.CODEX_AUTH_REFRESH_TOKEN = "refresh-secret";
process.env.CODEX_AUTH_CLI_SESSION_SUPERVISOR = "1";
process.env.CODEX_HOME = "/tmp/codex-home";
process.env.PATH = "C:\\Windows\\System32";
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 process.env.PATH mutation leaks across test suite

process.env.PATH is set to "C:\\Windows\\System32" here but PATH is not in the envKeys list that afterEach restores. on every platform except win32 this permanently replaces PATH for all subsequent tests in the same vitest worker, breaking any test that later needs to fork a real process (npm, node, etc.).

fix: either save/restore PATH manually or include it in envKeys:

Suggested change
process.env.PATH = "C:\\Windows\\System32";
const originalPath = process.env.PATH;
process.env.PATH = "C:\\Windows\\System32";
try {
const env = supervisorTestApi?.buildCodexChildEnv?.();
expect(env).toBeTruthy();
expect(env?.CODEX_AUTH_ACCESS_TOKEN).toBeUndefined();
expect(env?.CODEX_AUTH_REFRESH_TOKEN).toBeUndefined();

and close with } finally { process.env.PATH = originalPath; }. alternatively just add "PATH" to the envKeys tuple at the top of the file.

Prompt To Fix With AI
This is a comment left during a code review.
Path: test/codex-supervisor.test.ts
Line: 574

Comment:
**`process.env.PATH` mutation leaks across test suite**

`process.env.PATH` is set to `"C:\\Windows\\System32"` here but `PATH` is not in the `envKeys` list that `afterEach` restores. on every platform except win32 this permanently replaces `PATH` for all subsequent tests in the same vitest worker, breaking any test that later needs to fork a real process (npm, node, etc.).

fix: either save/restore `PATH` manually or include it in `envKeys`:

```suggestion
		const originalPath = process.env.PATH;
		process.env.PATH = "C:\\Windows\\System32";
		try {
			const env = supervisorTestApi?.buildCodexChildEnv?.();
			expect(env).toBeTruthy();
			expect(env?.CODEX_AUTH_ACCESS_TOKEN).toBeUndefined();
			expect(env?.CODEX_AUTH_REFRESH_TOKEN).toBeUndefined();
```

and close with `} finally { process.env.PATH = originalPath; }`. alternatively just add `"PATH"` to the `envKeys` tuple at the top of the file.

How can I resolve this? If you propose a fix, please make it concise.

Fix in Codex

@ndycode ndycode added the needs-human-review Flagged by automated PR quality screening; maintainer review required. label Mar 21, 2026
@ndycode
Copy link
Owner Author

ndycode commented Mar 21, 2026

Superseded by a cleaner split:\n- #212 for the config/quota plumbing\n- #213 for the stacked supervisor runtime\n\nClosing this 5.7k-line PR so reviewers can work through the same behavior in two smaller passes.

@ndycode ndycode closed this Mar 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-human-review Flagged by automated PR quality screening; maintainer review required.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant