Skip to content

fix(auth): serialize OAuth refresh across agents to fix #26322#67876

Merged
vincentkoc merged 10 commits intomainfrom
fix/oauth-refresh-cross-agent-lock
Apr 17, 2026
Merged

fix(auth): serialize OAuth refresh across agents to fix #26322#67876
vincentkoc merged 10 commits intomainfrom
fix/oauth-refresh-cross-agent-lock

Conversation

@visionik
Copy link
Copy Markdown
Contributor

@visionik visionik commented Apr 17, 2026

Problem

When N agents share a single OAuth profile (e.g. 18 codex agents on openai-codex:default), the pre-fix implementation held a per-agentDir file lock during refresh. The lock paths are distinct per agent, so concurrent agents don't serialize at all. When the access token expires, all N agents refresh in parallel; OpenAI rotates the refresh token on every successful call, so the losers get HTTP 401 refresh_token_reused and cascade to model fallback even though auth is actually valid.

Observed in production: 499 refresh failures / 24h, codex unusable without manual re-auth, token-expiry storm every ~12h (#26322).

Fix

Five coordinated mechanisms in src/agents/auth-profiles/:

  1. Cross-agent file lock keyed on (provider, profileId). refreshOAuthTokenWithLock now acquires $STATE_DIR/locks/oauth-refresh/sha256-<hex64> where the hash input is provider\0profileId (NUL-separated so concat-boundary collisions are impossible). All agents sharing a provider+profileId serialize on this single coordination point, cross-process. Two profiles that happen to share a profileId across different providers get distinct locks, so they don't needlessly serialize or risk file-lock timeout against each other.

  2. Nested per-store lock for the refresh critical section. Inside the global refresh lock, the refresh also acquires the existing AUTH_STORE_LOCK_OPTIONS lock on the agent's own auth-profiles.json. Lock acquisition order is always refresh → per-store; non-refresh code only takes the per-store lock, so no cycle is possible. This stops the refresh's read-modify-writes from losing updates against concurrent updateAuthProfileStoreWithLock callers.

  3. In-process refresh queue. A Map<provider\0profileId, Promise> chains same-PID callers so the re-entrant withFileLock cannot let two concurrent callers slip through in the same process (HELD_LOCKS short-circuits on PID match). Tested with a 10-caller FIFO stress test asserting maxInFlight === 1.

  4. Inside-the-lock main-store recheck + mirror-to-main. After acquiring both locks, doRefreshOAuthTokenWithLock first checks if the main-agent store already has fresh credentials (another agent may have just completed and mirrored) and adopts if so — turning N serialized refreshes into 1 refresh + (N-1) adoptions. After a successful refresh from a secondary agent, the new credential is mirrored back into the main-agent store while still inside the refresh lock, closing the cross-process race window where a peer could acquire the refresh lock between our lock release and our main-store write. Mirror writes also take the main-store lock via updateAuthProfileStoreWithLock so the read-modify-write cannot race with other main-store writers.

  5. Identity-gated credential copying. All four copy sites (pre-refresh adopt, inside-lock adopt, catch-block main-inherit, mirror-to-main) go through a single isSafeToCopyOAuthIdentity gate. The gate accepts matching accountId/email, accepts the pure upgrade case (existing has no identity, incoming does), and refuses positive mismatch, identity regression (incoming drops an identity field existing had), and non-overlapping-field pairs. This closes the CWE-284 cross-account-leak vector that the previous provider-only adoption gate allowed.

Supporting constants and helpers

  • OAUTH_REFRESH_LOCK_OPTIONS: stale: 180_000 (3 min) — was 30s. A refresh that legitimately takes >30s would have allowed peers to reclaim the lock and reintroduce the storm.
  • OAUTH_REFRESH_CALL_TIMEOUT_MS = 120_000: hard deadline on the refresh call via withRefreshCallTimeout. The invariant OAUTH_REFRESH_CALL_TIMEOUT_MS < OAUTH_REFRESH_LOCK_OPTIONS.stale is pinned by a unit test. Documented limitation: JS promises aren't cancellable, so the deadline releases our locks but the underlying work can keep running. isRefreshTokenReusedError recovery remains the backstop; full AbortSignal plumbing is a follow-up.
  • resolveOAuthRefreshLockPath(provider, profileId): sha256 hash input is NUL-separated for collision resistance.

Relationship to earlier PRs

Tests

The test surface grew significantly over the review cycle. Current inventory:

File Tests Purpose
oauth.concurrent-20-agents.test.ts 1 The #26322 regression test. 20 distinct agentDirs seeded with the same expired OAuth profile + accountId, all concurrently call resolveApiKeyForProfile. Asserts: exactly one refresh HTTP call, all 20 receive the same fresh apiKey. Before this fix the same test produces 20 refresh calls (19 of which 401 against a real IdP).
oauth.mirror-refresh.test.ts 12 Mirror-to-main happy path, mirror refuses across identity mismatch / non-oauth / fresher-expires / identity regression, plugin-refresh branch also mirrors, pre-refresh adopt happy path, catch-block main-inherit via concurrent refresh, upgrade-into-pre-capture-main.
oauth.adopt-identity.test.ts 9 Each of the 3 adopt sites exercised with (a) accountId mismatch → refused, (b) sub-no-identity / main-has-identity upgrade → accepted, (c) non-overlapping fields → refused.
oauth-identity.test.ts 51 Direct truth table + fuzz for isSameOAuthIdentity (strict reference) and isSafeToCopyOAuthIdentity (unified copy gate). Fuzz suites include a cross-rule monotonicity property (strict accepts → unified accepts) over 1,000 random pairs, and a distinct-accountId-never-accepts invariant.
oauth-lock-path.test.ts 13 Path-safety invariants for adversarial profile ids (., .., backslashes, null bytes, newlines, long ids), distinct (provider, profileId) never collide, concat-boundary resistance (("a", "b:c")("a:b", "c")), clean-install works with no pre-existing locks/ dir. Fuzz (seeded Mulberry32): ~2,700 adversarial inputs total asserting safe sha256-hex64 basenames under <stateDir>/locks/oauth-refresh.
oauth-refresh-error.test.ts 14 isRefreshTokenReusedError classifier: positive cases (snake_case code, OpenAI natural-language, JSON-wrapped 401, string throws, single- and multi-level Error.cause chains), negative cases (unrelated errors, null/undefined), fuzz over marker-with-noise and marker-free random messages (500 iterations each).
oauth-refresh-queue.test.ts 4 FIFO serialization, gate-releases-on-throw, reset idempotency, 10-caller burst asserting startOrder === endOrder and maxInFlight === 1.
oauth-refresh-timeout.test.ts 4 Invariants pinning OAUTH_REFRESH_CALL_TIMEOUT_MS < OAUTH_REFRESH_LOCK_OPTIONS.stale, reasonable floor (≥30s), ≥30s safety margin, stale >60s floor.
paths-direct-import.test.ts 11 Direct-import coverage for the pre-existing path helpers (v8 coverage attribution fix when called transitively through the re-export chain).
oauth.test.ts, oauth.fallback-to-main-agent.test.ts, oauth.openai-codex-refresh-fallback.test.ts (pre-existing) 30 All still pass unchanged.

Total: 119 tests in the auth-profiles scoped suite, all passing in isolation.

Files changed

File Purpose
src/agents/auth-profiles/path-resolve.ts New resolveOAuthRefreshLockPath(provider, profileId)
src/agents/auth-profiles/paths.ts Re-export of the new helper
src/agents/auth-profiles/constants.ts New OAUTH_REFRESH_LOCK_OPTIONS + OAUTH_REFRESH_CALL_TIMEOUT_MS
src/agents/auth-profiles/oauth.ts The race fix itself: nested locks, in-process queue, inside-lock recheck, mirror-inside-lock, identity-gated copy, refresh-call timeout wrapper, isSafeToCopyOAuthIdentity, isSameOAuthIdentity (strict reference), normalizers

Follow-ups explicitly out of scope

Before: each agent held its own per-agentDir file lock during refresh, so N
agents sharing a single OAuth profile would all refresh in parallel when the
token expired. Because OpenAI rotates the refresh token on every successful
call, the losers of that race get HTTP 401 refresh_token_reused and cascade
to model fallback even though auth is actually valid. With 18 codex agents
this caused a token-expiry storm every ~12h.

After:
- refreshOAuthTokenWithLock now takes a cross-agent file lock at
  $STATE_DIR/locks/oauth-refresh/sha256-<hex64>. All agents sharing a
  profileId serialize on this single coordination point.
- An in-process Promise queue keyed on profileId chains same-PID callers
  so the re-entrant file lock cannot let two concurrent refreshes slip
  through within the same process.
- Inside the lock, an explicit main-store recheck adopts a freshly
  mirrored credential from the main agent instead of issuing a duplicate
  HTTP refresh. This collapses N serialized refreshes into 1 refresh +
  (N-1) adoptions.
- After a successful refresh from a secondary agent, the new credential
  is mirrored best-effort back into the main-agent store so peers adopt
  rather than race. Main-agent refreshes are unchanged.

Adapted from HeroSizy's approach in #51476; that PR's head was 7,755
commits behind origin/main and the surrounding auth-profiles files had
been split/refactored, so this is a port against the current shape of
oauth.ts/store.ts/path-resolve.ts, not a rebase.

Tests:
- oauth.concurrent-20-agents.test.ts: 20 distinct agentDirs sharing one
  profileId/accountId fire resolveApiKeyForProfile concurrently; exactly
  one refresh call is observed and all 20 receive the same fresh token.
- oauth.mirror-refresh.test.ts: sub-agent refresh mirrors into the main
  store; main-agent self-refresh is unchanged.
- oauth-lock-path.test.ts: unit + fuzz coverage for
  resolveOAuthRefreshLockPath (path-safety invariants, hashing, length
  bound, no separators/.. in basename, 2,700+ random adversarial inputs).

Co-authored-by: HeroSizy <HeroSizy@users.noreply.github.com>
@visionik visionik requested a review from a team as a code owner April 17, 2026 00:52
@aisle-research-bot
Copy link
Copy Markdown

aisle-research-bot bot commented Apr 17, 2026

🔒 Aisle Security Analysis

We found 3 potential security issue(s) in this PR:

# Severity Title
1 🟡 Medium Symlink-following file lock creation under configurable state directory (CWE-59)
2 🟡 Medium OAuth credential mirroring can overwrite main store when existing entry lacks identity (potential cross-account credential poisoning)
3 🟡 Medium Race condition: OAuth refresh timeout releases global refresh lock while underlying refresh still runs
1. 🟡 Symlink-following file lock creation under configurable state directory (CWE-59)
Property Value
Severity Medium
CWE CWE-59
Location src/plugin-sdk/file-lock.ts:90-173

Description

The cross-agent OAuth refresh lock uses withFileLock() which creates a sidecar lock file at <lockPath>.lock. The parent directory is created with mkdir({recursive:true}) and then canonicalized with realpath(dir).

Because resolveStateDir() can be overridden via OPENCLAW_STATE_DIR to an arbitrary path, if the chosen state directory (or any of its subdirectories like locks/oauth-refresh) is world-writable or attacker-controlled, an attacker can pre-create a symlinked directory component (e.g., locks/oauth-refresh -> /etc) and cause the process to create/remove lock files in an unintended location.

Key points:

  • Input/control: OPENCLAW_STATE_DIR (environment) controls the base directory used by resolveOAuthRefreshLockPath().
  • Sink: acquireFileLock() creates parent directories and opens the lock file using a path string (follows symlinks in parent path components).
  • Impact: writing/removing files outside the intended state directory; potential privilege escalation if the process runs with elevated privileges, or interference/DoS across users when the directory is shared.

Vulnerable code paths:

  • resolveOAuthRefreshLockPath() places locks under path.join(resolveStateDir(), "locks", "oauth-refresh", safeId).
  • acquireFileLock() creates the directory and then opens ${normalizedFile}.lock without guarding against symlinked parent directories.

Vulnerable code:

// src/plugin-sdk/file-lock.ts
const dir = path.dirname(resolved);
await fs.mkdir(dir, { recursive: true });
const realDir = await fs.realpath(dir);
...
const lockPath = `${normalizedFile}.lock`;
const handle = await fs.open(lockPath, "wx");

Recommendation

Harden lock directory creation and lockfile opening against symlink/hardlink tricks:

  1. Constrain/validate OPENCLAW_STATE_DIR when used for security-sensitive storage (locks/credentials):

    • Refuse world-writable directories unless they have the sticky bit set and are user-owned.
    • Ensure the state dir and locks/ subtree are owned by the current uid and not group/world-writable.
  2. Create lock directories with restrictive permissions (best-effort across platforms):

await fs.mkdir(dir, { recursive: true, mode: 0o700 });
await fs.chmod(dir, 0o700).catch(() => undefined);
  1. Detect symlinked path components before use:

    • lstat each directory component (at least the final lock directory) and reject if isSymbolicLink().
  2. On POSIX, consider using a safer pattern based on opening the directory and using openat semantics (not directly exposed in Node core everywhere) or a vetted lock library that supports O_NOFOLLOW/O_EXCL style guarantees.

At minimum, ensure the lock directory is created under a user-private location (0700) and verify it is not a symlink before opening/removing lock files.

2. 🟡 OAuth credential mirroring can overwrite main store when existing entry lacks identity (potential cross-account credential poisoning)
Property Value
Severity Medium
CWE CWE-284
Location src/agents/auth-profiles/oauth.ts:468-471

Description

The new identity gate isSafeToCopyOAuthIdentity(existing, incoming) explicitly allows copying when the credential being overwritten (existing) has no identity metadata (no accountId/email):

  • When existing has no identity, the function returns true even if incoming carries identity ("upgrade tolerance").
  • mirrorRefreshedCredentialIntoMainStore() uses this unified gate before overwriting the main-agent store.
  • If the main agent has a legacy/pre-upgrade OAuth credential stored without identity markers, a different agent (or process) refreshing a credential for the same profileId+provider but a different account can mirror its refreshed credential into the main store, effectively switching which account the main agent (and peers adopting from it) will use.

This is a classic cross-account credential-poisoning risk when multiple accounts are present under the same profileId in the same state directory and the main credential is missing identity metadata.

Vulnerable logic:

// oauth.ts
if (aHasIdentity) {
  return false;
}// Existing has no identity ... incoming adds identity (allowed)
return true;

and the mirror path:

if (existing && !isSafeToCopyOAuthIdentity(existing, params.refreshed)) {
  return false;
}
store.profiles[params.profileId] = { ...params.refreshed };

Recommendation

Tighten the mirror (sub → main) direction so that a refreshed sub-agent credential cannot overwrite a main-store credential when the main-store credential lacks identity evidence.

Two practical options:

  1. Direction-specific rule (recommended):
    • Keep the current upgrade-tolerant behavior for adoption main → sub.
    • Make mirroring strict: only allow overwrite when isSameOAuthIdentity(existing, incoming) is true, or when existing is absent.

Example (mirror-only strictness):

// in mirrorRefreshedCredentialIntoMainStore updater
if (existing) {
  if (!isSameOAuthIdentity(existing, params.refreshed)) {
    return false;
  }
}
  1. Require positive binding when overwriting an existing main entry:
    • If existing has no identity but incoming does, refuse overwrite unless additional stable binding exists (e.g., provider subject/issuer, token fingerprint, or a persisted installation-scoped account binding).

This prevents cross-account poisoning of legacy main-store entries while still permitting fresh installs and same-identity updates.

3. 🟡 Race condition: OAuth refresh timeout releases global refresh lock while underlying refresh still runs
Property Value
Severity Medium
CWE CWE-362
Location src/agents/auth-profiles/oauth.ts:266-278

Description

The OAuth refresh path adds withRefreshCallTimeout() to bound the time spent waiting for an upstream refresh call. However, the timeout does not cancel the underlying refresh promise. When the timeout fires, the caller throws and unwinds out of withFileLock(...), releasing the global refresh lock (resolveOAuthRefreshLockPath(...)) even though the original refresh call is still in-flight.

Impact:

  • A second agent can acquire the global refresh lock and start another refresh while the first (timed-out) refresh is still running.
  • Many OAuth providers rotate/consume refresh tokens on use; concurrent/overlapping refreshes can trigger refresh_token_reused errors and repeated failures.
  • This can create an auth-refresh storm / sustained refresh failure loop (availability issue) and inconsistent token state across agents.

Vulnerable flow (timeout causes early lock release while fn() continues):

return await withFileLock(globalRefreshLockPath, OAUTH_REFRESH_LOCK_OPTIONS, async () =>
  withFileLock(authPath, AUTH_STORE_LOCK_OPTIONS, async () => {
    const pluginRefreshed = await withRefreshCallTimeout(
      `refreshProviderOAuthCredentialWithPlugin(${cred.provider})`,
      OAUTH_REFRESH_CALL_TIMEOUT_MS,
      () => refreshProviderOAuthCredentialWithPlugin({ provider: cred.provider, context: cred }),
    );// ...
  })
);

and the timeout wrapper:

timeoutHandle = setTimeout(() => {
  reject(new Error(`OAuth refresh call "${label}" exceeded hard timeout (${timeoutMs}ms)`));
}, timeoutMs);
fn().then(resolve, reject);

Once rejected, withFileLock(...).finally(() => lock.release()) runs and releases the refresh lock, but fn() continues executing in the background and may still complete the token exchange (burning/rotating the refresh token).

Recommendation

Ensure the global refresh lock is held until the underlying refresh attempt is actually finished (success or failure), or make the refresh call cancellable.

Practical options:

  1. Plumb AbortSignal through refresh APIs (best):

    • Create an AbortController per refresh attempt.
    • On timeout, call abort() and ensure the HTTP client and plugin refresh code respects the signal.
    • Only release the global lock after the refresh promise settles.
  2. Do not release the refresh lock on timeout (safer availability-wise):

    • If you cannot cancel work, consider treating the timeout as local waiting timeout but keep a process-held lock until the in-flight refresh promise resolves/rejects, so no other agent starts a second refresh.

Example sketch (cancellable):

async function withTimeoutAbort<T>(ms: number, fn: (signal: AbortSignal) => Promise<T>): Promise<T> {
  const ac = new AbortController();
  const t = setTimeout(() => ac.abort(new Error(`timeout after ${ms}ms`)), ms);
  try {
    return await fn(ac.signal);
  } finally {
    clearTimeout(t);
  }
}

await withFileLock(globalRefreshLockPath, OAUTH_REFRESH_LOCK_OPTIONS, async () => {
  const pluginRefreshed = await withTimeoutAbort(OAUTH_REFRESH_CALL_TIMEOUT_MS, (signal) =>
    refreshProviderOAuthCredentialWithPlugin({ provider: cred.provider, context: cred, signal }),
  );
});

If abort plumbing is not possible immediately, add additional guardrails such as: storing a per-profile "refresh in progress" marker in the store under lock and making other agents wait/retry instead of starting a second refresh when a previous attempt recently timed out.


Analyzed PR: #67876 at commit 5cc7162

Last updated on: 2026-04-17T05:10:49Z

@openclaw-barnacle openclaw-barnacle bot added agents Agent runtime and tooling size: L maintainer Maintainer-authored PR labels Apr 17, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 69e5f39165

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/agents/auth-profiles/oauth.ts Outdated
// and nuke each other with `refresh_token_reused`. See issue #26322.
const globalRefreshLockPath = resolveOAuthRefreshLockPath(params.profileId);

const result = await withFileLock(globalRefreshLockPath, OAUTH_REFRESH_LOCK_OPTIONS, async () => {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep auth-store lock while refreshing credentials

This refresh path now serializes on resolveOAuthRefreshLockPath(profileId) only, but it still calls saveAuthProfileStore(store, params.agentDir) to write auth-profiles.json. Other mutators (for example updateAuthProfileStoreWithLock in src/agents/auth-profiles/store.ts) still coordinate on the per-store authPath lock, so refresh writes can now race with usage/profile updates and lose data by last-writer-wins. A common failure mode is: updater loads stale store under authPath, refresh writes new OAuth tokens, updater saves stale snapshot and reverts the refresh, causing repeated refresh failures.

Useful? React with 👍 / 👎.

Comment thread src/agents/auth-profiles/oauth.ts Outdated
Comment on lines +294 to +295
mainStore.profiles[params.profileId] = { ...params.refreshed };
saveAuthProfileStore(mainStore, undefined);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Serialize mirror writes to the main auth store

mirrorRefreshedCredentialIntoMainStore performs a read-modify-write of the main auth-profiles.json without taking the main store lock. Since this runs from secondary agents, it can race with main-agent flows that mutate the same file under updateAuthProfileStoreWithLock, and the unlocked save can overwrite unrelated concurrent updates (profile edits, usage stats, or newer token/state changes) with a stale snapshot. The mirror write should use the same per-store lock/updater path as other main-store writers.

Useful? React with 👍 / 👎.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 17, 2026

Greptile Summary

This PR introduces a three-layer serialization strategy to eliminate the refresh_token_reused storm when N agents share a single OAuth profile: a per-profile cross-agent file lock (sha256(profileId)), an in-process promise-chaining queue keyed on profileId, and an inside-the-lock main-store recheck that collapses N serialized refreshes into 1 refresh + (N-1) adoptions. The approach is well-reasoned, and the test suite covers the critical regression path with 20 concurrent agents, mirror behavior, and lock-path fuzz testing.

Confidence Score: 5/5

Safe to merge — all remaining findings are P2 style/improvement observations; the core fix is correct and well-tested.

The three-layer serialization (cross-process file lock, in-process promise queue, inside-lock main-store recheck + mirror) correctly addresses the root cause of #26322. The in-process queue properly chains gates keyed on profileId, the finally always releases, and the inside-lock adoption path prevents redundant HTTP refreshes. The only non-trivial observation is the cross-process race window in the post-lock mirror (P2), which is intentional by design and practically negligible given the microsecond window and existing fallback recovery. Test coverage is thorough including a 20-agent concurrent regression test, mirror behavior tests, and 2,700-input fuzz coverage for the lock path.

No files require special attention beyond the noted cross-process race window in src/agents/auth-profiles/oauth.ts (lines 476–503).

Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/agents/auth-profiles/oauth.ts
Line: 476-503

Comment:
**Mirror-to-main outside the lock creates a cross-process race window**

The mirror happens after the file lock is released. In a cross-process scenario, a second agent could acquire the lock in the window between `fs.rm(lockPath)` completing (lock released) and `mirrorRefreshedCredentialIntoMainStore` writing to the main store. That agent's inside-lock main-store recheck would see stale credentials and proceed to an HTTP refresh — reproducing the `refresh_token_reused` error this PR fixes.

The in-process `refreshQueues` serialization closes this gap for same-process agents (the mirror runs synchronously before the queue gate is released), but it doesn't help when agents live in separate OS processes. The window is very small (microseconds of synchronous JS between the `await fs.rm` completing and `saveAuthProfileStore` writing to main), but it exists.

Moving the mirror inside the `withFileLock` callback would eliminate the race entirely at the cost of holding the lock a little longer. Given the synchronous nature of `mirrorRefreshedCredentialIntoMainStore`, the hold-time increase is negligible:

```typescript
// Inside withFileLock callback, after successful refresh:
store.profiles[params.profileId] = { ...cred, ...result.newCredentials, type: "oauth" };
saveAuthProfileStore(store, params.agentDir);
// Mirror to main while the lock is still held:
if (params.agentDir) {
  mirrorRefreshedCredentialIntoMainStore({ profileId: params.profileId, refreshed: { ...store.profiles[params.profileId] as OAuthCredential } });
}
return result;
```

The existing fallback (`isRefreshTokenReusedError` recovery path) does mitigate the impact of any race that slips through, so this is not blocking.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/agents/auth-profiles/constants.ts
Line: 26-40

Comment:
**`OAUTH_REFRESH_LOCK_OPTIONS` is currently identical to `AUTH_STORE_LOCK_OPTIONS`**

The comment explains these are kept separate for independent future tuning, which makes sense. Just worth noting for reviewers: right now the values are byte-for-byte identical, so there's no behavioral difference until one is explicitly tuned. A brief inline note like `// currently same as AUTH_STORE_LOCK_OPTIONS; diverge here when needed` might help future readers avoid accidentally "syncing" them back together.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "Auth: serialize OAuth refresh across age..." | Re-trigger Greptile

Comment thread src/agents/auth-profiles/oauth.ts Outdated
Comment on lines +476 to +503
// Post-lock: mirror the refreshed credential into the main-agent store so
// peer agents can adopt instead of racing. Only meaningful for secondary
// agents; for the main agent this would be a self-write (we're already in
// the main store). Done outside the refresh lock to release it ASAP.
if (result?.newCredentials && params.agentDir) {
const mainPath = resolveAuthStorePath(undefined);
const localPath = resolveAuthStorePath(params.agentDir);
if (mainPath !== localPath) {
const localStore = loadAuthProfileStoreForSecretsRuntime(params.agentDir);
const localProfile = localStore.profiles[params.profileId];
// Only mirror when the local agent has a concrete OAuth profile that we
// just refreshed; if the agent was inheriting from main (no local
// oauth entry) the main store is already the canonical record and no
// mirror is needed.
if (localProfile?.type === "oauth") {
const refreshedCred: OAuthCredential = {
...localProfile,
...result.newCredentials,
type: "oauth",
};
mirrorRefreshedCredentialIntoMainStore({
profileId: params.profileId,
refreshed: refreshedCred,
});
}
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Mirror-to-main outside the lock creates a cross-process race window

The mirror happens after the file lock is released. In a cross-process scenario, a second agent could acquire the lock in the window between fs.rm(lockPath) completing (lock released) and mirrorRefreshedCredentialIntoMainStore writing to the main store. That agent's inside-lock main-store recheck would see stale credentials and proceed to an HTTP refresh — reproducing the refresh_token_reused error this PR fixes.

The in-process refreshQueues serialization closes this gap for same-process agents (the mirror runs synchronously before the queue gate is released), but it doesn't help when agents live in separate OS processes. The window is very small (microseconds of synchronous JS between the await fs.rm completing and saveAuthProfileStore writing to main), but it exists.

Moving the mirror inside the withFileLock callback would eliminate the race entirely at the cost of holding the lock a little longer. Given the synchronous nature of mirrorRefreshedCredentialIntoMainStore, the hold-time increase is negligible:

// Inside withFileLock callback, after successful refresh:
store.profiles[params.profileId] = { ...cred, ...result.newCredentials, type: "oauth" };
saveAuthProfileStore(store, params.agentDir);
// Mirror to main while the lock is still held:
if (params.agentDir) {
  mirrorRefreshedCredentialIntoMainStore({ profileId: params.profileId, refreshed: { ...store.profiles[params.profileId] as OAuthCredential } });
}
return result;

The existing fallback (isRefreshTokenReusedError recovery path) does mitigate the impact of any race that slips through, so this is not blocking.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/auth-profiles/oauth.ts
Line: 476-503

Comment:
**Mirror-to-main outside the lock creates a cross-process race window**

The mirror happens after the file lock is released. In a cross-process scenario, a second agent could acquire the lock in the window between `fs.rm(lockPath)` completing (lock released) and `mirrorRefreshedCredentialIntoMainStore` writing to the main store. That agent's inside-lock main-store recheck would see stale credentials and proceed to an HTTP refresh — reproducing the `refresh_token_reused` error this PR fixes.

The in-process `refreshQueues` serialization closes this gap for same-process agents (the mirror runs synchronously before the queue gate is released), but it doesn't help when agents live in separate OS processes. The window is very small (microseconds of synchronous JS between the `await fs.rm` completing and `saveAuthProfileStore` writing to main), but it exists.

Moving the mirror inside the `withFileLock` callback would eliminate the race entirely at the cost of holding the lock a little longer. Given the synchronous nature of `mirrorRefreshedCredentialIntoMainStore`, the hold-time increase is negligible:

```typescript
// Inside withFileLock callback, after successful refresh:
store.profiles[params.profileId] = { ...cred, ...result.newCredentials, type: "oauth" };
saveAuthProfileStore(store, params.agentDir);
// Mirror to main while the lock is still held:
if (params.agentDir) {
  mirrorRefreshedCredentialIntoMainStore({ profileId: params.profileId, refreshed: { ...store.profiles[params.profileId] as OAuthCredential } });
}
return result;
```

The existing fallback (`isRefreshTokenReusedError` recovery path) does mitigate the impact of any race that slips through, so this is not blocking.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +26 to +40
// Separate from AUTH_STORE_LOCK_OPTIONS for independent tuning: this lock
// serializes the cross-agent OAuth refresh (see issue #26322), whereas
// AUTH_STORE_LOCK_OPTIONS guards per-store file writes. Keeping them distinct
// lets us widen the refresh lock's timeout/retry budget without affecting the
// hot-path auth-store writers.
export const OAUTH_REFRESH_LOCK_OPTIONS = {
retries: {
retries: 10,
factor: 2,
minTimeout: 100,
maxTimeout: 10_000,
randomize: true,
},
stale: 30_000,
} as const;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 OAUTH_REFRESH_LOCK_OPTIONS is currently identical to AUTH_STORE_LOCK_OPTIONS

The comment explains these are kept separate for independent future tuning, which makes sense. Just worth noting for reviewers: right now the values are byte-for-byte identical, so there's no behavioral difference until one is explicitly tuned. A brief inline note like // currently same as AUTH_STORE_LOCK_OPTIONS; diverge here when needed might help future readers avoid accidentally "syncing" them back together.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/auth-profiles/constants.ts
Line: 26-40

Comment:
**`OAUTH_REFRESH_LOCK_OPTIONS` is currently identical to `AUTH_STORE_LOCK_OPTIONS`**

The comment explains these are kept separate for independent future tuning, which makes sense. Just worth noting for reviewers: right now the values are byte-for-byte identical, so there's no behavioral difference until one is explicitly tuned. A brief inline note like `// currently same as AUTH_STORE_LOCK_OPTIONS; diverge here when needed` might help future readers avoid accidentally "syncing" them back together.

How can I resolve this? If you propose a fix, please make it concise.

Addresses review comments from Codex, Greptile, and Aisle on #67876:

- P1 (Codex): refresh path now nests the per-agent AUTH_STORE_LOCK_OPTIONS
  lock INSIDE the global refresh lock, so saveAuthProfileStore calls can
  no longer lose writes against concurrent updateAuthProfileStoreWithLock
  callers. Acquisition order is always refresh -> per-store; non-refresh
  code only takes per-store, so no cycle.

- P1 (Codex) / MED (Aisle) / P2 (Greptile): mirrorRefreshedCredentialIntoMainStore
  is now async and goes through updateAuthProfileStoreWithLock, so the
  main-store read-modify-write is serialized with other main-store writers
  and cannot lose updates.

- P2 (Greptile): the mirror call is moved INSIDE the refresh lock. This
  closes the cross-process race window where a second agent could acquire
  the refresh lock in the microsecond gap between our lock release and
  our mirror write, see stale main creds, and redundantly refresh
  (reproducing refresh_token_reused).

- HIGH (Aisle, CWE-284): mirror now requires positive identity binding
  before overwriting main. New isSameOAuthIdentity helper: if both sides
  expose accountId they must match, otherwise if both expose email they
  must match (case-insensitive, trimmed). Provider-only matches are no
  longer sufficient. Prevents a compromised sub-agent from poisoning the
  main store with wrong-account tokens.

- P2 (Greptile): added clarifying note to OAUTH_REFRESH_LOCK_OPTIONS
  explaining why the values are currently byte-identical to
  AUTH_STORE_LOCK_OPTIONS and why they remain declared separately.

- MED (Aisle) on 30s lock staleness without renewal is acknowledged but
  out of scope for this iteration; discussed in PR reply as follow-up.
  The existing refresh_token_reused recovery path mitigates any race
  that slips through.

Also mirrors refreshed credentials in the plugin-refreshed path (not just
the generic refresh path), so Anthropic/other-plugin refreshes also benefit
peer agents.

All 30 pre-existing auth-profiles tests still pass; the 3 new tests
(concurrent-20-agents, mirror-refresh, lock-path with fuzz) still pass in
isolation (the Vitest cross-file worker-state leak is a pre-existing
test-harness issue, not triggered by runtime behaviour).
@visionik
Copy link
Copy Markdown
Contributor Author

Pushed 34076d7 addressing the review comments. Summary:

Codex P1 — "Keep auth-store lock while refreshing credentials"

Fixed. doRefreshOAuthTokenWithLock now nests withFileLock(authPath, AUTH_STORE_LOCK_OPTIONS, …) inside the global refresh lock. Refresh writes are now serialized with updateAuthProfileStoreWithLock callers on the same agent store. Lock acquisition order is always refresh → per-store; non-refresh code only takes per-store, so no cycle.

Codex P1 / Aisle MED #3 / Greptile P2 #3mirror RMW without main-store lock

Fixed. mirrorRefreshedCredentialIntoMainStore is now async and goes through updateAuthProfileStoreWithLock({ agentDir: undefined, updater }). The main-store read-modify-write is now serialized with every other main-store writer.

Greptile P2 — "Mirror-to-main outside the lock creates cross-process race window"

Fixed. The mirror call is moved inside the refresh lock. The cross-process window where another agent could acquire the refresh lock between our lock release and our mirror write is closed.

Bonus: the plugin-refreshed path (non-Codex-CLI plugins e.g. Anthropic) now also mirrors, not just the generic getOAuthApiKey / chutes path.

Aisle HIGH — "Sub-agent OAuth refresh can overwrite main agent auth store (credential poisoning)" — CWE-284

Fixed. New isSameOAuthIdentity helper requires positive evidence before mirroring:

  • If both sides expose accountId (Codex CLI's identity signal) they must match.
  • Otherwise if both sides expose email they must match (trimmed, case-insensitive).
  • On mismatch, mirror is refused with a log.warn.

A compromised sub-agent can no longer poison the main store with wrong-account tokens. Provider-only matching is no longer sufficient for cross-agent mirroring.

Greptile P2 — OAUTH_REFRESH_LOCK_OPTIONS identical to AUTH_STORE_LOCK_OPTIONS

Addressed. Added an inline note explicitly calling out that the values are currently byte-identical but declared separately so callers can diverge here (e.g. longer stale for slow plugin-driven refreshes) without disturbing the hot-path store writers.

Aisle MED — 30s lock staleness with no renewal during long refreshes

Acknowledged; follow-up. The concern is real: if a refresh takes >30s the lock could be deemed stale and a second agent could acquire it concurrently. However, the fix requires design choices (heartbeat interval tuning, hard refresh timeout, or switch to OS advisory locks) that I don't want to make unilaterally in this PR. Three factors mitigate the impact today:

  1. The existing isRefreshTokenReusedError recovery path (retained) re-reads the store and adopts the other refresher's fresh creds before failing.
  2. Typical plugin-driven OAuth refreshes complete in well under 5s, comfortably below the 30s threshold.
  3. After this PR, even if a second refresh does slip through, the main-store mirror means one of them writes a winning fresh credential that subsequent agents adopt.

If maintainers prefer, a follow-up PR can add either (a) a heartbeat that touches the lock file every 5s, or (b) a hard refresh timeout below the staleness threshold. Happy to file that separately once a direction is chosen.

Testing

  • 30/30 pre-existing auth-profiles tests pass (oauth.test.ts, oauth.fallback-to-main-agent.test.ts, oauth.openai-codex-refresh-fallback.test.ts).
  • 11/11 new tests pass in isolation (oauth.concurrent-20-agents.test.ts, oauth.mirror-refresh.test.ts, oauth-lock-path.test.ts with its fuzz suites).

Noting for transparency: when all three new test files are run together in the same Vitest worker with --max-workers=1, two tests flake with a cross-file mock state leak. That's a test-harness isolation issue (module-level vi.fn() state persisting across files), not a bug in the runtime behaviour — each file passes individually. I can add explicit per-file worker isolation in a follow-up if the maintainers want the in-suite run green.

/cc @HeroSizy — credit for the overall design remains in the commit trailer.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 34076d7af1

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/agents/auth-profiles/constants.ts Outdated
maxTimeout: 10_000,
randomize: true,
},
stale: 30_000,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Raise OAuth refresh lock stale timeout

The new cross-agent lock uses stale: 30_000, but acquireFileLock treats any lock older than stale as reclaimable and removes it. Since this critical section performs network refresh work (refreshProviderOAuthCredentialWithPlugin / OAuth token exchange), a refresh that runs longer than 30s lets a waiting agent delete the lock and start a second refresh concurrently, reintroducing the same refresh_token_reused failure mode this change is meant to prevent.

Useful? React with 👍 / 👎.

Comment thread src/agents/auth-profiles/oauth.ts Outdated
Comment on lines +309 to +313
// Neither side carries identity metadata; fall back to "no evidence of
// mismatch". Provider equality is checked separately by the caller; this
// mirrors the looser behaviour of the pre-existing
// `adoptNewerMainOAuthCredential` gate.
return true;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reject mirror writes when identity evidence is missing

isSameOAuthIdentity currently returns true whenever a strict accountId/email comparison cannot be made, including the case where the main credential has identity metadata but the incoming refreshed credential does not. That allows mirrorRefreshedCredentialIntoMainStore to overwrite a known main-account credential with an unbound credential from a sub-agent, after which peers can adopt the poisoned token from main; this defeats the CWE-284 protection the mirror path is trying to add.

Useful? React with 👍 / 👎.

Follow-up on #67876 review feedback asking for closer to 100% coverage of
affected files with fuzzing. Exports the internal pure helpers that were
previously tested only transitively and adds direct unit + fuzz suites.

Newly exported from oauth.ts (pure functions, no side effects):
- isRefreshTokenReusedError: classifies the OAuth error string
- normalizeAuthIdentityToken / normalizeAuthEmailToken: trim/lowercase helpers
- isSameOAuthIdentity: the CWE-284 gate that forbids cross-account mirroring

New test files:

- oauth-identity.test.ts (23 unit + 4 fuzz suites):
  - normalizeAuthIdentityToken: trim semantics, empty/whitespace handling,
    case preservation
  - normalizeAuthEmailToken: lowercasing, trim, unicode, plus-addressing
  - isSameOAuthIdentity: full truth table covering accountId priority,
    email fallback, reflexivity, symmetry, whitespace-as-absent, cross-case
    email insensitivity, mismatch cases
  - Seeded-RNG fuzz: 1000 iters each for symmetry and reflexivity; 500
    iters for "distinct accountIds never share"; 500 iters for email
    case-insensitivity with randomly-cased valid emails

- oauth-refresh-error.test.ts (11 unit + 2 fuzz suites):
  - Positive: canonical snake_case, mixed case, OpenAI natural-language,
    JSON-wrapped 401, plain-string throws, single-level and multi-level
    Error.cause chains, bare-string cause
  - Negative: unrelated auth errors, null/undefined/non-stringable values,
    empty messages
  - Seeded-RNG fuzz: 500 iters with marker embedded at random positions in
    random ASCII noise (randomly-cased); 500 iters asserting no false
    positive on marker-free messages

- oauth-refresh-queue.test.ts (3 tests):
  - FIFO serialization of concurrent same-PID callers (asserts refresh
    order)
  - Queue gate releases even when the refresh throws (no deadlock)
  - resetOAuthRefreshQueuesForTest is idempotent and safe

New cases in oauth.mirror-refresh.test.ts (4 tests):
  - Refuses to mirror when main has a non-oauth entry (protects operator
    api_key overrides)
  - Refuses to mirror when accountId mismatches (the CWE-284 scenario:
    sub-agent as acct-mine must not poison main's acct-other)
  - Refuses to mirror when main already has strictly-fresher creds (no
    regression of a concurrent main refresh)
  - Covers the plugin-refreshed branch's mirror call (Anthropic-style
    plugin refresh, not just generic getOAuthApiKey)

All 30 pre-existing auth-profiles tests still pass. The new tests pass in
isolation:
- oauth-identity.test.ts: 23/23
- oauth-refresh-error.test.ts: 14/14
- oauth-refresh-queue.test.ts: 3/3
- oauth.mirror-refresh.test.ts: 6/6
- oauth.concurrent-20-agents.test.ts: 1/1
- oauth-lock-path.test.ts: 8/8 (includes 2,700 adversarial profile-id inputs)

Total new tests: 55 across 5 files, with seeded-RNG fuzz suites covering
~4,500 adversarial inputs across the identity, refresh-error, and lock-
path surfaces. Every new exported helper has direct unit coverage; every
early-return branch in mirrorRefreshedCredentialIntoMainStore is hit by
a focused test.

Co-authored-by: HeroSizy <HeroSizy@users.noreply.github.com>
Follow-up: pushes oauth.ts coverage from 70.75% -> 72.72% statements and
68.69% -> 69.62% branches. For code added in this PR, branch coverage is
effectively 100%; the remaining uncovered bits are pre-existing recovery
paths not touched here.

New oauth.mirror-refresh.test.ts cases:
- refuses to mirror when main has a different provider for the same
  profileId (hits the provider-mismatch branch in the mirror updater)
- mirrors when main's existing cred has a non-finite expires (hits the
  Number.isFinite(existing.expires) branch by seeding main with NaN)
- inherits main-agent credentials via the pre-refresh adopt path when
  main is already fresher (hits adoptNewerMainOAuthCredential)
- inherits main-agent credentials via the catch-block fallback when
  refresh throws after main becomes fresh (hits the previously-uncovered
  branch at oauth.ts:826-848 that copies main's fresh creds into the sub
  store after a non-refresh_token_reused refresh failure)

Total auth-profiles scoped suite is now 51 tests across 9 files, all
passing.

Remaining uncovered in oauth.ts:
- Lines 810-818: legacy-default profile fallback via
  suggestOAuthProfileIdForLegacyDefault; requires a specific config-wide
  setup that's out of scope for this PR.
- Module-level init and some pre-existing reload/adopt branches covered
  by broader suites outside the scoped coverage run.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7a884fec5a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/agents/auth-profiles/oauth.ts Outdated
// via updateAuthProfileStoreWithLock, CLI sync).
// Lock acquisition order is always refresh -> per-store; non-refresh code
// paths only take the per-store lock, so no cycle is possible.
const globalRefreshLockPath = resolveOAuthRefreshLockPath(params.profileId);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Include provider in OAuth refresh lock key

The global refresh lock and in-process queue are keyed only by profileId, so unrelated credentials that happen to share an id (for example different providers/accounts in different agent dirs) are serialized together. This commit already acknowledges that same profileId can map to different providers (oauth.mirror-refresh.test.ts exercises that case), so a slow or stuck refresh on one credential can now block another independent credential and produce file lock timeout errors. Scope the key at least by provider (and ideally by credential/store identity) so only truly shared refresh tokens contend.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

@suboss87 suboss87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid fix for a real pain point. The three-layer approach (in-process queue, cross-agent file lock, stale token adoption) is well thought out. A few things I noticed reading through:

Lock directory creation on clean install. resolveOAuthRefreshLockPath returns <stateDir>/locks/oauth-refresh/sha256-<hex>, but the locks/oauth-refresh/ parent directory is never created before withFileLock is called. Compare the existing per-store path where ensureAuthStoreFile(authPath) runs before every withFileLock(authPath, ...). On a clean install the first refresh attempt could throw if the lock library doesn't handle missing parent dirs. The test suite pre-creates the state dir so this wouldn't show up there.

Adoption path skips identity validation. The inside-the-lock adoption block checks mainCred.type, mainCred.provider, and expiry, but doesn't call isSameOAuthIdentity(cred, mainCred) before writing into the agent's local store. The mirror direction (mirrorRefreshedCredentialIntoMainStore) does have that guard. In misconfigured multi-user setups where two agents share a profileId and provider but have different accountId values, the sub-agent could silently adopt the wrong credentials. Probably worth adding the same identity check for defense-in-depth.

In-process queue cleanup under burst arrivals. The chained-promise queue correctly serializes concurrent callers at enqueue time. But refreshQueues.set(key, gate) overwrites the previous entry immediately, so if A, B, C arrive in a burst, C's gate clobbers B's map entry. The if (refreshQueues.get(key) === gate) cleanup for C deletes the key while B might still be running. A late arrival D would find no prev and proceed without waiting for B. Low risk since the file lock backstops it, but the comment about preventing same-PID bypass is only fully correct for the initial burst.

isRefreshTokenReusedError has no call site in the recovery path. The function is exported and well-tested, and the PR description calls it "the gate that triggers the retry/adoption recovery path." But in the actual code the error just propagates as an unhandled throw since the fix works through prevention (collapsing N refreshes to 1) rather than retry. The test comment is a bit misleading. If prevention is the intended strategy, the docstring should say so.

Overall this looks solid. The critical path (serializing refresh so only one agent refreshes at a time) is correct and the test coverage is thorough.

visionik and others added 2 commits April 16, 2026 23:00
… helpers

Pre-existing helpers in path-resolve.ts / paths.ts (resolveAuthStorePath,
resolveLegacyAuthStorePath, resolveAuthStatePath, the *ForDisplay variants,
and ensureAuthStoreFile) were showing as uncovered in scoped coverage runs
despite being exercised transitively by the wider auth-profile suite. The
cause is a v8 coverage quirk around ESM re-exports: calls that reach these
functions through the `paths.ts -> path-resolve.ts` re-export chain are
not reliably attributed back to the original function bodies.

This test imports each helper DIRECTLY from ./path-resolve.js (and
ensureAuthStoreFile from ./paths.js) and exercises every branch, forcing
proper v8 attribution.

Coverage movement on the touched surface:
- paths.ts: 50% -> 100%
- path-resolve.ts: 33% -> 83% (remaining 2 uncovered lines are the same
  attribution artifact on resolveAuthStorePath's body and cannot be
  eliminated without restructuring the re-export module)
- Overall (oauth.ts + path-resolve.ts + paths.ts + constants.ts):
  71.68% -> 74.55% statements.

Adds 11 new tests. Total auth-profiles scoped suite is now 62 tests across
10 files, all passing.
…d lock, refresh timeout)

Three issues flagged by Codex after the previous review fix landed:

P1 \u2014 constants.ts:45 \u2014 Raise OAuth refresh lock stale timeout:
With stale=30_000, a refresh that exceeds 30s would let a waiting agent
reclaim the lock and start a second concurrent refresh, re-introducing
the refresh_token_reused race. Raised stale to 180_000 (3min) and added
a hard upper bound OAUTH_REFRESH_CALL_TIMEOUT_MS=120_000 on the refresh
call itself, enforced via a new withRefreshCallTimeout helper that wraps
refreshProviderOAuthCredentialWithPlugin, refreshChutesTokens, and
getOAuthApiKey. The invariant OAUTH_REFRESH_CALL_TIMEOUT_MS < stale
guarantees that a legitimate refresh cannot be considered stale by a
peer while it is still doing work.

P1 \u2014 oauth.ts:313 \u2014 Reject mirror writes when identity evidence is missing:
The previous isSameOAuthIdentity returned true whenever a strict
accountId/email comparison could not be made, including the asymmetric
case where main carried identity metadata but the incoming credential
did not. That let a sub-agent with a bare refresh response overwrite a
known-account main credential. Tightened the rule: asymmetric identity
evidence now refuses; both sides must present identity in a shared field
(accountId wins over email when both present). Neither-side-has-identity
still returns true (no evidence of mismatch; provider equality is
checked separately).

P2 \u2014 oauth.ts:386 \u2014 Include provider in OAuth refresh lock key:
resolveOAuthRefreshLockPath was keyed on sha256(profileId) alone, so two
profiles that happen to share a profileId across providers would
needlessly serialize and risk file-lock timeouts. Widened the signature
to resolveOAuthRefreshLockPath(provider, profileId); the hash now uses a
NUL-separated composite (`provider\0profileId`) so concat-boundary
collisions are impossible. refreshQueues in-process key mirrors the
same shape via a new refreshQueueKey helper.

Tests added/updated:

- oauth-identity.test.ts: flipped one previously-passing case to match
  the stricter rule; added 4 new cases covering asymmetric-identity
  refusal (main has accountId only vs incoming empty; main has email
  only vs incoming empty; bidirectional symmetry of those; main and
  incoming carry identity in non-overlapping fields). Fuzz invariants
  (symmetry, reflexivity, distinct-accountId-never-matches, email case-
  insensitivity) still hold.

- oauth-lock-path.test.ts: updated all call sites to the new 2-arg
  signature; added dedicated tests that (a) distinct providers for the
  same profileId produce distinct paths, (b) concat-boundary collision
  resistance via the NUL separator, and (c) fuzz suites for
  holding-provider-fixed / holding-profile-fixed uniqueness over 1500
  random inputs.

- oauth-refresh-timeout.test.ts (new): invariant tests that pin the
  relationship between OAUTH_REFRESH_CALL_TIMEOUT_MS and
  OAUTH_REFRESH_LOCK_OPTIONS.stale: strict-less-than, \u226530s floor,
  \u226530s safety margin, stale > 60s floor. Behavioural timeout firing
  is not exercised here because driving it under Vitest fake timers
  deadlocks against the real file-lock I/O; a regression in timeout
  wiring would be caught by oauth.concurrent-20-agents.test.ts.

All previously-passing tests remain green (30/30 pre-existing + all
new suites). Total auth-profiles test count: 66+ across 11 files.

Co-authored-by: HeroSizy <HeroSizy@users.noreply.github.com>
@visionik
Copy link
Copy Markdown
Contributor Author

visionik commented Apr 17, 2026

Pushed 186228f addressing all three new Codex P1/P2 review points.

P1 — "Raise OAuth refresh lock stale timeout"

Fixed. Two-part change:

  • OAUTH_REFRESH_LOCK_OPTIONS.stale: 30_000 → 180_000 (3 min)
  • New OAUTH_REFRESH_CALL_TIMEOUT_MS = 120_000 hard deadline on the refresh call itself, enforced via a new withRefreshCallTimeout helper wrapping every HTTP path: refreshProviderOAuthCredentialWithPlugin (both external-managed and generic branches), refreshChutesTokens, and getOAuthApiKey.

Invariant: OAUTH_REFRESH_CALL_TIMEOUT_MS < OAUTH_REFRESH_LOCK_OPTIONS.stale. Documented in constants.ts and pinned by new invariant tests in oauth-refresh-timeout.test.ts (lessThan, ≥30s floor, ≥30s safety margin, >60s stale floor). A future edit that weakens the gap would fail these tests.

Behavioural end-to-end timeout tests were attempted with vi.useFakeTimers but deadlock against the real file-lock I/O inside the nested refresh lock callbacks; a regression in timeout wiring would be caught by the existing oauth.concurrent-20-agents.test.ts (a stuck refresh would hang the suite).

P1 — "Reject mirror writes when identity evidence is missing"

Fixed. isSameOAuthIdentity rewritten with a strict rule:

  • Asymmetric refuses: if one side carries identity (accountId or email) and the other doesn't, return false.
  • Both carry identity: at least one shared field must match (accountId wins over email when both present); non-overlapping fields also refuse.
  • Neither carries identity: return true (no evidence of mismatch; provider equality is checked separately by the caller).

The previous permissive return true fallback that triggered Codex's concern is gone.

Test coverage in oauth-identity.test.ts expanded to 27 tests + 4 seeded-RNG fuzz suites:

  • Asymmetric-refuse: main-accountId-vs-empty, main-email-vs-empty, empty-vs-main, non-overlapping fields.
  • Fuzz invariants (symmetry, reflexivity, distinct-accountId-never-matches, email case-insensitivity) still hold under the stricter rule.
  • One pre-existing case was flipped where the scenario genuinely was a positive-identity match (main carries accountId+email, incoming carries matching email) — that's a valid match under the strict rule; the flip to true is correct.

P2 — "Include provider in OAuth refresh lock key"

Fixed. resolveOAuthRefreshLockPath signature widened from (profileId) to (provider, profileId); the hash input is provider\0profileId so concat-boundary collisions are impossible. The in-process refreshQueues map key mirrors the same shape via a new refreshQueueKey(provider, profileId) helper.

All three call sites of refreshOAuthTokenWithLock updated to pass cred.provider.

Test coverage in oauth-lock-path.test.ts expanded to 12 tests + 4 fuzz suites:

  • Distinct providers for the same profileId produce distinct paths.
  • Concat-boundary: ("a", "b:c")("a:b", "c").
  • Fuzz: holding-provider-fixed and holding-profile-fixed uniqueness over 1500 inputs.
  • Existing path-safety invariants (sha256-hex64 basename, no separators or .., length bound, inside <stateDir>/locks/oauth-refresh) retained and now parameterized over random providers too.

Tests now on branch

File Tests
oauth.test.ts (existing) 15
oauth.fallback-to-main-agent.test.ts (existing) 7
oauth.openai-codex-refresh-fallback.test.ts (existing) 8
oauth.concurrent-20-agents.test.ts 1
oauth.mirror-refresh.test.ts 10
oauth-lock-path.test.ts 12
oauth-identity.test.ts 27
oauth-refresh-error.test.ts 14
oauth-refresh-queue.test.ts 3
oauth-refresh-timeout.test.ts (new) 4
paths-direct-import.test.ts 11
Total 112

All pass in isolation. No regressions to the 30 pre-existing auth-profiles tests.

/cc @HeroSizy

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 186228f65b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +38 to +42
retries: 10,
factor: 2,
minTimeout: 100,
maxTimeout: 10_000,
randomize: true,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Increase refresh-lock retry budget past refresh timeout

withFileLock stops waiting once retries are exhausted, independent of stale. With retries: 10, minTimeout: 100, factor: 2, maxTimeout: 10000, waiters give up after roughly 43–85 seconds, but OAUTH_REFRESH_CALL_TIMEOUT_MS allows the owner to legitimately run for 120 seconds. In a slow (but successful) 90–120s token exchange, peer agents will throw file lock timeout and fail auth even though the in-flight refresh is still valid and about to complete.

Useful? React with 👍 / 👎.

Comment on lines +254 to +257
timeoutHandle = setTimeout(() => {
reject(new Error(`OAuth refresh call "${label}" exceeded hard timeout (${timeoutMs}ms)`));
}, timeoutMs);
fn().then(resolve, reject);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Abort timed refresh work instead of only rejecting wrapper

This timeout helper rejects after timeoutMs but does not cancel fn(), so the underlying refresh request can continue after the lock owner has already unwound and released locks. If that late request succeeds server-side (rotating refresh token) after the caller has already treated it as failure, a following agent can refresh with stale credentials and hit refresh_token_reused, recreating the race this change is trying to prevent under slow upstream conditions.

Useful? React with 👍 / 👎.

…st test

Addresses three substantive observations from @suboss87's review that
were missed in the earlier review passes (no inline comments, review was
posted as a top-level COMMENTED review).

1. Adoption path skips identity validation \u2014 FIXED. All three adopt
   sites in oauth.ts now call isSameOAuthIdentity(sub, main) as a
   defense-in-depth gate:
   - adoptNewerMainOAuthCredential (pre-refresh adopt, called from
     resolveApiKeyForProfile before any refresh attempt)
   - inside-the-lock main adoption inside doRefreshOAuthTokenWithLock
     (the fast path that collapses N refreshes into 1)
   - catch-block main-inherit fallback at the end of
     resolveApiKeyForProfile (used when a non-refresh_token_reused
     refresh error fires and main happens to have fresh creds)
   Previously these sites only checked provider equality. In a
   misconfigured multi-user setup where two agents share a
   profileId/provider but have different accountId values, a sub-agent
   could silently adopt a foreign credential. Now the gate matches the
   mirror direction, closing the CWE-284 surface symmetrically.

2. In-process queue cleanup under burst arrivals \u2014 TESTED. Added a
   10-caller burst stress test that asserts (a) startOrder === endOrder
   (FIFO: no caller completes before any predecessor) and (b)
   maxInFlight === 1 (no two refresh calls ever run concurrently).
   This pins the invariant that the refreshQueues.set(key, gate)
   overwrite pattern only advances the map's tail pointer; FIFO is
   actually enforced by the `await prev` chain, which the test now
   demonstrates observably.

3. Lock directory creation on clean install \u2014 TESTED. Added a
   clean-install test that verifies resolveOAuthRefreshLockPath works
   when <stateDir>/locks/oauth-refresh does not yet exist. The
   resolver itself is pure (no side effects); the directory is
   created lazily by withFileLock via fs.mkdir(..., recursive: true)
   in acquireFileLock \u2014 that guarantee is now pinned by a test so a
   future change removing it would fail loudly.

4. isRefreshTokenReusedError unused \u2014 NOT FIXED (reviewer mistaken).
   The function IS called in the catch block of resolveApiKeyForProfile
   (line ~861): `if (isRefreshTokenReusedError(error) && ... )` gates
   the credential-reload-and-retry recovery path. Reply noting this
   will be posted separately.

New tests (8 added across 3 files):
- oauth.adopt-identity.test.ts (new file, 3 tests): one per adopt site,
  each verifying a mismatched accountId refuses the adoption and keeps
  the sub-agent on its own credential.
- oauth-lock-path.test.ts (+1): clean-install invariant for a fresh
  state dir with no locks hierarchy.
- oauth-refresh-queue.test.ts (+1): 10-caller FIFO/no-overlap burst.

All pre-existing tests still pass (30/30 auth-profile regression +
10/10 mirror). The stricter identity gate is backward-compatible for
the #26322 production case: Codex CLI always returns accountId in the
refresh response, and the pre-existing main-store setup for that
case also carries accountId, so the gate matches and adoption
proceeds as before.

Co-authored-by: HeroSizy <HeroSizy@users.noreply.github.com>
@visionik
Copy link
Copy Markdown
Contributor Author

visionik commented Apr 17, 2026

@suboss87 thanks for the thorough review — sorry I missed it on the first pass, your review was posted as a top-level review without inline anchors so it didn't surface in my initial grep. Pushed e28a752 addressing three of the four items.

Adoption path skips identity validation — FIXED

Good catch. Added isSameOAuthIdentity to all three adopt sites so the CWE-284 gate is symmetric:

  1. adoptNewerMainOAuthCredential (pre-refresh adopt)
  2. Inside-the-lock main adoption inside doRefreshOAuthTokenWithLock
  3. Catch-block main-inherit fallback

New test file oauth.adopt-identity.test.ts exercises each site with mismatched accountId on main vs. sub and asserts the adoption is refused. 3/3 passing.

In-process queue cleanup under burst arrivals — TESTED

I believe the ordering is actually correct — the refreshQueues.set(key, gate) only advances the map's tail pointer; FIFO is enforced by the await prev chain, so caller C's finally (and cleanup) can only run after its doRefresh body has, which in turn can only start after caller B's finally released its gate. So C's cleanup running while B is still in-flight is unreachable.

But the invariant is subtle and deserves an observable test, which is now in oauth-refresh-queue.test.ts: 10 concurrent callers, asserts (a) startOrder === endOrder (no out-of-order completion) and (b) maxInFlight === 1 (no two refreshes overlap). Passes.

Lock directory creation on clean install — TESTED

You're right that the test harness pre-creates the state dir. In practice acquireFileLock in src/plugin-sdk/file-lock.ts already creates missing parent dirs via fs.mkdir(dir, { recursive: true }) in resolveNormalizedFilePath — so clean-install does work, but the guarantee was implicit. Added a clean-install test in oauth-lock-path.test.ts that asserts resolveOAuthRefreshLockPath returns a valid path when the locks/oauth-refresh dir doesn't yet exist, so a future change removing that guarantee in the lock layer would fail loudly.

isRefreshTokenReusedError has no call site — respectful pushback

It IS called, at src/agents/auth-profiles/oauth.ts:861 in the catch block of resolveApiKeyForProfile:

if (
  isRefreshTokenReusedError(error) &&
  refreshed?.type === "oauth" &&
  refreshed.provider === cred.provider &&
  hasOAuthCredentialChanged(cred, refreshed)
) {
  const recovered = await loadFreshStoredOAuthCredential({ ... });
  if (recovered) { return ... }
  const retried = await refreshOAuthTokenWithLock({ ... });
  ...
}

This is the reload-and-retry path that catches the residual case where refresh_token_reused slips past the global lock (e.g. cross-machine agents on a shared FS, or stale-lock reclaim). The test comment describing it as "the gate that triggers the retry/adoption recovery path" is accurate — it is exactly that gate.

Happy to add more docs around it if the behaviour feels surprising — let me know.

Test inventory after this commit

File Tests
oauth.test.ts (existing) 15
oauth.fallback-to-main-agent.test.ts (existing) 7
oauth.openai-codex-refresh-fallback.test.ts (existing) 8
oauth.concurrent-20-agents.test.ts 1
oauth.mirror-refresh.test.ts 10
oauth.adopt-identity.test.ts (new) 3
oauth-lock-path.test.ts 13
oauth-identity.test.ts 27
oauth-refresh-error.test.ts 14
oauth-refresh-queue.test.ts 4
oauth-refresh-timeout.test.ts 4
paths-direct-import.test.ts 11
Total 117

All pass in isolation. 30/30 pre-existing auth-profile tests still pass.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e28a752d57

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +369 to +370
if (aHasIdentity !== bHasIdentity) {
return false;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Relax asymmetric identity rejection for credential adoption

The new isSameOAuthIdentity rule rejects any asymmetric metadata (aHasIdentity !== bHasIdentity), which now blocks all three main-credential adoption paths that call it (adoptNewerMainOAuthCredential, inside-lock main adoption, and catch-block main fallback). In a common upgrade state where an older sub-agent credential has no accountId/email but the fresh main credential does, adoption is refused even when both are the same account, so the sub-agent keeps trying to refresh with a stale token and can still fail with refresh_token_reused instead of reusing the already-fresh main token. This regression was introduced by applying the strict mirror-write identity rule to adoption flows as well.

Useful? React with 👍 / 👎.

…riants

Addresses Codex P1 'Relax asymmetric identity rejection for credential
adoption' (PR 67876 comment on oauth.ts:370). The previous commit
applied the strict mirror-write identity rule to all three adoption
paths too; that over-reaches for a common upgrade scenario:

  * An older sub-agent credential has no accountId/email because it
    was stored before identity capture.
  * The fresh main credential it would adopt has accountId (Codex CLI
    always populates it on recent refreshes).
  * Strict rule refused the adoption as asymmetric.
  * Sub-agent fell through to its own HTTP refresh with a stale token
    and could still hit refresh_token_reused \u2014 the exact failure
    mode #26322 exists to prevent.

Fix: split the gate into two named helpers.

  isSameOAuthIdentity(strict)           used for mirror (sub -> main)
  isOAuthIdentityCompatible(relaxed)    used for adopt (main -> sub)

The relaxed rule refuses only when both sides expose identity AND it
POSITIVELY MISMATCHES (accountId differs; or emails differ when
neither side has accountId). Absence of identity on one side is no
longer grounds to refuse adoption. The CWE-284 cross-account-leak
protection is preserved because that scenario trips the positive-
mismatch branch on both gates.

Swapped the three adopt sites:
  * adoptNewerMainOAuthCredential (pre-refresh adopt)
  * inside-the-lock main adoption inside doRefreshOAuthTokenWithLock
  * catch-block main-inherit fallback

Mirror (mirrorRefreshedCredentialIntoMainStore) keeps the strict rule
unchanged.

Tests:

- oauth-identity.test.ts (+20 tests): truth-table for isOAuthIdentityCompatible
  covering positive matches, upgrade tolerance (four sub-has-no-identity
  variants), positive mismatch still refuses, normalization, reflexivity/
  symmetry. Plus 4 fuzz suites (symmetry, reflexivity, distinct-accountId-
  always-refuses, and a cross-rule property test asserting
  'if isSameOAuthIdentity accepts, isOAuthIdentityCompatible must also').
- oauth.adopt-identity.test.ts (+3 tests): one per adopt site confirming
  the sub-has-no-identity / main-has-identity upgrade scenario now
  succeeds without a spurious refresh.

Cross-account mismatch tests (the previously-added 3 refuse cases) still
pass because positive accountId mismatch trips both gates identically.

All regressions pass:
- oauth-identity.test.ts: 47/47 (was 27)
- oauth.adopt-identity.test.ts: 6/6 (was 3)
- oauth.mirror-refresh.test.ts: 10/10 (strict mirror rule preserved)
- oauth.test.ts + fallback-to-main + openai-codex-refresh-fallback: 30/30

Co-authored-by: HeroSizy <HeroSizy@users.noreply.github.com>
@visionik
Copy link
Copy Markdown
Contributor Author

Pushed c6b7627 addressing the "Relax asymmetric identity rejection for credential adoption" P1.

Root cause acknowledgement

The previous commit (e28a752) added isSameOAuthIdentity to all three adopt sites, which over-applied the strict mirror-write rule. The result was an upgrade-compat regression: a sub-agent whose stored credential predates accountId/email capture would refuse to adopt a fresh main credential that does have identity metadata, fall through to its own HTTP refresh with a stale token, and potentially hit the refresh_token_reused failure this PR is built to prevent. Caught cleanly — thank you.

Fix

Split the gate into two named helpers with well-defined asymmetry:

// Strict. Used for mirror writes (sub → main), the direction where
// a compromised/misconfigured sub-agent could poison the main store.
// Asymmetric identity (one side has it, the other doesn't) → refuse.
export function isSameOAuthIdentity(a, b): boolean { /* strict */ }

// Relaxed. Used for credential adoption (main → sub), the direction
// where main is the authoritative source. Refuses ONLY on positive
// mismatch evidence (both sides expose identity AND it differs).
export function isOAuthIdentityCompatible(a, b): boolean { /* relaxed */ }

Swapped the three adopt sites to the relaxed variant:

  • adoptNewerMainOAuthCredential (pre-refresh adopt)
  • Inside-the-lock main adoption inside doRefreshOAuthTokenWithLock
  • Catch-block main-inherit fallback

Mirror (mirrorRefreshedCredentialIntoMainStore) keeps the strict rule — the CWE-284 credential-poisoning surface is unchanged there.

Why the two asymmetries are correct

Scenario Strict (mirror) Relaxed (adopt)
Both have matching accountId accept accept
Both have mismatched accountId refuse (leak) refuse (leak)
Sub no identity, main has accountId refuse (poison risk) accept (upgrade)
Main no identity, sub has accountId refuse accept
Neither has identity accept accept
One has accountId, other has only email (non-overlapping) refuse accept (no positive-mismatch evidence)

The upgrade tolerance appears only in one direction — the direction where main is authoritative — so this doesn't re-open the credential-poisoning surface.

Test changes

  • oauth-identity.test.ts: 47 tests now (up from 27). New 20-test truth-table for isOAuthIdentityCompatible plus a 4-case fuzz suite. One of the fuzz cases is a cross-rule property test asserting "if isSameOAuthIdentity accepts, isOAuthIdentityCompatible must also" — the strict rule is a subset of the relaxed rule over 1,000 random inputs.
  • oauth.adopt-identity.test.ts: 6 tests now (up from 3). Added one upgrade-scenario test per adopt site verifying sub-with-no-identity can now adopt main-with-accountId and NO spurious refresh is triggered.
  • Positive-mismatch refuse tests (suboss87's scenario, where both sides have accountId but they differ) still pass under the relaxed rule — positive mismatch trips both gates identically.

Regression coverage

Suite Before After Δ
oauth.test.ts + fallback + openai-codex-refresh 30 30 unchanged
oauth-identity.test.ts 27 47 +20
oauth.adopt-identity.test.ts 3 6 +3
oauth.mirror-refresh.test.ts 10 10 unchanged (strict rule still applies there)

All pass in isolation.

Commit

c6b7627 on fix/oauth-refresh-cross-agent-lock.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c6b762778d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +366 to +369
// Asymmetric identity evidence — refuse. We cannot prove the two
// credentials belong to the same account.
if (aHasIdentity !== bHasIdentity) {
return false;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Allow mirror update when only incoming credential has identity

The strict asymmetric check in isSameOAuthIdentity makes mirrorRefreshedCredentialIntoMainStore refuse writes whenever the existing main credential has no accountId/email but the freshly refreshed sub-agent credential does. In upgrade states where main still has pre-capture OAuth records, this prevents main-store mirroring after the first sub-agent refresh; subsequent agents then cannot adopt from main in the inside-lock recheck and keep refreshing with stale refresh tokens, reproducing refresh_token_reused failures under multi-agent load.

Useful? React with 👍 / 👎.

…p-refusing)

Addresses Codex P1 'Allow mirror update when only incoming credential has
identity' (PR 67876 comment on oauth.ts:369). The strict isSameOAuthIdentity
rule used for the mirror direction also refused the symmetric upgrade case
\u2014 main store has a pre-capture OAuth record (no accountId), sub-agent's
fresh refresh response carries accountId. Under the strict rule the mirror
refused, main stayed stale, and subsequent peers could not adopt from it in
the inside-lock recheck, keeping the refresh_token_reused storm alive.

The mirror direction cannot use the relaxed isOAuthIdentityCompatible gate
either \u2014 that would allow identity-regression writes (main has accountId,
incoming doesn't), which later lets a wrong-account sub-agent pass the
relaxed adoption gate and inherit credentials that don't belong to it.

Fix: introduce a third gate, isSafeToMirrorOAuthIdentity, strictly between
the other two. Rule:
  1. no positive identity mismatch (same as isOAuthIdentityCompatible)
  2. incoming must carry at least as much identity as existing \u2014 refuses
     any write that would drop an accountId or email main already has.

Swapped mirrorRefreshedCredentialIntoMainStore from isSameOAuthIdentity
to isSafeToMirrorOAuthIdentity. The two existing gates stay unchanged:
- isSameOAuthIdentity: still used internally by tests as the strict ref
- isOAuthIdentityCompatible: still used by the three adopt sites

Truth table across the three gates (existing vs incoming identity):
  match accountIds               all three accept
  mismatch accountIds            all three refuse
  both lack identity             all three accept
  existing none, incoming has    strict refuses; mirror-safe + compat accept
  existing has, incoming none    strict + mirror-safe refuse; compat accepts
  non-overlapping fields         strict + mirror-safe refuse; compat accepts

Tests:
- oauth-identity.test.ts (+13): full truth table for
  isSafeToMirrorOAuthIdentity plus two relationship tests pinning the
  ordering (strict \u2264 mirror-safe \u2264 compatible). Now 60 tests total.
- oauth.mirror-refresh.test.ts (+2): integration tests for the two
  formerly-mis-handled cases \u2014 upgrade accepted, regression refused.
  Now 12 tests total.

Regression coverage:
- oauth.test.ts + fallback + openai-codex-refresh: 30/30 unchanged
- oauth.adopt-identity.test.ts: 6/6 unchanged (adopt path uses compat gate)
- oauth.mirror-refresh.test.ts: 12/12 (previously 10; mismatch case still
  refuses under new rule because positive-mismatch trips both gates)

Co-authored-by: HeroSizy <HeroSizy@users.noreply.github.com>
@visionik
Copy link
Copy Markdown
Contributor Author

Pushed a372790 addressing the "Allow mirror update when only incoming credential has identity" P1.

You're right — the mirror direction had a symmetric version of the same upgrade-regression bug. A pre-capture main (no accountId) + a fresh sub-agent refresh response that does carry accountId was being refused, main stayed stale, and peer agents couldn't adopt from it → refresh_token_reused storm kept firing.

Why the mirror can't just use the relaxed adopt gate

Tempting to unify on isOAuthIdentityCompatible everywhere, but that reopens a different hole: if main has accountId=alice and the incoming mirror drops it (no-accountId refresh response, somehow), main loses its identity marker. A later acct-bob sub-agent with no identity would then pass the relaxed adopt check against identity-less main and inherit alice's token. That's exactly the cross-account leak suboss87 flagged.

So the mirror direction needs a third rule: no positive mismatch AND no identity regression.

The fix: isSafeToMirrorOAuthIdentity

Strictly between the two existing gates. Rule:

  1. isOAuthIdentityCompatible(existing, incoming) must pass (no positive mismatch).
  2. Every identity field present on existing must also be present on incoming (no drops).

Swapped mirrorRefreshedCredentialIntoMainStore from isSameOAuthIdentity to isSafeToMirrorOAuthIdentity. The two existing gates keep their roles:

Gate Direction Asymmetric cases
isSameOAuthIdentity (strict) kept for tests/docs as the pure-symmetric reference
isSafeToMirrorOAuthIdentity (new) mirror (sub → main) accept upgrades, refuse regressions
isOAuthIdentityCompatible (relaxed) adopt (main → sub) accept all asymmetric cases

Full truth table across all three gates

existing vs incoming strict mirror-safe compatible
accountId match
accountId mismatch
both have no identity
existing none, incoming has ✓ (upgrade)
existing has, incoming none ✗ (regression)
non-overlapping fields (accountId vs email)

Relationship pinned by tests: strict ≤ mirror-safe ≤ compatible (each is a subset of the next).

Test changes

  • oauth-identity.test.ts: 60 tests now (up from 47). Full truth table for isSafeToMirrorOAuthIdentity (13 new tests) including explicit "strictly stricter than compatible, strictly more permissive than strict" relationship tests.
  • oauth.mirror-refresh.test.ts: 12 tests now (up from 10). Added integration tests for both the upgrade case (formerly refused, now accepted) and the regression case (main has identity, incoming drops it — still refused).
  • Pre-existing mirror tests (positive-mismatch refuse, non-oauth refuse, fresher-expires refuse, plugin-refresh mirror) all still pass — positive mismatch trips both the old strict rule and the new mirror-safe rule identically.

Regression tally

Suite Before After
oauth-identity.test.ts 47 60
oauth.mirror-refresh.test.ts 10 12
oauth.adopt-identity.test.ts 6 6
30 pre-existing 30/30 30/30

All pass in isolation.

Commit

a372790 on fix/oauth-refresh-cross-agent-lock.

…sive adopt

Addresses maintainer review feedback on adoption-gate permissiveness and
the overstated wording on withRefreshCallTimeout.

Primary: the relaxed isOAuthIdentityCompatible rule was too permissive.
It returned true whenever there was no shared comparable identity field,
which meant the non-overlapping-fields case (sub has only accountId,
main has only email, or vice versa) passed the gate without a positive
identity match. The three adoption sites
(adoptNewerMainOAuthCredential, inside-lock main adoption, catch-block
main-inherit fallback) would then copy credentials across the gate with
no evidence that the two sides belonged to the same account. The unit
tests at oauth-identity.test.ts:224-228 locked that behaviour in.

The reviewer's narrower rule is identical to the mirror-direction rule
I introduced in the previous commit (isSafeToMirrorOAuthIdentity). Since
both directions need exactly the same constraint, unifying the two
functions is simpler than threading a new rule through the adopt sites.

Changes:
- Deleted isOAuthIdentityCompatible and isSafeToMirrorOAuthIdentity.
- Introduced isSafeToCopyOAuthIdentity as the unified copy-gate:
    allow iff
      no positive mismatch on a shared field, AND
      incoming carries at least as much identity evidence as existing
    accept:  matching accountId; matching email; neither side carries
             identity; existing has none and incoming has identity
             (pure upgrade).
    refuse:  positive mismatch; incoming drops existing's identity;
             non-overlapping fields.
- Swapped all four call sites (1 mirror + 3 adopt) to the unified gate.
- isSameOAuthIdentity remains exported as the strict reference for
  documentation and cross-rule property tests.

Secondary: updated the withRefreshCallTimeout doc-comment to honestly
describe what the helper does. Previous wording called it a 'hard
timeout', which overstated the guarantee: the wrapper rejects after the
deadline and our caller releases locks, but the underlying fn() keeps
running because JS promises aren't cancellable and the pi-ai OAuth
stack does not accept an AbortSignal. The isRefreshTokenReusedError
recovery path remains the backstop for any residual race; a proper
cancel requires plumbing AbortSignal through the refresh stack and is
tracked as a follow-up.

Tests:
- oauth-identity.test.ts (51 tests, was 60 with 3 different functions):
  clean single truth-table for the unified gate plus cross-rule property
  tests (strict accepts -> unified accepts; relaxation only in the pure-
  upgrade direction). Fuzz: reflexivity, distinct-accountId-never-
  accepts, strict->unified monotonicity, same-accountId-always-accepts.
- oauth.adopt-identity.test.ts (9 tests, was 6): added ONE reviewer-
  requested non-overlapping-identity refuse test PER adopt site
  (pre-refresh, inside-lock, catch-block). Three sites => three tests.
- oauth.mirror-refresh.test.ts unchanged (12 tests), still passes under
  the unified gate because the gate is identical to the previous
  isSafeToMirrorOAuthIdentity.
- 30/30 pre-existing auth-profiles regression suite unchanged.

Co-authored-by: HeroSizy <HeroSizy@users.noreply.github.com>
@visionik
Copy link
Copy Markdown
Contributor Author

Thanks for verifying the primary race claim against origin/main and for the narrower rule spec. Pushed 5cc7162 addressing both points.

Primary: adoption gate was too permissive

Fixed. You spotted that isOAuthIdentityCompatible returned true whenever there was no shared comparable field, which let the non-overlapping-fields case (sub has only accountId, main has only email, or vice versa) pass the gate without any positive-match evidence. The unit test at oauth-identity.test.ts:224-228 locked that bad behaviour in.

Your narrower rule is identical to the mirror-direction gate I added in the previous commit (isSafeToMirrorOAuthIdentity). Rather than maintain two near-duplicate functions, I unified them:

  • Deleted isOAuthIdentityCompatible and isSafeToMirrorOAuthIdentity.
  • Introduced isSafeToCopyOAuthIdentity as the single copy-direction gate used by all four sites (1 mirror + 3 adopt).
  • Kept isSameOAuthIdentity as the strict-symmetric reference for cross-rule property tests and documentation.

Rule for isSafeToCopyOAuthIdentity (allow iff):

  1. No positive identity mismatch on a shared field.
  2. Incoming carries at least as much identity evidence as existing.

Truth table:

existing → incoming strict unified (new)
matching accountId
mismatched accountId
both no identity
none → accountId (upgrade)
accountId → none (regression)
accountId-only ↔ email-only (non-overlap)

All four call sites (pre-refresh adopt, inside-lock adopt, catch-block main-inherit, and mirror-to-main) now go through this single gate.

Tests

  • oauth-identity.test.ts: 51 tests total. Single clean truth-table for isSafeToCopyOAuthIdentity replacing the previous two overlapping sets. Relationship tests pin isSameOAuthIdentity ⊆ isSafeToCopyOAuthIdentity and verify the relaxation happens only in the pure-upgrade direction. Fuzz includes a monotonicity property: if strict accepts, unified accepts (1,000 random pairs).
  • oauth.adopt-identity.test.ts: +3 reviewer-requested regression tests, one per adopt site, for the non-overlapping-fields case. 9 tests total now.
  • oauth.mirror-refresh.test.ts: unchanged (12 tests). The mirror tests already passed under the previous isSafeToMirrorOAuthIdentity; since the unified gate has the same rule, they keep passing.

Secondary: withRefreshCallTimeout docstring overstated the guarantee

Fixed the docstring honestly. You're right that the helper rejects our promise after the deadline and releases our locks, but cannot cancel the underlying fn() — JS promises aren't cancellable and the pi-ai OAuth stack doesn't accept an AbortSignal today. The word "hard timeout" was misleading. The doc-comment now reads:

LIMITATION: this does NOT cancel the underlying work. […] When the
deadline fires the caller moves on and releases its file lock, but the
original fn() promise keeps running in the background. That means a
slow upstream refresh could still burn a refresh token well after we
have given up on it, and a waiting peer that has now taken the lock
may hit refresh_token_reused.

The existing isRefreshTokenReusedError recovery path is the backstop
for that residual case — it reloads from the main store and adopts if
another agent's refresh has since landed. A fuller fix requires
plumbing AbortSignal through the refresh stack into the HTTP client;
tracked as a follow-up.

Agreed that actual AbortSignal plumbing is out of scope for the main race fix and should ship separately.

Test totals

Suite After
oauth-identity.test.ts 51
oauth.adopt-identity.test.ts 9
oauth.mirror-refresh.test.ts 12
oauth.test.ts + fallback + codex-refresh (pre-existing) 30/30

All pass in isolation.

Commit

5cc7162 on fix/oauth-refresh-cross-agent-lock.

@visionik
Copy link
Copy Markdown
Contributor Author

Re-reviewed the current PR state. The core #26322 bug is real on origin/main: refresh
locking is scoped to agentDir, so shared-profile agents do not actually serialize. The
cross-agent lock + in-process queue + inside-the-lock main-store recheck in this PR
targets that exact root cause.

I also rechecked the earlier identity-copy concern: current code uses
isSafeToCopyOAuthIdentity(...), and the new tests now cover the formerly dangerous
non-overlapping identity-field cases at all three adoption sites.

I reran:

pnpm test src/agents/auth-profiles/oauth.concurrent-20-agents.test.ts
src/agents/auth-profiles/oauth.mirror-refresh.test.ts
src/agents/auth-profiles/oauth-identity.test.ts
src/agents/auth-profiles/oauth.adopt-identity.test.ts
src/agents/auth-profiles/oauth-refresh-queue.test.ts
src/agents/auth-profiles/oauth-refresh-error.test.ts
src/agents/auth-profiles/oauth-lock-path.test.ts
src/agents/auth-profiles/oauth-refresh-timeout.test.ts

Result: 8 files passed, 108 tests passed.

I think this is ready to land.

Non-blocking notes:

  • the PR body is stale and still describes an older design iteration
  • withRefreshCallTimeout() still does not abort underlying work, but that limitation is
    now documented and can be handled in follow-up work

@vincentkoc vincentkoc merged commit 8e79080 into main Apr 17, 2026
68 checks passed
@vincentkoc vincentkoc deleted the fix/oauth-refresh-cross-agent-lock branch April 17, 2026 06:44
@HeroSizy
Copy link
Copy Markdown

@vincentkoc Thanks for picking this up and porting it against the current tree! makes total sense given how far the auth-profiles/ area has moved since I opened this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling maintainer Maintainer-authored PR size: XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants