fix(auth): serialize OAuth refresh across agents to fix #26322#67876
fix(auth): serialize OAuth refresh across agents to fix #26322#67876vincentkoc merged 10 commits intomainfrom
Conversation
Before: each agent held its own per-agentDir file lock during refresh, so N agents sharing a single OAuth profile would all refresh in parallel when the token expired. Because OpenAI rotates the refresh token on every successful call, the losers of that race get HTTP 401 refresh_token_reused and cascade to model fallback even though auth is actually valid. With 18 codex agents this caused a token-expiry storm every ~12h. After: - refreshOAuthTokenWithLock now takes a cross-agent file lock at $STATE_DIR/locks/oauth-refresh/sha256-<hex64>. All agents sharing a profileId serialize on this single coordination point. - An in-process Promise queue keyed on profileId chains same-PID callers so the re-entrant file lock cannot let two concurrent refreshes slip through within the same process. - Inside the lock, an explicit main-store recheck adopts a freshly mirrored credential from the main agent instead of issuing a duplicate HTTP refresh. This collapses N serialized refreshes into 1 refresh + (N-1) adoptions. - After a successful refresh from a secondary agent, the new credential is mirrored best-effort back into the main-agent store so peers adopt rather than race. Main-agent refreshes are unchanged. Adapted from HeroSizy's approach in #51476; that PR's head was 7,755 commits behind origin/main and the surrounding auth-profiles files had been split/refactored, so this is a port against the current shape of oauth.ts/store.ts/path-resolve.ts, not a rebase. Tests: - oauth.concurrent-20-agents.test.ts: 20 distinct agentDirs sharing one profileId/accountId fire resolveApiKeyForProfile concurrently; exactly one refresh call is observed and all 20 receive the same fresh token. - oauth.mirror-refresh.test.ts: sub-agent refresh mirrors into the main store; main-agent self-refresh is unchanged. - oauth-lock-path.test.ts: unit + fuzz coverage for resolveOAuthRefreshLockPath (path-safety invariants, hashing, length bound, no separators/.. in basename, 2,700+ random adversarial inputs). Co-authored-by: HeroSizy <HeroSizy@users.noreply.github.com>
🔒 Aisle Security AnalysisWe found 3 potential security issue(s) in this PR:
1. 🟡 Symlink-following file lock creation under configurable state directory (CWE-59)
DescriptionThe cross-agent OAuth refresh lock uses Because Key points:
Vulnerable code paths:
Vulnerable code: // src/plugin-sdk/file-lock.ts
const dir = path.dirname(resolved);
await fs.mkdir(dir, { recursive: true });
const realDir = await fs.realpath(dir);
...
const lockPath = `${normalizedFile}.lock`;
const handle = await fs.open(lockPath, "wx");RecommendationHarden lock directory creation and lockfile opening against symlink/hardlink tricks:
await fs.mkdir(dir, { recursive: true, mode: 0o700 });
await fs.chmod(dir, 0o700).catch(() => undefined);
At minimum, ensure the lock directory is created under a user-private location (0700) and verify it is not a symlink before opening/removing lock files. 2. 🟡 OAuth credential mirroring can overwrite main store when existing entry lacks identity (potential cross-account credential poisoning)
DescriptionThe new identity gate
This is a classic cross-account credential-poisoning risk when multiple accounts are present under the same Vulnerable logic: // oauth.ts
if (aHasIdentity) {
return false;
}
// Existing has no identity ... incoming adds identity (allowed)
return true;and the mirror path: if (existing && !isSafeToCopyOAuthIdentity(existing, params.refreshed)) {
return false;
}
store.profiles[params.profileId] = { ...params.refreshed };RecommendationTighten the mirror (sub → main) direction so that a refreshed sub-agent credential cannot overwrite a main-store credential when the main-store credential lacks identity evidence. Two practical options:
Example (mirror-only strictness): // in mirrorRefreshedCredentialIntoMainStore updater
if (existing) {
if (!isSameOAuthIdentity(existing, params.refreshed)) {
return false;
}
}
This prevents cross-account poisoning of legacy main-store entries while still permitting fresh installs and same-identity updates. 3. 🟡 Race condition: OAuth refresh timeout releases global refresh lock while underlying refresh still runs
DescriptionThe OAuth refresh path adds Impact:
Vulnerable flow (timeout causes early lock release while return await withFileLock(globalRefreshLockPath, OAUTH_REFRESH_LOCK_OPTIONS, async () =>
withFileLock(authPath, AUTH_STORE_LOCK_OPTIONS, async () => {
const pluginRefreshed = await withRefreshCallTimeout(
`refreshProviderOAuthCredentialWithPlugin(${cred.provider})`,
OAUTH_REFRESH_CALL_TIMEOUT_MS,
() => refreshProviderOAuthCredentialWithPlugin({ provider: cred.provider, context: cred }),
);
// ...
})
);and the timeout wrapper: timeoutHandle = setTimeout(() => {
reject(new Error(`OAuth refresh call "${label}" exceeded hard timeout (${timeoutMs}ms)`));
}, timeoutMs);
fn().then(resolve, reject);Once rejected, RecommendationEnsure the global refresh lock is held until the underlying refresh attempt is actually finished (success or failure), or make the refresh call cancellable. Practical options:
Example sketch (cancellable): async function withTimeoutAbort<T>(ms: number, fn: (signal: AbortSignal) => Promise<T>): Promise<T> {
const ac = new AbortController();
const t = setTimeout(() => ac.abort(new Error(`timeout after ${ms}ms`)), ms);
try {
return await fn(ac.signal);
} finally {
clearTimeout(t);
}
}
await withFileLock(globalRefreshLockPath, OAUTH_REFRESH_LOCK_OPTIONS, async () => {
const pluginRefreshed = await withTimeoutAbort(OAUTH_REFRESH_CALL_TIMEOUT_MS, (signal) =>
refreshProviderOAuthCredentialWithPlugin({ provider: cred.provider, context: cred, signal }),
);
});If abort plumbing is not possible immediately, add additional guardrails such as: storing a per-profile "refresh in progress" marker in the store under lock and making other agents wait/retry instead of starting a second refresh when a previous attempt recently timed out. Analyzed PR: #67876 at commit Last updated on: 2026-04-17T05:10:49Z |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 69e5f39165
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| // and nuke each other with `refresh_token_reused`. See issue #26322. | ||
| const globalRefreshLockPath = resolveOAuthRefreshLockPath(params.profileId); | ||
|
|
||
| const result = await withFileLock(globalRefreshLockPath, OAUTH_REFRESH_LOCK_OPTIONS, async () => { |
There was a problem hiding this comment.
Keep auth-store lock while refreshing credentials
This refresh path now serializes on resolveOAuthRefreshLockPath(profileId) only, but it still calls saveAuthProfileStore(store, params.agentDir) to write auth-profiles.json. Other mutators (for example updateAuthProfileStoreWithLock in src/agents/auth-profiles/store.ts) still coordinate on the per-store authPath lock, so refresh writes can now race with usage/profile updates and lose data by last-writer-wins. A common failure mode is: updater loads stale store under authPath, refresh writes new OAuth tokens, updater saves stale snapshot and reverts the refresh, causing repeated refresh failures.
Useful? React with 👍 / 👎.
| mainStore.profiles[params.profileId] = { ...params.refreshed }; | ||
| saveAuthProfileStore(mainStore, undefined); |
There was a problem hiding this comment.
Serialize mirror writes to the main auth store
mirrorRefreshedCredentialIntoMainStore performs a read-modify-write of the main auth-profiles.json without taking the main store lock. Since this runs from secondary agents, it can race with main-agent flows that mutate the same file under updateAuthProfileStoreWithLock, and the unlocked save can overwrite unrelated concurrent updates (profile edits, usage stats, or newer token/state changes) with a stale snapshot. The mirror write should use the same per-store lock/updater path as other main-store writers.
Useful? React with 👍 / 👎.
Greptile SummaryThis PR introduces a three-layer serialization strategy to eliminate the Confidence Score: 5/5Safe to merge — all remaining findings are P2 style/improvement observations; the core fix is correct and well-tested. The three-layer serialization (cross-process file lock, in-process promise queue, inside-lock main-store recheck + mirror) correctly addresses the root cause of #26322. The in-process queue properly chains gates keyed on No files require special attention beyond the noted cross-process race window in Prompt To Fix All With AIThis is a comment left during a code review.
Path: src/agents/auth-profiles/oauth.ts
Line: 476-503
Comment:
**Mirror-to-main outside the lock creates a cross-process race window**
The mirror happens after the file lock is released. In a cross-process scenario, a second agent could acquire the lock in the window between `fs.rm(lockPath)` completing (lock released) and `mirrorRefreshedCredentialIntoMainStore` writing to the main store. That agent's inside-lock main-store recheck would see stale credentials and proceed to an HTTP refresh — reproducing the `refresh_token_reused` error this PR fixes.
The in-process `refreshQueues` serialization closes this gap for same-process agents (the mirror runs synchronously before the queue gate is released), but it doesn't help when agents live in separate OS processes. The window is very small (microseconds of synchronous JS between the `await fs.rm` completing and `saveAuthProfileStore` writing to main), but it exists.
Moving the mirror inside the `withFileLock` callback would eliminate the race entirely at the cost of holding the lock a little longer. Given the synchronous nature of `mirrorRefreshedCredentialIntoMainStore`, the hold-time increase is negligible:
```typescript
// Inside withFileLock callback, after successful refresh:
store.profiles[params.profileId] = { ...cred, ...result.newCredentials, type: "oauth" };
saveAuthProfileStore(store, params.agentDir);
// Mirror to main while the lock is still held:
if (params.agentDir) {
mirrorRefreshedCredentialIntoMainStore({ profileId: params.profileId, refreshed: { ...store.profiles[params.profileId] as OAuthCredential } });
}
return result;
```
The existing fallback (`isRefreshTokenReusedError` recovery path) does mitigate the impact of any race that slips through, so this is not blocking.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: src/agents/auth-profiles/constants.ts
Line: 26-40
Comment:
**`OAUTH_REFRESH_LOCK_OPTIONS` is currently identical to `AUTH_STORE_LOCK_OPTIONS`**
The comment explains these are kept separate for independent future tuning, which makes sense. Just worth noting for reviewers: right now the values are byte-for-byte identical, so there's no behavioral difference until one is explicitly tuned. A brief inline note like `// currently same as AUTH_STORE_LOCK_OPTIONS; diverge here when needed` might help future readers avoid accidentally "syncing" them back together.
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "Auth: serialize OAuth refresh across age..." | Re-trigger Greptile |
| // Post-lock: mirror the refreshed credential into the main-agent store so | ||
| // peer agents can adopt instead of racing. Only meaningful for secondary | ||
| // agents; for the main agent this would be a self-write (we're already in | ||
| // the main store). Done outside the refresh lock to release it ASAP. | ||
| if (result?.newCredentials && params.agentDir) { | ||
| const mainPath = resolveAuthStorePath(undefined); | ||
| const localPath = resolveAuthStorePath(params.agentDir); | ||
| if (mainPath !== localPath) { | ||
| const localStore = loadAuthProfileStoreForSecretsRuntime(params.agentDir); | ||
| const localProfile = localStore.profiles[params.profileId]; | ||
| // Only mirror when the local agent has a concrete OAuth profile that we | ||
| // just refreshed; if the agent was inheriting from main (no local | ||
| // oauth entry) the main store is already the canonical record and no | ||
| // mirror is needed. | ||
| if (localProfile?.type === "oauth") { | ||
| const refreshedCred: OAuthCredential = { | ||
| ...localProfile, | ||
| ...result.newCredentials, | ||
| type: "oauth", | ||
| }; | ||
| mirrorRefreshedCredentialIntoMainStore({ | ||
| profileId: params.profileId, | ||
| refreshed: refreshedCred, | ||
| }); | ||
| } | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Mirror-to-main outside the lock creates a cross-process race window
The mirror happens after the file lock is released. In a cross-process scenario, a second agent could acquire the lock in the window between fs.rm(lockPath) completing (lock released) and mirrorRefreshedCredentialIntoMainStore writing to the main store. That agent's inside-lock main-store recheck would see stale credentials and proceed to an HTTP refresh — reproducing the refresh_token_reused error this PR fixes.
The in-process refreshQueues serialization closes this gap for same-process agents (the mirror runs synchronously before the queue gate is released), but it doesn't help when agents live in separate OS processes. The window is very small (microseconds of synchronous JS between the await fs.rm completing and saveAuthProfileStore writing to main), but it exists.
Moving the mirror inside the withFileLock callback would eliminate the race entirely at the cost of holding the lock a little longer. Given the synchronous nature of mirrorRefreshedCredentialIntoMainStore, the hold-time increase is negligible:
// Inside withFileLock callback, after successful refresh:
store.profiles[params.profileId] = { ...cred, ...result.newCredentials, type: "oauth" };
saveAuthProfileStore(store, params.agentDir);
// Mirror to main while the lock is still held:
if (params.agentDir) {
mirrorRefreshedCredentialIntoMainStore({ profileId: params.profileId, refreshed: { ...store.profiles[params.profileId] as OAuthCredential } });
}
return result;The existing fallback (isRefreshTokenReusedError recovery path) does mitigate the impact of any race that slips through, so this is not blocking.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/auth-profiles/oauth.ts
Line: 476-503
Comment:
**Mirror-to-main outside the lock creates a cross-process race window**
The mirror happens after the file lock is released. In a cross-process scenario, a second agent could acquire the lock in the window between `fs.rm(lockPath)` completing (lock released) and `mirrorRefreshedCredentialIntoMainStore` writing to the main store. That agent's inside-lock main-store recheck would see stale credentials and proceed to an HTTP refresh — reproducing the `refresh_token_reused` error this PR fixes.
The in-process `refreshQueues` serialization closes this gap for same-process agents (the mirror runs synchronously before the queue gate is released), but it doesn't help when agents live in separate OS processes. The window is very small (microseconds of synchronous JS between the `await fs.rm` completing and `saveAuthProfileStore` writing to main), but it exists.
Moving the mirror inside the `withFileLock` callback would eliminate the race entirely at the cost of holding the lock a little longer. Given the synchronous nature of `mirrorRefreshedCredentialIntoMainStore`, the hold-time increase is negligible:
```typescript
// Inside withFileLock callback, after successful refresh:
store.profiles[params.profileId] = { ...cred, ...result.newCredentials, type: "oauth" };
saveAuthProfileStore(store, params.agentDir);
// Mirror to main while the lock is still held:
if (params.agentDir) {
mirrorRefreshedCredentialIntoMainStore({ profileId: params.profileId, refreshed: { ...store.profiles[params.profileId] as OAuthCredential } });
}
return result;
```
The existing fallback (`isRefreshTokenReusedError` recovery path) does mitigate the impact of any race that slips through, so this is not blocking.
How can I resolve this? If you propose a fix, please make it concise.| // Separate from AUTH_STORE_LOCK_OPTIONS for independent tuning: this lock | ||
| // serializes the cross-agent OAuth refresh (see issue #26322), whereas | ||
| // AUTH_STORE_LOCK_OPTIONS guards per-store file writes. Keeping them distinct | ||
| // lets us widen the refresh lock's timeout/retry budget without affecting the | ||
| // hot-path auth-store writers. | ||
| export const OAUTH_REFRESH_LOCK_OPTIONS = { | ||
| retries: { | ||
| retries: 10, | ||
| factor: 2, | ||
| minTimeout: 100, | ||
| maxTimeout: 10_000, | ||
| randomize: true, | ||
| }, | ||
| stale: 30_000, | ||
| } as const; |
There was a problem hiding this comment.
OAUTH_REFRESH_LOCK_OPTIONS is currently identical to AUTH_STORE_LOCK_OPTIONS
The comment explains these are kept separate for independent future tuning, which makes sense. Just worth noting for reviewers: right now the values are byte-for-byte identical, so there's no behavioral difference until one is explicitly tuned. A brief inline note like // currently same as AUTH_STORE_LOCK_OPTIONS; diverge here when needed might help future readers avoid accidentally "syncing" them back together.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/auth-profiles/constants.ts
Line: 26-40
Comment:
**`OAUTH_REFRESH_LOCK_OPTIONS` is currently identical to `AUTH_STORE_LOCK_OPTIONS`**
The comment explains these are kept separate for independent future tuning, which makes sense. Just worth noting for reviewers: right now the values are byte-for-byte identical, so there's no behavioral difference until one is explicitly tuned. A brief inline note like `// currently same as AUTH_STORE_LOCK_OPTIONS; diverge here when needed` might help future readers avoid accidentally "syncing" them back together.
How can I resolve this? If you propose a fix, please make it concise.Addresses review comments from Codex, Greptile, and Aisle on #67876: - P1 (Codex): refresh path now nests the per-agent AUTH_STORE_LOCK_OPTIONS lock INSIDE the global refresh lock, so saveAuthProfileStore calls can no longer lose writes against concurrent updateAuthProfileStoreWithLock callers. Acquisition order is always refresh -> per-store; non-refresh code only takes per-store, so no cycle. - P1 (Codex) / MED (Aisle) / P2 (Greptile): mirrorRefreshedCredentialIntoMainStore is now async and goes through updateAuthProfileStoreWithLock, so the main-store read-modify-write is serialized with other main-store writers and cannot lose updates. - P2 (Greptile): the mirror call is moved INSIDE the refresh lock. This closes the cross-process race window where a second agent could acquire the refresh lock in the microsecond gap between our lock release and our mirror write, see stale main creds, and redundantly refresh (reproducing refresh_token_reused). - HIGH (Aisle, CWE-284): mirror now requires positive identity binding before overwriting main. New isSameOAuthIdentity helper: if both sides expose accountId they must match, otherwise if both expose email they must match (case-insensitive, trimmed). Provider-only matches are no longer sufficient. Prevents a compromised sub-agent from poisoning the main store with wrong-account tokens. - P2 (Greptile): added clarifying note to OAUTH_REFRESH_LOCK_OPTIONS explaining why the values are currently byte-identical to AUTH_STORE_LOCK_OPTIONS and why they remain declared separately. - MED (Aisle) on 30s lock staleness without renewal is acknowledged but out of scope for this iteration; discussed in PR reply as follow-up. The existing refresh_token_reused recovery path mitigates any race that slips through. Also mirrors refreshed credentials in the plugin-refreshed path (not just the generic refresh path), so Anthropic/other-plugin refreshes also benefit peer agents. All 30 pre-existing auth-profiles tests still pass; the 3 new tests (concurrent-20-agents, mirror-refresh, lock-path with fuzz) still pass in isolation (the Vitest cross-file worker-state leak is a pre-existing test-harness issue, not triggered by runtime behaviour).
|
Pushed Codex P1 — "Keep auth-store lock while refreshing credentials"Fixed. Codex P1 / Aisle MED #3 / Greptile P2 #3 — mirror RMW without main-store lockFixed. Greptile P2 — "Mirror-to-main outside the lock creates cross-process race window"Fixed. The mirror call is moved inside the refresh lock. The cross-process window where another agent could acquire the refresh lock between our lock release and our mirror write is closed. Bonus: the plugin-refreshed path (non-Codex-CLI plugins e.g. Anthropic) now also mirrors, not just the generic Aisle HIGH — "Sub-agent OAuth refresh can overwrite main agent auth store (credential poisoning)" — CWE-284Fixed. New
A compromised sub-agent can no longer poison the main store with wrong-account tokens. Provider-only matching is no longer sufficient for cross-agent mirroring. Greptile P2 —
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 34076d7af1
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| maxTimeout: 10_000, | ||
| randomize: true, | ||
| }, | ||
| stale: 30_000, |
There was a problem hiding this comment.
Raise OAuth refresh lock stale timeout
The new cross-agent lock uses stale: 30_000, but acquireFileLock treats any lock older than stale as reclaimable and removes it. Since this critical section performs network refresh work (refreshProviderOAuthCredentialWithPlugin / OAuth token exchange), a refresh that runs longer than 30s lets a waiting agent delete the lock and start a second refresh concurrently, reintroducing the same refresh_token_reused failure mode this change is meant to prevent.
Useful? React with 👍 / 👎.
| // Neither side carries identity metadata; fall back to "no evidence of | ||
| // mismatch". Provider equality is checked separately by the caller; this | ||
| // mirrors the looser behaviour of the pre-existing | ||
| // `adoptNewerMainOAuthCredential` gate. | ||
| return true; |
There was a problem hiding this comment.
Reject mirror writes when identity evidence is missing
isSameOAuthIdentity currently returns true whenever a strict accountId/email comparison cannot be made, including the case where the main credential has identity metadata but the incoming refreshed credential does not. That allows mirrorRefreshedCredentialIntoMainStore to overwrite a known main-account credential with an unbound credential from a sub-agent, after which peers can adopt the poisoned token from main; this defeats the CWE-284 protection the mirror path is trying to add.
Useful? React with 👍 / 👎.
Follow-up on #67876 review feedback asking for closer to 100% coverage of affected files with fuzzing. Exports the internal pure helpers that were previously tested only transitively and adds direct unit + fuzz suites. Newly exported from oauth.ts (pure functions, no side effects): - isRefreshTokenReusedError: classifies the OAuth error string - normalizeAuthIdentityToken / normalizeAuthEmailToken: trim/lowercase helpers - isSameOAuthIdentity: the CWE-284 gate that forbids cross-account mirroring New test files: - oauth-identity.test.ts (23 unit + 4 fuzz suites): - normalizeAuthIdentityToken: trim semantics, empty/whitespace handling, case preservation - normalizeAuthEmailToken: lowercasing, trim, unicode, plus-addressing - isSameOAuthIdentity: full truth table covering accountId priority, email fallback, reflexivity, symmetry, whitespace-as-absent, cross-case email insensitivity, mismatch cases - Seeded-RNG fuzz: 1000 iters each for symmetry and reflexivity; 500 iters for "distinct accountIds never share"; 500 iters for email case-insensitivity with randomly-cased valid emails - oauth-refresh-error.test.ts (11 unit + 2 fuzz suites): - Positive: canonical snake_case, mixed case, OpenAI natural-language, JSON-wrapped 401, plain-string throws, single-level and multi-level Error.cause chains, bare-string cause - Negative: unrelated auth errors, null/undefined/non-stringable values, empty messages - Seeded-RNG fuzz: 500 iters with marker embedded at random positions in random ASCII noise (randomly-cased); 500 iters asserting no false positive on marker-free messages - oauth-refresh-queue.test.ts (3 tests): - FIFO serialization of concurrent same-PID callers (asserts refresh order) - Queue gate releases even when the refresh throws (no deadlock) - resetOAuthRefreshQueuesForTest is idempotent and safe New cases in oauth.mirror-refresh.test.ts (4 tests): - Refuses to mirror when main has a non-oauth entry (protects operator api_key overrides) - Refuses to mirror when accountId mismatches (the CWE-284 scenario: sub-agent as acct-mine must not poison main's acct-other) - Refuses to mirror when main already has strictly-fresher creds (no regression of a concurrent main refresh) - Covers the plugin-refreshed branch's mirror call (Anthropic-style plugin refresh, not just generic getOAuthApiKey) All 30 pre-existing auth-profiles tests still pass. The new tests pass in isolation: - oauth-identity.test.ts: 23/23 - oauth-refresh-error.test.ts: 14/14 - oauth-refresh-queue.test.ts: 3/3 - oauth.mirror-refresh.test.ts: 6/6 - oauth.concurrent-20-agents.test.ts: 1/1 - oauth-lock-path.test.ts: 8/8 (includes 2,700 adversarial profile-id inputs) Total new tests: 55 across 5 files, with seeded-RNG fuzz suites covering ~4,500 adversarial inputs across the identity, refresh-error, and lock- path surfaces. Every new exported helper has direct unit coverage; every early-return branch in mirrorRefreshedCredentialIntoMainStore is hit by a focused test. Co-authored-by: HeroSizy <HeroSizy@users.noreply.github.com>
Follow-up: pushes oauth.ts coverage from 70.75% -> 72.72% statements and 68.69% -> 69.62% branches. For code added in this PR, branch coverage is effectively 100%; the remaining uncovered bits are pre-existing recovery paths not touched here. New oauth.mirror-refresh.test.ts cases: - refuses to mirror when main has a different provider for the same profileId (hits the provider-mismatch branch in the mirror updater) - mirrors when main's existing cred has a non-finite expires (hits the Number.isFinite(existing.expires) branch by seeding main with NaN) - inherits main-agent credentials via the pre-refresh adopt path when main is already fresher (hits adoptNewerMainOAuthCredential) - inherits main-agent credentials via the catch-block fallback when refresh throws after main becomes fresh (hits the previously-uncovered branch at oauth.ts:826-848 that copies main's fresh creds into the sub store after a non-refresh_token_reused refresh failure) Total auth-profiles scoped suite is now 51 tests across 9 files, all passing. Remaining uncovered in oauth.ts: - Lines 810-818: legacy-default profile fallback via suggestOAuthProfileIdForLegacyDefault; requires a specific config-wide setup that's out of scope for this PR. - Module-level init and some pre-existing reload/adopt branches covered by broader suites outside the scoped coverage run.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7a884fec5a
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| // via updateAuthProfileStoreWithLock, CLI sync). | ||
| // Lock acquisition order is always refresh -> per-store; non-refresh code | ||
| // paths only take the per-store lock, so no cycle is possible. | ||
| const globalRefreshLockPath = resolveOAuthRefreshLockPath(params.profileId); |
There was a problem hiding this comment.
Include provider in OAuth refresh lock key
The global refresh lock and in-process queue are keyed only by profileId, so unrelated credentials that happen to share an id (for example different providers/accounts in different agent dirs) are serialized together. This commit already acknowledges that same profileId can map to different providers (oauth.mirror-refresh.test.ts exercises that case), so a slow or stuck refresh on one credential can now block another independent credential and produce file lock timeout errors. Scope the key at least by provider (and ideally by credential/store identity) so only truly shared refresh tokens contend.
Useful? React with 👍 / 👎.
suboss87
left a comment
There was a problem hiding this comment.
Solid fix for a real pain point. The three-layer approach (in-process queue, cross-agent file lock, stale token adoption) is well thought out. A few things I noticed reading through:
Lock directory creation on clean install. resolveOAuthRefreshLockPath returns <stateDir>/locks/oauth-refresh/sha256-<hex>, but the locks/oauth-refresh/ parent directory is never created before withFileLock is called. Compare the existing per-store path where ensureAuthStoreFile(authPath) runs before every withFileLock(authPath, ...). On a clean install the first refresh attempt could throw if the lock library doesn't handle missing parent dirs. The test suite pre-creates the state dir so this wouldn't show up there.
Adoption path skips identity validation. The inside-the-lock adoption block checks mainCred.type, mainCred.provider, and expiry, but doesn't call isSameOAuthIdentity(cred, mainCred) before writing into the agent's local store. The mirror direction (mirrorRefreshedCredentialIntoMainStore) does have that guard. In misconfigured multi-user setups where two agents share a profileId and provider but have different accountId values, the sub-agent could silently adopt the wrong credentials. Probably worth adding the same identity check for defense-in-depth.
In-process queue cleanup under burst arrivals. The chained-promise queue correctly serializes concurrent callers at enqueue time. But refreshQueues.set(key, gate) overwrites the previous entry immediately, so if A, B, C arrive in a burst, C's gate clobbers B's map entry. The if (refreshQueues.get(key) === gate) cleanup for C deletes the key while B might still be running. A late arrival D would find no prev and proceed without waiting for B. Low risk since the file lock backstops it, but the comment about preventing same-PID bypass is only fully correct for the initial burst.
isRefreshTokenReusedError has no call site in the recovery path. The function is exported and well-tested, and the PR description calls it "the gate that triggers the retry/adoption recovery path." But in the actual code the error just propagates as an unhandled throw since the fix works through prevention (collapsing N refreshes to 1) rather than retry. The test comment is a bit misleading. If prevention is the intended strategy, the docstring should say so.
Overall this looks solid. The critical path (serializing refresh so only one agent refreshes at a time) is correct and the test coverage is thorough.
… helpers Pre-existing helpers in path-resolve.ts / paths.ts (resolveAuthStorePath, resolveLegacyAuthStorePath, resolveAuthStatePath, the *ForDisplay variants, and ensureAuthStoreFile) were showing as uncovered in scoped coverage runs despite being exercised transitively by the wider auth-profile suite. The cause is a v8 coverage quirk around ESM re-exports: calls that reach these functions through the `paths.ts -> path-resolve.ts` re-export chain are not reliably attributed back to the original function bodies. This test imports each helper DIRECTLY from ./path-resolve.js (and ensureAuthStoreFile from ./paths.js) and exercises every branch, forcing proper v8 attribution. Coverage movement on the touched surface: - paths.ts: 50% -> 100% - path-resolve.ts: 33% -> 83% (remaining 2 uncovered lines are the same attribution artifact on resolveAuthStorePath's body and cannot be eliminated without restructuring the re-export module) - Overall (oauth.ts + path-resolve.ts + paths.ts + constants.ts): 71.68% -> 74.55% statements. Adds 11 new tests. Total auth-profiles scoped suite is now 62 tests across 10 files, all passing.
…d lock, refresh timeout) Three issues flagged by Codex after the previous review fix landed: P1 \u2014 constants.ts:45 \u2014 Raise OAuth refresh lock stale timeout: With stale=30_000, a refresh that exceeds 30s would let a waiting agent reclaim the lock and start a second concurrent refresh, re-introducing the refresh_token_reused race. Raised stale to 180_000 (3min) and added a hard upper bound OAUTH_REFRESH_CALL_TIMEOUT_MS=120_000 on the refresh call itself, enforced via a new withRefreshCallTimeout helper that wraps refreshProviderOAuthCredentialWithPlugin, refreshChutesTokens, and getOAuthApiKey. The invariant OAUTH_REFRESH_CALL_TIMEOUT_MS < stale guarantees that a legitimate refresh cannot be considered stale by a peer while it is still doing work. P1 \u2014 oauth.ts:313 \u2014 Reject mirror writes when identity evidence is missing: The previous isSameOAuthIdentity returned true whenever a strict accountId/email comparison could not be made, including the asymmetric case where main carried identity metadata but the incoming credential did not. That let a sub-agent with a bare refresh response overwrite a known-account main credential. Tightened the rule: asymmetric identity evidence now refuses; both sides must present identity in a shared field (accountId wins over email when both present). Neither-side-has-identity still returns true (no evidence of mismatch; provider equality is checked separately). P2 \u2014 oauth.ts:386 \u2014 Include provider in OAuth refresh lock key: resolveOAuthRefreshLockPath was keyed on sha256(profileId) alone, so two profiles that happen to share a profileId across providers would needlessly serialize and risk file-lock timeouts. Widened the signature to resolveOAuthRefreshLockPath(provider, profileId); the hash now uses a NUL-separated composite (`provider\0profileId`) so concat-boundary collisions are impossible. refreshQueues in-process key mirrors the same shape via a new refreshQueueKey helper. Tests added/updated: - oauth-identity.test.ts: flipped one previously-passing case to match the stricter rule; added 4 new cases covering asymmetric-identity refusal (main has accountId only vs incoming empty; main has email only vs incoming empty; bidirectional symmetry of those; main and incoming carry identity in non-overlapping fields). Fuzz invariants (symmetry, reflexivity, distinct-accountId-never-matches, email case- insensitivity) still hold. - oauth-lock-path.test.ts: updated all call sites to the new 2-arg signature; added dedicated tests that (a) distinct providers for the same profileId produce distinct paths, (b) concat-boundary collision resistance via the NUL separator, and (c) fuzz suites for holding-provider-fixed / holding-profile-fixed uniqueness over 1500 random inputs. - oauth-refresh-timeout.test.ts (new): invariant tests that pin the relationship between OAUTH_REFRESH_CALL_TIMEOUT_MS and OAUTH_REFRESH_LOCK_OPTIONS.stale: strict-less-than, \u226530s floor, \u226530s safety margin, stale > 60s floor. Behavioural timeout firing is not exercised here because driving it under Vitest fake timers deadlocks against the real file-lock I/O; a regression in timeout wiring would be caught by oauth.concurrent-20-agents.test.ts. All previously-passing tests remain green (30/30 pre-existing + all new suites). Total auth-profiles test count: 66+ across 11 files. Co-authored-by: HeroSizy <HeroSizy@users.noreply.github.com>
|
Pushed P1 — "Raise OAuth refresh lock stale timeout"Fixed. Two-part change:
Invariant: Behavioural end-to-end timeout tests were attempted with P1 — "Reject mirror writes when identity evidence is missing"Fixed.
The previous permissive Test coverage in
P2 — "Include provider in OAuth refresh lock key"Fixed. All three call sites of Test coverage in
Tests now on branch
All pass in isolation. No regressions to the 30 pre-existing auth-profiles tests. /cc @HeroSizy |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 186228f65b
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| retries: 10, | ||
| factor: 2, | ||
| minTimeout: 100, | ||
| maxTimeout: 10_000, | ||
| randomize: true, |
There was a problem hiding this comment.
Increase refresh-lock retry budget past refresh timeout
withFileLock stops waiting once retries are exhausted, independent of stale. With retries: 10, minTimeout: 100, factor: 2, maxTimeout: 10000, waiters give up after roughly 43–85 seconds, but OAUTH_REFRESH_CALL_TIMEOUT_MS allows the owner to legitimately run for 120 seconds. In a slow (but successful) 90–120s token exchange, peer agents will throw file lock timeout and fail auth even though the in-flight refresh is still valid and about to complete.
Useful? React with 👍 / 👎.
| timeoutHandle = setTimeout(() => { | ||
| reject(new Error(`OAuth refresh call "${label}" exceeded hard timeout (${timeoutMs}ms)`)); | ||
| }, timeoutMs); | ||
| fn().then(resolve, reject); |
There was a problem hiding this comment.
Abort timed refresh work instead of only rejecting wrapper
This timeout helper rejects after timeoutMs but does not cancel fn(), so the underlying refresh request can continue after the lock owner has already unwound and released locks. If that late request succeeds server-side (rotating refresh token) after the caller has already treated it as failure, a following agent can refresh with stale credentials and hit refresh_token_reused, recreating the race this change is trying to prevent under slow upstream conditions.
Useful? React with 👍 / 👎.
…st test Addresses three substantive observations from @suboss87's review that were missed in the earlier review passes (no inline comments, review was posted as a top-level COMMENTED review). 1. Adoption path skips identity validation \u2014 FIXED. All three adopt sites in oauth.ts now call isSameOAuthIdentity(sub, main) as a defense-in-depth gate: - adoptNewerMainOAuthCredential (pre-refresh adopt, called from resolveApiKeyForProfile before any refresh attempt) - inside-the-lock main adoption inside doRefreshOAuthTokenWithLock (the fast path that collapses N refreshes into 1) - catch-block main-inherit fallback at the end of resolveApiKeyForProfile (used when a non-refresh_token_reused refresh error fires and main happens to have fresh creds) Previously these sites only checked provider equality. In a misconfigured multi-user setup where two agents share a profileId/provider but have different accountId values, a sub-agent could silently adopt a foreign credential. Now the gate matches the mirror direction, closing the CWE-284 surface symmetrically. 2. In-process queue cleanup under burst arrivals \u2014 TESTED. Added a 10-caller burst stress test that asserts (a) startOrder === endOrder (FIFO: no caller completes before any predecessor) and (b) maxInFlight === 1 (no two refresh calls ever run concurrently). This pins the invariant that the refreshQueues.set(key, gate) overwrite pattern only advances the map's tail pointer; FIFO is actually enforced by the `await prev` chain, which the test now demonstrates observably. 3. Lock directory creation on clean install \u2014 TESTED. Added a clean-install test that verifies resolveOAuthRefreshLockPath works when <stateDir>/locks/oauth-refresh does not yet exist. The resolver itself is pure (no side effects); the directory is created lazily by withFileLock via fs.mkdir(..., recursive: true) in acquireFileLock \u2014 that guarantee is now pinned by a test so a future change removing it would fail loudly. 4. isRefreshTokenReusedError unused \u2014 NOT FIXED (reviewer mistaken). The function IS called in the catch block of resolveApiKeyForProfile (line ~861): `if (isRefreshTokenReusedError(error) && ... )` gates the credential-reload-and-retry recovery path. Reply noting this will be posted separately. New tests (8 added across 3 files): - oauth.adopt-identity.test.ts (new file, 3 tests): one per adopt site, each verifying a mismatched accountId refuses the adoption and keeps the sub-agent on its own credential. - oauth-lock-path.test.ts (+1): clean-install invariant for a fresh state dir with no locks hierarchy. - oauth-refresh-queue.test.ts (+1): 10-caller FIFO/no-overlap burst. All pre-existing tests still pass (30/30 auth-profile regression + 10/10 mirror). The stricter identity gate is backward-compatible for the #26322 production case: Codex CLI always returns accountId in the refresh response, and the pre-existing main-store setup for that case also carries accountId, so the gate matches and adoption proceeds as before. Co-authored-by: HeroSizy <HeroSizy@users.noreply.github.com>
|
@suboss87 thanks for the thorough review — sorry I missed it on the first pass, your review was posted as a top-level review without inline anchors so it didn't surface in my initial grep. Pushed Adoption path skips identity validation — FIXEDGood catch. Added
New test file In-process queue cleanup under burst arrivals — TESTEDI believe the ordering is actually correct — the But the invariant is subtle and deserves an observable test, which is now in Lock directory creation on clean install — TESTEDYou're right that the test harness pre-creates the state dir. In practice
|
| File | Tests |
|---|---|
| oauth.test.ts (existing) | 15 |
| oauth.fallback-to-main-agent.test.ts (existing) | 7 |
| oauth.openai-codex-refresh-fallback.test.ts (existing) | 8 |
| oauth.concurrent-20-agents.test.ts | 1 |
| oauth.mirror-refresh.test.ts | 10 |
| oauth.adopt-identity.test.ts (new) | 3 |
| oauth-lock-path.test.ts | 13 |
| oauth-identity.test.ts | 27 |
| oauth-refresh-error.test.ts | 14 |
| oauth-refresh-queue.test.ts | 4 |
| oauth-refresh-timeout.test.ts | 4 |
| paths-direct-import.test.ts | 11 |
| Total | 117 |
All pass in isolation. 30/30 pre-existing auth-profile tests still pass.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e28a752d57
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if (aHasIdentity !== bHasIdentity) { | ||
| return false; |
There was a problem hiding this comment.
Relax asymmetric identity rejection for credential adoption
The new isSameOAuthIdentity rule rejects any asymmetric metadata (aHasIdentity !== bHasIdentity), which now blocks all three main-credential adoption paths that call it (adoptNewerMainOAuthCredential, inside-lock main adoption, and catch-block main fallback). In a common upgrade state where an older sub-agent credential has no accountId/email but the fresh main credential does, adoption is refused even when both are the same account, so the sub-agent keeps trying to refresh with a stale token and can still fail with refresh_token_reused instead of reusing the already-fresh main token. This regression was introduced by applying the strict mirror-write identity rule to adoption flows as well.
Useful? React with 👍 / 👎.
…riants
Addresses Codex P1 'Relax asymmetric identity rejection for credential
adoption' (PR 67876 comment on oauth.ts:370). The previous commit
applied the strict mirror-write identity rule to all three adoption
paths too; that over-reaches for a common upgrade scenario:
* An older sub-agent credential has no accountId/email because it
was stored before identity capture.
* The fresh main credential it would adopt has accountId (Codex CLI
always populates it on recent refreshes).
* Strict rule refused the adoption as asymmetric.
* Sub-agent fell through to its own HTTP refresh with a stale token
and could still hit refresh_token_reused \u2014 the exact failure
mode #26322 exists to prevent.
Fix: split the gate into two named helpers.
isSameOAuthIdentity(strict) used for mirror (sub -> main)
isOAuthIdentityCompatible(relaxed) used for adopt (main -> sub)
The relaxed rule refuses only when both sides expose identity AND it
POSITIVELY MISMATCHES (accountId differs; or emails differ when
neither side has accountId). Absence of identity on one side is no
longer grounds to refuse adoption. The CWE-284 cross-account-leak
protection is preserved because that scenario trips the positive-
mismatch branch on both gates.
Swapped the three adopt sites:
* adoptNewerMainOAuthCredential (pre-refresh adopt)
* inside-the-lock main adoption inside doRefreshOAuthTokenWithLock
* catch-block main-inherit fallback
Mirror (mirrorRefreshedCredentialIntoMainStore) keeps the strict rule
unchanged.
Tests:
- oauth-identity.test.ts (+20 tests): truth-table for isOAuthIdentityCompatible
covering positive matches, upgrade tolerance (four sub-has-no-identity
variants), positive mismatch still refuses, normalization, reflexivity/
symmetry. Plus 4 fuzz suites (symmetry, reflexivity, distinct-accountId-
always-refuses, and a cross-rule property test asserting
'if isSameOAuthIdentity accepts, isOAuthIdentityCompatible must also').
- oauth.adopt-identity.test.ts (+3 tests): one per adopt site confirming
the sub-has-no-identity / main-has-identity upgrade scenario now
succeeds without a spurious refresh.
Cross-account mismatch tests (the previously-added 3 refuse cases) still
pass because positive accountId mismatch trips both gates identically.
All regressions pass:
- oauth-identity.test.ts: 47/47 (was 27)
- oauth.adopt-identity.test.ts: 6/6 (was 3)
- oauth.mirror-refresh.test.ts: 10/10 (strict mirror rule preserved)
- oauth.test.ts + fallback-to-main + openai-codex-refresh-fallback: 30/30
Co-authored-by: HeroSizy <HeroSizy@users.noreply.github.com>
|
Pushed Root cause acknowledgementThe previous commit ( FixSplit the gate into two named helpers with well-defined asymmetry: // Strict. Used for mirror writes (sub → main), the direction where
// a compromised/misconfigured sub-agent could poison the main store.
// Asymmetric identity (one side has it, the other doesn't) → refuse.
export function isSameOAuthIdentity(a, b): boolean { /* strict */ }
// Relaxed. Used for credential adoption (main → sub), the direction
// where main is the authoritative source. Refuses ONLY on positive
// mismatch evidence (both sides expose identity AND it differs).
export function isOAuthIdentityCompatible(a, b): boolean { /* relaxed */ }Swapped the three adopt sites to the relaxed variant:
Mirror ( Why the two asymmetries are correct
The upgrade tolerance appears only in one direction — the direction where main is authoritative — so this doesn't re-open the credential-poisoning surface. Test changes
Regression coverage
All pass in isolation. Commit
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c6b762778d
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| // Asymmetric identity evidence — refuse. We cannot prove the two | ||
| // credentials belong to the same account. | ||
| if (aHasIdentity !== bHasIdentity) { | ||
| return false; |
There was a problem hiding this comment.
Allow mirror update when only incoming credential has identity
The strict asymmetric check in isSameOAuthIdentity makes mirrorRefreshedCredentialIntoMainStore refuse writes whenever the existing main credential has no accountId/email but the freshly refreshed sub-agent credential does. In upgrade states where main still has pre-capture OAuth records, this prevents main-store mirroring after the first sub-agent refresh; subsequent agents then cannot adopt from main in the inside-lock recheck and keep refreshing with stale refresh tokens, reproducing refresh_token_reused failures under multi-agent load.
Useful? React with 👍 / 👎.
…p-refusing)
Addresses Codex P1 'Allow mirror update when only incoming credential has
identity' (PR 67876 comment on oauth.ts:369). The strict isSameOAuthIdentity
rule used for the mirror direction also refused the symmetric upgrade case
\u2014 main store has a pre-capture OAuth record (no accountId), sub-agent's
fresh refresh response carries accountId. Under the strict rule the mirror
refused, main stayed stale, and subsequent peers could not adopt from it in
the inside-lock recheck, keeping the refresh_token_reused storm alive.
The mirror direction cannot use the relaxed isOAuthIdentityCompatible gate
either \u2014 that would allow identity-regression writes (main has accountId,
incoming doesn't), which later lets a wrong-account sub-agent pass the
relaxed adoption gate and inherit credentials that don't belong to it.
Fix: introduce a third gate, isSafeToMirrorOAuthIdentity, strictly between
the other two. Rule:
1. no positive identity mismatch (same as isOAuthIdentityCompatible)
2. incoming must carry at least as much identity as existing \u2014 refuses
any write that would drop an accountId or email main already has.
Swapped mirrorRefreshedCredentialIntoMainStore from isSameOAuthIdentity
to isSafeToMirrorOAuthIdentity. The two existing gates stay unchanged:
- isSameOAuthIdentity: still used internally by tests as the strict ref
- isOAuthIdentityCompatible: still used by the three adopt sites
Truth table across the three gates (existing vs incoming identity):
match accountIds all three accept
mismatch accountIds all three refuse
both lack identity all three accept
existing none, incoming has strict refuses; mirror-safe + compat accept
existing has, incoming none strict + mirror-safe refuse; compat accepts
non-overlapping fields strict + mirror-safe refuse; compat accepts
Tests:
- oauth-identity.test.ts (+13): full truth table for
isSafeToMirrorOAuthIdentity plus two relationship tests pinning the
ordering (strict \u2264 mirror-safe \u2264 compatible). Now 60 tests total.
- oauth.mirror-refresh.test.ts (+2): integration tests for the two
formerly-mis-handled cases \u2014 upgrade accepted, regression refused.
Now 12 tests total.
Regression coverage:
- oauth.test.ts + fallback + openai-codex-refresh: 30/30 unchanged
- oauth.adopt-identity.test.ts: 6/6 unchanged (adopt path uses compat gate)
- oauth.mirror-refresh.test.ts: 12/12 (previously 10; mismatch case still
refuses under new rule because positive-mismatch trips both gates)
Co-authored-by: HeroSizy <HeroSizy@users.noreply.github.com>
|
Pushed You're right — the mirror direction had a symmetric version of the same upgrade-regression bug. A pre-capture main (no Why the mirror can't just use the relaxed adopt gateTempting to unify on So the mirror direction needs a third rule: no positive mismatch AND no identity regression. The fix:
|
| Gate | Direction | Asymmetric cases |
|---|---|---|
isSameOAuthIdentity (strict) |
— | kept for tests/docs as the pure-symmetric reference |
isSafeToMirrorOAuthIdentity (new) |
mirror (sub → main) | accept upgrades, refuse regressions |
isOAuthIdentityCompatible (relaxed) |
adopt (main → sub) | accept all asymmetric cases |
Full truth table across all three gates
| existing vs incoming | strict | mirror-safe | compatible |
|---|---|---|---|
| accountId match | ✓ | ✓ | ✓ |
| accountId mismatch | ✗ | ✗ | ✗ |
| both have no identity | ✓ | ✓ | ✓ |
| existing none, incoming has | ✗ | ✓ (upgrade) | ✓ |
| existing has, incoming none | ✗ | ✗ (regression) | ✓ |
| non-overlapping fields (accountId vs email) | ✗ | ✗ | ✓ |
Relationship pinned by tests: strict ≤ mirror-safe ≤ compatible (each is a subset of the next).
Test changes
oauth-identity.test.ts: 60 tests now (up from 47). Full truth table forisSafeToMirrorOAuthIdentity(13 new tests) including explicit "strictly stricter than compatible, strictly more permissive than strict" relationship tests.oauth.mirror-refresh.test.ts: 12 tests now (up from 10). Added integration tests for both the upgrade case (formerly refused, now accepted) and the regression case (main has identity, incoming drops it — still refused).- Pre-existing mirror tests (positive-mismatch refuse, non-oauth refuse, fresher-expires refuse, plugin-refresh mirror) all still pass — positive mismatch trips both the old strict rule and the new mirror-safe rule identically.
Regression tally
| Suite | Before | After |
|---|---|---|
oauth-identity.test.ts |
47 | 60 |
oauth.mirror-refresh.test.ts |
10 | 12 |
oauth.adopt-identity.test.ts |
6 | 6 |
| 30 pre-existing | 30/30 | 30/30 |
All pass in isolation.
Commit
a372790 on fix/oauth-refresh-cross-agent-lock.
…sive adopt
Addresses maintainer review feedback on adoption-gate permissiveness and
the overstated wording on withRefreshCallTimeout.
Primary: the relaxed isOAuthIdentityCompatible rule was too permissive.
It returned true whenever there was no shared comparable identity field,
which meant the non-overlapping-fields case (sub has only accountId,
main has only email, or vice versa) passed the gate without a positive
identity match. The three adoption sites
(adoptNewerMainOAuthCredential, inside-lock main adoption, catch-block
main-inherit fallback) would then copy credentials across the gate with
no evidence that the two sides belonged to the same account. The unit
tests at oauth-identity.test.ts:224-228 locked that behaviour in.
The reviewer's narrower rule is identical to the mirror-direction rule
I introduced in the previous commit (isSafeToMirrorOAuthIdentity). Since
both directions need exactly the same constraint, unifying the two
functions is simpler than threading a new rule through the adopt sites.
Changes:
- Deleted isOAuthIdentityCompatible and isSafeToMirrorOAuthIdentity.
- Introduced isSafeToCopyOAuthIdentity as the unified copy-gate:
allow iff
no positive mismatch on a shared field, AND
incoming carries at least as much identity evidence as existing
accept: matching accountId; matching email; neither side carries
identity; existing has none and incoming has identity
(pure upgrade).
refuse: positive mismatch; incoming drops existing's identity;
non-overlapping fields.
- Swapped all four call sites (1 mirror + 3 adopt) to the unified gate.
- isSameOAuthIdentity remains exported as the strict reference for
documentation and cross-rule property tests.
Secondary: updated the withRefreshCallTimeout doc-comment to honestly
describe what the helper does. Previous wording called it a 'hard
timeout', which overstated the guarantee: the wrapper rejects after the
deadline and our caller releases locks, but the underlying fn() keeps
running because JS promises aren't cancellable and the pi-ai OAuth
stack does not accept an AbortSignal. The isRefreshTokenReusedError
recovery path remains the backstop for any residual race; a proper
cancel requires plumbing AbortSignal through the refresh stack and is
tracked as a follow-up.
Tests:
- oauth-identity.test.ts (51 tests, was 60 with 3 different functions):
clean single truth-table for the unified gate plus cross-rule property
tests (strict accepts -> unified accepts; relaxation only in the pure-
upgrade direction). Fuzz: reflexivity, distinct-accountId-never-
accepts, strict->unified monotonicity, same-accountId-always-accepts.
- oauth.adopt-identity.test.ts (9 tests, was 6): added ONE reviewer-
requested non-overlapping-identity refuse test PER adopt site
(pre-refresh, inside-lock, catch-block). Three sites => three tests.
- oauth.mirror-refresh.test.ts unchanged (12 tests), still passes under
the unified gate because the gate is identical to the previous
isSafeToMirrorOAuthIdentity.
- 30/30 pre-existing auth-profiles regression suite unchanged.
Co-authored-by: HeroSizy <HeroSizy@users.noreply.github.com>
|
Thanks for verifying the primary race claim against Primary: adoption gate was too permissiveFixed. You spotted that Your narrower rule is identical to the mirror-direction gate I added in the previous commit (
Rule for
Truth table:
All four call sites (pre-refresh adopt, inside-lock adopt, catch-block main-inherit, and mirror-to-main) now go through this single gate. Tests
Secondary:
|
| Suite | After |
|---|---|
| oauth-identity.test.ts | 51 |
| oauth.adopt-identity.test.ts | 9 |
| oauth.mirror-refresh.test.ts | 12 |
| oauth.test.ts + fallback + codex-refresh (pre-existing) | 30/30 |
All pass in isolation.
Commit
5cc7162 on fix/oauth-refresh-cross-agent-lock.
|
Re-reviewed the current PR state. The core #26322 bug is real on origin/main: refresh I also rechecked the earlier identity-copy concern: current code uses I reran: pnpm test src/agents/auth-profiles/oauth.concurrent-20-agents.test.ts Result: 8 files passed, 108 tests passed. I think this is ready to land. Non-blocking notes:
|
|
@vincentkoc Thanks for picking this up and porting it against the current tree! makes total sense given how far the auth-profiles/ area has moved since I opened this. |
Problem
When N agents share a single OAuth profile (e.g. 18 codex agents on
openai-codex:default), the pre-fix implementation held a per-agentDirfile lock during refresh. The lock paths are distinct per agent, so concurrent agents don't serialize at all. When the access token expires, all N agents refresh in parallel; OpenAI rotates the refresh token on every successful call, so the losers get HTTP 401refresh_token_reusedand cascade to model fallback even though auth is actually valid.Observed in production: 499 refresh failures / 24h, codex unusable without manual re-auth, token-expiry storm every ~12h (#26322).
Fix
Five coordinated mechanisms in
src/agents/auth-profiles/:Cross-agent file lock keyed on
(provider, profileId).refreshOAuthTokenWithLocknow acquires$STATE_DIR/locks/oauth-refresh/sha256-<hex64>where the hash input isprovider\0profileId(NUL-separated so concat-boundary collisions are impossible). All agents sharing a provider+profileId serialize on this single coordination point, cross-process. Two profiles that happen to share aprofileIdacross different providers get distinct locks, so they don't needlessly serialize or risk file-lock timeout against each other.Nested per-store lock for the refresh critical section. Inside the global refresh lock, the refresh also acquires the existing
AUTH_STORE_LOCK_OPTIONSlock on the agent's ownauth-profiles.json. Lock acquisition order is always refresh → per-store; non-refresh code only takes the per-store lock, so no cycle is possible. This stops the refresh's read-modify-writes from losing updates against concurrentupdateAuthProfileStoreWithLockcallers.In-process refresh queue. A
Map<provider\0profileId, Promise>chains same-PID callers so the re-entrantwithFileLockcannot let two concurrent callers slip through in the same process (HELD_LOCKS short-circuits on PID match). Tested with a 10-caller FIFO stress test assertingmaxInFlight === 1.Inside-the-lock main-store recheck + mirror-to-main. After acquiring both locks,
doRefreshOAuthTokenWithLockfirst checks if the main-agent store already has fresh credentials (another agent may have just completed and mirrored) and adopts if so — turning N serialized refreshes into 1 refresh + (N-1) adoptions. After a successful refresh from a secondary agent, the new credential is mirrored back into the main-agent store while still inside the refresh lock, closing the cross-process race window where a peer could acquire the refresh lock between our lock release and our main-store write. Mirror writes also take the main-store lock viaupdateAuthProfileStoreWithLockso the read-modify-write cannot race with other main-store writers.Identity-gated credential copying. All four copy sites (pre-refresh adopt, inside-lock adopt, catch-block main-inherit, mirror-to-main) go through a single
isSafeToCopyOAuthIdentitygate. The gate accepts matching accountId/email, accepts the pure upgrade case (existing has no identity, incoming does), and refuses positive mismatch, identity regression (incoming drops an identity field existing had), and non-overlapping-field pairs. This closes the CWE-284 cross-account-leak vector that the previous provider-only adoption gate allowed.Supporting constants and helpers
OAUTH_REFRESH_LOCK_OPTIONS:stale: 180_000(3 min) — was 30s. A refresh that legitimately takes >30s would have allowed peers to reclaim the lock and reintroduce the storm.OAUTH_REFRESH_CALL_TIMEOUT_MS = 120_000: hard deadline on the refresh call viawithRefreshCallTimeout. The invariantOAUTH_REFRESH_CALL_TIMEOUT_MS < OAUTH_REFRESH_LOCK_OPTIONS.staleis pinned by a unit test. Documented limitation: JS promises aren't cancellable, so the deadline releases our locks but the underlying work can keep running.isRefreshTokenReusedErrorrecovery remains the backstop; full AbortSignal plumbing is a follow-up.resolveOAuthRefreshLockPath(provider, profileId): sha256 hash input is NUL-separated for collision resistance.Relationship to earlier PRs
origin/mainand the auth-profiles files had been restructured upstream (store.tsrewritten,persisted.ts/runtime-snapshots.ts/state.ts/path-resolve.ts/policy.tssplit out of it). This PR is a port against the current shape of the tree, not a rebase. Credit preserved viaCo-authored-byon every commit.authStorePath(agentDir)only — doesn't serialize across agents. Insufficient for OAuth token refresh race condition causes spurious failovers (refresh_token_reused) #26322.Tests
The test surface grew significantly over the review cycle. Current inventory:
oauth.concurrent-20-agents.test.tsagentDirs seeded with the same expired OAuth profile +accountId, all concurrently callresolveApiKeyForProfile. Asserts: exactly one refresh HTTP call, all 20 receive the same fresh apiKey. Before this fix the same test produces 20 refresh calls (19 of which 401 against a real IdP).oauth.mirror-refresh.test.tsoauth.adopt-identity.test.tsoauth-identity.test.tsisSameOAuthIdentity(strict reference) andisSafeToCopyOAuthIdentity(unified copy gate). Fuzz suites include a cross-rule monotonicity property (strict accepts → unified accepts) over 1,000 random pairs, and a distinct-accountId-never-accepts invariant.oauth-lock-path.test.ts.,.., backslashes, null bytes, newlines, long ids), distinct (provider, profileId) never collide, concat-boundary resistance (("a", "b:c")≠("a:b", "c")), clean-install works with no pre-existing locks/ dir. Fuzz (seeded Mulberry32): ~2,700 adversarial inputs total asserting safe sha256-hex64 basenames under<stateDir>/locks/oauth-refresh.oauth-refresh-error.test.tsisRefreshTokenReusedErrorclassifier: positive cases (snake_case code, OpenAI natural-language, JSON-wrapped 401, string throws, single- and multi-levelError.causechains), negative cases (unrelated errors, null/undefined), fuzz over marker-with-noise and marker-free random messages (500 iterations each).oauth-refresh-queue.test.tsstartOrder === endOrderandmaxInFlight === 1.oauth-refresh-timeout.test.tsOAUTH_REFRESH_CALL_TIMEOUT_MS < OAUTH_REFRESH_LOCK_OPTIONS.stale, reasonable floor (≥30s), ≥30s safety margin, stale >60s floor.paths-direct-import.test.tsoauth.test.ts,oauth.fallback-to-main-agent.test.ts,oauth.openai-codex-refresh-fallback.test.ts(pre-existing)Total: 119 tests in the auth-profiles scoped suite, all passing in isolation.
Files changed
src/agents/auth-profiles/path-resolve.tsresolveOAuthRefreshLockPath(provider, profileId)src/agents/auth-profiles/paths.tssrc/agents/auth-profiles/constants.tsOAUTH_REFRESH_LOCK_OPTIONS+OAUTH_REFRESH_CALL_TIMEOUT_MSsrc/agents/auth-profiles/oauth.tsisSafeToCopyOAuthIdentity,isSameOAuthIdentity(strict reference), normalizersFollow-ups explicitly out of scope
withRefreshCallTimeoutactually cancel the underlying HTTP work rather than just releasing locks. Documented as a limitation on the helper.cloneAuthProfileStoresnapshot-sync from its 58-file PR and stacking it on this.