Improve auth-expiry experience: retry refresh + non-nagging hooks#183
Conversation
The two token-refresh calls (GitHub `/auth/refresh`, WorkOS `/user_management/authenticate`) used a bare single-shot `PostAsync` and swallowed every exception to `null`. A momentary transport blip — DNS stutter, connection reset, brief server slowness — therefore degraded instantly to "Authentication token has expired. Run 'kcap login'", even though the refresh credential was still valid and the next call would likely succeed. Users saw a red "Stop hook error" banner prompting a re-login they did not actually need. Route both calls through the existing `PostWithRetryAsync` helper with a short 5s budget. It retries only transport failures, never non-success HTTP responses, so a genuinely-expired refresh token still returns fast (400/401 -> null, no pointless retries). The short budget keeps a hook from blocking on the default 30s budget when the server is truly down. Adding retry does not introduce a new failure mode for WorkOS's rotate-on-use refresh tokens: the reuse window (re-sending a token the server already rotated when the response was lost) already exists across separate refresh calls, and the cross-process lock + re-read-under-lock remains the mitigation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
PR Summary by QodoRetry auth token refresh on transient failures Description
Diagram
High-Level Assessment
Files changed (1)
|
Code Review by Qodo
Context used 1.
|
When the access token is expired and can't be refreshed (re-login truly required), the Claude hook used to POST a request it knew would 401, print `HTTP 401`, and exit 1 — which Claude Code renders as a red "Stop hook error" banner on *every* turn until the user runs `kcap login`. Recording is best-effort, so this is needlessly alarming and repetitive. Add `HttpClientExtensions.CreateClientWithAuthStatusAsync`, which reports an `AuthStatus` (Ok / NoAuthRequired / Expired / NotAuthenticated) instead of silently swallowing it. `CreateAuthenticatedClientAsync` stays as a thin wrapper with the same stderr behaviour, so interactive callers are unchanged. The Claude hook now short-circuits when auth has lapsed: it skips the doomed POST and exits 0 (no banner), and nudges the user to re-login only on session-start — once per session — instead of on the high-frequency stop/notification/subagent events. The nudge is emitted as a `systemMessage` JSON object so Claude Code shows it to the user without injecting it into the model's context (plain session-start stdout would leak into context, and on exit 0 stderr is hidden entirely). Scope: applied to the Claude Code hook path (the one reported). Other agent hooks share the same client shape and can adopt the status-aware variant as a follow-up. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Routing the GitHub refresh through PostWithRetryAsync (prior commit) brought in EnsureAbsolute, which calls Environment.Exit(2) on a non-absolute URL. RefreshGitHubAsync is reached from daemon/background callers via GetValidTokensAsync (e.g. ServerConnection, StatusCommand), so a legacy scheme-less server_url during an expired-token refresh would abruptly terminate the process instead of failing gracefully — a regression from the old bare PostAsync, which was wrapped in catch -> null. Validate the URL with IsAcceptableUrl before the retry call and return null when it's not absolute http/https, preserving refresh's graceful-failure contract. The hook entry paths still EnsureAbsolute-and-exit by design; this guard only covers the library refresh call. WorkOS refresh is unaffected (it posts to a hardcoded absolute URL). Found by Qodo review on #183. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Addressed the Qodo finding ("Refresh can exit process") in 75c711f.
Verified: build clean, no IL3050/IL2026 warnings, 1759/1759 unit tests pass. |
* feat(hooks): apply auth-expiry handling to the non-Claude agent hooks Follow-up to #183, which taught the Claude hook to skip the doomed POST and stop nagging when auth has lapsed. The other agent hooks all still built their client with CreateAuthenticatedClientAsync and POSTed blindly, so an expired/absent token meant a guaranteed-to-401 POST plus a misleading per-turn "[kcap] <agent>-hook ...: HTTP 401" stderr line. - New AgentHookPoster shared helper: builds the client via CreateClientWithAuthStatusAsync and returns HookPostOutcome { Posted, AuthLapsed, Failed }. On AuthLapsed it skips the POST (no request, no stderr); Failed keeps the prior stderr line + exit code; Posted is unchanged. - Codex, Gemini, Copilot, Pi, Kiro, OpenCode: PostHookAsync delegates to the helper. Session-start handlers skip the watcher and exit cleanly on AuthLapsed (Codex still emits its required {"continue":true} stdout); Failed/Posted behaviour is unchanged. - Gemini/Copilot per-turn Notification paths and Cursor's HandleInternal use IsAuthLapsed to skip (Cursor also skips the spool drain, so a 401 can't Drop the backlog — it replays after re-login). - No user-facing re-login nudge: none of these agents has a safe stdout notice channel (all emit nothing or a JSON decision/context channel), so per the issue we soft-exit silently; the expired state is surfaced by `kcap status`. No-op for the None provider; unchanged when authenticated. Adds AgentHookPosterTests. Closes #184 Linear: AI-993 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(hooks): dispose the hook POST response in AgentHookPoster PostWithRetryAsync's HttpResponseMessage was stored in a local and never disposed; wrap it in `using var` so response streams/connections are released on every path (these hooks run per turn/session). Matches the disposal pattern used elsewhere for PostWithRetryAsync responses. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Why
A user hit this on a Stop hook after their auth lapsed:
Two separate problems were behind it, addressed in two commits.
1. Retry token refresh on transient failures
TokenStorerefreshes tokens silently before every authenticated call, but the two refresh HTTP calls used a bare single-shotPostAsyncand swallowed every exception tonull. So a momentary transport blip (DNS stutter, connection reset, brief server slowness) degraded instantly to "Authentication token has expired. Run 'kcap login'" — even when the refresh credential was still perfectly valid.Both calls now go through the existing
PostWithRetryAsynchelper with a short 5s budget:RefreshGitHubAsync→/auth/refreshRefreshWorkOSAsync→ WorkOS/user_management/authenticatePostWithRetryAsyncretries only transport failures, never non-success HTTP responses, so a genuinely-expired refresh token still returns fast (400/401→null, no pointless retries). The short budget keeps a hook from blocking on the default 30s when the server is truly down. Adding retry introduces no new failure mode for WorkOS's rotate-on-use refresh tokens — that reuse window already existed across separate refresh calls, and the cross-process lock + re-read-under-lock remains the mitigation.2. Make hook auth-expiry non-nagging and non-alarming
When auth has genuinely lapsed (refresh credential dead → re-login truly required), the Claude hook used to POST a request it knew would
401, printHTTP 401, and exit1— which Claude Code renders as a red "Stop hook error" banner on every turn until the user re-authenticates.Recording is best-effort, so this now:
session-start(once per session), staying silent on the high-frequencystop/notification/subagent-*events.Mechanism:
HttpClientExtensions.CreateClientWithAuthStatusAsyncnow reports anAuthStatus(Ok/NoAuthRequired/Expired/NotAuthenticated) instead of swallowing it. The existingCreateAuthenticatedClientAsyncis preserved as a thin wrapper (same stderr behaviour for interactive commands), so non-hook callers are unaffected.The session-start nudge is emitted as a
systemMessageJSON object on stdout, not stderr. This matters: on exit 0 Claude Code hides hook stderr entirely, and plainsession-startstdout is injected into the model's context —systemMessageis the one channel that shows the notice to the user without leaking into context. (Verified against the Claude Code hooks docs.)Scope
Hook change applied to the Claude Code path (the one reported). The other agent hooks (Codex, Cursor, Gemini, …) share the same
CreateAuthenticatedClientAsyncshape and can adoptCreateClientWithAuthStatusAsyncas a fast follow-up.Follow-up
Proactive background token refresh in the daemon is tracked separately in #182.
Testing
dotnet build— clean.🤖 Generated with Claude Code