Skip to content

Improve auth-expiry experience: retry refresh + non-nagging hooks#183

Merged
alexeyzimarev merged 3 commits into
mainfrom
fix/token-refresh-retry
Jun 26, 2026
Merged

Improve auth-expiry experience: retry refresh + non-nagging hooks#183
alexeyzimarev merged 3 commits into
mainfrom
fix/token-refresh-retry

Conversation

@alexeyzimarev

@alexeyzimarev alexeyzimarev commented Jun 26, 2026

Copy link
Copy Markdown
Member

Why

A user hit this on a Stop hook after their auth lapsed:

Stop hook error: Failed with non-blocking status code: Authentication token has expired. Run 'kcap login' to re-authenticate.
HTTP 401

Two separate problems were behind it, addressed in two commits.

1. Retry token refresh on transient failures

TokenStore refreshes tokens silently before every authenticated call, but the two refresh HTTP calls used a bare single-shot PostAsync and swallowed every exception to null. So a momentary transport blip (DNS stutter, connection reset, brief server slowness) degraded instantly to "Authentication token has expired. Run 'kcap login'" — even when the refresh credential was still perfectly valid.

Both calls now go through the existing PostWithRetryAsync helper with a short 5s budget:

  • RefreshGitHubAsync/auth/refresh
  • RefreshWorkOSAsync → WorkOS /user_management/authenticate

PostWithRetryAsync retries only transport failures, never non-success HTTP responses, so a genuinely-expired refresh token still returns fast (400/401null, no pointless retries). The short budget keeps a hook from blocking on the default 30s when the server is truly down. Adding retry introduces no new failure mode for WorkOS's rotate-on-use refresh tokens — that reuse window already existed across separate refresh calls, and the cross-process lock + re-read-under-lock remains the mitigation.

2. Make hook auth-expiry non-nagging and non-alarming

When auth has genuinely lapsed (refresh credential dead → re-login truly required), the Claude hook used to POST a request it knew would 401, print HTTP 401, and exit 1 — which Claude Code renders as a red "Stop hook error" banner on every turn until the user re-authenticates.

Recording is best-effort, so this now:

  • Skips the POST when auth is expired/missing (no point — it would 401).
  • Exits 0 — no red banner.
  • Nudges the user to re-login only on session-start (once per session), staying silent on the high-frequency stop / notification / subagent-* events.

Mechanism: HttpClientExtensions.CreateClientWithAuthStatusAsync now reports an AuthStatus (Ok / NoAuthRequired / Expired / NotAuthenticated) instead of swallowing it. The existing CreateAuthenticatedClientAsync is preserved as a thin wrapper (same stderr behaviour for interactive commands), so non-hook callers are unaffected.

The session-start nudge is emitted as a systemMessage JSON object on stdout, not stderr. This matters: on exit 0 Claude Code hides hook stderr entirely, and plain session-start stdout is injected into the model's context — systemMessage is the one channel that shows the notice to the user without leaking into context. (Verified against the Claude Code hooks docs.)

Scope

Hook change applied to the Claude Code path (the one reported). The other agent hooks (Codex, Cursor, Gemini, …) share the same CreateAuthenticatedClientAsync shape and can adopt CreateClientWithAuthStatusAsync as a fast follow-up.

Follow-up

Proactive background token refresh in the daemon is tracked separately in #182.

Testing

  • dotnet build — clean.
  • AOT publish — no IL3050/IL2026 warnings.
  • Unit tests — pass.
  • README: no change (no user-facing CLI surface change — internal behaviour only).

🤖 Generated with Claude Code

The two token-refresh calls (GitHub `/auth/refresh`, WorkOS
`/user_management/authenticate`) used a bare single-shot `PostAsync` and
swallowed every exception to `null`. A momentary transport blip — DNS
stutter, connection reset, brief server slowness — therefore degraded
instantly to "Authentication token has expired. Run 'kcap login'", even
though the refresh credential was still valid and the next call would
likely succeed. Users saw a red "Stop hook error" banner prompting a
re-login they did not actually need.

Route both calls through the existing `PostWithRetryAsync` helper with a
short 5s budget. It retries only transport failures, never non-success
HTTP responses, so a genuinely-expired refresh token still returns fast
(400/401 -> null, no pointless retries). The short budget keeps a hook
from blocking on the default 30s budget when the server is truly down.

Adding retry does not introduce a new failure mode for WorkOS's
rotate-on-use refresh tokens: the reuse window (re-sending a token the
server already rotated when the response was lost) already exists across
separate refresh calls, and the cross-process lock + re-read-under-lock
remains the mitigation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexeyzimarev alexeyzimarev added enhancement New feature or request quality of life Improvement to user experience labels Jun 26, 2026
@qodo-code-review

Copy link
Copy Markdown

PR Summary by Qodo

Retry auth token refresh on transient failures
🐞 Bug fix 🕐 10-20 Minutes

Grey Divider

Description

• Add a 5s retry budget for token refresh HTTP calls.
• Route GitHub and WorkOS refresh POSTs through PostWithRetryAsync for resilience.
Diagram

graph TD
  TS["TokenStore"] --> PWR["PostWithRetryAsync (5s)"] --> GH{{"GitHub: /auth/refresh"}}
  PWR --> WO{{"WorkOS: /user_management/authenticate"}}
Loading
High-Level Assessment

The following are alternative approaches to this PR:

1. Adopt HttpClientFactory + resilience policies (e.g., Polly)
  • ➕ Centralized, consistent retry/timeout behavior across all HTTP calls
  • ➕ Easier to tune per-endpoint policies and add telemetry
  • ➖ Larger refactor and additional dependency/configuration surface
  • ➖ Risk of inadvertently changing behavior for unrelated HTTP paths
2. Add targeted catch/retry around PostAsync in TokenStore only
  • ➕ Minimal dependency/abstraction changes
  • ➕ Very explicit about which exceptions are retried
  • ➖ Duplicates retry logic already present in PostWithRetryAsync
  • ➖ Higher risk of policy drift (different retry semantics across callers)

Recommendation: Using the existing PostWithRetryAsync with a short, explicit refresh budget is the best tradeoff here: it improves resilience specifically for refresh flows without broad HTTP behavior changes, and it preserves the key constraint of not retrying non-success HTTP responses (so genuinely expired refresh tokens still fail fast).

Files changed (1) +18 / -3

Bug fix (1) +18 / -3
TokenStore.csRetry token refresh calls with a short 5s budget +18/-3

Retry token refresh calls with a short 5s budget

• Introduces a 5-second retry budget constant for refresh HTTP calls. Updates both GitHub (/auth/refresh) and WorkOS (/user_management/authenticate) refresh requests to use PostWithRetryAsync, with added rationale around transport-only retries and WorkOS refresh-token rotation behavior.

src/Capacitor.Cli.Core/Auth/TokenStore.cs

@qodo-code-review

qodo-code-review Bot commented Jun 26, 2026

Copy link
Copy Markdown

Code Review by Qodo

🐞 Bugs (0) 📘 Rule violations (0) 📎 Requirement gaps (0) 📜 Skill insights (0)

Context used

Grey Divider


Remediation recommended

1. Refresh can exit process ✓ Resolved 🐞 Bug ☼ Reliability
Description
TokenStore.RefreshGitHubAsync now calls PostWithRetryAsync, which invokes EnsureAbsolute() and will
call Environment.Exit(2) when the configured server URL is missing http/https. Because
GetValidTokensAsync is used from daemon/background paths, an expired-token refresh attempt can
abruptly terminate the daemon/hook process instead of failing gracefully.
Code

src/Capacitor.Cli.Core/Auth/TokenStore.cs[R237-241]

        try {
-            var response = await http.PostAsync($"{baseUrl}/auth/refresh", payload);
+            var response = await http.PostWithRetryAsync($"{baseUrl}/auth/refresh", payload, RefreshRetryBudget);

            if (!response.IsSuccessStatusCode) {
                return null;
Evidence
The modified refresh path calls PostWithRetryAsync, which always runs EnsureAbsolute() and can
terminate the process on scheme-less URLs; URL normalization does not enforce a scheme; and
GetValidTokensAsync is invoked by daemon flows, so this exit can occur in long-running processes.

src/Capacitor.Cli.Core/Auth/TokenStore.cs[227-260]
src/Capacitor.Cli.Core/HttpClientExtensions.cs[116-131]
src/Capacitor.Cli.Core/Config/AppConfig.cs[216-221]
src/Capacitor.Cli.Daemon/Services/ServerConnection.cs[67-76]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`TokenStore.RefreshGitHubAsync` now uses `HttpClientExtensions.PostWithRetryAsync`, which calls `EnsureAbsolute(url)` and hard-exits the process (`Environment.Exit(2)`) if the URL is not an absolute http/https URL. This introduces a hard process-termination side effect into token refresh, which can be triggered from daemon/background flows calling `TokenStore.GetValidTokensAsync()`.

## Issue Context
- `AppConfig.NormalizeUrl()` does not add/validate a scheme, so legacy or user-provided scheme-less values (e.g. `localhost:5108`) can still flow into `baseUrl`.
- `EnsureAbsolute()` currently exits the process instead of throwing/returning an error, which bypasses normal error handling.

## Fix Focus Areas
- src/Capacitor.Cli.Core/Auth/TokenStore.cs[227-260]
- src/Capacitor.Cli.Core/HttpClientExtensions.cs[116-131]
- src/Capacitor.Cli.Core/Config/AppConfig.cs[216-221]

## Suggested fix approach
Choose one (prefer options that avoid process exit from library code):
1) **Local validation in TokenStore**: before calling `PostWithRetryAsync`, validate `baseUrl` with `HttpClientExtensions.IsAcceptableUrl(baseUrl)`; if invalid, return `null` (or throw and let callers handle) rather than calling a method that will `Environment.Exit`.
2) **Introduce a non-exiting retry helper**: add a variant like `PostWithRetryNoExitAsync` (or refactor `EnsureAbsolute` to throw) and use that inside `TokenStore`.

Add/adjust tests (if present) to cover scheme-less `server_url` during token refresh so the process does not exit unexpectedly.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

When the access token is expired and can't be refreshed (re-login truly
required), the Claude hook used to POST a request it knew would 401, print
`HTTP 401`, and exit 1 — which Claude Code renders as a red "Stop hook
error" banner on *every* turn until the user runs `kcap login`. Recording
is best-effort, so this is needlessly alarming and repetitive.

Add `HttpClientExtensions.CreateClientWithAuthStatusAsync`, which reports an
`AuthStatus` (Ok / NoAuthRequired / Expired / NotAuthenticated) instead of
silently swallowing it. `CreateAuthenticatedClientAsync` stays as a thin
wrapper with the same stderr behaviour, so interactive callers are
unchanged.

The Claude hook now short-circuits when auth has lapsed: it skips the
doomed POST and exits 0 (no banner), and nudges the user to re-login only
on session-start — once per session — instead of on the high-frequency
stop/notification/subagent events. The nudge is emitted as a
`systemMessage` JSON object so Claude Code shows it to the user without
injecting it into the model's context (plain session-start stdout would
leak into context, and on exit 0 stderr is hidden entirely).

Scope: applied to the Claude Code hook path (the one reported). Other agent
hooks share the same client shape and can adopt the status-aware variant as
a follow-up.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Routing the GitHub refresh through PostWithRetryAsync (prior commit) brought
in EnsureAbsolute, which calls Environment.Exit(2) on a non-absolute URL.
RefreshGitHubAsync is reached from daemon/background callers via
GetValidTokensAsync (e.g. ServerConnection, StatusCommand), so a legacy
scheme-less server_url during an expired-token refresh would abruptly
terminate the process instead of failing gracefully — a regression from the
old bare PostAsync, which was wrapped in catch -> null.

Validate the URL with IsAcceptableUrl before the retry call and return null
when it's not absolute http/https, preserving refresh's graceful-failure
contract. The hook entry paths still EnsureAbsolute-and-exit by design; this
guard only covers the library refresh call. WorkOS refresh is unaffected (it
posts to a hardcoded absolute URL).

Found by Qodo review on #183.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexeyzimarev

Copy link
Copy Markdown
Member Author

Addressed the Qodo finding ("Refresh can exit process") in 75c711f.

RefreshGitHubAsync now validates the URL with IsAcceptableUrl before calling PostWithRetryAsync, returning null on a scheme-less server_url instead of letting EnsureAbsoluteEnvironment.Exit(2) terminate the process. This preserves the graceful-failure contract for the daemon/background callers that reach refresh via GetValidTokensAsync (ServerConnection, StatusCommand). The hook entry paths still EnsureAbsolute-and-exit by design; WorkOS refresh was never affected (hardcoded absolute URL).

Verified: build clean, no IL3050/IL2026 warnings, 1759/1759 unit tests pass.

@alexeyzimarev alexeyzimarev merged commit 27083c1 into main Jun 26, 2026
5 checks passed
@alexeyzimarev alexeyzimarev deleted the fix/token-refresh-retry branch June 26, 2026 09:09
alexeyzimarev pushed a commit that referenced this pull request Jul 5, 2026
* feat(hooks): apply auth-expiry handling to the non-Claude agent hooks

Follow-up to #183, which taught the Claude hook to skip the doomed POST and stop nagging when
auth has lapsed. The other agent hooks all still built their client with
CreateAuthenticatedClientAsync and POSTed blindly, so an expired/absent token meant a
guaranteed-to-401 POST plus a misleading per-turn "[kcap] <agent>-hook ...: HTTP 401" stderr line.

- New AgentHookPoster shared helper: builds the client via CreateClientWithAuthStatusAsync and
  returns HookPostOutcome { Posted, AuthLapsed, Failed }. On AuthLapsed it skips the POST (no
  request, no stderr); Failed keeps the prior stderr line + exit code; Posted is unchanged.
- Codex, Gemini, Copilot, Pi, Kiro, OpenCode: PostHookAsync delegates to the helper. Session-start
  handlers skip the watcher and exit cleanly on AuthLapsed (Codex still emits its required
  {"continue":true} stdout); Failed/Posted behaviour is unchanged.
- Gemini/Copilot per-turn Notification paths and Cursor's HandleInternal use IsAuthLapsed to skip
  (Cursor also skips the spool drain, so a 401 can't Drop the backlog — it replays after re-login).
- No user-facing re-login nudge: none of these agents has a safe stdout notice channel (all emit
  nothing or a JSON decision/context channel), so per the issue we soft-exit silently; the expired
  state is surfaced by `kcap status`.

No-op for the None provider; unchanged when authenticated. Adds AgentHookPosterTests.

Closes #184
Linear: AI-993

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(hooks): dispose the hook POST response in AgentHookPoster

PostWithRetryAsync's HttpResponseMessage was stored in a local and never disposed; wrap it in `using var` so response streams/connections are released on every path (these hooks run per turn/session). Matches the disposal pattern used elsewhere for PostWithRetryAsync responses.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request quality of life Improvement to user experience

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant