Skip to content

Remove token-refresher sidecar, generate tokens in-process#971

Merged
gjkim42 merged 3 commits intomainfrom
remove-token-refresher
Apr 12, 2026
Merged

Remove token-refresher sidecar, generate tokens in-process#971
gjkim42 merged 3 commits intomainfrom
remove-token-refresher

Conversation

@gjkim42
Copy link
Copy Markdown
Collaborator

@gjkim42 gjkim42 commented Apr 12, 2026

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

Removes the kelos-token-refresher sidecar binary entirely. Spawner and ghproxy pods now generate GitHub App installation tokens in-process using a shared githubapp.TokenProvider, eliminating an extra container, two volumes, and a separate Docker image from every GitHub App pod.

Key changes:

  • internal/githubapp: New TokenProvider — thread-safe, caches tokens with a 5-minute expiry margin, accepts *TokenClient and *Credentials at the constructor level (no env-var reads inside the package)
  • Deployment builders: Replace sidecar + EmptyDir + Secret volume with GITHUB_APP_ID, GITHUB_APP_INSTALLATION_ID, GITHUB_APP_PRIVATE_KEY env vars injected from the workspace secret
  • Spawner / ghproxy / webhook-server: Accept GitHub App credentials as command-line flags (--github-app-id, etc.) defaulting to os.Getenv(…) for backward-compatible K8s deployments. Token resolver is created in main() and passed down — no credential reads inside library packages
  • internal/reporting: GitHubReporter.TokenFile replaced with TokenFunc func() string for dynamic token resolution without filesystem I/O
  • Deleted: cmd/kelos-token-refresher/, its Dockerfile, all --token-refresher-* controller flags, Helm values, and CLI install options
  • Upgrade path: Controller still reconciles InitContainers / Volumes diffs so existing deployments are cleaned up automatically

Which issue(s) this PR is related to:

N/A

Special notes for your reviewer:

  • The GITHUB_API_BASE_URL env var is also injected for GHES so the TokenProvider calls the correct API endpoint
  • Integration / e2e tests for the sidecar are rewritten to verify env-var injection instead

Does this PR introduce a user-facing change?

Action Required: The `--token-refresher-resource-requests`, `--token-refresher-resource-limits`, and `--token-refresher-image` flags are removed. Remove these flags from any automation scripts or custom controller manifests before upgrading.

Remove the token-refresher sidecar container from spawner and ghproxy pods. GitHub App installation tokens are now generated in-process, reducing per-pod resource usage and simplifying pod specs. Existing deployments are migrated automatically on the next controller reconcile.

🤖 Generated with Claude Code


Summary by cubic

Removed the kelos-token-refresher sidecar and now generate GitHub App tokens in-process for spawner, ghproxy, and the webhook. This simplifies deployments, cuts per-pod resources, and makes token refresh request-aware with a 10s default HTTP timeout.

  • New Features

    • Added internal/githubapp.TokenProvider (thread-safe; caches with a 5‑min expiry margin; falls back to cached token on refresh errors; NewTokenClient uses a 10s timeout).
    • ghproxy token resolver now accepts context.Context, so refresh is bound to each request.
    • Spawner, ghproxy, and webhook accept PATs via --github-token and GitHub App creds via flags/env; a resolver is built in main() (secrets read after flag parsing and not shown in --help).
    • Replaced file-based tokens with TokenFunc in internal/reporting.GitHubReporter; spawner errors if reporting is enabled without a token resolver.
  • Migration

    • Deleted cmd/kelos-token-refresher and removed all --token-refresher-* controller/CLI options and Helm values (images, docs, and dev workflow updated).
    • Removed --github-token-file across spawner, ghproxy, and webhook. Use --github-token or GitHub App flags/env (GITHUB_APP_ID, GITHUB_APP_INSTALLATION_ID, GITHUB_APP_PRIVATE_KEY); for GHES set --github-api-base-url.
    • Upgrades are automatic: the controller reconciles and removes stale initContainers and volumes.

Written for commit 7883042. Summary will update on new commits.

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9 issues found across 28 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="cmd/kelos-spawner/main_test.go">

<violation number="1" location="cmd/kelos-spawner/main_test.go:2254">
P2: This injected `TokenResolver` bypasses the env-var fallback, so the test no longer verifies `GITHUB_TOKEN`-based reporting auth.</violation>
</file>

<file name="cmd/kelos-webhook-server/main.go">

<violation number="1" location="cmd/kelos-webhook-server/main.go:55">
P1: Do not use secret env vars as `flag` defaults; they can be echoed in help/usage output and leak the token or private key.</violation>
</file>

<file name="cmd/kelos-spawner/reconciler.go">

<violation number="1" location="cmd/kelos-spawner/reconciler.go:81">
P2: Don't silently skip reporting when reporting is enabled but no token resolver is configured.</violation>

<violation number="2" location="cmd/kelos-spawner/reconciler.go:93">
P2: Returning an empty token here turns resolver failures into repeated unauthenticated GitHub write attempts for every task.</violation>
</file>

<file name="internal/controller/workspace_controller_test.go">

<violation number="1" location="internal/controller/workspace_controller_test.go:91">
P2: This test now covers only `GITHUB_APP_ID`, so it can miss regressions where the other required GitHub App env vars are absent or the deleted token-file wiring is still present.

(Based on your team's feedback about adding or extending tests for new behavior.) [FEEDBACK_USED]</violation>
</file>

<file name="cmd/kelos-spawner/main.go">

<violation number="1" location="cmd/kelos-spawner/main.go:202">
P2: Check `ts.Spec.Suspend` before resolving a GitHub token. As written, suspended GitHub-backed spawners can still mint tokens and fail the cycle instead of skipping cleanly.</violation>

<violation number="2" location="cmd/kelos-spawner/main.go:665">
P1: Do not return a nil resolver on invalid GitHub App credentials. That silently falls back to unauthenticated GitHub requests instead of failing fast on a bad secret.</violation>
</file>

<file name="cmd/ghproxy/main.go">

<violation number="1" location="cmd/ghproxy/main.go:388">
P2: Bound GitHub App token generation with a timeout. Using `context.Background()` here lets a stalled token endpoint hang proxied requests indefinitely.</violation>

<violation number="2" location="cmd/ghproxy/main.go:410">
P1: Avoid using env-derived defaults for secret flags; Go's usage output will print those values when help/usage is shown.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread cmd/kelos-webhook-server/main.go Outdated
Comment thread cmd/kelos-spawner/main.go Outdated
Comment thread cmd/ghproxy/main.go Outdated
Comment thread cmd/kelos-spawner/main_test.go Outdated
Comment thread cmd/kelos-spawner/reconciler.go
Comment thread cmd/kelos-spawner/reconciler.go Outdated
Comment thread internal/controller/workspace_controller_test.go
Comment thread cmd/kelos-spawner/main.go
}

src, err := buildSourceWithProxy(&ts, githubOwner, githubRepo, ghProxyURL, githubAPIBaseURL, githubTokenFile, jiraBaseURL, jiraProject, jiraJQL, httpClient)
src, err := buildSourceWithProxy(ctx, &ts, githubOwner, githubRepo, ghProxyURL, githubAPIBaseURL, tokenResolver, jiraBaseURL, jiraProject, jiraJQL, httpClient)
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai bot Apr 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Check ts.Spec.Suspend before resolving a GitHub token. As written, suspended GitHub-backed spawners can still mint tokens and fail the cycle instead of skipping cleanly.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At cmd/kelos-spawner/main.go, line 202:

<comment>Check `ts.Spec.Suspend` before resolving a GitHub token. As written, suspended GitHub-backed spawners can still mint tokens and fail the cycle instead of skipping cleanly.</comment>

<file context>
@@ -170,27 +179,27 @@ func runReportingCycle(ctx context.Context, cl client.Client, key types.Namespac
 	}
 
-	src, err := buildSourceWithProxy(&ts, githubOwner, githubRepo, ghProxyURL, githubAPIBaseURL, githubTokenFile, jiraBaseURL, jiraProject, jiraJQL, httpClient)
+	src, err := buildSourceWithProxy(ctx, &ts, githubOwner, githubRepo, ghProxyURL, githubAPIBaseURL, tokenResolver, jiraBaseURL, jiraProject, jiraJQL, httpClient)
 	if err != nil {
 		return fmt.Errorf("building source: %w", err)
</file context>
Fix with Cubic

Comment thread cmd/ghproxy/main.go Outdated
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 8 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="internal/githubapp/token.go">

<violation number="1" location="internal/githubapp/token.go:248">
P1: Recompute the expiry check after the refresh attempt fails; using the pre-request timestamp can return an already expired cached token.</violation>
</file>

<file name="internal/githubapp/token_test.go">

<violation number="1" location="internal/githubapp/token_test.go:392">
P3: This test never verifies that the refresh was attempted, so an implementation that incorrectly skips the refresh path would still pass.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread internal/githubapp/token.go
Comment thread internal/githubapp/token_test.go
@gjkim42 gjkim42 force-pushed the remove-token-refresher branch 3 times, most recently from 271170d to 0b7b81e Compare April 12, 2026 12:39
gjkim42 added 3 commits April 12, 2026 22:06
Replace the token-refresher sidecar binary with in-process GitHub App
token generation using a shared TokenProvider. This eliminates an entire
binary, simplifies pod specs, and saves resources on every GitHub App
spawner/ghproxy pod.

- Add githubapp.TokenProvider: thread-safe, caches tokens with 5-min
  expiry margin, falls back to cached token on transient refresh errors,
  10-second default HTTP timeout on the token client
- Deployment builders inject GITHUB_APP_* env vars from the workspace
  secret instead of running a sidecar with shared volumes
- Spawner, ghproxy, and webhook server accept GitHub App credentials as
  command-line flags (with env var fallback for K8s compatibility);
  secrets are never exposed in flag defaults or -help output
- Replace GitHubReporter.TokenFile with TokenFunc for dynamic token
  resolution without filesystem I/O
- ghproxy token resolver accepts request context so refresh is bounded
  by request lifetime
- Error explicitly when GitHub reporting is enabled but no token
  resolver is configured
- Delete cmd/kelos-token-refresher, its Dockerfile, and all associated
  controller flags, Helm values, and CLI options
- Retain init container / volume reconciliation to clean up existing
  deployments during upgrade
@gjkim42 gjkim42 force-pushed the remove-token-refresher branch from 0b7b81e to 7883042 Compare April 12, 2026 13:17
@gjkim42 gjkim42 merged commit 4c4b9a5 into main Apr 12, 2026
19 checks passed
@gjkim42 gjkim42 deleted the remove-token-refresher branch April 12, 2026 13:28
kelos-bot bot pushed a commit that referenced this pull request Apr 12, 2026
…riable

Based on review feedback from recent PRs:

- PR #971: code review flagged P1 security issue where os.Getenv() for
  secrets was used as flag.StringVar default in two binaries. Go's flag
  package prints defaults in usage output, leaking secret values. Added
  as convention to AGENTS.md, agentconfig, workers config, and reviewer
  security checklist.

- PR #965: human reviewer flagged that .Branch template variable is empty
  for issue_comment events from issues (not PRs). Clarified the template
  variable table in self-development/README.md.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
kelos-bot bot pushed a commit that referenced this pull request Apr 14, 2026
…eedback

Adds conventions learned from recent PR reviews:

1. Fail fast on invalid configuration (PR #971): three P1 and four P2
   issues flagged silent degradation when credentials or config were
   invalid, falling back to unauthenticated requests instead of
   erroring.

2. No manual PR branch checkout in TaskSpawner prompts (PR #974): Kelos
   already checks out the PR branch automatically; manual checkout
   instructions are redundant and confusing.

Also carries forward the previously proposed changes from PR #786:
- os.Getenv() secret-in-flag-defaults convention (PR #971)
- TaskSpawner creation conventions (PR #965)
- Branch template variable documentation fix (PR #965)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant