API: Add Workspace.spec.setupCommand for pre-agent setup#1056
Conversation
|
/kelos review |
There was a problem hiding this comment.
🤖 Kelos Reviewer Agent @gjkim42
Review Summary
Verdict: COMMENT (would be APPROVE — self-author blocks the approval event)
Overall correctness: patch is correct
Scope: Adds Workspace.spec.setupCommand ([]string, exec-form, MinItems=1), JSON-encoded into KELOS_SETUP_COMMAND and executed by each reference entrypoint via node spawnSync after init containers and before the agent starts; non-zero exit aborts the Task with banner markers on stderr.
Findings Overview
| Priority | Count | File:Line | Summary |
|---|---|---|---|
| P0 | 0 | — | none |
| P1 | 0 | — | none |
| P2 | 1 | docs/agent-image-interface.md:49-51 | Reserved-name claim is only enforced when the workspace populates the env var; PodOverrides.Env can still set KELOS_SETUP_COMMAND when setupCommand is unset |
| P3 | 0 | — | none |
Findings
Correctness / Defense-in-depth
- [P2]
docs/agent-image-interface.md:49-51(andinternal/controller/job_builder.go:599-611) — see inline comment. The PR's new "reserved names ... are dropped" claim is conditional on the controller having already populated the env var; whenWorkspace.spec.setupCommandis unset, a user-suppliedPodOverrides.Env: KELOS_SETUP_COMMANDflows through to the entrypoint and is executed. Not a privilege escalation today (task.Spec.Imagealready grants arbitrary code execution as the agent UID), but the new field turns an env var into a code-execution surface, so the documented invariant is worth tightening with a static reserved-name list.
Suggestions (optional)
- [P3]
internal/controller/job_builder_test.go— consider adding a complementary case toTestBuildAgentJob_SetupCommandPodOverrideIsDroppedwhere the workspace has nosetupCommandbutPodOverrides.EnvsetsKELOS_SETUP_COMMAND. Today that injection would silently succeed; a test pinning the desired behavior makes the invariant unambiguous. - [P3] The
node -esetup-runner block is duplicated verbatim across all five entrypoints. Acceptable and consistent with howKELOS_MCP_SERVERSdecoding is handled today, but a small/kelos/kelos-run-setuphelper colocated withkelos-capturewould centralize future fixes (e.g., signal forwarding, timeout) without touching five files.
Key takeaways
- API shape, ordering docs,
MinItems=1, and built-in-wins test coverage all match the API review feedback. - The reserved-name documentation introduced in this PR overstates the controller's protection; one static-list change in
job_builder.gowould make the docs literally true.
|
|
||
| > The names listed in this table are reserved. `PodOverrides.Env` entries that | ||
| > reuse a reserved name are dropped so the built-in value always wins; do not | ||
| > set them manually. |
There was a problem hiding this comment.
[P2] This new note ("PodOverrides.Env entries that reuse a reserved name are dropped so the built-in value always wins") overstates the actual filter at internal/controller/job_builder.go:599-611, which only drops collisions against names already populated on mainContainer.Env. When Workspace.spec.setupCommand is empty the controller does not append KELOS_SETUP_COMMAND, so a PodOverrides.Env: KELOS_SETUP_COMMAND=["sh","-c","curl ..."] entry passes through to the entrypoint and gets executed before the agent. For pre-existing reserved names (e.g. KELOS_MODEL) the asymmetry is harmless — the user value just becomes the effective config — but KELOS_SETUP_COMMAND turns an env var into a code-execution surface. Not a privilege escalation today (task.Spec.Image already grants the same blast radius), but worth closing the gap by maintaining a static reserved-name set that is filtered unconditionally, so the documented invariant is literally true and future image-pinning admission policies don't leave this path open.
gjkim42
left a comment
There was a problem hiding this comment.
Add an e2e test for this new feature.
/kelos pick-up
58bb6b7 to
1d1a5cc
Compare
Greptile SummaryThis PR adds Confidence Score: 5/5Safe to merge — no new blocking issues found; prior review threads cover the only meaningful feedback (signal-kill diagnostic and e2e CreateSecret idempotency). All remaining open items are P2 (signal-name propagation for better diagnostics, e2e CreateSecret cleanup) and were already flagged in prior threads. The security-critical path — reservedEnvNames blocking KELOS_SETUP_COMMAND injection via PodOverrides — is correctly implemented and tested for all four cases. No new P0/P1 defects found. No files require special attention beyond the prior thread items.
|
| Filename | Overview |
|---|---|
| internal/controller/job_builder.go | Adds KELOS_SETUP_COMMAND env var injection from workspace.SetupCommand and the reservedEnvNames guard that blocks PodOverrides from forging it; ordering relative to builtinNames is correct. |
| claude-code/kelos_entrypoint.sh | Adds the setup command block (node spawnSync, banners, exit propagation) identically across all five entrypoints; signal-kill masking already flagged in prior review thread. |
| api/v1alpha1/workspace_types.go | Adds SetupCommand []string with +optional / +kubebuilder:validation:MinItems=1 markers; deepcopy generated correctly. |
| internal/controller/job_builder_test.go | Four new unit tests cover: happy path, no-setupCommand, PodOverride-override-drop (with workspace setupCommand), and static-list override-drop (without workspace setupCommand) — full coverage of the new logic paths. |
| test/e2e/setup_command_test.go | Two e2e tests for success and failure paths; duplicate CreateSecret("claude-credentials") across both It blocks was flagged in a prior review thread. |
| docs/agent-image-interface.md | Documents KELOS_SETUP_COMMAND in the reserved-names table and adds a new section 6 describing pre-agent setup command semantics. |
Sequence Diagram
sequenceDiagram
participant K as Kubernetes API
participant C as Controller (JobBuilder)
participant I as Init Containers
participant E as Entrypoint (kelos_entrypoint.sh)
participant S as setupCommand process
participant A as Agent (claude/codex/…)
K->>C: Reconcile Task + Workspace
C->>C: json.Marshal(setupCommand) → KELOS_SETUP_COMMAND
C->>C: Drop KELOS_SETUP_COMMAND from PodOverrides.Env (reservedEnvNames)
C->>K: Create Job with KELOS_SETUP_COMMAND env var
I->>I: git clone → checkout Ref → configure remotes → write Files
E->>E: Check KELOS_SETUP_COMMAND
alt setupCommand defined
E->>E: Print KELOS_SETUP_COMMAND_START
E->>S: node spawnSync(cmd[0], cmd.slice(1))
alt exit 0
S-->>E: status=0
E->>E: Print KELOS_SETUP_COMMAND_DONE
E->>A: exec agent
else exit non-zero
S-->>E: status≠0
E->>E: Print KELOS_SETUP_COMMAND_FAILED exit=N
E->>K: exit(N) → Task → Failed
end
else no setupCommand
E->>A: exec agent directly
end
Reviews (2): Last reviewed commit: "feat(api): add Workspace.spec.setupComma..." | Re-trigger Greptile
|
/kelos squash-commits |
1d1a5cc to
6d81934
Compare
|
Squash complete. 🤖 Kelos Agent @gjkim42 Rebased on
Force-pushed to the branch. PR description and release note were already accurate — no changes needed. |
Add a project convention and reviewer checklist item that documentation should describe only what the code actually does, not aspirational behavior. Evidence: - PR #1035 (kelos-bot[bot] P1): docs claimed HMAC signature validation per source, but the GenericSource branch in handler.go never reads the webhook secret, never inspects X-Hub-Signature-256, and never calls validateHMACSignature — giving users a false sense of security. - PR #1056 (kelos-bot[bot] P2): doc note claimed PodOverrides.Env entries reusing reserved names are dropped so the built-in always wins, but the actual filter only drops collisions against names already populated on mainContainer.Env, leaving KELOS_SETUP_COMMAND passable when the field is empty. Applied to AGENTS.md (symlinked to CLAUDE.md), agentconfig.yaml, kelos-workers.yaml, and a new "Documentation accuracy" subsection in the kelos-reviewer.yaml checklist. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
What type of PR is this?
/kind feature
What this PR does / why we need it:
Adds
Workspace.spec.setupCommand: an optional exec-form slice that runs inthe agent container after the workspace is prepared and before the agent
starts. The controller JSON-encodes the command into
KELOS_SETUP_COMMAND,and each agent's
kelos_entrypoint.shdecodes the array and execs itdirectly via
node(no shell wrapping). A non-zero exit aborts the Task,and
---KELOS_SETUP_COMMAND_{START,DONE,FAILED}---banners on stderrdistinguish setup failures from agent failures when tailing pod logs.
This lets workspaces declare repo-specific bootstrapping (npm ci, pip
install, go mod download, code generation) once instead of paying the
typing/wall-clock cost on every Task.
Implementation notes:
sh -c— semantics match Kuberneteslifecycle.postStart.exec.command. Use["sh", "-c", "<script>"]forshell pipelines.
MinItems=1rejects an explicit empty array at admission.KELOS_SETUP_COMMANDis a statically reserved env-var name: thecontroller drops user-supplied
PodOverrides.Enventries with that nameunconditionally, even when the workspace does not set
setupCommand,so the entrypoint cannot be tricked into executing forged commands.
→ agent, as documented in the godoc.
Which issue(s) this PR is related to:
Fixes #1051
Special notes for your reviewer:
The setup runs inside the agent container (not as a separate init
container) so it shares the agent toolchain and credentials, matching the
issue's scope decision. All five reference agent images (
claude-code,codex,gemini,opencode,cursor) already install Node.js v22, sothe entrypoint uses
node -ewithchild_process.spawnSyncfor portableJSON decoding and exec.
Test coverage:
internal/controller/job_builder_test.gocover the happypath, the no-setup-command case, the
PodOverrides.Envoverride-dropcase (workspace defines setupCommand), and the new static-list
override-drop case (workspace does not define setupCommand).
test/e2e/setup_command_test.goexercise the successpath (setup writes a file the agent reads) and the failure path
(non-zero setup exit transitions the Task to Failed without running
the agent).
Does this PR introduce a user-facing change?
Yes — a new optional CRD field.