Skip to content

fix(worker): stop snapshot-baked stale JOB_DATA running the wrong agent (ucho MNG-1622/1702)#1470

Merged
zbigniewsobiecki merged 4 commits into
devfrom
fix/snapshot-stale-jobdata-and-env-scrub
Jul 1, 2026
Merged

fix(worker): stop snapshot-baked stale JOB_DATA running the wrong agent (ucho MNG-1622/1702)#1470
zbigniewsobiecki merged 4 commits into
devfrom
fix/snapshot-stale-jobdata-and-env-scrub

Conversation

@zbigniewsobiecki

Copy link
Copy Markdown
Member

Symptom

Prod (project ucho): MNG-1622 & MNG-1702 were moved into Splitting; the label triggered but no backlog story cards were produced.

Root cause (verified: prod logs + live docker image inspect)

The splitting workers silently re-ran the planning agent on a stale payload. A regression from the MNG-1660 offload fix × snapshot reuse:

  1. An earlier planning run passed its payload inline (-e JOB_DATA=<json>). Its container was committed to a snapshot — and docker commit bakes the container's Config.Env (incl. JOB_DATA=<planning json> and every secret) into the image.
  2. The splitting payload was large → the router offloaded to Redis, setting only JOB_DATA_REDIS_KEY (worker-env.ts is if/else — no JOB_DATA).
  3. The splitting worker spawned from that snapshot. docker run -e JOB_DATA_REDIS_KEY=… doesn't clear the baked JOB_DATA, so the container had both.
  4. resolveRawJobData() read inline JOB_DATA first → the stale planning payload (triggerResult.agentType: 'planning') → ran planning → no cards.

Smoking gun: docker image inspect cascade-snapshot-ucho-mng-1702:latest showed baked JOB_DATA with triggerResult.agentType: planning alongside the splitting JOB_DATA_REDIS_KEY — plus 10 plaintext secrets (DATABASE_URL, GITHUB_TOKEN_IMPLEMENTER/REVIEWER, LINEAR_API_KEY, CLAUDE_CODE_OAUTH_TOKEN, CODEX_AUTH_JSON, …).

Before MNG-1660, every run set inline JOB_DATA, which always overrode the baked value — masking this.

Fix (two parts)

  1. Correctness, self-healingworker-entry.ts resolveRawJobData: JOB_DATA_REDIS_KEY is now authoritative when present. The router sets exactly one channel per spawn, so an inline JOB_DATA co-present with a key can only be a stale snapshot artifact → ignored. Fixes snapshots already baked in prod without pruning.
  2. Hygiene + 🔒 securityworker-snapshots.ts commitWorkerSnapshot: inspect the container and commit a scrubbed Config.Env (docker-modem _query/_body idiom), stripping job env + secrets (static deny-list + dynamic credential names from CASCADE_CREDENTIAL_KEYS) while preserving PATH/PLAYWRIGHT_BROWSERS_PATH/NODE_*. On inspect failure → bare commit + Sentry (snapshot_env_scrub_inspect_failed), never silently baking secrets or losing the snapshot.

Blast radius

Any second+ agent on a work item where an earlier run baked inline JOB_DATA and the later run offloads (>96 KiB): planning→splitting, planning→implementation, implementation→review, etc. Every non-GitHub adapter embeds the pre-resolved triggerResult, so all are exposed.

Tests

  • +2 worker-entry: Redis key wins over stale inline; inline-only path unchanged.
  • +6 worker-snapshots: scrubSnapshotEnv (deny-list / credential-keys / =-in-value / no-=); scrubbed-commit _query/_body shape; inspect-failure fallback + Sentry.
  • typecheck, lint, unit-api (1814) + unit-core (6233) green; full pre-push suite green.

Ops after deploy

  • Prune existing cascade-snapshot-* images — they carry baked secrets + stale JOB_DATA: docker images 'cascade-snapshot-*' -q | xargs -r docker rmi -f.
  • Rotate any credentials baked into snapshot images that were ever pushed off-host.
  • Manual verify: after a snapshot-enabled run, docker image inspect <snap> --format '{{json .Config.Env}}' | grep -Ei 'JOB_DATA|TOKEN|API_KEY|DATABASE_URL|AUTH_JSON' → prints nothing.

Live unblock (already done, no deploy)

Pruned the MNG-1702 snapshot and force-triggered splitting for both items via CLI → they re-ran as real splitting on the base image.

🤖 Generated with Claude Code

…nt (ucho MNG-1622/1702)

Prod: MNG-1622 and MNG-1702 were moved into Splitting; the label triggered but
no backlog story cards were produced. Root cause (verified via prod logs +
live `docker image inspect`): the splitting workers RE-RAN the planning agent
on a stale payload.

Mechanism — an interaction between the MNG-1660 JOB_DATA offload and snapshot
reuse:
- An earlier `planning` run of the same issue passed its payload INLINE
  (`-e JOB_DATA=<json>`). Its container was committed to a snapshot image, and
  `docker commit` bakes the container's `Config.Env` — including
  `JOB_DATA=<planning json>` (and every secret: DATABASE_URL, REDIS_URL, GitHub
  /Linear/OpenAI creds, Claude OAuth token) — into the image.
- The `splitting` run's payload was large, so the router OFFLOADED it to Redis
  and set only `JOB_DATA_REDIS_KEY`, not `JOB_DATA` (worker-env.ts is if/else).
- The splitting worker spawned FROM that snapshot image. `docker run
  -e JOB_DATA_REDIS_KEY=...` does not clear the baked `JOB_DATA`, so the
  container had both.
- `resolveRawJobData()` read inline `JOB_DATA` first → the STALE planning
  payload (triggerResult.agentType 'planning') → ran planning → no cards.
This regressed from MNG-1660: before it, every run set inline JOB_DATA which
always overrode the baked value.

Two-part fix:
1. Correctness (self-healing, covers snapshots already baked in prod):
   worker-entry.ts resolveRawJobData now treats JOB_DATA_REDIS_KEY as
   AUTHORITATIVE when present — the router sets exactly one channel per spawn,
   so an inline JOB_DATA co-present with a key can only be a stale snapshot
   artifact and is ignored.
2. Hygiene + security: worker-snapshots.ts commitWorkerSnapshot now inspects the
   container and commits a SCRUBBED Config.Env (docker-modem _query/_body idiom),
   stripping job env (JOB_DATA*/JOB_ID/JOB_TYPE) and secrets (DATABASE_URL,
   REDIS_URL, CREDENTIAL_MASTER_KEY, CLAUDE_CODE_OAUTH_TOKEN, and the dynamic
   project-credential names from CASCADE_CREDENTIAL_KEYS) while preserving PATH /
   PLAYWRIGHT_BROWSERS_PATH / NODE_*. Closes the credential-in-image leak; on
   inspect failure, falls back to a bare commit + Sentry (snapshot_env_scrub_
   inspect_failed) rather than silently baking secrets or losing the snapshot.

Blast radius: any second+ agent on a work item where an earlier run baked inline
JOB_DATA and the later run offloads (>96 KiB) — planning→splitting,
planning→implementation, implementation→review, etc.

Tests: +2 worker-entry (Redis key wins over stale inline; inline-only path
unchanged); +6 worker-snapshots (scrubSnapshotEnv deny-list/credential-keys/=-in
-value/no-= edge cases; scrubbed-commit _query/_body shape; inspect-failure
fallback + Sentry). typecheck, lint, unit-api (1814) and unit-core (6233) green.

Ops after deploy: prune existing cascade-snapshot-* images (they carry baked
secrets + stale JOB_DATA); rotate any credentials baked into images shared
off-host.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RyBTx5JozjbpUko5SRyz4X
@codecov

codecov Bot commented Jul 1, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 93.75000% with 6 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/worker-entry.ts 66.66% 6 Missing ⚠️

📢 Thoughts on this report? Let us know!

@nhopeatall nhopeatall left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

REQUEST_CHANGES. The worker-side reorder (part 1) correctly fixes the reported forward incident, but the snapshot env-scrub (part 2 — the headline 🔒 security + at-source hygiene fix) writes the Docker commit body in the wrong shape, so it almost certainly strips nothing. A mocked unit test locks in that wrong shape, which is why CI is green despite the bug. There is also an asymmetric edge in the worker reorder.

Code Issues

Blocking

  • src/router/worker-snapshots.ts:128 — commit body shape doesn't match the Docker ImageCommit API.
    container.commit({ _query, _body: { Config: { Env: scrubbed } } }) sends the HTTP body {"Config":{"Env":[…]}}. But Docker's POST /commit (ImageCommit) request body is itself a ContainerConfigEnv is a top-level field, not nested under Config. The Config nesting is the shape of the inspect response (info.Config.Env, which line 114 reads correctly), not the commit request.

    docker-modem sends _body verbatim (modem.js:208 → JSON.stringify(opts._body || opts)), so Docker receives {"Config":{…}}, treats Config as an unknown field (ignored), and sees a nil Env. The existing bare commit already sends a body that decodes to an all-nil config and still preserves the container env, so nil Env means "keep the container's env" → the scrubbed list is never applied. The committed snapshot keeps every secret + the stale JOB_DATA, while the strippedKeys: N info log (lines 130–133) reports a false positive. (If a given daemon instead treats the body as a full replace, the image gets an empty/broken config — also bad.)

    Fix: send a flat ContainerConfig: _body: { Env: scrubbed }. This needs verification against a real Docker daemon — the unit test mocks container.commit, so it only checks the argument the code passes, not what Docker accepts. The PR's own "Manual verify" grep (docker image inspect <snap> --format '{{json .Config.Env}}' | grep -Ei 'JOB_DATA|TOKEN|API_KEY|…') is exactly the right check and should be run before merge, not just after deploy.

Should Fix

  • src/worker-entry.ts:478 — Redis-key-first is asymmetric; the reverse stale-snapshot case now hard-fails.
    Making the key authoritative fixes the forward case (baked inline JOB_DATA + fresh key). But the symmetric case regresses: a snapshot committed from a prior offloaded run bakes a stale JOB_DATA_REDIS_KEY=cascade:jobdata:<priorJobId>. If the next run for the same work item is inline (payload < 96 KiB), docker run -e JOB_DATA=… does not clear the baked key, so process.env.JOB_DATA_REDIS_KEY is the stale key. readOffloadedJobData then reads a key the prior run already deleted (job-data-offload.ts:108) → throws → process.exit(1). With the old inline-first order this run succeeded. That is a brand-new failure mode in exactly the "self-heal already-baked snapshots without pruning" scenario this PR targets, and snapshot reuse is per (projectId, workItemId) so it is reachable (container-manager.ts:253).

    The offload key embeds the jobId and JOB_ID is set fresh on every spawn (worker-env.ts:113), so you can disambiguate both directions: treat the Redis key as authoritative only when key === buildJobDataRedisKey(process.env.JOB_ID); otherwise it is a stale baked artifact — ignore it and fall through to inline. That fixes forward and reverse.

Notes

  • tests/unit/router/worker-snapshots.test.ts:127 asserts commitArg._body.Config.Env — it validates the argument the code passes to the mock, not what a real daemon accepts, which is why CI passes despite the shape bug. Once _body is corrected to { Env: … }, update this assertion and ideally add a real-daemon integration check.
  • Deny-list completeness itself looks good: it covers the infra secrets + CLAUDE_CODE_OAUTH_TOKEN + the optional vars from appendOptionalEnvVars, and dynamically pulls per-project credential names from CASCADE_CREDENTIAL_KEYS (set in worker-env.ts:164). Preserving SENTRY_* / CASCADE_DASHBOARD_URL / PATH / PLAYWRIGHT_BROWSERS_PATH is correct — but none of it takes effect until the commit body shape is fixed.

🕵️ claude-code · claude-opus-4-8 · run details

Comment thread src/router/worker-snapshots.ts Outdated
const scrubbed = scrubSnapshotEnv(env, extractCredentialKeys(env));
await container.commit({
_query: { container: containerId, repo, tag: 'latest' },
_body: { Config: { Env: scrubbed } },

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BLOCKING — wrong Docker commit body shape. Docker's POST /commit (ImageCommit) request body is a ContainerConfig, so Env must be top-level: _body: { Env: scrubbed }. { Config: { Env } } is the inspect response shape (correctly read on line 114), not the commit request shape. docker-modem sends _body verbatim (modem.js:208), so Docker receives {"Config":{"Env":[…]}}, treats Config as an unknown field, and sees a nil Env → it keeps the container's original (unscrubbed) env. Net: every secret + stale JOB_DATA stay baked and strippedKeys logs a false positive. Verify against a real daemon: docker image inspect <snap> --format '{{json .Config.Env}}'.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 097b179a. The scrubbed-env commit now sends a flat ContainerConfig_body: { Env: scrubbed } (top-level Env) — instead of the { Config: { Env } } inspect-response shape.

Verified against docker-modem's source: modem.js:162 uses _query for the querystring and modem.js:208 (JSON.stringify(opts._body || opts)) sends _body verbatim as the ImageCommit request body, so the flat shape now maps to ContainerConfig.Env as the API expects. The old nested Config was an unknown top-level field Docker ignored, leaving Env nil and baking the raw env while strippedKeys logged a false positive — exactly as you described.

Docker isn't available in the agent sandbox for a live docker image inspect, so the PR's manual-verify grep stays the pre-merge gate.

Comment thread src/worker-entry.ts Outdated
captureException(err, { tags: { source: 'worker_env' } });
await flush();
process.exit(1);
if (key) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SHOULD_FIX — this is asymmetric. If a snapshot was committed from a prior offloaded run it bakes a stale JOB_DATA_REDIS_KEY=cascade:jobdata:<priorJobId>. When the next run for the same work item is inline, docker run -e JOB_DATA=… doesn't clear the baked key, so this reads the stale key; readOffloadedJobData finds it already deleted (the prior run del'd it) → throws → process.exit(1). That run succeeded before this change — a new failure mode in exactly the self-heal-without-pruning scenario the PR targets. Since the key embeds the jobId and JOB_ID is set fresh every spawn (worker-env.ts:113), only trust the key when key === buildJobDataRedisKey(process.env.JOB_ID); otherwise treat it as a stale baked artifact and fall through to inline. Handles both directions.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 097b179a. The Redis key is now authoritative only when it names THIS job: key === buildJobDataRedisKey(process.env.JOB_ID). Since JOB_ID is set fresh on every spawn and the offload key embeds that jobId (worker-env.ts:113/137), this disambiguates both directions:

  • Forward (fresh key + stale baked inline JOB_DATA): key matches → read Redis, ignore the stale inline value.
  • Reverse (fresh inline + stale baked JOB_DATA_REDIS_KEY from a prior offloaded run): key does not match → treat it as a stale baked artifact, fall through to inline. No more reading a key the prior run already deleted → no process.exit(1) regression.

Added a reverse-case regression test in tests/unit/worker-entry.test.ts and updated the missing/expired offload test to use a key that matches JOB_ID (as production always does).

repo: 'cascade-snapshot-proj-snap-card-snap',
tag: 'latest',
});
const bakedEnv = commitArg._body.Config.Env;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This asserts the argument the code passes to the mocked container.commit (_body.Config.Env), not what a real Docker daemon accepts — which is why CI stays green even though the commit body shape is wrong for the ImageCommit API (see the comment on worker-snapshots.ts). Once _body is corrected to { Env: … }, update this assertion, and ideally add a real-daemon integration check.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in 097b179a: the assertion now reads commitArg._body.Env (top-level) to match the corrected ImageCommit body shape. I also added a reverse-case regression to worker-entry.test.ts.

On the real-daemon integration check: the repo has no existing real-Docker test harness (all Docker interaction in tests is mocked via dockerode) and no daemon is available in the agent sandbox, so adding one here would be unrunnable/flaky. I verified the wire shape via docker-modem source instead (_body serialized verbatim at modem.js:208); the PR's manual-verify grep remains the pre-merge gate.

…stale offload key

Address review feedback on PR #1470.

1. worker-snapshots.ts (BLOCKING): the scrubbed-env commit wrote the Docker
   ImageCommit body as `{ Config: { Env } }`. Docker's POST /commit request body
   IS a ContainerConfig — Env is top-level. docker-modem serializes `_body`
   verbatim (modem.js:208), so the nested `Config` was an unknown field Docker
   ignored, leaving Env nil and baking the container's raw (unscrubbed) env while
   the `strippedKeys` log reported a false positive. Send `_body: { Env: scrubbed }`.

2. worker-entry.ts (SHOULD_FIX): making JOB_DATA_REDIS_KEY unconditionally
   authoritative fixed the forward case but regressed the reverse one — a snapshot
   baked from a prior OFFLOADED run carries a stale JOB_DATA_REDIS_KEY that a later
   INLINE run does not clear, so reading it hit a key the prior run already deleted
   → process.exit(1). Trust the key only when it names THIS job
   (`key === buildJobDataRedisKey(JOB_ID)`); otherwise treat it as a stale baked
   artifact and fall through to inline. Fixes forward AND reverse.

3. Tests: assert the corrected `_body.Env` shape; add a reverse-case regression
   (stale baked key mismatching JOB_ID → inline wins, no crash); make the
   missing/expired offload test use a key that matches JOB_ID (as production always
   does).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@nhopeatall nhopeatall left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

REQUEST_CHANGES. Part 1 (worker-entry resolveRawJobData) is correct and well-tested. Part 2 (snapshot env-scrub) commits a partial ContainerConfig to Docker, which replaces the image's entire runtime config — wiping Cmd/WorkingDir/User. Every reused snapshot after this change fails to launch, which regresses the exact snapshot-reuse path this PR targets.

Code Issues

Blocking

src/router/worker-snapshots.ts (commitScrubbed) — the scrubbed commit passes _body: { Env: scrubbed }, a partial config. Docker's POST /commit treats a non-nil request-body Config as the complete runtime config of the new image; it does not merge it over the container's config:

  • moby daemon/commit.go: c.Config only falls back to container.Config when it is nil. A non-nil {Env} is used as-is.
  • image.NewChildImage sets Config: child.Config directly — no merge with the parent image's config.

So the snapshot image ends up with only Env and loses everything else the worker relies on:

Lost field Baked by Consequence on reuse
Cmd ["node","--import","./dist/instrument.js","dist/worker-entry.js"] Dockerfile.worker:145 createContainer/start → "No command specified"
WorkingDir /app Dockerfile.worker:16 resets to / → relative dist/worker-entry.js not found
User node Dockerfile.worker:123 resets to root → Claude Code refuses --dangerously-skip-permissions (Dockerfile.worker:107)
Labels / ExposedPorts / Volumes Dockerfile.worker dropped

The reuse path sets no explicit Cmd/WorkingDir/Userworker-container-launcher.ts:74 createContainer passes only Image/Env/HostConfig/Labels, and resolveSpawnSettings swaps workerImage = snapshot.imageName (worker-spawn-settings.ts:141). So a reused snapshot depends entirely on the image's baked config. Result: the first snapshot-enabled run commits a broken image; the 2nd+ run for that work item can't start at all — strictly worse than the "wrong agent" bug being fixed, and it lands on the same ucho/snapshot path.

Fix — spread the full inspected Config, replacing only Env (the inspect extraction must widen too — today it only keeps info?.Config?.Env):

const info = (await container.inspect()) as { Config?: Record<string, unknown> & { Env?: string[] } };
const fullConfig = info?.Config ?? {};
const env = fullConfig.Env;
// …after computing `scrubbed`…
await container.commit({
  _query: { container: containerId, repo, tag: 'latest' },
  _body: { ...fullConfig, Env: scrubbed },
});

I could not run Docker in the review env — please confirm on a live daemon: after a scrubbed commit, docker image inspect <snap> --format '{{json .Config.Cmd}}' should print the CMD, not null. The new unit test asserts only _body.Env and mocks container.commit, so it can't catch this; once fixed, add assertions that Cmd/WorkingDir/User survive.

Note (non-blocking)

The inspect-failure fallback container.commit({ repo, tag }) (bare) is the config-preserving form (nil body → daemon keeps container.Config). So today the failure path yields a working (unscrubbed) snapshot while the success path yields a broken one. Fixing the main path removes the asymmetry.

What's correct

worker-entry.ts resolveRawJobData: gating trust on JOB_ID === buildJobDataRedisKey(jobId) cleanly disambiguates both stale-snapshot directions. The one-channel-per-spawn invariant holds (worker-env.ts:133-142), JOB_ID is validated present in main() before this runs, and the forward/reverse/inline-only cases are all covered by tests. 👍

🕵️ claude-code · claude-opus-4-8 · run details

let env: string[] | undefined;
try {
const info = (await container.inspect()) as { Config?: { Env?: string[] } } | undefined;
env = info?.Config?.Env;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only captures Config.Env, so the full container config isn't available to pass back on commit (see the blocking comment below). Widen this to keep the whole Config (Cmd/WorkingDir/User/…) so it can be spread into the commit body.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 630e39be. Widened the inspect extraction to keep the whole Config (fullConfig) rather than only Config.Env:

const info = (await container.inspect()) as
  | { Config?: Record<string, unknown> & { Env?: string[] } }
  | undefined;
fullConfig = info?.Config;

fullConfig now carries Cmd/WorkingDir/User/Labels/ExposedPorts so they can be spread back into the commit body (see the reply on the commit call below).

Comment thread src/router/worker-snapshots.ts Outdated
// nested under Config (that is the inspect-response shape). See fn doc.
await container.commit({
_query: { container: containerId, repo, tag: 'latest' },
_body: { Env: scrubbed },

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking: _body: { Env: scrubbed } is a partial ContainerConfig. Docker's POST /commit uses a non-nil body config as the image's complete runtime config (moby daemon/commit.go only falls back to container.Config when the body is nil; image.NewChildImage assigns Config: child.Config with no parent merge). So the snapshot loses Cmd (Dockerfile.worker:145), WorkingDir=/app (:16) and User=node (:123). Since worker-container-launcher.ts sets none of those at createContainer, a reused snapshot (worker-spawn-settings.ts:141) can't start — Cmd gone → "No command specified"; even otherwise, User→root breaks Claude Code and WorkingDir→/ breaks the relative dist/worker-entry.js. Spread the full inspected config instead: _body: { ...fullConfig, Env: scrubbed }. Please verify on a live daemon that .Config.Cmd survives the commit.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 630e39be. The scrubbed commit now spreads the full inspected config and replaces only Env:

await container.commit({
  _query: { container: containerId, repo, tag: 'latest' },
  _body: { ...fullConfig, Env: scrubbed },
});

This preserves Cmd (the worker entrypoint), WorkingDir=/app, User=node, Labels, and ExposedPorts baked by Dockerfile.worker, so a reused snapshot (which sets no explicit Cmd/WorkingDir/User at createContainer) still launches. The inspect-failure bare-commit fallback was already the config-preserving form (nil body → daemon keeps container.Config), so this also removes the success/failure asymmetry you flagged in the non-blocking note.

Added unit assertions that Cmd/WorkingDir/User/Labels/ExposedPorts survive the scrubbed commit body (the mock inspect now returns those fields alongside Env).

On the live-daemon check: Docker isn't available in the agent sandbox and the repo has no real-Docker test harness (all Docker interaction is mocked via dockerode), so I verified the shape against moby's source — daemon/commit.go falls back to container.Config only when the body config is nil, and image.NewChildImage assigns Config: child.Config with no parent merge, confirming a non-nil {Env}-only body would wipe the rest. The PR's manual-verify grep plus docker image inspect <snap> --format '{{json .Config.Cmd}}' (should print the CMD, not null) remains the pre-merge gate.

The scrubbed-env commit passed a partial `_body: { Env: scrubbed }`. Docker's
POST /commit uses a non-nil body config as the new image's COMPLETE runtime
config (moby daemon/commit.go falls back to container.Config only when the body
is nil; image.NewChildImage assigns Config: child.Config with no parent merge),
so this wiped Cmd/WorkingDir/User/Labels/ExposedPorts baked by Dockerfile.worker.
Every reused snapshot then failed to launch ("No command specified", User→root,
WorkingDir→/) — regressing the exact snapshot-reuse path this PR targets.

Widen the inspect extraction to keep the whole Config and spread it into the
commit body, replacing only Env: `_body: { ...fullConfig, Env: scrubbed }`. The
inspect-failure bare-commit fallback was already config-preserving; this removes
the success/failure asymmetry. Unit test now asserts Cmd/WorkingDir/User/Labels/
ExposedPorts survive the scrubbed commit.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@nhopeatall nhopeatall left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

REQUEST_CHANGES — scoped to the snapshot env-scrub only. The worker-side fix (resolveRawJobData) is correct, well-reasoned, and fully resolves the wrong-agent regression on its own. The scrub half relies on an unverified assumption about Docker's commit semantics, and the tests can't catch a failure because they mock container.commit() and assert on its arguments, not on the committed image.

Code Issues

Blocking (verify against a real daemon before merge)

src/router/worker-snapshots.ts_body: { Env: scrubbed } scrub is unverified end-to-end.

Docker's POST /commit applies the request-body config on top of the container's existing config; it does not simply swap in what you pass. For the Env list specifically, moby re-appends the container's env vars whose keys are absent from the provided list — there is no supported way to remove an env var via docker commit (that's why --change only ever adds/overrides ENV, never deletes). If that holds here, every deny-listed key (JOB_DATA, DATABASE_URL, CLAUDE_CODE_OAUTH_TOKEN, the project credentials) is re-added at commit time, the committed image is byte-for-byte identical to today's bare commit, and the "🔒 security" scrub is a silent no-op.

The opposite semantics are no safer: if the body config did replace wholesale, the snapshot would lose its baked CMD/ENTRYPOINT/WORKINGDIRworker-container-launcher.ts starts the container with Image + Env and no explicit Cmd (relies on the image's baked command), so a reused snapshot would fail to boot.

Either way the change needs a real-daemon check. The unit tests can't distinguish these outcomes: they assert on commitArg._body.Env (the mocked call args), and the in-test comment "a nested Config would be silently dropped by Docker, baking the raw env" is itself an unverifiable claim about Docker that a mock cannot exercise.

Ask: run your own "Ops after deploy → Manual verify" grep pre-merge
docker image inspect <snap> --format '{{json .Config.Env}}' | grep -Ei 'JOB_DATA|TOKEN|API_KEY|DATABASE_URL|AUTH_JSON' must print nothing — and ideally lock it down with an integration test that commits a throwaway container and inspects the resulting image env. If the keys survive, scrub-by-omission won't work; blanking values (--change 'ENV KEY=') or not injecting persistable secrets into snapshot-enabled containers would be the fallback.

Note: the correctness fix does not depend on the scrub — resolveRawJobData already ignores the stale baked JOB_DATA, so the wrong-agent bug is fixed regardless of how commit resolves this.

What's solid (no changes needed)

  • resolveRawJobData — the JOB_ID↔key match guard cleanly disambiguates both the forward case (offloaded run + stale baked inline JOB_DATA) and the reverse case (inline run + stale baked JOB_DATA_REDIS_KEY). Cross-checked against worker-env.ts: JOB_ID=${job.id} is always set, exactly one of JOB_DATA/JOB_DATA_REDIS_KEY is set per spawn, the offload key is built from the same job id, and job ids are ${type}-${Date.now()}-${random} (collision-free), so a stale baked key never matches a fresh JOB_ID. Tests cover both directions plus inline-only and the Redis-miss exit path.
  • scrubSnapshotEnv as a pure function is correct and well-tested (static deny-set + dynamic CASCADE_CREDENTIAL_KEYS, first-= split, no-= lines).
  • Inspect-failure → bare-commit + Sentry (snapshot_env_scrub_inspect_failed) is a reasonable non-silent degrade.

🕵️ claude-code · claude-opus-4-8 · run details

Comment thread src/router/worker-snapshots.ts Outdated
const scrubbed = scrubSnapshotEnv(env, extractCredentialKeys(env));
// ImageCommit request body IS a ContainerConfig → Env is top-level, NOT
// nested under Config (that is the inspect-response shape). See fn doc.
await container.commit({

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This scrub is unverified end-to-end, and the tests can't catch a failure — they assert on the mocked container.commit() args (commitArg._body.Env), not on the resulting image's Config.Env.

Docker's POST /commit applies the body config on top of the container's config; moby re-appends the container's env vars whose keys aren't in the provided list (you can't remove an env var via docker commit--change ENV only adds/overrides). If that's the case here, JOB_DATA + every secret get re-added at commit time → the image is identical to a bare commit and the security scrub is a silent no-op.

If instead the body replaced wholesale, the snapshot would lose its baked CMD/ENTRYPOINT/WORKINGDIR (worker-container-launcher.ts starts with Image+Env and no explicit Cmd) → reused snapshots fail to boot.

Please verify pre-merge against a real daemon: docker image inspect <snap> --format '{{json .Config.Env}}' | grep -Ei 'JOB_DATA|TOKEN|API_KEY|DATABASE_URL' must print nothing. The worker-side resolveRawJobData fix already covers the correctness bug, so this path is security-only.

@aaight

aaight commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator

🛠️ Addressing the snapshot env-scrub feedback now

🕵️ run details

@zbigniewsobiecki

Copy link
Copy Markdown
Member Author

Real-daemon verification (the reviewer's blocking ask) — done. @nhopeatall is correct: the current env-scrub is a silent no-op. Here's the proof and the working fix.

Ran both approaches against a live Docker daemon (postgres:16-alpine container with JOB_DATA=STALE, MY_SECRET=topsecret, KEEP_ME=yes):

❌ Scrub-by-omission (current PR — pass a filtered Env in the commit body):

POST /commit  body {"Config":{"Env":["KEEP_ME=yes","PATH=..."]}}
→ image .Config.Env STILL contains: JOB_DATA=STALE_PLANNING, MY_SECRET=topsecret

moby re-appends every container env var whose key is absent from the body list — there is no "remove env" via commit. The committed image is byte-identical to a bare commit. The scrub (and the 🔒 security claim) does nothing. Unit tests can't catch this because they assert on the mocked commit() args, not the resulting image.

✅ Blank the values via changes: ['ENV KEY='] (verified through the exact dockerode call the code uses):

container.commit({ repo, tag: 'latest',
  changes: ['ENV JOB_DATA=', 'ENV JOB_DATA_REDIS_KEY=', 'ENV MY_SECRET='] })
 image .Config.Env: ["JOB_DATA=", "JOB_DATA_REDIS_KEY=", "MY_SECRET=", "KEEP_ME=yes"]
 Cmd preserved: ["sleep","120"]

Values are emptied (secret gone; empty JOB_DATA is falsy so it's ignored), and Cmd/ENTRYPOINT/all other env are preserved — no boot breakage.

Recommended change to commitWorkerSnapshot: keep scrubSnapshotEnv/extractCredentialKeys to compute the deny-key set, but instead of a scrubbed _body.Env, inspect the container's env, intersect the deny-set (static + CASCADE_CREDENTIAL_KEYS) with the keys actually present, and commit with changes: presentDenyKeys.map(k => \ENV ${k}=`). Drop the _query/_body` Config approach entirely.

Test gap to close: the mock-args assertion should be replaced/augmented — assert the changes array is built from the present deny-keys. A true end-to-end guarantee needs an integration test that commits a throwaway container and inspects the image env (gated on Docker availability), since only that exercises moby's real merge semantics.

Note (agreeing with the review): the resolveRawJobData correctness fix stands on its own and fully resolves the wrong-agent regression — the stale baked JOB_DATA is ignored regardless of how commit resolves. The scrub is security/hygiene only. If it's simpler, the scrub can even land as a follow-up so the correctness fix isn't blocked.

@nhopeatall nhopeatall left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

APPROVE — this correctly fixes a verified prod bug (stale snapshot-baked JOB_DATA running the wrong agent) plus the associated secret-baking, and the Docker mechanics behind the fix hold up under scrutiny.

What I verified

  • resolveRawJobData disambiguation is sound. worker-env.ts sets exactly one channel per spawn (if/else) with JOB_ID=job.id and JOB_DATA_REDIS_KEY=buildJobDataRedisKey(job.id) — both from the same job.id, so a fresh offload key always matches JOB_ID. BullMQ job IDs are unique per spawn (queue.ts timestamp+random), so a stale baked key embeds a prior job's ID and can never collide with the fresh JOB_ID. Every channel combination (fresh key + stale inline; fresh inline + stale key; overridden key) resolves correctly. Only worker-entry.ts reads process.env.JOB_DATA, and the subprocess allowlist (envFilter.ts) already blocks it from agents.
  • The _query/_body docker-modem idiom is real in the pinned docker-modem (modem.js:162 opts._query, :208 opts._body).
  • The "non-nil commit body = complete config replacement" reasoning is correct. The original bare {repo,tag} body decodes to a nil container.Config (no matching fields) so the daemon kept the container config; {...fullConfig, Env: scrubbed} decodes to a populated config used verbatim, so spreading the full inspected Config and replacing only Env is what preserves Cmd/WorkingDir/User/Labels. Scrubbing is safe because the launcher re-injects full Env on every spawn (worker-container-launcher.ts:77); only base-image vars need to survive, and they do.
  • Deny-list coverage is complete — static infra keys + CLAUDE_CODE_OAUTH_TOKEN + every dynamic project credential via CASCADE_CREDENTIAL_KEYS (covers CODEX_AUTH_JSON, GitHub/Linear tokens, etc.), while preserving PATH/PLAYWRIGHT_BROWSERS_PATH/NODE_*/SENTRY_*.

Tests are meaningful and pass locally (39 + 19); CI 7/7 green.

Minor (optional, non-blocking)

One negligible doc/code nuance noted inline — no action required.

🕵️ claude-code · claude-opus-4-8 · run details

}

// inspect unavailable / empty env → degrade to the prior bare-commit behavior.
await container.commit({ repo, tag: 'latest' });

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional nit (non-blocking): the function docstring says "If inspect fails or returns no env, falls back to a bare commit … and captures Sentry under snapshot_env_scrub_inspect_failed so the regression to baking secrets is loud rather than silent." But Sentry is only captured on the inspect throw path — this fallback (reached when inspect() resolves yet Config/Env is missing) bare-commits silently. In practice this is unreachable for a real worker container (Config.Env always has at least PATH + injected vars), and an empty env has no secrets to leak, so the behavior is fine. Only the docstring slightly overstates the Sentry coverage. Consider tightening the wording, or capturing Sentry here too if you want the "loud on unexpected shape" guarantee to literally hold.

…changes`)

The prior scrub committed a filtered `Env` array in the commit body — a proven
NO-OP. Verified against a live Docker daemon: moby re-appends every container
env var whose key is absent from the body list, so the committed image is
byte-identical to a bare commit — `JOB_DATA` and every credential survive.
docker commit cannot REMOVE an env var.

Switch to the supported mechanism: `container.commit({ changes: ['ENV KEY='] })`
blanks the value of each deny-listed / credential key actually present. Empty
`JOB_DATA` is falsy (worker ignores it); an emptied secret carries no value to
leak; `Cmd`/`Entrypoint`/`WorkingDir`/all other env are preserved (bare commit
keeps the full container config, `changes` overlays only the named keys).

End-to-end verified on a live daemon (exact code path): JOB_DATA, DATABASE_URL,
GITHUB_TOKEN_IMPLEMENTER, LINEAR_API_KEY all committed as `KEY=` (blank);
PATH/SENTRY_DSN/Cmd preserved.

- New `buildSnapshotEnvScrubChanges(env)` → `ENV KEY=` list for present deny keys.
- `commitScrubbed` uses `{repo, tag, changes}`; drops the `_query`/`_body` shape.
- Tests assert the `changes` array (present secret/job keys blanked, safe keys
  untouched) instead of the unverifiable `_body.Env` mock args.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RyBTx5JozjbpUko5SRyz4X
@zbigniewsobiecki

Copy link
Copy Markdown
Member Author

Applied the corrected scrub in 5031c1ba and verified end-to-end against a live daemon (the pre-merge real-daemon check @nhopeatall asked for).

The _body:{Env:scrubbed} approach was a confirmed no-op — ran the exact dockerode call from 630e39be (commit({_query, _body:{...fullConfig, Env: scrubbed}})) against a real container: JOB_DATA and MY_SECRET survived in the committed image (moby re-appends container env whose keys are absent from the body; only Cmd/Entrypoint are honored). So the security claim was not holding.

Now using changes: ['ENV KEY='] (the supported mechanism — docker commit can't remove an env var, only blank it). End-to-end on the exact code path (buildSnapshotEnvScrubChangescommit({repo,tag,changes})):

changes: ["ENV JOB_DATA=","ENV DATABASE_URL=","ENV CASCADE_CREDENTIAL_KEYS=","ENV GITHUB_TOKEN_IMPLEMENTER=","ENV LINEAR_API_KEY="]
→ committed image env:  JOB_DATA=""  DATABASE_URL=""  GITHUB_TOKEN_IMPLEMENTER=""  LINEAR_API_KEY=""   (all BLANK ✓)
→ preserved:            PATH=/usr/local/bin, SENTRY_DSN=..., Cmd=["sleep","120"]

Empty JOB_DATA is falsy (worker ignores it, on top of the resolveRawJobData fix); emptied secrets carry no value to leak; Cmd/WorkingDir/other env preserved so reuse still boots.

Tests updated to assert the changes array (present secret/job keys blanked, safe keys untouched) rather than the un-exercisable _body.Env mock args. typecheck / lint / unit-api + unit-core green.

@zbigniewsobiecki zbigniewsobiecki merged commit c7ed44e into dev Jul 1, 2026
9 checks passed

@nhopeatall nhopeatall left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

APPROVE. Both halves of the fix are correct, backward-compatible, and well-tested (I ran the two touched suites locally: 61/61 green). The final state also resolves every BLOCKING concern from the earlier review thread — the commit body no-op, the lost Cmd/WorkingDir/User, and the reverse-direction asymmetry are all handled.

What I verified

  • resolveRawJobData (worker-entry.ts). The key === buildJobDataRedisKey(jobId) gate is the right disambiguator. worker-env.ts sets exactly one channel per spawn (if/else at :133-137) and a fresh JOB_ID (:113); addJob job IDs are unique (${type}-${Date.now()}-${rand}, queue.ts:150), so a baked key from a prior run can never collide. Forward case (fresh key + stale baked inline) → reads Redis; reverse case (fresh inline + stale baked key) → key mismatch → falls through to inline. The old normal paths (inline-only, offload-only) are unchanged. Correct on all four combinations.
  • commitScrubbed (worker-snapshots.ts). The switch to changes: ['ENV KEY='] is the correct mechanism. Confirmed via docker-modem: container.commit({repo,tag,changes}) sends changes as a repeated query param (modem.js:162 → buildQuerystring array handling) while the body is the full opts object treated as ContainerConfigrepo/tag/changes/container are not ContainerConfig fields (no case-insensitive collisions), so it behaves as a bare/config-preserving commit with the ENV KEY= overlay. Cmd/WorkingDir/User survive; secret values are blanked to empty (falsy → the worker ignores them, and empty carries nothing to leak).
  • Scrub coverage. Static deny-list covers every infra secret pushed in worker-env.ts; project credentials are covered dynamically via CASCADE_CREDENTIAL_KEYS, which worker-env.ts:164 writes atomically alongside the individual credential vars — so the dynamic list always matches what's actually baked.

Non-blocking notes

1. PR description / ops runbook is stale vs. the final implementation. The description's "Fix part 2" and "Tests" still describe the superseded _body:{Env:scrubbed} / _query/_body approach; the merged code uses changes:['ENV KEY=']. Most importantly, the "Manual verify" step is now misleading:

docker image inspect <snap> --format '{{json .Config.Env}}' | grep -Ei 'JOB_DATA|TOKEN|API_KEY|DATABASE_URL|AUTH_JSON'  → prints nothing

ENV KEY= blanks the value but keeps the key, so Config.Env still contains JOB_DATA=, GITHUB_TOKEN_IMPLEMENTER=, DATABASE_URL= (empty). That grep will print the blanked keys, not "nothing" — an operator following the runbook would wrongly conclude the scrub failed. The value-focused check is what actually proves the fix, e.g.:

... | grep -Eio '(JOB_DATA|TOKEN|API_KEY|DATABASE_URL|AUTH_JSON)=.+'  → prints nothing

Worth correcting wherever this runbook lives so post-deploy verification isn't a false alarm.

2. Docstring overstates Sentry coverage (commitScrubbed, already raised as a nit in the prior thread and still present) — see inline.

3. Deny-list maintenance coupling (defense-in-depth) — see inline.

None of these block merge.

🕵️ claude-code · claude-opus-4-8 · run details

* latter is a proven no-op). `Cmd`/`Entrypoint`/`WorkingDir`/all other env are
* preserved, so a reused snapshot still boots.
*
* If inspect fails or the env is empty, falls back to a bare commit (an

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit (persists from the earlier thread): the docstring says "If inspect fails or the env is empty, falls back to a bare commit ... and captures Sentry under snapshot_env_scrub_inspect_failed". Sentry is only captured in the inspect() throw path (catch block). The resolves-but-empty-env / changes.length === 0 path bare-commits silently (no capture). Harmless in practice (a real worker container always has a non-empty Config.Env, and an empty env has no secrets to leak), but the wording overstates the coverage — either tighten it to "on inspect failure" or capture Sentry on the unexpected-shape path too.

* Static deny-set covers job + infra-secret keys; per-project credential names
* are dynamic and enumerated at runtime from `CASCADE_CREDENTIAL_KEYS`.
*/
const SNAPSHOT_ENV_DENYLIST: ReadonlySet<string> = new Set([

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defense-in-depth (non-blocking): this is a static deny-list. Project credentials are safe because they flow through the dynamic CASCADE_CREDENTIAL_KEYS path, but a new infra secret added to worker-env.ts (e.g. another optional var in appendOptionalEnvVars) that isn't added here would be baked into every snapshot with no test catching it. The current list is complete, but consider a small guard test that asserts every non-credential secret key pushed by worker-env.ts is present in SNAPSHOT_ENV_DENYLIST, so the coupling can't silently drift.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants