Skip to content

fix: retry stale coordinator lease before command#158

Merged
steipete merged 1 commit into
mainfrom
fix/aws-before-command-lease-retry
May 25, 2026
Merged

fix: retry stale coordinator lease before command#158
steipete merged 1 commit into
mainfrom
fix/aws-before-command-lease-retry

Conversation

@steipete
Copy link
Copy Markdown
Contributor

Summary

  • Replace a coordinator-backed one-shot lease once when SSH drops after sync but before the command starts.
  • Release the stale lease before acquiring the replacement so coordinator active-lease and budget limits are respected.
  • Reset coordinator heartbeat, run telemetry, workspace state, and sync timing before retrying sync on the replacement lease.
  • Document the bounded retry behavior and add changelog coverage.

Verification

  • go test ./internal/cli
  • go test ./...
  • npm run docs:check
  • /Users/steipete/Projects/agent-scripts/skills/autoreview/scripts/autoreview --mode local (clean)

@clawsweeper
Copy link
Copy Markdown

clawsweeper Bot commented May 25, 2026

Codex review: needs real behavior proof before merge. Reviewed May 25, 2026, 4:40 AM ET / 08:40 UTC.

Summary
The branch adds a bounded coordinator-backed one-shot run retry that releases a stale lease, acquires one replacement, resets run state and telemetry, retries sync, and documents/tests the behavior.

Reproducibility: no. live reproduction was established in this review. Source inspection shows current main returns the before-command SSH wait error directly, and the PR adds the replacement branch for that failure path.

Review metrics: 1 noteworthy metric.

  • Files changed: 6 files, 185 additions, 3 deletions. The PR touches the run command lease lifecycle plus tests, docs, and changelog, so live behavior proof matters more than diff size alone.

Merge readiness
Overall: 🧂 unranked krab
Proof: 🧂 unranked krab
Patch quality: 🐚 platinum hermit
Result: blocked until real behavior proof is added.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • Add redacted live output or logs showing stale lease release, replacement lease acquisition, resync, and command start; update the PR body so ClawSweeper can re-review automatically.

Proof guidance:
Needs real behavior proof before merge: The PR body lists tests and autoreview only; it needs redacted terminal output, logs, screenshot, recording, or linked artifact showing the after-fix live coordinator replacement path before merge. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.

Risk before merge

  • The merge result changes coordinator lease release/acquire behavior in a live failure branch; unit tests cover the predicate and telemetry reset, but not an actual coordinator-backed replacement run.

Maintainer options:

  1. Require live coordinator replacement proof (recommended)
    Have the contributor update the PR body with redacted terminal output or logs showing an after-sync before-command SSH drop, stale lease release, one replacement acquire, resync, and command start.
  2. Accept unit-test-only coverage
    Maintainers can merge if they are comfortable that the source-level guards cover a live coordinator failure path that was not exercised in proof.

Next step before merge
Human review should wait on contributor-supplied real behavior proof; there is no narrow code repair for automation to perform.

Security
Cleared: No concrete security or supply-chain issue was found; the diff uses existing coordinator lease APIs and does not add dependencies, workflows, permissions, or secret handling.

Review details

Best possible solution:

Land the bounded retry after redacted live proof shows stale lease release, one replacement acquire, resync, and command start while preserving the documented exclusions for explicit, kept, no-sync, sync-only, and custom-slug runs.

Do we have a high-confidence way to reproduce the issue?

No live reproduction was established in this review. Source inspection shows current main returns the before-command SSH wait error directly, and the PR adds the replacement branch for that failure path.

Is this the best way to solve the issue?

Yes, the bounded one-shot replacement is a narrow maintainable fix for this failure mode. The remaining blocker is proof that the live coordinator path behaves as intended, not a different implementation direction.

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 388c99fe94e7.

Label changes

Label changes:

  • add P2: This is a bounded runtime reliability fix for coordinator-backed one-shot runs with limited blast radius.
  • add merge-risk: 🚨 availability: The new retry path releases and replaces live coordinator leases, so an unproven failure in that path could affect command availability.
  • add rating: 🧂 unranked krab: Overall readiness is 🧂 unranked krab; proof is 🧂 unranked krab and patch quality is 🐚 platinum hermit.
  • add status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs real behavior proof before merge: The PR body lists tests and autoreview only; it needs redacted terminal output, logs, screenshot, recording, or linked artifact showing the after-fix live coordinator replacement path before merge. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.

Label justifications:

  • P2: This is a bounded runtime reliability fix for coordinator-backed one-shot runs with limited blast radius.
  • merge-risk: 🚨 availability: The new retry path releases and replaces live coordinator leases, so an unproven failure in that path could affect command availability.
  • rating: 🧂 unranked krab: Overall readiness is 🧂 unranked krab; proof is 🧂 unranked krab and patch quality is 🐚 platinum hermit.
  • status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs real behavior proof before merge: The PR body lists tests and autoreview only; it needs redacted terminal output, logs, screenshot, recording, or linked artifact showing the after-fix live coordinator replacement path before merge. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.
Evidence reviewed

What I checked:

  • Current main failure path: Current main waits for SSH before the command and returns the wait error directly; there is no lease replacement/retry branch in that path. (internal/cli/run.go:899, 388c99fe94e7)
  • Proposed replacement path: The PR adds one guarded replacement attempt, stops heartbeat/telemetry, releases the stale lease, acquires a replacement, resets run state, and jumps back to sync. (internal/cli/run.go:696, fc6d61b4f37a)
  • Retry hook: The before-command SSH wait now invokes the replacement helper and retries sync only when the helper reports a replacement. (internal/cli/run.go:985, fc6d61b4f37a)
  • Regression coverage: The PR adds table coverage for the retry predicate and coverage that telemetry reset stops the sampler and clears captured telemetry. (internal/cli/run_test.go:198, fc6d61b4f37a)
  • Documentation coverage: The run command docs describe the bounded retry and excluded modes, and the changelog records the user-visible fix. (docs/commands/run.md:38, fc6d61b4f37a)
  • Merge result check: The synthetic merge contains the same six-file diff and git diff --check reported no whitespace errors. (4de46a767c8b)

Likely related people:

  • steipete: Git blame and git log show the current run command, coordinator lease backend, stop-after policy, and before-command SSH path primarily date to commits by Peter Steinberger; the PR author also has merged history in this area beyond this proposal. (role: feature-history owner; confidence: high; commits: e1f2f9317a65, c3570dce8655, 68b6a594d14b; files: internal/cli/run.go, internal/cli/provider_coordinator.go, internal/cli/run_recorder.go)
  • obviyus: Current main includes adjacent coordinator-backed AWS SSH ingress work preserving active lease source CIDRs, which is related to the live coordinator SSH availability surface this PR affects. (role: adjacent provider-access contributor; confidence: medium; commits: 388c99fe94e7; files: worker/src/aws.ts, worker/src/fleet.ts, docs/features/network.md)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fc6d61b4f3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread internal/cli/run.go
strings.TrimSpace(requestedSlug) != "" {
return false
}
return shouldReleaseRunLease(acquired, keep, keepOnFailure, stopAfter, err)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Skip replacement when keep-on-failure is requested

shouldReplaceLeaseAfterBeforeCommandSSHFailure currently delegates to shouldReleaseRunLease, which makes replacement depend on --stop-after; as a result, runs with --keep-on-failure still replace the lease when --stop-after=always or --stop-after=failure is set. That contradicts the new run behavior contract for keep-on-failure runs and can unexpectedly discard the original failing environment (and trigger an extra acquire/sync cycle) exactly when users asked to preserve failure state.

Useful? React with 👍 / 👎.

@clawsweeper clawsweeper Bot added rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. labels May 25, 2026
@clawsweeper
Copy link
Copy Markdown

clawsweeper Bot commented May 25, 2026

ClawSweeper PR egg

🎁 Pass real behavior proof to wake the egg and unlock a hatchable treat.

Where did the egg go?
  • The egg game starts only after the PR passes the real-behavior proof check.
  • Before that, no creature or rarity is rolled. The treat waits for real proof.
  • This is still just collectible flavor: proof affects review readiness, not creature quality.

@steipete steipete merged commit 8d32442 into main May 25, 2026
11 checks passed
@steipete steipete deleted the fix/aws-before-command-lease-retry branch May 25, 2026 08:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant