Skip to content

fix(worker): roll back partial Hetzner provisioning#783

Open
steipete wants to merge 3 commits into
mainfrom
codex/rollback-hetzner-partial-create
Open

fix(worker): roll back partial Hetzner provisioning#783
steipete wants to merge 3 commits into
mainfrom
codex/rollback-hetzner-partial-create

Conversation

@steipete

@steipete steipete commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Summary

  • roll back a Hetzner server when post-create readiness fails, treating already-absent resources as successfully cleaned up
  • continue through retryable server-type fallbacks after a successful rollback
  • delete only newly created canonical lease-owned SSH keys; retain shared workspace, reused, and custom keys
  • persist exact key cleanup work when immediate deletion fails, retry it with coordinator backoff, and finish it before workspace reprovisioning
  • persist the exact server and key IDs when rollback deletion fails, defer cleanup without hot-looping, and retry server-before-key so an attached key is never removed first

Closes #761

Verification

Exact head: 732a909fcd4e5495cb47c6fdf61fdd2fdc4d11d6 (rebased onto the 0.34.1 development head). Exact-head CI is terminal green: https://github.com/openclaw/crabbox/actions/runs/28615334393

  • npm test --prefix worker — 762 tests passed after rebasing onto current main
  • npm run format:check --prefix worker
  • npm run lint --prefix worker
  • npm run check --prefix worker
  • npm run build --prefix worker
  • scripts/check-docs.sh — 51 command docs, 69 providers, 202 Markdown files, and the generated docs site
  • whole-branch and focused autoreview rounds; final review reported no actionable findings

No new live Hetzner canary was run because the credential session remains locked. The regression suite exercises successful rollback, server and key 404 idempotency, fallback continuation, durable server/key ID recovery after rollback failure, server-before-key cleanup, exact-ID persistence/retry, retry backoff and early-alarm suppression, and shared/reused-key retention.

@clawsweeper

clawsweeper Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs real behavior proof before merge. Reviewed July 2, 2026, 3:40 PM ET / 19:40 UTC.

Summary
The branch updates Worker Hetzner provisioning and fleet cleanup to roll back or persist failed server/key cleanup IDs, adds LeaseRecord fields and Worker tests, and edits CHANGELOG.md.

Reproducibility: yes. Source inspection of the PR head shows a delete=false release can leave providerKeyCleanupPending in the stored lease, and the alarm predicate later treats that ended lease as cleanup-needed.

Review metrics: 2 noteworthy metrics.

  • Diff surface: 5 files, +780/-29. The functional change spans provider rollback, fleet cleanup metadata, LeaseRecord typing, tests, and release notes.
  • Mocked test surface: worker/test/fleet.ts +603/-0. The branch adds substantial regression coverage, but it does not replace the missing live provider proof.

Root-cause cluster
Relationship: fixed_by_candidate
Canonical: #761
Summary: This PR is a candidate fix for the linked brokered Hetzner partial-provisioning cleanup issue; the older overlapping PR remains open but is conflicting and not proof-positive.

Members:

Proposal only: this assessment does not dispatch repair, suppress jobs, mutate sibling items, close, or merge anything.

Merge readiness
Overall: 🧂 unranked krab
Proof: 🧂 unranked krab
Patch quality: 🧂 unranked krab
Result: blocked until real behavior proof from a real setup is added.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • [P1] Fix delete=false release handling for pending Hetzner cleanup.
  • [P1] Add redacted current-head live Hetzner terminal/log proof for forced readiness failure, cleanup or retry, SSH-key behavior, and clean inventory.
  • Remove the release-owned changelog entry.

Proof guidance:

  • [P1] Needs real behavior proof before merge: The PR body lists tests and green CI but says no live Hetzner canary was run; contributor action is needed to add redacted terminal/log proof of the real rollback and inventory path before merge. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.

Risk before merge

  • [P1] A failed Hetzner lease with providerKeyCleanupPending can be released with delete=false but later alarm cleanup still deletes the server/key, violating the retention request.
  • [P1] No current-head live Hetzner proof exercises actual server create, forced readiness failure, rollback or retry, SSH-key cleanup, and clean inventory.
  • [P1] The overlapping fix: roll back failed Hetzner readiness #768 remains open, conflicting, and not proof-positive, so maintainers should consolidate on one landing path.

Maintainer options:

  1. Fix retention and prove live rollback (recommended)
    Clear or suppress pending provider cleanup for delete=false releases, then add redacted current-head Hetzner terminal/log proof for forced readiness failure, cleanup or retry, SSH-key behavior, and clean inventory.
  2. Accept mocked validation after code fix
    Maintainers could accept the mocked regression suite after the retention bug is fixed, but that intentionally leaves real Hetzner API cleanup and final inventory unproven.
  3. Pause and consolidate candidates
    Keep this branch paused while choosing it or fix: roll back failed Hetzner readiness #768 as the single landing path for [security] Brokered Hetzner partial provisioning can leave a billable server outside cleanup #761.

Next step before merge

  • [P1] Human follow-up is needed because live Hetzner proof cannot be supplied by automation, and the retention behavior must be fixed before merge.

Security
Needs attention: The diff changes security-sensitive provider resource deletion behavior and still has a concrete retained-resource deletion bug.

Review findings

  • [P1] Honor no-delete releases during pending cleanup — worker/src/fleet.ts:14938
  • [P3] Remove the release-owned changelog entry — CHANGELOG.md:7
Review details

Best possible solution:

Preserve no-delete releases as a hard stop for failed Hetzner cleanup while keeping exact-ID rollback/retry cleanup for normal delete paths, remove changelog churn, and require redacted live Hetzner proof before merge.

Do we have a high-confidence way to reproduce the issue?

Yes. Source inspection of the PR head shows a delete=false release can leave providerKeyCleanupPending in the stored lease, and the alarm predicate later treats that ended lease as cleanup-needed.

Is this the best way to solve the issue?

No. The rollback/retry direction is narrow, but this implementation is not the best landing path until it honors no-delete release semantics and real provider proof confirms actual Hetzner cleanup behavior.

Full review comments:

  • [P1] Honor no-delete releases during pending cleanup — worker/src/fleet.ts:14938
    When a failed Hetzner lease with providerKeyCleanupPending is released with deleteServer=false, this path can still make the ended lease cleanup-needed later. The alarm cleanup then calls the Hetzner provider and can delete the server/key the user asked to retain.
    Confidence: 0.92
  • [P3] Remove the release-owned changelog entry — CHANGELOG.md:7
    Repository review policy treats CHANGELOG.md as release-owned for normal PRs. Keep this release-note context in the PR body or commit message and leave the changelog edit for the release process.
    Confidence: 0.9

Overall correctness: patch is incorrect
Overall confidence: 0.9

AGENTS.md: found and applied where relevant.

Codex review notes: model internal, reasoning high; reviewed against ab1c29f3fd40.

Label changes

Label justifications:

  • P2: The PR targets a bounded but security-sensitive Hetzner provider lifecycle cleanup bug with limited provider-specific blast radius.
  • merge-risk: 🚨 compatibility: The patch can violate existing delete=false release behavior for failed Hetzner leases with pending cleanup.
  • merge-risk: 🚨 security-boundary: The diff changes brokered cloud server and SSH-key deletion behavior around provider resources and credential material.
  • rating: 🧂 unranked krab: Overall readiness is 🧂 unranked krab; proof is 🧂 unranked krab and patch quality is 🧂 unranked krab.
  • status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs real behavior proof before merge: The PR body lists tests and green CI but says no live Hetzner canary was run; contributor action is needed to add redacted terminal/log proof of the real rollback and inventory path before merge. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.
Evidence reviewed

Security concerns:

  • [medium] Retained Hetzner resources can still be deleted — worker/src/fleet.ts:14938
    A failed lease released with no deletion requested can keep pending cleanup metadata, allowing alarm cleanup to delete the persisted Hetzner server and SSH key despite the retention request.
    Confidence: 0.9

What I checked:

  • AGENTS.md policy read: Repository policy was read fully and applies here because the PR touches Worker provider lifecycle code, security-sensitive provider cleanup, and release-owned changelog content. (AGENTS.md:1, ab1c29f3fd40)
  • PR proof gap: The PR body lists mocked Worker tests and green CI but explicitly says no new live Hetzner canary was run because the credential session remains locked. (732a909fcd4e)
  • PR persists failed Hetzner cleanup IDs: The PR head records failed Hetzner server IDs and provider key cleanup IDs on failed leases, scheduling cleanup retry when either is present. (worker/src/fleet.ts:2282, 732a909fcd4e)
  • No-delete retention regression: On the PR head, delete=false release can finalize without clearing providerKeyCleanupPending, while the new cleanup predicate later treats providerKeyCleanupPending as cleanup-needed. (worker/src/fleet.ts:14938, 732a909fcd4e)
  • Release-owned changelog edit: The PR adds an Unreleased changelog entry even though this review lane treats CHANGELOG.md as release-owned for normal PRs. (CHANGELOG.md:7, 732a909fcd4e)
  • Related work search: The linked issue remains open and the older overlapping rollback PR remains open but conflicting, so this PR is a candidate fix rather than a superseded close target.

Likely related people:

  • steipete: Current-main blame and log history for Hetzner provisioning, release cleanup, and Worker fleet lifecycle paths repeatedly point to recent commits by this user. (role: recent area contributor; confidence: high; commits: ef898c27570a, 9e208c80cd1a, 16fa99f89f60; files: worker/src/hetzner.ts, worker/src/fleet.ts, worker/test/fleet.test.ts)
  • vincentkoc: Shortlog and GitHub commit metadata show substantial adjacent Worker fleet/provider/test history, including workspace and cleanup-adjacent commits in the same files. (role: heavy adjacent contributor; confidence: medium; commits: 4fccd01713b3, 938df1d9dfd5, ddeacd6a7fe4; files: worker/src/fleet.ts, worker/src/hetzner.ts, worker/test/fleet.test.ts)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@clawsweeper clawsweeper Bot added rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. P2 Normal priority bug or improvement with limited blast radius. merge-risk: 🚨 security-boundary 🚨 Merging this PR could weaken sandboxing, authorization, credentials, or sensitive data. labels Jul 2, 2026
@steipete steipete force-pushed the codex/rollback-hetzner-partial-create branch from 7f59c64 to ff66ca2 Compare July 2, 2026 17:22

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ff66ca2444

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread worker/src/fleet.ts
Comment on lines +14929 to +14931
!leaseIsLive(lease) &&
((lease.cloudID && (lease.cleanupError || lease.cleanupStartedAt)) ||
lease.providerKeyCleanupPending),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Honor no-delete releases with pending Hetzner cleanup

When a partial Hetzner create persists providerKeyCleanupPending together with a serverID/cloudID, the owner can still release the failed lease with {"delete": false}. That path marks releaseDeletesServer=false but leaves providerKeyCleanupPending, so this predicate schedules background cleanup anyway; HetznerProvider.releaseLease then deletes the server before deleting the SSH key. This violates the no-delete release request and can remove a server the user explicitly chose to retain.

Useful? React with 👍 / 👎.

@steipete steipete force-pushed the codex/rollback-hetzner-partial-create branch from ff66ca2 to 7106f10 Compare July 2, 2026 18:05
@clawsweeper clawsweeper Bot added the merge-risk: 🚨 compatibility 🚨 Merging this PR could break existing users, config, migrations, defaults, or upgrades. label Jul 2, 2026
@steipete steipete force-pushed the codex/rollback-hetzner-partial-create branch from 7106f10 to 732a909 Compare July 2, 2026 19:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge-risk: 🚨 compatibility 🚨 Merging this PR could break existing users, config, migrations, defaults, or upgrades. merge-risk: 🚨 security-boundary 🚨 Merging this PR could weaken sandboxing, authorization, credentials, or sensitive data. P2 Normal priority bug or improvement with limited blast radius. rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[security] Brokered Hetzner partial provisioning can leave a billable server outside cleanup

1 participant