Harden autocert controller against ACME failure modes by phinze · Pull Request #730 · mirendev/runtime

phinze · 2026-04-06T14:57:24Z

I exercised the new failure handling during a move of my blog across domains, where I added the route for the new domain before DNS had fully propagated. That's how I discovered all these issues. Glad to be improving this stuff because "DNS not quite right yet" is going to be a really common setup scenario.

The initial HTTP-01 challenge failed (as expected, DNS wasn't ready), but then every subsequent TLS handshake spawned a new ACME attempt because the timeout codepath in GetCertificate never recorded a failure, so the cooldown mechanism never activated. Let's Encrypt's 5-failed-authorizations-per-hour limit was exhausted in about 9 minutes. Reconcile compounded things by having no timeout (blocking for autocert.Manager's full 5-minute internal deadline, which also prevented graceful shutdown) and no cooldown check of its own.

There was also a subtler issue where the synthetic ClientHello used for eager provisioning had nil cipher suite fields. autocert.Manager's supportsECDSA check treats that as "no ECDSA support" and provisions an RSA-only cert. Browsers connect with full ECDSA-capable hellos, find no matching cert in cache, and fall through to inline ACME provisioning. So even after eager provisioning reported success, the site would still show a self-signed fallback.

The core changes: Reconcile checks the failure cooldown before attempting ACME and wraps the call in a 30-second timeout. GetCertificate now records a failure on timeout (not just on error), so the cooldown actually kicks in. The synthetic ClientHello includes realistic TLS parameters with ECDSA support. And failure tracking now honors server-sent Retry-After headers via acme.RateLimit(), with the default cooldown dropped from 5 minutes to 1 minute (matching autocert's internal createCertRetryAfter) so users who fix the underlying problem get faster feedback.

We also investigated the goroutine leak from timed-out calls into autocert.Manager. The Manager's GetCertificate API doesn't accept a context, so orphaned goroutines can't be cancelled. But with the cooldown fix, the leak rate is bounded to at most one goroutine per domain per cooldown period, which is acceptable.

Closes MIR-971

A real-world incident exposed several gaps in the autocert controller's failure handling. When an ACME attempt failed, every subsequent TLS handshake would spawn a new goroutine into autocert.Manager because the timeout path in GetCertificate never recorded a failure, so the cooldown never kicked in. Meanwhile, Reconcile had no timeout at all (blocking up to 5 minutes on the Manager's internal deadline, which also wedged graceful shutdown) and no cooldown check (so every resync fired a fresh ACME attempt even while rate-limited). On top of that, the synthetic ClientHello used for eager provisioning had empty cipher suite fields, which caused autocert's supportsECDSA check to return false. The Manager would provision an RSA cert while browsers prefer ECDSA, so the eagerly-provisioned cert went unused and browsers fell through to inline provisioning anyway. The fixes: Reconcile now checks the failure cooldown before attempting ACME and wraps the Manager call in a 30-second timeout. GetCertificate records a failure on timeout so the cooldown actually activates. The synthetic ClientHello includes realistic TLS 1.3/1.2 parameters with ECDSA cipher suites so eager provisioning gets the cert type browsers want. Failure tracking now honors ACME Retry-After headers via acme.RateLimit, and the default cooldown drops from 5 minutes to 1 minute (matching the Manager's internal createCertRetryAfter) so users who fix the underlying problem don't have to wait as long.

coderabbitai · 2026-04-06T14:57:41Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 42e197b2-727f-467d-bdfe-7820cb7fc2be

📥 Commits

Reviewing files that changed from the base of the PR and between f1047a3 and c750e20.

📒 Files selected for processing (2)

controllers/certificate/autocert_controller.go
controllers/certificate/autocert_controller_test.go

📝 Walkthrough

Walkthrough

The changes implement ACME failure tracking with per-domain cooldown metadata in the autocert controller. The failures map values were restructured from time.Time to an acmeFailure type storing both failure timestamp and cooldown duration. New helper functions recordFailure and inCooldown were introduced. Both Reconcile and GetCertificate now check cooldown status before attempting ACME provisioning. Eager provisioning in Reconcile was modified to execute GetCertificate in a goroutine with a 30-second timeout. The synthetic ClientHelloInfo was expanded with additional handshake details. The default cooldown was reduced from 5 minutes to 1 minute, with logic to honor ACME-provided Retry-After headers. Four new tests were added covering cooldown behavior and failure recording.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

phinze requested a review from a team as a code owner April 6, 2026 14:57

evanphx approved these changes Apr 7, 2026

View reviewed changes

phinze merged commit 37b8040 into main Apr 7, 2026
11 checks passed

phinze deleted the phinze/mir-971-harden-autocert-controller-against-acme-failure-modes branch April 7, 2026 19:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harden autocert controller against ACME failure modes#730

Harden autocert controller against ACME failure modes#730
phinze merged 1 commit into
mainfrom
phinze/mir-971-harden-autocert-controller-against-acme-failure-modes

phinze commented Apr 6, 2026

Uh oh!

coderabbitai Bot commented Apr 6, 2026 •

edited

Loading

Walkthrough

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

phinze commented Apr 6, 2026

Uh oh!

coderabbitai Bot commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented Apr 6, 2026 •

edited

Loading