Skip to content

Harden autocert controller against ACME failure modes#730

Merged
phinze merged 1 commit into
mainfrom
phinze/mir-971-harden-autocert-controller-against-acme-failure-modes
Apr 7, 2026
Merged

Harden autocert controller against ACME failure modes#730
phinze merged 1 commit into
mainfrom
phinze/mir-971-harden-autocert-controller-against-acme-failure-modes

Conversation

@phinze
Copy link
Copy Markdown
Contributor

@phinze phinze commented Apr 6, 2026

I exercised the new failure handling during a move of my blog across domains, where I added the route for the new domain before DNS had fully propagated. That's how I discovered all these issues. Glad to be improving this stuff because "DNS not quite right yet" is going to be a really common setup scenario.

The initial HTTP-01 challenge failed (as expected, DNS wasn't ready), but then every subsequent TLS handshake spawned a new ACME attempt because the timeout codepath in GetCertificate never recorded a failure, so the cooldown mechanism never activated. Let's Encrypt's 5-failed-authorizations-per-hour limit was exhausted in about 9 minutes. Reconcile compounded things by having no timeout (blocking for autocert.Manager's full 5-minute internal deadline, which also prevented graceful shutdown) and no cooldown check of its own.

There was also a subtler issue where the synthetic ClientHello used for eager provisioning had nil cipher suite fields. autocert.Manager's supportsECDSA check treats that as "no ECDSA support" and provisions an RSA-only cert. Browsers connect with full ECDSA-capable hellos, find no matching cert in cache, and fall through to inline ACME provisioning. So even after eager provisioning reported success, the site would still show a self-signed fallback.

The core changes: Reconcile checks the failure cooldown before attempting ACME and wraps the call in a 30-second timeout. GetCertificate now records a failure on timeout (not just on error), so the cooldown actually kicks in. The synthetic ClientHello includes realistic TLS parameters with ECDSA support. And failure tracking now honors server-sent Retry-After headers via acme.RateLimit(), with the default cooldown dropped from 5 minutes to 1 minute (matching autocert's internal createCertRetryAfter) so users who fix the underlying problem get faster feedback.

We also investigated the goroutine leak from timed-out calls into autocert.Manager. The Manager's GetCertificate API doesn't accept a context, so orphaned goroutines can't be cancelled. But with the cooldown fix, the leak rate is bounded to at most one goroutine per domain per cooldown period, which is acceptable.

Closes MIR-971

A real-world incident exposed several gaps in the autocert controller's
failure handling. When an ACME attempt failed, every subsequent TLS
handshake would spawn a new goroutine into autocert.Manager because
the timeout path in GetCertificate never recorded a failure, so the
cooldown never kicked in. Meanwhile, Reconcile had no timeout at all
(blocking up to 5 minutes on the Manager's internal deadline, which
also wedged graceful shutdown) and no cooldown check (so every resync
fired a fresh ACME attempt even while rate-limited).

On top of that, the synthetic ClientHello used for eager provisioning
had empty cipher suite fields, which caused autocert's supportsECDSA
check to return false. The Manager would provision an RSA cert while
browsers prefer ECDSA, so the eagerly-provisioned cert went unused and
browsers fell through to inline provisioning anyway.

The fixes: Reconcile now checks the failure cooldown before attempting
ACME and wraps the Manager call in a 30-second timeout. GetCertificate
records a failure on timeout so the cooldown actually activates. The
synthetic ClientHello includes realistic TLS 1.3/1.2 parameters with
ECDSA cipher suites so eager provisioning gets the cert type browsers
want. Failure tracking now honors ACME Retry-After headers via
acme.RateLimit, and the default cooldown drops from 5 minutes to 1
minute (matching the Manager's internal createCertRetryAfter) so users
who fix the underlying problem don't have to wait as long.
@phinze phinze requested a review from a team as a code owner April 6, 2026 14:57
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 6, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 42e197b2-727f-467d-bdfe-7820cb7fc2be

📥 Commits

Reviewing files that changed from the base of the PR and between f1047a3 and c750e20.

📒 Files selected for processing (2)
  • controllers/certificate/autocert_controller.go
  • controllers/certificate/autocert_controller_test.go

📝 Walkthrough

Walkthrough

The changes implement ACME failure tracking with per-domain cooldown metadata in the autocert controller. The failures map values were restructured from time.Time to an acmeFailure type storing both failure timestamp and cooldown duration. New helper functions recordFailure and inCooldown were introduced. Both Reconcile and GetCertificate now check cooldown status before attempting ACME provisioning. Eager provisioning in Reconcile was modified to execute GetCertificate in a goroutine with a 30-second timeout. The synthetic ClientHelloInfo was expanded with additional handshake details. The default cooldown was reduced from 5 minutes to 1 minute, with logic to honor ACME-provided Retry-After headers. Four new tests were added covering cooldown behavior and failure recording.


Comment @coderabbitai help to get the list of available commands and usage tips.

@phinze phinze merged commit 37b8040 into main Apr 7, 2026
11 checks passed
@phinze phinze deleted the phinze/mir-971-harden-autocert-controller-against-acme-failure-modes branch April 7, 2026 19:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants