Harden autocert controller against ACME failure modes#730
Conversation
A real-world incident exposed several gaps in the autocert controller's failure handling. When an ACME attempt failed, every subsequent TLS handshake would spawn a new goroutine into autocert.Manager because the timeout path in GetCertificate never recorded a failure, so the cooldown never kicked in. Meanwhile, Reconcile had no timeout at all (blocking up to 5 minutes on the Manager's internal deadline, which also wedged graceful shutdown) and no cooldown check (so every resync fired a fresh ACME attempt even while rate-limited). On top of that, the synthetic ClientHello used for eager provisioning had empty cipher suite fields, which caused autocert's supportsECDSA check to return false. The Manager would provision an RSA cert while browsers prefer ECDSA, so the eagerly-provisioned cert went unused and browsers fell through to inline provisioning anyway. The fixes: Reconcile now checks the failure cooldown before attempting ACME and wraps the Manager call in a 30-second timeout. GetCertificate records a failure on timeout so the cooldown actually activates. The synthetic ClientHello includes realistic TLS 1.3/1.2 parameters with ECDSA cipher suites so eager provisioning gets the cert type browsers want. Failure tracking now honors ACME Retry-After headers via acme.RateLimit, and the default cooldown drops from 5 minutes to 1 minute (matching the Manager's internal createCertRetryAfter) so users who fix the underlying problem don't have to wait as long.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThe changes implement ACME failure tracking with per-domain cooldown metadata in the autocert controller. The Comment |
I exercised the new failure handling during a move of my blog across domains, where I added the route for the new domain before DNS had fully propagated. That's how I discovered all these issues. Glad to be improving this stuff because "DNS not quite right yet" is going to be a really common setup scenario.
The initial HTTP-01 challenge failed (as expected, DNS wasn't ready), but then every subsequent TLS handshake spawned a new ACME attempt because the timeout codepath in GetCertificate never recorded a failure, so the cooldown mechanism never activated. Let's Encrypt's 5-failed-authorizations-per-hour limit was exhausted in about 9 minutes. Reconcile compounded things by having no timeout (blocking for autocert.Manager's full 5-minute internal deadline, which also prevented graceful shutdown) and no cooldown check of its own.
There was also a subtler issue where the synthetic ClientHello used for eager provisioning had nil cipher suite fields. autocert.Manager's
supportsECDSAcheck treats that as "no ECDSA support" and provisions an RSA-only cert. Browsers connect with full ECDSA-capable hellos, find no matching cert in cache, and fall through to inline ACME provisioning. So even after eager provisioning reported success, the site would still show a self-signed fallback.The core changes: Reconcile checks the failure cooldown before attempting ACME and wraps the call in a 30-second timeout. GetCertificate now records a failure on timeout (not just on error), so the cooldown actually kicks in. The synthetic ClientHello includes realistic TLS parameters with ECDSA support. And failure tracking now honors server-sent Retry-After headers via
acme.RateLimit(), with the default cooldown dropped from 5 minutes to 1 minute (matching autocert's internalcreateCertRetryAfter) so users who fix the underlying problem get faster feedback.We also investigated the goroutine leak from timed-out calls into
autocert.Manager. The Manager'sGetCertificateAPI doesn't accept a context, so orphaned goroutines can't be cancelled. But with the cooldown fix, the leak rate is bounded to at most one goroutine per domain per cooldown period, which is acceptable.Closes MIR-971