If one of the domains in an ingress fails reachability, kube-lego should not try to authorize any of the domains #330

ftab · 2018-05-08T18:27:42Z

I had a power outage on Saturday which kicked off an issue I still can't fully explain. The hostAliases I defined in my kube-lego deployment which was working before, suddenly was ignored, and as a result one of the 2 domains in an ingress wasn't reachable (ourdomain.com) while the other domain in the ingress did resolve to the correct IP. See footnote* for more info. As a result of failing the reachability test, kube-lego proceeded to get a certificate for just the www.ourdomain.com domain. Because the certificate wasn't for all of the domains in the ingress, kube-lego proceeded to repeatedly request a new one, which lather-rinse-repeated, couldn't reach the fake acme challenge at ourdomain.com, requested a new one for just www.ourdomain.com, etc.

I got rate limited and then IP blocked before I was even awoken by the alert emails that tell me when the site goes down. While I try to figure out how to get it to actually resolve to the correct IP, I would also like to learn how to prevent it from causing that kind of loop in the future. Hopefully I don't have to bug Let's Encrypt admins more than once even if I do run into a DNS thing again.

It doesn't make sense to me that kube-lego would ever try to request a certificate if it couldn't do the reachability test to all domains in the certificate - unless I'm missing something, such a situation indicates the ingress was configured for the wrong domain, a DNS resolution issue (like in my case), or something else similarly fatal. If I'm wrong and there is a good reason for it to go ahead and get a certificate even if it can't reach all of them, then at the very least, it shouldn't ever try to re-obtain the certificate (at least until all domains pass the reachability test).

I'm opening this under the assumption that what causes this to happen is different at least in some part than the retry issue that was addressed in #329 - if I am incorrect in that assumption, then please accept my apology as I await eagerly for the next release :)

_{*: I don't know much about Windows networks, but for some reason, the nameserver on the inside of the LAN has to point ourdomain.com to a different IP than the one I want my cluster to point to for the purposes of reachability/readiness/liveness tests of that site. To the outside world, both ourdomain.com and www.ourdomain.com point to the same IP which gets forwarded to this nginx ingress. To get kube-lego to work, I had originally set it up with a hostAliases entry that mapped ourdomain.com to the correct IP. This worked until a power outage / reboot on Saturday. I don't know why it doesn't work now. I'm trying a few other things like setting up dnsmasq on the host nodes with an appropriate /etc/hosts entry.}

When requesting a certificate, kube-lego drops unreachable domains from the request. While this allows partial certificates to be provisioned, as far as kube-lego is concerned the certificate remains unfulfilled, and kube-lego continues in its attempts to deliver. In the face of domains that remain unreachable for extended periods of time (e.g., permanently), kube-lego would repeatedly request a partial certificate, eventually triggering Let's Encrypt rate-limit for duplicate certificates. With this change, kube-lego keeps track of missing domains in an existing certificate, and will only request a new certificate if it includes at least one of the missing domains. This prevents certificate request loops where one or more domains are unreachable permanently (or for an extended period of time). The loop condition remains if the set of unreachable domains is not stable. This also prevents kube-lego from requesting a certificate for a narrower set of domains, than currently supported by an existing certificate. Logic in newCertNeeded() was tweaked so that expired certificates are checked for before checking for missing domains, to allow for partial certificates to be renewed, and not skipped. Fixes: jetstack#330 Signed-off-by: João Abecasis <joao@schibsted.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

If one of the domains in an ingress fails reachability, kube-lego should not try to authorize any of the domains #330

If one of the domains in an ingress fails reachability, kube-lego should not try to authorize any of the domains #330

ftab commented May 8, 2018 •

edited

If one of the domains in an ingress fails reachability, kube-lego should not try to authorize any of the domains #330

If one of the domains in an ingress fails reachability, kube-lego should not try to authorize any of the domains #330

Comments

ftab commented May 8, 2018 • edited

ftab commented May 8, 2018 •

edited