Skip to content
This repository has been archived by the owner on Aug 26, 2021. It is now read-only.

If one of the domains in an ingress fails reachability, kube-lego should not try to authorize any of the domains #330

Open
ftab opened this issue May 8, 2018 · 0 comments

Comments

@ftab
Copy link

ftab commented May 8, 2018

I had a power outage on Saturday which kicked off an issue I still can't fully explain. The hostAliases I defined in my kube-lego deployment which was working before, suddenly was ignored, and as a result one of the 2 domains in an ingress wasn't reachable (ourdomain.com) while the other domain in the ingress did resolve to the correct IP. See footnote* for more info. As a result of failing the reachability test, kube-lego proceeded to get a certificate for just the www.ourdomain.com domain. Because the certificate wasn't for all of the domains in the ingress, kube-lego proceeded to repeatedly request a new one, which lather-rinse-repeated, couldn't reach the fake acme challenge at ourdomain.com, requested a new one for just www.ourdomain.com, etc.

I got rate limited and then IP blocked before I was even awoken by the alert emails that tell me when the site goes down. While I try to figure out how to get it to actually resolve to the correct IP, I would also like to learn how to prevent it from causing that kind of loop in the future. Hopefully I don't have to bug Let's Encrypt admins more than once even if I do run into a DNS thing again.

It doesn't make sense to me that kube-lego would ever try to request a certificate if it couldn't do the reachability test to all domains in the certificate - unless I'm missing something, such a situation indicates the ingress was configured for the wrong domain, a DNS resolution issue (like in my case), or something else similarly fatal. If I'm wrong and there is a good reason for it to go ahead and get a certificate even if it can't reach all of them, then at the very least, it shouldn't ever try to re-obtain the certificate (at least until all domains pass the reachability test).

I'm opening this under the assumption that what causes this to happen is different at least in some part than the retry issue that was addressed in #329 - if I am incorrect in that assumption, then please accept my apology as I await eagerly for the next release :)

*: I don't know much about Windows networks, but for some reason, the nameserver on the inside of the LAN has to point ourdomain.com to a different IP than the one I want my cluster to point to for the purposes of reachability/readiness/liveness tests of that site. To the outside world, both ourdomain.com and www.ourdomain.com point to the same IP which gets forwarded to this nginx ingress. To get kube-lego to work, I had originally set it up with a hostAliases entry that mapped ourdomain.com to the correct IP. This worked until a power outage / reboot on Saturday. I don't know why it doesn't work now. I'm trying a few other things like setting up dnsmasq on the host nodes with an appropriate /etc/hosts entry.

biochimia added a commit to biochimia/kube-lego that referenced this issue Nov 23, 2018
When requesting a certificate, kube-lego drops unreachable domains from
the request. While this allows partial certificates to be provisioned,
as far as kube-lego is concerned the certificate remains unfulfilled,
and kube-lego continues in its attempts to deliver.

In the face of domains that remain unreachable for extended periods of
time (e.g., permanently), kube-lego would repeatedly request a
partial certificate, eventually triggering Let's Encrypt rate-limit for
duplicate certificates.

With this change, kube-lego keeps track of missing domains in an
existing certificate, and will only request a new certificate if it
includes at least one of the missing domains.

This prevents certificate request loops where one or more domains are
unreachable permanently (or for an extended period of time). The loop
condition remains if the set of unreachable domains is not stable.

This also prevents kube-lego from requesting a certificate for a
narrower set of domains, than currently supported by an existing
certificate.

Logic in newCertNeeded() was tweaked so that expired certificates are
checked for before checking for missing domains, to allow for partial
certificates to be renewed, and not skipped.

Fixes: jetstack#330
Signed-off-by: João Abecasis <joao@schibsted.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant