Join GitHub today
Solutions for split-horizon and restricted DNS environment issues #903
I have seen a lot of issues regarding the DNS01 self check come up.
There seems to be two categories of problem, and this issue tries to encapsulates these different problems in order to come up with a solution that helps resolve them.
So far, the two big problems I see:
Restricted DNS environments
Here, a user has configured their cluster/VPC so that all outbound traffic on port 53 is denied, except for the one cluster DNS server (i.e. kube-dns, or their route53 resolver).
This consequently means the DNS01 self-check will timeout, because cert-manager will recurse up the DNS records to find the authoritative nameserver, and query that.
We attempt to query the authoritative nameserver so we can be sure the record has been updated at the root - this is how Let's Encrypt perform their own validations, and as such is the most concrete way to verify a validation will succeed.
Split-horizon DNS environments
Here, a user has a public and private zone configured in Route53 (or similar) for the same domain. This is typically done to allow applications in the cluster to resolve hostnames to internal endpoints instead of using the regular 'user facing' endpoints.
This creates issues for cert-manager because when we perform a DNS query to find the DNS authority, the internal nameserver will respond with the private DNS zone root, consequently failing the self-check (as cert-manager will have updated the public zone).
This is mitigated by allowing users to specify the
In a worse class of this problem, cert-manager may actually update the private DNS zone without being aware it is internal only, and consequently pass the self check despite the record not being publicly available, eventually resulting in the Let's Encrypt quota/rate limit being used up.
In order to mitigate this, cert-manager also allows specifying a
Both of the above combined
Some users experience both of these issues together.
cert-manager will update the public DNS zone (assuming they use the hostedZoneID field in Route53's case), but they'll only be able to query the local DNS server due to network policies.
In order to mitigate this, a user would need to setup a separate DNS server that uses a public resolver upstream and point cert-manager at that using the
It seems the majority of users experiencing this are on AWS, so we've not heard many complaints about the lack of
The field was actually not meant to be a part of cert-manager, but I didn't notice its addition until after we began shipping releases. Ideally, I'd like to remove the field before we reach 1.0.
In order to do so, we need to work out how to either auto-detect public/private zones for all DNS providers, or otherwise add some dns-provider specific configuration to our Certificate resource.
This was referenced
Sep 13, 2018
I believe there's a third option, which is not as nice, but may provide an escape hatch for users.
I think this leaves open the possibility that we can just allow for disabling of the self check entirely, and have a user-specified timeout. This is essentially just hoping the user will configure things correctly to avoid using up their rate-limit, so such an option should come with a big warning.
However, it would be simple to implement and allow for people to bypass this issue if their local dns environment is a problem.
(e.g. the user can only use internal resolvers, and they don't have a resolver that will actually view the real state, they only have the internal zone to hand)
That said, I think the second option is better, but I don't know that it's viable in all environments.
The RFC does support this, and quite well describes this case.
Given we now have a 'back-off' on re-issuance attempts (currently hardcoded at ever 5 minutes), we could consider implementing this.
Users would still hit the quota for attempts quite quickly in failure cases. With #809 we could extend this backoff behaviour further as well.
For route 53 the GetChange api call will help to validate the record is resolvable.
Hence I think is safe to assume that when R53 says
That covers the Challenge validation.
However cert-manager also checks for the hostname that is requesting a cert to be resolvable, but afaik, DNS01 challenge doesn't require this, why the check is implemented in certmanager?, this makes split-dns even harder.
I think solution
In my case, I have a dnsmasq server just for cert-manager, it only forwards queries for
And this works like a charm if
We are also facing this issue and @mikebryant's suggestion with a backoff policy seems very resonable.
Split DNS resolves public and internal IPs for the same domains, depending on the origin. Worse yet, we cannot simply workaround this using