Solutions for split-horizon and restricted DNS environment issues #903

munnerz · 2018-09-13T19:47:41Z

I have seen a lot of issues regarding the DNS01 self check come up.

There seems to be two categories of problem, and this issue tries to encapsulates these different problems in order to come up with a solution that helps resolve them.

So far, the two big problems I see:

Restricted DNS environments

Here, a user has configured their cluster/VPC so that all outbound traffic on port 53 is denied, except for the one cluster DNS server (i.e. kube-dns, or their route53 resolver).

This consequently means the DNS01 self-check will timeout, because cert-manager will recurse up the DNS records to find the authoritative nameserver, and query that.

We attempt to query the authoritative nameserver so we can be sure the record has been updated at the root - this is how Let's Encrypt perform their own validations, and as such is the most concrete way to verify a validation will succeed.

#877

Split-horizon DNS environments

Here, a user has a public and private zone configured in Route53 (or similar) for the same domain. This is typically done to allow applications in the cluster to resolve hostnames to internal endpoints instead of using the regular 'user facing' endpoints.

This creates issues for cert-manager because when we perform a DNS query to find the DNS authority, the internal nameserver will respond with the private DNS zone root, consequently failing the self-check (as cert-manager will have updated the public zone).

This is mitigated by allowing users to specify the --dns01-self-check-nameservers flag, which will alter the 'root' nameserver used to perform the initial query - the idea here being that by specifying e.g. 8.8.8.8, they will begin recursing the public zone to find the authority that Let's Encrypt will see.

In a worse class of this problem, cert-manager may actually update the private DNS zone without being aware it is internal only, and consequently pass the self check despite the record not being publicly available, eventually resulting in the Let's Encrypt quota/rate limit being used up.

In order to mitigate this, cert-manager also allows specifying a hostedZoneID in some DNS provider configurations (e.g. route53) which allows the user to override the auto-hosted zone selection logic. This works well, but then requires a DNS provider configuration per DNS zone to be configured, which is awkward and not as we design the API to be used.

Both of the above combined

Some users experience both of these issues together.

cert-manager will update the public DNS zone (assuming they use the hostedZoneID field in Route53's case), but they'll only be able to query the local DNS server due to network policies.

In order to mitigate this, a user would need to setup a separate DNS server that uses a public resolver upstream and point cert-manager at that using the --dns01-self-check-nameservers flag.

#894

Possible solutions

Allow forcing all DNS queries to go through a single resolver - this solution kind of works, but will fall down in the face of DNS caches/ttls. It would not be possible to point all traffic at the authoritative nameserver, as that would restrict users to only being able to obtain certificates for domains hosted within that zone.
Clearly document, or provide assistance in configuring a DNS resolver that will handle this for us. Users in restricted network environments could then allow just that DNS resolver access to the wider internet (or simply 8.8.8.8 et al).
???

It seems the majority of users experiencing this are on AWS, so we've not heard many complaints about the lack of hostedZoneID field in other providers.

The field was actually not meant to be a part of cert-manager, but I didn't notice its addition until after we began shipping releases. Ideally, I'd like to remove the field before we reach 1.0.

In order to do so, we need to work out how to either auto-detect public/private zones for all DNS providers, or otherwise add some dns-provider specific configuration to our Certificate resource.
Related #783

/area provider-acme

The text was updated successfully, but these errors were encountered:

mikebryant · 2018-09-13T19:56:35Z

I believe there's a third option, which is not as nice, but may provide an escape hatch for users.

From https://github.com/ietf-wg-acme/acme/blob/master/draft-ietf-acme-acme.md#retrying-challenges:
Clients SHOULD NOT respond to challenges until they believe that the server's queries will succeed.

I think this leaves open the possibility that we can just allow for disabling of the self check entirely, and have a user-specified timeout. This is essentially just hoping the user will configure things correctly to avoid using up their rate-limit, so such an option should come with a big warning.

However, it would be simple to implement and allow for people to bypass this issue if their local dns environment is a problem.

(e.g. the user can only use internal resolvers, and they don't have a resolver that will actually view the real state, they only have the internal zone to hand)

That said, I think the second option is better, but I don't know that it's viable in all environments.

munnerz · 2018-09-13T20:02:00Z

4. SHOULD NOT   This phrase, or the phrase "NOT RECOMMENDED" mean that
   there may exist valid reasons in particular circumstances when the
   particular behavior is acceptable or even useful, but the full
   implications should be understood and the case carefully weighed
   before implementing any behavior described with this label.

https://tools.ietf.org/html/rfc2119

The RFC does support this, and quite well describes this case.

Given we now have a 'back-off' on re-issuance attempts (currently hardcoded at ever 5 minutes), we could consider implementing this.

Users would still hit the quota for attempts quite quickly in failure cases. With #809 we could extend this backoff behaviour further as well.

jescarri · 2018-09-13T20:51:03Z

@munnerz / @mikebryant

For route 53 the GetChange api call will help to validate the record is resolvable.

https://docs.aws.amazon.com/sdk-for-go/api/service/route53/#Route53.GetChange
https://docs.aws.amazon.com/Route53/latest/APIReference/API_GetChange.html

This is already done at https://github.com/jetstack/cert-manager/blob/master/pkg/issuer/acme/dns/route53/route53.go#L181-L193

Hence I think is safe to assume that when R53 says INSYNC the record is resolvable.

That covers the Challenge validation.

However cert-manager also checks for the hostname that is requesting a cert to be resolvable, but afaik, DNS01 challenge doesn't require this, why the check is implemented in certmanager?, this makes split-dns even harder.

I think solution Allow forcing all DNS queries to go through a single resolver works perfectly if:

No neg-cache is done at that DNS resolver.
users are ok with the associated latency associated with the cache of upstream servers, this can be solved by decreasing the TTL of the TXT record?

In my case, I have a dnsmasq server just for cert-manager, it only forwards queries for foo.bar zone to an outsider DNS server, it can be also pointed to the authoritative NS, it also answers a bogus ip address when cert-manager asks for a hostname it's using a certificate.

And this works like a charm if Allow forcing all DNS queries to go through a single resolver happens.

zystem · 2018-09-26T20:42:54Z

Additional info to ticket
#877
Acme.sh works well. It did not make direct dns requests and uses allowed internal dns server. Problem with letsencrypt appears only in cert-manager

juliohm1978 · 2018-10-10T18:02:19Z

We are also facing this issue and @mikebryant's suggestion with a backoff policy seems very resonable.

Split DNS resolves public and internal IPs for the same domains, depending on the origin. Worse yet, we cannot simply workaround this using --dns01-self-check-nameservers. Due to security policies adopted here, none of our servers are allowed to query DNS extenally, so using 1.1.1.1 or 8.8.8.8 is not an option in this case.

jescarri · 2018-12-15T01:07:27Z

Hey guys any update on this?.

Is removing the check a valid option here?

I could submit a PR.

In my case I do the validation using the --dns01-self-check-nameservers which is a DNS server that has the external view of our Split Horizon DNS.

It works great.

takmatsu · 2018-12-25T05:54:48Z

Even in 0.5.2, the problem seems to be occurred.

--dns01-self-check-nameservers="INTERNAL_DNS_IP:53"

I1225 05:07:23.512534       1 start.go:79] starting cert-manager v0.5.2 (revision 9e8c3ad899c5aafaa360ca947eac7f5ba6301035)
I1225 05:07:23.523922       1 controller.go:126] Using the following nameservers for DNS01 checks: [INTERNAL_DNS_IP:53]
I1225 05:08:53.472111       1 dns.go:99] Checking DNS propagation for "hogehoge.net" using name servers: [INTERNAL_DNS_IP:53]
I1225 05:09:13.484780       1 sync.go:276] Error preparing issuer for certificate default/hogehoge.net: read udp 10.100.4.67:35222->205.251.196.87:53: i/o timeout

tlm · 2019-01-09T03:01:29Z

Hey Guys,

Have just run into this issue where our environment has no internet access and cannot get to the NS of the zone to check propagation. It's understandable why this would be the desired behaviour as you can remove the chance that negative caching comes into affect.

For our purposes we understand this possibility and ran a quick test with modified code that uses the dns01-self-check-nameservers as the authoritative name servers instead. It works as expected and allowed the process to complete.

I would propose a new flag --dns01-check-authoritative that defaults to true so we can toggle the behaviour.

Happy to put the PR together as we have most of the work already done.

juliohm1978 · 2019-01-23T20:40:44Z

Happy to report back to inform that the new 0.6.0 works perfectly to work around this issue. Using --dns01-recursive-nameservers and --dns01-recursive-nameservers-only parameters I was able to issue certs, staging and production.

Many thanks to the community.

mdgreenwald · 2020-05-28T21:38:16Z

For anyone running split-horizon DNS who finds this like I did, the config changes end up looking like this despite the Docs specifying otherwise:

- --dns01-recursive-nameservers="1.1.1.1:53"
- --dns01-recursive-nameservers="1.0.0.1:53"
- --dns01-recursive-nameservers-only

An [issue][0] describes a problem I was experiencing where the domain I was trying to update is overridden by my local DNS settings (split DNS). This change makes it so that, when performing a DNS01 challenge, `cert-manager` will use a public DNS server instead of the local one. [0]: cert-manager/cert-manager#903

The [docs][0] don't seem to be correct. A [comment on an issue][1] might be more correct. [0]: https://cert-manager.io/docs/configuration/acme/dns01/#setting-nameservers-for-dns01-self-check [1]: cert-manager/cert-manager#903 (comment)

jetstack-bot added the area/acme Indicates a PR directly modifies the ACME Issuer code label Sep 13, 2018

This was referenced Sep 13, 2018

Route53 DNS01 Validation does not work in split horizon dns setups. #894

Closed

looks like cert-manager ignores resolv.conf and make direct requests to dns #877

Closed

juliohm1978 mentioned this issue Oct 10, 2018

Cert-manager ignores nameservers passed to --dns01-self-check-nameservers #946

Closed

This was referenced Jan 9, 2019

Control authoritative dns01 server check. #1184

Merged

Allow internal DNS servers to be used for TXT lookup #1177

Closed

jetstack-bot closed this as completed in #1184 Jan 12, 2019

cemo mentioned this issue Feb 19, 2019

CloudDNS and Split DNS issue #1383

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Solutions for split-horizon and restricted DNS environment issues #903

Solutions for split-horizon and restricted DNS environment issues #903

munnerz commented Sep 13, 2018

mikebryant commented Sep 13, 2018

munnerz commented Sep 13, 2018

jescarri commented Sep 13, 2018

zystem commented Sep 26, 2018

juliohm1978 commented Oct 10, 2018

jescarri commented Dec 15, 2018

takmatsu commented Dec 25, 2018

tlm commented Jan 9, 2019

juliohm1978 commented Jan 23, 2019 •

edited

mdgreenwald commented May 28, 2020

Solutions for split-horizon and restricted DNS environment issues #903

Solutions for split-horizon and restricted DNS environment issues #903

Comments

munnerz commented Sep 13, 2018

Restricted DNS environments

Split-horizon DNS environments

Both of the above combined

Possible solutions

mikebryant commented Sep 13, 2018

munnerz commented Sep 13, 2018

jescarri commented Sep 13, 2018

zystem commented Sep 26, 2018

juliohm1978 commented Oct 10, 2018

jescarri commented Dec 15, 2018

takmatsu commented Dec 25, 2018

tlm commented Jan 9, 2019

juliohm1978 commented Jan 23, 2019 • edited

mdgreenwald commented May 28, 2020

juliohm1978 commented Jan 23, 2019 •

edited