Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solutions for split-horizon and restricted DNS environment issues #903

Closed
munnerz opened this issue Sep 13, 2018 · 10 comments · Fixed by #1184
Closed

Solutions for split-horizon and restricted DNS environment issues #903

munnerz opened this issue Sep 13, 2018 · 10 comments · Fixed by #1184
Labels
area/acme Indicates a PR directly modifies the ACME Issuer code

Comments

@munnerz
Copy link
Member

munnerz commented Sep 13, 2018

I have seen a lot of issues regarding the DNS01 self check come up.

There seems to be two categories of problem, and this issue tries to encapsulates these different problems in order to come up with a solution that helps resolve them.

So far, the two big problems I see:

Restricted DNS environments

Here, a user has configured their cluster/VPC so that all outbound traffic on port 53 is denied, except for the one cluster DNS server (i.e. kube-dns, or their route53 resolver).

This consequently means the DNS01 self-check will timeout, because cert-manager will recurse up the DNS records to find the authoritative nameserver, and query that.

We attempt to query the authoritative nameserver so we can be sure the record has been updated at the root - this is how Let's Encrypt perform their own validations, and as such is the most concrete way to verify a validation will succeed.

#877

Split-horizon DNS environments

Here, a user has a public and private zone configured in Route53 (or similar) for the same domain. This is typically done to allow applications in the cluster to resolve hostnames to internal endpoints instead of using the regular 'user facing' endpoints.

This creates issues for cert-manager because when we perform a DNS query to find the DNS authority, the internal nameserver will respond with the private DNS zone root, consequently failing the self-check (as cert-manager will have updated the public zone).

This is mitigated by allowing users to specify the --dns01-self-check-nameservers flag, which will alter the 'root' nameserver used to perform the initial query - the idea here being that by specifying e.g. 8.8.8.8, they will begin recursing the public zone to find the authority that Let's Encrypt will see.

In a worse class of this problem, cert-manager may actually update the private DNS zone without being aware it is internal only, and consequently pass the self check despite the record not being publicly available, eventually resulting in the Let's Encrypt quota/rate limit being used up.

In order to mitigate this, cert-manager also allows specifying a hostedZoneID in some DNS provider configurations (e.g. route53) which allows the user to override the auto-hosted zone selection logic. This works well, but then requires a DNS provider configuration per DNS zone to be configured, which is awkward and not as we design the API to be used.

Both of the above combined

Some users experience both of these issues together.

cert-manager will update the public DNS zone (assuming they use the hostedZoneID field in Route53's case), but they'll only be able to query the local DNS server due to network policies.

In order to mitigate this, a user would need to setup a separate DNS server that uses a public resolver upstream and point cert-manager at that using the --dns01-self-check-nameservers flag.

#894

Possible solutions

  • Allow forcing all DNS queries to go through a single resolver - this solution kind of works, but will fall down in the face of DNS caches/ttls. It would not be possible to point all traffic at the authoritative nameserver, as that would restrict users to only being able to obtain certificates for domains hosted within that zone.

  • Clearly document, or provide assistance in configuring a DNS resolver that will handle this for us. Users in restricted network environments could then allow just that DNS resolver access to the wider internet (or simply 8.8.8.8 et al).

  • ???

It seems the majority of users experiencing this are on AWS, so we've not heard many complaints about the lack of hostedZoneID field in other providers.

The field was actually not meant to be a part of cert-manager, but I didn't notice its addition until after we began shipping releases. Ideally, I'd like to remove the field before we reach 1.0.

In order to do so, we need to work out how to either auto-detect public/private zones for all DNS providers, or otherwise add some dns-provider specific configuration to our Certificate resource.
Related #783

/area provider-acme

@mikebryant
Copy link
Member

I believe there's a third option, which is not as nice, but may provide an escape hatch for users.

From https://github.com/ietf-wg-acme/acme/blob/master/draft-ietf-acme-acme.md#retrying-challenges:
Clients SHOULD NOT respond to challenges until they believe that the server's queries will succeed.

I think this leaves open the possibility that we can just allow for disabling of the self check entirely, and have a user-specified timeout. This is essentially just hoping the user will configure things correctly to avoid using up their rate-limit, so such an option should come with a big warning.

However, it would be simple to implement and allow for people to bypass this issue if their local dns environment is a problem.

(e.g. the user can only use internal resolvers, and they don't have a resolver that will actually view the real state, they only have the internal zone to hand)

That said, I think the second option is better, but I don't know that it's viable in all environments.

@munnerz
Copy link
Member Author

munnerz commented Sep 13, 2018

4. SHOULD NOT   This phrase, or the phrase "NOT RECOMMENDED" mean that
   there may exist valid reasons in particular circumstances when the
   particular behavior is acceptable or even useful, but the full
   implications should be understood and the case carefully weighed
   before implementing any behavior described with this label.

https://tools.ietf.org/html/rfc2119

The RFC does support this, and quite well describes this case.

Given we now have a 'back-off' on re-issuance attempts (currently hardcoded at ever 5 minutes), we could consider implementing this.

Users would still hit the quota for attempts quite quickly in failure cases. With #809 we could extend this backoff behaviour further as well.

@jescarri
Copy link

@munnerz / @mikebryant

For route 53 the GetChange api call will help to validate the record is resolvable.

https://docs.aws.amazon.com/sdk-for-go/api/service/route53/#Route53.GetChange
https://docs.aws.amazon.com/Route53/latest/APIReference/API_GetChange.html

This is already done at https://github.com/jetstack/cert-manager/blob/master/pkg/issuer/acme/dns/route53/route53.go#L181-L193

Hence I think is safe to assume that when R53 says INSYNC the record is resolvable.

That covers the Challenge validation.

However cert-manager also checks for the hostname that is requesting a cert to be resolvable, but afaik, DNS01 challenge doesn't require this, why the check is implemented in certmanager?, this makes split-dns even harder.

I think solution Allow forcing all DNS queries to go through a single resolver works perfectly if:

  • No neg-cache is done at that DNS resolver.
  • users are ok with the associated latency associated with the cache of upstream servers, this can be solved by decreasing the TTL of the TXT record?

In my case, I have a dnsmasq server just for cert-manager, it only forwards queries for foo.bar zone to an outsider DNS server, it can be also pointed to the authoritative NS, it also answers a bogus ip address when cert-manager asks for a hostname it's using a certificate.

And this works like a charm if Allow forcing all DNS queries to go through a single resolver happens.

@zystem
Copy link

zystem commented Sep 26, 2018

Additional info to ticket
#877
Acme.sh works well. It did not make direct dns requests and uses allowed internal dns server. Problem with letsencrypt appears only in cert-manager

@juliohm1978
Copy link

We are also facing this issue and @mikebryant's suggestion with a backoff policy seems very resonable.

Split DNS resolves public and internal IPs for the same domains, depending on the origin. Worse yet, we cannot simply workaround this using --dns01-self-check-nameservers. Due to security policies adopted here, none of our servers are allowed to query DNS extenally, so using 1.1.1.1 or 8.8.8.8 is not an option in this case.

@jescarri
Copy link

Hey guys any update on this?.

Is removing the check a valid option here?

I could submit a PR.

In my case I do the validation using the --dns01-self-check-nameservers which is a DNS server that has the external view of our Split Horizon DNS.

It works great.

@takmatsu
Copy link

Even in 0.5.2, the problem seems to be occurred.

--dns01-self-check-nameservers="INTERNAL_DNS_IP:53"
I1225 05:07:23.512534       1 start.go:79] starting cert-manager v0.5.2 (revision 9e8c3ad899c5aafaa360ca947eac7f5ba6301035)
I1225 05:07:23.523922       1 controller.go:126] Using the following nameservers for DNS01 checks: [INTERNAL_DNS_IP:53]
I1225 05:08:53.472111       1 dns.go:99] Checking DNS propagation for "hogehoge.net" using name servers: [INTERNAL_DNS_IP:53]
I1225 05:09:13.484780       1 sync.go:276] Error preparing issuer for certificate default/hogehoge.net: read udp 10.100.4.67:35222->205.251.196.87:53: i/o timeout

@tlm
Copy link
Contributor

tlm commented Jan 9, 2019

Hey Guys,

Have just run into this issue where our environment has no internet access and cannot get to the NS of the zone to check propagation. It's understandable why this would be the desired behaviour as you can remove the chance that negative caching comes into affect.

For our purposes we understand this possibility and ran a quick test with modified code that uses the dns01-self-check-nameservers as the authoritative name servers instead. It works as expected and allowed the process to complete.

I would propose a new flag --dns01-check-authoritative that defaults to true so we can toggle the behaviour.

Happy to put the PR together as we have most of the work already done.

@juliohm1978
Copy link

juliohm1978 commented Jan 23, 2019

Happy to report back to inform that the new 0.6.0 works perfectly to work around this issue. Using --dns01-recursive-nameservers and --dns01-recursive-nameservers-only parameters I was able to issue certs, staging and production.

Many thanks to the community.

@mdgreenwald
Copy link

For anyone running split-horizon DNS who finds this like I did, the config changes end up looking like this despite the Docs specifying otherwise:

- --dns01-recursive-nameservers="1.1.1.1:53"
- --dns01-recursive-nameservers="1.0.0.1:53"
- --dns01-recursive-nameservers-only

cjlarose added a commit to cjlarose/homelab that referenced this issue Jun 28, 2020
An [issue][0] describes a problem I was experiencing where the domain I
was trying to update is overridden by my local DNS settings (split DNS).

This change makes it so that, when performing a DNS01 challenge,
`cert-manager` will use a public DNS server instead of the local one.

[0]: cert-manager/cert-manager#903
cjlarose added a commit to cjlarose/homelab that referenced this issue Jun 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/acme Indicates a PR directly modifies the ACME Issuer code
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants