New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solutions for split-horizon and restricted DNS environment issues #903

Open
munnerz opened this Issue Sep 13, 2018 · 5 comments

Comments

Projects
None yet
6 participants
@munnerz
Member

munnerz commented Sep 13, 2018

I have seen a lot of issues regarding the DNS01 self check come up.

There seems to be two categories of problem, and this issue tries to encapsulates these different problems in order to come up with a solution that helps resolve them.

So far, the two big problems I see:

Restricted DNS environments

Here, a user has configured their cluster/VPC so that all outbound traffic on port 53 is denied, except for the one cluster DNS server (i.e. kube-dns, or their route53 resolver).

This consequently means the DNS01 self-check will timeout, because cert-manager will recurse up the DNS records to find the authoritative nameserver, and query that.

We attempt to query the authoritative nameserver so we can be sure the record has been updated at the root - this is how Let's Encrypt perform their own validations, and as such is the most concrete way to verify a validation will succeed.

#877

Split-horizon DNS environments

Here, a user has a public and private zone configured in Route53 (or similar) for the same domain. This is typically done to allow applications in the cluster to resolve hostnames to internal endpoints instead of using the regular 'user facing' endpoints.

This creates issues for cert-manager because when we perform a DNS query to find the DNS authority, the internal nameserver will respond with the private DNS zone root, consequently failing the self-check (as cert-manager will have updated the public zone).

This is mitigated by allowing users to specify the --dns01-self-check-nameservers flag, which will alter the 'root' nameserver used to perform the initial query - the idea here being that by specifying e.g. 8.8.8.8, they will begin recursing the public zone to find the authority that Let's Encrypt will see.

In a worse class of this problem, cert-manager may actually update the private DNS zone without being aware it is internal only, and consequently pass the self check despite the record not being publicly available, eventually resulting in the Let's Encrypt quota/rate limit being used up.

In order to mitigate this, cert-manager also allows specifying a hostedZoneID in some DNS provider configurations (e.g. route53) which allows the user to override the auto-hosted zone selection logic. This works well, but then requires a DNS provider configuration per DNS zone to be configured, which is awkward and not as we design the API to be used.

Both of the above combined

Some users experience both of these issues together.

cert-manager will update the public DNS zone (assuming they use the hostedZoneID field in Route53's case), but they'll only be able to query the local DNS server due to network policies.

In order to mitigate this, a user would need to setup a separate DNS server that uses a public resolver upstream and point cert-manager at that using the --dns01-self-check-nameservers flag.

#894

Possible solutions

  • Allow forcing all DNS queries to go through a single resolver - this solution kind of works, but will fall down in the face of DNS caches/ttls. It would not be possible to point all traffic at the authoritative nameserver, as that would restrict users to only being able to obtain certificates for domains hosted within that zone.

  • Clearly document, or provide assistance in configuring a DNS resolver that will handle this for us. Users in restricted network environments could then allow just that DNS resolver access to the wider internet (or simply 8.8.8.8 et al).

  • ???

It seems the majority of users experiencing this are on AWS, so we've not heard many complaints about the lack of hostedZoneID field in other providers.

The field was actually not meant to be a part of cert-manager, but I didn't notice its addition until after we began shipping releases. Ideally, I'd like to remove the field before we reach 1.0.

In order to do so, we need to work out how to either auto-detect public/private zones for all DNS providers, or otherwise add some dns-provider specific configuration to our Certificate resource.
Related #783

/area provider-acme

@mikebryant

This comment has been minimized.

Collaborator

mikebryant commented Sep 13, 2018

I believe there's a third option, which is not as nice, but may provide an escape hatch for users.

From https://github.com/ietf-wg-acme/acme/blob/master/draft-ietf-acme-acme.md#retrying-challenges:
Clients SHOULD NOT respond to challenges until they believe that the server's queries will succeed.

I think this leaves open the possibility that we can just allow for disabling of the self check entirely, and have a user-specified timeout. This is essentially just hoping the user will configure things correctly to avoid using up their rate-limit, so such an option should come with a big warning.

However, it would be simple to implement and allow for people to bypass this issue if their local dns environment is a problem.

(e.g. the user can only use internal resolvers, and they don't have a resolver that will actually view the real state, they only have the internal zone to hand)

That said, I think the second option is better, but I don't know that it's viable in all environments.

@munnerz

This comment has been minimized.

Member

munnerz commented Sep 13, 2018

4. SHOULD NOT   This phrase, or the phrase "NOT RECOMMENDED" mean that
   there may exist valid reasons in particular circumstances when the
   particular behavior is acceptable or even useful, but the full
   implications should be understood and the case carefully weighed
   before implementing any behavior described with this label.

https://tools.ietf.org/html/rfc2119

The RFC does support this, and quite well describes this case.

Given we now have a 'back-off' on re-issuance attempts (currently hardcoded at ever 5 minutes), we could consider implementing this.

Users would still hit the quota for attempts quite quickly in failure cases. With #809 we could extend this backoff behaviour further as well.

@jescarri

This comment has been minimized.

jescarri commented Sep 13, 2018

@munnerz / @mikebryant

For route 53 the GetChange api call will help to validate the record is resolvable.

https://docs.aws.amazon.com/sdk-for-go/api/service/route53/#Route53.GetChange
https://docs.aws.amazon.com/Route53/latest/APIReference/API_GetChange.html

This is already done at https://github.com/jetstack/cert-manager/blob/master/pkg/issuer/acme/dns/route53/route53.go#L181-L193

Hence I think is safe to assume that when R53 says INSYNC the record is resolvable.

That covers the Challenge validation.

However cert-manager also checks for the hostname that is requesting a cert to be resolvable, but afaik, DNS01 challenge doesn't require this, why the check is implemented in certmanager?, this makes split-dns even harder.

I think solution Allow forcing all DNS queries to go through a single resolver works perfectly if:

  • No neg-cache is done at that DNS resolver.
  • users are ok with the associated latency associated with the cache of upstream servers, this can be solved by decreasing the TTL of the TXT record?

In my case, I have a dnsmasq server just for cert-manager, it only forwards queries for foo.bar zone to an outsider DNS server, it can be also pointed to the authoritative NS, it also answers a bogus ip address when cert-manager asks for a hostname it's using a certificate.

And this works like a charm if Allow forcing all DNS queries to go through a single resolver happens.

@zystem

This comment has been minimized.

zystem commented Sep 26, 2018

Additional info to ticket
#877
Acme.sh works well. It did not make direct dns requests and uses allowed internal dns server. Problem with letsencrypt appears only in cert-manager

@juliohm1978

This comment has been minimized.

juliohm1978 commented Oct 10, 2018

We are also facing this issue and @mikebryant's suggestion with a backoff policy seems very resonable.

Split DNS resolves public and internal IPs for the same domains, depending on the origin. Worse yet, we cannot simply workaround this using --dns01-self-check-nameservers. Due to security policies adopted here, none of our servers are allowed to query DNS extenally, so using 1.1.1.1 or 8.8.8.8 is not an option in this case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment