Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kube-dns add-on should accept option ndots for SkyDNS or document ConfigMap alternative subPath #33554

Closed
bogdando opened this issue Sep 27, 2016 · 36 comments
Assignees
Labels
area/dns sig/network Categorizes an issue or PR as relevant to SIG Network.

Comments

@bogdando
Copy link

bogdando commented Sep 27, 2016

BUG REPORT:

Kubernetes version (use kubectl version):
Kubernetes v1.3.5+coreos.0,
Kube-dns add-on v19 from gcr.io/google_containers/kubedns-amd64:1.7

Environment:

  • Cloud provider or hardware configuration: on-premise VMs
  • OS (e.g. from /etc/os-release): Ubuntu Xenial
  • Kernel (e.g. uname -a): 4.4.0-38-generic
  • Install tools: Kube-dns add-on v19 from gcr.io/google_containers/kubedns-amd64:1.7
  • Others:

What happened:
ndots:5 is hardcoded to the containers' base /etc/resolv.conf by kubelet running with --cluster_dns, --cluster_domain, --resolv-conf=/etc/resolv.conf flags.

What you expected to happen:
ndots should be configurable via the kubedns app definition or a configmap to allow users to chose if skydns shall be attempting absolute domains or utilizing the search domains, f.e.:

        args:
        # command = "/kube-dns"
        - --domain=cluster.local.
        - --ndots=5
        - --dns-port=10053

AFAICT, DNS SRV records expect ndots:7, thus will fail to resolv via skydns (or maybe not! #33554 (comment))
Also, this might affect DNS performance by generating undesired additional resolve queries for suggested search subdomains before actually trying the absolute domain, when a number of dots in the initial query exceeds the given ndots threshold.

How to reproduce it (as minimally and precisely as possible):
Deploy kube dns cluster add-in, check containers' /etc/resolv.conf within pods

Anything else do we need to know:
This is rather a docs issue, see #35525 (comment) for details and please address that in docs.

@MrHohn
Copy link
Member

MrHohn commented Oct 3, 2016

Why would resolving SRV records fail?

Suppose a hostname with X dot is queried and X < ndots threshhold, it will have the search paths appended before it is sent to the name server?

For the SRV record _my-port-name._my-port-protocol.my-svc.my-namespace.svc.cluster.local, I think below inputs will act properly:

  • _my-port-name._my-port-protocol.my-svc.my-namespace
  • _my-port-name._my-port-protocol.my-svc.my-namespace.svc
  • _my-port-name._my-port-protocol.my-svc.my-namespace.svc.cluster.local

Given the search paths is:

  • default.svc.cluster.local
  • svc.cluster.local
  • cluster.local

Please correct me if I miss something.

@macb
Copy link

macb commented Oct 5, 2016

Though SRV records may not fail, the ndots configuration should be exposed somewhere. The current 5 is a pretty rough default to expect everyone to adhere to. It results in something like 10+ dns queries per non-cluster domain attempted in our clusters.

#14051 (comment) describes some scenarios where lowering it "won't work". That doesn't actually seem to be the case, it just changes the behavior from first resolving all search domains and then trying the absolute domain to instead attempting the absolute domain then trying the search domains.

cc @thockin (linked your comment above)

Our clusters use cluster domains along the lines of: .int.clustername.region.internal.my.domain
This allows us to manage PKI off of our .internal.my.domain domain. This seems to get expanded to making search domains quite pricey when combined with the hosts /etc/resolv.conf search domains (6 configured search domains).

We have an existing application that lives outside our clusters at:
application.internal.my.domain

The 3 dots means we'll attempt the 6 cluster search domains before actually trying the absolute domain (which ultimately is resolved by the regional DNS servers outside the k8s cluster). That's 12 DNS queries (A and AAAA for each search domain) which ultimately fail since application.internal.my.domain doesn't exist within our cluster.

If instead we could configure ndots to say ndots:3 the absolute domain is first attempted. If it NXDOMAINs, all of the search domains will be applied just as before. 3 in this case would also keep the previous behavior of application, application.default, application.default.svc utilizing the search domains first.

In any case, it would give the option to the user to decide which is more important for their usage; attempting absolute domains or utilizing the search domains.

@thockin
Copy link
Member

thockin commented Oct 6, 2016

I don't think that behavior is consistent. The resolvers I have seen never try search if the query has >= ndots dots.

$ k run thdbg --rm --restart=Never -ti --image=ubuntu
Waiting for pod default/thdbg to be running, status is Pending, pod ready: false
Waiting for pod default/thdbg to be running, status is Pending, pod ready: false
Waiting for pod default/thdbg to be running, status is Pending, pod ready: false
Waiting for pod default/thdbg to be running, status is Pending, pod ready: false
If you don't see a command prompt, try pressing enter.

root@thdbg:/# apt-get update >/dev/null 2>&1

root@thdbg:/# apt-get install -y dnsutils >/dev/null 2>&1

root@thdbg:/# cat /etc/resolv.conf 
search default.svc.cluster.local svc.cluster.local cluster.local google.internal c.thockin-dev.internal
nameserver 10.0.0.10
options ndots:5

root@thdbg:/# cat > /etc/resolv.conf << EOF
> search default.svc.cluster.local svc.cluster.local cluster.local google.internal c.thockin-dev.internal
> nameserver 10.0.0.10
> options ndots:1
> EOF

root@thdbg:/# cat /etc/resolv.conf 
search default.svc.cluster.local svc.cluster.local cluster.local google.internal c.thockin-dev.internal
nameserver 10.0.0.10
options ndots:1

root@thdbg:/# nslookup kubernetes
Server:     10.0.0.10
Address:    10.0.0.10#53

Non-authoritative answer:
Name:   kubernetes.default.svc.cluster.local
Address: 10.0.0.1

root@thdbg:/# nslookup kubernetes.default
Server:     10.0.0.10
Address:    10.0.0.10#53

** server can't find kubernetes.default: NXDOMAIN

@macb
Copy link

macb commented Oct 6, 2016

From the resolv.conf man page:

The default for n is 1, meaning that if there are any dots in a name, the name will be tried first as an absolute name before any search list elements are appended to it.

Implies search domains should still be respected even if the absolute domain is attempted. It does seem not everything respects that though. Seems like something still best left up to the user to determine for themselves?

@thockin
Copy link
Member

thockin commented Oct 6, 2016

We had a rough proposal from someone to add a DNS policy for
"ClusterNoSearch" or something that was the same as before, but only set
ndots:1 for the requesting pod.

I would accept such a PR...

On Oct 6, 2016 6:56 AM, "Mac Browning" notifications@github.com wrote:

From the resolv.conf man page:

The default for n is 1, meaning that if there are any dots in a name, the
name will be tried first as an absolute name before any search list
elements are appended to it.

Implies search domains should still be respected even if the absolute
domain is attempted. It does seem not everything respects that though.
Seems like something still best left up to the user to determine for
themselves?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#33554 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFVgVNbMlEzaPmIInzweCYhxABYDqkxGks5qxP33gaJpZM4KHmmB
.

@macb
Copy link

macb commented Oct 6, 2016

tl;dr:

It seems like libbind/netresolv and musl are consistent in ignoring search domains when the ndots option is met. However glibc (and other similar implementations such as golang's dns impl) would continue to respect search domains with a lower ndots.

I wouldn't want to just go from ndots:5 to ndots:1 since the vast majority of search signaling could still be preserved with ndots:3 in our case. Though given the choice ndots:1 is likely more useful for users that have a lot of services outside their clusters since it will be 1 failed absolute domain request then falling back to search for internal requests instead of 10+ failed searches before falling back to the absolute domain for external services.

Do you have the ClusterNoSearch proposal handy? A search didn't turn anything up in issues.


With ndots:1 and an otherwise normal k8s resolv.conf:

nslookup

All of the bind utilities (dig, nslookup, etc) use libbind (or whatever its called)

root@macb-debug:/# nslookup kubernetes.default
Server:         172.17.240.2
Address:        172.17.240.2#53

** server can't find kubernetes.default: NXDOMAIN

And tcpdump from that request where it just tries the absolute:

17:35:56.959854 IP 172.17.41.7.44089 > 172.17.240.2.53: 56391+ A? kubernetes.default. (36)
17:35:56.963720 IP 172.17.240.2.53 > 172.17.41.7.44089: 56391 NXDomain 0/0/0 (36)

This makes sense given bind documents ndots as:

+ndots=D
Set the number of dots that have to appear in name to D for it to be considered absolute. The default value is that defined using the ndots statement in /etc/resolv.conf, or 1 if no ndots statement is present. Names with fewer dots are interpreted as relative names and will be searched for in the domains listed in the search or domain directive in /etc/resolv.conf.

curl

However, if I use curl, it will resolve:

root@macb-debug:/# curl http://kubernetes.default -v
* Rebuilt URL to: http://kubernetes.default/
* Hostname was NOT found in DNS cache
*   Trying 172.17.240.1...

And tcpdump from that request where we can see it trying the absolute then falling back to search domains:

17:34:14.753159 IP 172.17.41.7.42383 > 172.17.240.2.53: 5843+ A? kubernetes.default. (36)
17:34:14.753246 IP 172.17.41.7.42383 > 172.17.240.2.53: 5736+ AAAA? kubernetes.default. (36)
17:34:14.753510 IP 172.17.240.2.53 > 172.17.41.7.42383: 5843 NXDomain 0/0/0 (36)
17:34:14.753604 IP 172.17.240.2.53 > 172.17.41.7.42383: 5736 NXDomain 0/0/0 (36)
17:34:14.753727 IP 172.17.41.7.60146 > 172.17.240.2.53: 48770+ A? kubernetes.default.default.svc.int.frog.nyc3.internal.my.domain. (88)
17:34:14.753798 IP 172.17.41.7.60146 > 172.17.240.2.53: 46767+ AAAA? kubernetes.default.default.svc.int.frog.nyc3.internal.my.domain. (88)
17:34:14.754942 IP 172.17.240.2.53 > 172.17.41.7.60146: 46767 NXDomain 0/1/0 (259)
17:34:14.755472 IP 172.17.240.2.53 > 172.17.41.7.60146: 48770 NXDomain 0/1/0 (259)
17:34:14.755721 IP 172.17.41.7.34184 > 172.17.240.2.53: 36544+ A? kubernetes.default.svc.int.frog.nyc3.internal.my.domain. (80)
17:34:14.755860 IP 172.17.41.7.34184 > 172.17.240.2.53: 26345+ AAAA? kubernetes.default.svc.int.frog.nyc3.internal.my.domain. (80)
17:34:14.756634 IP 172.17.240.2.53 > 172.17.41.7.34184: 36544 1/0/0 A 172.17.240.1 (96)
17:34:14.756660 IP 172.17.240.2.53 > 172.17.41.7.34184: 26345 0/0/0 (80)

golang (1.7)

Tried out a simple go program(go1.7):

package main

import "net/http"

func main() {
        http.Get("http://kubernetes.default")
}
root@macb-debug:~# ./main

And the tcpdump:

17:42:55.799352 IP 172.17.41.7.54780 > 172.17.240.2.53: 49389+ AAAA? kubernetes.default. (36)
17:42:55.801342 IP 172.17.41.7.52301 > 172.17.240.2.53: 61981+ A? kubernetes.default. (36)
17:42:55.805505 IP 172.17.240.2.53 > 172.17.41.7.54780: 49389 NXDomain 0/0/0 (36)
17:42:55.805636 IP 172.17.240.2.53 > 172.17.41.7.52301: 61981 NXDomain 0/0/0 (36)
17:42:55.806171 IP 172.17.41.7.45104 > 172.17.240.2.53: 32960+ AAAA? kubernetes.default.default.svc.int.frog.nyc3.internal.my.domain. (88)
17:42:55.806411 IP 172.17.41.7.47877 > 172.17.240.2.53: 41316+ A? kubernetes.default.default.svc.int.frog.nyc3.internal.my.domain. (88)
17:42:55.807575 IP 172.17.240.2.53 > 172.17.41.7.47877: 41316 NXDomain 0/1/0 (259)
17:42:55.807816 IP 172.17.240.2.53 > 172.17.41.7.45104: 32960 NXDomain 0/1/0 (259)
17:42:55.808179 IP 172.17.41.7.54148 > 172.17.240.2.53: 11120+ AAAA? kubernetes.default.svc.int.frog.nyc3.internal.my.domain. (80)
17:42:55.808339 IP 172.17.240.2.53 > 172.17.41.7.54148: 11120 0/0/0 (80)
17:42:55.808365 IP 172.17.41.7.40203 > 172.17.240.2.53: 65043+ A? kubernetes.default.svc.int.frog.nyc3.internal.my.domain. (80)
17:42:55.808502 IP 172.17.240.2.53 > 172.17.41.7.40203: 65043 1/0/0 A 172.17.240.1 (96)

The golang stdlib spells out the case nicely in the unix dnsclient.

musl

musl documents its difference at least:

queries with fewer dots than the ndots configuration variable are processed with search first then tried literally (just like glibc), but those with at least as many dots as ndots are only tried in the global namespace (never falling back to search, which glibc would do if the name is not found in the global DNS namespace)

@thockin
Copy link
Member

thockin commented Oct 7, 2016

So we can not depend on this behavior. Unfortunately, resolv.conf behavior
is under-specified here and in other ways.

The "proposal" was an offhand remark in a bug or email or something. If
this is really a pain point, I would be happy for someone to open a new
proposal (small) on this.

On Thu, Oct 6, 2016 at 11:15 AM, Mac Browning notifications@github.com
wrote:

tl;dr:

It seems like libbind/netresolv and musl are consistent in ignoring
search domains when the ndots option is met. However glibc (and other
similar implementations such as golang's dns impl) would continue to
respect search domains with a lower ndots.

I wouldn't want to just want to go from ndots:5 to ndots:1 since the vast
majority of search signaling could still be preserved with ndots:3 in our
case. Though given the choice ndots:1 is likely more useful for users
that have a lot of services outside their clusters since it will be 1
failed absolute domain request then falling back to search for internal
requests instead of 10+ failed searches before falling back to the absolute
domain for external services.

Do you have the ClusterNoSearch proposal handy? That search didn't turn

anything up in issues.

With ndots:1 and an otherwise normal k8s resolv.conf: nslookup

All of the bind utilities (dig, nslookup, etc) use libbind

root@macb-debug:/# nslookup kubernetes.default
Server: 172.17.240.2
Address: 172.17.240.2#53

** server can't find kubernetes.default: NXDOMAIN

And tcpdump from that request where it just tries the absolute:

17:35:56.959854 IP 172.17.41.7.44089 > 172.17.240.2.53: 56391+ A? kubernetes.default. (36)
17:35:56.963720 IP 172.17.240.2.53 > 172.17.41.7.44089: 56391 NXDomain 0/0/0 (36)

This makes sense given bind documents ndots as:

+ndots=D
Set the number of dots that have to appear in name to D for it to be
considered absolute. The default value is that defined using the ndots
statement in /etc/resolv.conf, or 1 if no ndots statement is present. Names
with fewer dots are interpreted as relative names and will be searched for
in the domains listed in the search or domain directive in /etc/resolv.conf.

curl

However, if I use curl, it will resolve:

root@macb-debug:/# curl http://kubernetes.default -v

And tcpdump from that request where we can see it trying the absolute
then falling back to search domains:

17:34:14.753159 IP 172.17.41.7.42383 > 172.17.240.2.53: 5843+ A? kubernetes.default. (36)
17:34:14.753246 IP 172.17.41.7.42383 > 172.17.240.2.53: 5736+ AAAA? kubernetes.default. (36)
17:34:14.753510 IP 172.17.240.2.53 > 172.17.41.7.42383: 5843 NXDomain 0/0/0 (36)
17:34:14.753604 IP 172.17.240.2.53 > 172.17.41.7.42383: 5736 NXDomain 0/0/0 (36)
17:34:14.753727 IP 172.17.41.7.60146 > 172.17.240.2.53: 48770+ A? kubernetes.default.default.svc.int.frog.nyc3.internal.my.domain. (88)
17:34:14.753798 IP 172.17.41.7.60146 > 172.17.240.2.53: 46767+ AAAA? kubernetes.default.default.svc.int.frog.nyc3.internal.my.domain. (88)
17:34:14.754942 IP 172.17.240.2.53 > 172.17.41.7.60146: 46767 NXDomain 0/1/0 (259)
17:34:14.755472 IP 172.17.240.2.53 > 172.17.41.7.60146: 48770 NXDomain 0/1/0 (259)
17:34:14.755721 IP 172.17.41.7.34184 > 172.17.240.2.53: 36544+ A? kubernetes.default.svc.int.frog.nyc3.internal.my.domain. (80)
17:34:14.755860 IP 172.17.41.7.34184 > 172.17.240.2.53: 26345+ AAAA? kubernetes.default.svc.int.frog.nyc3.internal.my.domain. (80)
17:34:14.756634 IP 172.17.240.2.53 > 172.17.41.7.34184: 36544 1/0/0 A 172.17.240.1 (96)
17:34:14.756660 IP 172.17.240.2.53 > 172.17.41.7.34184: 26345 0/0/0 (80)

golang (1.7)

Tried out a simple go program(go1.7):

package main

import "net/http"

func main() {
http.Get("http://kubernetes.default")
}

root@macb-debug:~# ./main

And the tcpdump:

17:42:55.799352 IP 172.17.41.7.54780 > 172.17.240.2.53: 49389+ AAAA? kubernetes.default. (36)
17:42:55.801342 IP 172.17.41.7.52301 > 172.17.240.2.53: 61981+ A? kubernetes.default. (36)
17:42:55.805505 IP 172.17.240.2.53 > 172.17.41.7.54780: 49389 NXDomain 0/0/0 (36)
17:42:55.805636 IP 172.17.240.2.53 > 172.17.41.7.52301: 61981 NXDomain 0/0/0 (36)
17:42:55.806171 IP 172.17.41.7.45104 > 172.17.240.2.53: 32960+ AAAA? kubernetes.default.default.svc.int.frog.nyc3.internal.my.domain. (88)
17:42:55.806411 IP 172.17.41.7.47877 > 172.17.240.2.53: 41316+ A? kubernetes.default.default.svc.int.frog.nyc3.internal.my.domain. (88)
17:42:55.807575 IP 172.17.240.2.53 > 172.17.41.7.47877: 41316 NXDomain 0/1/0 (259)
17:42:55.807816 IP 172.17.240.2.53 > 172.17.41.7.45104: 32960 NXDomain 0/1/0 (259)
17:42:55.808179 IP 172.17.41.7.54148 > 172.17.240.2.53: 11120+ AAAA? kubernetes.default.svc.int.frog.nyc3.internal.my.domain. (80)
17:42:55.808339 IP 172.17.240.2.53 > 172.17.41.7.54148: 11120 0/0/0 (80)
17:42:55.808365 IP 172.17.41.7.40203 > 172.17.240.2.53: 65043+ A? kubernetes.default.svc.int.frog.nyc3.internal.my.domain. (80)
17:42:55.808502 IP 172.17.240.2.53 > 172.17.41.7.40203: 65043 1/0/0 A 172.17.240.1 (96)

The golang stdlib spells out the case nicely in the unix dnsclient
https://golang.org/src/net/dnsclient_unix.go.
musl

musl documents its difference at least:

queries with fewer dots than the ndots configuration variable are
processed with search first then tried literally (just like glibc), but
those with at least as many dots as ndots are only tried in the global
namespace (never falling back to search, which glibc would do if the name
is not found in the global DNS namespace)


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#33554 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFVgVEdryFhbsGNSfsnEM8txtzw5-Xmcks5qxTqqgaJpZM4KHmmB
.

@bogdando
Copy link
Author

This conversation definitely fixed white spaces I had with understanding ndots in action, thank you for details - I've updated the bug description. This has nothing to the bug relevance tho.

@bogdando
Copy link
Author

@thockin hardcoded ndots:5 is still a pain point, despite on implementation details of DNS stack. This keeps this issue relevant.

@bogdando
Copy link
Author

bogdando commented Nov 3, 2016

NOTE: this is rather a docs issue, see #35525 (comment)
I suggest to fix this in docs, as described by @fluxrad

@bogdando bogdando changed the title Kube-dns add-on should accept option ndots for SkyDNS in order to resolve SRV DNS records Kube-dns add-on should accept option ndots for SkyDNS or document ConfigMap alternative subPath Nov 3, 2016
@jkemp101
Copy link

jkemp101 commented Dec 9, 2016

This issue snuck up and bit me. I was surprised two DNS pods couldn’t handle the load of a 7 node/81 pod cluster when the node running the DNS containers did a GC to delete old docker images. I’m finding it hard to accept a design that causes a single connection to api.example.com to result in 8 failed NXDOMAIN responses before I get the correct address. We can scale kube-dns, dnsmasq, etcd instances as much as we want but it just seems wrong.

I’ve started testing the configmap workaround for my cluster. There is a chance the dnsmasq --neg-ttl option might help with caching the NXDOMAIN responses but I prefer the configmap option for now so I haven’t played with it.

I’m curious what the real world use cases are for having ndots set to 5. I’m a little new to k8s but I have traditionally always configured critical software to use FQDNs when possible. The only use case I can think of if you are trying to find a service that is in your namespace but you don’t know what namespace you are in. What am I missing, when do you have to rely on the search list?

And more importantly am I going to break something if I run the majority of my services with a ndots set at 1? Can we make sure no Kubernetes components are built requiring ndots = 5 which would then potentially restrict users running it set to 1.

@MrHohn
Copy link
Member

MrHohn commented Dec 9, 2016

cc @bowei

@thockin
Copy link
Member

thockin commented Dec 10, 2016

I apologize for the girth of this, but I have a lot to say :)

This is a tradeoff between automagic and performance. There are a number of considerations that went into this design. I can explain them, but of course reasonable people can disagree.

  1. Same-namespace lookups are the vast majority of lookups, so we need "my local namespace" in the search path.
  2. There are multiple "classes" of things that exist in DNS so the class must be part of the DNS name.
  3. Services are the vast majority of lookups so class "svc" must be in the search path and ndots must be >= 1 (e.g. name must resolve)
  4. Reasonable people want to configure the cluster zone suffix (e.g. for corp names, multi-cluster, etc)

= ergo the name of a Service is $service.$namespace.svc.$zone, and $namespace.svc.$zone is the first search path.

  1. The second most common lookup is cross-namespace services, e.g. the kubernetes master in the defaultns, so cross-namespace lookups should be easy

  2. Because of (4), we don't really want apps to hardcode the FQDN (bad for portability), so svc.$zone must be in the search path and ndots must be >= 2 (e.g. kubernetes.default must resolve)

  3. Because of (2) and (4), same-namespace and cross-namespace lookups of non-service names should be easy. Therefore $zone must be in the search path and ndots must be >= 3 (e.g. name.namespace.svc must resolve).

  4. Because of (1) and (4), and the fact that petsets have per-endpoint names, local and cross-namespace petnames should be easy. Given the previous search paths and ndots >= 4, we can ensure that petname.service.namespace.svc resolves.

  5. We also support SRV records of the form _$port._$proto.$service.$namespace.svc.$zone. Given (6) and (2), we must enable
    _$port._$proto.$service.$namespace.svc to resolve. That requires ndots = 5.

This explains how we got to ndots = 5.

  1. Given (9) and (8), we pathologcially get _$port._$proto.$petname.$service.$namespace.svc.$zone. Therefore must enable _$port._$proto.$petname.$service.$namespace.svc to resolve. That actually requires ndots = 6, unless I can't count. In truth, SRV doesn't make much sense except for Services, but now we have federated services...

We did not change ndots to 6 because This is getting out of hand. I'd very much like to revisit some of the assumptions and the schema. The problem, of course, is how to make a transition, once we have a better schema.

Consider an alternative:

  • The canonical name for a service becomes $service.s.$ns.$zone
  • The pathological case for SRV becomes _$port._$proto.$petname.$service.s.$ns.$zone
  • Most common lookup being same-namespace services, search path = s.$ns.$zone (nslookup myservice)
  • Second most common lookup being cross-namespace services, search path += $zone (nslookup kubernetes.s.default)

That is a better, safer, more appropriate schema that only requires 2 search paths. and in fact, you could argue that it only REQUIRES $zone, while the other is sugar. People love sugar. This still leaves pathologically ndots = 6.

If we exposed $zone through downward API (as we do $namespace), then maybe we don't need so much magic. I'm reticent to require $zone to access the kube-master, but maybe we can get away with ndots = 3 (kubernetes.s.default) or ndots = 4 (petname.kubernetes.s.default). That's not much better.

We could mitigate some of the perf penalties by always trying names as upstream FQDNs first, but that means that all intra-cluster lookups get slower. Which do we expect more frequently? I'll argue intra-cluster names, if only because the TTL is so low. So that's the wrong tradeoff. Also, that is a client-side thing, so we'd have to implement server-side logic to do search expansion. But namespace is really variable, so it's some hybrid. Blech.

OTOH, good caching means that external lookups are slow the first time but fast subsequently. So that's where we've been focused. The schema change would be nice (uses less search domains, but is a little more verbose), but requires some serious ballet to transition, and we have not figured that out.

Now, we could make a case for a new DNSPolicy value that cut down on search paths and ndots. We could even make a case for per-namespace defaults that override the global API defaults. We can't make a global change, and I doubt we can make a per-cluster change because ~every example out there will break.

@johnbelamaric (spec)
@matchstick (fyi)
@madhusudancs (federation)
@kubernetes/sig-network (discuss)
@smarterclayton (smart guy)
@jbeda (smart guy)

@jbeda
Copy link
Contributor

jbeda commented Dec 11, 2016

First, off -- I'm totally in favor of moving to a schema where the "class" is under namespace. i.e. $service.s.$ns.$zone. (Note that this is the schema we picked for GCE: $host.c.$project.internal)

Second -- SRV records are uncommon enough that it is probably okay to make that use case less smooth. Something similar can be set for petsets. Anything using petsets like that will require some elbow grease already. We could expose the zone and namespace as env variables to smooth this over.

That means we have 2 cases we care about:

  • same-namespace service: $otherservice -> $otherservice.s.$namespace.$zone
  • cross-namespace service: $otherservice.s.$othernamespace -> $otherservice.s.$othernamespace.$zone

That means ndots needs to be 3, right?

Can we have it both ways here? Can we have a local DNS cache that can cache and answer these queries super fast? If we make this be per-node (in the proxy or kubelet or a new binary {ug}) then it is faster, cheaper and config scales with cluster size.

@thockin
Copy link
Member

thockin commented Dec 12, 2016 via email

@caseydavenport
Copy link
Member

SRV records are uncommon enough that it is probably okay to make that use case less smooth. Something similar can be set for petsets

Yes, reading through this thread I found myself reaching the same place in my head. ndots=3 seems like the right value assuming we also switch service.namespace to service.s.namespace, otherwise I think we could get away with ndots=2 using the current schema, yeah?

OTOH, good caching means that external lookups are slow the first time but fast subsequently. So that's where we've been focused.

I think that's probably the right place to focus.

a smallish per-node cache ... This does mean that DNS will not get original client IPs, but I think we can live with that.

You mean kube-dns won't see the client IPs? Yeah, that could be interesting in the context of the multi-tenant DNS discussions that have popped up before. I guess so long as each tenant gets its own cache and that cache forwards to the right tenant kube-dns service...

@bogdando
Copy link
Author

bogdando commented Dec 12, 2016

@thockin it's interesting you've mentioned the per node cache in front of the KubeDNS app. That's how Kargo is currently configures DNS (see the Dnsmasq svc/dset in the drawing). Although I'm failing to see how that cache would fix or improve the situation for hostnet pods, which rely on the hosts' /etc/resolv.conf files and those (hosts) start behave bad sometimes, if given the options ndots: 5 there.

PS. The transition is always the the hard part, but that is not a problem for a properly organized change management (deprecation rules) and docs, right?

@johnbelamaric
Copy link
Member

For the transition, as long as we don't have a namespace that is "svc" or "pod", the server can differentiate between the new schema and the old one; so both can be active at the same time. We could use a different DnsPolicy on the client side with the new search and ndots, eventually, with good lead time, make it the default.

@jkemp101
Copy link

Thanks everyone for continuing this discussion so we can figure out our best options. Here are two thoughts:

  • I would very much like to see a new DNS Policy that has ndots=1. That would let us configure pods with high external DNS requests to not strain the cluster DNS or suffer other performance issues. For instance, I will have many pods that will be doing a majority of non-cluster DNS lookups and would regularly use this DNS Policy. Making it a new DNS Policy will ensure this is a test/supported configuration in the future versus the config map workaround technique.
  • With DNS caching we need to factor in that a lot (maybe most) of these responses are NXDOMAIN so that the caching needs to handle that. I think NXDOMAIN responses are often not cached in a “typical” configuration. And if they are cached they have a very short TTL causing them to not be as beneficial as normal positive responses.

@johnbelamaric
Copy link
Member

@thockin @caseydavenport kube-dns not seeing the client IPs can be mitigated by having the local cache append it (and/or other data) as an EDNS0 option.

@johnbelamaric
Copy link
Member

If and when that becomes necessary.

@jkemp101
Copy link

Just as a point of reference. My relatively small cluster was running 1,246 packets per second for all DNS related traffic in the cluster with the default settings. After I implemented the config map workaround for most of the pods to set ndots to 1 the same cluster is now running at 109 pps for DNS traffic.

@thockin
Copy link
Member

thockin commented Dec 13, 2016 via email

@bowei
Copy link
Member

bowei commented Dec 13, 2016

@thockin: I'll take a crack at the proposal
@jkemp101: 10x QPS, yikes

@zihaoyu
Copy link

zihaoyu commented Mar 14, 2017

@tonylambiris Could you elaborate why --no-negcache is a good idea? From what I learned negative caching is good but that flag disables it. Maybe I misunderstood something.

@BrianGallew
Copy link

Frankly, I don't see the reluctance to allowing this to be configurable per site (or better yet, per container). "We're smart enough to solve this for everyone" is not realistic. There are a number of assumptions in the SkyDNS design which, while undoubtedly true for the given developer's environment, is clearly not true for many of us who are trying to get work done, only to run into this issue.

@thockin thockin added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed sig/network Categorizes an issue or PR as relevant to SIG Network. labels May 16, 2017
@bowei
Copy link
Member

bowei commented May 25, 2017

/assign

@shivangipatwardhan2018
Copy link

I am having this exact problem and wish to change ndots to 1. @jkemp101 mentioned a configmap work around. What is that? Where can I find it? or is there a workaround to set the ndots = 1?
Thank you!

@jkemp101
Copy link

jkemp101 commented Aug 3, 2017

@shivangipatwardhan2018 Just create a resolve.conf file without the options ndots:5 line (look at the file in an existing pod). Create a config map with that new resolve.conf. Then add something like this to your deployments to override the k8s provided resolv.conf with the file in your config map.

        volumeMounts:
        - name: resolv-conf
          mountPath: /etc/resolv.conf
          subPath: resolv.conf
...
      volumes:
        - name: resolv-conf
          configMap:
            name: resolv.conf
            items:
            - key: resolv.conf
              path: resolv.conf

@bowei
Copy link
Member

bowei commented Dec 12, 2017

https://github.com/kubernetes/community/blob/master/contributors/design-proposals/network/pod-resolv-conf.md

This feature is in 1.9 as alpha

@bowei
Copy link
Member

bowei commented Dec 20, 2017

I am going to close this as we now have dnsConfig

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/dns sig/network Categorizes an issue or PR as relevant to SIG Network.
Projects
None yet
Development

No branches or pull requests