Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

avoid excessive search lines in CI #1860

Closed
BenTheElder opened this issue Sep 18, 2020 · 17 comments · Fixed by #3178
Closed

avoid excessive search lines in CI #1860

BenTheElder opened this issue Sep 18, 2020 · 17 comments · Fixed by #3178
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@BenTheElder
Copy link
Member

There's not a good way to manage this in KIND, it's reasonable that we expect the host to have sane DNS and it's intentional that we use upstream DNS pointed at the host DNS resolution, but in CI we have additional search paths by way of running inside another cluster. We're already using FQDN for interacting with services in that cluster outside of KIND (namely the bazel build cache), so we shouldn't need search paths at all.

Basically we should mitigate this up front:
kubernetes/test-infra#19080 (comment)

/assign
/priority important-soon

xref: #303

@BenTheElder BenTheElder added the kind/bug Categorizes issue or PR as related to a bug. label Sep 18, 2020
@k8s-ci-robot k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Sep 18, 2020
@aojea
Copy link
Contributor

aojea commented Sep 18, 2020

/cc

@aojea
Copy link
Contributor

aojea commented Sep 18, 2020

CoreDNS config kubernetes/kubernetes#94794 (comment)

@BenTheElder
Copy link
Member Author

@aojea if you check the context link the issue we had was in the kubelet sooo ...

@aojea
Copy link
Contributor

aojea commented Sep 19, 2020

does not forwarding the queries to the upstream dns server solve the problem?

@BenTheElder
Copy link
Member Author

BenTheElder commented Sep 20, 2020 via email

@aojea
Copy link
Contributor

aojea commented Sep 20, 2020

Sep 01 19:07:11 kind-worker kubelet[611]: E0901 19:07:11.320994     611 dns.go:125] Search Line limits were exceeded, some search paths have been omitted, the applied search line is: volume-expand-4184-7003.svc.cluster.local svc.cluster.local cluster.local test-pods.svc.cluster.local us-central1-b.c.k8s-infra-prow-build.internal c.k8s-infra-prow-build.internal

too many layers 😄

maybe we should add an option to kubelet in addition to the cluster-domain one ?
https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/

--cluster-domain string
  | Domain for this cluster. If set, kubelet will configure all containers to search this domain in addition to the host's search domains (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's `--config` flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.)

this will require a KEP but seems very simple to implement, something like?

--cluster-domain-only bool
If set, kubelet will not add the host's search domains to the pods

@aojea
Copy link
Contributor

aojea commented Sep 20, 2020

🤔 I' m not sure a new flag will require a KEP, beccause is not changing current behavior and just adding a new configuration option.

@thockin what do you think?
what will be the best way to avoid (globally) appending the host resolv.conf additional search domains to pods?

@BenTheElder
Copy link
Member Author

BenTheElder commented Sep 20, 2020 via email

@aojea
Copy link
Contributor

aojea commented Sep 20, 2020

my point is that this is not only a KIND problem :) , it is possible that you don't want to have your host search domains in your pods, and AFAIK the possibilities that we have now are

kubelet can use a DNS config file that is not the host global one,

this option is not great because forces the admins to keep both files in sync, it will require another layer of orchestration/configuration to the cluster

using pods dns config https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pod-dns-config

that seems overkill if we can just indicate the kubelet to not copy the search domains from the host .

@BenTheElder
Copy link
Member Author

This gets a little confusing to talk about. Reminding anyone in the thread that the layering is:

[GKE Node] -- runs a kubelet, has GKE managed DNS config

[[GKE Pod / ProwJob Pod]] -- docker in docker runs here. We could manage this config ourselves in the same script that sets up docker.

[[[kind "node" container"]]] -- container created by kind to be a "node". We want this to respect the "host" DNS locally using docker embedded DNS. "host" here is the prowjob pod. we should improve the prowjob pod to be a better environment for running KIND.

[[[[kind cluster pods]]]] -- not actually relevant here, the problem in kubernetes/test-infra#19080 (comment) was with kubelet, at the level above (kind "node" container)

my point is that this is not only a KIND problem :) , it is possible that you don't want to have your host search domains in your pods, and AFAIK the possibilities that we have now are

My point was that it's not a KIND problem, it's a "kind inside kubernetes" problem which is not something reasonable to optimize for in Kubernetes.

that seems overkill if we can just indicate the kubelet to not copy the search domains from the host .

In this case we need the GKE kubelet to do that, which we're not going to be able to configure regardless of whatever upstream options are available 🙃, it's managed. Upstream options to customize DNS for the kubelet already exist though.

Just for the GKE cluster pods in which we run kind, we want to reduce the searches, we don't need them in the inner cluster nodes.


We should reduce ndots and searches at the prowjob pod level to look more like a typical host. We're already hacking around Kubernetes in abnormal ways to do the docker-in-docker bit, so tweaking DNS there is not a big deal.

@aojea
Copy link
Contributor

aojea commented Sep 21, 2020

ic, I was trying to kill two birds with one shot, but seems one was not really a bird 😄 and after looking at the other today I'm not sure that will solve it and just mitigate it

@BenTheElder BenTheElder changed the title avoid excessive search lines in avoid excessive search lines in CI Sep 23, 2020
@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 22, 2020
@BenTheElder
Copy link
Member Author

this one is not the worst thing ever but it should definitely happen eventually. when we finish sorting out the cgroups in the entrypoint I'm going to try to take a moment to port the dind fixes including CGROUP_PARENT and this.

@kubernetes-sigs kubernetes-sigs deleted a comment from fejta-bot Jan 14, 2021
@BenTheElder BenTheElder added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 14, 2021
@KenMacD
Copy link

KenMacD commented Dec 8, 2022

I'm not sure if this is the same issue, so please tell me if I should open a new one. I've seen similar behaviour when using a host with an entry in the search domain that points to a domain hosted at cloudflare and that has DNSSEC enabled.

In these cases a combination of factors lead to things like service DNS lookups failing. Here's what happens:

  • A golang service compiled against musl attempts to lookup 'service.ns.svc.cluster.local'
  • Because K8/Kind set NDOTS to 5 this query is tried against the search domain first
  • service.ns.svc.cluster.local.ns.svc.cluster.lookup is tested, NXDOMAIN is received
  • service.ns.svc.cluster.local.svc.cluster.local is tested, NXDOMAIN
  • service.ns.svc.cluster.local.cluster.local is tested, NSDOMAIN
  • service.ns.svc.cluster.local.HOSTDOMAIN is tested. Cloudflare does not return an NXDOMAIN

Because of what Cloudflare returns musl gives up on resolving the domain, meaning services can no longer be found. This result would be found with any golang programs linked to musl, but with the default NDOTS=1 it's rarely seen.

Workarounds exist to either use service.ns to lookup the name, or add a . to the end of the name to force direct lookup.

@BenTheElder
Copy link
Member Author

You can customize ndots on your pod but I recommend not using musl in kubernetes. There's a whole history of DNS resolver issues not specific to KIND there.

@KenMacD
Copy link

KenMacD commented Dec 8, 2022

Thanks @BenTheElder. Yes, some more workarounds are:

  • Not using musl
  • GODEBUG=netdns=go
  • Setting dns_searches = [ "."] in containers.conf, but this does affect more than just the cluster. I have no reason for my cluster to need the host search domain though, so it works for me.

You mentioned setting ndots on the pod... there's still no way to set that globally on the cluster is there? Also I figured if ndots was specifically set to 5 instead of the default 1 that there was probably a good reason, is there?

@BenTheElder
Copy link
Member Author

Even if there is a way to configure ndots globally on the cluster that configuration would be highly non standard and this workload would be broken on other clusters, which isn't really in alignment with the spirit of kind enabling conformant cluster testing.

The host searches is awkward, I don't know the best answer for kind to handle this. Ideally the host environment should be reasonable, previously on this issue we were discussing clusters inside clusters which causes this issue but can be solved by configuring the DNS on the pod in the outer cluster that kind runs within.

It's 5 by default because service SRV
https://dev.to/imjoseangel/tune-up-your-kubernetes-application-performance-with-a-small-dns-configuration-1o46
This has a better explanation of that already written 😅

@BenTheElder
Copy link
Member Author

The e2e script should adopt #3097

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants