Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lowers the default nodelocaldns denial cache TTL #74093

Merged
merged 2 commits into from Feb 22, 2019

Conversation

@blakebarnett
Copy link
Contributor

blakebarnett commented Feb 14, 2019

What type of PR is this?
/kind bug

What this PR does / why we need it:
Similar to --no-negcache on dnsmasq, this prevents issues which poll DNS for orchestration such as operators with StatefulSets. It can also be very confusing for users when negative caching results in a change they just made seeming to be broken until the cache expires. This assumes that 5 seconds is reasonable and will still catch repeated AAAA negative responses. We could also set the denial cache size to zero which should effectively fully disable it like dnsmasq in kube-dns but testing shows this approach seems to work well in our (albeit small) test clusters.

Which issue(s) this PR fixes:

Fixes #74092

Does this PR introduce a user-facing change?:

Reduces the cache TTL for negative responses to 5s minimum.
Lowers the default nodelocaldns denial cache TTL
Similar to `--no-negcache` on dnsmasq, this prevents issues which poll DNS for orchestration such as operators with StatefulSets. It can also be very confusing for users when negative caching results in a change they just made seeming to be broken until the cache expires. This assumes that 5 seconds is reasonable and will still catch repeated AAAA negative responses. We could also set the denial cache size to zero which should effectively fully disable it like dnsmasq in kube-dns but testing shows this approach seems to work well in our (albeit small) test clusters.
@blakebarnett

This comment has been minimized.

Copy link
Contributor Author

blakebarnett commented Feb 14, 2019

/sig network

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Feb 14, 2019

Hi @blakebarnett. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@MrHohn

This comment has been minimized.

Copy link
Member

MrHohn commented Feb 14, 2019

/assign @prameshj

@MrHohn

This comment has been minimized.

Copy link
Member

MrHohn commented Feb 14, 2019

/ok-to-test

@chrisohaver

This comment has been minimized.

Copy link
Contributor

chrisohaver commented Feb 14, 2019

In case negative responses in the cluster dns domain is the primary concern here, these negative responses default to a TTL of 5 as of CoreDNS 1.3.1.

e.g.

dnstools# dig notthere.default.svc.cluster.local. A

; <<>> DiG 9.11.3 <<>> notthere.default.svc.cluster.local. A
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 34026
;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;notthere.default.svc.cluster.local. IN	A

;; AUTHORITY SECTION:
cluster.local.		5	IN	SOA	ns.dns.cluster.local. hostmaster.cluster.local. 1549979885 7200 1800 86400 5

;; Query time: 0 msec
;; SERVER: 10.96.0.10#53(10.96.0.10)
;; WHEN: Thu Feb 14 20:15:31 UTC 2019
;; MSG SIZE  rcvd: 168
@blakebarnett

This comment has been minimized.

Copy link
Contributor Author

blakebarnett commented Feb 14, 2019

Thanks!

I was unaware, I verified we've been testing against a build using CoreDNS v1.2.5, I'll update and test again now.

@blakebarnett

This comment has been minimized.

Copy link
Contributor Author

blakebarnett commented Feb 14, 2019

Ah, seems the latest published image is still only v1.2.6. @prameshj is upgrading to 1.3.1 and publishing an image viable?

@chrisohaver

This comment has been minimized.

Copy link
Contributor

chrisohaver commented Feb 14, 2019

CoreDNS 1.3.1 is already in the k8s.gcr.io repo: k8s.gcr.io/coredns:1.3.1.

These negative responses originate from the cluster dns, not the node local dns. The version of the node local dns should not make a difference. IIUC

@prameshj

This comment has been minimized.

Copy link
Contributor

prameshj commented Feb 14, 2019

CoreDNS 1.3.1 is already in the k8s.gcr.io repo: k8s.gcr.io/coredns:1.3.1.

These negative responses originate from the cluster dns, not the node local dns. The version of the node local dns should not make a difference. IIUC

Yes, that's how I understand it too, clusterDNS is setting the TTL on the record. If we use coreDNS as cluster dns, this TTL on the record will be 5s. So no need to use any TTL on the cache side.
However, with kube-dns, the default TTL is 30s.

@blakebarnett

This comment has been minimized.

Copy link
Contributor Author

blakebarnett commented Feb 14, 2019

Ah, we've not moved to CoreDNS yet, so in this case I suppose this change is still needed.

@blakebarnett

This comment has been minimized.

Copy link
Contributor Author

blakebarnett commented Feb 14, 2019

GKE is still defaulting to kube-dns also AFAIK (at least when we built our 1.11.5 cluster)

@prameshj

This comment has been minimized.

Copy link
Contributor

prameshj commented Feb 14, 2019

In this change, the cache size for success and denial is specified as 10000, but it will be rounded down to nearest multiple of 256 = 9984, which is the default cache size value if nothing is specified.
https://github.com/coredns/coredns/tree/master/plugin/cache

So the only change here is specifying a lower TTL for denial, that looks good to me.

@prameshj

This comment has been minimized.

Copy link
Contributor

prameshj commented Feb 14, 2019

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label Feb 14, 2019

@prameshj

This comment has been minimized.

Copy link
Contributor

prameshj commented Feb 14, 2019

/priority important-soon
/approved

@k8s-ci-robot k8s-ci-robot removed the lgtm label Feb 21, 2019

@prameshj

This comment has been minimized.

Copy link
Contributor

prameshj commented Feb 21, 2019

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm label Feb 21, 2019

@MrHohn

MrHohn approved these changes Feb 21, 2019

Copy link
Member

MrHohn left a comment

/approve

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Feb 21, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: blakebarnett, MrHohn, prameshj

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 042f9ed into kubernetes:master Feb 22, 2019

16 checks passed

cla/linuxfoundation blakebarnett authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-cross Skipped
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-godeps Skipped
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
pull-publishing-bot-validate Skipped
tide In merge pool.
Details

@blakebarnett blakebarnett deleted the blakebarnett:lower-neg-cache-ttl branch Feb 22, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.