Lowers the default nodelocaldns denial cache TTL #74093

blakebarnett · 2019-02-14T20:04:21Z

What type of PR is this?
/kind bug

What this PR does / why we need it:
Similar to --no-negcache on dnsmasq, this prevents issues which poll DNS for orchestration such as operators with StatefulSets. It can also be very confusing for users when negative caching results in a change they just made seeming to be broken until the cache expires. This assumes that 5 seconds is reasonable and will still catch repeated AAAA negative responses. We could also set the denial cache size to zero which should effectively fully disable it like dnsmasq in kube-dns but testing shows this approach seems to work well in our (albeit small) test clusters.

Which issue(s) this PR fixes:

Fixes #74092

Does this PR introduce a user-facing change?:

Reduces the cache TTL for negative responses to 5s minimum.

Similar to `--no-negcache` on dnsmasq, this prevents issues which poll DNS for orchestration such as operators with StatefulSets. It can also be very confusing for users when negative caching results in a change they just made seeming to be broken until the cache expires. This assumes that 5 seconds is reasonable and will still catch repeated AAAA negative responses. We could also set the denial cache size to zero which should effectively fully disable it like dnsmasq in kube-dns but testing shows this approach seems to work well in our (albeit small) test clusters.

blakebarnett · 2019-02-14T20:04:28Z

/sig network

k8s-ci-robot · 2019-02-14T20:04:29Z

Hi @blakebarnett. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

MrHohn · 2019-02-14T20:07:54Z

/assign @prameshj

MrHohn · 2019-02-14T20:08:21Z

/ok-to-test

chrisohaver · 2019-02-14T20:22:23Z

In case negative responses in the cluster dns domain is the primary concern here, these negative responses default to a TTL of 5 as of CoreDNS 1.3.1.

e.g.

dnstools# dig notthere.default.svc.cluster.local. A

; <<>> DiG 9.11.3 <<>> notthere.default.svc.cluster.local. A
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 34026
;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;notthere.default.svc.cluster.local. IN	A

;; AUTHORITY SECTION:
cluster.local.		5	IN	SOA	ns.dns.cluster.local. hostmaster.cluster.local. 1549979885 7200 1800 86400 5

;; Query time: 0 msec
;; SERVER: 10.96.0.10#53(10.96.0.10)
;; WHEN: Thu Feb 14 20:15:31 UTC 2019
;; MSG SIZE  rcvd: 168

blakebarnett · 2019-02-14T20:41:15Z

Thanks!

I was unaware, I verified we've been testing against a build using CoreDNS v1.2.5, I'll update and test again now.

blakebarnett · 2019-02-14T20:45:46Z

Ah, seems the latest published image is still only v1.2.6. @prameshj is upgrading to 1.3.1 and publishing an image viable?

chrisohaver · 2019-02-14T20:58:20Z

CoreDNS 1.3.1 is already in the k8s.gcr.io repo: k8s.gcr.io/coredns:1.3.1.

These negative responses originate from the cluster dns, not the node local dns. The version of the node local dns should not make a difference. IIUC

prameshj · 2019-02-14T21:10:51Z

CoreDNS 1.3.1 is already in the k8s.gcr.io repo: k8s.gcr.io/coredns:1.3.1.

These negative responses originate from the cluster dns, not the node local dns. The version of the node local dns should not make a difference. IIUC

Yes, that's how I understand it too, clusterDNS is setting the TTL on the record. If we use coreDNS as cluster dns, this TTL on the record will be 5s. So no need to use any TTL on the cache side.
However, with kube-dns, the default TTL is 30s.

blakebarnett · 2019-02-14T21:11:03Z

Ah, we've not moved to CoreDNS yet, so in this case I suppose this change is still needed.

blakebarnett · 2019-02-14T21:18:45Z

GKE is still defaulting to kube-dns also AFAIK (at least when we built our 1.11.5 cluster)

prameshj · 2019-02-14T21:30:38Z

In this change, the cache size for success and denial is specified as 10000, but it will be rounded down to nearest multiple of 256 = 9984, which is the default cache size value if nothing is specified.
https://github.com/coredns/coredns/tree/master/plugin/cache

So the only change here is specifying a lower TTL for denial, that looks good to me.

prameshj · 2019-02-14T21:30:47Z

/lgtm

prameshj · 2019-02-14T21:32:11Z

/priority important-soon
/approved

https://github.com/coredns/coredns/blob/master/plugin/cache/cache.go#L236 This gets rounded down to the nearest multiple of 256: 9984

prameshj · 2019-02-21T23:13:31Z

/lgtm
/approve

MrHohn

/approve

k8s-ci-robot · 2019-02-21T23:15:19Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: blakebarnett, MrHohn, prameshj

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster/addons/dns/OWNERS~~ [MrHohn]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Feb 14, 2019

k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 14, 2019

k8s-ci-robot requested review from bowei and MrHohn February 14, 2019 20:04

k8s-ci-robot added the sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. label Feb 14, 2019

k8s-ci-robot assigned prameshj Feb 14, 2019

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 14, 2019

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 14, 2019

k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Feb 14, 2019

Match default cache size of 10000

46c299c

https://github.com/coredns/coredns/blob/master/plugin/cache/cache.go#L236 This gets rounded down to the nearest multiple of 256: 9984

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 21, 2019

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 21, 2019

MrHohn approved these changes Feb 21, 2019

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 21, 2019

k8s-ci-robot merged commit 042f9ed into kubernetes:master Feb 22, 2019

blakebarnett deleted the lower-neg-cache-ttl branch February 22, 2019 18:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lowers the default nodelocaldns denial cache TTL #74093

Lowers the default nodelocaldns denial cache TTL #74093

blakebarnett commented Feb 14, 2019

blakebarnett commented Feb 14, 2019

k8s-ci-robot commented Feb 14, 2019

MrHohn commented Feb 14, 2019

MrHohn commented Feb 14, 2019

chrisohaver commented Feb 14, 2019

blakebarnett commented Feb 14, 2019

blakebarnett commented Feb 14, 2019

chrisohaver commented Feb 14, 2019 •

edited

prameshj commented Feb 14, 2019

blakebarnett commented Feb 14, 2019

blakebarnett commented Feb 14, 2019

prameshj commented Feb 14, 2019

prameshj commented Feb 14, 2019

prameshj commented Feb 14, 2019

prameshj commented Feb 21, 2019

MrHohn left a comment

k8s-ci-robot commented Feb 21, 2019

Lowers the default nodelocaldns denial cache TTL #74093

Lowers the default nodelocaldns denial cache TTL #74093

Conversation

blakebarnett commented Feb 14, 2019

blakebarnett commented Feb 14, 2019

k8s-ci-robot commented Feb 14, 2019

MrHohn commented Feb 14, 2019

MrHohn commented Feb 14, 2019

chrisohaver commented Feb 14, 2019

blakebarnett commented Feb 14, 2019

blakebarnett commented Feb 14, 2019

chrisohaver commented Feb 14, 2019 • edited

prameshj commented Feb 14, 2019

blakebarnett commented Feb 14, 2019

blakebarnett commented Feb 14, 2019

prameshj commented Feb 14, 2019

prameshj commented Feb 14, 2019

prameshj commented Feb 14, 2019

prameshj commented Feb 21, 2019

MrHohn left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Feb 21, 2019

chrisohaver commented Feb 14, 2019 •

edited