Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lowers the default nodelocaldns denial cache TTL #74093

Merged
merged 2 commits into from
Feb 22, 2019

Conversation

blakebarnett
Copy link

What type of PR is this?
/kind bug

What this PR does / why we need it:
Similar to --no-negcache on dnsmasq, this prevents issues which poll DNS for orchestration such as operators with StatefulSets. It can also be very confusing for users when negative caching results in a change they just made seeming to be broken until the cache expires. This assumes that 5 seconds is reasonable and will still catch repeated AAAA negative responses. We could also set the denial cache size to zero which should effectively fully disable it like dnsmasq in kube-dns but testing shows this approach seems to work well in our (albeit small) test clusters.

Which issue(s) this PR fixes:

Fixes #74092

Does this PR introduce a user-facing change?:

Reduces the cache TTL for negative responses to 5s minimum.

Similar to `--no-negcache` on dnsmasq, this prevents issues which poll DNS for orchestration such as operators with StatefulSets. It can also be very confusing for users when negative caching results in a change they just made seeming to be broken until the cache expires. This assumes that 5 seconds is reasonable and will still catch repeated AAAA negative responses. We could also set the denial cache size to zero which should effectively fully disable it like dnsmasq in kube-dns but testing shows this approach seems to work well in our (albeit small) test clusters.
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Feb 14, 2019
@blakebarnett
Copy link
Author

/sig network

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Feb 14, 2019
@k8s-ci-robot
Copy link
Contributor

Hi @blakebarnett. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 14, 2019
@k8s-ci-robot k8s-ci-robot added the sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. label Feb 14, 2019
@MrHohn
Copy link
Member

MrHohn commented Feb 14, 2019

/assign @prameshj

@MrHohn
Copy link
Member

MrHohn commented Feb 14, 2019

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 14, 2019
@chrisohaver
Copy link
Contributor

In case negative responses in the cluster dns domain is the primary concern here, these negative responses default to a TTL of 5 as of CoreDNS 1.3.1.

e.g.

dnstools# dig notthere.default.svc.cluster.local. A

; <<>> DiG 9.11.3 <<>> notthere.default.svc.cluster.local. A
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 34026
;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;notthere.default.svc.cluster.local. IN	A

;; AUTHORITY SECTION:
cluster.local.		5	IN	SOA	ns.dns.cluster.local. hostmaster.cluster.local. 1549979885 7200 1800 86400 5

;; Query time: 0 msec
;; SERVER: 10.96.0.10#53(10.96.0.10)
;; WHEN: Thu Feb 14 20:15:31 UTC 2019
;; MSG SIZE  rcvd: 168

@blakebarnett
Copy link
Author

Thanks!

I was unaware, I verified we've been testing against a build using CoreDNS v1.2.5, I'll update and test again now.

@blakebarnett
Copy link
Author

Ah, seems the latest published image is still only v1.2.6. @prameshj is upgrading to 1.3.1 and publishing an image viable?

@chrisohaver
Copy link
Contributor

chrisohaver commented Feb 14, 2019

CoreDNS 1.3.1 is already in the k8s.gcr.io repo: k8s.gcr.io/coredns:1.3.1.

These negative responses originate from the cluster dns, not the node local dns. The version of the node local dns should not make a difference. IIUC

@prameshj
Copy link
Contributor

CoreDNS 1.3.1 is already in the k8s.gcr.io repo: k8s.gcr.io/coredns:1.3.1.

These negative responses originate from the cluster dns, not the node local dns. The version of the node local dns should not make a difference. IIUC

Yes, that's how I understand it too, clusterDNS is setting the TTL on the record. If we use coreDNS as cluster dns, this TTL on the record will be 5s. So no need to use any TTL on the cache side.
However, with kube-dns, the default TTL is 30s.

@blakebarnett
Copy link
Author

Ah, we've not moved to CoreDNS yet, so in this case I suppose this change is still needed.

@blakebarnett
Copy link
Author

GKE is still defaulting to kube-dns also AFAIK (at least when we built our 1.11.5 cluster)

@prameshj
Copy link
Contributor

In this change, the cache size for success and denial is specified as 10000, but it will be rounded down to nearest multiple of 256 = 9984, which is the default cache size value if nothing is specified.
https://github.com/coredns/coredns/tree/master/plugin/cache

So the only change here is specifying a lower TTL for denial, that looks good to me.

@prameshj
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 14, 2019
@prameshj
Copy link
Contributor

/priority important-soon
/approved

@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Feb 14, 2019
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 21, 2019
@prameshj
Copy link
Contributor

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 21, 2019
Copy link
Member

@MrHohn MrHohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: blakebarnett, MrHohn, prameshj

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 21, 2019
@k8s-ci-robot k8s-ci-robot merged commit 042f9ed into kubernetes:master Feb 22, 2019
@blakebarnett blakebarnett deleted the lower-neg-cache-ttl branch February 22, 2019 18:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/network Categorizes an issue or PR as relevant to SIG Network. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

node-local-dns caching negative responses too long
5 participants