Idea: DNS "diagnosis" program #45934

thockin · 2017-05-17T04:43:07Z

We still have a lot of reports of DNS issues. It would be super cool to have a diagnoser program that could run as a pod in your cluster, and would gather information about DNS - how many replicas are running, how many restarts they have, do a bunch of DNS lookups of various kinds - in-cluster, out of cluster, A, PTR, SRV - and collect the latencies and dropped requests.

Something like kubectl apply -f http://kubernetes.io/diagnose/dns.yaml && kubectl attach -ti dns-diagnoser | tee dns.out or similar.

I'm filing this as help-wanted - it seems like something that a newcomer could tackle to learn how to use kubernetes and produce a valuable result!

The text was updated successfully, but these errors were encountered:

resouer · 2017-05-19T08:25:38Z

@thockin Where should the diagnose problem belong to? contrib repo?

cmluciano · 2017-05-19T18:05:07Z

I think we should consider putting it in dns or creating a new repo. I find the contrib repo hard to navigate and I believe there was a motion to split out most components into seperate repos.

thockin · 2017-05-24T07:11:46Z

+1 DNS contrib is dead.

…

On Fri, May 19, 2017 at 11:05 AM, cmluciano ***@***.***> wrote: I think we should consider putting it in dns or creating a new repo. I find the contrib repo hard to navigate and I believe there was a motion to split out most components into seperate repos. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#45934 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFVgVDMvGZFRIKayFvLGeTZtAkDrkDh-ks5r7dnjgaJpZM4NdXkx> .

fgimenez · 2017-05-28T15:55:07Z

I'd like to work on this, where should I begin? I've been reading through https://github.com/kubernetes/dns, AIUI the Dockerfile of dns-diagnoser should live there and the build process should be changed to build it, is that ok? Also, about the initial suggestion, what should dns.yaml define?

Thanks!

thockin · 2017-05-28T17:23:45Z

Seems appropriate as a home. I would start with a rough spec of the tests you want to run. Report number of dns endpoints. Report number of restarts. Measure latency and dropped requests to each endpoint in the service. Report. Ache ratios for each replica. Etc

…

On May 28, 2017 8:55 AM, "Federico Gimenez" ***@***.***> wrote: I'd like to work on this, where should I begin? I've been reading through https://github.com/kubernetes/dns, AIUI the Dockerfile of dns-diagnoser should live there and the build process should be changed to build it, is that ok? Also, about the initial suggestion, what should dns.yaml define? Thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#45934 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFVgVCVsd3Rr5T6G-uVQUVROTT2-my3Oks5r-ZjsgaJpZM4NdXkx> .

fgimenez · 2017-05-29T11:31:36Z

@thockin great thx, on it

someword · 2017-05-31T20:03:50Z

Based off of my personal experiences with cluster DNS issues I find them to be mostly short lived in the 1 - 10 minute range which stop as abruptly as they started with nobody fixing anything. Would it be too heavy weight to have have the proposed diagnostic tool running continuously to catch intermittent issues? I can see in my situation by the time I run the diagnostic tool the mysterious problem may have subsided and I've missed the event. An alternative to having the diagnostic tool running continuously would possibly be a document which describes what data users should be gathering all the time to support post incident review. Things like udp packet loss, conntrack drops, dnsmasq/kube-dns metrics, cpu/mem consumption of kube-dns pod, etc.

bowei · 2017-06-01T23:41:37Z

@someword Some of this could be integrated into the node problem detector? (https://github.com/kubernetes/node-problem-detector)

thockin · 2017-06-02T01:01:24Z

I think it would be interesting to have a long running DNS prober that did lookups every couple seconds. It could be a Job that runs for 3 hours and then exits with failure, so the Job controller restarts it possibly elsewhere. Need to collect useful results, rotate them, serve them, and defeat the DNS caches.

…

On Thu, Jun 1, 2017 at 4:41 PM, Bowei Du ***@***.***> wrote: @someword <https://github.com/someword> Some of this could be integrated into the node problem detector? (https://github.com/ kubernetes/node-problem-detector) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#45934 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFVgVIsHD-YHqQS8xmMyIoyybCzyB9biks5r_0xFgaJpZM4NdXkx> .

fgimenez · 2017-06-07T07:40:12Z

@thockin cool thx a lot, I can add that description to an initial spec proposal for the process to be performed by the diagnose tool to discuss the implementation further, WDYT? Also, in terms of the checks themselves (not for managing the results), do you think DNSPerf could be useful?

fejta-bot · 2017-12-26T14:15:43Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

fejta-bot · 2018-01-25T15:03:24Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle rotten
/remove-lifecycle stale

fejta-bot · 2018-02-24T15:09:49Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

thockin added area/dns help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. sig/network Categorizes an issue or PR as relevant to SIG Network. labels May 17, 2017

cmluciano mentioned this issue May 29, 2017

How to debug dns issues ? kubernetes/dns#100

Closed

fgimenez mentioned this issue May 31, 2017

dns diagnosis tool proposal: test kubernetes/dns#102

Closed

fgimenez mentioned this issue Jul 25, 2017

diagnoser's initial implementation kubernetes/dns#129

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 26, 2017

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 25, 2018

k8s-ci-robot closed this as completed Feb 24, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Idea: DNS "diagnosis" program #45934

Idea: DNS "diagnosis" program #45934

thockin commented May 17, 2017

resouer commented May 19, 2017

cmluciano commented May 19, 2017

thockin commented May 24, 2017 via email

fgimenez commented May 28, 2017

thockin commented May 28, 2017 via email

fgimenez commented May 29, 2017

someword commented May 31, 2017

bowei commented Jun 1, 2017

thockin commented Jun 2, 2017 via email

fgimenez commented Jun 7, 2017

fejta-bot commented Dec 26, 2017

fejta-bot commented Jan 25, 2018

fejta-bot commented Feb 24, 2018

Idea: DNS "diagnosis" program #45934

Idea: DNS "diagnosis" program #45934

Comments

thockin commented May 17, 2017

resouer commented May 19, 2017

cmluciano commented May 19, 2017

thockin commented May 24, 2017 via email

fgimenez commented May 28, 2017

thockin commented May 28, 2017 via email

fgimenez commented May 29, 2017

someword commented May 31, 2017

bowei commented Jun 1, 2017

thockin commented Jun 2, 2017 via email

fgimenez commented Jun 7, 2017

fejta-bot commented Dec 26, 2017

fejta-bot commented Jan 25, 2018

fejta-bot commented Feb 24, 2018