Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CoreDNS 1 replica only / slow detecting unhealthy node #3092

Closed
joaogl opened this issue Mar 18, 2021 · 2 comments
Closed

CoreDNS 1 replica only / slow detecting unhealthy node #3092

joaogl opened this issue Mar 18, 2021 · 2 comments

Comments

@joaogl
Copy link

joaogl commented Mar 18, 2021

Environmental Info:
K3s Version: v1.20.4+k3s1

Node(s) CPU architecture, OS, and Version:
Ubuntu, AMD Epyc

Cluster Configuration:
3 masters, 2 agents, cloudflare LB

Describe the bug:
I can see there's only one coredns replica deployed, if I kill the master where that replica is, the k3s takes around 5 mins to recognize that the node is down and to redeploy a new replica in another master.
We should probably be able to define more replicas from the installation script or detection of unhealthy nodes should be much faster.
The Cloudflare LB takes max 60s to recognize the node as unhealthy and the kubectl get nodes is almost instant, but it still says for around 5 mins that the coredns is RUNNING in the unhealthy node.

Steps To Reproduce:
Deploy k3s without modifying the cluster afterwards (increasing coredns replicas)
Check in which node is coredns running, kill that master node.

Expected behavior:
Either another coredns should be started almost instantaneously or multiple replicas should already be available. Traffic should be redirected to the new one and almost no downtime should be felt.

Actual behavior:
Takes around 5 to 10 mins for the cluster to recognize coredns is down and to redeploy a new instance for it.

@brandond
Copy link
Contributor

The unhealthy node detection time is governed by upstream; there are some parameters you can pass to the apiserver to tune this but it will never be quite instantaneous; Kubernetes isn't really architected that way. Running more replicas of the service is probably the best way to do that.

There is some discussion at #1328 about gotchas to be aware of when disabling the built-in coredns manifest so that you can provide your own customized one.

@stale
Copy link

stale bot commented Sep 14, 2021

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 180 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

@stale stale bot added the status/stale label Sep 14, 2021
@stale stale bot closed this as completed Sep 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants