CoreDNS 1 replica only / slow detecting unhealthy node #3092

joaogl · 2021-03-18T15:34:23Z

Environmental Info:
K3s Version: v1.20.4+k3s1

Node(s) CPU architecture, OS, and Version:
Ubuntu, AMD Epyc

Cluster Configuration:
3 masters, 2 agents, cloudflare LB

Describe the bug:
I can see there's only one coredns replica deployed, if I kill the master where that replica is, the k3s takes around 5 mins to recognize that the node is down and to redeploy a new replica in another master.
We should probably be able to define more replicas from the installation script or detection of unhealthy nodes should be much faster.
The Cloudflare LB takes max 60s to recognize the node as unhealthy and the kubectl get nodes is almost instant, but it still says for around 5 mins that the coredns is RUNNING in the unhealthy node.

Steps To Reproduce:
Deploy k3s without modifying the cluster afterwards (increasing coredns replicas)
Check in which node is coredns running, kill that master node.

Expected behavior:
Either another coredns should be started almost instantaneously or multiple replicas should already be available. Traffic should be redirected to the new one and almost no downtime should be felt.

Actual behavior:
Takes around 5 to 10 mins for the cluster to recognize coredns is down and to redeploy a new instance for it.

brandond · 2021-03-18T16:31:24Z

The unhealthy node detection time is governed by upstream; there are some parameters you can pass to the apiserver to tune this but it will never be quite instantaneous; Kubernetes isn't really architected that way. Running more replicas of the service is probably the best way to do that.

There is some discussion at #1328 about gotchas to be aware of when disabling the built-in coredns manifest so that you can provide your own customized one.

stale · 2021-09-14T17:06:17Z

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 180 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

stale bot added the status/stale label Sep 14, 2021

stale bot closed this as completed Sep 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CoreDNS 1 replica only / slow detecting unhealthy node #3092

CoreDNS 1 replica only / slow detecting unhealthy node #3092

joaogl commented Mar 18, 2021

brandond commented Mar 18, 2021

stale bot commented Sep 14, 2021

CoreDNS 1 replica only / slow detecting unhealthy node #3092

CoreDNS 1 replica only / slow detecting unhealthy node #3092

Comments

joaogl commented Mar 18, 2021

brandond commented Mar 18, 2021

stale bot commented Sep 14, 2021