CPU spikes on the first control plane machine, when the second machine tries to join #214

yastij · 2021-06-02T14:08:42Z

in some cases, when using kube-vip as a control plane endpoint to bootstrap clusters, the resource consumption (especially CPU) spikes in the first control plane machine when the second CP machine tries to join. This leads to a situation where a quorum loss happens, which ultimately makes the bootstrapping fail.

By lowering the leader election params, clusters are able to bootstrap. Let's use this issue to track ways to mitigate this

nickperry · 2021-06-02T18:46:51Z

This is the problem I described here https://twitter.com/nickwperry/status/1385687098322214919

VMWare recommended workaround of backing off the LeaderElection params from default 15, 10, 2 to 30, 20, 4 works in our environments. Running oversized control plane nodes (8vcpu) also mitigates successfully.

thebsdbox · 2021-06-03T08:42:57Z

I think the first port of call is to start with larger params.. (we introduce a lot of etcd thrashing when we join a second member to the cluster, causing the API-server to fall over) we can then reduce the params once the cluster is stable and enjoy faster failover.

yastij · 2021-06-10T12:41:02Z

another option for this is to support etcd learner mode in kubeadm and CAPI https://etcd.io/docs/v3.3/learning/learner/#background cc @fabriziopandini

thebsdbox · 2021-06-10T13:00:28Z

☝️ That seems like a great idea.. it would certainly stop the etcd/api-server flakiness.

nickperry · 2021-06-10T14:24:46Z

There is an existing RFE for joining as learner - kubernetes/kubeadm#1793

fabriziopandini · 2021-06-11T08:54:41Z

From the kubeadm side the idea is still on the table, but I don't see this happening soon unless someone picks up the work.
As far as I remember about etcd learner mode, there was two main issue to be addressed:

only one learner mode was allowed at time. This imply some coordination for the parallel join scenario
there was no easy/automatic mechanism for promoting learners to actual members
But might be my informations are a little bit outdated...

sammcgeown · 2021-06-14T16:01:50Z

This is the problem I described here https://twitter.com/nickwperry/status/1385687098322214919

VMWare recommended workaround of backing off the LeaderElection params from default 15, 10, 2 to 30, 20, 4 works in our environments. Running oversized control plane nodes (8vcpu) also mitigates successfully.

@nickperry could you give me details of the work around you used for this? I'm struggling to build a cluster and seeing this behaviour - thanks

nickperry · 2021-06-14T16:27:52Z

@sammcgeown workaround is to make the LeaderElection timings more relaxed. On VMWare TKGM 1.3.x you can do this in a YTT overlay file before building your workload cluster. Copy the attached overlay file to ~/.tanzu/tkg/providers/infrastructure-vsphere/ytt/vsphere-overlay.yaml (ensure you remove the .txt suffix).

This workaround was provided by VMware R&D and relaxes the LeaderElection timings to 30, 20, 4.
vsphere-overlay.yaml.txt

sammcgeown · 2021-06-14T16:30:11Z

@nickperry thank you - that's perfect. For an installation of kube-vip with kubeadm I'm just editing the /etc/kubernetes/manifests/kube-vip.yaml to match those settings. It appears to be working for me now!

nickperry · 2021-06-14T16:50:12Z

@sammcgeown are you seeing the problem when manually deploying with kubeadm on top of existing VMs, rather than using CAPI? Probably quite a useful data point if so. Static pods for etcd or independently managed etcd?

sammcgeown · 2021-06-16T16:07:58Z

Yes - Ubuntu 21.04 on Raspberry Pi 4s, with static pods for etcd.

thebsdbox · 2024-04-10T11:51:00Z

Default timeouts have been improved and there have been no further discussions around this issue for some time.

yastij added bug Something isn't working control plane outside-cluster labels Jun 2, 2021

gab-satchi mentioned this issue Aug 10, 2021

Relaxes kube-vip leader election parameters vmware-tanzu/tanzu-framework#356

Closed

5 tasks

thebsdbox added the STALE label Apr 10, 2024

thebsdbox closed this as completed Apr 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU spikes on the first control plane machine, when the second machine tries to join #214

CPU spikes on the first control plane machine, when the second machine tries to join #214

yastij commented Jun 2, 2021

nickperry commented Jun 2, 2021

thebsdbox commented Jun 3, 2021

yastij commented Jun 10, 2021

thebsdbox commented Jun 10, 2021

nickperry commented Jun 10, 2021

fabriziopandini commented Jun 11, 2021

sammcgeown commented Jun 14, 2021

nickperry commented Jun 14, 2021 •

edited

Loading

sammcgeown commented Jun 14, 2021

nickperry commented Jun 14, 2021

sammcgeown commented Jun 16, 2021

thebsdbox commented Apr 10, 2024

CPU spikes on the first control plane machine, when the second machine tries to join #214

CPU spikes on the first control plane machine, when the second machine tries to join #214

Comments

yastij commented Jun 2, 2021

nickperry commented Jun 2, 2021

thebsdbox commented Jun 3, 2021

yastij commented Jun 10, 2021

thebsdbox commented Jun 10, 2021

nickperry commented Jun 10, 2021

fabriziopandini commented Jun 11, 2021

sammcgeown commented Jun 14, 2021

nickperry commented Jun 14, 2021 • edited Loading

sammcgeown commented Jun 14, 2021

nickperry commented Jun 14, 2021

sammcgeown commented Jun 16, 2021

thebsdbox commented Apr 10, 2024

nickperry commented Jun 14, 2021 •

edited

Loading