ETCD becomes the bottleneck of cluster when we have 1K+ nodes #20540

mqliang · 2016-02-03T05:08:53Z

ETCD becomes the bottleneck of cluster when we have 1K+ nodes.

After close scrutiny, it's the 1K+ nodes frequently send PUT request to ETCD to update it's status(every 10s using default configure), which give ETCD a big burden.

May I suggest, instead:
_Let NodeController initiatively do the health check. Thus we could significantly reduce the number of request sent to ETCD._

IIRC, in kubernetes v0.12, we make NodeController initiatively do the health check DoChecks, but after v1.0, we make kubelet update it's status. Is there good reason for this change?@gmarek @fgrzadkowski

By the way, I think may be it's time to think about make ETCD extensible.

mqliang · 2016-02-03T05:22:12Z

For large cluster, it's not reasonable if we just have one health-check routine. We could use several routines to make health check. And the number of routines could be self-adopting. Any thought?

adohe-zz · 2016-02-03T07:17:07Z

+1 multi routines to make health check is necessary for large cluster.

xiang90 · 2016-02-03T07:28:40Z

@mqliang For etcd v3, we introduced a concept called lease. The client can acquire lease and do low cost lease keepalive. 10k of nodes to do keepaliving should not be a problem in general.

xiang90 · 2016-02-03T07:29:12Z

Also even in today's setting, 100s of puts per second is not a big problem for etcd actually.

xiang90 · 2016-02-03T07:31:55Z

BTW, I am not opposite to add a "centralized" control functionality to coalesce node keepalives. But I would like to see the result/motivation for it. If you can prove this is a problem indeed with some data, then it would be great.

mqliang · 2016-02-03T07:49:42Z

@xiang90 We have a cluster with 1K+ node, 100s of puts per second _alone_ is not a big problem, but there are many other requests. When we observe etcd under great pressure, we make nodeController do the health check initiatively, and move Pod and Node source to their dedicated etcd cluster, as a workaround

BTW, we observe that etcd will write snapshot every few minutes, when etcd write snapshot, a lot of requests will timeout. IIRC, etcd v2 writes full snapshot, does etcd v3 has a plan to support incremental snapshot?

xiang90 · 2016-02-03T07:53:45Z

but there are many other requests.

What is the percentage that node updates account for of all the requests? I would suggest to investigate the most significant part.

When we observe etcd about to crash,

I think we need data to prove the issue.

we observe that etcd will write snapshot every few minutes, when etcd write snapshot, a lot of requests will timeout

I do not know why. Have you tried to figure out what the root cause of this?

etcd v2 writes full snapshot, does etcd v3 has a plan to support incremental snapshot?

v3 does incremental snapshot.

mqliang · 2016-02-03T07:54:28Z

In addition, it's hard to monitor the status of etcd cluster. When etcd under great pressure, requests will timeout, but cpu/mem usage seems has no significant increase, which make us hard to monitor and alarm.

xiang90 · 2016-02-03T07:56:40Z

In addition, it's hard to monitor the status of etcd cluster. When etcd under great pressure, requests will timeout, but cpu/mem usage seems has no significant increase, which make us hard to monitor and alarm.

What is the great pressure? Can you reproduce this with etcd itself?

xiang90 · 2016-02-03T08:04:19Z

@mqliang Again, to report any etcd issue you have to reproduce it with etcd itself. These metrics basically mean nothing to etcd.

gmarek · 2016-02-03T11:15:07Z

The reason was architectural, not related to the performance. We don't want to establish connections from master machine to anywhere. This allows for e.g. having a hosted master somewhere that oversees on-prem Nodes. Currently it's possible and safe, as all communication is initiated by Nodes.

I don't think that the fact that Kubelet updates node status is a performance problem by itself. I think that the problem is that we keep those timestamps as a part of NodeStatus. I wrote a proposal to change this, but it was kind of rejected: #14735

magicwang-cn · 2016-02-03T11:17:09Z

/cc @kubernetes/huawei

wojtek-t · 2016-02-03T14:49:00Z

@kubernetes/sig-scalability

timothysc · 2016-02-03T15:08:45Z

Could you please describe your deployment & tests in greater detail.

What tests are you running (is it high churn)
Have you load balanced your api-servers
Have you sharded events?
Could you grab your api-server metrics to show the counts of operations. As well as the latency of operations.
Are you running a full-secure deployment.
...

The more detail, the better.

magicwang-cn · 2016-02-04T04:59:25Z

ref #18266

mqliang · 2016-02-04T05:28:46Z

Some details:

We remove the rate limit of kube-controller-manager/kube-scheduler.
We doesn't connect the secure server, so the max-request-in-flight flag doesn't work
We use k8s v1.0, which has no cache in API server, and kube-controller-manager will relist every 5 minutes(AFAIK in latest version,it has been changed to 12h), frequent list operation would give ETCD a big burden.
Didn't share Events, all resource are in a same etcd cluster
No load balance since multi API server without quorum read may return stale value. See add a knob to enable quorum read #20145 and should introduce a knob to enable quorum read of etcd for HA #19902

I think I have found the cause of the problem:

frequently re-list
all resource are in a same etcd cluster

mqliang · 2016-02-04T05:35:40Z

May I ask why In v1.0 kube-controller-manager will relist every 5 minutes? Does it mean the watch of etcd is not reliable, so that we should periodically relist?

magicwang-cn · 2016-02-04T06:39:21Z

+1. the list-watch mechanism is really puzzled. but if you enable the watch-cache, the mainly watch-relist action is in the kube-apiserver's side, this can reduce etcd's burden.

@xiang90 to etcd, if watch() returns err, and we do not list(), maybe some changes will miss?

Does it mean the watch of etcd is not reliable, so that we should periodically relist?

wojtek-t · 2016-02-04T08:09:53Z

@mqliang - if you are using 1.0 release, there is no mystery that this doesn't work. The 1.0 release is supporting only 100-node clusters, the 1.1 release is supporting 250-node clusters.
Only 1.2 release will be supporting 1000-node clusters.

Also note, that removing rate-limiting in controller-manager can significantly degrade performance, so it's definitely not recommended.

Also note, that enabling watch-cache is significantly improving performance (this was the main factor enabling us going from 100 to 1000 nodes)

mqliang · 2016-02-04T08:28:24Z

@wojtek-t May I ask does k8s has a further plan to support a larger cluster? And I highly recommend to make etcd horizontal scalable. I mean split data to different etcd cluster. Current, k8s support move a kind of source to their dedicated ectd, but I think it's much more helpful if we support "move all the source from a namespace set to it's dedicated ectd"

wojtek-t · 2016-02-04T08:35:46Z

@mqliang

yes we plan to support larger cluster - at some point we would like to support 5k-node clusters, but this will take us few months to get there (probably not even in 1.3 release)
regarding splitting data per namespace - if we will need to do it, we may do it, but I didn't think deeply about it yet

adohe-zz · 2016-02-04T08:41:48Z

I just notice this #20504, I think etcdv3 will be helpful for larger cluster.

mqliang · 2016-02-04T08:47:53Z

this could be closed.

wojtek-t added priority/backlog Higher priority than priority/awaiting-more-evidence. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. team/control-plane labels Feb 3, 2016

mqliang closed this as completed Feb 4, 2016

spzala mentioned this issue Jul 5, 2017

when increase kube-api-qps param in kube-controller-mananger, etcd2.2.2 out of memory #18266

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ETCD becomes the bottleneck of cluster when we have 1K+ nodes #20540

ETCD becomes the bottleneck of cluster when we have 1K+ nodes #20540

mqliang commented Feb 3, 2016

mqliang commented Feb 3, 2016

adohe-zz commented Feb 3, 2016

xiang90 commented Feb 3, 2016

xiang90 commented Feb 3, 2016

xiang90 commented Feb 3, 2016

mqliang commented Feb 3, 2016

xiang90 commented Feb 3, 2016

mqliang commented Feb 3, 2016

xiang90 commented Feb 3, 2016

xiang90 commented Feb 3, 2016

gmarek commented Feb 3, 2016

magicwang-cn commented Feb 3, 2016

wojtek-t commented Feb 3, 2016

timothysc commented Feb 3, 2016

magicwang-cn commented Feb 4, 2016

mqliang commented Feb 4, 2016

mqliang commented Feb 4, 2016

magicwang-cn commented Feb 4, 2016

wojtek-t commented Feb 4, 2016

mqliang commented Feb 4, 2016

wojtek-t commented Feb 4, 2016

adohe-zz commented Feb 4, 2016

mqliang commented Feb 4, 2016

ETCD becomes the bottleneck of cluster when we have 1K+ nodes #20540

ETCD becomes the bottleneck of cluster when we have 1K+ nodes #20540

Comments

mqliang commented Feb 3, 2016

mqliang commented Feb 3, 2016

adohe-zz commented Feb 3, 2016

xiang90 commented Feb 3, 2016

xiang90 commented Feb 3, 2016

xiang90 commented Feb 3, 2016

mqliang commented Feb 3, 2016

xiang90 commented Feb 3, 2016

mqliang commented Feb 3, 2016

xiang90 commented Feb 3, 2016

xiang90 commented Feb 3, 2016

gmarek commented Feb 3, 2016

magicwang-cn commented Feb 3, 2016

wojtek-t commented Feb 3, 2016

timothysc commented Feb 3, 2016

magicwang-cn commented Feb 4, 2016

mqliang commented Feb 4, 2016

mqliang commented Feb 4, 2016

magicwang-cn commented Feb 4, 2016

wojtek-t commented Feb 4, 2016

mqliang commented Feb 4, 2016

wojtek-t commented Feb 4, 2016

adohe-zz commented Feb 4, 2016

mqliang commented Feb 4, 2016