-
Notifications
You must be signed in to change notification settings - Fork 38.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ETCD becomes the bottleneck of cluster when we have 1K+ nodes #20540
Comments
For large cluster, it's not reasonable if we just have one health-check routine. We could use several routines to make health check. And the number of routines could be self-adopting. Any thought? |
+1 multi routines to make health check is necessary for large cluster. |
@mqliang For etcd v3, we introduced a concept called lease. The client can acquire lease and do low cost lease keepalive. 10k of nodes to do keepaliving should not be a problem in general. |
Also even in today's setting, 100s of puts per second is not a big problem for etcd actually. |
BTW, I am not opposite to add a "centralized" control functionality to coalesce node keepalives. But I would like to see the result/motivation for it. If you can prove this is a problem indeed with some data, then it would be great. |
@xiang90 We have a cluster with 1K+ node, 100s of puts per second _alone_ is not a big problem, but there are many other requests. When we observe etcd under great pressure, we make nodeController do the health check initiatively, and move Pod and Node source to their dedicated etcd cluster, as a workaround BTW, we observe that etcd will write snapshot every few minutes, when etcd write snapshot, a lot of requests will timeout. IIRC, etcd v2 writes full snapshot, does etcd v3 has a plan to support incremental snapshot? |
What is the percentage that node updates account for of all the requests? I would suggest to investigate the most significant part.
I think we need data to prove the issue.
I do not know why. Have you tried to figure out what the root cause of this?
v3 does incremental snapshot. |
In addition, it's hard to monitor the status of etcd cluster. When etcd under great pressure, requests will timeout, but cpu/mem usage seems has no significant increase, which make us hard to monitor and alarm. |
What is the great pressure? Can you reproduce this with etcd itself? |
@mqliang Again, to report any etcd issue you have to reproduce it with etcd itself. These metrics basically mean nothing to etcd. |
The reason was architectural, not related to the performance. We don't want to establish connections from master machine to anywhere. This allows for e.g. having a hosted master somewhere that oversees on-prem Nodes. Currently it's possible and safe, as all communication is initiated by Nodes. I don't think that the fact that Kubelet updates node status is a performance problem by itself. I think that the problem is that we keep those timestamps as a part of NodeStatus. I wrote a proposal to change this, but it was kind of rejected: #14735 |
/cc @kubernetes/huawei |
@kubernetes/sig-scalability |
Could you please describe your deployment & tests in greater detail.
The more detail, the better. |
ref #18266 |
Some details:
I think I have found the cause of the problem:
|
May I ask why In v1.0 kube-controller-manager will relist every 5 minutes? Does it mean the watch of etcd is not reliable, so that we should periodically relist? |
+1. the list-watch mechanism is really puzzled. but if you enable the watch-cache, the mainly watch-relist action is in the kube-apiserver's side, this can reduce etcd's burden. @xiang90 to etcd, if watch() returns err, and we do not list(), maybe some changes will miss?
|
@mqliang - if you are using 1.0 release, there is no mystery that this doesn't work. The 1.0 release is supporting only 100-node clusters, the 1.1 release is supporting 250-node clusters. Also note, that removing rate-limiting in controller-manager can significantly degrade performance, so it's definitely not recommended. Also note, that enabling watch-cache is significantly improving performance (this was the main factor enabling us going from 100 to 1000 nodes) |
@wojtek-t May I ask does k8s has a further plan to support a larger cluster? And I highly recommend to make etcd horizontal scalable. I mean split data to different etcd cluster. Current, k8s support move a kind of source to their dedicated ectd, but I think it's much more helpful if we support "move all the source from a namespace set to it's dedicated ectd" |
|
I just notice this #20504, I think etcdv3 will be helpful for larger cluster. |
this could be closed. |
ETCD becomes the bottleneck of cluster when we have 1K+ nodes.
After close scrutiny, it's the 1K+ nodes frequently send PUT request to ETCD to update it's status(every 10s using default configure), which give ETCD a big burden.
May I suggest, instead:
_Let NodeController initiatively do the health check. Thus we could significantly reduce the number of request sent to ETCD._
IIRC, in kubernetes v0.12, we make NodeController initiatively do the health check DoChecks, but after v1.0, we make kubelet update it's status. Is there good reason for this change?@gmarek @fgrzadkowski
By the way, I think may be it's time to think about make ETCD extensible.
The text was updated successfully, but these errors were encountered: