Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ETCD becomes the bottleneck of cluster when we have 1K+ nodes #20540

Closed
mqliang opened this issue Feb 3, 2016 · 23 comments
Closed

ETCD becomes the bottleneck of cluster when we have 1K+ nodes #20540

mqliang opened this issue Feb 3, 2016 · 23 comments
Labels
priority/backlog Higher priority than priority/awaiting-more-evidence. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.

Comments

@mqliang
Copy link
Contributor

mqliang commented Feb 3, 2016

ETCD becomes the bottleneck of cluster when we have 1K+ nodes.

After close scrutiny, it's the 1K+ nodes frequently send PUT request to ETCD to update it's status(every 10s using default configure), which give ETCD a big burden.

May I suggest, instead:
_Let NodeController initiatively do the health check. Thus we could significantly reduce the number of request sent to ETCD._

IIRC, in kubernetes v0.12, we make NodeController initiatively do the health check DoChecks, but after v1.0, we make kubelet update it's status. Is there good reason for this change?@gmarek @fgrzadkowski

By the way, I think may be it's time to think about make ETCD extensible.

@mqliang
Copy link
Contributor Author

mqliang commented Feb 3, 2016

For large cluster, it's not reasonable if we just have one health-check routine. We could use several routines to make health check. And the number of routines could be self-adopting. Any thought?

@adohe-zz
Copy link

adohe-zz commented Feb 3, 2016

+1 multi routines to make health check is necessary for large cluster.

@xiang90
Copy link
Contributor

xiang90 commented Feb 3, 2016

@mqliang For etcd v3, we introduced a concept called lease. The client can acquire lease and do low cost lease keepalive. 10k of nodes to do keepaliving should not be a problem in general.

@xiang90
Copy link
Contributor

xiang90 commented Feb 3, 2016

Also even in today's setting, 100s of puts per second is not a big problem for etcd actually.

@xiang90
Copy link
Contributor

xiang90 commented Feb 3, 2016

BTW, I am not opposite to add a "centralized" control functionality to coalesce node keepalives. But I would like to see the result/motivation for it. If you can prove this is a problem indeed with some data, then it would be great.

@mqliang
Copy link
Contributor Author

mqliang commented Feb 3, 2016

@xiang90 We have a cluster with 1K+ node, 100s of puts per second _alone_ is not a big problem, but there are many other requests. When we observe etcd under great pressure, we make nodeController do the health check initiatively, and move Pod and Node source to their dedicated etcd cluster, as a workaround

BTW, we observe that etcd will write snapshot every few minutes, when etcd write snapshot, a lot of requests will timeout. IIRC, etcd v2 writes full snapshot, does etcd v3 has a plan to support incremental snapshot?

@xiang90
Copy link
Contributor

xiang90 commented Feb 3, 2016

but there are many other requests.

What is the percentage that node updates account for of all the requests? I would suggest to investigate the most significant part.

When we observe etcd about to crash,

I think we need data to prove the issue.

we observe that etcd will write snapshot every few minutes, when etcd write snapshot, a lot of requests will timeout

I do not know why. Have you tried to figure out what the root cause of this?

etcd v2 writes full snapshot, does etcd v3 has a plan to support incremental snapshot?

v3 does incremental snapshot.

@mqliang
Copy link
Contributor Author

mqliang commented Feb 3, 2016

In addition, it's hard to monitor the status of etcd cluster. When etcd under great pressure, requests will timeout, but cpu/mem usage seems has no significant increase, which make us hard to monitor and alarm.

@xiang90
Copy link
Contributor

xiang90 commented Feb 3, 2016

In addition, it's hard to monitor the status of etcd cluster. When etcd under great pressure, requests will timeout, but cpu/mem usage seems has no significant increase, which make us hard to monitor and alarm.

What is the great pressure? Can you reproduce this with etcd itself?

@xiang90
Copy link
Contributor

xiang90 commented Feb 3, 2016

@mqliang Again, to report any etcd issue you have to reproduce it with etcd itself. These metrics basically mean nothing to etcd.

@gmarek
Copy link
Contributor

gmarek commented Feb 3, 2016

The reason was architectural, not related to the performance. We don't want to establish connections from master machine to anywhere. This allows for e.g. having a hosted master somewhere that oversees on-prem Nodes. Currently it's possible and safe, as all communication is initiated by Nodes.

I don't think that the fact that Kubelet updates node status is a performance problem by itself. I think that the problem is that we keep those timestamps as a part of NodeStatus. I wrote a proposal to change this, but it was kind of rejected: #14735

@magicwang-cn
Copy link
Contributor

/cc @kubernetes/huawei

@wojtek-t wojtek-t added priority/backlog Higher priority than priority/awaiting-more-evidence. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. team/control-plane labels Feb 3, 2016
@wojtek-t
Copy link
Member

wojtek-t commented Feb 3, 2016

@kubernetes/sig-scalability

@timothysc
Copy link
Member

Could you please describe your deployment & tests in greater detail.

  1. What tests are you running (is it high churn)
  2. Have you load balanced your api-servers
  3. Have you sharded events?
  4. Could you grab your api-server metrics to show the counts of operations. As well as the latency of operations.
  5. Are you running a full-secure deployment.
    ...

The more detail, the better.

@magicwang-cn
Copy link
Contributor

ref #18266

@mqliang
Copy link
Contributor Author

mqliang commented Feb 4, 2016

Some details:

  1. We remove the rate limit of kube-controller-manager/kube-scheduler.
  2. We doesn't connect the secure server, so the max-request-in-flight flag doesn't work
  3. We use k8s v1.0, which has no cache in API server, and kube-controller-manager will relist every 5 minutes(AFAIK in latest version,it has been changed to 12h), frequent list operation would give ETCD a big burden.
  4. Didn't share Events, all resource are in a same etcd cluster
  5. No load balance since multi API server without quorum read may return stale value. See add a knob to enable quorum read #20145 and should introduce a knob to enable quorum read of etcd for HA #19902

I think I have found the cause of the problem:

  1. frequently re-list
  2. all resource are in a same etcd cluster

@mqliang
Copy link
Contributor Author

mqliang commented Feb 4, 2016

May I ask why In v1.0 kube-controller-manager will relist every 5 minutes? Does it mean the watch of etcd is not reliable, so that we should periodically relist?

@magicwang-cn
Copy link
Contributor

+1. the list-watch mechanism is really puzzled. but if you enable the watch-cache, the mainly watch-relist action is in the kube-apiserver's side, this can reduce etcd's burden.

@xiang90 to etcd, if watch() returns err, and we do not list(), maybe some changes will miss?

Does it mean the watch of etcd is not reliable, so that we should periodically relist?

@wojtek-t
Copy link
Member

wojtek-t commented Feb 4, 2016

@mqliang - if you are using 1.0 release, there is no mystery that this doesn't work. The 1.0 release is supporting only 100-node clusters, the 1.1 release is supporting 250-node clusters.
Only 1.2 release will be supporting 1000-node clusters.

Also note, that removing rate-limiting in controller-manager can significantly degrade performance, so it's definitely not recommended.

Also note, that enabling watch-cache is significantly improving performance (this was the main factor enabling us going from 100 to 1000 nodes)

@mqliang
Copy link
Contributor Author

mqliang commented Feb 4, 2016

@wojtek-t May I ask does k8s has a further plan to support a larger cluster? And I highly recommend to make etcd horizontal scalable. I mean split data to different etcd cluster. Current, k8s support move a kind of source to their dedicated ectd, but I think it's much more helpful if we support "move all the source from a namespace set to it's dedicated ectd"

@wojtek-t
Copy link
Member

wojtek-t commented Feb 4, 2016

@mqliang

  • yes we plan to support larger cluster - at some point we would like to support 5k-node clusters, but this will take us few months to get there (probably not even in 1.3 release)
  • regarding splitting data per namespace - if we will need to do it, we may do it, but I didn't think deeply about it yet

@adohe-zz
Copy link

adohe-zz commented Feb 4, 2016

I just notice this #20504, I think etcdv3 will be helpful for larger cluster.

@mqliang
Copy link
Contributor Author

mqliang commented Feb 4, 2016

this could be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/backlog Higher priority than priority/awaiting-more-evidence. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.
Projects
None yet
Development

No branches or pull requests

7 participants