Resource Management of ETCD Load #120781

Sharpz7 · 2023-09-20T18:16:26Z

What would you like to be added?

There should be Resource Management for tracking etcd load. This can then be used as a Resource Quota for Jobs to ensure that they (and potentially kubelet) do not spin up Pods faster than etcd can handle.

K8s Docs:
https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#extended-resources
https://kubernetes.io/docs/concepts/policy/resource-quotas/

Why is this needed?

For people dealing with extremely high-throughput batch work (i.e 1000's jobs / second, lasting 1-2 mins each), etcd starts to become a real problem. There should be an in-k8s solution to this.

Whether it is through Resource Quotas or some other medium, I think this is something interesting to be explored. Happy to make this Ticket more detailed as required, and start the KEP if this is something that is interesting to folks.

Links to back up this point:

https://etcd.io/docs/v3.5/op-guide/performance/
https://github.com/armadaproject/armada: A scheduling solution partially designed around this problem.

Sharpz7 · 2023-09-20T18:17:31Z

/sig api-machinery
/wg batch

jiahuif · 2023-09-21T16:45:02Z

/assign @wenjiaswe
Could you relay this issue to etcd maintainer? Thank you.
/triage accepted

Sharpz7 · 2023-09-21T21:04:19Z

@jiahuif I don't really think this is an etcd problem - it's as efficient as it's going to get.

Kubernetes needs to be responsible for not overloading it.

wenjiaswe · 2023-09-21T21:13:31Z

it's as efficient as it's going to get. Kubernetes needs to be responsible for not overloading it.

Agree.

But... Interestingly, somewhat related, @logicalhan proposed his interesting project extensible-etcd recently: https://docs.google.com/document/d/16XEGyPBisZvmmoIHSZzv__LoyOeluC5a4x353CX0SIM/edit#bookmark=id.n170uancbaqb, it would potentially help with etcd performance limit.

cc @logicalhan @jpbetz @serathius

Sharpz7 · 2023-09-23T22:32:00Z

We had seen projects like this, but really etcd is very nice the way it is, and we don't have much interest in switching to something else.

That is why I think a Resource Quota is an interesting idea. Or maybe something like Kubelet being able to "queue" etcd requests to stop overloading.

alculquicondor · 2023-10-26T14:51:07Z

cc @serathius for thoughts

alculquicondor · 2023-10-26T14:51:55Z

/sig etcd
?

logicalhan · 2023-10-26T14:53:09Z

We had seen projects like this, but really etcd is very nice the way it is, and we don't have much interest in switching to something else.

If we pursued this, it would be in-tree for etcd.

Sharpz7 · 2023-10-26T15:01:12Z

Okay, so should I re-create the ticket there?

I would also be happy to work on etcd.

I am also not convinced that we want to handle this in etcd. I don't want to hold back the creation of key-value pairs, I want to stop the key-value pairs even being sent to etcd in the first place.

serathius · 2023-10-27T08:34:02Z

This sounds like a flow control issue. Are you using https://kubernetes.io/docs/concepts/cluster-administration/flow-control/?

Sharpz7 added the kind/feature Categorizes issue or PR as related to a new feature. label Sep 20, 2023

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 20, 2023

k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. wg/batch Categorizes an issue or PR as relevant to WG Batch. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 20, 2023

k8s-ci-robot assigned wenjiaswe Sep 21, 2023

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 21, 2023

k8s-ci-robot added the sig/etcd Categorizes an issue or PR as relevant to SIG Etcd. label Oct 26, 2023

Sharpz7 mentioned this issue Oct 27, 2023

Handle Extremely High Throughput by holding back requests to etcd until the throughput decreases. etcd-io/etcd#16837

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resource Management of ETCD Load #120781

Resource Management of ETCD Load #120781

Sharpz7 commented Sep 20, 2023

Sharpz7 commented Sep 20, 2023

jiahuif commented Sep 21, 2023

Sharpz7 commented Sep 21, 2023 •

edited

wenjiaswe commented Sep 21, 2023

Sharpz7 commented Sep 23, 2023

alculquicondor commented Oct 26, 2023

alculquicondor commented Oct 26, 2023

logicalhan commented Oct 26, 2023

Sharpz7 commented Oct 26, 2023 •

edited

serathius commented Oct 27, 2023 •

edited

Resource Management of ETCD Load #120781

Resource Management of ETCD Load #120781

Comments

Sharpz7 commented Sep 20, 2023

What would you like to be added?

Why is this needed?

Sharpz7 commented Sep 20, 2023

jiahuif commented Sep 21, 2023

Sharpz7 commented Sep 21, 2023 • edited

wenjiaswe commented Sep 21, 2023

Sharpz7 commented Sep 23, 2023

alculquicondor commented Oct 26, 2023

alculquicondor commented Oct 26, 2023

logicalhan commented Oct 26, 2023

Sharpz7 commented Oct 26, 2023 • edited

serathius commented Oct 27, 2023 • edited

Sharpz7 commented Sep 21, 2023 •

edited

Sharpz7 commented Oct 26, 2023 •

edited

serathius commented Oct 27, 2023 •

edited