Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resource Management of ETCD Load #120781

Open
Sharpz7 opened this issue Sep 20, 2023 · 10 comments
Open

Resource Management of ETCD Load #120781

Sharpz7 opened this issue Sep 20, 2023 · 10 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/etcd Categorizes an issue or PR as relevant to SIG Etcd. triage/accepted Indicates an issue or PR is ready to be actively worked on. wg/batch Categorizes an issue or PR as relevant to WG Batch.

Comments

@Sharpz7
Copy link
Contributor

Sharpz7 commented Sep 20, 2023

What would you like to be added?

There should be Resource Management for tracking etcd load. This can then be used as a Resource Quota for Jobs to ensure that they (and potentially kubelet) do not spin up Pods faster than etcd can handle.

K8s Docs:
https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#extended-resources
https://kubernetes.io/docs/concepts/policy/resource-quotas/

Why is this needed?

For people dealing with extremely high-throughput batch work (i.e 1000's jobs / second, lasting 1-2 mins each), etcd starts to become a real problem. There should be an in-k8s solution to this.

Whether it is through Resource Quotas or some other medium, I think this is something interesting to be explored. Happy to make this Ticket more detailed as required, and start the KEP if this is something that is interesting to folks.

Links to back up this point:

@Sharpz7 Sharpz7 added the kind/feature Categorizes issue or PR as related to a new feature. label Sep 20, 2023
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 20, 2023
@Sharpz7
Copy link
Contributor Author

Sharpz7 commented Sep 20, 2023

/sig api-machinery
/wg batch

@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. wg/batch Categorizes an issue or PR as relevant to WG Batch. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 20, 2023
@jiahuif
Copy link
Member

jiahuif commented Sep 21, 2023

/assign @wenjiaswe
Could you relay this issue to etcd maintainer? Thank you.
/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 21, 2023
@Sharpz7
Copy link
Contributor Author

Sharpz7 commented Sep 21, 2023

@jiahuif I don't really think this is an etcd problem - it's as efficient as it's going to get.

Kubernetes needs to be responsible for not overloading it.

@wenjiaswe
Copy link
Contributor

it's as efficient as it's going to get. Kubernetes needs to be responsible for not overloading it.

Agree.

But... Interestingly, somewhat related, @logicalhan proposed his interesting project extensible-etcd recently: https://docs.google.com/document/d/16XEGyPBisZvmmoIHSZzv__LoyOeluC5a4x353CX0SIM/edit#bookmark=id.n170uancbaqb, it would potentially help with etcd performance limit.

cc @logicalhan @jpbetz @serathius

@Sharpz7
Copy link
Contributor Author

Sharpz7 commented Sep 23, 2023

We had seen projects like this, but really etcd is very nice the way it is, and we don't have much interest in switching to something else.

That is why I think a Resource Quota is an interesting idea. Or maybe something like Kubelet being able to "queue" etcd requests to stop overloading.

@alculquicondor
Copy link
Member

cc @serathius for thoughts

@alculquicondor
Copy link
Member

/sig etcd
?

@k8s-ci-robot k8s-ci-robot added the sig/etcd Categorizes an issue or PR as relevant to SIG Etcd. label Oct 26, 2023
@logicalhan
Copy link
Member

We had seen projects like this, but really etcd is very nice the way it is, and we don't have much interest in switching to something else.

If we pursued this, it would be in-tree for etcd.

@Sharpz7
Copy link
Contributor Author

Sharpz7 commented Oct 26, 2023

Okay, so should I re-create the ticket there?

I would also be happy to work on etcd.

I am also not convinced that we want to handle this in etcd. I don't want to hold back the creation of key-value pairs, I want to stop the key-value pairs even being sent to etcd in the first place.

@serathius
Copy link
Contributor

serathius commented Oct 27, 2023

This sounds like a flow control issue. Are you using https://kubernetes.io/docs/concepts/cluster-administration/flow-control/?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/etcd Categorizes an issue or PR as relevant to SIG Etcd. triage/accepted Indicates an issue or PR is ready to be actively worked on. wg/batch Categorizes an issue or PR as relevant to WG Batch.
Projects
None yet
Development

No branches or pull requests

7 participants