Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider providing separate etcd destination for CRDs #118858

Open
geetasg opened this issue Jun 25, 2023 · 12 comments
Open

Consider providing separate etcd destination for CRDs #118858

geetasg opened this issue Jun 25, 2023 · 12 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@geetasg
Copy link

geetasg commented Jun 25, 2023

What would you like to be added?

As of now, Kubernetes api server provides a mechanism to push events to a separate etcd cluster using the --etcd-servers-overrides="/events#" flag. This issue is to request similar mechanism for sending CRDs to separate etcd cluster.

Why is this needed?

Primary motivation is to keep the main etcd cluster performant.

  • CRD listing - Some workloads use their CRDs for events (example argo). These events cause issues similar to the Kubernetes native events - they have spiky writes and they keep getting LIST calls typically from monitoring tools. The motivation for moving these out of main etcd cluster is similar to the reasoning for moving Kubernetes events out of main etcd cluster.

  • CRD count - Some CRDs produce millions of objects and affect the performance of main etcd cluster.

@geetasg geetasg added the kind/feature Categorizes issue or PR as related to a new feature. label Jun 25, 2023
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 25, 2023
@geetasg
Copy link
Author

geetasg commented Jun 25, 2023

similar to #4432

@aojea
Copy link
Member

aojea commented Jun 26, 2023

/sig api-machinery
/cc @wojtek-t @sttts

@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 26, 2023
@sttts
Copy link
Contributor

sttts commented Jun 26, 2023

--etcd-servers-overrides could certainly be extended to cover CRDs too. I remember that there was such a discussion in the past. If I remember right, the only reason against that was that we are not really confident --etcd-servers-overrides is the right long-term solution. But that discussion has been years ago. I don't think there has been much progress to find a more abstract way to configure storage.

@alexzielenski
Copy link
Contributor

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 27, 2023
@geetasg
Copy link
Author

geetasg commented Jul 5, 2023

@sttts Can you please comment on what is the best way forward here ? I can start investigating implementation for etcd cluster override for CRDs but would like to verify that it is aligned with long term direction. /cc @serathius

@sttts
Copy link
Contributor

sttts commented Jul 6, 2023

Formally Sig-API-Machinery is responsible for this topic. The Sig meeting on every second Wednesday might be a good place to bring it up. There is an agenda document. Just put it on there. cc @fedebongio

@geetasg
Copy link
Author

geetasg commented Jul 19, 2023

Thanks @sttts . I will attend the next meeting to discuss this with the community.

@wenjiaswe
Copy link
Contributor

cc @jpbetz

@jberkus
Copy link

jberkus commented Aug 24, 2023

This is a good idea, except that it would also need to include a way to deploy the 2nd etcd cluster. Would we adopt a standard operator?

@jmhbnz
Copy link
Member

jmhbnz commented Aug 24, 2023

Proliferation of CRD's along with the behavior of their controllers is something that is definitely causing scalability ceilings for single clusters.

This idea sounds helpful, and would cover one part of the equation in terms of prioritising/maintaining availability of core etcd cluster and providing additional capacity. Another side of the equation we need to address is ensuring api server massive memory growth & spikes could also be mitigated when dealing with vast amounts of objects.

@jpbetz
Copy link
Contributor

jpbetz commented Aug 24, 2023

Binary protocols for CRD (@benluddy is planning to submit a KEP for 1.29) should help a lot with CRD scalability. I'm love to see what scale limits clusters with lots of CRDs hitting limits after that is available.

I'm also very curious what limit clusters are hitting. Is is apiserver CPU? etcd CPU or storage space? Depending on the limit hit, a separate etcd may or may not help.

@liangyuanpeng
Copy link
Contributor

Binary protocols for CRD (@benluddy is planning to submit a KEP for 1.29) should help a lot with CRD scalability

This is https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/4222-cbor-serializer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

10 participants