Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only create one etcd client per transport #111559

Closed
wants to merge 1 commit into from
Closed

Conversation

negz
Copy link
Contributor

@negz negz commented Jul 30, 2022

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

Previously we would create one etcd client per type of custom resource, resulting in a TCP connection to the etcd cluster per type of resource, and excessive memory usage (mostly because we also created an expensive logger per client).

With ~1,900 CRDs loaded on an otherwise idle kind cluster I'm seeing a ~35% reduction in RSS memory with this change. Garbage collection appears to be kicking in at ~5.2GiB RSS rather than ~8Gib RSS. Here's a profile:

profile-with-one-client-per-transport

You can see that we're using dramatically fewer etcd connections:

root@e092b6d27bf-clientpertransport-control-plane:/# ps -C kube-apiserver
    PID TTY          TIME CMD
    605 ?        00:16:41 kube-apiserver

# Column 3 is the remote host and port per https://www.kernel.org/doc/Documentation/networking/proc_net_tcp.txt
# 0100007F:094B is hex for 127.0.0.1:2379 (the etcd client port)
root@e092b6d27bf-clientpertransport-control-plane:/# cat /proc/605/net/tcp|awk '{ print $3 }'|grep '0100007F:094B'|wc -l
5

Which issue(s) this PR fixes:

Fixes #111622

Special notes for your reviewer:

I'm opening this as a draft to illustrate my thinking. Notably it's missing any new tests at the moment - I'd like to get some signal that this direction looks good before I spend time on that.

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jul 30, 2022
@k8s-ci-robot
Copy link
Contributor

Hi @negz. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: negz
Once this PR has been reviewed and has the lgtm label, please assign liggitt for approval by writing /assign @liggitt in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added area/apiserver sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jul 30, 2022
@negz
Copy link
Contributor Author

negz commented Jul 30, 2022

/cc @aojea @jpbetz @sttts

@aojea
Copy link
Member

aojea commented Aug 1, 2022

/ok-to-test

let's get a taste on the CI, can you undraft to run one round of tests and see the results?

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 1, 2022
@aojea
Copy link
Member

aojea commented Aug 1, 2022

/cc @wojtek-t

@p0lyn0mial
Copy link
Contributor

does anyone know why we create one etcd client per type?

return fmt.Sprintf("%v", tc) // gives: {[server1 server2] keyFile certFile caFile}
}

// For returns a compacted etcd v3 client for the supplied transport config. One
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// For returns a compacted etcd v3 client for the supplied transport config. One
// For returns a compacting etcd v3 client for the supplied transport config. One

@sttts
Copy link
Contributor

sttts commented Aug 4, 2022

@wojtek-t worth to run any of the bigger scaling tests?

@marseel
Copy link
Member

marseel commented Aug 4, 2022

/test pull-kubernetes-kubemark-e2e-gce-scale

@cici37
Copy link
Contributor

cici37 commented Aug 4, 2022

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 4, 2022
@negz
Copy link
Contributor Author

negz commented Aug 19, 2022

Hey folks, just checking in. Should I update this PR with some new tests? I'd like to get some signal around what we'd need to feel comfortable moving forward with this.

@MikeSpreitzer
Copy link
Member

/test pull-kubernetes-dependencies

@MikeSpreitzer
Copy link
Member

@MikeSpreitzer

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 13, 2022
@k8s-ci-robot
Copy link
Contributor

@negz: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@liggitt
Copy link
Member

liggitt commented Oct 31, 2022

the lifecycle of etcd clients held by various storage implementations came up during review of #112050 and I wanted to make sure that by making the CRD etcd clients shared, we weren't going to tear down the shared etcd client when we update storage in response to a CRD update or removal at

// Tear down the old storage
go r.tearDown(oldInfo)

@wojtek-t
Copy link
Member

/assign

Will take a look at it.

@dims
Copy link
Member

dims commented Dec 12, 2022

If you still need this PR then please rebase, if not, please close the PR

@negz
Copy link
Contributor Author

negz commented Dec 13, 2022

If you still need this PR then please rebase, if not, please close the PR

I wouldn't say I need this per se, but it seems like it would be a good thing for the API server to not be creating potentially hundreds of superfluous etcd clients (with the associated network and memory overhead). I'll find time to rebase this week, but more broadly I'm not sure I will have the bandwidth in the near future to proactively lobby to get this fixed, which is something I imagine will be needed to drive this PR to completion given it's been sitting open for months with little momentum.

@dims
Copy link
Member

dims commented Dec 13, 2022

cc @wojtek-t @liggitt

iordan, wojtek, could we please make a determination on this?

@liggitt
Copy link
Member

liggitt commented Dec 13, 2022

I'm +1 on the goal of not creating duplicate etcd clients. I'm -1 on adding a new client cache... the ones we have created for client-go have been sort of a nightmare, and I'm not eager to repeat/expand that experience

Since we control all constructions of etcd clients (as opposed to client-go which is constructed by callers from The Internet), I would expect us to be able to move the client construction earlier in the server startup so a single instance can be shared by multiple registries.

@enj @ritazh, would any of the storage cleanup work you did in 1.26 around unifying encryption make it easier to ensure we construct etcd clients for a single etcd server a single time?

As an aside, I do wonder if this will change the performance characteristics of large servers with lots of requests going to parallel resources at the same time, which previously got their own connections, and now would multiplex over a single (?) connection. @wojtek-t, do our scale tests exercise saturating the apiserver <-> etcd connection?

@enj
Copy link
Member

enj commented Dec 13, 2022

I'm +1 on the goal of not creating duplicate etcd clients. I'm -1 on adding a new client cache... the ones we have created for client-go have been sort of a nightmare, and I'm not eager to

💯% agree here, lets not make the same mistake twice.

Since we control all constructions of etcd clients (as opposed to client-go which is constructed by callers from The Internet), I would expect us to be able to move the client construction earlier in the server startup so a single instance can be shared by multiple registries.

@enj @ritazh, would any of the storage cleanup work you did in 1.26 around unifying encryption make it easier to ensure we construct etcd clients for a single etcd server a single time?

#114458 might work.

@enj
Copy link
Member

enj commented Dec 13, 2022

@negz can you see if #114458 helps in the scenarios you were testing?

@negz
Copy link
Contributor Author

negz commented Dec 13, 2022

Thanks for the direction folks. I'll close this PR and we can follow up on the issue that tracks the problem (#111622).

@enj unfortunately I'm oversubscribed at the moment (and about to take some vacation) so I won't get a chance to dig into this until early January. Definitely happy to do so then. If you want to take a shot in the meantime what I'd be looking for is:

  • Does this change affect performance per @liggitt's concerns. I personally don't have access to large scale installations to test this, so not sure how to build confidence apart from E2E scale tests.
  • Does this change reduce API server per-CRD memory consumption.
  • Does this change reduce API server per-CRD TCP socket usage

My personal motivation is driving down the memory usage. Currently when you load too many @crossplane providers (and thus too many CRDs) on an EKS or GKE cluster the API server does not scale up elegantly. It seems to be OOM killed and it takes some time for the control plane to recover and become available (presumably it's eventually given more memory). If using fewer etcd clients does indeed result the kind of improvement I saw (i.e. ~30% less memory with 1,800 CRDs) that's ~30% more CRDs we can load before things go south.

(CC @apelisse who has been working to drive down API server memory and CPU usage under CRD load.)

@negz negz closed this Dec 13, 2022
@aojea
Copy link
Member

aojea commented Dec 15, 2022

As an aside, I do wonder if this will change the performance characteristics of large servers with lots of requests going to parallel resources at the same time, which previously got their own connections, and now would multiplex over a single (?) connection. @wojtek-t, do our scale tests exercise saturating the apiserver <-> etcd connection?

I think that etcd client uses GRPC, GRPC should loadbalance , no?

@liggitt
Copy link
Member

liggitt commented Dec 15, 2022

As an aside, I do wonder if this will change the performance characteristics of large servers with lots of requests going to parallel resources at the same time, which previously got their own connections, and now would multiplex over a single (?) connection. @wojtek-t, do our scale tests exercise saturating the apiserver <-> etcd connection?

I think that etcd client uses GRPC, GRPC should loadbalance , no?

sure, but multiplexing unrelated resources over a single connection at least opens the door to contention or capacity issues that separate connections would have side-stepped, right?

@lavalamp
Copy link
Member

In practice I believe all connections were made to the same backend previously. I will be a little surprised if gRPC's multiplexing (over single HTTP/2 connection) is worse than the kernel/network fabric's multiplexing (over multiple connections).

@wojtek-t
Copy link
Member

wojtek-t commented Jan 2, 2023

As an aside, I do wonder if this will change the performance characteristics of large servers with lots of requests going to parallel resources at the same time, which previously got their own connections, and now would multiplex over a single (?) connection. @wojtek-t, do our scale tests exercise saturating the apiserver <-> etcd connection?

I don't think we're really saturating it. Most of problems we were and are suffering from are at the kube-apiserver level and the tests were focusing on that. It shouldn't be hard to slightly extend the test to saturate it (e.g. by adding N pods running LIST calls), but we don't have it as part of default tests.

I think my intuition is what @lavalamp wrote above, but we should probably test that anywah before making such change.

I will queue taking a look at the other PR and looking what's the quickest way of testing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/apiserver cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API server creates one etcd client per CRD