Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[prototype] Make apiserver admission plugins retrieve k8s objects dir… #121979

Closed
wants to merge 1 commit into from

Conversation

linxiulei
Copy link
Contributor

Make apiserver admission plugins retrieve k8s objects directly from its storage

Before this commit, admission plugins use informers to sync k8s objects and as a result, there is an additional copy of k8s object in kube-apiserver. This commit allows referencing k8s objects from storage without actually copying them, which reduces significant memory usage in both serving and storage.

Also observed a visible amount of CPU usage reduced in kube-apiserver startup as the admission plugins didn't need to fetch objects from the HTTP endpoints, nor did kube-apiserver need to serve.

Memory measurements and comparisons

# create about 9k pods
kubectl get --raw /metrics | grep go_memstats_alloc_bytes

Origin:

Peak memory alloc: go_memstats_alloc_bytes 7.16614552e+08
Stablized memory alloc: go_memstats_alloc_bytes 4.2804752e+08

Optimized:

Peak memory alloc: go_memstats_alloc_bytes 3.46012872e+08
Stablized memory alloc: go_memstats_alloc_bytes 2.82520896e+08

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

This commit allows referencing k8s objects from storage without actually copying them, which reduces significant memory and CPU usage in both serving and storage.

Which issue(s) this PR fixes:

Fixes #121657

Special notes for your reviewer:

The current PR only have minimal changes for easier reviews and discussion.

The implementation contains three parts:

  1. Populating the restStorage object from kube-apiserver initialization code, which is needed for retrieving k8s objects
  2. Creating a StorageInformers factory reusing the current code-autogen of staging/src/k8s.io/client-go/informers/factory.go to automatically generate that takes restStorage instead of k8s_client to initialize. This new factory will have exactly same interfaces as the current informer factory
  3. Adding an interface for apiserver plugins to allow to be initialized with informers created by StorageInformers

As for plugin migration, the migration code should be as minimal as just adding an interface and replacing informers to the ones created by StorageInformers. Likely, we will migrate all plugins to use the new informers for lower overheads. To safely rollout, we could use a feature gate to guard the change.

Does this PR introduce a user-facing change?


Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

NONE

…ectly from its storage

Before this commit, admission plugins use informers to sync k8s objects and as
a result, there is a copy of k8s object in kube-apiserver. This commit allows
referencing k8s objects from storage without actually copying them, which
reduces significant memory usage in both serving and storage.

Also observed a visible amount of CPU usage reduced in kube-apiserver
startup as the admission plugins didn't need to fetch objects from the
HTTP endpoints, nor did kube-apiserver need to serve.

Memory measurements and comparisons

```
kubectl get --raw /metrics | grep go_memstats_alloc_bytes
```

Origin:

Peak memory alloc: go_memstats_alloc_bytes 7.16614552e+08
Stablized memory alloc: go_memstats_alloc_bytes 4.2804752e+08

Optimized:

Peak memory alloc: go_memstats_alloc_bytes 3.46012872e+08
Stablized memory alloc: go_memstats_alloc_bytes 2.82520896e+08
@k8s-ci-robot
Copy link
Contributor

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 21, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Nov 21, 2023
@linxiulei linxiulei marked this pull request as draft November 21, 2023 11:59
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 21, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: linxiulei
Once this PR has been reviewed and has the lgtm label, please assign deads2k for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added area/apiserver sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 21, 2023
@linxiulei
Copy link
Contributor Author

/cc @deads2k @jpbetz @liggitt @wojtek-t

@jpbetz
Copy link
Contributor

jpbetz commented Nov 21, 2023

Thanks for prototyping this @linxiulei!

Have we considered an alternative approach where instead of introducing an alternate implementation of informers, we instead implement an alternate implementation of the client? I'm imagining an "internal client" that is optimized to minimize copies. This client would probably only be optimized for reads, since writes MUST go through the admission chain. Would this be viable? Would it eliminate the need to generate alternate informers?

It also just occurred to me that bypassing the normal client flow also bypasses auth..

@linxiulei
Copy link
Contributor Author

re: internal client, I am not too sure if I understand it correctly, but if the internal client doesn't have the same API as the informers have, then it will be larger effort to migrate the code in all those admission plugins (30+ plugins although not all using informers) to use the new client. It'd be even more non-trivial to keep both new client and current design if we need to handle safe fallback for special cases.

This was the major reason that I wanted to implement informer API. Also, please correct me if I'm wrong, I believe admission plugins don't write.

re: bypassing auth. Yeah, it bypasses HTTP interface so every function related to it. I guess adminssion plugins don't need it in common cases

@jpbetz
Copy link
Contributor

jpbetz commented Nov 21, 2023

re: internal client, I am not too sure if I understand it correctly, but if the internal client doesn't have the same API as the informers have, then it will be larger effort to migrate the code in all those admission plugins (30+ plugins although not all using informers) to use the new client. It'd be even more non-trivial to keep both new client and current design if we need to handle safe fallback for special cases.

My thought was along the lines of: What if we could use the informers exactly as they exist today, but instead of wiring in the normal client, we somehow wire in a client (implementation of k8s.io/client-go/kubernetes.Interface) that retrieves data directly from storage. Then, if we use that client here to perform reads directly from storage, would this optimization "just work"?

This was the major reason that I wanted to implement informer API. Also, please correct me if I'm wrong, I believe admission plugins don't write.

Yes, that's right. I brought up writes because, if we reimplement k8s.io/client-go/kubernetes.Interface we'd have tp decide what the implementation of writes do. I think I'd be in favor of having writes delegate to the normal client.

re: bypassing auth. Yeah, it bypasses HTTP interface so every function related to it. I guess adminssion plugins don't need it in common cases

This makes me uncomfortable. If we do this, I'd want to keep this optimization narrow as possible. It might make my client idea useless, since we'd be handling just a very few special cases. Do we know which admission plugins benefit most from this optimization? Everything that touches pod?

@wojtek-t
Copy link
Member

re: bypassing auth. Yeah, it bypasses HTTP interface so every function related to it. I guess adminssion plugins don't need it in common cases

This makes me uncomfortable. If we do this, I'd want to keep this optimization narrow as possible. It might make my client idea useless, since we'd be handling just a very few special cases. Do we know which admission plugins benefit most from this optimization? Everything that touches pod?

This makes my uncomfortable too, because pretty much everything we're doing in the handler chain now no longer works. Auth seems to be the most serious example (@liggitt @enj - please chime in), but there are other consequences too, e.g. auditing, APF (rate-limitting), timeouts-handling, whole observability (logging/monitoring), etc.

I understand the motivation and I admit that the savings are pretty significant, but I'm wondering how we can better ensure safety. Can we somehow hack it around differently to preserve the whole handler chain? Without putting a deeper thought into it, what if the model would be like:

  • we create a simple wrapper around rest storage that instead of returning the data, puts them into context under some well known key
  • we wrap this with the whole set of existing handlers from the handler chain
  • we implement only ListWatch interface for that and the implementation is using the above and gets data from the context...

I admit it sounds much more hacky, but at least seems safer to me. But that's just a brainstorming - I would like to hear better options...

@wojtek-t wojtek-t self-assigned this Nov 22, 2023
@jpbetz
Copy link
Contributor

jpbetz commented Nov 22, 2023

This makes my uncomfortable too, because pretty much everything we're doing in the handler chain now no longer works.

I was thinking about this last night and was finding it very difficult to reason through all the implications of skipping the handler chain. I agree we just shouldn't do that.

  • we create a simple wrapper around rest storage that instead of returning the data, puts them into context under some well known key
  • we wrap this with the whole set of existing handlers from the handler chain
  • we implement only ListWatch interface for that and the implementation is using the above and gets data from the context...

Do we still end up loosing any defaulting? I'm worried there might be some old object in etcd that need defaulting and break the admission chain when it doesn't happen.

@enj
Copy link
Member

enj commented Nov 22, 2023

I haven't looked at the code, but used to have tons of direct etcd calls back in the etcd2 days in OpenShift. It was fast, but keeping the behavior consistent between the REST API and direct to storage paths was near impossible. Any changes we make for performance should keep the code paths for all calls as identical as possible. Audit, auth and similar layers should not be bypassed. I think we do some of the conversion and defaulting in the REST layers so that would need to be preserved.

cc @deads2k @soltysh

@enj
Copy link
Member

enj commented Nov 22, 2023

Also, don't we need a copy of one of everything for the watch cache anyway? Can we not rely on that instead?

Nevermind, I think this has the same issues if used directly without the REST layer.

@logicalhan
Copy link
Member

Audit, auth and similar layers should not be bypassed. I think we do some of the conversion and defaulting in the REST layers so that would need to be preserved.

+1000 (skipping those handlers seems pretty sketch to me)

@logicalhan
Copy link
Member

/hold

I think we need sig-auth to really scrutinize this.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 22, 2023
@linxiulei
Copy link
Contributor Author

Everything that touches pod?

Yes, pod is the biggest factor in memory usage for most use cases.

Apologize in advance to ask more questions rather than addressing comments

re: auth, the current kubeClientConfig for apiserver admission plugins is either using an ephemeral token created at initialization time or self-signed cert, does this make any differences?

re: handler chain, this only does READ requests from restStorage and no writes, I wondered if there are handler chains doing mutations in READ requests?

re: audit, was it by design that we want to audit self READ requests?

@wojtek-t
Copy link
Member

I think we do some of the conversion and defaulting in the REST layers so that would need to be preserved.

I think conversion isn't a problem here, but yeah, I forgot about defaulting which isn't part of the chain itself...
So my idea doesn't address it, or at least in not that obvious way that I hoped.

@wojtek-t
Copy link
Member

I think at this point it seems that if it's even possible, this would require a KEP.
But I wouldn't recomment to write a KEP without having a discussion with the people on this PR before...

@linxiulei
Copy link
Contributor Author

I think we do some of the conversion and defaulting in the REST layers so that would need to be preserved.

@enj https://github.com/kubernetes/kubernetes/pull/121979/files#diff-05cd394e18da9e91dbf8b3d82e1e0813be670ee37db43ca5af6f91a39f78d08fR704 does reference REST object, which has the defaulting in List()/Get(). So this should have been transparently preserved.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 21, 2024
@linxiulei
Copy link
Contributor Author

Update after bringing this to sig-auth (pls see meeting notes)

Relevant to

/hold

I think we need sig-auth to really scrutinize this.

  • Auth is not big risk here given the current implementation uses a hardcoded method to just make admission plugins work to fetch k8s objects
  • Audit might be a problem but given most objects are fetched via watches, the auditing is skipped.

Also discussed major risks that had not reached consensus because overall risk is high

  • There are no mechanisms to well test the objects returned from the new code path will be identical to what are returned by informers, which are well tested everywhere in k8s use cases, although for now, the returned objects in the proposal should be theoretically the same.
  • The potential modifications occur in the clients (i.e. admission plugins) with proposed design would bring larger blast radius since the modifications will be applied to the apiserver REST storage that essentially change the k8s objects and result in incorrect objects

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 6, 2024
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 5, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/apiserver cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

apiserver admission plugins should zero-copy k8s objects
7 participants