Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add AssumeCache from volume binding to client-go #112202

Closed
wants to merge 7 commits into from

Conversation

pohly
Copy link
Contributor

@pohly pohly commented Sep 2, 2022

What type of PR is this?

/kind feature

What this PR does / why we need it:

The AssumeCache in the kube-scheduler volume binding plugin solves a problem that other components also have: when storing objects in an informer cache, that cache is out-dated directly after making a change in the apiserver. If the component happens to do some work at that point that involves the updated object, it will do so based on stale information.

Components can be designed to handle this, for example by discarding changes if storing an updated object in the apiserver fails with a conflict error. Probably components have to be prepared for such a case anyway, but it is confusing.

Special notes for your reviewer:

This PR leaves the the volume binding plugin unchanged. Migrating that code can be handled separately.

Because this uses generics, each commit comes with benchmark results.

In contrast to the original code, the final version of the AssumeCache in this PR no longer makes assumptions about the content of the ResourceVersion field.

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Sep 2, 2022
@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/storage Categorizes an issue or PR as relevant to SIG Storage. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 2, 2022
@pohly
Copy link
Contributor Author

pohly commented Sep 2, 2022

The best solution that I could come up with that doesn't depend on ResourceVersion is this:

  • before retrieving an object from the cache with the intent to modify it, lock it
  • the informer callbacks also locks it
  • after the Update call, unlock and store the modified object or cancel the lock in case of a failure

This ensures that the apiserver always wins. But this may also lead to not having the very latest object locally:

  • apiserver sends new object with revision X, gets added to AssumeCache
  • apiserver sends update Y, not added yet
  • component locks object with revision X
  • patches object (can succeed even when not using the latest revision) -> revision Z stored in AssumeCache
  • AssumeCache processes object with revision Y, replacing the newer assumed object with revision Z

With this approach, the AssumeCache is still an improvement over just using the informer cache, but it cannot fully guarantee that the local object is always the latest.

@pohly
Copy link
Contributor Author

pohly commented Sep 3, 2022

There is a simpler solution that achieves the same: Assume must be passed the original object (= the one retrieved from the cache) and the modified one (= the one returned as response to Update or Patch). The modified object then only gets stored if the cache hasn't received a different object from the apiserver in the meantime.

@aojea
Copy link
Member

aojea commented Sep 4, 2022

/assign @lavalamp @liggitt @deads2k

I think this is a recurrent topic, but I couldn't find the references why it was not implemented this before, the assigned persons will now better for sure 😄

@pohly
Copy link
Contributor Author

pohly commented Sep 5, 2022

Let me quote from the comment from @cofyc in #sig-storage:

Although the doc about resource version asks not to compare resource version, however the GET/WATCH methods can accept ResourceVersion query parameter to get not older than or starts at specified resource version. IMHO the resource version is designed to be monotonic (at least per object I guess) but it is internal implementation in API server currently. Submit a proposal to standardize the ordering of ResourceVersion , e.g. alphanumeric ordering?

@pohly
Copy link
Contributor Author

pohly commented Sep 5, 2022

alphanumeric ordering

That would imply padding the integer with enough zeros to ensure that "01" < "20". Not sure whether it can be done on-the-fly for existing objects. How many zeros are "enough"?

It might be better to just specify the current behavior (convert to integer, compare the result).

@pohly pohly force-pushed the client-go-assume-cache branch 2 times, most recently from 568da8c to bcf48e3 Compare September 6, 2022 09:03
@pohly
Copy link
Contributor Author

pohly commented Sep 6, 2022

Thanks to some optimization (not allocating a "not found" error when the caller in add just discards it again) the current implementation is now faster than the original one.

@liggitt
Copy link
Member

liggitt commented Sep 6, 2022

One potential concern is that the AssumeCache must make an assumption (sic!) about the content of the ResourceVersion field.

That is a concern... I don't think we should provide libraries to clients that go against the explicit documentation that clients should not compare resource versions other than equality. I'd like to hear @deads2k and @lavalamp thoughts as well.

@pohly
Copy link
Contributor Author

pohly commented Sep 6, 2022

The AssumeCache under client-go could be implemented without the "older than" comparison. It would still be useful for my purposes. But can it then be used by the volume binding plugin or do we need to keep that code unchanged?

@jsafrane, @msau42: WDYT?

@leilajal
Copy link
Contributor

leilajal commented Sep 6, 2022

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 6, 2022
@lavalamp
Copy link
Member

lavalamp commented Sep 6, 2022

I'm not very excited about clients parsing RV, our explicit guidance is to not do that.

You can use .generation for this if your changes are in the spec. Ideal would be a logical clock like generation that covered the whole object.

I'll try and think of better alternatives...

@pohly
Copy link
Contributor Author

pohly commented Sep 12, 2022

Perhaps two examples in the documentation of AssumeCache make this clearer. I've pushed an update...

@k8s-ci-robot
Copy link
Contributor

@pohly: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-unit f794435 link true /test pull-kubernetes-unit
pull-kubernetes-e2e-kind-ipv6 f794435 link true /test pull-kubernetes-e2e-kind-ipv6

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@lavalamp
Copy link
Member

The "is equal" comparison does not depend on that. The logic is "if the AssumeCache currently holds the same object that I modified, then it is safe to replace it with the update that I just got back from Update/Patch". There is no comparison when receiving an update through the informer. That update always overwrites the object in the AssumeCache.

Ah. (I still haven't actually read this code, btw -- I am still trying to figure out if anything in this space is possible without API changes, and right now I'm thinking not.)

That's why the revised AssumeCache is less precise: sometimes it might drop the newer object and replace it with an older one.

Yeah, there's at least one case where this technique can cause backwards time travel, which is the one thing clients basically can't tolerate at all.

I want this kind of cache to exist (I talked about the necessity of this in the least understood part of this old talk), but I think we need to do it right.

Currently it seems like we either need to add an entire-object logical clock, or relax our RV client usage constraints. I think this is worth talking about at the next api machinery SIG meeting.

@lavalamp
Copy link
Member

(I put this on the agenda)

@pohly
Copy link
Contributor Author

pohly commented Sep 12, 2022

Yeah, there's at least one case where this technique can cause backwards time travel, which is the one thing clients basically can't tolerate at all. [...] I think we need to do it right.

I can't speak for other clients, but at least the controller that I am currently working on for dynamic resource allocation has no problem with this. It's entirely stateless and will just react to whatever it currently gets from the cache. The gRPC calls it makes are idempotent, so it's not a problem to do the same operation twice. If a request is invalid because the object was too old, then an error gets logged and it will try again, probably with a newer object.

TLDR: I think this would already be useful in client-go as it is now. But I can also host a local copy elsewhere while something better gets figured out for client-go.

@lavalamp
Copy link
Member

It's entirely stateless and will just react to whatever it currently gets from the cache.

That means you can have oscillation and actuation amplification. (user changes A->B; controller changes external system A->B->A->B before settling)

@pohly
Copy link
Contributor Author

pohly commented Sep 13, 2022

In my case, the controller triggers two external changes, allocation and deallocation of a claim, and records the result in the claim status.

When A = "not allocated, pending" and B = "needs to be allocated", then the external change is an Allocate call when seeing B. The third state then is C = "is allocated" which the AssumeCache needs to store locally to prevent repeating the Allocate call. When the controller sees A->B->A, it only calls Allocate once. Without the AssumeCache, it acts twice on B because of the stale informer cache. That's still harmless (the Allocate call is idempotent), but causes noise in the logs and extra work.

The inverse direction (deallocating a claim) is similar.

@lavalamp
Copy link
Member

A is "needs to be allocated", B is "allocated", C is "needs to be deallocated", D is "unallocated"

I don't know the specifics enough to say the exact sequence that triggers the behavior, but I'll be surprised if there's no A->C write combo which, combined with a time travel event, causes your controller to allocate, deallocate, allocate, deallocate.

Idempotence doesn't protect against time travel! If the input (desired state) flaps, the output will flap too.

Maybe the trigger will be exceedingly rare or never happen in your case, but that definitely isn't generally true.

@pohly
Copy link
Contributor Author

pohly commented Sep 13, 2022

Maybe the trigger will be exceedingly rare or never happen in your case,

I think it is indeed. I can easily trigger the situation where the lack of an AssumeCache causes redundant Allocate calls, but I can't think of a way to trigger thrashing.

Anyway, we don't need to merge this. I can copy the end-result of this PR into a package elsewhere and just use it in that controller where it is known to help more than it hurts.

Shall I close this PR or keep it open as a reminder?

@lavalamp
Copy link
Member

Let's leave this open at least until the sig discusses this.

pohly added a commit to pohly/kubernetes that referenced this pull request Sep 19, 2022
This is identical to the proposal from
kubernetes#112202. In contrast to the
AssumeCache in the volume binding scheduler plugin, this version of the code
does not make assumptions about the content of the ResourceVersion fields,
i.e. it does no "is newer than" comparison. Therefore objects might go back in
time under some circumstances.

This was seen as insufficient for inclusion in client-go, but for DRA its
better than not having the AssumeCache, so the code gets included here for use
in that controller.
pohly added a commit to pohly/kubernetes that referenced this pull request Sep 19, 2022
This is identical to the proposal from
kubernetes#112202. In contrast to the
AssumeCache in the volume binding scheduler plugin, this version of the code
does not make assumptions about the content of the ResourceVersion fields,
i.e. it does no "is newer than" comparison. Therefore objects might go back in
time under some circumstances.

This was seen as insufficient for inclusion in client-go, but for DRA its
better than not having the AssumeCache, so the code gets included here for use
in that controller.
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 20, 2022
@k8s-ci-robot
Copy link
Contributor

@pohly: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

pohly added a commit to pohly/kubernetes that referenced this pull request Sep 20, 2022
This is identical to the proposal from
kubernetes#112202. In contrast to the
AssumeCache in the volume binding scheduler plugin, this version of the code
does not make assumptions about the content of the ResourceVersion fields,
i.e. it does no "is newer than" comparison. Therefore objects might go back in
time under some circumstances.

This was seen as insufficient for inclusion in client-go, but for DRA its
better than not having the AssumeCache, so the code gets included here for use
in that controller.
pohly added a commit to pohly/kubernetes that referenced this pull request Sep 22, 2022
This is identical to the proposal from
kubernetes#112202. In contrast to the
AssumeCache in the volume binding scheduler plugin, this version of the code
does not make assumptions about the content of the ResourceVersion fields,
i.e. it does no "is newer than" comparison. Therefore objects might go back in
time under some circumstances.

This was seen as insufficient for inclusion in client-go, but for DRA its
better than not having the AssumeCache, so the code gets included here for use
in that controller.
@pohly
Copy link
Contributor Author

pohly commented Sep 22, 2022

This was discussed in the September 21st 2022 SIG api-machinery meeting.

There was a certain tendency to just allow parsing ResourceVersion, but also concerns that it might not always be a signed 64 bit int. The next step is for @lavalamp to write an email and/or KEP documenting the options and making a formal proposal.

My two cents:

  • It's not just me who wants this. The kube-scheduler volume binding plugin already has such an AssumeCache with ResourceVersion parsing.
  • A new fields with better defined semantic may be nicer, but doesn't solve the problem for a long time while Kubernetes releases without it still need to be supported. Can we allow parsing ResourceVersion as fallback for those?

@lavalamp
Copy link
Member

Discussion in: #112684

@stevekuznetsov
Copy link
Contributor

Doesn't the mutation cache in client-go already implement (the more useful, but invalid use of API semantics) functionality?

@lavalamp
Copy link
Member

Mutation cache is test-only, and is there to detect people modifying items in the cache accidentally. Unless it has changed substantially since I last looked.

@stevekuznetsov
Copy link
Contributor

Sorry, to be super clear, I mean the cache here, imported in k/k and used here in the ServiceAccount tokens controller.

@lavalamp
Copy link
Member

I either didn't know about that or had blocked it from my memory. My first impression is that it seems to be constructed of razor wire :)

@pohly
Copy link
Contributor Author

pohly commented Sep 22, 2022

I didn't know about it either. Now that I do I am wondering how it would interact with an informer. There's the "TODO find a way to layer this into an informer/lister". I suppose it's meant to be passed the store that is used by an informer?

I like the implementation less than the one from the volume binder, but I guess it would do the job and this PR isn't needed - unless a more modern, type-safe API is desired.

@pohly
Copy link
Contributor Author

pohly commented Oct 4, 2022

Let me close this PR. The mutation cache is already in client-go and does essentially the same thing, just with a different API.

It would still be good to get the whole question of "is it okay to compare versions" clarified, but we don't need this PR for that.

@pohly pohly closed this Oct 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/dependency Issues or PRs related to dependency changes cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. release-note-none Denotes a PR that doesn't merit a release note. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants