Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client-go/cache: informer misses delete events on relist for elements only present in the queue #115658

Closed
odinuge opened this issue Feb 9, 2023 · 37 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. release-blocker sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@odinuge
Copy link
Member

odinuge commented Feb 9, 2023

What happened?

For quite a while we have seen that some controllers/informers suddenly get out of sync with the state of the world, and keep x (different number every time).

After more investigation, we think we narrowed it down to the informer having a queue of events while a relist occur, and for some objects, we never see the DELETED event. This leads to a leak of some sort, and the informer is stuck in a bad state. The informers in question have resyncs disabled, so this only goes back to normal after a new relist. It should be noted that the queue at the point of the relist have been 1000+ elements, so the controller are clearly in a state where they are lagging very far behind.

What did you expect to happen?

Overall we expect that after a reflector relist, no objects should be stuck in a dangling state where they are added to the informer store, but never cleared; even when they are no longer present in k8s.

How can we reproduce it (as minimally and precisely as possible)?

A rough timeline about what we think happens;

1. new Object is added O, agent gets CREATED event for it
2. watch is terminated, and the agent runs a new list, L
3. CREATE event for O is still on the queue to be processed.
4. informer replaces the old data in store with L, and O is not in L
  - Since O is not in the store, and not in the list L, no DELETED event is queued
5. CREATE event for O is still on the queue to be processed.
6. CREATE event for O is processed
7. O is <leaked>; its present in the agents external "store" (and I assume also the internal informer store) but not in k8s.

I've been trying to write an informer/controller test to reproduce, but its kinda tricky since it depends on the ordering the DeltaFIFO.Replace and DeltaFIFO.Pop functions are able to acquire the DeltaFIFO lock.

Anything else we need to know?

Given this is indeed the case, I assume this would affect all users of informers, and most heavily those with the resync period set to 0, aka. disabled.. After some digging, I think the opposite is true. Since resyncs won't refetch the list from k8s, it will just put all the cached object on the queue, slowing down processing even more; possibly causing even more leaks.

Kubernetes version

$ kubectl version
# paste output here

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@odinuge odinuge added the kind/bug Categorizes issue or PR as related to a bug. label Feb 9, 2023
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 9, 2023
@odinuge
Copy link
Member Author

odinuge commented Feb 9, 2023

I did an additional implementation while experimenting in #115620. There is more info in the PR description, but the tl;dr is; that it will take the union of the objects in the store and the queue, then subtract the elements from the relist call, and create DELETE events for the result subset. Previously, it used only the store and subtracted the elements from the relist.

Another approach could be popping the object out of the queue inside of .Replace instead, so that the controller and store never gets the object. I'm not too sure about the implications of that, so happy to hear what others think.

@odinuge
Copy link
Member Author

odinuge commented Feb 9, 2023

/sig api-machinery
/cc @wojtek-t @aojea

@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 9, 2023
@aojea
Copy link
Member

aojea commented Feb 9, 2023

6. CREATE event for O is processed
7. O is <leaked>; its present in the agent but not in k8s.

is O present in the informer cache?

you said is not present in the list L

@odinuge
Copy link
Member Author

odinuge commented Feb 9, 2023

is O present in the informer cache?

you said is not present in the list L

Yee, maybe that was a bit ambiguous, and my lingua around the internals here is not 100% on point, so sorry for that.

The "store" we have metrics for is a separate "store" populated from ResourceEventHandlerFuncs, and not the underlying store in the informer itself. However, I would assume that since the code updates the internal store then runs the handler function, they should be the same; https://github.com/kubernetes/kubernetes/blob/v1.26.0/staging/src/k8s.io/client-go/tools/cache/controller.go#L442-L445

If that makes sense?

And by L I mean the result from the list call from the apiserver after the watch was terminated.

@aojea
Copy link
Member

aojea commented Feb 9, 2023

the infomer cache represent the "current state", the handlers are used fill a workqueue, to implement this "controller" model https://github.com/kubernetes/community/blob/master/contributors/devel/sig-api-machinery/controllers.md ,

so you queue the events and you reconcile syncing from the worqueue, in this case I expect that you may queue O due to the stuck event, but it should not be present in the informer cache, so the reconciliation will work correctly

Anyway, I defer to apimachinery people to expand on this or correct me if necessary

/assign @lavalamp @deads2k

@cici37
Copy link
Contributor

cici37 commented Feb 9, 2023

/triage accepted
/cc @MikeSpreitzer

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 9, 2023
@lavalamp
Copy link
Member

lavalamp commented Feb 9, 2023

This sounds like it is going wrong in the DeltaFIFO before it even gets to the application handler (and workqueue, if any).

Specifically we possibly don't add the object to the cache until the create message is processed.

Thanks for the report.

@aojea
Copy link
Member

aojea commented Feb 9, 2023

Specifically we possibly don't add the object to the cache until the create message is processed.

That is the key thing, @lavalamp he says the object is not in the List after the Relist, is the event that got stuck and goes to the handler, .... I think it should not be in the informer cache, no?

@lavalamp
Copy link
Member

lavalamp commented Feb 9, 2023

@odinuge
Copy link
Member Author

odinuge commented Feb 9, 2023

Yup @lavalamp, that was my conclusion as well ref. #115620.

@lavalamp
Copy link
Member

lavalamp commented Feb 9, 2023

ah, thanks, I didn't see that -- left a comment.

@wojtek-t
Copy link
Member

wojtek-t commented Feb 9, 2023

Thanks for the report - I agree with @lavalamp pointer (and clearly it's also what you found).

@MikeSpreitzer
Copy link
Member

MikeSpreitzer commented Feb 13, 2023

To spell it out a little more: the bug is that the behavior of func (f *DeltaFIFO) Replace(list ...) was leaving an entry in f.queue and f.items if it was there beforehand and absent from list and f.knownObjects. This is consistent with the comments on DeltaFIFO::Replace but is not the behavior that suits the clients (informers); it would cause this ghost item to later (when the DeltaFIFO's Pop method is called) get added to the informer's local cache and delivered in a notification to the informer's client(s).

@odinuge
Copy link
Member Author

odinuge commented Feb 17, 2023

Fixed in #115620. Fix will be part of k8s v1.27. We are also planning to cherry-pick it to older releases as well.

@odinuge odinuge closed this as completed Feb 17, 2023
@odinuge odinuge reopened this Mar 1, 2023
@odinuge
Copy link
Member Author

odinuge commented Mar 1, 2023

Hmm. We originally saw this issue with cilium, and this the patch in #115620 seems to now correctly generate delete events. However, it seems like they clear the original object data [0] from the delta when its popped [1] from the queue and added to the store. This results in the delete events now created when hitting Replace, points to an "empty" object when the delete delta is popped. This was news to me, and looks very strange, but It seems like cilium is doing it, so maybe others are too?

Right now I'm trying a cilium version with the "clearing" mechanism [0] removed, to verify that things then behave as expected.

Any thoughts here @lavalamp @MikeSpreitzer @wojtek-t? I guess this behavior change is fine for v1.27 (or??) given we add proper warning messages, but what about the backports? I'll put /hold on the backports until we decide on a go or no-go here. Thanks!

[0]; https://github.com/cilium/cilium/blob/77c7d6d36cbc3cd16762ff093be575aa3d2ceed0/pkg/k8s/factory_functions.go#L622
[1]: https://github.com/cilium/cilium/blob/77c7d6d36cbc3cd16762ff093be575aa3d2ceed0/pkg/k8s/informer/informer.go#L103

@odinuge
Copy link
Member Author

odinuge commented Mar 1, 2023

cc @aanm

@lavalamp
Copy link
Member

lavalamp commented Mar 1, 2023

@odinuge I'm not sure I followed that but let's definitely hold the backports until we're sure there's no regressions!

@liggitt liggitt added this to the v1.27 milestone Mar 1, 2023
@liggitt
Copy link
Member

liggitt commented Mar 1, 2023

adding to 1.27 and marking as a blocker to resolve before release (not because it's necessarily urgent to resolve today, just so it doesn't get lost/forgotten)... it's unclear to me if the change that was already merged is acceptable for 1.27, or if it needs rework to avoid breaking existing delta/informer consumers

@odinuge
Copy link
Member Author

odinuge commented Mar 1, 2023

Thanks @liggitt! I'm also scratching my head here, because there were already code paths doing the same inside the DeltaFIFO before this patch, so its possible to "reproduce" the same behavior on eg. v1.26.0 as well. Its not clear to me if more people are doing this code of "object cleanup" after popping deltas tho., especially since people haven't noticed the previous leak, or the strange behavior when not having a backing store/knownObjects.

Made a quick reproduction test here; odinuge@e15eb7e

By uncommenting the line; //*test = testFifoObject{}, one can mimic the behavior in cilium.

Tho cilium is using a backing store like other informers, so if one provides knownObjects to the DeltaFIFO and uncomments out that line, it will;

  • on v1.26.0; not create a delete event at all.
  • on master; create a delete event with an object that is "empty"/cleared out.

I'll think some more tomorrow

@odinuge
Copy link
Member Author

odinuge commented Mar 1, 2023

Hmm. Maybe we have to use the latest object in knownObjects to get around this in case people mutate the deltas, and just pop the key from the queue when its not present in knownObjects, and we have some elements in the queue; stupid basic proof-of-concept for it here; #116190. Maybe this has other side effects tho?

I guess the most important thing we should do is figure out if we support users mutating the objects they pop from the queue, and if so, writes some tests verifying that we handle it gracefully.

@lavalamp
Copy link
Member

lavalamp commented Mar 1, 2023

The tests looked pretty good to me, it should keep the most recent known object and that's what it seems to do.

Judging from the code you linked, I think cilium should stop mutating the object passed in the notification. The code does NOT expect objects in deltas or the store to be modified out of band. Maybe an easy fix is for them to make a copy before processing the notification?

@liggitt
Copy link
Member

liggitt commented Mar 1, 2023

have we swept in-tree code to make sure we don't have any similar issues?

@lavalamp
Copy link
Member

lavalamp commented Mar 1, 2023

I think the object in the Deltas{} gets directly placed in the store, and we use a mutation cache in tests to detect that that doesn't get modified, so, I'm at least 30% sure that we don't do this in-tree

@dilyevsky
Copy link

Also hitting this with Calico controllers on very large clusters. Is anyone aware of a workaround while this is getting merged? Enable resync? Empty queue faster?

@lavalamp
Copy link
Member

lavalamp commented Mar 7, 2023

The current state as I understand it is that the k8s go client library change fixes the bug, but cilium is mutating objects in place and needs to fix that before they'll be able to use the new library. @odinuge is that correct?

@dilyevsky I think calico should re-compile with the current client. The condition for hitting the problem should be very rare but if your environment is triggering it I think it'll be hard to tweak the environment to make it stop. E.g., you probably can't make people stop deleting and recreating the same object.

@odinuge
Copy link
Member Author

odinuge commented Mar 8, 2023

The current state as I understand it is that the k8s go client library change fixes the bug, but cilium is mutating objects in place and needs to fix that before they'll be able to use the new library. @odinuge is that correct?

Correct.

I have now tested #115620 + cilium/cilium#24241 on our internal cilium v1.10 version, and it does fix the problem described in this issue. After deploying it I haven't seen a single leak the way we did previously. We did see it hit a handful of clients on the biggest cluster each day before the patches.

@dilyevsky I think calico should re-compile with the current client. The condition for hitting the problem should be very rare but if your environment is triggering it I think it'll be hard to tweak the environment to make it stop. E.g., you probably can't make people stop deleting and recreating the same object.

Yes, compiling with the latest client should fix the problem, and afaik. its the only way to properly fix it right now. In theory if an object is leaked, it can only be properly removed by a full relist, or by someone creating and deleting an object with the same key (so namespace + name for namespaced resources).

@lavalamp
Copy link
Member

lavalamp commented Mar 8, 2023

Thanks for the update @odinuge! Shall we continue with the backports, then?

@odinuge
Copy link
Member Author

odinuge commented Mar 9, 2023

Shall we continue with the backports, then?

I'm kinda debating the correct path here myself, and I think I defer it to you sig api-machinery owners. If there are other clients like cilium clearing the object, it could now lead to a regression since we now prefer returning the latest element from the queue instead of the store.. Any thoughts?

There is also the situation where the informer objType is X, and the transformer transforms it to Y for the store. Previously (for informers where KnownObjects == nil) we only returned DeletedFinalStateUnknown tombstones with objects of type Y from the store; and now we can also return objects of type X. If client code doesn't handle it, it got result in strange things I guess.

I think the code in master is the correct path long term, I'm just unsure about how to inform users to make sure their controllers deal with these things correctly.

Also, it looks like errors returned from those transformFuncs is also swallowed here, so unless users log them before returning, those just silently disappear, unless I'm missing something.. https://github.com/kubernetes/kubernetes/blob/v1.26.0/staging/src/k8s.io/client-go/tools/cache/controller.go#L184

Also found this test that has a transformFunc that doesn't handle tombstones, so I updated it to be more correct; #116432 (i'll take another peek at it tomorrow).

cc @Verolop ref. #115899 (comment)

@lavalamp
Copy link
Member

lavalamp commented Mar 9, 2023

we now prefer returning the latest element from the queue instead of the store.. Any thoughts?

Yes, the technical term for this is "bug fix" :) The previous behavior was incorrect, it was not showing users the final known state of the object even though it claimed to.

the transformer transforms it to Y for the store

This does seem like something we missed. Maybe we need to make https://github.com/kubernetes/kubernetes/blob/v1.26.0/staging/src/k8s.io/client-go/tools/cache/controller.go#L419 dig inside tombstone objects?

@wojtek-t
Copy link
Member

This does seem like something we missed. Maybe we need to make https://github.com/kubernetes/kubernetes/blob/v1.26.0/staging/src/k8s.io/client-go/tools/cache/controller.go#L419 dig inside tombstone objects?

+1 - let' fix that for everyone

@odinuge
Copy link
Member Author

odinuge commented Mar 10, 2023

Yes, the technical term for this is "bug fix" :) The previous behavior was incorrect, it was not showing users the final known state of the object even though it claimed to.

Thats a good point, thanks! That makes sense, and I agree, cherry-picking it should be good to go. The possible races when mutating ref. the discussion in #116432 is there independent of the backports anyways.

I'm have to head out of the office for the weekend now, so if you want to land it in the next patch release, it would be nice if you could remove the /hold the backport PRs @lavalamp. Thanks!

This does seem like something we missed. Maybe we need to make https://github.com/kubernetes/kubernetes/blob/v1.26.0/staging/src/k8s.io/client-go/tools/cache/controller.go#L419 dig inside tombstone objects?

Yeah, thats a very good point! Do you want to prototype a possible solution there?

@lavalamp
Copy link
Member

I think we need to hold the backports until we make sure we're calling transform functions exactly once on everything. I'm not sure if I will have time to make a PR to fix this, but maybe.

@liggitt
Copy link
Member

liggitt commented Mar 13, 2023

programming note: this is still marked as a release-blocker since the behavior changed at HEAD for 1.27 but it sounds like there's work remaining to make the changed behavior as safe as possible for consumers... are the assignees for this issue correct to ensure that work is done before 1.27 cuts?

@neoaggelos
Copy link

Hi @liggitt @lavalamp, bug triage team for 1.27 here. I see there is no activity on this issue since last week, but it's still marked as a release blocker. Do you believe this will be dealt with appropriately in time for the 1.27 release?

@liggitt
Copy link
Member

liggitt commented Mar 20, 2023

my understanding is that #116623 resolved this issue for 1.27, but I'll let @lavalamp confirm and close this

@lavalamp
Copy link
Member

Yes that's correct, sorry I didn't get the "fixed" metadata right. This is done, backports are free to proceed, the release is unblocked.

@odinuge
Copy link
Member Author

odinuge commented Mar 20, 2023

Yes, thanks! I'll double check backports today or tomorrow so that they are all good to go for the next patch release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. release-blocker sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Archived in project
Development

No branches or pull requests