-
Notifications
You must be signed in to change notification settings - Fork 38.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
client-go/cache: informer misses delete events on relist for elements only present in the queue #115658
Comments
I did an additional implementation while experimenting in #115620. There is more info in the PR description, but the tl;dr is; that it will take the union of the objects in the store and the queue, then subtract the elements from the relist call, and create Another approach could be popping the object out of the queue inside of |
is O present in the informer cache? you said is not present in the list L |
Yee, maybe that was a bit ambiguous, and my lingua around the internals here is not 100% on point, so sorry for that. The "store" we have metrics for is a separate "store" populated from If that makes sense? And by |
the infomer cache represent the "current state", the handlers are used fill a workqueue, to implement this "controller" model https://github.com/kubernetes/community/blob/master/contributors/devel/sig-api-machinery/controllers.md , so you queue the events and you reconcile syncing from the worqueue, in this case I expect that you may queue O due to the stuck event, but it should not be present in the informer cache, so the reconciliation will work correctly Anyway, I defer to apimachinery people to expand on this or correct me if necessary |
/triage accepted |
This sounds like it is going wrong in the DeltaFIFO before it even gets to the application handler (and workqueue, if any). Specifically we possibly don't add the object to the cache until the create message is processed. Thanks for the report. |
That is the key thing, @lavalamp he says the object is not in the List after the Relist, is the event that got stuck and goes to the handler, .... I think it should not be in the informer cache, no? |
Yeah. Bug is that this code: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/tools/cache/delta_fifo.go#L634 Doesn't check pending changes: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/tools/cache/delta_fifo.go#L104 |
ah, thanks, I didn't see that -- left a comment. |
Thanks for the report - I agree with @lavalamp pointer (and clearly it's also what you found). |
To spell it out a little more: the bug is that the behavior of |
Fixed in #115620. Fix will be part of k8s v1.27. We are also planning to cherry-pick it to older releases as well. |
Hmm. We originally saw this issue with cilium, and this the patch in #115620 seems to now correctly generate delete events. However, it seems like they clear the original object data [0] from the delta when its popped [1] from the queue and added to the store. This results in the Right now I'm trying a cilium version with the "clearing" mechanism [0] removed, to verify that things then behave as expected. Any thoughts here @lavalamp @MikeSpreitzer @wojtek-t? I guess this behavior change is fine for v1.27 (or??) given we add proper warning messages, but what about the backports? I'll put [0]; https://github.com/cilium/cilium/blob/77c7d6d36cbc3cd16762ff093be575aa3d2ceed0/pkg/k8s/factory_functions.go#L622 |
cc @aanm |
@odinuge I'm not sure I followed that but let's definitely hold the backports until we're sure there's no regressions! |
adding to 1.27 and marking as a blocker to resolve before release (not because it's necessarily urgent to resolve today, just so it doesn't get lost/forgotten)... it's unclear to me if the change that was already merged is acceptable for 1.27, or if it needs rework to avoid breaking existing delta/informer consumers |
Thanks @liggitt! I'm also scratching my head here, because there were already code paths doing the same inside the Made a quick reproduction test here; odinuge@e15eb7e By uncommenting the line; Tho cilium is using a backing store like other informers, so if one provides
I'll think some more tomorrow |
Hmm. Maybe we have to use the latest object in I guess the most important thing we should do is figure out if we support users mutating the objects they pop from the queue, and if so, writes some tests verifying that we handle it gracefully. |
The tests looked pretty good to me, it should keep the most recent known object and that's what it seems to do. Judging from the code you linked, I think cilium should stop mutating the object passed in the notification. The code does NOT expect objects in deltas or the store to be modified out of band. Maybe an easy fix is for them to make a copy before processing the notification? |
have we swept in-tree code to make sure we don't have any similar issues? |
I think the object in the Deltas{} gets directly placed in the store, and we use a mutation cache in tests to detect that that doesn't get modified, so, I'm at least 30% sure that we don't do this in-tree |
Also hitting this with Calico controllers on very large clusters. Is anyone aware of a workaround while this is getting merged? Enable resync? Empty queue faster? |
The current state as I understand it is that the k8s go client library change fixes the bug, but cilium is mutating objects in place and needs to fix that before they'll be able to use the new library. @odinuge is that correct? @dilyevsky I think calico should re-compile with the current client. The condition for hitting the problem should be very rare but if your environment is triggering it I think it'll be hard to tweak the environment to make it stop. E.g., you probably can't make people stop deleting and recreating the same object. |
Correct. I have now tested #115620 + cilium/cilium#24241 on our internal cilium v1.10 version, and it does fix the problem described in this issue. After deploying it I haven't seen a single leak the way we did previously. We did see it hit a handful of clients on the biggest cluster each day before the patches.
Yes, compiling with the latest client should fix the problem, and afaik. its the only way to properly fix it right now. In theory if an object is leaked, it can only be properly removed by a full relist, or by someone creating and deleting an object with the same key (so |
Thanks for the update @odinuge! Shall we continue with the backports, then? |
I'm kinda debating the correct path here myself, and I think I defer it to you sig api-machinery owners. If there are other clients like cilium clearing the object, it could now lead to a regression since we now prefer returning the latest element from the queue instead of the store.. Any thoughts? There is also the situation where the informer I think the code in master is the correct path long term, I'm just unsure about how to inform users to make sure their controllers deal with these things correctly. Also, it looks like errors returned from those Also found this test that has a cc @Verolop ref. #115899 (comment) |
Yes, the technical term for this is "bug fix" :) The previous behavior was incorrect, it was not showing users the final known state of the object even though it claimed to.
This does seem like something we missed. Maybe we need to make https://github.com/kubernetes/kubernetes/blob/v1.26.0/staging/src/k8s.io/client-go/tools/cache/controller.go#L419 dig inside tombstone objects? |
+1 - let' fix that for everyone |
Thats a good point, thanks! That makes sense, and I agree, cherry-picking it should be good to go. The possible races when mutating ref. the discussion in #116432 is there independent of the backports anyways. I'm have to head out of the office for the weekend now, so if you want to land it in the next patch release, it would be nice if you could remove the
Yeah, thats a very good point! Do you want to prototype a possible solution there? |
I think we need to hold the backports until we make sure we're calling transform functions exactly once on everything. I'm not sure if I will have time to make a PR to fix this, but maybe. |
programming note: this is still marked as a release-blocker since the behavior changed at HEAD for 1.27 but it sounds like there's work remaining to make the changed behavior as safe as possible for consumers... are the assignees for this issue correct to ensure that work is done before 1.27 cuts? |
Yes that's correct, sorry I didn't get the "fixed" metadata right. This is done, backports are free to proceed, the release is unblocked. |
Yes, thanks! I'll double check backports today or tomorrow so that they are all good to go for the next patch release. |
What happened?
For quite a while we have seen that some controllers/informers suddenly get out of sync with the state of the world, and keep
x
(different number every time).After more investigation, we think we narrowed it down to the informer having a queue of events while a relist occur, and for some objects, we never see the
DELETED
event. This leads to a leak of some sort, and the informer is stuck in a bad state. The informers in question have resyncs disabled, so this only goes back to normal after a new relist. It should be noted that the queue at the point of the relist have been1000
+ elements, so the controller are clearly in a state where they are lagging very far behind.What did you expect to happen?
Overall we expect that after a reflector relist, no objects should be stuck in a dangling state where they are added to the informer store, but never cleared; even when they are no longer present in k8s.
How can we reproduce it (as minimally and precisely as possible)?
A rough timeline about what we think happens;
I've been trying to write an informer/controller test to reproduce, but its kinda tricky since it depends on the ordering the
DeltaFIFO.Replace
andDeltaFIFO.Pop
functions are able to acquire theDeltaFIFO
lock.Anything else we need to know?
Given this is indeed the case, I assume this would affect all users of informers, and most heavily those with the resync period set to. After some digging, I think the opposite is true. Since resyncs won't refetch the list from k8s, it will just put all the cached object on the queue, slowing down processing even more; possibly causing even more leaks.0
, aka. disabled.Kubernetes version
Cloud provider
OS version
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
The text was updated successfully, but these errors were encountered: