client-go/cache: informer misses delete events on relist for elements only present in the queue #115658

odinuge · 2023-02-09T15:21:40Z

What happened?

For quite a while we have seen that some controllers/informers suddenly get out of sync with the state of the world, and keep x (different number every time).

After more investigation, we think we narrowed it down to the informer having a queue of events while a relist occur, and for some objects, we never see the DELETED event. This leads to a leak of some sort, and the informer is stuck in a bad state. The informers in question have resyncs disabled, so this only goes back to normal after a new relist. It should be noted that the queue at the point of the relist have been 1000+ elements, so the controller are clearly in a state where they are lagging very far behind.

What did you expect to happen?

Overall we expect that after a reflector relist, no objects should be stuck in a dangling state where they are added to the informer store, but never cleared; even when they are no longer present in k8s.

How can we reproduce it (as minimally and precisely as possible)?

A rough timeline about what we think happens;

1. new Object is added O, agent gets CREATED event for it
2. watch is terminated, and the agent runs a new list, L
3. CREATE event for O is still on the queue to be processed.
4. informer replaces the old data in store with L, and O is not in L
  - Since O is not in the store, and not in the list L, no DELETED event is queued
5. CREATE event for O is still on the queue to be processed.
6. CREATE event for O is processed
7. O is <leaked>; its present in the agents external "store" (and I assume also the internal informer store) but not in k8s.

I've been trying to write an informer/controller test to reproduce, but its kinda tricky since it depends on the ordering the DeltaFIFO.Replace and DeltaFIFO.Pop functions are able to acquire the DeltaFIFO lock.

Anything else we need to know?

~~Given this is indeed the case, I assume this would affect all users of informers, and most heavily those with the resync period set to 0, aka. disabled.~~. After some digging, I think the opposite is true. Since resyncs won't refetch the list from k8s, it will just put all the cached object on the queue, slowing down processing even more; possibly causing even more leaks.

Kubernetes version

$ kubectl version
# paste output here

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

The text was updated successfully, but these errors were encountered:

odinuge · 2023-02-09T15:26:16Z

I did an additional implementation while experimenting in #115620. There is more info in the PR description, but the tl;dr is; that it will take the union of the objects in the store and the queue, then subtract the elements from the relist call, and create DELETE events for the result subset. Previously, it used only the store and subtracted the elements from the relist.

Another approach could be popping the object out of the queue inside of .Replace instead, so that the controller and store never gets the object. I'm not too sure about the implications of that, so happy to hear what others think.

odinuge · 2023-02-09T15:27:34Z

/sig api-machinery
/cc @wojtek-t @aojea

aojea · 2023-02-09T16:36:56Z

6. CREATE event for O is processed
7. O is <leaked>; its present in the agent but not in k8s.

is O present in the informer cache?

you said is not present in the list L

odinuge · 2023-02-09T16:57:00Z

is O present in the informer cache?

you said is not present in the list L

Yee, maybe that was a bit ambiguous, and my lingua around the internals here is not 100% on point, so sorry for that.

The "store" we have metrics for is a separate "store" populated from ResourceEventHandlerFuncs, and not the underlying store in the informer itself. However, I would assume that since the code updates the internal store then runs the handler function, they should be the same; https://github.com/kubernetes/kubernetes/blob/v1.26.0/staging/src/k8s.io/client-go/tools/cache/controller.go#L442-L445

If that makes sense?

And by L I mean the result from the list call from the apiserver after the watch was terminated.

aojea · 2023-02-09T17:15:02Z

the infomer cache represent the "current state", the handlers are used fill a workqueue, to implement this "controller" model https://github.com/kubernetes/community/blob/master/contributors/devel/sig-api-machinery/controllers.md ,

so you queue the events and you reconcile syncing from the worqueue, in this case I expect that you may queue O due to the stuck event, but it should not be present in the informer cache, so the reconciliation will work correctly

Anyway, I defer to apimachinery people to expand on this or correct me if necessary

/assign @lavalamp @deads2k

cici37 · 2023-02-09T17:42:06Z

/triage accepted
/cc @MikeSpreitzer

lavalamp · 2023-02-09T17:48:01Z

This sounds like it is going wrong in the DeltaFIFO before it even gets to the application handler (and workqueue, if any).

Specifically we possibly don't add the object to the cache until the create message is processed.

Thanks for the report.

aojea · 2023-02-09T18:23:20Z

Specifically we possibly don't add the object to the cache until the create message is processed.

That is the key thing, @lavalamp he says the object is not in the List after the Relist, is the event that got stuck and goes to the handler, .... I think it should not be in the informer cache, no?

lavalamp · 2023-02-09T18:50:20Z

Yeah. Bug is that this code: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/tools/cache/delta_fifo.go#L634

Doesn't check pending changes: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/tools/cache/delta_fifo.go#L104

odinuge · 2023-02-09T18:54:13Z

Yup @lavalamp, that was my conclusion as well ref. #115620.

lavalamp · 2023-02-09T19:02:16Z

ah, thanks, I didn't see that -- left a comment.

wojtek-t · 2023-02-09T19:12:10Z

Thanks for the report - I agree with @lavalamp pointer (and clearly it's also what you found).

MikeSpreitzer · 2023-02-13T19:37:42Z

To spell it out a little more: the bug is that the behavior of func (f *DeltaFIFO) Replace(list ...) was leaving an entry in f.queue and f.items if it was there beforehand and absent from list and f.knownObjects. This is consistent with the comments on DeltaFIFO::Replace but is not the behavior that suits the clients (informers); it would cause this ghost item to later (when the DeltaFIFO's Pop method is called) get added to the informer's local cache and delivered in a notification to the informer's client(s).

odinuge · 2023-02-17T18:14:37Z

Fixed in #115620. Fix will be part of k8s v1.27. We are also planning to cherry-pick it to older releases as well.

odinuge · 2023-03-01T18:01:41Z

Hmm. We originally saw this issue with cilium, and this the patch in #115620 seems to now correctly generate delete events. However, it seems like they clear the original object data [0] from the delta when its popped [1] from the queue and added to the store. This results in the delete events now created when hitting Replace, points to an "empty" object when the delete delta is popped. This was news to me, and looks very strange, but It seems like cilium is doing it, so maybe others are too?

Right now I'm trying a cilium version with the "clearing" mechanism [0] removed, to verify that things then behave as expected.

Any thoughts here @lavalamp @MikeSpreitzer @wojtek-t? I guess this behavior change is fine for v1.27 (or??) given we add proper warning messages, but what about the backports? I'll put /hold on the backports until we decide on a go or no-go here. Thanks!

[0]; https://github.com/cilium/cilium/blob/77c7d6d36cbc3cd16762ff093be575aa3d2ceed0/pkg/k8s/factory_functions.go#L622
[1]: https://github.com/cilium/cilium/blob/77c7d6d36cbc3cd16762ff093be575aa3d2ceed0/pkg/k8s/informer/informer.go#L103

odinuge · 2023-03-01T18:01:50Z

cc @aanm

lavalamp · 2023-03-01T18:15:19Z

@odinuge I'm not sure I followed that but let's definitely hold the backports until we're sure there's no regressions!

liggitt · 2023-03-01T18:46:15Z

adding to 1.27 and marking as a blocker to resolve before release (not because it's necessarily urgent to resolve today, just so it doesn't get lost/forgotten)... it's unclear to me if the change that was already merged is acceptable for 1.27, or if it needs rework to avoid breaking existing delta/informer consumers

odinuge · 2023-03-01T19:15:18Z

Thanks @liggitt! I'm also scratching my head here, because there were already code paths doing the same inside the DeltaFIFO before this patch, so its possible to "reproduce" the same behavior on eg. v1.26.0 as well. Its not clear to me if more people are doing this code of "object cleanup" after popping deltas tho., especially since people haven't noticed the previous leak, or the strange behavior when not having a backing store/knownObjects.

Made a quick reproduction test here; odinuge@e15eb7e

By uncommenting the line; //*test = testFifoObject{}, one can mimic the behavior in cilium.

Tho cilium is using a backing store like other informers, so if one provides knownObjects to the DeltaFIFO and uncomments out that line, it will;

on v1.26.0; not create a delete event at all.
on master; create a delete event with an object that is "empty"/cleared out.

I'll think some more tomorrow

odinuge · 2023-03-01T20:00:04Z

Hmm. Maybe we have to use the latest object in knownObjects to get around this in case people mutate the deltas, and just pop the key from the queue when its not present in knownObjects, and we have some elements in the queue; stupid basic proof-of-concept for it here; #116190. Maybe this has other side effects tho?

I guess the most important thing we should do is figure out if we support users mutating the objects they pop from the queue, and if so, writes some tests verifying that we handle it gracefully.

lavalamp · 2023-03-01T22:29:18Z

The tests looked pretty good to me, it should keep the most recent known object and that's what it seems to do.

Judging from the code you linked, I think cilium should stop mutating the object passed in the notification. The code does NOT expect objects in deltas or the store to be modified out of band. Maybe an easy fix is for them to make a copy before processing the notification?

liggitt · 2023-03-01T22:30:53Z

have we swept in-tree code to make sure we don't have any similar issues?

lavalamp · 2023-03-01T22:38:59Z

I think the object in the Deltas{} gets directly placed in the store, and we use a mutation cache in tests to detect that that doesn't get modified, so, I'm at least 30% sure that we don't do this in-tree

dilyevsky · 2023-03-07T17:40:40Z

Also hitting this with Calico controllers on very large clusters. Is anyone aware of a workaround while this is getting merged? Enable resync? Empty queue faster?

lavalamp · 2023-03-07T17:52:33Z

The current state as I understand it is that the k8s go client library change fixes the bug, but cilium is mutating objects in place and needs to fix that before they'll be able to use the new library. @odinuge is that correct?

@dilyevsky I think calico should re-compile with the current client. The condition for hitting the problem should be very rare but if your environment is triggering it I think it'll be hard to tweak the environment to make it stop. E.g., you probably can't make people stop deleting and recreating the same object.

odinuge · 2023-03-08T11:45:37Z

The current state as I understand it is that the k8s go client library change fixes the bug, but cilium is mutating objects in place and needs to fix that before they'll be able to use the new library. @odinuge is that correct?

Correct.

I have now tested #115620 + cilium/cilium#24241 on our internal cilium v1.10 version, and it does fix the problem described in this issue. After deploying it I haven't seen a single leak the way we did previously. We did see it hit a handful of clients on the biggest cluster each day before the patches.

@dilyevsky I think calico should re-compile with the current client. The condition for hitting the problem should be very rare but if your environment is triggering it I think it'll be hard to tweak the environment to make it stop. E.g., you probably can't make people stop deleting and recreating the same object.

Yes, compiling with the latest client should fix the problem, and afaik. its the only way to properly fix it right now. In theory if an object is leaked, it can only be properly removed by a full relist, or by someone creating and deleting an object with the same key (so namespace + name for namespaced resources).

lavalamp · 2023-03-08T17:10:15Z

Thanks for the update @odinuge! Shall we continue with the backports, then?

odinuge · 2023-03-09T18:54:56Z

Shall we continue with the backports, then?

I'm kinda debating the correct path here myself, and I think I defer it to you sig api-machinery owners. If there are other clients like cilium clearing the object, it could now lead to a regression since we now prefer returning the latest element from the queue instead of the store.. Any thoughts?

There is also the situation where the informer objType is X, and the transformer transforms it to Y for the store. Previously (for informers where KnownObjects == nil) we only returned DeletedFinalStateUnknown tombstones with objects of type Y from the store; and now we can also return objects of type X. If client code doesn't handle it, it got result in strange things I guess.

I think the code in master is the correct path long term, I'm just unsure about how to inform users to make sure their controllers deal with these things correctly.

Also, it looks like errors returned from those transformFuncs is also swallowed here, so unless users log them before returning, those just silently disappear, unless I'm missing something.. https://github.com/kubernetes/kubernetes/blob/v1.26.0/staging/src/k8s.io/client-go/tools/cache/controller.go#L184

Also found this test that has a transformFunc that doesn't handle tombstones, so I updated it to be more correct; #116432 (i'll take another peek at it tomorrow).

cc @Verolop ref. #115899 (comment)

lavalamp · 2023-03-09T21:33:05Z

we now prefer returning the latest element from the queue instead of the store.. Any thoughts?

Yes, the technical term for this is "bug fix" :) The previous behavior was incorrect, it was not showing users the final known state of the object even though it claimed to.

the transformer transforms it to Y for the store

This does seem like something we missed. Maybe we need to make https://github.com/kubernetes/kubernetes/blob/v1.26.0/staging/src/k8s.io/client-go/tools/cache/controller.go#L419 dig inside tombstone objects?

wojtek-t · 2023-03-10T08:57:50Z

This does seem like something we missed. Maybe we need to make https://github.com/kubernetes/kubernetes/blob/v1.26.0/staging/src/k8s.io/client-go/tools/cache/controller.go#L419 dig inside tombstone objects?

+1 - let' fix that for everyone

odinuge · 2023-03-10T11:21:15Z

Yes, the technical term for this is "bug fix" :) The previous behavior was incorrect, it was not showing users the final known state of the object even though it claimed to.

Thats a good point, thanks! That makes sense, and I agree, cherry-picking it should be good to go. The possible races when mutating ref. the discussion in #116432 is there independent of the backports anyways.

I'm have to head out of the office for the weekend now, so if you want to land it in the next patch release, it would be nice if you could remove the /hold the backport PRs @lavalamp. Thanks!

This does seem like something we missed. Maybe we need to make https://github.com/kubernetes/kubernetes/blob/v1.26.0/staging/src/k8s.io/client-go/tools/cache/controller.go#L419 dig inside tombstone objects?

Yeah, thats a very good point! Do you want to prototype a possible solution there?

lavalamp · 2023-03-10T23:10:37Z

I think we need to hold the backports until we make sure we're calling transform functions exactly once on everything. I'm not sure if I will have time to make a PR to fix this, but maybe.

liggitt · 2023-03-13T19:08:14Z

programming note: this is still marked as a release-blocker since the behavior changed at HEAD for 1.27 but it sounds like there's work remaining to make the changed behavior as safe as possible for consumers... are the assignees for this issue correct to ensure that work is done before 1.27 cuts?

neoaggelos · 2023-03-20T16:12:30Z

Hi @liggitt @lavalamp, bug triage team for 1.27 here. I see there is no activity on this issue since last week, but it's still marked as a release blocker. Do you believe this will be dealt with appropriately in time for the 1.27 release?

liggitt · 2023-03-20T16:13:54Z

my understanding is that #116623 resolved this issue for 1.27, but I'll let @lavalamp confirm and close this

lavalamp · 2023-03-20T16:15:26Z

Yes that's correct, sorry I didn't get the "fixed" metadata right. This is done, backports are free to proceed, the release is unblocked.

odinuge · 2023-03-20T16:16:22Z

Yes, thanks! I'll double check backports today or tomorrow so that they are all good to go for the next patch release.

odinuge added the kind/bug Categorizes issue or PR as related to a bug. label Feb 9, 2023

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 9, 2023

odinuge mentioned this issue Feb 9, 2023

client-go/cache: fix missing delete event on replace #115620

Merged

k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 9, 2023

k8s-ci-robot assigned deads2k and lavalamp Feb 9, 2023

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 9, 2023

odinuge closed this as completed Feb 17, 2023

odinuge reopened this Mar 1, 2023

This was referenced Mar 1, 2023

Automated cherry pick of #115620: client-go/cache: fix missing delete event on replace (+ #116623) #115899

Merged

Automated cherry pick of #115620: client-go/cache: fix missing delete event on replace (+ #116623) #115900

Merged

This was referenced Mar 1, 2023

Automated cherry pick of #115620: client-go/cache: fix missing delete event on replace (+ #116623) #115901

Merged

Automated cherry pick of #115620: client-go/cache: fix missing delete event on replace #115923

Closed

liggitt added this to the v1.27 milestone Mar 1, 2023

liggitt added the release-blocker label Mar 1, 2023

odinuge mentioned this issue Mar 1, 2023

cache/delta_fifo[poc]: pop deltas from queue on replace #116190

Closed

dilyevsky mentioned this issue Mar 7, 2023

kube-controllers/node: Use pod/node indexes projectcalico/calico#7433

Merged

3 tasks

aojea mentioned this issue Mar 10, 2023

Avoid clearing objects in conversion funcs cilium/cilium#24241

Merged

8 tasks

lavalamp closed this as completed Mar 20, 2023

client-go/cache: informer misses delete events on relist for elements only present in the queue #115658

client-go/cache: informer misses delete events on relist for elements only present in the queue #115658

Comments

odinuge commented Feb 9, 2023 • edited Loading

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

odinuge commented Feb 9, 2023

odinuge commented Feb 9, 2023

aojea commented Feb 9, 2023

odinuge commented Feb 9, 2023

aojea commented Feb 9, 2023

cici37 commented Feb 9, 2023

lavalamp commented Feb 9, 2023

aojea commented Feb 9, 2023

lavalamp commented Feb 9, 2023

odinuge commented Feb 9, 2023

lavalamp commented Feb 9, 2023

wojtek-t commented Feb 9, 2023

MikeSpreitzer commented Feb 13, 2023 • edited Loading

odinuge commented Feb 17, 2023

odinuge commented Mar 1, 2023

odinuge commented Mar 1, 2023

lavalamp commented Mar 1, 2023

liggitt commented Mar 1, 2023 • edited Loading

odinuge commented Mar 1, 2023

odinuge commented Mar 1, 2023 • edited Loading

lavalamp commented Mar 1, 2023

liggitt commented Mar 1, 2023

lavalamp commented Mar 1, 2023

dilyevsky commented Mar 7, 2023

lavalamp commented Mar 7, 2023

odinuge commented Mar 8, 2023

lavalamp commented Mar 8, 2023

odinuge commented Mar 9, 2023

lavalamp commented Mar 9, 2023

wojtek-t commented Mar 10, 2023

odinuge commented Mar 10, 2023

lavalamp commented Mar 10, 2023

liggitt commented Mar 13, 2023

neoaggelos commented Mar 20, 2023

liggitt commented Mar 20, 2023

lavalamp commented Mar 20, 2023

odinuge commented Mar 20, 2023

odinuge commented Feb 9, 2023 •

edited

Loading

MikeSpreitzer commented Feb 13, 2023 •

edited

Loading

liggitt commented Mar 1, 2023 •

edited

Loading

odinuge commented Mar 1, 2023 •

edited

Loading