Informer frequent desyncs on slow-moving resources in <= 1.15 #219

clux · 2020-04-06T22:19:41Z

Need to revisit the choice made in #134. In particular the choice discussed here:

There are two different options when desyncs happen:

Do a LIST to get the latest resourceVersion, losing all events between the desync and the LIST

Always set resourceVersion to "0", generating WatchEvent::Added events for all resources that already exist

We went with option 2, back then and it generally works fine... but..

...if we have an informer on a namespace with particularly slow-moving resources, then choice 2 can cause us to reset the resourceVersion to 0 every poll after the first one (because the last returned resourceVersion is still too old for a new watch call). (example will link to this in a few).

Ran into this in #218 when trying to see if we could just use the Informer's logic to drive a Reflector to simplify some code. The list call + watch from that is what saves us from ever encountering this with the current Reflector. Unfortunately, Informer as it stands is actually unusable on certain pathological cases.

Maybe, we should consider having Informer do a list call after all. It's possible to bunch-up the data returned as WatchEvents to emulate the current "start from zero" behaviour..

The text was updated successfully, but these errors were encountered:

clux · 2020-04-06T22:26:02Z

Related issue: kubernetes-client/python#819

nightkr · 2020-04-06T23:15:48Z

Or, we could implement support for watch bookmarks (https://github.com/kubernetes/enhancements/blob/master/keps/sig-api-machinery/20190206-watch-bookmark.md), which makes it the apiserver's responsibility to send the client the new resourceVersion before the old one is invalidated.

was an attempt to get around #219 but didn't work

also for #219

clux · 2020-04-07T06:44:45Z

Have implemented this now in the same branch, but it's not much use until everyone's got at least kubernetes 1.16 (where it first became beta)

nightkr · 2020-04-07T07:03:01Z

I don't think that's unreasonable, given that 1.15 is EOL now. And if it's a major problem for them then they can turn on the WatchBookmark feature gate.

clux · 2020-04-07T07:27:34Z

What's EOL on their side isn't what's readily available though:

EKS range: 1.12, 1.13, 1.14, 1.15
GCP 1.14 stable -> 1.15 regular -> 1.16 rapid
AKS.. harder to tell, but looks like they're just starting to get 1.16

We are pretty close, but still some time to go. I still rely on EKS and would like Reflectors working. Though perhaps a few extra watch calls every 300s isn't such a bad hit to help incentivise people to upgrade 🤔

nightkr · 2020-04-07T07:36:57Z

Oh, ouch. But yeah, it's not like anything is /broken/ on 1.15. Sadly it doesn't look like EKS allows you to customize the feature gates either.. aws/containers-roadmap#487 aws/containers-roadmap#512

clux · 2020-04-07T07:58:34Z

Yeah, I guess it's not broken-broken. It's a network inefficiency (which will probably last for months for consumers).

Even if we don't do the part of #218 that makes Reflector's use Informers internally yet (which is such a simplification), there's still the issue of informers themselves having the same issue unless we start porting the reflector list-then-watch functionality into the informer.

Still with EKS probably getting it soon™, maybe we just mark the next kube version's runtime module as a "works best with 1.16 and above". It's not like it's been particularly stable in the past two months anyway.

this will work in kubernetes >= 1.16 - see #219 for now, don't break the it.

was an attempt to get around #219 but didn't work

also for #219

this will work in kubernetes >= 1.16 - see #219 for now, don't break the it.

clux · 2020-04-07T09:23:54Z

Fixed it in Reflectors by reverting the change for now. Probably wontfix Informer. Put a recommended with kubernetes >= 1.16 on it.

nightkr · 2020-04-07T15:23:58Z

FWIW I'm not sure I understand how Reflector could do a better job on its own than Informer. The three possible reactions would be:

Clear the cache, set resourceVersion=0, restart watch
- Informer's approach
Reset cache and resourceVersion from a LIST call, restart watch
Do a LIST with an empty filter go get a recent resourceVersion, restart watch

1 and 2 should be pretty much equivalent, since you still need to download all of the objects. LIST might have slightly less overhead per item, while rewatching from 0 avoids an API call.

3 is broken-broken, since it will lose writes and get stuck with stale objects.

clux · 2020-04-07T15:40:24Z

I think approach 2 used by Reflector is better than 1 because the resourceVersion returned by the LIST call is a dynamic value, whereas the last watchevent's resourceVersion is the resourceVersion of the last object when it last updated.

nightkr · 2020-04-07T16:27:45Z

True, that's a good point.

clux · 2020-07-26T18:40:49Z

Unmarking as a bug. It's kind of documented behaviour of the pre-bookmark world.

People are reporting it's not a thing with bookmarks, preliminarily, but haven't been able to fully test it out myself yet.

Since this flag does nothing for older apiservers it's safe to default enable it. This will also close #219 in the process. Note that ListParams does not pick up on the flag for non-watch calls.

clux added bug Something isn't working runtime controller runtime related labels Apr 6, 2020

clux added a commit that referenced this issue Apr 6, 2020

example that always breaks #219

ae0195d

clux mentioned this issue Apr 6, 2020

self-driving reflector + signal handling #218

Merged

clux added a commit that referenced this issue Apr 7, 2020

revert informer resource version parsing

232261f

was an attempt to get around #219 but didn't work

clux added a commit that referenced this issue Apr 7, 2020

add watch bookmarks - fixes #54

8cccbda

also for #219

clux added a commit that referenced this issue Apr 7, 2020

revert reflector using informers

53d36d8

this will work in kubernetes >= 1.16 - see #219 for now, don't break the it.

clux added a commit that referenced this issue Apr 7, 2020

revert reflector using informers

906cc0e

this will work in kubernetes >= 1.16 - see #219 for now, don't break the it.

clux added a commit that referenced this issue Apr 7, 2020

revert reflector using informers

4d42621

this will work in kubernetes >= 1.16 - see #219 for now, don't break the it.

clux added a commit that referenced this issue Apr 7, 2020

example that always breaks #219

f7e29a2

clux added a commit that referenced this issue Apr 7, 2020

revert informer resource version parsing

e3bea99

was an attempt to get around #219 but didn't work

clux added a commit that referenced this issue Apr 7, 2020

add watch bookmarks - fixes #54

610d89b

also for #219

clux added a commit that referenced this issue Apr 7, 2020

revert reflector using informers

5796563

this will work in kubernetes >= 1.16 - see #219 for now, don't break the it.

clux added the wontfix This will not be worked on label Apr 7, 2020

clux changed the title ~~Informer::poll always reset on slow-moving resources~~ Informer::poll always reset on slow-moving resources in <= 1.15 Apr 7, 2020

clux changed the title ~~Informer::poll always reset on slow-moving resources in <= 1.15~~ Informer frequent desyncs on slow-moving resources in <= 1.15 Apr 7, 2020

clux mentioned this issue Jul 15, 2020

Expose old objects along with watchevents from a reflectors' stream #266

Closed

clux removed the bug Something isn't working label Jul 26, 2020

clux mentioned this issue Feb 28, 2021

ListParams to default enable bookmarks - closes #226 #445

Merged

clux closed this as completed in #445 Mar 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Informer frequent desyncs on slow-moving resources in <= 1.15 #219

Informer frequent desyncs on slow-moving resources in <= 1.15 #219

clux commented Apr 6, 2020

clux commented Apr 6, 2020

nightkr commented Apr 6, 2020 •

edited

clux commented Apr 7, 2020

nightkr commented Apr 7, 2020

clux commented Apr 7, 2020 •

edited

nightkr commented Apr 7, 2020

clux commented Apr 7, 2020

clux commented Apr 7, 2020

nightkr commented Apr 7, 2020

clux commented Apr 7, 2020 •

edited

nightkr commented Apr 7, 2020

clux commented Jul 26, 2020

Informer frequent desyncs on slow-moving resources in <= 1.15 #219

Informer frequent desyncs on slow-moving resources in <= 1.15 #219

Comments

clux commented Apr 6, 2020

clux commented Apr 6, 2020

nightkr commented Apr 6, 2020 • edited

clux commented Apr 7, 2020

nightkr commented Apr 7, 2020

clux commented Apr 7, 2020 • edited

nightkr commented Apr 7, 2020

clux commented Apr 7, 2020

clux commented Apr 7, 2020

nightkr commented Apr 7, 2020

clux commented Apr 7, 2020 • edited

nightkr commented Apr 7, 2020

clux commented Jul 26, 2020

nightkr commented Apr 6, 2020 •

edited

clux commented Apr 7, 2020 •

edited

clux commented Apr 7, 2020 •

edited