Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Informer frequent desyncs on slow-moving resources in <= 1.15 #219

Closed
clux opened this issue Apr 6, 2020 · 12 comments · Fixed by #445
Closed

Informer frequent desyncs on slow-moving resources in <= 1.15 #219

clux opened this issue Apr 6, 2020 · 12 comments · Fixed by #445
Labels
runtime controller runtime related wontfix This will not be worked on

Comments

@clux
Copy link
Member

clux commented Apr 6, 2020

Need to revisit the choice made in #134. In particular the choice discussed here:

There are two different options when desyncs happen:

  1. Do a LIST to get the latest resourceVersion, losing all events between the desync and the LIST
  2. Always set resourceVersion to "0", generating WatchEvent::Added events for all resources that already exist

We went with option 2, back then and it generally works fine... but..

...if we have an informer on a namespace with particularly slow-moving resources, then choice 2 can cause us to reset the resourceVersion to 0 every poll after the first one (because the last returned resourceVersion is still too old for a new watch call). (example will link to this in a few).

Ran into this in #218 when trying to see if we could just use the Informer's logic to drive a Reflector to simplify some code. The list call + watch from that is what saves us from ever encountering this with the current Reflector. Unfortunately, Informer as it stands is actually unusable on certain pathological cases.

Maybe, we should consider having Informer do a list call after all. It's possible to bunch-up the data returned as WatchEvents to emulate the current "start from zero" behaviour..

@clux clux added bug Something isn't working runtime controller runtime related labels Apr 6, 2020
clux added a commit that referenced this issue Apr 6, 2020
@clux
Copy link
Member Author

clux commented Apr 6, 2020

Related issue: kubernetes-client/python#819

@nightkr
Copy link
Member

nightkr commented Apr 6, 2020

Or, we could implement support for watch bookmarks (https://github.com/kubernetes/enhancements/blob/master/keps/sig-api-machinery/20190206-watch-bookmark.md), which makes it the apiserver's responsibility to send the client the new resourceVersion before the old one is invalidated.

clux added a commit that referenced this issue Apr 7, 2020
was an attempt to get around #219 but didn't work
clux added a commit that referenced this issue Apr 7, 2020
@clux
Copy link
Member Author

clux commented Apr 7, 2020

Have implemented this now in the same branch, but it's not much use until everyone's got at least kubernetes 1.16 (where it first became beta)

@nightkr
Copy link
Member

nightkr commented Apr 7, 2020

I don't think that's unreasonable, given that 1.15 is EOL now. And if it's a major problem for them then they can turn on the WatchBookmark feature gate.

@clux
Copy link
Member Author

clux commented Apr 7, 2020

What's EOL on their side isn't what's readily available though:

We are pretty close, but still some time to go. I still rely on EKS and would like Reflectors working. Though perhaps a few extra watch calls every 300s isn't such a bad hit to help incentivise people to upgrade 🤔

@nightkr
Copy link
Member

nightkr commented Apr 7, 2020

Oh, ouch. But yeah, it's not like anything is /broken/ on 1.15. Sadly it doesn't look like EKS allows you to customize the feature gates either.. aws/containers-roadmap#487 aws/containers-roadmap#512

@clux
Copy link
Member Author

clux commented Apr 7, 2020

Yeah, I guess it's not broken-broken. It's a network inefficiency (which will probably last for months for consumers).

Even if we don't do the part of #218 that makes Reflector's use Informers internally yet (which is such a simplification), there's still the issue of informers themselves having the same issue unless we start porting the reflector list-then-watch functionality into the informer.

Still with EKS probably getting it soon™, maybe we just mark the next kube version's runtime module as a "works best with 1.16 and above". It's not like it's been particularly stable in the past two months anyway.

clux added a commit that referenced this issue Apr 7, 2020
this will work in kubernetes >= 1.16 - see #219
for now, don't break the it.
clux added a commit that referenced this issue Apr 7, 2020
this will work in kubernetes >= 1.16 - see #219
for now, don't break the it.
clux added a commit that referenced this issue Apr 7, 2020
this will work in kubernetes >= 1.16 - see #219
for now, don't break the it.
clux added a commit that referenced this issue Apr 7, 2020
clux added a commit that referenced this issue Apr 7, 2020
was an attempt to get around #219 but didn't work
clux added a commit that referenced this issue Apr 7, 2020
clux added a commit that referenced this issue Apr 7, 2020
this will work in kubernetes >= 1.16 - see #219
for now, don't break the it.
@clux
Copy link
Member Author

clux commented Apr 7, 2020

Fixed it in Reflectors by reverting the change for now. Probably wontfix Informer. Put a recommended with kubernetes >= 1.16 on it.

@clux clux added the wontfix This will not be worked on label Apr 7, 2020
@clux clux changed the title Informer::poll always reset on slow-moving resources Informer::poll always reset on slow-moving resources in <= 1.15 Apr 7, 2020
@clux clux changed the title Informer::poll always reset on slow-moving resources in <= 1.15 Informer frequent desyncs on slow-moving resources in <= 1.15 Apr 7, 2020
@nightkr
Copy link
Member

nightkr commented Apr 7, 2020

FWIW I'm not sure I understand how Reflector could do a better job on its own than Informer. The three possible reactions would be:

  1. Clear the cache, set resourceVersion=0, restart watch
    • Informer's approach
  2. Reset cache and resourceVersion from a LIST call, restart watch
  3. Do a LIST with an empty filter go get a recent resourceVersion, restart watch

1 and 2 should be pretty much equivalent, since you still need to download all of the objects. LIST might have slightly less overhead per item, while rewatching from 0 avoids an API call.

3 is broken-broken, since it will lose writes and get stuck with stale objects.

@clux
Copy link
Member Author

clux commented Apr 7, 2020

I think approach 2 used by Reflector is better than 1 because the resourceVersion returned by the LIST call is a dynamic value, whereas the last watchevent's resourceVersion is the resourceVersion of the last object when it last updated.

@nightkr
Copy link
Member

nightkr commented Apr 7, 2020

True, that's a good point.

@clux
Copy link
Member Author

clux commented Jul 26, 2020

Unmarking as a bug. It's kind of documented behaviour of the pre-bookmark world.

People are reporting it's not a thing with bookmarks, preliminarily, but haven't been able to fully test it out myself yet.

clux added a commit that referenced this issue Feb 28, 2021
Since this flag does nothing for older apiservers it's safe to default
enable it. This will also close #219 in the process.

Note that ListParams does not pick up on the flag for non-watch calls.
@clux clux closed this as completed in #445 Mar 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
runtime controller runtime related wontfix This will not be worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants