Reconsider using etcd+chunking for reflector lists #82655

shyamjvs · 2019-09-12T18:20:38Z

Filing issue for the discussion I brought up in SIG scalability meeting today, as I couldn't find one for it already.

Background:
Currently all our informers, which use reflectors underneath, list from apiserver's watch cache. This was a decision made quite some time ago to use watch cache instead of etcd - to avoid overloading etcd with several lists.

Now, since the watch cache doesn't support chunking (due to lack of history), informers make a full list (without chunking) to the apiserver. This is causing an issue that we recently encountered where multiple watches trying to list at the same time caused apiserver CPU/memory to spike quite a bit:

And indeed such a scenario can happen if for e.g the watch from apiserver to etcd doesn't receive an event before that resource version gets compacted from etcd:

E0824 09:43:50.201103 8 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted
W0824 09:43:50.201169 8 reflector.go:256] storage/cacher.go:/pods: watch of *core.Pod ended with: The resourceVersion for the provided watch is too old.
W0824 09:43:51.201450 8 cacher.go:125] Terminating all watchers from cacher *core.Pod

And as a result when apiserver terminates all watches from the cacher, that caused multiple clients to relist at the same time, causing the resource spike on apiserver.

From our discussion, it seems that apiserver watch falling out of etcd window can happen in at least two scenarios currently:

set of objects being watched are not changing frequently enough, causing that latest RV to go out of etcd's window. @wojtek-t - IIUC this issue will be solved with an etcd 3.4's feature you mentioned that's similar to our watch bookmark. Can you confirm?
watch cache not able to handle the throughput of incoming events from etcd. This problem FMU still remains even after moving to etcd 3.4. So to avoid apiserver spike in such cases, IMO we should reconsider the option of allowing informers to list from etcd with chunking (since we have it now) instead of watch cache. It seems feasible, per Wojtek's thoughts, to be able to do this starting from etcd 3.4 as it has significant read concurrency improvements

What would you like to be changed:

Switch to etcd 3.4
Allow informers to list from etcd (with chunking)
Scale test it

cc @kubernetes/sig-scalability-feature-requests @wojtek-t @liggitt @smarterclayton

The text was updated successfully, but these errors were encountered:

wojtek-t · 2019-09-12T18:49:17Z

set of objects being watched are not changing frequently enough, causing that latest RV to go out of etcd's window. @wojtek-t - IIUC this issue will be solved with an etcd 3.4's feature you mentioned that's similar to our watch bookmark. Can you confirm?

High-level yes, though I'm not sure about the release it becomes useful (it may be before 3.4 - we need to check). This feature is called "etcd progress notify"
#77584 is related.

watch cache not able to handle the throughput of incoming events from etcd. This problem FMU still remains even after moving to etcd 3.4. So to avoid apiserver spike in such cases, IMO we should reconsider the option of allowing informers to list from etcd with chunking (since we have it now) instead of watch cache. It seems feasible, per Wojtek's thoughts, to be able to do this starting from etcd 3.4 as it has significant read concurrency improvements

I didn't say it's feasible - I only said that I don't have full confidence it's NOT IMPOSSIBLE :)
That said - I also want to emphasize that if watchcache is not able to keep up with load for a longer period of time (it's fine for it to not keep up for couple seconds if it recovers later), then we have much bigger problem than periodic relists and we need to fix that more fundamentally.

===============

But after thinking about this a bit more, switching to etcd + chunking still is more a mitigation than a fix itself.

I believe that the proper solution is proper rate-limitting and load shedding. Priority and fairness is great step toward this (we're targetting 1.17 for Alpha).
Once this happens, what we really need to is introduce a different cost for for different calls (e.g.. GET single pod is different than LIST all pods).

Even with chunking, it will still be possible to overload apiserver and etcd - load shedding is the only option for proper solution of this problem.

shyamjvs · 2019-09-12T19:34:59Z

Thanks for those clarifications!

So I agree load-shedding is a nice way to protect the apiserver from getting overloaded, and is a necessary (and sufficient) guard rail for such scenarios.

Though, I think there's still a fundamental problem for which supporting etcd + chunking is a fix rather than mitigation. And that is, informers are currently needing to make "high cost" calls to the apiserver due to lack of chunking. As a bad e.g just to illustrate, if a full list call has 10x more cost than chunked call (w.r.t cpu/mem), apiserver might be able to handle 10x more chunked calls (than full list calls) within its budget. That said, I believe this is more of an optimization than a requirement at this point.

wojtek-t · 2019-09-12T19:39:27Z

As a bad e.g just to illustrate, if a full list call has 10x more cost than chunked call (w.r.t cpu/mem),

CPU is a bad argument - because we will need to process all chunks anyway. So in total we will use even more CPU. The "current memory" argument is valid one (in addition to how long a single call takes).

lavalamp · 2019-09-12T19:59:21Z

field/label-selector based lists isn't currently supported with etcd

Are you sure? When did this change? (edit: to be clear, apiserver should be doing the filtering whether or not the request goes to the watch cache or to etcd directly)

Did you just mean it's O(N) with the size of collection rather than the size of the results?

wojtek-t · 2019-09-12T20:27:13Z

Did you just mean it's O(N) with the size of collection rather than the size of the results?

Yes - this meant that we need to list everything from etcd (and deserialize it) and do filtering in apiserver

shyamjvs · 2019-09-12T20:50:51Z

Did you just mean it's O(N) with the size of collection rather than the size of the results?

Sorry I wasn't clear earlier, what I had in mind was etcd doesn't natively support filtering based on those selectors. Fmu if it was possible, that could potentially further reduce cpu/memory footprint of apiserver at the cost of extra cpu in etcd. Though I feel that's a different concern I haven't fully thought about and let me keep that discussion separate. Edited issue description.

smarterclayton · 2019-09-13T00:12:35Z

I’m in favor of all three things being tested. On Sep 12, 2019, at 2:21 PM, Shyam JVS <notifications@github.com> wrote: Filing issue for the discussion I brought up in SIG scalability meeting today, as I couldn't find one for it already. *Background*: Currently all our informers, which use reflectors underneath, list from apiserver's watch cache. This was a decision made quite some time ago to use watch cache instead of etcd - to avoid overloading etcd with several lists. Also @wojtek-t <https://github.com/wojtek-t> brought up a good point that field/label-selector based lists isn't currently supported with etcd. But for this discussion, let's stick only to "full" informers. Now, since the watch cache doesn't support chunking (due to lack of history), informers make a full list (without chunking) to the apiserver. This is causing an issue that we recently encountered where multiple watches trying to list at the same time caused apiserver CPU/memory to spike quite a bit: [image: Screen Shot 2019-09-12 at 10 33 22 AM] <https://user-images.githubusercontent.com/4333971/64806881-f10c7d80-d548-11e9-9677-3eccb6ae66ca.png> And indeed such a scenario can happen if for e.g the watch from apiserver to etcd doesn't receive an event before that resource version gets compacted from etcd: E0824 09:43:50.201103 8 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted W0824 09:43:50.201169 8 reflector.go:256] storage/cacher.go:/pods: watch of *core.Pod ended with: The resourceVersion for the provided watch is too old. W0824 09:43:51.201450 8 cacher.go:125] Terminating all watchers from cacher *core.Pod And as a result when apiserver terminates all watches from the cacher, that caused multiple clients to relist at the same time, causing the resource spike on apiserver. From our discussion, it seems that apiserver watch falling out of etcd window can happen in at least two scenarios currently: - set of objects being watched are not changing frequently enough, causing that latest RV to go out of etcd's window. @wojtek-t <https://github.com/wojtek-t> - IIUC this issue will be solved with an etcd 3.4's feature you mentioned that's similar to our watch bookmark. Can you confirm? - watch cache not able to handle the throughput of incoming events from etcd. This problem FMU still remains even after moving to etcd 3.4. So to avoid apiserver spike in such cases, IMO we should reconsider the option of allowing informers to list from etcd with chunking (since we have it now) instead of watch cache. It seems feasible, per Wojtek's thoughts, to be able to do this starting from etcd 3.4 as it has significant read concurrency improvements *What would you like to be changed*: - Switch to etcd 3.4 - Allow "full" informers to list from etcd (with chunking) - Scale test it cc @kubernetes/sig-scalability-feature-requests <https://github.com/orgs/kubernetes/teams/sig-scalability-feature-requests> @wojtek-t <https://github.com/wojtek-t> @liggitt <https://github.com/liggitt> @smarterclayton <https://github.com/smarterclayton> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#82655?email_source=notifications&email_token=AAI37J7P2677I3LVNK7IFGTQJKCA7A5CNFSM4IWH6J2KYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HLB6CYA>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAI37J5FYE6NY6WR55NWHV3QJKCA7ANCNFSM4IWH6J2A> .

fejta-bot · 2019-12-12T00:25:18Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

shyamjvs · 2019-12-12T18:24:00Z

/remove-lifecycle stale
/lifecycle frozen

JohnRusk · 2022-01-30T21:54:16Z

@shyamjvs There's some additional discussion of this in #102672.

wojtek-t · 2022-01-31T08:39:37Z

kubernetes/enhancements#3142 is the current path we're exploring more (and I would like to go more towards this path)

shyamjvs added the kind/feature Categorizes issue or PR as related to a new feature. label Sep 12, 2019

k8s-ci-robot added the sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. label Sep 12, 2019

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 12, 2019

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 12, 2019

JohnRusk mentioned this issue Jan 30, 2022

ResourceVersion="0" causes no chunking, makes starting Informers expensive #102672

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reconsider using etcd+chunking for reflector lists #82655

Reconsider using etcd+chunking for reflector lists #82655

shyamjvs commented Sep 12, 2019 •

edited

wojtek-t commented Sep 12, 2019

shyamjvs commented Sep 12, 2019

wojtek-t commented Sep 12, 2019

lavalamp commented Sep 12, 2019 •

edited

wojtek-t commented Sep 12, 2019

shyamjvs commented Sep 12, 2019 •

edited

smarterclayton commented Sep 13, 2019 via email

fejta-bot commented Dec 12, 2019

shyamjvs commented Dec 12, 2019

JohnRusk commented Jan 30, 2022

wojtek-t commented Jan 31, 2022

Reconsider using etcd+chunking for reflector lists #82655

Reconsider using etcd+chunking for reflector lists #82655

Comments

shyamjvs commented Sep 12, 2019 • edited

wojtek-t commented Sep 12, 2019

shyamjvs commented Sep 12, 2019

wojtek-t commented Sep 12, 2019

lavalamp commented Sep 12, 2019 • edited

wojtek-t commented Sep 12, 2019

shyamjvs commented Sep 12, 2019 • edited

smarterclayton commented Sep 13, 2019 via email

fejta-bot commented Dec 12, 2019

shyamjvs commented Dec 12, 2019

JohnRusk commented Jan 30, 2022

wojtek-t commented Jan 31, 2022

shyamjvs commented Sep 12, 2019 •

edited

lavalamp commented Sep 12, 2019 •

edited

shyamjvs commented Sep 12, 2019 •

edited