Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconsider using etcd+chunking for reflector lists #82655

Open
shyamjvs opened this issue Sep 12, 2019 · 11 comments
Open

Reconsider using etcd+chunking for reflector lists #82655

shyamjvs opened this issue Sep 12, 2019 · 11 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.

Comments

@shyamjvs
Copy link
Member

shyamjvs commented Sep 12, 2019

Filing issue for the discussion I brought up in SIG scalability meeting today, as I couldn't find one for it already.

Background:
Currently all our informers, which use reflectors underneath, list from apiserver's watch cache. This was a decision made quite some time ago to use watch cache instead of etcd - to avoid overloading etcd with several lists.

Now, since the watch cache doesn't support chunking (due to lack of history), informers make a full list (without chunking) to the apiserver. This is causing an issue that we recently encountered where multiple watches trying to list at the same time caused apiserver CPU/memory to spike quite a bit:

Screen Shot 2019-09-12 at 10 33 22 AM

And indeed such a scenario can happen if for e.g the watch from apiserver to etcd doesn't receive an event before that resource version gets compacted from etcd:

E0824 09:43:50.201103 8 watcher.go:208] watch chan error: etcdserver: mvcc: required revision has been compacted
W0824 09:43:50.201169 8 reflector.go:256] storage/cacher.go:/pods: watch of *core.Pod ended with: The resourceVersion for the provided watch is too old.
W0824 09:43:51.201450 8 cacher.go:125] Terminating all watchers from cacher *core.Pod

And as a result when apiserver terminates all watches from the cacher, that caused multiple clients to relist at the same time, causing the resource spike on apiserver.

From our discussion, it seems that apiserver watch falling out of etcd window can happen in at least two scenarios currently:

  • set of objects being watched are not changing frequently enough, causing that latest RV to go out of etcd's window. @wojtek-t - IIUC this issue will be solved with an etcd 3.4's feature you mentioned that's similar to our watch bookmark. Can you confirm?

  • watch cache not able to handle the throughput of incoming events from etcd. This problem FMU still remains even after moving to etcd 3.4. So to avoid apiserver spike in such cases, IMO we should reconsider the option of allowing informers to list from etcd with chunking (since we have it now) instead of watch cache. It seems feasible, per Wojtek's thoughts, to be able to do this starting from etcd 3.4 as it has significant read concurrency improvements

What would you like to be changed:

  • Switch to etcd 3.4
  • Allow informers to list from etcd (with chunking)
  • Scale test it

cc @kubernetes/sig-scalability-feature-requests @wojtek-t @liggitt @smarterclayton

@shyamjvs shyamjvs added the kind/feature Categorizes issue or PR as related to a new feature. label Sep 12, 2019
@k8s-ci-robot k8s-ci-robot added the sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. label Sep 12, 2019
@wojtek-t
Copy link
Member

set of objects being watched are not changing frequently enough, causing that latest RV to go out of etcd's window. @wojtek-t - IIUC this issue will be solved with an etcd 3.4's feature you mentioned that's similar to our watch bookmark. Can you confirm?

High-level yes, though I'm not sure about the release it becomes useful (it may be before 3.4 - we need to check). This feature is called "etcd progress notify"
#77584 is related.

watch cache not able to handle the throughput of incoming events from etcd. This problem FMU still remains even after moving to etcd 3.4. So to avoid apiserver spike in such cases, IMO we should reconsider the option of allowing informers to list from etcd with chunking (since we have it now) instead of watch cache. It seems feasible, per Wojtek's thoughts, to be able to do this starting from etcd 3.4 as it has significant read concurrency improvements

I didn't say it's feasible - I only said that I don't have full confidence it's NOT IMPOSSIBLE :)
That said - I also want to emphasize that if watchcache is not able to keep up with load for a longer period of time (it's fine for it to not keep up for couple seconds if it recovers later), then we have much bigger problem than periodic relists and we need to fix that more fundamentally.

===============

But after thinking about this a bit more, switching to etcd + chunking still is more a mitigation than a fix itself.

I believe that the proper solution is proper rate-limitting and load shedding. Priority and fairness is great step toward this (we're targetting 1.17 for Alpha).
Once this happens, what we really need to is introduce a different cost for for different calls (e.g.. GET single pod is different than LIST all pods).

Even with chunking, it will still be possible to overload apiserver and etcd - load shedding is the only option for proper solution of this problem.

@shyamjvs
Copy link
Member Author

Thanks for those clarifications!

So I agree load-shedding is a nice way to protect the apiserver from getting overloaded, and is a necessary (and sufficient) guard rail for such scenarios.

Though, I think there's still a fundamental problem for which supporting etcd + chunking is a fix rather than mitigation. And that is, informers are currently needing to make "high cost" calls to the apiserver due to lack of chunking. As a bad e.g just to illustrate, if a full list call has 10x more cost than chunked call (w.r.t cpu/mem), apiserver might be able to handle 10x more chunked calls (than full list calls) within its budget. That said, I believe this is more of an optimization than a requirement at this point.

@wojtek-t
Copy link
Member

As a bad e.g just to illustrate, if a full list call has 10x more cost than chunked call (w.r.t cpu/mem),

CPU is a bad argument - because we will need to process all chunks anyway. So in total we will use even more CPU. The "current memory" argument is valid one (in addition to how long a single call takes).

@lavalamp
Copy link
Member

lavalamp commented Sep 12, 2019

field/label-selector based lists isn't currently supported with etcd

Are you sure? When did this change? (edit: to be clear, apiserver should be doing the filtering whether or not the request goes to the watch cache or to etcd directly)

Did you just mean it's O(N) with the size of collection rather than the size of the results?

@wojtek-t
Copy link
Member

Did you just mean it's O(N) with the size of collection rather than the size of the results?

Yes - this meant that we need to list everything from etcd (and deserialize it) and do filtering in apiserver

@shyamjvs
Copy link
Member Author

shyamjvs commented Sep 12, 2019

Did you just mean it's O(N) with the size of collection rather than the size of the results?

Sorry I wasn't clear earlier, what I had in mind was etcd doesn't natively support filtering based on those selectors. Fmu if it was possible, that could potentially further reduce cpu/memory footprint of apiserver at the cost of extra cpu in etcd. Though I feel that's a different concern I haven't fully thought about and let me keep that discussion separate. Edited issue description.

@smarterclayton
Copy link
Contributor

smarterclayton commented Sep 13, 2019 via email

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 12, 2019
@shyamjvs
Copy link
Member Author

/remove-lifecycle stale
/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 12, 2019
@JohnRusk
Copy link

@shyamjvs There's some additional discussion of this in #102672.

@wojtek-t
Copy link
Member

kubernetes/enhancements#3142 is the current path we're exploring more (and I would like to go more towards this path)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.
Projects
None yet
Development

No branches or pull requests

7 participants