New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reconsider using etcd+chunking for reflector lists #82655
Comments
High-level yes, though I'm not sure about the release it becomes useful (it may be before 3.4 - we need to check). This feature is called "etcd progress notify"
I didn't say it's feasible - I only said that I don't have full confidence it's NOT IMPOSSIBLE :) =============== But after thinking about this a bit more, switching to etcd + chunking still is more a mitigation than a fix itself. I believe that the proper solution is proper rate-limitting and load shedding. Priority and fairness is great step toward this (we're targetting 1.17 for Alpha). Even with chunking, it will still be possible to overload apiserver and etcd - load shedding is the only option for proper solution of this problem. |
Thanks for those clarifications! So I agree load-shedding is a nice way to protect the apiserver from getting overloaded, and is a necessary (and sufficient) guard rail for such scenarios. Though, I think there's still a fundamental problem for which supporting etcd + chunking is a fix rather than mitigation. And that is, informers are currently needing to make "high cost" calls to the apiserver due to lack of chunking. As a bad e.g just to illustrate, if a full list call has 10x more cost than chunked call (w.r.t cpu/mem), apiserver might be able to handle 10x more chunked calls (than full list calls) within its budget. That said, I believe this is more of an optimization than a requirement at this point. |
CPU is a bad argument - because we will need to process all chunks anyway. So in total we will use even more CPU. The "current memory" argument is valid one (in addition to how long a single call takes). |
Are you sure? When did this change? (edit: to be clear, apiserver should be doing the filtering whether or not the request goes to the watch cache or to etcd directly) Did you just mean it's O(N) with the size of collection rather than the size of the results? |
Yes - this meant that we need to list everything from etcd (and deserialize it) and do filtering in apiserver |
Sorry I wasn't clear earlier, what I had in mind was etcd doesn't natively support filtering based on those selectors. Fmu if it was possible, that could potentially further reduce cpu/memory footprint of apiserver at the cost of extra cpu in etcd. Though I feel that's a different concern I haven't fully thought about and let me keep that discussion separate. Edited issue description. |
I’m in favor of all three things being tested.
On Sep 12, 2019, at 2:21 PM, Shyam JVS <notifications@github.com> wrote:
Filing issue for the discussion I brought up in SIG scalability meeting
today, as I couldn't find one for it already.
*Background*:
Currently all our informers, which use reflectors underneath, list from
apiserver's watch cache. This was a decision made quite some time ago to
use watch cache instead of etcd - to avoid overloading etcd with several
lists. Also @wojtek-t <https://github.com/wojtek-t> brought up a good point
that field/label-selector based lists isn't currently supported with etcd.
But for this discussion, let's stick only to "full" informers.
Now, since the watch cache doesn't support chunking (due to lack of
history), informers make a full list (without chunking) to the apiserver.
This is causing an issue that we recently encountered where multiple
watches trying to list at the same time caused apiserver CPU/memory to
spike quite a bit:
[image: Screen Shot 2019-09-12 at 10 33 22 AM]
<https://user-images.githubusercontent.com/4333971/64806881-f10c7d80-d548-11e9-9677-3eccb6ae66ca.png>
And indeed such a scenario can happen if for e.g the watch from apiserver
to etcd doesn't receive an event before that resource version gets
compacted from etcd:
E0824 09:43:50.201103 8 watcher.go:208] watch chan error: etcdserver:
mvcc: required revision has been compacted
W0824 09:43:50.201169 8 reflector.go:256] storage/cacher.go:/pods:
watch of *core.Pod ended with: The resourceVersion for the provided
watch is too old.
W0824 09:43:51.201450 8 cacher.go:125] Terminating all watchers from
cacher *core.Pod
And as a result when apiserver terminates all watches from the cacher, that
caused multiple clients to relist at the same time, causing the resource
spike on apiserver.
From our discussion, it seems that apiserver watch falling out of etcd
window can happen in at least two scenarios currently:
-
set of objects being watched are not changing frequently enough, causing
that latest RV to go out of etcd's window. @wojtek-t
<https://github.com/wojtek-t> - IIUC this issue will be solved with an
etcd 3.4's feature you mentioned that's similar to our watch bookmark. Can
you confirm?
-
watch cache not able to handle the throughput of incoming events from
etcd. This problem FMU still remains even after moving to etcd 3.4. So to
avoid apiserver spike in such cases, IMO we should reconsider the option of
allowing informers to list from etcd with chunking (since we have it now)
instead of watch cache. It seems feasible, per Wojtek's thoughts, to be
able to do this starting from etcd 3.4 as it has significant read
concurrency improvements
*What would you like to be changed*:
- Switch to etcd 3.4
- Allow "full" informers to list from etcd (with chunking)
- Scale test it
cc @kubernetes/sig-scalability-feature-requests
<https://github.com/orgs/kubernetes/teams/sig-scalability-feature-requests>
@wojtek-t <https://github.com/wojtek-t> @liggitt
<https://github.com/liggitt> @smarterclayton
<https://github.com/smarterclayton>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#82655?email_source=notifications&email_token=AAI37J7P2677I3LVNK7IFGTQJKCA7A5CNFSM4IWH6J2KYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HLB6CYA>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAI37J5FYE6NY6WR55NWHV3QJKCA7ANCNFSM4IWH6J2A>
.
|
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
kubernetes/enhancements#3142 is the current path we're exploring more (and I would like to go more towards this path) |
Filing issue for the discussion I brought up in SIG scalability meeting today, as I couldn't find one for it already.
Background:
Currently all our informers, which use reflectors underneath, list from apiserver's watch cache. This was a decision made quite some time ago to use watch cache instead of etcd - to avoid overloading etcd with several lists.
Now, since the watch cache doesn't support chunking (due to lack of history), informers make a full list (without chunking) to the apiserver. This is causing an issue that we recently encountered where multiple watches trying to list at the same time caused apiserver CPU/memory to spike quite a bit:
And indeed such a scenario can happen if for e.g the watch from apiserver to etcd doesn't receive an event before that resource version gets compacted from etcd:
And as a result when apiserver terminates all watches from the cacher, that caused multiple clients to relist at the same time, causing the resource spike on apiserver.
From our discussion, it seems that apiserver watch falling out of etcd window can happen in at least two scenarios currently:
set of objects being watched are not changing frequently enough, causing that latest RV to go out of etcd's window. @wojtek-t - IIUC this issue will be solved with an etcd 3.4's feature you mentioned that's similar to our watch bookmark. Can you confirm?
watch cache not able to handle the throughput of incoming events from etcd. This problem FMU still remains even after moving to etcd 3.4. So to avoid apiserver spike in such cases, IMO we should reconsider the option of allowing informers to list from etcd with chunking (since we have it now) instead of watch cache. It seems feasible, per Wojtek's thoughts, to be able to do this starting from etcd 3.4 as it has significant read concurrency improvements
What would you like to be changed:
cc @kubernetes/sig-scalability-feature-requests @wojtek-t @liggitt @smarterclayton
The text was updated successfully, but these errors were encountered: