-
Notifications
You must be signed in to change notification settings - Fork 39.3k
-
Notifications
You must be signed in to change notification settings - Fork 39.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed-up LIST operations by serving them from memory #15945
Comments
Today, we don't enforce that guarantee necessarily in a cluster because we We had proposed taking the resourceVersion as minimum read before, I'm in On Tue, Oct 20, 2015 at 8:31 AM, Wojciech Tyczynski <
|
Was this discussed somewhere on github? I didn't see that... |
Very old #846 (comment)
|
@wojtek-t and I discussed this IRL yesterday and I'm also in favor of accepting a resourceVersion on list. It means clients that care have to change, e.g. kubectl may want to cache the resource version it got on a POST for reuse in a subsequent LIST. |
@wojtek-t do you plan on fixing watches at the same time or will that be a second step? I think most of the benefit is quite possibly in fixing the number of times the watches invoke filterfuncs... |
We might want to make clients stateful and observe this themselves. On Tue, Oct 20, 2015 at 12:37 PM, Daniel Smith notifications@github.com
|
Hah. I'm not the only one who remembers old issues. :-) I'm in favor of accepting resourceVersion on GET, as well, as you could guess from that comment. That would only help new, smart clients, however. An option not mentioned was implementing a full-blown cache coherence protocol, to invalidate on write and subsequently fetch the most recent copy of the resource. |
@bgrant0607 that's the write-through cache that I proposed above. It sounds like you're also in favor of that. While the "minimum resource version on get" option is theoretically correct, it does impose quite a burden on less sophisticated client developers. We will definitely cause a lot of difficult to track down bugs in clients which write and then read dumbly, no matter how loudly we tell developers not to do that. |
That sounds quite complex-- does a write on apiserver A invalidate the cache of apiserver B? It must, to be useful, but then we're replicating much of etcd's functionality. |
I can see making a correct write-through cache on a single apiserver by just stopping the world until your write shows up in the watch from etcd. Simply adding the written object to the cache before seeing it in the watch means that you decouple the cache's state from etcd's state and sounds dangerous-but-possibly-fixable in the single apiserver case, and infeasible in the multiple apiserver case. |
Seems like the "semantics" you are worried about changing already happen in an HA cluster? I think that is what clayton means in his comment about quorum=true (though even with quorum=true a read could still occasionally get stale data I think?) If so, then isn't option 1 the best choice? |
Also, option 4:
This maintains the "semantics" in the case of a single apiserver, and in the case of multiple apiservers with session affinity (which I predict will become common for larger clusters). |
Option 4 does not require updating the cache bit by bit, and so may be much less bug-prone and easier to implement. Assuming read operations are 10x to 1000x more common than write operations, then you still get huge batching benefits. Also, since many writes are going to be status updates, and since the semantics you want to preserve are less relevant for status writes, you could possibly modify the algorithm to handle "readonly and status-only-writing" operations in a single batch. That would increase the reuse frequency of the cache and indices. |
@erictune I don't see how that works. How does apiserver know that a given On Tue, Oct 20, 2015 at 11:15 AM, Eric Tune notifications@github.com
|
In option 4, it conservatively assume that all reads need to see any write. |
OK, so that's what I said above "I can see making a correct write-through cache on a single apiserver by just stopping the world until your write shows up in the watch from etcd" + go to etcd instead of stopping the world. The problem with going to etcd is that it reverts to n^2 behavior for a little bit after every write. Since the time for the cache to see the write should be measured in 100's of ms, I think it may be better to just block everyone until it shows up. |
Okay, what @lavalamp said SGTM. |
Yuck. This still seems like an optimization for infrastructure components On Oct 20, 2015, at 3:15 PM, Daniel Smith notifications@github.com wrote: OK, so that's what I said above "I can see making a correct write-through The problem with going to etcd is that it reverts to n^2 behavior for a — |
In addition to caching performed by the apiserver, caching is also performed outside it, in clients, but potentially also in intermediate caches. We do have to make it possible for a client to DTRT in the presence of an arbitrary number of intermediate caches. So, we at least need the resourceVersion-based solution. |
Obviously not in a single PR, but yes - I would treat it as part of the same effort.
I completely agree. So do I understand correctly, that we want to do resourceVersion-based solution first and then possibly implement something more difficult in the future (if needed)? I think that's reasonable. |
@wojtek-t, SGTM On Wed, Oct 21, 2015 at 12:49 AM, Wojciech Tyczynski <
|
When you say intermediate caches, which ones are you referring to? Anyone On Wed, Oct 21, 2015 at 12:29 PM, Daniel Smith notifications@github.com
|
This may affect deployment. Deployment needs to keep track of the number of available pods, scale rcs accordingly, and make sure the number of available pods is within a certain range (depends on deployment strategy). |
@janetkuo - it will be possible to still list "the current" version from etcd (by setting parameters correctly). So it will not be breaking change. |
The initial version (without indexers) is very close to be done. However, after some deeper thinking I would like to suggest a slightly different semantics:
@lavalamp @smarterclayton @timothysc - any thoughts on it ^^ |
@wojtek-t I agree with your suggested semantics (actually I thought that's what we were going to do originally-- maybe I misunderstood). Another solution--possibly less invasive--is to make it a pointer so you can pass nil. |
@lavalamp - I'm also fine with pointer, but since in the whole system we are using "resourceVersion" as string - see e.g. it would be more consistent to pass it as string (we will not need any transformations in that case). |
By which you mean the transformations can be hidden behind the storage On Thu, Dec 3, 2015 at 9:39 AM, Wojciech Tyczynski <notifications@github.com
|
Yes - exactly
Thanks - I will prepare a PR tomorrow. |
Yes, hiding RV transformation behind storage is right. Question on 2
On Thu, Dec 3, 2015 at 1:19 PM, Wojciech Tyczynski <notifications@github.com
|
While I like this, I wonder at what point can we/should we just be passing a constraint filter to the kV store itself. @xiang90 ^ SELECT resource FROM table WHERE (constraint/filter) |
Range scans are part of the etcd 3 API design. On Dec 3, 2015, at 4:58 PM, Timothy St. Clair notifications@github.com While I like this, I wonder at what point can we/should we just be passing @xiang90 https://github.com/xiang90 ^ SELECT resource FROM table WHERE — |
I'm afraid I didn't fully understand your question. [Once this is implemented] if you specify RV in the LIST operation, you are guaranteed to get at least that fresh response. However, you also get the exact RV from which the results is returned. So you can then start watching from exactly that point. |
My original concern was a change that breaks naive clients. I was trying On Dec 4, 2015, at 3:23 AM, Wojciech Tyczynski notifications@github.com @smarterclayton https://github.com/smarterclayton
I'm afraid I didn't fully understand your question. [Once this is implemented] if you specify RV in the LIST operation, you are — |
Yes - naive clients will be unaffected. Passing rv for list that is old - will result in returning ~current state (the guarantee that the result is at least that fresh as the rv passed, is then satisfied). |
How long do we block for? Timeout specified on request? What error is On Dec 4, 2015, at 10:44 AM, Wojciech Tyczynski notifications@github.com Yes - naive clients will be unaffected. Passing rv for list that is old - will result in returning ~current state |
Currently specific timeout is not supported (other than the default http server timeouts, which is a generic mechanism IIUC). Do you think it's not enough? |
I randomly made it 60 seconds, I think we need to be defensive about clients asking for RVs in the distant future (or from other collections). |
@lavalamp - what do you mean by "or from other collections" ? |
E.g., reading from /pods, sending that RV to /services. |
That shouldn't be a big deal - resource version is common for all resources (there is a single RV in etcd). I agree that technically it's incorrect, but that shouldn't cause problems in my opinion. |
It does cause problems, though! See the other commit in #20433. Someone read from one table, passed that (big) RV into list of services. But the service watch RV was stuck at the last service write, not the global number--which had since advanced--so apiserver hung on startup. |
And anyway, clients need to treat it as separate, because we could swap out the storage, like we do with events. |
I agree that client need to treat is as separate - I just though that it's not urgent... |
Anything left to do here? @wojtek-t can we close this? |
The missing thing here is the Indexer mechanism. But this is definitely not for 1.2 and I think we can create a separate issue for it. |
#4817 already exists, we can reuse that. Thanks! |
Background
Currently, to serve LIST operation in apiserver we do the following:
This means, that complexity of LIST operation is proportional to the number of all objects of a given type, not to the number of objects returned to the user.
As an example, consider 1000-node cluster with 30.000 pods running in it. If a user has a ReplicationController with, say 10 replicas, then listing them requires requires reading all 30.000 from etcd, filtering them and returning just 10 which are interesting for the user.
Proposal
We would like to make the LIST operation proportional to the number of elements it returns as a result.
With the "Cacher" layer (used by watch in apiserver: #10475), we already store a copies of objects with a given type in apiserver so if we add an Indexer to it (which should be simple), we would be able to effectively serve list operations from it.
The problem is that a cache in apiserver is delayed (usually by tens to hundreds milliseconds). So if we just start to serve list operations from there we will change the semantics - e.g. it can happen that if you POST an object to apiserver and call LIST immediately after, the returned list may not contain already POSTed object.
In many cases, this doesn't really matter, for example if LIST is only done to start watching from that point, we will get eventual consistency anyway (just the starting point changes). However, it can possibly break some other clients.
I can see 3 main options:
[The third option is the best in my opinion].
What do you think about it?
@kubernetes/goog-control-plane @kubernetes/goog-csi
@lavalamp @brendandburns @quinton-hoole @bgrant0607
@smarterclayton @derekwaynecarr
The text was updated successfully, but these errors were encountered: