Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed-up LIST operations by serving them from memory #15945

Closed
wojtek-t opened this issue Oct 20, 2015 · 48 comments
Closed

Speed-up LIST operations by serving them from memory #15945

wojtek-t opened this issue Oct 20, 2015 · 48 comments
Assignees
Labels
priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.
Milestone

Comments

@wojtek-t
Copy link
Member

Background

Currently, to serve LIST operation in apiserver we do the following:

  • read all objects of a given resource from etcd
  • filter out objects that user is not interested in
  • return the result

This means, that complexity of LIST operation is proportional to the number of all objects of a given type, not to the number of objects returned to the user.

As an example, consider 1000-node cluster with 30.000 pods running in it. If a user has a ReplicationController with, say 10 replicas, then listing them requires requires reading all 30.000 from etcd, filtering them and returning just 10 which are interesting for the user.

Proposal

We would like to make the LIST operation proportional to the number of elements it returns as a result.

With the "Cacher" layer (used by watch in apiserver: #10475), we already store a copies of objects with a given type in apiserver so if we add an Indexer to it (which should be simple), we would be able to effectively serve list operations from it.

The problem is that a cache in apiserver is delayed (usually by tens to hundreds milliseconds). So if we just start to serve list operations from there we will change the semantics - e.g. it can happen that if you POST an object to apiserver and call LIST immediately after, the returned list may not contain already POSTed object.

In many cases, this doesn't really matter, for example if LIST is only done to start watching from that point, we will get eventual consistency anyway (just the starting point changes). However, it can possibly break some other clients.

I can see 3 main options:

  • just change the semantics
  • add an option (parameter) to LIST operation to force listing from database; however, in this case we should probably make listing from database by switched off by default to take advantage from it wherever it's possible
  • change the LIST operation to take ResourceVersion paramater and ensure that what we return in a result is no older than the given ResourceVersion (however, to make it efficient, we don't give any guarantees on how fresh the result is - we only promise to return the result not older than the given ResourceVersion).

[The third option is the best in my opinion].

What do you think about it?

@kubernetes/goog-control-plane @kubernetes/goog-csi
@lavalamp @brendandburns @quinton-hoole @bgrant0607
@smarterclayton @derekwaynecarr

@wojtek-t wojtek-t added sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Oct 20, 2015
@wojtek-t wojtek-t added this to the v1.2 milestone Oct 20, 2015
@wojtek-t wojtek-t self-assigned this Oct 20, 2015
@smarterclayton
Copy link
Contributor

Today, we don't enforce that guarantee necessarily in a cluster because we
don't specify quorum=true reads. This will make it more likely though.

We had proposed taking the resourceVersion as minimum read before, I'm in
favor of that as well.

On Tue, Oct 20, 2015 at 8:31 AM, Wojciech Tyczynski <
notifications@github.com> wrote:

Background

Currently, to serve LIST operation in apiserver we do the following:

  • read all objects of a given resource from etcd
  • filter out objects that user is not interested in
  • return the result

This means, that complexity of LIST operation is proportional to the
number of all objects of a given type, not to the number of objects
returned to the user.

As an example, consider 1000-node cluster with 30.000 pods running in it.
If a user has a ReplicationController with, say 10 replicas, then listing
them requires requires reading all 30.000 from etcd, filtering them and
returning just 10 which are interesting for the user.
Proposal

We would like to make the LIST operation proportional to the number of
elements it returns as a result.

With the "Cacher" layer (used by watch in apiserver: #10475
#10475), we already
store a copies of objects with a given type in apiserver so if we add an
Indexer to it (which should be simple), we would be able to effectively
serve list operations from it.

The problem is that a cache in apiserver is delayed (usually by tens to
hundreds milliseconds). So if we just start to serve list operations from
there we will change the semantics - e.g. it can happen that if you POST an
object to apiserver and call LIST immediately after, the returned list may
not contain already POSTed object.

In many cases, this doesn't really matter, for example if LIST is only
done to start watching from that point, we will get eventual consistency
anyway (just the starting point changes). However, it can possibly break
some other clients.

I can see 3 main options:

  • just change the semantics
  • add an option (parameter) to LIST operation to force listing from
    database; however, in this case we should probably make listing from
    database by switched off by default to take advantage from it wherever it's
    possible
  • change the LIST operation to take ResourceVersion paramater and
    ensure that what we return in a result is no older than the given
    ResourceVersion (however, to make it efficient, we don't give any
    guarantees on how fresh the result is - we only promise to return the
    result not older than the given ResourceVersion).

[The third option is the best in my opinion].

What do you think about it?

@kubernetes/goog-control-plane
https://github.com/orgs/kubernetes/teams/goog-control-plane
@kubernetes/goog-csi https://github.com/orgs/kubernetes/teams/goog-csi
@lavalamp https://github.com/lavalamp @brendandburns
https://github.com/brendandburns @quinton-hoole
https://github.com/quinton-hoole @bgrant0607
https://github.com/bgrant0607
@smarterclayton https://github.com/smarterclayton @derekwaynecarr
https://github.com/derekwaynecarr


Reply to this email directly or view it on GitHub
#15945.

@wojtek-t
Copy link
Member Author

We had proposed taking the resourceVersion as minimum read before, I'm in
favor of that as well.

Was this discussed somewhere on github? I didn't see that...

@smarterclayton
Copy link
Contributor

smarterclayton commented Oct 20, 2015 via email

@lavalamp
Copy link
Member

@wojtek-t and I discussed this IRL yesterday and I'm also in favor of accepting a resourceVersion on list. It means clients that care have to change, e.g. kubectl may want to cache the resource version it got on a POST for reuse in a subsequent LIST.

@lavalamp
Copy link
Member

@wojtek-t do you plan on fixing watches at the same time or will that be a second step? I think most of the benefit is quite possibly in fixing the number of times the watches invoke filterfuncs...

@smarterclayton
Copy link
Contributor

We might want to make clients stateful and observe this themselves.

On Tue, Oct 20, 2015 at 12:37 PM, Daniel Smith notifications@github.com
wrote:

@wojtek-t https://github.com/wojtek-t do you plan on fixing watches at
the same time or will that be a second step? I think most of the benefit is
quite possibly in fixing the number of times the watches invoke
filterfuncs...


Reply to this email directly or view it on GitHub
#15945 (comment)
.

@bgrant0607
Copy link
Member

Hah. I'm not the only one who remembers old issues. :-)

I'm in favor of accepting resourceVersion on GET, as well, as you could guess from that comment. That would only help new, smart clients, however.

An option not mentioned was implementing a full-blown cache coherence protocol, to invalidate on write and subsequently fetch the most recent copy of the resource.

@ghost
Copy link

ghost commented Oct 20, 2015

@bgrant0607 that's the write-through cache that I proposed above. It sounds like you're also in favor of that.

While the "minimum resource version on get" option is theoretically correct, it does impose quite a burden on less sophisticated client developers. We will definitely cause a lot of difficult to track down bugs in clients which write and then read dumbly, no matter how loudly we tell developers not to do that.

@lavalamp
Copy link
Member

An option not mentioned was implementing a full-blown cache coherence protocol, to invalidate on write and subsequently fetch the most recent copy of the resource.

That sounds quite complex-- does a write on apiserver A invalidate the cache of apiserver B? It must, to be useful, but then we're replicating much of etcd's functionality.

@lavalamp
Copy link
Member

I can see making a correct write-through cache on a single apiserver by just stopping the world until your write shows up in the watch from etcd.

Simply adding the written object to the cache before seeing it in the watch means that you decouple the cache's state from etcd's state and sounds dangerous-but-possibly-fixable in the single apiserver case, and infeasible in the multiple apiserver case.

@erictune
Copy link
Member

Seems like the "semantics" you are worried about changing already happen in an HA cluster? I think that is what clayton means in his comment about quorum=true (though even with quorum=true a read could still occasionally get stale data I think?)

If so, then isn't option 1 the best choice?

@erictune
Copy link
Member

Also, option 4:
Add indexer to cache. Within a single apiserver, queue up pending requests, then do the following:

for ever:
   refresh cache and indexes
   while (head of queue is readonly op):
     pop(head) and handle op using cache/index
   while (head of queue is read/write op):
     pop(head) and handle op using direct etcd access.

This maintains the "semantics" in the case of a single apiserver, and in the case of multiple apiservers with session affinity (which I predict will become common for larger clusters).

@erictune
Copy link
Member

Option 4 does not require updating the cache bit by bit, and so may be much less bug-prone and easier to implement.

Assuming read operations are 10x to 1000x more common than write operations, then you still get huge batching benefits.

Also, since many writes are going to be status updates, and since the semantics you want to preserve are less relevant for status writes, you could possibly modify the algorithm to handle "readonly and status-only-writing" operations in a single batch. That would increase the reuse frequency of the cache and indices.

@lavalamp
Copy link
Member

@erictune I don't see how that works. How does apiserver know that a given
list needs to see a given write?

On Tue, Oct 20, 2015 at 11:15 AM, Eric Tune notifications@github.com
wrote:

Also, option 4:
Add indexer to cache. Within a single apiserver, queue up pending
requests, the do the following:

for ever:
refresh cache and indexes
while (head of queue is readonly op):
pop(head) and handle op using cache/index
while (head of queue is read/write op):
pop(head) and handle op using direct etcd access.

This maintains the "semantics" in the case of a single apiserver, and in
the case of multiple apiservers with session affinity (which I predict will
become common for larger clusters).


Reply to this email directly or view it on GitHub
#15945 (comment)
.

@erictune
Copy link
Member

In option 4, it conservatively assume that all reads need to see any write.

@lavalamp
Copy link
Member

OK, so that's what I said above "I can see making a correct write-through cache on a single apiserver by just stopping the world until your write shows up in the watch from etcd" + go to etcd instead of stopping the world.

The problem with going to etcd is that it reverts to n^2 behavior for a little bit after every write. Since the time for the cache to see the write should be measured in 100's of ms, I think it may be better to just block everyone until it shows up.

@erictune
Copy link
Member

Okay, what @lavalamp said SGTM.

@smarterclayton
Copy link
Contributor

Yuck. This still seems like an optimization for infrastructure components
and advanced clients, so forcing clients to be smarter is 100% ok by me.
No one runs list all namespaces who isn't a list watcher and a reflector.
I'd prefer to optimize smart clients first.

On Oct 20, 2015, at 3:15 PM, Daniel Smith notifications@github.com wrote:

OK, so that's what I said above "I can see making a correct write-through
cache on a single apiserver by just stopping the world until your write
shows up in the watch from etcd" + go to etcd instead of stopping the world.

The problem with going to etcd is that it reverts to n^2 behavior for a
little bit after every write. Since the time for the cache to see the write
should be measured in 100's of ms, I think it may be better to just block
everyone until it shows up.


Reply to this email directly or view it on GitHub
#15945 (comment)
.

@bgrant0607
Copy link
Member

In addition to caching performed by the apiserver, caching is also performed outside it, in clients, but potentially also in intermediate caches.

We do have to make it possible for a client to DTRT in the presence of an arbitrary number of intermediate caches. So, we at least need the resourceVersion-based solution.

@wojtek-t
Copy link
Member Author

@wojtek-t do you plan on fixing watches at the same time or will that be a second step? I think most of the benefit is quite possibly in fixing the number of times the watches invoke filterfuncs...

Obviously not in a single PR, but yes - I would treat it as part of the same effort.

I can see making a correct write-through cache on a single apiserver by just stopping the world until your write shows up in the watch from etcd.

Simply adding the written object to the cache before seeing it in the watch means that you decouple the cache's state from etcd's state and sounds dangerous-but-possibly-fixable in the single apiserver case, and infeasible in the multiple apiserver case.

I completely agree.

So do I understand correctly, that we want to do resourceVersion-based solution first and then possibly implement something more difficult in the future (if needed)? I think that's reasonable.

@lavalamp
Copy link
Member

@wojtek-t, SGTM

On Wed, Oct 21, 2015 at 12:49 AM, Wojciech Tyczynski <
notifications@github.com> wrote:

@wojtek-t https://github.com/wojtek-t do you plan on fixing watches at
the same time or will that be a second step? I think most of the benefit is
quite possibly in fixing the number of times the watches invoke
filterfuncs...

Obviously not in a single PR, but yes - I would treat it as part of the
same effort.

I can see making a correct write-through cache on a single apiserver by
just stopping the world until your write shows up in the watch from etcd.

Simply adding the written object to the cache before seeing it in the
watch means that you decouple the cache's state from etcd's state and
sounds dangerous-but-possibly-fixable in the single apiserver case, and
infeasible in the multiple apiserver case.

I completely agree.

So do I understand correctly, that we want to do resourceVersion-based
solution first and then possibly implement something more difficult in the
future (if needed)? I think that's reasonable.


Reply to this email directly or view it on GitHub
#15945 (comment)
.

@smarterclayton
Copy link
Contributor

When you say intermediate caches, which ones are you referring to? Anyone
using an intermediate cache is probably doing so for resilience or
performance, and is by definition a sophisticated client. The vast
majority of clients have no sophistication. If we accept multiple
apiservers, DTRT for unsophisticated clients is slow but correct. DTRT for
sophisticated clients is resource version tracking with resource version
preconditions. What else do we need?

On Wed, Oct 21, 2015 at 12:29 PM, Daniel Smith notifications@github.com
wrote:

@wojtek-t, SGTM

On Wed, Oct 21, 2015 at 12:49 AM, Wojciech Tyczynski <
notifications@github.com> wrote:

@wojtek-t https://github.com/wojtek-t do you plan on fixing watches at
the same time or will that be a second step? I think most of the benefit
is
quite possibly in fixing the number of times the watches invoke
filterfuncs...

Obviously not in a single PR, but yes - I would treat it as part of the
same effort.

I can see making a correct write-through cache on a single apiserver by
just stopping the world until your write shows up in the watch from etcd.

Simply adding the written object to the cache before seeing it in the
watch means that you decouple the cache's state from etcd's state and
sounds dangerous-but-possibly-fixable in the single apiserver case, and
infeasible in the multiple apiserver case.

I completely agree.

So do I understand correctly, that we want to do resourceVersion-based
solution first and then possibly implement something more difficult in
the
future (if needed)? I think that's reasonable.


Reply to this email directly or view it on GitHub
<
#15945 (comment)

.


Reply to this email directly or view it on GitHub
#15945 (comment)
.

@lavalamp lavalamp added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Oct 29, 2015
@janetkuo
Copy link
Member

janetkuo commented Dec 3, 2015

This may affect deployment. Deployment needs to keep track of the number of available pods, scale rcs accordingly, and make sure the number of available pods is within a certain range (depends on deployment strategy).

@wojtek-t
Copy link
Member Author

wojtek-t commented Dec 3, 2015

@janetkuo - it will be possible to still list "the current" version from etcd (by setting parameters correctly). So it will not be breaking change.

@wojtek-t
Copy link
Member Author

wojtek-t commented Dec 3, 2015

The initial version (without indexers) is very close to be done.

However, after some deeper thinking I would like to suggest a slightly different semantics:

  1. if "resourceVersion" parameter is NOT specified for the request (today this is the case for all requests), we still serve this request from etcd

  2. if "resourceVersion" parameter is specified for the request - than we only guarantee that the returned result is at least that fresh

    This makes this mechanism backward compatible (which I think is a huge advantage).

    However, to do it, we need to be able to distinguish set & not-set "resourceVersion". And to do it, I would need to change the "storage.Interface":
    https://github.com/kubernetes/kubernetes/blob/master/pkg/storage/interfaces.go#L74
    so that resourceVersion is passed as string there (not uin64).

    If we do that - it's easy to distinguish empty and non-empty string (it doesn't work for uint64, because it should be fine to set "0" as resourceVersion if we don't need any guarantees).

@lavalamp @smarterclayton @timothysc - any thoughts on it ^^

@lavalamp
Copy link
Member

lavalamp commented Dec 3, 2015

@wojtek-t I agree with your suggested semantics (actually I thought that's what we were going to do originally-- maybe I misunderstood).

Another solution--possibly less invasive--is to make it a pointer so you can pass nil.

@wojtek-t
Copy link
Member Author

wojtek-t commented Dec 3, 2015

@lavalamp - I'm also fine with pointer, but since in the whole system we are using "resourceVersion" as string - see e.g.
https://github.com/kubernetes/kubernetes/blob/master/pkg/api/types.go#L96

it would be more consistent to pass it as string (we will not need any transformations in that case).

@lavalamp
Copy link
Member

lavalamp commented Dec 3, 2015

By which you mean the transformations can be hidden behind the storage
interface? I think I'm fine with that.

On Thu, Dec 3, 2015 at 9:39 AM, Wojciech Tyczynski <notifications@github.com

wrote:

@lavalamp https://github.com/lavalamp - I'm also fine with pointer, but
since in the whole system we are using "resourceVersion" as string - see
e.g.
https://github.com/kubernetes/kubernetes/blob/master/pkg/api/types.go#L96

it would be more consistent to pass it as string (we will not need any
transformations in that case).


Reply to this email directly or view it on GitHub
#15945 (comment)
.

@wojtek-t
Copy link
Member Author

wojtek-t commented Dec 3, 2015

By which you mean the transformations can be hidden behind the storage
interface?

Yes - exactly

I think I'm fine with that.

Thanks - I will prepare a PR tomorrow.

@smarterclayton
Copy link
Contributor

Yes, hiding RV transformation behind storage is right.

Question on 2

  • as a controller doing list watch - If I list, I don't specify RV. I get
    the latest. If I watch from that RV it's extremely likely to be current
    (as close as possible)
  • as a sophisticated client, if I specify RV, I get RV or newer or an error?

On Thu, Dec 3, 2015 at 1:19 PM, Wojciech Tyczynski <notifications@github.com

wrote:

By which you mean the transformations can be hidden behind the storage
interface?

Yes - exactly

I think I'm fine with that.

Thanks - I will prepare a PR tomorrow.


Reply to this email directly or view it on GitHub
#15945 (comment)
.

@timothysc
Copy link
Member

While I like this, I wonder at what point can we/should we just be passing a constraint filter to the kV store itself.

@xiang90 ^ SELECT resource FROM table WHERE (constraint/filter)

@smarterclayton
Copy link
Contributor

Range scans are part of the etcd 3 API design.

On Dec 3, 2015, at 4:58 PM, Timothy St. Clair notifications@github.com
wrote:

While I like this, I wonder at what point can we/should we just be passing
a constraint filter to the kV store itself.

@xiang90 https://github.com/xiang90 ^ SELECT resource FROM table WHERE
(constraint)


Reply to this email directly or view it on GitHub
#15945 (comment)
.

@wojtek-t
Copy link
Member Author

wojtek-t commented Dec 4, 2015

@smarterclayton

  • as a controller doing list watch - If I list, I don't specify RV. I get
    the latest. If I watch from that RV it's extremely likely to be current
    (as close as possible)
  • as a sophisticated client, if I specify RV, I get RV or newer or an error?

I'm afraid I didn't fully understand your question.

[Once this is implemented] if you specify RV in the LIST operation, you are guaranteed to get at least that fresh response. However, you also get the exact RV from which the results is returned. So you can then start watching from exactly that point.
Does it answer your question or I misunderstood it?

@smarterclayton
Copy link
Contributor

My original concern was a change that breaks naive clients. I was trying
to validate that naive clients (no rv) are unaffected. The question about
error was for expired rv versions or rv versions in the future for List.

On Dec 4, 2015, at 3:23 AM, Wojciech Tyczynski notifications@github.com
wrote:

@smarterclayton https://github.com/smarterclayton

  • as a controller doing list watch - If I list, I don't specify RV. I
    get the latest. If I watch from that RV it's extremely likely to be current
    (as close as possible)
  • as a sophisticated client, if I specify RV, I get RV or newer or an
    error?

I'm afraid I didn't fully understand your question.

[Once this is implemented] if you specify RV in the LIST operation, you are
guaranteed to get at least that fresh response. However, you also get the
exact RV from which the results is returned. So you can then start watching
from exactly that point.
Does it answer your question or I misunderstood it?


Reply to this email directly or view it on GitHub
#15945 (comment)
.

@wojtek-t
Copy link
Member Author

wojtek-t commented Dec 4, 2015

Yes - naive clients will be unaffected.

Passing rv for list that is old - will result in returning ~current state (the guarantee that the result is at least that fresh as the rv passed, is then satisfied).
Passing rv that is in the future, will simply block the LIST call until then.

@smarterclayton
Copy link
Contributor

How long do we block for? Timeout specified on request? What error is
returned on timeout (our server timeout error)?

On Dec 4, 2015, at 10:44 AM, Wojciech Tyczynski notifications@github.com
wrote:

Yes - naive clients will be unaffected.

Passing rv for list that is old - will result in returning ~current state
(the guarantee that the result is at least that fresh as the rv passed, is
then satisfied).
Passing rv that is in the future, will simply block the LIST call until
then.

@wojtek-t
Copy link
Member Author

wojtek-t commented Dec 7, 2015

How long do we block for? Timeout specified on request? What error is
returned on timeout (our server timeout error)?

Currently specific timeout is not supported (other than the default http server timeouts, which is a generic mechanism IIUC). Do you think it's not enough?

@wojtek-t
Copy link
Member Author

wojtek-t commented Feb 3, 2016

Actually, @lavalamp added timeout in #20433

@lavalamp
Copy link
Member

lavalamp commented Feb 3, 2016

I randomly made it 60 seconds, I think we need to be defensive about clients asking for RVs in the distant future (or from other collections).

@wojtek-t
Copy link
Member Author

wojtek-t commented Feb 4, 2016

@lavalamp - what do you mean by "or from other collections" ?

@lavalamp
Copy link
Member

lavalamp commented Feb 4, 2016

E.g., reading from /pods, sending that RV to /services.

@wojtek-t
Copy link
Member Author

wojtek-t commented Feb 5, 2016

That shouldn't be a big deal - resource version is common for all resources (there is a single RV in etcd). I agree that technically it's incorrect, but that shouldn't cause problems in my opinion.

@lavalamp
Copy link
Member

lavalamp commented Feb 5, 2016

It does cause problems, though! See the other commit in #20433. Someone read from one table, passed that (big) RV into list of services. But the service watch RV was stuck at the last service write, not the global number--which had since advanced--so apiserver hung on startup.

@lavalamp
Copy link
Member

lavalamp commented Feb 5, 2016

And anyway, clients need to treat it as separate, because we could swap out the storage, like we do with events.

@wojtek-t
Copy link
Member Author

wojtek-t commented Feb 5, 2016

I agree that client need to treat is as separate - I just though that it's not urgent...

@lavalamp
Copy link
Member

lavalamp commented Feb 8, 2016

Anything left to do here? @wojtek-t can we close this?

@wojtek-t
Copy link
Member Author

wojtek-t commented Feb 9, 2016

The missing thing here is the Indexer mechanism. But this is definitely not for 1.2 and I think we can create a separate issue for it.

@lavalamp
Copy link
Member

#4817 already exists, we can reuse that. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.
Projects
None yet
Development

No branches or pull requests

8 participants
@timothysc @lavalamp @smarterclayton @janetkuo @bgrant0607 @erictune @wojtek-t and others