Speed-up LIST operations by serving them from memory #15945

Closed
wojtek-t opened this Issue Oct 20, 2015 · 48 comments

Comments

Projects
None yet
8 participants
@wojtek-t
Member

wojtek-t commented Oct 20, 2015

Background

Currently, to serve LIST operation in apiserver we do the following:

  • read all objects of a given resource from etcd
  • filter out objects that user is not interested in
  • return the result

This means, that complexity of LIST operation is proportional to the number of all objects of a given type, not to the number of objects returned to the user.

As an example, consider 1000-node cluster with 30.000 pods running in it. If a user has a ReplicationController with, say 10 replicas, then listing them requires requires reading all 30.000 from etcd, filtering them and returning just 10 which are interesting for the user.

Proposal

We would like to make the LIST operation proportional to the number of elements it returns as a result.

With the "Cacher" layer (used by watch in apiserver: #10475), we already store a copies of objects with a given type in apiserver so if we add an Indexer to it (which should be simple), we would be able to effectively serve list operations from it.

The problem is that a cache in apiserver is delayed (usually by tens to hundreds milliseconds). So if we just start to serve list operations from there we will change the semantics - e.g. it can happen that if you POST an object to apiserver and call LIST immediately after, the returned list may not contain already POSTed object.

In many cases, this doesn't really matter, for example if LIST is only done to start watching from that point, we will get eventual consistency anyway (just the starting point changes). However, it can possibly break some other clients.

I can see 3 main options:

  • just change the semantics
  • add an option (parameter) to LIST operation to force listing from database; however, in this case we should probably make listing from database by switched off by default to take advantage from it wherever it's possible
  • change the LIST operation to take ResourceVersion paramater and ensure that what we return in a result is no older than the given ResourceVersion (however, to make it efficient, we don't give any guarantees on how fresh the result is - we only promise to return the result not older than the given ResourceVersion).

[The third option is the best in my opinion].

What do you think about it?

@kubernetes/goog-control-plane @kubernetes/goog-csi
@lavalamp @brendandburns @quinton-hoole @bgrant0607
@smarterclayton @derekwaynecarr

@wojtek-t wojtek-t added this to the v1.2 milestone Oct 20, 2015

@wojtek-t wojtek-t self-assigned this Oct 20, 2015

@smarterclayton

This comment has been minimized.

Show comment
Hide comment
@smarterclayton

smarterclayton Oct 20, 2015

Contributor

Today, we don't enforce that guarantee necessarily in a cluster because we
don't specify quorum=true reads. This will make it more likely though.

We had proposed taking the resourceVersion as minimum read before, I'm in
favor of that as well.

On Tue, Oct 20, 2015 at 8:31 AM, Wojciech Tyczynski <
notifications@github.com> wrote:

Background

Currently, to serve LIST operation in apiserver we do the following:

  • read all objects of a given resource from etcd
  • filter out objects that user is not interested in
  • return the result

This means, that complexity of LIST operation is proportional to the
number of all objects of a given type, not to the number of objects
returned to the user.

As an example, consider 1000-node cluster with 30.000 pods running in it.
If a user has a ReplicationController with, say 10 replicas, then listing
them requires requires reading all 30.000 from etcd, filtering them and
returning just 10 which are interesting for the user.
Proposal

We would like to make the LIST operation proportional to the number of
elements it returns as a result.

With the "Cacher" layer (used by watch in apiserver: #10475
#10475), we already
store a copies of objects with a given type in apiserver so if we add an
Indexer to it (which should be simple), we would be able to effectively
serve list operations from it.

The problem is that a cache in apiserver is delayed (usually by tens to
hundreds milliseconds). So if we just start to serve list operations from
there we will change the semantics - e.g. it can happen that if you POST an
object to apiserver and call LIST immediately after, the returned list may
not contain already POSTed object.

In many cases, this doesn't really matter, for example if LIST is only
done to start watching from that point, we will get eventual consistency
anyway (just the starting point changes). However, it can possibly break
some other clients.

I can see 3 main options:

  • just change the semantics
  • add an option (parameter) to LIST operation to force listing from
    database; however, in this case we should probably make listing from
    database by switched off by default to take advantage from it wherever it's
    possible
  • change the LIST operation to take ResourceVersion paramater and
    ensure that what we return in a result is no older than the given
    ResourceVersion (however, to make it efficient, we don't give any
    guarantees on how fresh the result is - we only promise to return the
    result not older than the given ResourceVersion).

[The third option is the best in my opinion].

What do you think about it?

@kubernetes/goog-control-plane
https://github.com/orgs/kubernetes/teams/goog-control-plane
@kubernetes/goog-csi https://github.com/orgs/kubernetes/teams/goog-csi
@lavalamp https://github.com/lavalamp @brendandburns
https://github.com/brendandburns @quinton-hoole
https://github.com/quinton-hoole @bgrant0607
https://github.com/bgrant0607
@smarterclayton https://github.com/smarterclayton @derekwaynecarr
https://github.com/derekwaynecarr


Reply to this email directly or view it on GitHub
#15945.

Contributor

smarterclayton commented Oct 20, 2015

Today, we don't enforce that guarantee necessarily in a cluster because we
don't specify quorum=true reads. This will make it more likely though.

We had proposed taking the resourceVersion as minimum read before, I'm in
favor of that as well.

On Tue, Oct 20, 2015 at 8:31 AM, Wojciech Tyczynski <
notifications@github.com> wrote:

Background

Currently, to serve LIST operation in apiserver we do the following:

  • read all objects of a given resource from etcd
  • filter out objects that user is not interested in
  • return the result

This means, that complexity of LIST operation is proportional to the
number of all objects of a given type, not to the number of objects
returned to the user.

As an example, consider 1000-node cluster with 30.000 pods running in it.
If a user has a ReplicationController with, say 10 replicas, then listing
them requires requires reading all 30.000 from etcd, filtering them and
returning just 10 which are interesting for the user.
Proposal

We would like to make the LIST operation proportional to the number of
elements it returns as a result.

With the "Cacher" layer (used by watch in apiserver: #10475
#10475), we already
store a copies of objects with a given type in apiserver so if we add an
Indexer to it (which should be simple), we would be able to effectively
serve list operations from it.

The problem is that a cache in apiserver is delayed (usually by tens to
hundreds milliseconds). So if we just start to serve list operations from
there we will change the semantics - e.g. it can happen that if you POST an
object to apiserver and call LIST immediately after, the returned list may
not contain already POSTed object.

In many cases, this doesn't really matter, for example if LIST is only
done to start watching from that point, we will get eventual consistency
anyway (just the starting point changes). However, it can possibly break
some other clients.

I can see 3 main options:

  • just change the semantics
  • add an option (parameter) to LIST operation to force listing from
    database; however, in this case we should probably make listing from
    database by switched off by default to take advantage from it wherever it's
    possible
  • change the LIST operation to take ResourceVersion paramater and
    ensure that what we return in a result is no older than the given
    ResourceVersion (however, to make it efficient, we don't give any
    guarantees on how fresh the result is - we only promise to return the
    result not older than the given ResourceVersion).

[The third option is the best in my opinion].

What do you think about it?

@kubernetes/goog-control-plane
https://github.com/orgs/kubernetes/teams/goog-control-plane
@kubernetes/goog-csi https://github.com/orgs/kubernetes/teams/goog-csi
@lavalamp https://github.com/lavalamp @brendandburns
https://github.com/brendandburns @quinton-hoole
https://github.com/quinton-hoole @bgrant0607
https://github.com/bgrant0607
@smarterclayton https://github.com/smarterclayton @derekwaynecarr
https://github.com/derekwaynecarr


Reply to this email directly or view it on GitHub
#15945.

@wojtek-t

This comment has been minimized.

Show comment
Hide comment
@wojtek-t

wojtek-t Oct 20, 2015

Member

We had proposed taking the resourceVersion as minimum read before, I'm in
favor of that as well.

Was this discussed somewhere on github? I didn't see that...

Member

wojtek-t commented Oct 20, 2015

We had proposed taking the resourceVersion as minimum read before, I'm in
favor of that as well.

Was this discussed somewhere on github? I didn't see that...

@smarterclayton

This comment has been minimized.

Show comment
Hide comment
Contributor

smarterclayton commented Oct 20, 2015

@lavalamp

This comment has been minimized.

Show comment
Hide comment
@lavalamp

lavalamp Oct 20, 2015

Member

@wojtek-t and I discussed this IRL yesterday and I'm also in favor of accepting a resourceVersion on list. It means clients that care have to change, e.g. kubectl may want to cache the resource version it got on a POST for reuse in a subsequent LIST.

Member

lavalamp commented Oct 20, 2015

@wojtek-t and I discussed this IRL yesterday and I'm also in favor of accepting a resourceVersion on list. It means clients that care have to change, e.g. kubectl may want to cache the resource version it got on a POST for reuse in a subsequent LIST.

@lavalamp

This comment has been minimized.

Show comment
Hide comment
@lavalamp

lavalamp Oct 20, 2015

Member

@wojtek-t do you plan on fixing watches at the same time or will that be a second step? I think most of the benefit is quite possibly in fixing the number of times the watches invoke filterfuncs...

Member

lavalamp commented Oct 20, 2015

@wojtek-t do you plan on fixing watches at the same time or will that be a second step? I think most of the benefit is quite possibly in fixing the number of times the watches invoke filterfuncs...

@smarterclayton

This comment has been minimized.

Show comment
Hide comment
@smarterclayton

smarterclayton Oct 20, 2015

Contributor

We might want to make clients stateful and observe this themselves.

On Tue, Oct 20, 2015 at 12:37 PM, Daniel Smith notifications@github.com
wrote:

@wojtek-t https://github.com/wojtek-t do you plan on fixing watches at
the same time or will that be a second step? I think most of the benefit is
quite possibly in fixing the number of times the watches invoke
filterfuncs...


Reply to this email directly or view it on GitHub
#15945 (comment)
.

Contributor

smarterclayton commented Oct 20, 2015

We might want to make clients stateful and observe this themselves.

On Tue, Oct 20, 2015 at 12:37 PM, Daniel Smith notifications@github.com
wrote:

@wojtek-t https://github.com/wojtek-t do you plan on fixing watches at
the same time or will that be a second step? I think most of the benefit is
quite possibly in fixing the number of times the watches invoke
filterfuncs...


Reply to this email directly or view it on GitHub
#15945 (comment)
.

@bgrant0607

This comment has been minimized.

Show comment
Hide comment
@bgrant0607

bgrant0607 Oct 20, 2015

Member

Hah. I'm not the only one who remembers old issues. :-)

I'm in favor of accepting resourceVersion on GET, as well, as you could guess from that comment. That would only help new, smart clients, however.

An option not mentioned was implementing a full-blown cache coherence protocol, to invalidate on write and subsequently fetch the most recent copy of the resource.

Member

bgrant0607 commented Oct 20, 2015

Hah. I'm not the only one who remembers old issues. :-)

I'm in favor of accepting resourceVersion on GET, as well, as you could guess from that comment. That would only help new, smart clients, however.

An option not mentioned was implementing a full-blown cache coherence protocol, to invalidate on write and subsequently fetch the most recent copy of the resource.

@quinton-hoole

This comment has been minimized.

Show comment
Hide comment
@quinton-hoole

quinton-hoole Oct 20, 2015

Member

@bgrant0607 that's the write-through cache that I proposed above. It sounds like you're also in favor of that.

While the "minimum resource version on get" option is theoretically correct, it does impose quite a burden on less sophisticated client developers. We will definitely cause a lot of difficult to track down bugs in clients which write and then read dumbly, no matter how loudly we tell developers not to do that.

Member

quinton-hoole commented Oct 20, 2015

@bgrant0607 that's the write-through cache that I proposed above. It sounds like you're also in favor of that.

While the "minimum resource version on get" option is theoretically correct, it does impose quite a burden on less sophisticated client developers. We will definitely cause a lot of difficult to track down bugs in clients which write and then read dumbly, no matter how loudly we tell developers not to do that.

@lavalamp

This comment has been minimized.

Show comment
Hide comment
@lavalamp

lavalamp Oct 20, 2015

Member

An option not mentioned was implementing a full-blown cache coherence protocol, to invalidate on write and subsequently fetch the most recent copy of the resource.

That sounds quite complex-- does a write on apiserver A invalidate the cache of apiserver B? It must, to be useful, but then we're replicating much of etcd's functionality.

Member

lavalamp commented Oct 20, 2015

An option not mentioned was implementing a full-blown cache coherence protocol, to invalidate on write and subsequently fetch the most recent copy of the resource.

That sounds quite complex-- does a write on apiserver A invalidate the cache of apiserver B? It must, to be useful, but then we're replicating much of etcd's functionality.

@lavalamp

This comment has been minimized.

Show comment
Hide comment
@lavalamp

lavalamp Oct 20, 2015

Member

I can see making a correct write-through cache on a single apiserver by just stopping the world until your write shows up in the watch from etcd.

Simply adding the written object to the cache before seeing it in the watch means that you decouple the cache's state from etcd's state and sounds dangerous-but-possibly-fixable in the single apiserver case, and infeasible in the multiple apiserver case.

Member

lavalamp commented Oct 20, 2015

I can see making a correct write-through cache on a single apiserver by just stopping the world until your write shows up in the watch from etcd.

Simply adding the written object to the cache before seeing it in the watch means that you decouple the cache's state from etcd's state and sounds dangerous-but-possibly-fixable in the single apiserver case, and infeasible in the multiple apiserver case.

@erictune

This comment has been minimized.

Show comment
Hide comment
@erictune

erictune Oct 20, 2015

Member

Seems like the "semantics" you are worried about changing already happen in an HA cluster? I think that is what clayton means in his comment about quorum=true (though even with quorum=true a read could still occasionally get stale data I think?)

If so, then isn't option 1 the best choice?

Member

erictune commented Oct 20, 2015

Seems like the "semantics" you are worried about changing already happen in an HA cluster? I think that is what clayton means in his comment about quorum=true (though even with quorum=true a read could still occasionally get stale data I think?)

If so, then isn't option 1 the best choice?

@erictune

This comment has been minimized.

Show comment
Hide comment
@erictune

erictune Oct 20, 2015

Member

Also, option 4:
Add indexer to cache. Within a single apiserver, queue up pending requests, then do the following:

for ever:
   refresh cache and indexes
   while (head of queue is readonly op):
     pop(head) and handle op using cache/index
   while (head of queue is read/write op):
     pop(head) and handle op using direct etcd access.

This maintains the "semantics" in the case of a single apiserver, and in the case of multiple apiservers with session affinity (which I predict will become common for larger clusters).

Member

erictune commented Oct 20, 2015

Also, option 4:
Add indexer to cache. Within a single apiserver, queue up pending requests, then do the following:

for ever:
   refresh cache and indexes
   while (head of queue is readonly op):
     pop(head) and handle op using cache/index
   while (head of queue is read/write op):
     pop(head) and handle op using direct etcd access.

This maintains the "semantics" in the case of a single apiserver, and in the case of multiple apiservers with session affinity (which I predict will become common for larger clusters).

@erictune

This comment has been minimized.

Show comment
Hide comment
@erictune

erictune Oct 20, 2015

Member

Option 4 does not require updating the cache bit by bit, and so may be much less bug-prone and easier to implement.

Assuming read operations are 10x to 1000x more common than write operations, then you still get huge batching benefits.

Also, since many writes are going to be status updates, and since the semantics you want to preserve are less relevant for status writes, you could possibly modify the algorithm to handle "readonly and status-only-writing" operations in a single batch. That would increase the reuse frequency of the cache and indices.

Member

erictune commented Oct 20, 2015

Option 4 does not require updating the cache bit by bit, and so may be much less bug-prone and easier to implement.

Assuming read operations are 10x to 1000x more common than write operations, then you still get huge batching benefits.

Also, since many writes are going to be status updates, and since the semantics you want to preserve are less relevant for status writes, you could possibly modify the algorithm to handle "readonly and status-only-writing" operations in a single batch. That would increase the reuse frequency of the cache and indices.

@lavalamp

This comment has been minimized.

Show comment
Hide comment
@lavalamp

lavalamp Oct 20, 2015

Member

@erictune I don't see how that works. How does apiserver know that a given
list needs to see a given write?

On Tue, Oct 20, 2015 at 11:15 AM, Eric Tune notifications@github.com
wrote:

Also, option 4:
Add indexer to cache. Within a single apiserver, queue up pending
requests, the do the following:

for ever:
refresh cache and indexes
while (head of queue is readonly op):
pop(head) and handle op using cache/index
while (head of queue is read/write op):
pop(head) and handle op using direct etcd access.

This maintains the "semantics" in the case of a single apiserver, and in
the case of multiple apiservers with session affinity (which I predict will
become common for larger clusters).


Reply to this email directly or view it on GitHub
#15945 (comment)
.

Member

lavalamp commented Oct 20, 2015

@erictune I don't see how that works. How does apiserver know that a given
list needs to see a given write?

On Tue, Oct 20, 2015 at 11:15 AM, Eric Tune notifications@github.com
wrote:

Also, option 4:
Add indexer to cache. Within a single apiserver, queue up pending
requests, the do the following:

for ever:
refresh cache and indexes
while (head of queue is readonly op):
pop(head) and handle op using cache/index
while (head of queue is read/write op):
pop(head) and handle op using direct etcd access.

This maintains the "semantics" in the case of a single apiserver, and in
the case of multiple apiservers with session affinity (which I predict will
become common for larger clusters).


Reply to this email directly or view it on GitHub
#15945 (comment)
.

@erictune

This comment has been minimized.

Show comment
Hide comment
@erictune

erictune Oct 20, 2015

Member

In option 4, it conservatively assume that all reads need to see any write.

Member

erictune commented Oct 20, 2015

In option 4, it conservatively assume that all reads need to see any write.

@lavalamp

This comment has been minimized.

Show comment
Hide comment
@lavalamp

lavalamp Oct 20, 2015

Member

OK, so that's what I said above "I can see making a correct write-through cache on a single apiserver by just stopping the world until your write shows up in the watch from etcd" + go to etcd instead of stopping the world.

The problem with going to etcd is that it reverts to n^2 behavior for a little bit after every write. Since the time for the cache to see the write should be measured in 100's of ms, I think it may be better to just block everyone until it shows up.

Member

lavalamp commented Oct 20, 2015

OK, so that's what I said above "I can see making a correct write-through cache on a single apiserver by just stopping the world until your write shows up in the watch from etcd" + go to etcd instead of stopping the world.

The problem with going to etcd is that it reverts to n^2 behavior for a little bit after every write. Since the time for the cache to see the write should be measured in 100's of ms, I think it may be better to just block everyone until it shows up.

@erictune

This comment has been minimized.

Show comment
Hide comment
@erictune

erictune Oct 20, 2015

Member

Okay, what @lavalamp said SGTM.

Member

erictune commented Oct 20, 2015

Okay, what @lavalamp said SGTM.

@smarterclayton

This comment has been minimized.

Show comment
Hide comment
@smarterclayton

smarterclayton Oct 20, 2015

Contributor

Yuck. This still seems like an optimization for infrastructure components
and advanced clients, so forcing clients to be smarter is 100% ok by me.
No one runs list all namespaces who isn't a list watcher and a reflector.
I'd prefer to optimize smart clients first.

On Oct 20, 2015, at 3:15 PM, Daniel Smith notifications@github.com wrote:

OK, so that's what I said above "I can see making a correct write-through
cache on a single apiserver by just stopping the world until your write
shows up in the watch from etcd" + go to etcd instead of stopping the world.

The problem with going to etcd is that it reverts to n^2 behavior for a
little bit after every write. Since the time for the cache to see the write
should be measured in 100's of ms, I think it may be better to just block
everyone until it shows up.


Reply to this email directly or view it on GitHub
#15945 (comment)
.

Contributor

smarterclayton commented Oct 20, 2015

Yuck. This still seems like an optimization for infrastructure components
and advanced clients, so forcing clients to be smarter is 100% ok by me.
No one runs list all namespaces who isn't a list watcher and a reflector.
I'd prefer to optimize smart clients first.

On Oct 20, 2015, at 3:15 PM, Daniel Smith notifications@github.com wrote:

OK, so that's what I said above "I can see making a correct write-through
cache on a single apiserver by just stopping the world until your write
shows up in the watch from etcd" + go to etcd instead of stopping the world.

The problem with going to etcd is that it reverts to n^2 behavior for a
little bit after every write. Since the time for the cache to see the write
should be measured in 100's of ms, I think it may be better to just block
everyone until it shows up.


Reply to this email directly or view it on GitHub
#15945 (comment)
.

@bgrant0607

This comment has been minimized.

Show comment
Hide comment
@bgrant0607

bgrant0607 Oct 21, 2015

Member

In addition to caching performed by the apiserver, caching is also performed outside it, in clients, but potentially also in intermediate caches.

We do have to make it possible for a client to DTRT in the presence of an arbitrary number of intermediate caches. So, we at least need the resourceVersion-based solution.

Member

bgrant0607 commented Oct 21, 2015

In addition to caching performed by the apiserver, caching is also performed outside it, in clients, but potentially also in intermediate caches.

We do have to make it possible for a client to DTRT in the presence of an arbitrary number of intermediate caches. So, we at least need the resourceVersion-based solution.

@wojtek-t

This comment has been minimized.

Show comment
Hide comment
@wojtek-t

wojtek-t Oct 21, 2015

Member

@wojtek-t do you plan on fixing watches at the same time or will that be a second step? I think most of the benefit is quite possibly in fixing the number of times the watches invoke filterfuncs...

Obviously not in a single PR, but yes - I would treat it as part of the same effort.

I can see making a correct write-through cache on a single apiserver by just stopping the world until your write shows up in the watch from etcd.

Simply adding the written object to the cache before seeing it in the watch means that you decouple the cache's state from etcd's state and sounds dangerous-but-possibly-fixable in the single apiserver case, and infeasible in the multiple apiserver case.

I completely agree.

So do I understand correctly, that we want to do resourceVersion-based solution first and then possibly implement something more difficult in the future (if needed)? I think that's reasonable.

Member

wojtek-t commented Oct 21, 2015

@wojtek-t do you plan on fixing watches at the same time or will that be a second step? I think most of the benefit is quite possibly in fixing the number of times the watches invoke filterfuncs...

Obviously not in a single PR, but yes - I would treat it as part of the same effort.

I can see making a correct write-through cache on a single apiserver by just stopping the world until your write shows up in the watch from etcd.

Simply adding the written object to the cache before seeing it in the watch means that you decouple the cache's state from etcd's state and sounds dangerous-but-possibly-fixable in the single apiserver case, and infeasible in the multiple apiserver case.

I completely agree.

So do I understand correctly, that we want to do resourceVersion-based solution first and then possibly implement something more difficult in the future (if needed)? I think that's reasonable.

@lavalamp

This comment has been minimized.

Show comment
Hide comment
@lavalamp

lavalamp Oct 21, 2015

Member

@wojtek-t, SGTM

On Wed, Oct 21, 2015 at 12:49 AM, Wojciech Tyczynski <
notifications@github.com> wrote:

@wojtek-t https://github.com/wojtek-t do you plan on fixing watches at
the same time or will that be a second step? I think most of the benefit is
quite possibly in fixing the number of times the watches invoke
filterfuncs...

Obviously not in a single PR, but yes - I would treat it as part of the
same effort.

I can see making a correct write-through cache on a single apiserver by
just stopping the world until your write shows up in the watch from etcd.

Simply adding the written object to the cache before seeing it in the
watch means that you decouple the cache's state from etcd's state and
sounds dangerous-but-possibly-fixable in the single apiserver case, and
infeasible in the multiple apiserver case.

I completely agree.

So do I understand correctly, that we want to do resourceVersion-based
solution first and then possibly implement something more difficult in the
future (if needed)? I think that's reasonable.


Reply to this email directly or view it on GitHub
#15945 (comment)
.

Member

lavalamp commented Oct 21, 2015

@wojtek-t, SGTM

On Wed, Oct 21, 2015 at 12:49 AM, Wojciech Tyczynski <
notifications@github.com> wrote:

@wojtek-t https://github.com/wojtek-t do you plan on fixing watches at
the same time or will that be a second step? I think most of the benefit is
quite possibly in fixing the number of times the watches invoke
filterfuncs...

Obviously not in a single PR, but yes - I would treat it as part of the
same effort.

I can see making a correct write-through cache on a single apiserver by
just stopping the world until your write shows up in the watch from etcd.

Simply adding the written object to the cache before seeing it in the
watch means that you decouple the cache's state from etcd's state and
sounds dangerous-but-possibly-fixable in the single apiserver case, and
infeasible in the multiple apiserver case.

I completely agree.

So do I understand correctly, that we want to do resourceVersion-based
solution first and then possibly implement something more difficult in the
future (if needed)? I think that's reasonable.


Reply to this email directly or view it on GitHub
#15945 (comment)
.

@smarterclayton

This comment has been minimized.

Show comment
Hide comment
@smarterclayton

smarterclayton Oct 21, 2015

Contributor

When you say intermediate caches, which ones are you referring to? Anyone
using an intermediate cache is probably doing so for resilience or
performance, and is by definition a sophisticated client. The vast
majority of clients have no sophistication. If we accept multiple
apiservers, DTRT for unsophisticated clients is slow but correct. DTRT for
sophisticated clients is resource version tracking with resource version
preconditions. What else do we need?

On Wed, Oct 21, 2015 at 12:29 PM, Daniel Smith notifications@github.com
wrote:

@wojtek-t, SGTM

On Wed, Oct 21, 2015 at 12:49 AM, Wojciech Tyczynski <
notifications@github.com> wrote:

@wojtek-t https://github.com/wojtek-t do you plan on fixing watches at
the same time or will that be a second step? I think most of the benefit
is
quite possibly in fixing the number of times the watches invoke
filterfuncs...

Obviously not in a single PR, but yes - I would treat it as part of the
same effort.

I can see making a correct write-through cache on a single apiserver by
just stopping the world until your write shows up in the watch from etcd.

Simply adding the written object to the cache before seeing it in the
watch means that you decouple the cache's state from etcd's state and
sounds dangerous-but-possibly-fixable in the single apiserver case, and
infeasible in the multiple apiserver case.

I completely agree.

So do I understand correctly, that we want to do resourceVersion-based
solution first and then possibly implement something more difficult in
the
future (if needed)? I think that's reasonable.


Reply to this email directly or view it on GitHub
<
#15945 (comment)

.


Reply to this email directly or view it on GitHub
#15945 (comment)
.

Contributor

smarterclayton commented Oct 21, 2015

When you say intermediate caches, which ones are you referring to? Anyone
using an intermediate cache is probably doing so for resilience or
performance, and is by definition a sophisticated client. The vast
majority of clients have no sophistication. If we accept multiple
apiservers, DTRT for unsophisticated clients is slow but correct. DTRT for
sophisticated clients is resource version tracking with resource version
preconditions. What else do we need?

On Wed, Oct 21, 2015 at 12:29 PM, Daniel Smith notifications@github.com
wrote:

@wojtek-t, SGTM

On Wed, Oct 21, 2015 at 12:49 AM, Wojciech Tyczynski <
notifications@github.com> wrote:

@wojtek-t https://github.com/wojtek-t do you plan on fixing watches at
the same time or will that be a second step? I think most of the benefit
is
quite possibly in fixing the number of times the watches invoke
filterfuncs...

Obviously not in a single PR, but yes - I would treat it as part of the
same effort.

I can see making a correct write-through cache on a single apiserver by
just stopping the world until your write shows up in the watch from etcd.

Simply adding the written object to the cache before seeing it in the
watch means that you decouple the cache's state from etcd's state and
sounds dangerous-but-possibly-fixable in the single apiserver case, and
infeasible in the multiple apiserver case.

I completely agree.

So do I understand correctly, that we want to do resourceVersion-based
solution first and then possibly implement something more difficult in
the
future (if needed)? I think that's reasonable.


Reply to this email directly or view it on GitHub
<
#15945 (comment)

.


Reply to this email directly or view it on GitHub
#15945 (comment)
.

@janetkuo

This comment has been minimized.

Show comment
Hide comment
@janetkuo

janetkuo Dec 3, 2015

Member

This may affect deployment. Deployment needs to keep track of the number of available pods, scale rcs accordingly, and make sure the number of available pods is within a certain range (depends on deployment strategy).

Member

janetkuo commented Dec 3, 2015

This may affect deployment. Deployment needs to keep track of the number of available pods, scale rcs accordingly, and make sure the number of available pods is within a certain range (depends on deployment strategy).

@wojtek-t

This comment has been minimized.

Show comment
Hide comment
@wojtek-t

wojtek-t Dec 3, 2015

Member

@janetkuo - it will be possible to still list "the current" version from etcd (by setting parameters correctly). So it will not be breaking change.

Member

wojtek-t commented Dec 3, 2015

@janetkuo - it will be possible to still list "the current" version from etcd (by setting parameters correctly). So it will not be breaking change.

@wojtek-t

This comment has been minimized.

Show comment
Hide comment
@wojtek-t

wojtek-t Dec 3, 2015

Member

The initial version (without indexers) is very close to be done.

However, after some deeper thinking I would like to suggest a slightly different semantics:

  1. if "resourceVersion" parameter is NOT specified for the request (today this is the case for all requests), we still serve this request from etcd

  2. if "resourceVersion" parameter is specified for the request - than we only guarantee that the returned result is at least that fresh

    This makes this mechanism backward compatible (which I think is a huge advantage).

    However, to do it, we need to be able to distinguish set & not-set "resourceVersion". And to do it, I would need to change the "storage.Interface":
    https://github.com/kubernetes/kubernetes/blob/master/pkg/storage/interfaces.go#L74
    so that resourceVersion is passed as string there (not uin64).

    If we do that - it's easy to distinguish empty and non-empty string (it doesn't work for uint64, because it should be fine to set "0" as resourceVersion if we don't need any guarantees).

@lavalamp @smarterclayton @timothysc - any thoughts on it ^^

Member

wojtek-t commented Dec 3, 2015

The initial version (without indexers) is very close to be done.

However, after some deeper thinking I would like to suggest a slightly different semantics:

  1. if "resourceVersion" parameter is NOT specified for the request (today this is the case for all requests), we still serve this request from etcd

  2. if "resourceVersion" parameter is specified for the request - than we only guarantee that the returned result is at least that fresh

    This makes this mechanism backward compatible (which I think is a huge advantage).

    However, to do it, we need to be able to distinguish set & not-set "resourceVersion". And to do it, I would need to change the "storage.Interface":
    https://github.com/kubernetes/kubernetes/blob/master/pkg/storage/interfaces.go#L74
    so that resourceVersion is passed as string there (not uin64).

    If we do that - it's easy to distinguish empty and non-empty string (it doesn't work for uint64, because it should be fine to set "0" as resourceVersion if we don't need any guarantees).

@lavalamp @smarterclayton @timothysc - any thoughts on it ^^

@lavalamp

This comment has been minimized.

Show comment
Hide comment
@lavalamp

lavalamp Dec 3, 2015

Member

@wojtek-t I agree with your suggested semantics (actually I thought that's what we were going to do originally-- maybe I misunderstood).

Another solution--possibly less invasive--is to make it a pointer so you can pass nil.

Member

lavalamp commented Dec 3, 2015

@wojtek-t I agree with your suggested semantics (actually I thought that's what we were going to do originally-- maybe I misunderstood).

Another solution--possibly less invasive--is to make it a pointer so you can pass nil.

@wojtek-t

This comment has been minimized.

Show comment
Hide comment
@wojtek-t

wojtek-t Dec 3, 2015

Member

@lavalamp - I'm also fine with pointer, but since in the whole system we are using "resourceVersion" as string - see e.g.
https://github.com/kubernetes/kubernetes/blob/master/pkg/api/types.go#L96

it would be more consistent to pass it as string (we will not need any transformations in that case).

Member

wojtek-t commented Dec 3, 2015

@lavalamp - I'm also fine with pointer, but since in the whole system we are using "resourceVersion" as string - see e.g.
https://github.com/kubernetes/kubernetes/blob/master/pkg/api/types.go#L96

it would be more consistent to pass it as string (we will not need any transformations in that case).

@lavalamp

This comment has been minimized.

Show comment
Hide comment
@lavalamp

lavalamp Dec 3, 2015

Member

By which you mean the transformations can be hidden behind the storage
interface? I think I'm fine with that.

On Thu, Dec 3, 2015 at 9:39 AM, Wojciech Tyczynski <notifications@github.com

wrote:

@lavalamp https://github.com/lavalamp - I'm also fine with pointer, but
since in the whole system we are using "resourceVersion" as string - see
e.g.
https://github.com/kubernetes/kubernetes/blob/master/pkg/api/types.go#L96

it would be more consistent to pass it as string (we will not need any
transformations in that case).


Reply to this email directly or view it on GitHub
#15945 (comment)
.

Member

lavalamp commented Dec 3, 2015

By which you mean the transformations can be hidden behind the storage
interface? I think I'm fine with that.

On Thu, Dec 3, 2015 at 9:39 AM, Wojciech Tyczynski <notifications@github.com

wrote:

@lavalamp https://github.com/lavalamp - I'm also fine with pointer, but
since in the whole system we are using "resourceVersion" as string - see
e.g.
https://github.com/kubernetes/kubernetes/blob/master/pkg/api/types.go#L96

it would be more consistent to pass it as string (we will not need any
transformations in that case).


Reply to this email directly or view it on GitHub
#15945 (comment)
.

@wojtek-t

This comment has been minimized.

Show comment
Hide comment
@wojtek-t

wojtek-t Dec 3, 2015

Member

By which you mean the transformations can be hidden behind the storage
interface?

Yes - exactly

I think I'm fine with that.

Thanks - I will prepare a PR tomorrow.

Member

wojtek-t commented Dec 3, 2015

By which you mean the transformations can be hidden behind the storage
interface?

Yes - exactly

I think I'm fine with that.

Thanks - I will prepare a PR tomorrow.

@smarterclayton

This comment has been minimized.

Show comment
Hide comment
@smarterclayton

smarterclayton Dec 3, 2015

Contributor

Yes, hiding RV transformation behind storage is right.

Question on 2

  • as a controller doing list watch - If I list, I don't specify RV. I get
    the latest. If I watch from that RV it's extremely likely to be current
    (as close as possible)
  • as a sophisticated client, if I specify RV, I get RV or newer or an error?

On Thu, Dec 3, 2015 at 1:19 PM, Wojciech Tyczynski <notifications@github.com

wrote:

By which you mean the transformations can be hidden behind the storage
interface?

Yes - exactly

I think I'm fine with that.

Thanks - I will prepare a PR tomorrow.


Reply to this email directly or view it on GitHub
#15945 (comment)
.

Contributor

smarterclayton commented Dec 3, 2015

Yes, hiding RV transformation behind storage is right.

Question on 2

  • as a controller doing list watch - If I list, I don't specify RV. I get
    the latest. If I watch from that RV it's extremely likely to be current
    (as close as possible)
  • as a sophisticated client, if I specify RV, I get RV or newer or an error?

On Thu, Dec 3, 2015 at 1:19 PM, Wojciech Tyczynski <notifications@github.com

wrote:

By which you mean the transformations can be hidden behind the storage
interface?

Yes - exactly

I think I'm fine with that.

Thanks - I will prepare a PR tomorrow.


Reply to this email directly or view it on GitHub
#15945 (comment)
.

@timothysc

This comment has been minimized.

Show comment
Hide comment
@timothysc

timothysc Dec 3, 2015

Member

While I like this, I wonder at what point can we/should we just be passing a constraint filter to the kV store itself.

@xiang90 ^ SELECT resource FROM table WHERE (constraint/filter)

Member

timothysc commented Dec 3, 2015

While I like this, I wonder at what point can we/should we just be passing a constraint filter to the kV store itself.

@xiang90 ^ SELECT resource FROM table WHERE (constraint/filter)

@smarterclayton

This comment has been minimized.

Show comment
Hide comment
@smarterclayton

smarterclayton Dec 3, 2015

Contributor

Range scans are part of the etcd 3 API design.

On Dec 3, 2015, at 4:58 PM, Timothy St. Clair notifications@github.com
wrote:

While I like this, I wonder at what point can we/should we just be passing
a constraint filter to the kV store itself.

@xiang90 https://github.com/xiang90 ^ SELECT resource FROM table WHERE
(constraint)


Reply to this email directly or view it on GitHub
#15945 (comment)
.

Contributor

smarterclayton commented Dec 3, 2015

Range scans are part of the etcd 3 API design.

On Dec 3, 2015, at 4:58 PM, Timothy St. Clair notifications@github.com
wrote:

While I like this, I wonder at what point can we/should we just be passing
a constraint filter to the kV store itself.

@xiang90 https://github.com/xiang90 ^ SELECT resource FROM table WHERE
(constraint)


Reply to this email directly or view it on GitHub
#15945 (comment)
.

@wojtek-t

This comment has been minimized.

Show comment
Hide comment
@wojtek-t

wojtek-t Dec 4, 2015

Member

@smarterclayton

  • as a controller doing list watch - If I list, I don't specify RV. I get
    the latest. If I watch from that RV it's extremely likely to be current
    (as close as possible)
  • as a sophisticated client, if I specify RV, I get RV or newer or an error?

I'm afraid I didn't fully understand your question.

[Once this is implemented] if you specify RV in the LIST operation, you are guaranteed to get at least that fresh response. However, you also get the exact RV from which the results is returned. So you can then start watching from exactly that point.
Does it answer your question or I misunderstood it?

Member

wojtek-t commented Dec 4, 2015

@smarterclayton

  • as a controller doing list watch - If I list, I don't specify RV. I get
    the latest. If I watch from that RV it's extremely likely to be current
    (as close as possible)
  • as a sophisticated client, if I specify RV, I get RV or newer or an error?

I'm afraid I didn't fully understand your question.

[Once this is implemented] if you specify RV in the LIST operation, you are guaranteed to get at least that fresh response. However, you also get the exact RV from which the results is returned. So you can then start watching from exactly that point.
Does it answer your question or I misunderstood it?

@smarterclayton

This comment has been minimized.

Show comment
Hide comment
@smarterclayton

smarterclayton Dec 4, 2015

Contributor

My original concern was a change that breaks naive clients. I was trying
to validate that naive clients (no rv) are unaffected. The question about
error was for expired rv versions or rv versions in the future for List.

On Dec 4, 2015, at 3:23 AM, Wojciech Tyczynski notifications@github.com
wrote:

@smarterclayton https://github.com/smarterclayton

  • as a controller doing list watch - If I list, I don't specify RV. I
    get the latest. If I watch from that RV it's extremely likely to be current
    (as close as possible)
  • as a sophisticated client, if I specify RV, I get RV or newer or an
    error?

I'm afraid I didn't fully understand your question.

[Once this is implemented] if you specify RV in the LIST operation, you are
guaranteed to get at least that fresh response. However, you also get the
exact RV from which the results is returned. So you can then start watching
from exactly that point.
Does it answer your question or I misunderstood it?


Reply to this email directly or view it on GitHub
#15945 (comment)
.

Contributor

smarterclayton commented Dec 4, 2015

My original concern was a change that breaks naive clients. I was trying
to validate that naive clients (no rv) are unaffected. The question about
error was for expired rv versions or rv versions in the future for List.

On Dec 4, 2015, at 3:23 AM, Wojciech Tyczynski notifications@github.com
wrote:

@smarterclayton https://github.com/smarterclayton

  • as a controller doing list watch - If I list, I don't specify RV. I
    get the latest. If I watch from that RV it's extremely likely to be current
    (as close as possible)
  • as a sophisticated client, if I specify RV, I get RV or newer or an
    error?

I'm afraid I didn't fully understand your question.

[Once this is implemented] if you specify RV in the LIST operation, you are
guaranteed to get at least that fresh response. However, you also get the
exact RV from which the results is returned. So you can then start watching
from exactly that point.
Does it answer your question or I misunderstood it?


Reply to this email directly or view it on GitHub
#15945 (comment)
.

@wojtek-t

This comment has been minimized.

Show comment
Hide comment
@wojtek-t

wojtek-t Dec 4, 2015

Member

Yes - naive clients will be unaffected.

Passing rv for list that is old - will result in returning ~current state (the guarantee that the result is at least that fresh as the rv passed, is then satisfied).
Passing rv that is in the future, will simply block the LIST call until then.

Member

wojtek-t commented Dec 4, 2015

Yes - naive clients will be unaffected.

Passing rv for list that is old - will result in returning ~current state (the guarantee that the result is at least that fresh as the rv passed, is then satisfied).
Passing rv that is in the future, will simply block the LIST call until then.

@smarterclayton

This comment has been minimized.

Show comment
Hide comment
@smarterclayton

smarterclayton Dec 4, 2015

Contributor

How long do we block for? Timeout specified on request? What error is
returned on timeout (our server timeout error)?

On Dec 4, 2015, at 10:44 AM, Wojciech Tyczynski notifications@github.com
wrote:

Yes - naive clients will be unaffected.

Passing rv for list that is old - will result in returning ~current state
(the guarantee that the result is at least that fresh as the rv passed, is
then satisfied).
Passing rv that is in the future, will simply block the LIST call until
then.

Contributor

smarterclayton commented Dec 4, 2015

How long do we block for? Timeout specified on request? What error is
returned on timeout (our server timeout error)?

On Dec 4, 2015, at 10:44 AM, Wojciech Tyczynski notifications@github.com
wrote:

Yes - naive clients will be unaffected.

Passing rv for list that is old - will result in returning ~current state
(the guarantee that the result is at least that fresh as the rv passed, is
then satisfied).
Passing rv that is in the future, will simply block the LIST call until
then.

@wojtek-t

This comment has been minimized.

Show comment
Hide comment
@wojtek-t

wojtek-t Dec 7, 2015

Member

How long do we block for? Timeout specified on request? What error is
returned on timeout (our server timeout error)?

Currently specific timeout is not supported (other than the default http server timeouts, which is a generic mechanism IIUC). Do you think it's not enough?

Member

wojtek-t commented Dec 7, 2015

How long do we block for? Timeout specified on request? What error is
returned on timeout (our server timeout error)?

Currently specific timeout is not supported (other than the default http server timeouts, which is a generic mechanism IIUC). Do you think it's not enough?

@wojtek-t

This comment has been minimized.

Show comment
Hide comment
@wojtek-t

wojtek-t Feb 3, 2016

Member

Actually, @lavalamp added timeout in #20433

Member

wojtek-t commented Feb 3, 2016

Actually, @lavalamp added timeout in #20433

@lavalamp

This comment has been minimized.

Show comment
Hide comment
@lavalamp

lavalamp Feb 3, 2016

Member

I randomly made it 60 seconds, I think we need to be defensive about clients asking for RVs in the distant future (or from other collections).

Member

lavalamp commented Feb 3, 2016

I randomly made it 60 seconds, I think we need to be defensive about clients asking for RVs in the distant future (or from other collections).

@wojtek-t

This comment has been minimized.

Show comment
Hide comment
@wojtek-t

wojtek-t Feb 4, 2016

Member

@lavalamp - what do you mean by "or from other collections" ?

Member

wojtek-t commented Feb 4, 2016

@lavalamp - what do you mean by "or from other collections" ?

@lavalamp

This comment has been minimized.

Show comment
Hide comment
@lavalamp

lavalamp Feb 4, 2016

Member

E.g., reading from /pods, sending that RV to /services.

Member

lavalamp commented Feb 4, 2016

E.g., reading from /pods, sending that RV to /services.

@wojtek-t

This comment has been minimized.

Show comment
Hide comment
@wojtek-t

wojtek-t Feb 5, 2016

Member

That shouldn't be a big deal - resource version is common for all resources (there is a single RV in etcd). I agree that technically it's incorrect, but that shouldn't cause problems in my opinion.

Member

wojtek-t commented Feb 5, 2016

That shouldn't be a big deal - resource version is common for all resources (there is a single RV in etcd). I agree that technically it's incorrect, but that shouldn't cause problems in my opinion.

@lavalamp

This comment has been minimized.

Show comment
Hide comment
@lavalamp

lavalamp Feb 5, 2016

Member

It does cause problems, though! See the other commit in #20433. Someone read from one table, passed that (big) RV into list of services. But the service watch RV was stuck at the last service write, not the global number--which had since advanced--so apiserver hung on startup.

Member

lavalamp commented Feb 5, 2016

It does cause problems, though! See the other commit in #20433. Someone read from one table, passed that (big) RV into list of services. But the service watch RV was stuck at the last service write, not the global number--which had since advanced--so apiserver hung on startup.

@lavalamp

This comment has been minimized.

Show comment
Hide comment
@lavalamp

lavalamp Feb 5, 2016

Member

And anyway, clients need to treat it as separate, because we could swap out the storage, like we do with events.

Member

lavalamp commented Feb 5, 2016

And anyway, clients need to treat it as separate, because we could swap out the storage, like we do with events.

@wojtek-t

This comment has been minimized.

Show comment
Hide comment
@wojtek-t

wojtek-t Feb 5, 2016

Member

I agree that client need to treat is as separate - I just though that it's not urgent...

Member

wojtek-t commented Feb 5, 2016

I agree that client need to treat is as separate - I just though that it's not urgent...

@lavalamp

This comment has been minimized.

Show comment
Hide comment
@lavalamp

lavalamp Feb 8, 2016

Member

Anything left to do here? @wojtek-t can we close this?

Member

lavalamp commented Feb 8, 2016

Anything left to do here? @wojtek-t can we close this?

@wojtek-t

This comment has been minimized.

Show comment
Hide comment
@wojtek-t

wojtek-t Feb 9, 2016

Member

The missing thing here is the Indexer mechanism. But this is definitely not for 1.2 and I think we can create a separate issue for it.

Member

wojtek-t commented Feb 9, 2016

The missing thing here is the Indexer mechanism. But this is definitely not for 1.2 and I think we can create a separate issue for it.

@lavalamp

This comment has been minimized.

Show comment
Hide comment
@lavalamp

lavalamp Feb 10, 2016

Member

#4817 already exists, we can reuse that. Thanks!

Member

lavalamp commented Feb 10, 2016

#4817 already exists, we can reuse that. Thanks!

@lavalamp lavalamp closed this Feb 10, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment