Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Paginate watch cache->etcd List calls & reflector init/resync List calls not served by watch cache #75389

Open
wants to merge 2 commits into
base: master
from

Conversation

@jpbetz
Copy link
Contributor

jpbetz commented Mar 14, 2019

Paginate requests from the kube-apiserver watch cache to etcd in chunks. Also, paginate reflector init and resync List calls that are not served by watch cache.

Main changes:

  • watch cache: Reflector used internally to initialize and resync it's data from etcd now requests data in chunks, the cacheListerWatcher used by the watch cache to list from etcd now supports pagination.
  • Client reflector init and resync list requests...
    • ...served from the watch cache: Behavior will remain as it is today. Pagination will be ignored by the kube-apiserver since the watch cache does not support pagination. The watch cache scales quite well to large List requests already and we don't plan to add watch cache pagination support. The reflector will simply get a full list result from the first list request and use it, so there is no compatibility problem.
    • ...not served from the watch cache: The List request will be paginated in chunks from the client to the kube-apiserver. e.g. when kube-apiserver is configured with --watch-cache=false.

Fixing this will allow runaway process that create resources indefinitely to reach even high resource counts. And at some point, it will take longer to list the resources than the etcd compaction interval allows (default of 5 minutes), at which point the list operations will fail with "resource expired" errors. This is arguably a more graceful failure mode than we have today where the kube-apiserver becomes entirely unavailable, but it still can result in broken controllers (or the watch cache consuming excessive memory?). We can add a metric on the duration of the watch cache list operations, which could be used to monitor and alert on the list operation duration if it gets too high. We are testing to see what failure mode actually occurs with this fix in place when adding resources indefinitely.

cc @wojtek-t @cheftako @jingyih

/priority important-longterm
/kind feature

Paginate requests from the kube-apiserver watch cache to etcd in chunks.
Paginate reflector init and resync List calls that are not served by watch cache.
return r.listerWatcher.List(opts)
}))
pager.PageSize = listPageSize
// Since listPageSize is already large, if the list expires, do not attempt a full list.

This comment has been minimized.

@wojtek-t

wojtek-t Mar 15, 2019

Member

I'm not convinced about this - we want finally to initialize cacher, because otherwise watch will not work at all.

This comment has been minimized.

@jpbetz

jpbetz Mar 15, 2019

Author Contributor

Yes, I do need to do more here. I'm concerned that a paginated list failure due to a expired error indicates that the resource count is very high since even after 5 minutes of paging 10k chunks it still couldn't load, and that attempting a full list is bound to fail or make etcd unavailable. But I need do some digging to understand exactly how the watch cache handles errors better. If I can't find a better way to handle this case, I'll re-enable the full list fallback, and maybe add a metric to count occurrences of it.

This comment has been minimized.

@jpbetz

jpbetz Mar 19, 2019

Author Contributor

I've put the fallback to perform a full list back in.

This comment has been minimized.

@jpbetz

jpbetz Mar 19, 2019

Author Contributor

I plan to circle back on this in a future PR, potentially (a) having requests miss the watch cache until it's initial/resync list requests succeed, and/or (b) setting a max resource count on the watch cache, deactivating it for resource kinds that have more than that number of resources (e.g. 500k). If we these in place than we might be able to remove the full list fallback safely.

@jpbetz jpbetz force-pushed the jpbetz:pagination-v1 branch from 8eb7697 to d5e720a Mar 15, 2019

@k8s-ci-robot k8s-ci-robot added size/L and removed size/S labels Mar 15, 2019

@yliaog

This comment has been minimized.

Copy link
Contributor

yliaog commented Mar 18, 2019

/cc

@k8s-ci-robot k8s-ci-robot requested a review from yliaog Mar 18, 2019

@jpbetz jpbetz force-pushed the jpbetz:pagination-v1 branch from d5e720a to 4372be6 Mar 19, 2019

@jpbetz jpbetz changed the title [WIP] Paginate watch cache->etcd List calls & reflector init/resync List calls not served by watch cache Paginate watch cache->etcd List calls & reflector init/resync List calls not served by watch cache Mar 19, 2019

@jpbetz

This comment has been minimized.

Copy link
Contributor Author

jpbetz commented Mar 19, 2019

@k8s-ci-robot k8s-ci-robot requested a review from smarterclayton Mar 19, 2019

@jpbetz

This comment has been minimized.

Copy link
Contributor Author

jpbetz commented Mar 19, 2019

/retest

@@ -74,6 +76,9 @@ var (
// We try to spread the load on apiserver by setting timeouts for
// watch requests - it is random in [minWatchTimeout, 2*minWatchTimeout].
minWatchTimeout = 5 * time.Minute

// Gather initial and resync lists in chunks of this size.
listPageSize = int64(10000)

This comment has been minimized.

@smarterclayton

smarterclayton Mar 20, 2019

Contributor

Experimentally chosen? When I did the first round 500 worked ok for clients.

This comment has been minimized.

@jpbetz

jpbetz Mar 20, 2019

Author Contributor

We did some etcd range read experiments and found this to be a good number for watch cache to etcd traffic. From API clients we should probably stick with the 500 you previously found works well. I'll update this PR to only use 10k for watch cache to etcd.

This comment has been minimized.

@wojtek-t

wojtek-t Mar 20, 2019

Member

I don't understand why those are supposed to be different - this is the same thing, isn't it?

This comment has been minimized.

@jpbetz

jpbetz Mar 21, 2019

Author Contributor

I've updated the PR, maybe that will clarify? I'm viewing them as:

  1. watchcache->etcd list requests (proposing we go from full list to 10k item chunks)
  2. clients using client reflector->apiserver->etcd list requests (proposing we go from full list to 500 item chunks)

This comment has been minimized.

@smarterclayton

smarterclayton Mar 22, 2019

Contributor

clients -> apiserver is more about transfer bandwidth (today) until we get gzip on by default. With gzip on by default I think it's likely that we could bump the chunk size up from the client.

This comment has been minimized.

@wojtek-t

wojtek-t Mar 22, 2019

Member

OK - now I understand. Thanks.

transfer bandwidth (today) until we get gzip on by default

I guess it depends on environment? We didn't see real issues with bandwidth in gcp.

@jpbetz

This comment has been minimized.

Copy link
Contributor Author

jpbetz commented Mar 21, 2019

Increased test coverage with a reflector unit test and a integration validating a paginated List with resourceVersion=0 is correct both when the watch cache enabled and disabled. PTAL!

@jpbetz jpbetz force-pushed the jpbetz:pagination-v1 branch from a0e6605 to 2fc6448 Mar 21, 2019

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Mar 21, 2019

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jpbetz
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: wojtek-t

If they are not already assigned, you can assign the PR to them by writing /assign @wojtek-t in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jpbetz jpbetz force-pushed the jpbetz:pagination-v1 branch from 2fc6448 to 98ce7df Mar 21, 2019

@jpbetz jpbetz force-pushed the jpbetz:pagination-v1 branch from 98ce7df to 51c6756 Mar 21, 2019

@smarterclayton

This comment has been minimized.

Copy link
Contributor

smarterclayton commented Mar 22, 2019

e2e-gcp-100 perf broken on #75570

if r.WatchListPageSize != 0 {
pager.PageSize = r.WatchListPageSize
}
list, err = pager.List(context.Background(), options)

This comment has been minimized.

@wojtek-t

wojtek-t Mar 22, 2019

Member

What I'm a bit afraid of now, is that now we don't have any fallback if this paginated list won't work.
Maybe it's fine given the early phase of release cycle, but this is pretty crucial part (if this won't initialize, watch pretty much won't work at all) so I think we need to be careful.

This comment has been minimized.

@jpbetz

jpbetz Mar 22, 2019

Author Contributor

Ah, I should have pointed this out more explicitly: pager.List automatically falls back to full list (see pager.List impl), which is what we do today. So the behavior will be that first we attempt a paginated request and if that fails we fall back to full list. If both fail the cacher retries using the same flow it has today.

@jpbetz

This comment has been minimized.

Copy link
Contributor Author

jpbetz commented Mar 22, 2019

/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.