-
Notifications
You must be signed in to change notification settings - Fork 39.3k
-
Notifications
You must be signed in to change notification settings - Fork 39.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sketch design: API response paging for large data sets #49338
Comments
See #48921 for a design exploration. I did some testing on a very large data set (100k secrets, 10k namespaces, 10k pods, about 1 million keys) and peak memory load during controller restart was significantly reduced. For a migration job running using paging, it meant that clients started processing results almost immediately instead of having to wait to receive the full data set size. |
@kubernetes/sig-scalability-misc @kubernetes/sig-api-machinery-misc We've been exploring how very large clusters work in practice - I think paging could be a very low cost win for large clusters based on the explorations so far. |
/cc |
cc @jpbetz |
So long as the pagination is passed through to storage/etcd, perhaps through optional list parameters it makes sense. It would get really weird if you did multi-layered pagination with the api-server serving as another caching layer. |
/cc |
Automatic merge from submit-queue Design for consistent API chunking in Kubernetes In order to reduce the memory allocation impact of very large LIST operations against the Kubernetes apiserver, it should be possible to receive a single large resource list as many individual page requests to the server. Part of kubernetes/enhancements#365. Taken from kubernetes/kubernetes#49338. Original discussion in kubernetes/kubernetes#2349
Implemented |
With the move to etcd3, efficient paging from clients down to the underlying store is now possible. A number of issues have been opened around this (#2349).
Problem statement:
On large clusters, performing large queries (GET /api/v1/pods, GET /api/v1/secrets) can cause several problems:
kube-apiserver
also has to load all that into memory (in a larger form than etcd) which can cause large spikes in memory usagekubectl get
of a large fetch must also load that entirely into memoryIn general, if we can efficiently page from clients to etcd, we can reduce spikiness of memory allocations and CPU and smear that cost over time.
Proposed change:
Expose a simple paging mechanism to perform consistent reads across a large list . Clients would indicate a limit with their
LIST
calls that indicates how many objects they wish to receive. The server would return up to that amount of objects, and if more exist it would return acontinue
parameter that the client could pass to receive the next set of data (along with a limit). The server would be allowed to ignore the limit if it does not implement limiting (backward compatible), but it is not allowed to support limiting without supporting a way to continue the query past the limit (may not implementlimit
withoutcontinue
).Backends
Etcd2 would not be supported. Other databases that might back Kubernetes could either chose to not implement limiting, or use an MVCC style approach that provides a consistent read. Non consistent LISTS would not be supported since this would not enable informers and other clients to perform a resync safely.
For etcd3, the continue parameter would need to contain a resource version (the snapshot that we are reading that is consistent across the entire LIST) and the start key for the range. The end key would be calculated by storage before invoking etcd.
Handling expired resource versions
If the resource version has expired (default 5m expiration today), the server would return a
410 Gone ResourceExpired
status reponse (the same as for watch), which means clients must either start their page over again, or relist.The 5 minute default compaction interval for etcd3 bounds how long a list can run. Since clients may wish to perform processing over very large sets, increasing that timeout may make sense for large clusters.
Types of clients and impact
Some clients such as controllers, receiving a 410 error, may instead wish to perform a full LIST without paging.
kubectl get
410 ResourceExpired
as "continue from the last key I processed"For clients that do not care about consistency, the storage should return a
continue
value with theResourceExpired
error that allows the client to restart from the same prefix key, but using the latest resource version. This would allow clients that do not require a fully consistent LIST to opt in to partially consistent LISTs but still be able to scan the entire working set.Rate limiting
Since the goal is to reduce spikiness of load, the standard API rate limiter might prefer to rate limit page requests differently from global lists, allowing full LISTs only slowly while smaller pages can proceed more quickly.
Enabling by default?
On a very large data set, paging trades total memory allocated in etcd, the apiserver, and the client for higher overhead per request (request/response processing, authentication, authorization). Picking a sufficiently high paging value like 500 or 1000 would not impact smaller clusters, but would reduce the peak memory load of a very large cluster (10k resources and up). In testing, no significant overhead was shown in etcd3 for a paged historical query which is expected since the etcd3 store is an MVCC store and must always filter some values to serve a list.
For clients that must perform sequential processing of lists (kubectl get, migration commands) this change dramatically improves initial latency - clients got their first chunk of data in milliseconds, rather than seconds for the full set.
Other solutions
Compression from the apiserver and between the apiserver and etcd can reduce total network bandwidth, but cannot reduce the peak CPU and memory used inside the three processes.
The text was updated successfully, but these errors were encountered: