Sketch design: API response paging for large data sets #49338

smarterclayton · 2017-07-20T23:47:12Z

With the move to etcd3, efficient paging from clients down to the underlying store is now possible. A number of issues have been opened around this (#2349).

Problem statement:

On large clusters, performing large queries (GET /api/v1/pods, GET /api/v1/secrets) can cause several problems:

Large reads to etcd block writes to the same range (Question: Should a large get with WithSerializable() option block puts? etcd-io/etcd#7719) - might be fixed in the near future
On very large ranges (100k secrets, 100k pods), the amount of data transferred can be in the hundreds of megabytes (500M secret reads have been observed in the wild) - that data has to be loaded into memory in etcd all at once
The data from etcd has to be transferred to the apiserver
The kube-apiserver also has to load all that into memory (in a larger form than etcd) which can cause large spikes in memory usage
A slow client reading from the kube-apiserver may timeout if there is significant contention on the master, or if their network can't transfer the data in the 60s window
kubectl get of a large fetch must also load that entirely into memory

In general, if we can efficiently page from clients to etcd, we can reduce spikiness of memory allocations and CPU and smear that cost over time.

Proposed change:

Expose a simple paging mechanism to perform consistent reads across a large list . Clients would indicate a limit with their LIST calls that indicates how many objects they wish to receive. The server would return up to that amount of objects, and if more exist it would return a continue parameter that the client could pass to receive the next set of data (along with a limit). The server would be allowed to ignore the limit if it does not implement limiting (backward compatible), but it is not allowed to support limiting without supporting a way to continue the query past the limit (may not implement limit without continue).

GET /api/v1/pods?limit=500
{
  "metadata": {"continue": "ABC...", "resourceVersion": "147"},
  "items": [
     // no more than 500 items
   ]
}
GET /api/v1/pods?limit=500&continue=ABC...
{
  "metadata": {"continue": "DEF...", "resourceVersion": "147"},
  "items": [
     // no more than 500 items
   ]
}
GET /api/v1/pods?limit=500&continue=DEF...
{
  "metadata": {"resourceVersion": "147"},
  "items": [
     // no more than 500 items
   ]
}

Backends

Etcd2 would not be supported. Other databases that might back Kubernetes could either chose to not implement limiting, or use an MVCC style approach that provides a consistent read. Non consistent LISTS would not be supported since this would not enable informers and other clients to perform a resync safely.

For etcd3, the continue parameter would need to contain a resource version (the snapshot that we are reading that is consistent across the entire LIST) and the start key for the range. The end key would be calculated by storage before invoking etcd.

Handling expired resource versions

If the resource version has expired (default 5m expiration today), the server would return a 410 Gone ResourceExpired status reponse (the same as for watch), which means clients must either start their page over again, or relist.

# resourceVersion is expired
GET /api/v1/pods?limit=500&continue=DEF...
{
  "kind": "Status",
  "code": 410,
  "reason": "ResourceExpired"
}

The 5 minute default compaction interval for etcd3 bounds how long a list can run. Since clients may wish to perform processing over very large sets, increasing that timeout may make sense for large clusters.

Types of clients and impact

Some clients such as controllers, receiving a 410 error, may instead wish to perform a full LIST without paging.

Controllers with full caches
- Any controller with a full in-memory cache of one or more resources almost certainly depends on having a consistent view of resources, and so will either need to perform a full list or a paged list, without dropping results
kubectl get
- Most administrators would probably prefer to see a very large set with some inconsistency rather than no results (due to a timeout under load). They would likely be ok with handling 410 ResourceExpired as "continue from the last key I processed"
Migration style commands
- Assuming a migration command has to run on the full data set (to upgrade a resource from json to protobuf, or to check a large set of resources for errors) and is performing some expensive calculation on each, very large sets may not complete over the server expiration window.

For clients that do not care about consistency, the storage should return a continue value with the ResourceExpired error that allows the client to restart from the same prefix key, but using the latest resource version. This would allow clients that do not require a fully consistent LIST to opt in to partially consistent LISTs but still be able to scan the entire working set.

Rate limiting

Since the goal is to reduce spikiness of load, the standard API rate limiter might prefer to rate limit page requests differently from global lists, allowing full LISTs only slowly while smaller pages can proceed more quickly.

Enabling by default?

On a very large data set, paging trades total memory allocated in etcd, the apiserver, and the client for higher overhead per request (request/response processing, authentication, authorization). Picking a sufficiently high paging value like 500 or 1000 would not impact smaller clusters, but would reduce the peak memory load of a very large cluster (10k resources and up). In testing, no significant overhead was shown in etcd3 for a paged historical query which is expected since the etcd3 store is an MVCC store and must always filter some values to serve a list.

For clients that must perform sequential processing of lists (kubectl get, migration commands) this change dramatically improves initial latency - clients got their first chunk of data in milliseconds, rather than seconds for the full set.

Other solutions

Compression from the apiserver and between the apiserver and etcd can reduce total network bandwidth, but cannot reduce the peak CPU and memory used inside the three processes.

The text was updated successfully, but these errors were encountered:

smarterclayton · 2017-07-20T23:57:57Z

See #48921 for a design exploration. I did some testing on a very large data set (100k secrets, 10k namespaces, 10k pods, about 1 million keys) and peak memory load during controller restart was significantly reduced. For a migration job running using paging, it meant that clients started processing results almost immediately instead of having to wait to receive the full data set size.

smarterclayton · 2017-07-20T23:59:46Z

@kubernetes/sig-scalability-misc @kubernetes/sig-api-machinery-misc We've been exploring how very large clusters work in practice - I think paging could be a very low cost win for large clusters based on the explorations so far.

dixudx · 2017-07-21T02:53:44Z

/cc

caesarxuchao · 2017-07-27T21:01:32Z

cc @jpbetz

timothysc · 2017-08-01T15:45:00Z

So long as the pagination is passed through to storage/etcd, perhaps through optional list parameters it makes sense. It would get really weird if you did multi-layered pagination with the api-server serving as another caching layer.

ravisantoshgudimetla · 2017-08-03T16:00:05Z

/cc

Automatic merge from submit-queue Design for consistent API chunking in Kubernetes In order to reduce the memory allocation impact of very large LIST operations against the Kubernetes apiserver, it should be possible to receive a single large resource list as many individual page requests to the server. Part of kubernetes/enhancements#365. Taken from kubernetes/kubernetes#49338. Original discussion in kubernetes/kubernetes#2349

smarterclayton · 2017-11-02T03:55:25Z

Implemented

smarterclayton added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. labels Jul 20, 2017

k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Jul 20, 2017

smarterclayton removed the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Jul 21, 2017

This was referenced Jul 27, 2017

Support paged LIST queries from the Kubernetes API kubernetes/enhancements#365

Closed

When watch cache can't establish a watch successfully for its backing store, all resourceVersion=0 queries start to pile up #49684

Closed

smarterclayton mentioned this issue Aug 11, 2017

Design for consistent API chunking in Kubernetes kubernetes/community#896

Merged

smarterclayton closed this as completed Nov 2, 2017

pacoxu mentioned this issue Feb 24, 2021

APIListChunking - feature gate to GA? #96497

Closed

maruiz93 mentioned this issue Jul 23, 2024

Add pagination to the KubeArchive API Server responses kubearchive/kubearchive#226

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sketch design: API response paging for large data sets #49338

Sketch design: API response paging for large data sets #49338

smarterclayton commented Jul 20, 2017 •

edited

Loading

smarterclayton commented Jul 20, 2017

smarterclayton commented Jul 20, 2017 •

edited

Loading

dixudx commented Jul 21, 2017

caesarxuchao commented Jul 27, 2017

timothysc commented Aug 1, 2017

ravisantoshgudimetla commented Aug 3, 2017

smarterclayton commented Nov 2, 2017

Sketch design: API response paging for large data sets #49338

Sketch design: API response paging for large data sets #49338

Comments

smarterclayton commented Jul 20, 2017 • edited Loading

Problem statement:

Proposed change:

Backends

Handling expired resource versions

Types of clients and impact

Rate limiting

Enabling by default?

Other solutions

smarterclayton commented Jul 20, 2017

smarterclayton commented Jul 20, 2017 • edited Loading

dixudx commented Jul 21, 2017

caesarxuchao commented Jul 27, 2017

timothysc commented Aug 1, 2017

ravisantoshgudimetla commented Aug 3, 2017

smarterclayton commented Nov 2, 2017

smarterclayton commented Jul 20, 2017 •

edited

Loading

smarterclayton commented Jul 20, 2017 •

edited

Loading