Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sketch design: API response paging for large data sets #49338

Closed
smarterclayton opened this issue Jul 20, 2017 · 7 comments
Closed

Sketch design: API response paging for large data sets #49338

smarterclayton opened this issue Jul 20, 2017 · 7 comments
Labels
sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.

Comments

@smarterclayton
Copy link
Contributor

smarterclayton commented Jul 20, 2017

With the move to etcd3, efficient paging from clients down to the underlying store is now possible. A number of issues have been opened around this (#2349).

Problem statement:

On large clusters, performing large queries (GET /api/v1/pods, GET /api/v1/secrets) can cause several problems:

  • Large reads to etcd block writes to the same range (Question: Should a large get with WithSerializable() option block puts? etcd-io/etcd#7719) - might be fixed in the near future
  • On very large ranges (100k secrets, 100k pods), the amount of data transferred can be in the hundreds of megabytes (500M secret reads have been observed in the wild) - that data has to be loaded into memory in etcd all at once
  • The data from etcd has to be transferred to the apiserver
  • The kube-apiserver also has to load all that into memory (in a larger form than etcd) which can cause large spikes in memory usage
  • A slow client reading from the kube-apiserver may timeout if there is significant contention on the master, or if their network can't transfer the data in the 60s window
  • kubectl get of a large fetch must also load that entirely into memory

In general, if we can efficiently page from clients to etcd, we can reduce spikiness of memory allocations and CPU and smear that cost over time.

Proposed change:

Expose a simple paging mechanism to perform consistent reads across a large list . Clients would indicate a limit with their LIST calls that indicates how many objects they wish to receive. The server would return up to that amount of objects, and if more exist it would return a continue parameter that the client could pass to receive the next set of data (along with a limit). The server would be allowed to ignore the limit if it does not implement limiting (backward compatible), but it is not allowed to support limiting without supporting a way to continue the query past the limit (may not implement limit without continue).

GET /api/v1/pods?limit=500
{
  "metadata": {"continue": "ABC...", "resourceVersion": "147"},
  "items": [
     // no more than 500 items
   ]
}
GET /api/v1/pods?limit=500&continue=ABC...
{
  "metadata": {"continue": "DEF...", "resourceVersion": "147"},
  "items": [
     // no more than 500 items
   ]
}
GET /api/v1/pods?limit=500&continue=DEF...
{
  "metadata": {"resourceVersion": "147"},
  "items": [
     // no more than 500 items
   ]
}

Backends

Etcd2 would not be supported. Other databases that might back Kubernetes could either chose to not implement limiting, or use an MVCC style approach that provides a consistent read. Non consistent LISTS would not be supported since this would not enable informers and other clients to perform a resync safely.

For etcd3, the continue parameter would need to contain a resource version (the snapshot that we are reading that is consistent across the entire LIST) and the start key for the range. The end key would be calculated by storage before invoking etcd.

Handling expired resource versions

If the resource version has expired (default 5m expiration today), the server would return a 410 Gone ResourceExpired status reponse (the same as for watch), which means clients must either start their page over again, or relist.

# resourceVersion is expired
GET /api/v1/pods?limit=500&continue=DEF...
{
  "kind": "Status",
  "code": 410,
  "reason": "ResourceExpired"
}

The 5 minute default compaction interval for etcd3 bounds how long a list can run. Since clients may wish to perform processing over very large sets, increasing that timeout may make sense for large clusters.

Types of clients and impact

Some clients such as controllers, receiving a 410 error, may instead wish to perform a full LIST without paging.

  • Controllers with full caches
    • Any controller with a full in-memory cache of one or more resources almost certainly depends on having a consistent view of resources, and so will either need to perform a full list or a paged list, without dropping results
  • kubectl get
    • Most administrators would probably prefer to see a very large set with some inconsistency rather than no results (due to a timeout under load). They would likely be ok with handling 410 ResourceExpired as "continue from the last key I processed"
  • Migration style commands
    • Assuming a migration command has to run on the full data set (to upgrade a resource from json to protobuf, or to check a large set of resources for errors) and is performing some expensive calculation on each, very large sets may not complete over the server expiration window.

For clients that do not care about consistency, the storage should return a continue value with the ResourceExpired error that allows the client to restart from the same prefix key, but using the latest resource version. This would allow clients that do not require a fully consistent LIST to opt in to partially consistent LISTs but still be able to scan the entire working set.

Rate limiting

Since the goal is to reduce spikiness of load, the standard API rate limiter might prefer to rate limit page requests differently from global lists, allowing full LISTs only slowly while smaller pages can proceed more quickly.

Enabling by default?

On a very large data set, paging trades total memory allocated in etcd, the apiserver, and the client for higher overhead per request (request/response processing, authentication, authorization). Picking a sufficiently high paging value like 500 or 1000 would not impact smaller clusters, but would reduce the peak memory load of a very large cluster (10k resources and up). In testing, no significant overhead was shown in etcd3 for a paged historical query which is expected since the etcd3 store is an MVCC store and must always filter some values to serve a list.

For clients that must perform sequential processing of lists (kubectl get, migration commands) this change dramatically improves initial latency - clients got their first chunk of data in milliseconds, rather than seconds for the full set.

Other solutions

Compression from the apiserver and between the apiserver and etcd can reduce total network bandwidth, but cannot reduce the peak CPU and memory used inside the three processes.

@smarterclayton smarterclayton added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. labels Jul 20, 2017
@smarterclayton
Copy link
Contributor Author

See #48921 for a design exploration. I did some testing on a very large data set (100k secrets, 10k namespaces, 10k pods, about 1 million keys) and peak memory load during controller restart was significantly reduced. For a migration job running using paging, it meant that clients started processing results almost immediately instead of having to wait to receive the full data set size.

@smarterclayton
Copy link
Contributor Author

smarterclayton commented Jul 20, 2017

@kubernetes/sig-scalability-misc @kubernetes/sig-api-machinery-misc We've been exploring how very large clusters work in practice - I think paging could be a very low cost win for large clusters based on the explorations so far.

@k8s-ci-robot k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Jul 20, 2017
@smarterclayton smarterclayton removed the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Jul 21, 2017
@dixudx
Copy link
Member

dixudx commented Jul 21, 2017

/cc

@caesarxuchao
Copy link
Member

cc @jpbetz

@timothysc
Copy link
Member

So long as the pagination is passed through to storage/etcd, perhaps through optional list parameters it makes sense. It would get really weird if you did multi-layered pagination with the api-server serving as another caching layer.

@ravisantoshgudimetla
Copy link
Contributor

/cc

k8s-github-robot pushed a commit to kubernetes/community that referenced this issue Aug 29, 2017
Automatic merge from submit-queue

Design for consistent API chunking in Kubernetes

In order to reduce the memory allocation impact of very large LIST operations against the Kubernetes apiserver, it should be possible to receive a single large resource list as many individual page requests to the server.

Part of kubernetes/enhancements#365. Taken from kubernetes/kubernetes#49338. Original discussion in kubernetes/kubernetes#2349
@smarterclayton
Copy link
Contributor Author

Implemented

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.
Projects
None yet
Development

No branches or pull requests

6 participants