Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design for consistent API chunking in Kubernetes #896

Merged
merged 1 commit into from Aug 29, 2017
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
182 changes: 182 additions & 0 deletions contributors/design-proposals/api-chunking.md
@@ -0,0 +1,182 @@
# Allow clients to retrieve consistent API lists in chunks

On large clusters, performing API queries that return all of the objects of a given resource type (GET /api/v1/pods, GET /api/v1/secrets) can lead to significant variations in peak memory use on the server and contribute substantially to long tail request latency.

When loading very large sets of objects -- some clusters are now reaching 100k pods or equivalent numbers of supporting resources -- the system must:

* Construct the full range description in etcd in memory and serialize it as protobuf in the client
* Some clusters have reported over 500MB being stored in a single object type
* This data is read from the underlying datastore and converted to a protobuf response
* Large reads to etcd can block writes to the same range (https://github.com/coreos/etcd/issues/7719)
* The data from etcd has to be transferred to the apiserver in one large chunk
* The `kube-apiserver` also has to deserialize that response into a single object, and then re-serialize it back to the client
* Much of the decoded etcd memory is copied into the struct used to serialize to the client
* An API client like `kubectl get` will then decode the response from JSON or protobuf
* An API client with a slow connection may not be able to receive the entire response body within the default 60s timeout
* This may cause other failures downstream of that API client with their own timeouts
* The recently introduced client compression feature can assist
* The large response will also be loaded entirely into memory

The standard solution for reducing the impact of large reads is to allow them to be broken into smaller reads via a technique commonly referred to as paging or chunking. By efficiently splitting large list ranges from etcd to clients into many smaller list ranges, we can reduce the peak memory allocation on etcd and the apiserver, without losing the consistent read invariant our clients depend on.

This proposal does not cover general purpose ranging or paging for arbitrary clients, such as allowing web user interfaces to offer paged output, but does define some parameters for future extension. To that end, this proposal uses the phrase "chunking" to describe retrieving a consistent snapshot range read from the API server in distinct pieces.

Our primary consistent store etcd3 offers support for efficient chunking with minimal overhead, and mechanisms exist for other potential future stores such as SQL databases or Consul to also implement a simple form of consistent chunking.

Relevant issues:

* https://github.com/kubernetes/kubernetes/issues/2349

## Terminology

**Consistent list** - A snapshot of all resources at a particular moment in time that has a single `resourceVersion` that clients can begin watching from to receive updates. All Kubernetes controllers depend on this semantic. Allows a controller to refresh its internal state, and then receive a stream of changes from the initial state.

**API paging** - API parameters designed to allow a human to view results in a series of "pages".

**API chunking** - API parameters designed to allow a client to break one large request into multiple smaller requests without changing the semantics of the original request.


## Proposed change:

Expose a simple chunking mechanism to allow large API responses to be broken into consistent partial responses. Clients would indicate a tolerance for chunking (opt-in) by specifying a desired maximum number of results to return in a `LIST` call. The server would return up to that amount of objects, and if more exist it would return a `continue` parameter that the client could pass to receive the next set of results. The server would be allowed to ignore the limit if it does not implement limiting (backward compatible), but it is not allowed to support limiting without supporting a way to continue the query past the limit (may not implement `limit` without `continue`).

```
GET /api/v1/pods?limit=500
{
"metadata": {"continue": "ABC...", "resourceVersion": "147"},
"items": [
// no more than 500 items
]
}
GET /api/v1/pods?limit=500&continue=ABC...
{
"metadata": {"continue": "DEF...", "resourceVersion": "147"},
"items": [
// no more than 500 items
]
}
GET /api/v1/pods?limit=500&continue=DEF...
{
"metadata": {"resourceVersion": "147"},
"items": [
// no more than 500 items
]
}
```

The token returned by the server for `continue` would be an opaque serialized string that would contain a simple serialization of a version identifier (to allow future extension), and any additional data needed by the server storage to identify where to start the next range.

The continue token is not required to encode other filtering parameters present on the initial request, and clients may alter their filter parameters on subsequent chunk reads. However, the server implementation **may** reject such changes with a `400 Bad Request` error, and clients should consider this behavior undefined and left to future clarification. Chunking is intended to return consistent lists, and clients **should not** alter their filter parameters on subsequent chunk reads.

If the resource version parameter specified on the request is inconsistent with the `continue` token, the server **must** reject the request with a `400 Bad Request` error.

The schema of the continue token is chosen by the storage layer and is not guaranteed to remain consistent for clients - clients **must** consider the continue token as opaque. Server implementations **should** ensure that continue tokens can persist across server restarts and across upgrades.

Servers **may** return fewer results than `limit` if server side filtering returns no results such as when a `label` or `field` selector is used. If the entire result set is filtered, the server **may** return zero results with a valid `continue` token. A client **must** use the presence of a `continue` token in the response to determine whether more results are available, regardless of the number of results returned. A server that supports limits **must not** return more results than `limit` if a `continue` token is also returned. If the server does not return a `continue` token, the server **must** return all remaining results. The server **may** return zero results with no `continue` token on the last call.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand why you have it this way, but does it prevent us from moving the filtering step into the db layer in the future? I guess not. But it is an implementation detail leaking through.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it doesn't. In fact, filtering at the DB level should be opaque and possible - the continue token is generated by storage and surfaced all the way up, and filtering at the DB would simply identify the appropriate next key.


The server **may** limit the amount of time a continue token is valid for. Clients **should** assume continue tokens last only a few minutes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Servers must give a clear error on expired continue tokens.

What happens if the collection is large enough that the last page has always expired by the time the client gets there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They do, 410 Resource Expired.

For that case the default pager impl falls back to not paged. On our largest pathological data sets (1.2 gb at rest, 20gb in memory) i was able to read 100k secrets (500M of data) in chunks of 500 in under 4 seconds.


The server **must** support `continue` tokens that are valid across multiple API servers. The server **must** support a mechanism for rolling restart such that continue tokens are valid after one or all API servers have been restarted.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And storage backend restarts.



### Proposed Implementations

etcd3 is the primary Kubernetes store and has been designed to support consistent range reads in chunks for this use case. The etcd3 store is an ordered map of keys to values, and Kubernetes places all keys within a resource type under a common prefix, with namespaces being a further prefix of those keys. A read of all keys within a resource type is an in-order scan of the etcd3 map, and therefore we can retrieve in chunks by defining a start key for the next chunk that skips the last key read.

etcd2 will not be supported as it has no option to perform a consistent read and is on track to be deprecated in Kubernetes. Other databases that might back Kubernetes could either choose to not implement limiting, or leverage their own transactional characteristics to return a consistent list. In the near term our primary store remains etcd3 which can provide this capability at low complexity.

Implementations that cannot offer consistent ranging (returning a set of results that are logically equivalent to receiving all results in one response) must not allow continuation, because consistent listing is a requirement of the Kubernetes API list and watch pattern.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, we should come up with a stress test for this.


#### etcd3

For etcd3 the continue token would contain a resource version (the snapshot that we are reading that is consistent across the entire LIST) and the start key for the next set of results. Upon receiving a valid continue token the apiserver would instruct etcd3 to retrieve the set of results at a given resource version, beginning at the provided start key, limited by the maximum number of requests provided by the continue token (or optionally, by a different limit specified by the client). If more results remain after reading up to the limit, the storage should calculate a continue token that would begin at the next possible key, and the continue token set on the returned list.

The storage layer in the apiserver must apply consistency checking to the provided continue token to ensure that malicious users cannot trick the server into serving results outside of its range. The storage layer must perform defensive checking on the provided value, check for path traversal attacks, and have stable versioning for the continue token.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cannot happen as etcd doesn't have idea of path. Key is just a blob, it's Kubernetes that puts paths in key.


#### Possible SQL database implementation

A SQL database backing a Kubernetes server would need to implement a consistent snapshot read of an entire resource type, plus support changefeed style updates in order to implement the WATCH primitive. A likely implementation in SQL would be a table that stores multiple versions of each object, ordered by key and version, and filters out all historical versions of an object. A consistent paged list over such a table might be similar to:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, do you really expect anyone to try this? Does it really need to be in this proposal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just want to be really clear on the equivalence between an MVCC store like etcd and someone who really believes they want to have an alternate backend. At our scales it doesn't look like it's necessary anymore, but I want to be able to point back to a clear consideration.


SELECT * FROM resource_type WHERE resourceVersion < ? AND deleted = false AND namespace > ? AND name > ? LIMIT ? ORDER BY namespace, name ASC

where `namespace` and `name` are part of the continuation token and an index exists over `(namespace, name, resourceVersion, deleted)` that makes the range query performant. The highest returned resource version row for each `(namespace, name)` tuple would be returned.


### Security implications of returning last or next key in the continue token

If the continue token encodes the next key in the range, that key may expose info that is considered security sensitive, whether simply the name or namespace of resources not under the current tenant's control, or more seriously the name of a resource which is also a shared secret (for example, an access token stored as a kubernetes resource). There are a number of approaches to mitigating this impact:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The keys of all the things the user actually got are present in the list; I don't know if I follow why exposing the next one would be problematic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we filter by ACL on a subset (user A can see namespaces 1, 3, and 10) and the last page of 3 extends into 4, if we don't continue reading until the next visible key we would return a startKey of 4/somename which would expose both the presence of 4 and sometime to the user who shouldn't know that 4 exists.


1. Disable chunking on specific resources
2. Disable chunking when the user does not have permission to view all resources within a range
3. Encrypt the next key or the continue token using a shared secret across all API servers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generate a random string as the continuation token; a shared map maps from that string to the actual data structure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't work across api servers unless persisted to etcd.

4. When chunking, continue reading until the next visible start key is located after filtering, so that start keys are always keys the user has access to.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're worried only about construction and not about making it opaque, you could sign or HMAC the key to prevent users from constructing them.


In the short term we have no supported subset filtering (i.e. a user who can LIST can also LIST ?fields= and vice versa), so 1 is sufficient to address the sensitive key name issue. Because clients are required to proceed as if limiting is not possible, the server is always free to ignore a chunked request for other reasons. In the future, 4 may be the best option because we assume that most users starting a consistent read intend to finish it, unlike more general user interface paging where only a small fraction of requests continue to the next page.


### Handling expired resource versions

If the required data to perform a consistent list is no longer available in the storage backend (by default, old versions of objects in etcd3 are removed after 5 minutes), the server **must** return a `410 Gone ResourceExpired` status reponse (the same as for watch), which means clients must start from the beginning.

```
# resourceVersion is expired
GET /api/v1/pods?limit=500&continue=DEF...
{
"kind": "Status",
"code": 410,
"reason": "ResourceExpired"
}
```

Some clients may wish to follow a failed paged list with a full list attempt.

The 5 minute default compaction interval for etcd3 bounds how long a list can run. Since clients may wish to perform processing over very large sets, increasing that timeout may make sense for large clusters. It should be possible to alter the interval at which compaction runs to accomodate larger clusters.


#### Types of clients and impact

Some clients such as controllers, receiving a 410 error, may instead wish to perform a full LIST without chunking.

* Controllers with full caches
* Any controller with a full in-memory cache of one or more resources almost certainly depends on having a consistent view of resources, and so will either need to perform a full list or a paged list, without dropping results
* `kubectl get`
* Most administrators would probably prefer to see a very large set with some inconsistency rather than no results (due to a timeout under load). They would likely be ok with handling `410 ResourceExpired` as "continue from the last key I processed"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't etcd's ordering potentially change with each update? I would expect lots of duplicates / omissions in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not. Etcd guarantees key order.

* Migration style commands
* Assuming a migration command has to run on the full data set (to upgrade a resource from json to protobuf, or to check a large set of resources for errors) and is performing some expensive calculation on each, very large sets may not complete over the server expiration window.

For clients that do not care about consistency, the server **may** return a `continue` value on the `ResourceExpired` error that allows the client to restart from the same prefix key, but using the latest resource version. This would allow clients that do not require a fully consistent LIST to opt in to partially consistent LISTs but still be able to scan the entire working set. It is likely this could be a sub field (opaque data) of the `Status` response under `statusDetails`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are etcd responses sorted by key? For some reason I thought it was by position in the etcd db file. @jpbetz

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Etcd is a sorted map in lexicographic key order.



### Rate limiting

Since the goal is to reduce spikiness of load, the standard API rate limiter might prefer to rate limit page requests differently from global lists, allowing full LISTs only slowly while smaller pages can proceed more quickly.


### Chunk by default?

On a very large data set, chunking trades total memory allocated in etcd, the apiserver, and the client for higher overhead per request (request/response processing, authentication, authorization). Picking a sufficiently high chunk value like 500 or 1000 would not impact smaller clusters, but would reduce the peak memory load of a very large cluster (10k resources and up). In testing, no significant overhead was shown in etcd3 for a paged historical query which is expected since the etcd3 store is an MVCC store and must always filter some values to serve a list.

For clients that must perform sequential processing of lists (kubectl get, migration commands) this change dramatically improves initial latency - clients got their first chunk of data in milliseconds, rather than seconds for the full set. It also improves user experience for web consoles that may be accessed by administrators with access to large parts of the system.

It is recommended that most clients attempt to page by default at a large page size (500 or 1000) and gracefully degrade to not chunking.


### Other solutions

Compression from the apiserver and between the apiserver and etcd can reduce total network bandwidth, but cannot reduce the peak CPU and memory used inside the client, apiserver, or etcd processes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't apiserver do stream processing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depends on how you define a stream. It can. Would require a dramatic change to the API shape of LIST.


Various optimizations exist that can and should be applied to minimizing the amount of data that is transferred from etcd to the client or number of allocations made in each location, but do not how response size scales with number of entries.


## Plan

The initial chunking implementation would focus on consistent listing on server and client as well as measuring the impact of chunking on total system load, since chunking will slightly increase the cost to view large data sets because of the additional per page processing. The initial implementation should make the fewest assumptions possible in constraining future backend storage.

For the initial alpha release, chunking would be behind a feature flag and attempts to provide the `continue` or `limit` flags should be ignored. While disabled, a `continue` token should never be returned by the server as part of a list.

Future work might offer more options for clients to page in an inconsistent fashion, or allow clients to directly specify the parts of the namespace / name keyspace they wish to range over (paging).


<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/server-get.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->