-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Topology Aware Infrastructure Disruptions for Statefulsets #114010
Comments
@null-sleep: This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/sig sig-apps |
@null-sleep: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/sig apps |
In case it helps, here is one example from AWS: https://github.com/aws/zone-aware-controllers-for-k8s And more details on why the ZoneDisruptionBudgets (ZDB) controller was created can be found in the following blog post: https://aws.amazon.com/blogs/opensource/speed-up-highly-available-deployments-on-kubernetes/ |
@marianafranco Thanks for sharing the links. Reading your blog post, your motivations and solution are quite similar to what we did at Shopify. |
I have a use case for this feature to allow for faster cross cluster migrations. With #112744 (kubernetes/enhancements#3336), StatefulSet slices can be scaled down and scaled up across clusters. With the
Having a way to specify a There are probably some safeguards we want to enforce with this feature, as it doesn't make sense to certain configurations. For example, if using |
Other similar requests made in the past: #41442 and kubernetes/community#2045 |
Tying this back to existing work in this space: The proposal of modifying the StatefulSet spec is characterized by the New Update Strategy alternative from the "topology-aware workload controllers" KEP The other approach (which was the main proposal from KEP topology-aware workload controllers) is in modifying the PDB API, and workload controllers to use the PDB API to control upgrades, rather than the rollingUpdate strategy. Both the StatefulSet and Deployment have update strategy fields. This would allow for a new update strategy that could defer to the PDB API to identify which pods should be updated (evicted and re-created). Deployment may be more challenge to orchestrate, over as it does not actuate pods directly |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle-stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
/remove-lifecycle-stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close not-planned |
@k8s-triage-robot: Closing this issue, marking it as "Not Planned". In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/reopen |
@jeromeinsf: You can't reopen an issue/PR unless you authored it or you are a collaborator. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/reopen |
@msau42: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
/remove-lifecycle rotten |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
What would you like to be added?
Add the following features to the StatefulSet API for Kubernetes to make the management of stateful apps easier:
Why is this needed?
A key focus of most stateful applications is data availability and consistency. In service of this, many stateful applications replicate data and often have these replicas spread across multiple zones. Due to this zonal replication, these apps can often tolerate the disruptions of several pods if limited to the same zone.
We propose Kubernetes should provide native features to allow users to take advantage of the zonal replication for stateful apps when performing any kind of planned disruption ex. rollouts or node upgrades.
The two features we thing can have a high impact on easing the management of such applications are:
The aim for both these features is to speed up deployment times without having an impact on availability.
There are several popular apps: Kafka, Zookeeper, Redis, Elastisearch, TiKV, Cortex to name a few which would benefit from the addition of these features to Kubernetes (one would expect most distributed apps should). These apps often support zonal replicas by using some kind of topology or rack aware setting in their configuration.
Ideally these are native features in Kubernetes that can be developed to be robust, generic and be of use to several stateful apps.
More context for the proposal is captured in this public doc here
Similar requests have been made in the past: #104583
Current attempts at solving the problem
The current way most teams support such features is by writing custom operators. These operator are either designed to be generic which is how we solve this at Shopify and thanks to @marianafranco it seems AWS seems to do the same. Or these features get baked into existing operators written to manage apps.
Another pattern we have seen, sometimes in conjunction with a custom operator is to have multiple Statefulset definitions for a single logical app, with each Statefulset definition restricting pods to a single zone.
A simple case study of speeding up rollouts for Kafka using a custom operator at Shopify
Kafka is distributed event store, deployed as a Statefulset on Kubernetes. Users can read/write oredered events from a topics which are further made up of partitions. Kafka performs replication at this partition level.
Kafka like many other apps is rack/topology aware and spreads the partition replicas across the available zones. We usually have 3 replicas for each partition spread across 3 zones.
And so, if 3 or more pods are down at random we run the risk of data loss and unavailable partitions. So for deploys we were limited to the default behaviour provided for
RollingRestart
of updating one pod at a time. This strategy worked for small clusters with less than 10 pods, but for our larger cluster with over 200 pods, deploys could easily take 20+ hours.We realized we could significantly speed up the deploy process by updating multiple pods at the same time as long as they were part of the same zone. We could tolerate a single replica being unavailable for planned disruptions like rollouts and still keep the minimum quorum of 2.
To support this "zonal rollout" of a StatefulApp we built a custom controller, designed to be generic. The deploys for our largest clusters now take ~20 min on average, down from 20 hours.
Besides the operational win of being able to deploy rapidly and safely within working hours for Kafka, there are other motivating factors specific to Kafka which motivate faster rollouts. Ex. Kafka consumer groups which read from a topic in parallel (a topic is made of multiple partitions, which can be read from in parallel to maximize throughput) are sensitive to to changes in partitions status. These changes often cause the entire group to stop reading, rebalance and resume work. During a Kafka rollout, a smaller time window of disruption is preferred to minimize time wasted by these consumer groups rebalancing due to pod restarts.
The AWS use case for Cortex is another case study to consider and reads quite similarly to this one.
What does the operator look like?
The zonal deploy operator is written using the Kubebuilder SDK. At a high level:
OnDelete
strategy.readinessProbe
for the updated pod to signal that is running and successfully updated. For Kafka this check returns true once that pod is "caught up" with replication of data.Deployments
can be paused in Kubernetes.From the same AWS use case, they seem to have implemented and open sourced their own operator which is very similar.
It is worth noting that for Kafka's use case, restarting all pods in a zone works. For many apps it woud be useful to have an option like
MaxUnavailable
to control how many pods should be restarted within a zone. Most would expect this to mean ifMaxUnavailable
for that zone is 5, then 5 pods can be restarted in parallel but the AWS controller went with an exponential approach to updating pods until the configuredMaxUnavailable
value for the zone is reached.What would such improvements look like for zonal rollouts?
To add support for the ability to rollout pods by zone for a Statefulset, I think these are the key semantics:
readinessProbe
. This gives the pod/application the flexibility to signal the health of that pod or the application as a whole and provides the application control over the conditions to continue with the rollout. Or run any custom commands, etc... if needed.What could the API changes possibly look like:
updateStrategy: ZonalUpdate
with the ability to define what label to use for zone information.maxUnavailablePerZone
and allow configuration on how zones may be pickedalphabetical
RollingUpdate
configuration to support a label selector which can be the zone.The text was updated successfully, but these errors were encountered: