Topology Aware Infrastructure Disruptions for Statefulsets #114010

null-sleep · 2022-11-18T20:16:38Z

What would you like to be added?

Add the following features to the StatefulSet API for Kubernetes to make the management of stateful apps easier:

The ability to rollout pods by zone for a Statefulset.
Allow PDBs to have budget of disruptions per zone for a Statefulset.

Why is this needed?

A key focus of most stateful applications is data availability and consistency. In service of this, many stateful applications replicate data and often have these replicas spread across multiple zones. Due to this zonal replication, these apps can often tolerate the disruptions of several pods if limited to the same zone.

We propose Kubernetes should provide native features to allow users to take advantage of the zonal replication for stateful apps when performing any kind of planned disruption ex. rollouts or node upgrades.

The two features we thing can have a high impact on easing the management of such applications are:

The ability to rollout pods by zone for a Statefulset.
Allow PDBs to have budget of disruptions per zone for a Statefulset.

The aim for both these features is to speed up deployment times without having an impact on availability.

There are several popular apps: Kafka, Zookeeper, Redis, Elastisearch, TiKV, Cortex to name a few which would benefit from the addition of these features to Kubernetes (one would expect most distributed apps should). These apps often support zonal replicas by using some kind of topology or rack aware setting in their configuration.

Ideally these are native features in Kubernetes that can be developed to be robust, generic and be of use to several stateful apps.

More context for the proposal is captured in this public doc here
Similar requests have been made in the past: #104583

Current attempts at solving the problem

The current way most teams support such features is by writing custom operators. These operator are either designed to be generic which is how we solve this at Shopify and thanks to @marianafranco it seems AWS seems to do the same. Or these features get baked into existing operators written to manage apps.

Another pattern we have seen, sometimes in conjunction with a custom operator is to have multiple Statefulset definitions for a single logical app, with each Statefulset definition restricting pods to a single zone.

A simple case study of speeding up rollouts for Kafka using a custom operator at Shopify

Kafka is distributed event store, deployed as a Statefulset on Kubernetes. Users can read/write oredered events from a topics which are further made up of partitions. Kafka performs replication at this partition level.

Kafka like many other apps is rack/topology aware and spreads the partition replicas across the available zones. We usually have 3 replicas for each partition spread across 3 zones.

And so, if 3 or more pods are down at random we run the risk of data loss and unavailable partitions. So for deploys we were limited to the default behaviour provided for RollingRestart of updating one pod at a time. This strategy worked for small clusters with less than 10 pods, but for our larger cluster with over 200 pods, deploys could easily take 20+ hours.

We realized we could significantly speed up the deploy process by updating multiple pods at the same time as long as they were part of the same zone. We could tolerate a single replica being unavailable for planned disruptions like rollouts and still keep the minimum quorum of 2.

To support this "zonal rollout" of a StatefulApp we built a custom controller, designed to be generic. The deploys for our largest clusters now take ~20 min on average, down from 20 hours.

Besides the operational win of being able to deploy rapidly and safely within working hours for Kafka, there are other motivating factors specific to Kafka which motivate faster rollouts. Ex. Kafka consumer groups which read from a topic in parallel (a topic is made of multiple partitions, which can be read from in parallel to maximize throughput) are sensitive to to changes in partitions status. These changes often cause the entire group to stop reading, rebalance and resume work. During a Kafka rollout, a smaller time window of disruption is preferred to minimize time wasted by these consumer groups rebalancing due to pod restarts.

The AWS use case for Cortex is another case study to consider and reads quite similarly to this one.

What does the operator look like?

The zonal deploy operator is written using the Kubebuilder SDK. At a high level:

The operator watches a single Statefulset for a change in revision and triggers a rollout on a new revision.
The StatefulSets is deployed using the OnDelete strategy.
Pods are always updates in the same order. This helps with rollbacks or fix it forward updates so the most recently updated pods can be updated first.
All pods in a zone are updated at the same time. I would imagine further control of how many pods in that zone should be restarted at a time (ex. 30% of pods) would be a useful configuration to have.
We rely on the readinessProbe for the updated pod to signal that is running and successfully updated. For Kafka this check returns true once that pod is "caught up" with replication of data.
If pods are unhealthy after restarting a zone, the operator does not make any further progress.
The operator can be paused to stop an in progress deploy, similar to how Deployments can be paused in Kubernetes.

From the same AWS use case, they seem to have implemented and open sourced their own operator which is very similar.

It is worth noting that for Kafka's use case, restarting all pods in a zone works. For many apps it woud be useful to have an option like MaxUnavailable to control how many pods should be restarted within a zone. Most would expect this to mean if MaxUnavailable for that zone is 5, then 5 pods can be restarted in parallel but the AWS controller went with an exponential approach to updating pods until the configured MaxUnavailable value for the zone is reached.

What would such improvements look like for zonal rollouts?

To add support for the ability to rollout pods by zone for a Statefulset, I think these are the key semantics:

Zones and the pods within them are always updated in the same order (ex. alphabetical).
All pods in a zone have to be updated before moving to the next zone.
Make good use of the readinessProbe. This gives the pod/application the flexibility to signal the health of that pod or the application as a whole and provides the application control over the conditions to continue with the rollout. Or run any custom commands, etc... if needed.
If pods are unhealthy after restarting them in a zone, the controller stalls.
Ability to pause rollouts like with deployments.

What could the API changes possibly look like:

Add a new updateStrategy: ZonalUpdate with the ability to define what label to use for zone information.

apiVersion: apps/v1
kind: StatefulSet
spec:
  paused: false
  replicas: 100
  serviceName: kafka
  podManagementPolicy: Parallel
  updateStrategy:
    type: ZonalUpdate
    zonalUpdate:
      maxUnavailable: 5
      topologyKey: topology.kubernetes.io/zone
  template:

Similar to the first option. Be more explicit with maxUnavailablePerZone and allow configuration on how zones may be picked alphabetical

apiVersion: apps/v1
kind: StatefulSet
spec:
  paused: false
  replicas: 100
  serviceName: kafka
  podManagementPolicy: Parallel
  updateStrategy:
    type: ZonalUpdate
    zonalUpdate:
      maxUnavailablePerZone: 25%
      zoneOrder: alphabetical
      topologyKey: topology.kubernetes.io/zone
  template:

Expand the existing RollingUpdate configuration to support a label selector which can be the zone.

apiVersion: apps/v1
kind: StatefulSet
spec:
  paused: false
  replicas: 100
  serviceName: kafka
  podManagementPolicy: Parallel
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 25%
      partitionType: Ordinal | Label Selector
      partition: topology.kubernetes.io/zone
  template:

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2022-11-18T20:16:46Z

@null-sleep: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

null-sleep · 2022-11-18T20:17:03Z

/sig sig-apps

k8s-ci-robot · 2022-11-18T20:17:04Z

@null-sleep: The label(s) sig/sig-apps cannot be applied, because the repository doesn't have them.

In response to this:

/sig sig-apps

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

null-sleep · 2022-11-18T20:17:21Z

/sig apps

marianafranco · 2022-11-23T01:11:29Z

Currently custom solutions like operators/controllers are being used to take advantage of this common pattern.

In case it helps, here is one example from AWS: https://github.com/aws/zone-aware-controllers-for-k8s

And more details on why the ZoneDisruptionBudgets (ZDB) controller was created can be found in the following blog post: https://aws.amazon.com/blogs/opensource/speed-up-highly-available-deployments-on-kubernetes/

null-sleep · 2022-11-24T08:05:43Z

@marianafranco Thanks for sharing the links. Reading your blog post, your motivations and solution are quite similar to what we did at Shopify.

pwschuurman · 2022-11-28T05:31:18Z

I have a use case for this feature to allow for faster cross cluster migrations. With #112744 (kubernetes/enhancements#3336), StatefulSet slices can be scaled down and scaled up across clusters. With the StatefulSetStartOrdinal feature, this is possible on an ordinal level, but not a zone level. For applications that can support it, moving a zone at a time can greatly speed up migration, similar to the in-cluster case.

apiVersion: apps/v1
kind: StatefulSet
spec:
paused: false
replicas: 100
serviceName: kafka
podManagementPolicy: Parallel
updateStrategy:
type: ZonalUpdate
zonalUpdate:
maxUnavailablePerZone: 25%
zoneOrder: alphabetical
topologyKey: topology.kubernetes.io/zone
template:

Having a way to specify a partition for the ZonalUpdate strategy would be ideal to ensure a deterministic roll forward and rollback. This would probably need to incorporate a topology value (eg: order zones alphabetically, stop at zone b, with 10% of replicas in that zone migrated). So the StatefulSet would become somewhat topology aware.

There are probably some safeguards we want to enforce with this feature, as it doesn't make sense to certain configurations. For example, if using mmaxSkew, whenUnsatisfiable: DoNotSchedule. With this configuration, if an entire zone is updated at once, it may cause a replica in a different zone from being scheduled until the skew is satisfied (eg: on pod evict and re-create). This can lead to unexpected availability reduction depending on the replication factor of the application.

marianafranco · 2022-11-28T18:03:08Z

Other similar requests made in the past: #41442 and kubernetes/community#2045

pwschuurman · 2023-02-13T22:55:35Z

Tying this back to existing work in this space: The proposal of modifying the StatefulSet spec is characterized by the New Update Strategy alternative from the "topology-aware workload controllers" KEP

The other approach (which was the main proposal from KEP topology-aware workload controllers) is in modifying the PDB API, and workload controllers to use the PDB API to control upgrades, rather than the rollingUpdate strategy.

Both the StatefulSet and Deployment have update strategy fields. This would allow for a new update strategy that could defer to the PDB API to identify which pods should be updated (evicted and re-created). Deployment may be more challenge to orchestrate, over as it does not actuate pods directly

k8s-triage-robot · 2023-05-14T23:09:38Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

ashrayjain · 2023-05-19T17:24:30Z

/remove-lifecycle-stale

k8s-triage-robot · 2023-06-18T17:29:52Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

jeromeinsf · 2023-06-19T13:09:08Z

/remove-lifecycle-stale

k8s-triage-robot · 2023-07-19T14:04:02Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2023-07-19T14:04:12Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jeromeinsf · 2023-10-23T20:44:13Z

/reopen

k8s-ci-robot · 2023-10-23T20:44:18Z

@jeromeinsf: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

msau42 · 2023-11-02T05:23:13Z

/reopen
/remove-lifecycle rotten

k8s-ci-robot · 2023-11-02T05:23:18Z

@msau42: Reopened this issue.

In response to this:

/reopen
/remove-lifecycle rotten

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-triage-robot · 2024-01-31T16:42:13Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-03-01T17:41:20Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

ashrayjain · 2024-03-01T17:46:14Z

/remove-lifecycle rotten

k8s-triage-robot · 2024-05-30T18:46:11Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

null-sleep added the kind/feature Categorizes issue or PR as related to a new feature. label Nov 18, 2022

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 18, 2022

k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 18, 2022

null-sleep changed the title ~~[WIP] Topology Aware Infrastructure Disruptions for Statefulsets~~ Topology Aware Infrastructure Disruptions for Statefulsets Nov 24, 2022

pwschuurman mentioned this issue Feb 10, 2023

Feature request: Upgrade buckets in statefulsets #115436

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 14, 2023

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 18, 2023

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 19, 2023

k8s-ci-robot reopened this Nov 2, 2023

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Nov 2, 2023

pwschuurman mentioned this issue Dec 15, 2023

KEP-4212: Declarative Node Maintenance kubernetes/enhancements#4213

Open

6 tasks

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 31, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 1, 2024

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 1, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Topology Aware Infrastructure Disruptions for Statefulsets #114010

Topology Aware Infrastructure Disruptions for Statefulsets #114010

null-sleep commented Nov 18, 2022 •

edited

k8s-ci-robot commented Nov 18, 2022

null-sleep commented Nov 18, 2022

k8s-ci-robot commented Nov 18, 2022

null-sleep commented Nov 18, 2022

marianafranco commented Nov 23, 2022 •

edited

null-sleep commented Nov 24, 2022 •

edited

pwschuurman commented Nov 28, 2022

marianafranco commented Nov 28, 2022

pwschuurman commented Feb 13, 2023

k8s-triage-robot commented May 14, 2023

ashrayjain commented May 19, 2023

k8s-triage-robot commented Jun 18, 2023

jeromeinsf commented Jun 19, 2023

k8s-triage-robot commented Jul 19, 2023

k8s-ci-robot commented Jul 19, 2023

jeromeinsf commented Oct 23, 2023

k8s-ci-robot commented Oct 23, 2023

msau42 commented Nov 2, 2023

k8s-ci-robot commented Nov 2, 2023

k8s-triage-robot commented Jan 31, 2024

k8s-triage-robot commented Mar 1, 2024

ashrayjain commented Mar 1, 2024

k8s-triage-robot commented May 30, 2024

Topology Aware Infrastructure Disruptions for Statefulsets #114010

Topology Aware Infrastructure Disruptions for Statefulsets #114010

Comments

null-sleep commented Nov 18, 2022 • edited

What would you like to be added?

Why is this needed?

Current attempts at solving the problem

A simple case study of speeding up rollouts for Kafka using a custom operator at Shopify

What does the operator look like?

What would such improvements look like for zonal rollouts?

k8s-ci-robot commented Nov 18, 2022

null-sleep commented Nov 18, 2022

k8s-ci-robot commented Nov 18, 2022

null-sleep commented Nov 18, 2022

marianafranco commented Nov 23, 2022 • edited

null-sleep commented Nov 24, 2022 • edited

pwschuurman commented Nov 28, 2022

marianafranco commented Nov 28, 2022

pwschuurman commented Feb 13, 2023

k8s-triage-robot commented May 14, 2023

ashrayjain commented May 19, 2023

k8s-triage-robot commented Jun 18, 2023

jeromeinsf commented Jun 19, 2023

k8s-triage-robot commented Jul 19, 2023

k8s-ci-robot commented Jul 19, 2023

jeromeinsf commented Oct 23, 2023

k8s-ci-robot commented Oct 23, 2023

msau42 commented Nov 2, 2023

k8s-ci-robot commented Nov 2, 2023

k8s-triage-robot commented Jan 31, 2024

k8s-triage-robot commented Mar 1, 2024

ashrayjain commented Mar 1, 2024

k8s-triage-robot commented May 30, 2024

null-sleep commented Nov 18, 2022 •

edited

marianafranco commented Nov 23, 2022 •

edited

null-sleep commented Nov 24, 2022 •

edited