Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Topology Aware Infrastructure Disruptions for Statefulsets #114010

Open
null-sleep opened this issue Nov 18, 2022 · 23 comments
Open

Topology Aware Infrastructure Disruptions for Statefulsets #114010

null-sleep opened this issue Nov 18, 2022 · 23 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/apps Categorizes an issue or PR as relevant to SIG Apps.

Comments

@null-sleep
Copy link

null-sleep commented Nov 18, 2022

What would you like to be added?

Add the following features to the StatefulSet API for Kubernetes to make the management of stateful apps easier:

  1. The ability to rollout pods by zone for a Statefulset.
  2. Allow PDBs to have budget of disruptions per zone for a Statefulset.

Why is this needed?

A key focus of most stateful applications is data availability and consistency. In service of this, many stateful applications replicate data and often have these replicas spread across multiple zones. Due to this zonal replication, these apps can often tolerate the disruptions of several pods if limited to the same zone.

We propose Kubernetes should provide native features to allow users to take advantage of the zonal replication for stateful apps when performing any kind of planned disruption ex. rollouts or node upgrades.

The two features we thing can have a high impact on easing the management of such applications are:

  1. The ability to rollout pods by zone for a Statefulset.
  2. Allow PDBs to have budget of disruptions per zone for a Statefulset.

The aim for both these features is to speed up deployment times without having an impact on availability.

There are several popular apps: Kafka, Zookeeper, Redis, Elastisearch, TiKV, Cortex to name a few which would benefit from the addition of these features to Kubernetes (one would expect most distributed apps should). These apps often support zonal replicas by using some kind of topology or rack aware setting in their configuration.

Ideally these are native features in Kubernetes that can be developed to be robust, generic and be of use to several stateful apps.

More context for the proposal is captured in this public doc here
Similar requests have been made in the past: #104583

Current attempts at solving the problem

The current way most teams support such features is by writing custom operators. These operator are either designed to be generic which is how we solve this at Shopify and thanks to @marianafranco it seems AWS seems to do the same. Or these features get baked into existing operators written to manage apps.

Another pattern we have seen, sometimes in conjunction with a custom operator is to have multiple Statefulset definitions for a single logical app, with each Statefulset definition restricting pods to a single zone.

A simple case study of speeding up rollouts for Kafka using a custom operator at Shopify

Kafka is distributed event store, deployed as a Statefulset on Kubernetes. Users can read/write oredered events from a topics which are further made up of partitions. Kafka performs replication at this partition level.

Kafka like many other apps is rack/topology aware and spreads the partition replicas across the available zones. We usually have 3 replicas for each partition spread across 3 zones.

And so, if 3 or more pods are down at random we run the risk of data loss and unavailable partitions. So for deploys we were limited to the default behaviour provided for RollingRestart of updating one pod at a time. This strategy worked for small clusters with less than 10 pods, but for our larger cluster with over 200 pods, deploys could easily take 20+ hours.

We realized we could significantly speed up the deploy process by updating multiple pods at the same time as long as they were part of the same zone. We could tolerate a single replica being unavailable for planned disruptions like rollouts and still keep the minimum quorum of 2.

To support this "zonal rollout" of a StatefulApp we built a custom controller, designed to be generic. The deploys for our largest clusters now take ~20 min on average, down from 20 hours.

Besides the operational win of being able to deploy rapidly and safely within working hours for Kafka, there are other motivating factors specific to Kafka which motivate faster rollouts. Ex. Kafka consumer groups which read from a topic in parallel (a topic is made of multiple partitions, which can be read from in parallel to maximize throughput) are sensitive to to changes in partitions status. These changes often cause the entire group to stop reading, rebalance and resume work. During a Kafka rollout, a smaller time window of disruption is preferred to minimize time wasted by these consumer groups rebalancing due to pod restarts.

The AWS use case for Cortex is another case study to consider and reads quite similarly to this one.

What does the operator look like?

The zonal deploy operator is written using the Kubebuilder SDK. At a high level:

  • The operator watches a single Statefulset for a change in revision and triggers a rollout on a new revision.
  • The StatefulSets is deployed using the OnDelete strategy.
  • Pods are always updates in the same order. This helps with rollbacks or fix it forward updates so the most recently updated pods can be updated first.
  • All pods in a zone are updated at the same time. I would imagine further control of how many pods in that zone should be restarted at a time (ex. 30% of pods) would be a useful configuration to have.
  • We rely on the readinessProbe for the updated pod to signal that is running and successfully updated. For Kafka this check returns true once that pod is "caught up" with replication of data.
  • If pods are unhealthy after restarting a zone, the operator does not make any further progress.
  • The operator can be paused to stop an in progress deploy, similar to how Deployments can be paused in Kubernetes.

From the same AWS use case, they seem to have implemented and open sourced their own operator which is very similar.

It is worth noting that for Kafka's use case, restarting all pods in a zone works. For many apps it woud be useful to have an option like MaxUnavailable to control how many pods should be restarted within a zone. Most would expect this to mean if MaxUnavailable for that zone is 5, then 5 pods can be restarted in parallel but the AWS controller went with an exponential approach to updating pods until the configured MaxUnavailable value for the zone is reached.

What would such improvements look like for zonal rollouts?

To add support for the ability to rollout pods by zone for a Statefulset, I think these are the key semantics:

  • Zones and the pods within them are always updated in the same order (ex. alphabetical).
  • All pods in a zone have to be updated before moving to the next zone.
  • Make good use of the readinessProbe. This gives the pod/application the flexibility to signal the health of that pod or the application as a whole and provides the application control over the conditions to continue with the rollout. Or run any custom commands, etc... if needed.
  • If pods are unhealthy after restarting them in a zone, the controller stalls.
  • Ability to pause rollouts like with deployments.

What could the API changes possibly look like:

  1. Add a new updateStrategy: ZonalUpdate with the ability to define what label to use for zone information.
apiVersion: apps/v1
kind: StatefulSet
spec:
  paused: false
  replicas: 100
  serviceName: kafka
  podManagementPolicy: Parallel
  updateStrategy:
    type: ZonalUpdate
    zonalUpdate:
      maxUnavailable: 5
      topologyKey: topology.kubernetes.io/zone
  template:
  1. Similar to the first option. Be more explicit with maxUnavailablePerZone and allow configuration on how zones may be picked alphabetical
apiVersion: apps/v1
kind: StatefulSet
spec:
  paused: false
  replicas: 100
  serviceName: kafka
  podManagementPolicy: Parallel
  updateStrategy:
    type: ZonalUpdate
    zonalUpdate:
      maxUnavailablePerZone: 25%
      zoneOrder: alphabetical
      topologyKey: topology.kubernetes.io/zone
  template:
  1. Expand the existing RollingUpdate configuration to support a label selector which can be the zone.
apiVersion: apps/v1
kind: StatefulSet
spec:
  paused: false
  replicas: 100
  serviceName: kafka
  podManagementPolicy: Parallel
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 25%
      partitionType: Ordinal | Label Selector
      partition: topology.kubernetes.io/zone
  template:
@null-sleep null-sleep added the kind/feature Categorizes issue or PR as related to a new feature. label Nov 18, 2022
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 18, 2022
@k8s-ci-robot
Copy link
Contributor

@null-sleep: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@null-sleep
Copy link
Author

/sig sig-apps

@k8s-ci-robot
Copy link
Contributor

@null-sleep: The label(s) sig/sig-apps cannot be applied, because the repository doesn't have them.

In response to this:

/sig sig-apps

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@null-sleep
Copy link
Author

/sig apps

@k8s-ci-robot k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 18, 2022
@marianafranco
Copy link

marianafranco commented Nov 23, 2022

Currently custom solutions like operators/controllers are being used to take advantage of this common pattern.

In case it helps, here is one example from AWS: https://github.com/aws/zone-aware-controllers-for-k8s

And more details on why the ZoneDisruptionBudgets (ZDB) controller was created can be found in the following blog post: https://aws.amazon.com/blogs/opensource/speed-up-highly-available-deployments-on-kubernetes/

@null-sleep null-sleep changed the title [WIP] Topology Aware Infrastructure Disruptions for Statefulsets Topology Aware Infrastructure Disruptions for Statefulsets Nov 24, 2022
@null-sleep
Copy link
Author

null-sleep commented Nov 24, 2022

@marianafranco Thanks for sharing the links. Reading your blog post, your motivations and solution are quite similar to what we did at Shopify.

@pwschuurman
Copy link
Contributor

I have a use case for this feature to allow for faster cross cluster migrations. With #112744 (kubernetes/enhancements#3336), StatefulSet slices can be scaled down and scaled up across clusters. With the StatefulSetStartOrdinal feature, this is possible on an ordinal level, but not a zone level. For applications that can support it, moving a zone at a time can greatly speed up migration, similar to the in-cluster case.

apiVersion: apps/v1
kind: StatefulSet
spec:
paused: false
replicas: 100
serviceName: kafka
podManagementPolicy: Parallel
updateStrategy:
type: ZonalUpdate
zonalUpdate:
maxUnavailablePerZone: 25%
zoneOrder: alphabetical
topologyKey: topology.kubernetes.io/zone
template:

Having a way to specify a partition for the ZonalUpdate strategy would be ideal to ensure a deterministic roll forward and rollback. This would probably need to incorporate a topology value (eg: order zones alphabetically, stop at zone b, with 10% of replicas in that zone migrated). So the StatefulSet would become somewhat topology aware.

There are probably some safeguards we want to enforce with this feature, as it doesn't make sense to certain configurations. For example, if using mmaxSkew, whenUnsatisfiable: DoNotSchedule. With this configuration, if an entire zone is updated at once, it may cause a replica in a different zone from being scheduled until the skew is satisfied (eg: on pod evict and re-create). This can lead to unexpected availability reduction depending on the replication factor of the application.

@marianafranco
Copy link

Other similar requests made in the past: #41442 and kubernetes/community#2045

@pwschuurman
Copy link
Contributor

Tying this back to existing work in this space: The proposal of modifying the StatefulSet spec is characterized by the New Update Strategy alternative from the "topology-aware workload controllers" KEP

The other approach (which was the main proposal from KEP topology-aware workload controllers) is in modifying the PDB API, and workload controllers to use the PDB API to control upgrades, rather than the rollingUpdate strategy.

Both the StatefulSet and Deployment have update strategy fields. This would allow for a new update strategy that could defer to the PDB API to identify which pods should be updated (evicted and re-created). Deployment may be more challenge to orchestrate, over as it does not actuate pods directly

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 14, 2023
@ashrayjain
Copy link

/remove-lifecycle-stale

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 18, 2023
@jeromeinsf
Copy link

/remove-lifecycle-stale

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 19, 2023
@jeromeinsf
Copy link

/reopen

@k8s-ci-robot
Copy link
Contributor

@jeromeinsf: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@msau42
Copy link
Member

msau42 commented Nov 2, 2023

/reopen
/remove-lifecycle rotten

@k8s-ci-robot
Copy link
Contributor

@msau42: Reopened this issue.

In response to this:

/reopen
/remove-lifecycle rotten

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Nov 2, 2023
@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Nov 2, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 31, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 1, 2024
@ashrayjain
Copy link

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 1, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/apps Categorizes an issue or PR as relevant to SIG Apps.
Projects
Status: Needs Triage
Development

No branches or pull requests

8 participants