diff --git a/contributors/design-proposals/statefulset-update.md b/contributors/design-proposals/statefulset-update.md new file mode 100644 index 00000000000..c880186127b --- /dev/null +++ b/contributors/design-proposals/statefulset-update.md @@ -0,0 +1,828 @@ +# StatefulSet Updates + +**Author**: kow3ns@ + +**Status**: Proposal + +## Abstract +Currently (as of Kubernetes 1.6), `.Spec.Replicas` and +`.Spec.Template.Containers` are the only mutable fields of the +StatefulSet API object. Updating `.Spec.Replicas` will scale the number of Pods +in the StatefulSet. Updating `.Spec.Template.Containers` causes all subsequently +created Pods to have the specified containers. In order to cause the +StatefulSet controller to apply its updated `.Spec`, users must manually delete +each Pod. This manual method of applying updates is error prone. The +implementation of this proposal will add the capability to perform ordered, +automated, sequential updates. + +## Affected Components +1. API Server +1. Kubectl +1. StatefulSet Controller +1. StatefulSetSpec API object +1. StatefulSetStatus API object + +## Use Cases +Upon implementation, this design will support the following in scope use cases, +and it will not rule out the future implementation of the out of scope use +cases. + +### In Scope +- As the administrator of a stateful application, in order to vertically scale +my application, I want to update resource limits or requested resources. +- As the administrator of a stateful application, in order to deploy critical +security updates, break fix patches, and feature releases, I want to update +container images. +- As the administrator of a stateful application, in order to update my +application's configuration, I want to update environment variables, container +entry point commands or parameters, or configuration files. +- As the administrator of the logging and monitoring infrastructure for my +organization, in order to add logging and monitoring side cars, I want to patch +a Pods' containers to add images. + +### Out of Scope +- As the administrator of a stateful application, in order to increase the +applications storage capacity, I want to update PersistentVolumes. +- As the administrator of a stateful application, in order to update the +network configuration of the application, I want to update Services and +container ports in a consistent way. +- As the administrator of a stateful application, when I scale my application +horizontally, I want associated PodDisruptionBudgets to be adjusted to +compensate for the application's scaling. + +## Assumptions + - StatefulSet update must support singleton StatefulSets. However, an update in + this case will cause a temporary outage. This is acceptable as a single + process application is, by definition, not highly available. + - Disruption in Kubernetes is controlled by PodDisruptionBudgets. As + StatefulSet updates progress one Pod at a time, and only occur when all + other Pods have a Status of Running and a Ready Condition, they can not + violate reasonable PodDisruptionBudgets. + - Without priority and preemption, there is no guarantee that an update will + not block due to a loss of capacity or due to the scheduling of another Pod + between Pod termination and Pod creation. This is mitigated by blocking the + update when a Pod fails to schedule. Remediation will require operator + intervention. This implementation is no worse than the current behavior with + respect to eviction. + - We will eventually implement a signal that is delivered to Pods to indicate + the + [reason for termination](https://github.com/kubernetes/community/pull/541). + - StatefulSet updates will use the methodology outlined in the + [controller history](https://github.com/kubernetes/community/pull/594) proposal + for version tracking, update detection, and rollback detection. + This will be a general implementation, usable for any Pod in a Kubernetes + cluster. It is, therefore, out of scope to design such a mechanism here. + - Kubelet does not support resizing a container's resources without terminating + the Pod. In place resource reallocation is out of scope for this design. + Vertical scaling must be performed destructively. + - The primary means of configuration update will be configuration files, + command line flags, environment variables, or ConfigMaps consumed as the one + of the former. + - In place configuration update via SIGHUP is not universally + supported, and Kubelet provides no mechanism to perform this currently. Pod + reconfiguration will be performed destructively. + - Stateful applications are likely to evolve wire protocols and storage formats + between versions. In most cases, when updating the application's Pod's + containers, it will not be safe to roll back or forward to an arbitrary + version. Controller based Pod update should work well when rolling out an + update, or performing a rollback, between two specific revisions of the + controlled API object. This is how Deployment functions, and this property is, + perhaps, even more critical for stateful applications. + +## Requirements +This design is based on the following requirements. +- Users must be able to update the containers of a StatefulSet's Pods. + - Updates to container commands, images, resources and configuration must be + supported. +- The update must progress in a sequential, deterministic order and respect the + StatefulSet + [identity](https://kubernetes.io/docs/concepts/abstractions/controllers/statefulsets/#pod-identity), + [deployment, and scaling](https://kubernetes.io/docs/concepts/abstractions/controllers/statefulsets/#deployment-and-scaling-guarantee) + guarantees. +- A failed update must halt. +- Users must be able to roll back an update. +- Users must be able to roll forward to fix a failing/failed update. +- Users must be able to view the status of an update. +- Users should be able to view a bounded history of the updates that have been +applied to the StatefulSet. + +## API Objects + +The following modifications will be made to the StatefulSetSpec API object. + +```go +// StatefulSetUpdateStrategy indicates the strategy that the StatefulSet +// controller will use to perform updates. It includes any additional parameters +// necessary to preform the update for the indicated strategy. +type StatefulSetUpdateStrategy struct { + // Type indicates the type of the StatefulSetUpdateStrategy. + Type StatefulSetUpdateStrategyType + // Partition is used to communicate the ordinal at which to partition + // the StatefulSet when Type is PartitionStatefulSetStrategyType. This + // value must be set when Type is PartitionStatefulSetStrategyType, + // and it must be nil otherwise. + Partition *PartitionStatefulSetStrategy + +// StatefulSetUpdateStrategyType is a string enumeration type that enumerates +// all possible update strategies for the StatefulSet controller. +type StatefulSetUpdateStrategyType string + +const ( + // PartitionStatefulSetStrategyType indicates that updates will only be + // applied to a partition of the StatefulSet. This is useful for canaries + // and phased roll outs. When a scale operation is performed with this + // strategy, new Pods will be created from the updated specification. + PartitionStatefulSetStrategyType StatefulSetUpdateStrategyType = "Partition" + // RollingUpdateStatefulSetStrategyType indicates that update will be + // applied to all Pods in the StatefulSet with respect to the StatefulSet + // ordering constraints. When a scale operation is performed with this + // strategy, new Pods will be created from the updated specification. + RollingUpdateStatefulSetStrategyType = "RollingUpdate" + // OnDeleteStatefulSetStrategyType triggers the legacy behavior. Version + // tracking and ordered rolling restarts are disabled. Pods are recreated + // from the StatefulSetSpec when they are manually deleted. When a scale + // operation is performed with this strategy, new Pods will be created + // from the current specification. + OnDeleteStatefulSetStrategyType = "OnDelete" +) + +// PartitionStatefulSetStrategy contains the parameters used with the +// PartitionStatefulSetStrategyType. +type PartitionStatefulSetStrategy struct { + // Ordinal indicates the ordinal at which the StatefulSet should be + // partitioned. + Ordinal int32 +} + +type StatefulSetSpec struct { + // Replicas, Selector, Template, VolumeClaimsTemplate, and ServiceName + // omitted for brevity. + + // UpdateStrategy indicates the StatefulSetUpdateStrategy that will be + // employed to update Pods in the StatefulSet when a revision is made to + // Template or VolumeClaimsTemplate. + UpdateStrategy StatefulSetUpdateStrategy `json:"updateStrategy,omitempty` + + // RevisionHistoryLimit is the maximum number of revisions that will + // be maintained in the StatefulSet's revision history. The revision history + // consists of all revisions not represented by a currently applied + // StatefulSetSpec version. The default value is 2. + RevisionHistoryLimit *int32 `json:revisionHistoryLimit,omitempty` +} +``` + +The following modifications will be made to the StatefulSetStatus API object. + +```go + type StatefulSetStatus struct { + // ObservedGeneration and Replicas fields are omitted for brevity. + + // CurrentRevision, if not empty, indicates the version of PodSpecTemplate, + // VolumeClaimsTemplate tuple used to generate Pods in the sequence + // [0,CurrentReplicas). + CurrentRevision string `json:"currentRevision,omitempty"` + + // UpdateRevision, if not empty, indicates the version of PodSpecTemplate, + // VolumeClaimsTemplate tuple used to generate Pods in the sequence + // [Replicas-UpdatedReplicas,Replicas) + UpdateRevision string `json:"updateRevision,omitempty"` + + // ReadyReplicas is the current number of Pods, created by the StatefulSet + // controller, that have a Status of Running and a Ready Condition. + ReadyReplicas int32 `json:"readyReplicas,omitempty"` + + // CurrentReplicas is the number of Pods created by the StatefulSet + // controller from the PodTemplateSpec, VolumeClaimsTemplate tuple indicated + // by CurrentRevision. + CurrentReplicas int32 `json:"currentReplicas,omitempty"` + + // UpdatedReplicas is the number of Pods created by the StatefulSet + // controller from the PodTemplateSpec, VolumeClaimsTemplate tuple indicated + // by UpdateRevision. + UpdatedReplicas int32 `json:"updatedReplicas,omitempty"` +} +``` + +Additionally we introduce the following constant. + +```go +// StatefulSetRevisionLabel is the label used by StatefulSet controller to track +// which version of StatefulSet's StatefulSetSpec was used generate a Pod. +const StatefulSetRevisionLabel = "statefulset.kubernetes.io/revision" + +``` +## StatefulSet Controller +The StatefulSet controller will watch for modifications to StatefulSet and Pod +API objects. When a StatefulSet is created or updated, or when one +of the Pods in a StatefulSet is updated or deleted, the StatefulSet +controller will attempt to create, update, or delete Pods to conform the +current state of the system to the user declared [target state](#target-state). + +### Revised Controller Algorithm +The StatefulSet controller will use the following algorithm to continue to +make progress toward the user declared [target state](#target-state) while +respecting the controller's +[identity](https://kubernetes.io/docs/concepts/abstractions/controllers/statefulsets/#pod-identity), +[deployment, and scaling](https://kubernetes.io/docs/concepts/abstractions/controllers/statefulsets/#deployment-and-scaling-guarantee) +guarantees. The StatefulSet controller will use the technique proposed in +[Controller History](https://github.com/kubernetes/community/pull/594) to +snapshot and version its [target Object state](#target-pod-state). + +1. The controller will reconstruct the +[revision history](#history-reconstruction) of the StatefulSet. +1. The controller will +[process any updates to its StatefulSetSpec](#specification-updates) to +ensure that the StatefulSet's revision history is consistent with the user +declared desired state. +1. The controller will select all Pods in the StatefulSet, filter any Pods not +owned by the StatefulSet, and sort the remaining Pods in ordinal order. +1. For all created Pods, the controller will perform any necessary +[non-destructive state reconciliation](#pod-state-reconciliation). +1. If any Pods with ordinals in the sequence `[0,.Spec.Replicas)` have not been +created, for the Pod corresponding to the lowest such ordinal, the controller +will create the Pod with declared [target Pod state](#target-pod-state). +1. If all Pods in the sequence `[0,.Spec.Replicas)` have been created, but if any +do not have a Ready Condition, the StatefulSet controller will wait for these +Pods to either become Ready, or to be completely deleted. +1. If all Pods in the sequence `[0,.Spec.Replicas)` have a Ready Condition, and +if `.Spec.Replicas` is less than `.Status.Replicas`, the controller will delete +the Pod corresponding to the largest ordinal. This implies that scaling takes +precedence over Pod updates. +1. If all Pods in the sequence `[0,.Spec.Replicas)` have a Status of Running and +a Ready Condition, if `.Spec.Replicas` is equal to `.Status.Replicas`, and if +there are Pods that do not match their [target Pod state](#target-pod-state), +the Pod with the largest ordinal in that set will be deleted. +1. If the StatefulSet controller has achieved the +[declared target state](#target-state) the StatefulSet controller will +[complete any in progress updates](#update-completion). +1. The controller will [report its status](#status-reporting). +1. The controller will perform any necessary +[maintenance of its revision history](#history-maintenance). + +### Target State +The target state of the StatefulSet controller with respect to an individual +StatefulSet is defined as follows. + +1. The StatefulSet contains exactly `[0,.Spec.Replicas)` Pods. +1. All Pods in the StatefulSet have the correct +[target Pod state](#target-pod-state). + +### Target Pod State +As in the [Controller History](https://github.com/kubernetes/community/pull/594) +proposal we define the target Object state of StatefulSetSpec specification type +object to be the `.Template` and `.VolumeClaimsTemplate`. The latter is currently +immutable, but we will version it as one day this constraint may be lifted. This +state provides enough information to generate a Pod and its associated +PersistentVolumeClaims. The target Pod State for a Pod in a StatefulSet is as +follows. +1. The Pods PersistentVolumeClaims have been created. + - Note that we do not currently delete PersistentVolumeClaims. +1. If the Pod's ordinal is in the sequence `[0,.Spec.Replicas)` the Pod should +have a Ready Condition. This implies the Pod is Running. +1. If Pod's ordinal is greater than or equal to `.Spec.Replicas`, the Pod +should be completely terminated and deleted. +1. If the StatefulSet's `Spec.UpdateStrategy.Type` is equal to +`OnDeleteStatefulSetStrategyType`, no version tracking is performed, Pods +can be at an arbitrary version, and they will be recreated from the current +`.Spec.Template` and `.Spec.VolumeClaimsTemplate` when the are deleted. +1. If StatefulSet's `Spec.UpdateStrategy.Type` is equal to +`RollingUpdateStatefulSetStrategyType` then the version of the Pod should be +as follows. + 1. If the Pod's ordinal is in the sequence `[0,.Status.CurrentReplicas)`, + the Pod should be consistent with version indicated by `Status.CurrentRevision`. + 1. If the Pod's ordinal is in the sequence + `[.Status.Replicas - .Status.UpdatedReplicas, .Status.Replicas)` + the Pod should be consistent with the version indicated by + `Status.UpdateRevision`. +1. If the StatefulSet's `.Spec.UpdateStrategy.Type` is equal to +`PartitionStatefulSetStrategyType` then the version of the Pod should be +as follows. + 1. If the Pod's ordinal is in the sequence `[0,.Status.CurrentReplicas)`, + the Pod should be consistent with version indicated by `Status.CurrentRevision`. + 1. If the Pod's ordinal is in the sequence + `[.Status.Replicas - .Status.UpdatedReplicas, .Status.Replicas)` the Pod + should be consistent with the version indicated by `Status.UpdateRevision`. + 1. If the Pod does not meet either of the prior two conditions, and if + ordinal is in the sequence `[0, .Spec.UpdateStrategy.Partition.Ordinal)`, + it should be consistent with the version indicated by + `Status.CurrentRevision`. + 1. Otherwise, the Pod should be consistent with the version indicated + by `Status.UpdateRevision`. + +### Pod State Reconciliation +In order to reconcile a Pod with declared desired +[target state](#target-pod-state) the StatefulSet controller will do the +following. + +1. If the Pod is already consistent with its target state the controller will do +nothing. +1. If the Pod is labeled with a `StatefulSetRevisionLabel` that indicates +the Pod was generated from a version of the StatefulSetSpec that is semantically +equivalent to, but not equal to, the [target version](#target-pod-state), the +StatefulSet controller will update the Pod with a `StatefulSetRevisionLabel` +indicating the new semantically equivalent version. This form of reconciliation +is non-destructive. +1. If the Pod was not created from the target version, the Pod will be deleted +and recreated from that version. This form of reconciliation is destructive. + +### Specification Updates +The StatefulSet controller will [snapshot](#snapshot-creation) its target +Object state when mutations are made to its `.Spec.Template` or +`.Spec.VolumeClaimsTemplate` (Note that the latter is currently immutable). + +1. When the StatefulSet controller observes a mutation to a StatefulSet's + `.Spec.Template` it will snapshot its target Object state and compare +the snapshot with the version indicated by its `.Status.UpdateRevision`. +1. If the current state is equivalent to the version indicated by +`.Status.UpdateRevision` no update has occurred. +1. If the `Status.CurrentRevision` field is empty, then the StatefulSet has no +revision history. To initialize its revision history, the StatefulSet controller +will set both `.Status.CurrentRevision` and `.Status.UpdateRevision` to the +version of the current snapshot. +1. If the `.Status.CurrentRevision` is not empty, and if the +`.Status.UpdateRevision` is not equal to the version of the current snapshot, +the StatefulSet controller will set the `.Status.UpdateRevision` to the version +indicated by the current snapshot. + +### StatefulSet Revision History +The StatefulSet controller will use the technique proposed in +[Controller History](https://github.com/kubernetes/community/pull/594) to +snapshot and version its target Object state. + +#### Snapshot Creation +In order to snapshot a version of its target Object state, it will +serialize and store the `.Spec.Template` and `.Spec.VolumesClaimsTemplate` +along with the `.Generation` in each snapshot. Each snapshot will be labeled +with the StatefulSet's `.Selector`. + +#### History Reconstruction +As proposed in +[Controller History](https://github.com/kubernetes/community/pull/594), in +order to reconstruct the revision history of a StatefulSet, the StatefulSet +controller will select all snapshots based on its `Spec.Selector` and sort them +by the contained `.Generation`. This will produce an ordered set of +revisions to the StatefulSet's target Object state. + +#### History Maintenance +In order to prevent the revision history of the StatefulSet from exceeding +memory or storage limits, the StatefulSet controller will periodically prune +its revision history so that no more that `.Spec.RevisionHistoryLimit` non-live +versions of target Object state are preserved. + +### Update Completion +The criteria for update completion is as follows. + +1. If the StatefulSet's `.Spec.UpdateStrategy.Type` is equal to +`OnDeleteStatefulSetStrategyType` then no version tracking is performed. In +this case, an update can never be in progress. +1. If the StatefulSet's `.Spec.UpdateStrategy.Type` is equal to +`PartitionStatefulSetStrategyType` updates can not complete. The version +indicated `.Status.UpdateRevision` will only be applied to Pods with ordinals +in the sequence `(.Spec.UpdateStrategy.Partition.Ordinal,.Spec.Replicas)`. +1. If the StatefulSet's `.Spec.UpdateStrategy.Type` is equal to +`RollingUpdateStatefulSetStrategyType`, then an update is complete when the +StatefulSet is at its [target state](#target-state). The StatefulSet controller +will signal update completion as follows. + 1. The controller will set `.Status.CurrentRevision` to the value of + `.Status.UpdateRevision`. + 1. The controller will set `.Status.CurrentReplicas` to + `.Status.UpdatedReplicas`. Note that this value will be equal to + `.Status.Replicas`. + 1. The controller will set `.Status.UpdatedReplicas` to 0. + +### Status Reporting +After processing the creation, update, or deletion of a StatefulSet or Pod, +the StatefulSet controller will record its status by persisting a +StatefulSetStatus object. This has two purposes. + +1. It allows the StatefulSet controller to recreate the exact StatefulSet +membership in the event of a hard restart of the entire system. +1. It communicates the current state of the StatefulSet to clients. Using the +`.Status.ObserverGeneration`, clients can construct a linearizable view of +the operations performed by the controller. + +When the StatefulSet controller records the status of a StatefulSet it will +do the following. + +1. The controller will increment the `.Status.ObservedGeneration` to communicate +the `.Generation` of the StatefulSet object that was observed. +1. The controller will set the `.Status.Replicas` to the current number of +created Pods. +1. The controller will set the `.Status.ReadyReplicas` to the current number of +Pods that have a Ready Condition. +1. The controller will set the `.Status.CurrentRevision` and +`.Status.UpdateRevision` in accordance with StatefulSet's +[revision history](#statefulset-revision-history) and +any [complete updates](#update-completion). +1. The controller will set the `.Status.CurrentReplicas` to the number of +Pods that it has created from the version indicated by +`.Status.CurrentRevision`. +1. The controller will set the `.Status.UpdatedReplicas` to the number of Pods +that it has created from the version indicated by `.Status.UpdateRevision`. +1. The controller will then persist the StatefulSetStatus make it durable and +communicate it to observers. + +## API Server +The API Server will perform validation for StatefulSet creation and updates. + +### StatefulSet Validation +As is currently implemented, the API Server will not allow mutation to any +fields of the StatefulSet object other than `.Spec.Replicas` and +`.Spec.Template.Containers`. This design imposes the following, additional +constraints. + +1. If the `.Spec.UpdateStrategy.Type` is equal to +`PartitionStatefulSetStrategyType`, the API Server should fail validation +if any of the following conditions are true. + 1. `.Spec.UpdateStrategy.Partition` is nil. + 1. `.Spec.UpdateStrategy.Parition` is not nil, and + `.Spec.UpdateStrategy.Partition.Ordinal` not in the sequence + `(0,.Spec.Replicas)`. +1. The API Server will fail validation on any update to a StatefulSetStatus +object if any of the following conditions are true. + 1. `.Status.Replicas` is negative. + 1. `.Status.ReadyReplicas` is negative or greater than `.Status.Replicas`. + 1. `.Status.CurrentReplicas` is negative or greater than `.Status.Replicas`. + 1. `.Stauts.UpdateReplicas` is negative or greater than `.Status.Replicas`. + +## Kubectl +Kubectl will use the `rollout` command to control and provide the status of +StatefulSet updates. + + - `kubectl rollout status statefulset `: displays the status + of a StatefulSet update. + - `kubectl rollout undo statefulset `: triggers a rollback + of the current update. + - `kubectl rollout history statefulset `: displays a the + StatefulSets revision history. + +## Usage +This section demonstrates how the design functions in typical usage scenarios. + +### Initial Deployment +Users can create a StatefulSet using `kubectl apply`. + +Given the following manifest `web.yaml` + +```yaml +apiVersion: apps/v1beta1 +kind: StatefulSet +metadata: + name: web +spec: + serviceName: "nginx" + replicas: 3 + template: + metadata: + labels: + app: nginx + spec: + containers: + - name: nginx + image: gcr.io/google_containers/nginx-slim:0.8 + ports: + - containerPort: 80 + name: web + volumeMounts: + - name: www + mountPath: /usr/share/nginx/html + volumeClaimTemplates: + - metadata: + name: www + annotations: + volume.alpha.kubernetes.io/storage-class: anything + spec: + accessModes: [ "ReadWriteOnce" ] + resources: + requests: + storage: 1Gi +``` + +Users can use the following command to create the StatefulSet. + +```shell +kubectl apply -f web.yaml +``` + +The only difference between the proposed and current implementation is that +the proposed implementation will initialize the StatefulSet's revision history +upon initial creation. + +### Rolling out an Update +Users can create a rolling update using `kubectl apply`. If a user creates a +StatefulSet [as above](#initial-deployment), the user can trigger a rolling +update by updating the image (as in the manifest as below). + +```yaml +apiVersion: apps/v1beta1 +kind: StatefulSet +metadata: + name: web +spec: + serviceName: "nginx" + replicas: 3 + template: + metadata: + labels: + app: nginx + spec: + updateStrategy: + type: RollingUpdate + containers: + - name: nginx + image: gcr.io/google_containers/nginx-slim:0.9 + ports: + - containerPort: 80 + name: web + volumeMounts: + - name: www + mountPath: /usr/share/nginx/html + volumeClaimTemplates: + - metadata: + name: www + annotations: + volume.alpha.kubernetes.io/storage-class: anything + spec: + accessModes: [ "ReadWriteOnce" ] + resources: + requests: + storage: 1Gi +``` + + +Users can use the following command to trigger a rolling update. + +```shell +kubectl apply -f web.yaml +``` + +### Canaries +Users can create a canary using `kubectl apply`. The only difference between a + [rolling update](#rolling-out-an-update) and a canary is that the + `.Spec.UpdateStrategy.Type` is set to `PartitionStatefulSetStrategyType` and + the `.Spec.UpdateStrategy.Partition.Ordinal` is set to `.Spec.Replicas-1`. + + +```yaml +apiVersion: apps/v1beta1 +kind: StatefulSet +metadata: + name: web +spec: + serviceName: "nginx" + replicas: 3 + template: + metadata: + labels: + app: nginx + spec: + updateStrategy: + type: Partition + partition: + ordinal: 2 + containers: + - name: nginx + image: gcr.io/google_containers/nginx-slim:0.9 + ports: + - containerPort: 80 + name: web + volumeMounts: + - name: www + mountPath: /usr/share/nginx/html + + volumeClaimTemplates: + - metadata: + name: www + annotations: + volume.alpha.kubernetes.io/storage-class: anything + spec: + accessModes: [ "ReadWriteOnce" ] + resources: + requests: + storage: 1Gi +``` + +Users can also simultaneously scale up and add a canary. This reduces risk +for some deployment scenarios by adding additional capacity for the canary. +For example, in the manifest below, `.Spec.Replicas` is increased to `4` while +`.Spec.UpdateStrategy.Partition.Ordinal` is set to `.Spec.Replicas-1`. + +```yaml +apiVersion: apps/v1beta1 +kind: StatefulSet +metadata: + name: web +spec: + serviceName: "nginx" + replicas: 4 + template: + metadata: + labels: + app: nginx + spec: + updateStrategy: + type: Partition + partition: + ordinal: 3 + containers: + - name: nginx + image: gcr.io/google_containers/nginx-slim:0.9 + ports: + - containerPort: 80 + name: web + volumeMounts: + - name: www + mountPath: /usr/share/nginx/html + volumeClaimTemplates: + - metadata: + name: www + annotations: + volume.alpha.kubernetes.io/storage-class: anything + spec: + accessModes: [ "ReadWriteOnce" ] + resources: + requests: + storage: 1Gi +``` + +### Phased Roll Outs +Users can create a canary using `kubectl apply`. The only difference between a + [canary](#canaries) and a phased roll out is that the + `.Spec.UpdateStrategy.Partition.Ordinal` is set to a value less than + `.Spec.Replicas-1`. + +```yaml +apiVersion: apps/v1beta1 +kind: StatefulSet +metadata: + name: web +spec: + serviceName: "nginx" + replicas: 4 + template: + metadata: + labels: + app: nginx + spec: + updateStrategy: + type: Partition + partition: + ordinal: 2 + containers: + - name: nginx + image: gcr.io/google_containers/nginx-slim:0.9 + ports: + - containerPort: 80 + name: web + volumeMounts: + - name: www + mountPath: /usr/share/nginx/html + volumeClaimTemplates: + - metadata: + name: www + annotations: + volume.alpha.kubernetes.io/storage-class: anything + spec: + accessModes: [ "ReadWriteOnce" ] + resources: + requests: + storage: 1Gi +``` + +Phased roll outs can be used to roll out a configuration, image, or resource +update to some portion of the fleet maintained by the StatefulSet prior to +updating the entire fleet. It is useful to support linear, geometric, and +exponential roll out of an update. Users can modify the +`.Spec.UpdateStrategy.Partition.Ordinal` to allow the roll out to progress. + +```yaml +apiVersion: apps/v1beta1 +kind: StatefulSet +metadata: + name: web +spec: + serviceName: "nginx" + replicas: 3 + template: + metadata: + labels: + app: nginx + spec: + updateStrategy: + type: Partition + partition: + ordinal: 1 + containers: + - name: nginx + image: gcr.io/google_containers/nginx-slim:0.9 + ports: + - containerPort: 80 + name: web + volumeMounts: + - name: www + mountPath: /usr/share/nginx/html + volumeClaimTemplates: + - metadata: + name: www + annotations: + volume.alpha.kubernetes.io/storage-class: anything + spec: + accessModes: [ "ReadWriteOnce" ] + resources: + requests: + storage: 1Gi +``` + +### Rollbacks +To rollback an update, users can use the `kubectl rollout` command. + +The command below will roll back the `web` StatefulSet to the previous revision in +its history. If a roll out is in progress, it will stop deploying the target +revision, and roll back to the current revision. + +```shell +kubectl rollout undo statefulset web +``` + +### Rolling Forward +Rolling back is usually the safest, and often the fastest, strategy to mitigate +deployment failure, but rolling forward is sometimes the only practical solution +for stateful applications (e.g. A users has a minor configuration error but has +already modified the storage format for the application). Users can use +sequential `kubectl apply`'s to update the StatefulSet's current +[target state](#target-state). The StatefulSet's `.Spec.GenerationPartition` +will be respected, and it therefore interacts well with canaries and phased roll + outs. + +## Tests +- Updating a StatefulSet's containers will trigger updates to the StatefulSet's +Pods respecting the +[identity](https://kubernetes.io/docs/concepts/abstractions/controllers/statefulsets/#pod-identity) +and [deployment, and scaling](https://kubernetes.io/docs/concepts/abstractions/controllers/statefulsets/#deployment-and-scaling-guarantee) +guarantees. +- A StatefulSet update will block on failure. +- A StatefulSet update can be rolled back. +- A StatefulSet update can be rolled forward by applying another update. +- A StatefulSet update's status can be retrieved. +- A StatefulSet's revision history contains all updates with respect to the +configured revision history limit. +- A StatefulSet update can create a canary. +- A StatefulSet update can be performed in stages. + +## Future Work +In the future, we may implement the following features to enhance StatefulSet +updates. + +### Termination Reason +Without communicating a signal indicating the reason for termination to a Pod in +a StatefulSet, as proposed [here](https://github.com/kubernetes/community/pull/541), +the tenant application has no way to determine if it is being terminated due to +a scale down operation or due to an update. + +Consider a BASE distributed storage application like Cassandra, where 2 TiB of +persistent data is not atypical, and the data distribution is not identical on +every server. We want to enable two distinct behaviors based on the reason for +termination. + +- If the termination is due to scale down, during the configured termination +grace period, the entry point of the Pod should cause the application to drain +its client connections, replicate its persisted data (so that the cluster is not +left under replicated) and decommission the application to remove it from the +cluster. +- If the termination is due to a temporary capacity loss (e.g. an update or an +image upgrade), the application should drain all of its client connections, +flush any in memory data structures to the file system, and synchronize the +file system with storage media. It should not redistribute its data. + +If the application implements the strategy of always redistributing its data, +we unnecessarily decrease recovery time during an update and incur the +additional network and storage cost of two full data redistributions for every +updated node. +It should be noted that this is already an issue for Node cordon and Pod eviction +(due to drain or taints), and applications can use the same mitigation as they +would for these events for StatefulSet update. + +### VolumeTemplatesSpec Updates +While this proposal does not address +[VolumeTemplateSpec updates](https://github.com/kubernetes/kubernetes/issues/41015), +this would be a valuable feature for production users of storage systems that use +intermittent compaction as a form of garbage collection. Applications that use +log structured merge trees with size tiered compaction (e.g Cassandra) or append +only B(+/*) Trees (e.g Couchbase) can temporarily double their storage requirement +during compaction. If there is insufficient space for compaction +to progress, these applications will either fail or degrade until +additional capacity is added. While, if the user is using AWS EBS or GCE PD, +there are valid manual workarounds to expand the size of a PD, it would be +useful to automate the resize via updates to the StatefulSet's +VolumeClaimsTemplate. + +### In Place Updates +Currently configuration, images, and resource request/limits updates are all +performed destructively. Without a [termination reason](https://github.com/kubernetes/community/pull/541) +implementation, there is little value to implementing in place image updates, +and configuration and resource request/limit updates are not possible. +When [termination reason](#https://github.com/kubernetes/kubernetes/issues/1462) +is implemented we may modify the behavior of StatefulSet update to only update, +rather than delete and create, Pods when the only mutated value is the container + image, and if resizable resource request/limits is implemented, we may extend + the above to allow for updates to Pod resources.