Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update proposal for maxUnavailable for statefulsets #1010

Merged
merged 1 commit into from Sep 12, 2019
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
196 changes: 133 additions & 63 deletions keps/sig-apps/20190226-maxunavailable-for-statefulsets.md
Expand Up @@ -7,12 +7,14 @@ participating-sigs:
- sig-apps
reviewers:
- "@janetkuo"
- "@kow3ns"
approvers:
- TBD
- "@janetkuo"
- "@kow3ns"
editor: TBD
creation-date: 2018-12-29
last-updated: 2018-12-29
status: provisional
last-updated: 2019-08-10
status: implementable
see-also:
- n/a
replaces:
Expand All @@ -35,63 +37,69 @@ superseded-by:
- [Story 1](#story-1)
- [Implementation Details](#implementation-details)
- [API Changes](#api-changes)
- [Recommended Choice](#recommended-choice)
- [Implementation](#implementation)
- [Risks and Mitigations](#risks-and-mitigations)
- [Upgrades/Downgrades](#upgradesdowngrades)
- [Tests](#tests)
- [Graduation Criteria](#graduation-criteria)
- [Implementation History](#implementation-history)
- [Drawbacks [optional]](#drawbacks-optional)
- [Drawbacks](#drawbacks)
- [Alternatives](#alternatives)
<!-- /toc -->

## Summary

The purpose of this enhancement is to implement maxUnavailable for StatefulSet during RollingUpdate. When a StatefulSet’s
`.spec.updateStrategy.type` is set to `RollingUpdate`, the StatefulSet controller will delete and recreate each Pod
in the StatefulSet. The updating of each Pod currently happens one at a time. With support for `maxUnavailable`, the updating
will proceed `maxUnavailable` number of pods at a time. Note, that maxUnavailable does not affect podManagementPolicy which
is only applicable during scaling.

The purpose of this enhancement is to implement maxUnavailable for StatefulSet during RollingUpdate.
When a StatefulSet’s `.spec.updateStrategy.type` is set to `RollingUpdate`, the StatefulSet controller
will delete and recreate each Pod in the StatefulSet. The updating of each Pod currently happens one at a time. With support for `maxUnavailable`, the updating will proceed `maxUnavailable` number of pods at a time.

## Motivation

Consider the following scenarios:-

1. My containers publish metrics to a time series system. If I am using a Deployment, each rolling update creates a new pod name and hence the metrics
published by these new pod starts a new time series which makes tracking metrics for the application difficult. While this could be mitigated,
it requires some tricks on the time series collection side. It would be so much better, If we could use a StatefulSet object so my object names doesnt
change and hence all metrics goes to a single time series. This will be easier if StatefulSet is at feature parity with Deployments.
2. My Container does some initial startup tasks like loading up cache or something that takes a lot of time. If we used StatefulSet, we can only go one
pod at a time which would result in a slow rolling update. If we did maxUnavailable for StatefulSet with a greater than 1 number, it would allow for a
faster rollout.
3. My Stateful clustered application, has followers and leaders, with followers being many more than 1. My application can tolerate many followers going
down at the same time. I want to be able to do faster rollouts by bringing down 2 or more followers at the same time. This is only possible if StatefulSet
1. My containers publish metrics to a time series system. If I am using a Deployment, each rolling
update creates a new pod name and hence the metrics published by this new pod starts a new time series
which makes tracking metrics for the application difficult. While this could be mitigated, it requires
some tricks on the time series collection side. It would be so much better, If we could use a
StatefulSet object so my object names doesnt change and hence all metrics goes to a single time series. This will be easier if StatefulSet is at feature parity with Deployments.
2. My Container does some initial startup tasks like loading up cache or something that takes a lot of
time. If we used StatefulSet, we can only go one pod at a time which would result in a slow rolling
update. If StatefulSet supported maxUnavailable with value greater than 1, it would allow for a faster
rollout since a total of maxUnavailable number of pods could be loading up the cache at the same time.
3. My Stateful clustered application, has followers and leaders, with followers being many more than 1. My application can tolerate many followers going down at the same time. I want to be able to do faster
rollouts by bringing down 2 or more followers at the same time. This is only possible if StatefulSet
supports maxUnavailable in Rolling Updates.
4. Sometimes i just want easier tracking of revisions of a rolling update. Deployment does it through ReplicaSets and has its own nuances. Understanding
that requires diving into the complicacy of hashing and how replicasets are named. Over and above that, there are some issues with hash collisions which
further complicate the situation(I know they were solved). StatefulSet introduced ControllerRevisions in 1.7 which I believe are easier to think and reason
about. They are used by DaemonSet and StatefulSet for tracking revisions. It would be so much nicer if all the use cases of Deployments can be met and we
could track the revisions by ControllerRevisions.

With this feature in place, when using StatefulSet with maxUnavailable >1, the user understands that this would not cause issues with their Stateful
Applications which have per pod state and identity while still providing all of the above written advantages.
4. Sometimes I just want easier tracking of revisions of a rolling update. Deployment does it through
ReplicaSets and has its own nuances. Understanding that requires diving into the complicacy of hashing
and how ReplicaSets are named. Over and above that, there are some issues with hash collisions which
further complicate the situation(I know they were solved). StatefulSet introduced ControllerRevisions
in 1.7 which are much easier to think and reason about. They are used by DaemonSet and StatefulSet for
tracking revisions. It would be so much nicer if all the use cases of Deployments can be met in
StatefulSet's and additionally we could track the revisions by ControllerRevisions. Another way of
saying this is, all my Deployment use cases are easily met by StatefulSet, and additionally I can enjoy
easier revision tracking only if StatefulSet supported `maxUnavailable`.

With this feature in place, when using StatefulSet with maxUnavailable >1, the user is making a
conscious choice that more than one pod going down at the same time during rolling update, would not
cause issues with their Stateful Application which have per pod state and identity. Other Stateful
Applications which cannot tolerate more than one pod going down, will resort to the current behavior of one pod at a time Rolling Updates.

### Goals
StatefulSet RollingUpdate strategy will contain an additional parameter called `maxUnavailable` to control how many Pods will be brought down at a time,
during Rolling Update.
StatefulSet RollingUpdate strategy will contain an additional parameter called `maxUnavailable` to
control how many Pods will be brought down at a time, during Rolling Update.

### Non-Goals
maxUnavailable is only implemeted to affect the Rolling Update of StatefulSet. Considering maxUnavailable for Pod Management Policy of Parallel is beyond
the purview of this KEP.
NA

## Proposal

### User Stories

#### Story 1
As a User of Kubernetes, I should be able to update my StatefulSet, more than one Pod at a time, in a RollingUpdate way, if my Stateful app can tolerate
more than one pod being down, thus allowing my update to finish much faster.
As a User of Kubernetes, I should be able to update my StatefulSet, more than one Pod at a time, in a
RollingUpdate manner, if my Stateful app can tolerate more than one pod being down, thus allowing my
update to finish much faster.

### Implementation Details

Expand Down Expand Up @@ -121,32 +129,79 @@ type RollingUpdateStatefulSetStrategy struct {
}
```

- By Default, if maxUnavailable is not specified, its value will be assumed to be 1 and StatefulSets will follow their old behavior. This
will also help while upgrading from a release which doesnt support maxUnavailable to a release which supports this field.
- By Default, if maxUnavailable is not specified, its value will be assumed to be 1 and StatefulSets
will follow their old behavior. This will also help while upgrading from a release which doesnt support maxUnavailable to a release which supports this field.
- If maxUnavailable is specified, it cannot be greater than total number of replicas.
- If maxUnavailable is specified and partition is also specified, MaxUnavailable cannot be greater than `replicas-partition`
- If a partition is specified, maxUnavailable will only apply to all the pods which are staged by the partition. Which means all Pods
with an ordinal that is greater than or equal to the partition will be updated when the StatefulSet’s .spec.template is updated. Lets
say total replicas is 5 and partition is set to 2 and maxUnavailable is set to 2. If the image is changed in this scenario, following
are the behavior choices we have:-
- pods with ordinal 4 and 3 will go down at the same time(because of maxUnavailable). Once they are both running and ready, pods with
ordinal 2 will go down. Pods with ordinal 0 and 1 will remain untouched due the partition.
- pods with ordinal 4 and 3 will go down at the same time(because of maxUnavailable). When any of 4 or 3 are running and ready, pods
with ordinal 2 will start going down. This could violate ordering guarantees, since if 3 is running and ready, then both 4 and 2
are terminating at the same time out of order.
- pod with ordinal 4 and 3 will go down at the same time(because of maxUnavailable). When 4 is running and ready, 2 will go down. At
this time both 2 and 3 are terminating. If 3 is running and ready before 4, 2 wont go down to preserve ordering semantics. So at
this time, only 1 is unavailable although we requested 2.
- NOTE: The goal is faster updates of an application. In some cases , people would need both ordering and faster updates. In other cases
they just need faster updates and they dont care about ordering as long as they get identity. We need to find which one users care
about more
- If a partition is specified, maxUnavailable will only apply to all the pods which are staged by the
partition. Which means all Pods with an ordinal that is greater than or equal to the partition will be
updated when the StatefulSet’s .spec.template is updated. Lets say total replicas is 5 and partition is set to 2 and maxUnavailable is set to 2. If the image is changed in this scenario, following
are the possible behavior choices we have:-

1. Pods with ordinal 4 and 3 will start Terminating at the same time(because of maxUnavailable). Once they are both running and ready, pods with ordinal 2 will start Terminating. Pods with ordinal 0 and 1
will remain untouched due the partition. In this choice, the number of pods terminating is not always
maxUnavailable, but sometimes less than that. For e.g. if pod with ordinal 3 is running and ready but 4 is not, we still wait for 4 to be running and ready before moving on to 2. This implementation avoids
out of order Terminations of pods.
2. Pods with ordinal 4 and 3 will start Terminating at the same time(because of maxUnavailable). When any of 4 or 3 are running and ready, pods with ordinal 2 will start Terminating. This could violate
ordering guarantees, since if 3 is running and ready, then both 4 and 2 are terminating at the same
time out of order. If 4 is running and ready, then both 3 and 2 are Terminating at the same time and no ordering guarantees are violated. This implementation, guarantees, that always there are maxUnavailable number of Pods Terminating except the last batch.
3. Pod with ordinal 4 and 3 will start Terminating at the same time(because of maxUnavailable). When 4 is running and ready, 2 will start Terminating. At this time both 2 and 3 are terminating. If 3 is
running and ready before 4, 2 wont start Terminating to preserve ordering semantics. So at this time,
only 1 is unavailable although we requested 2.
4. Introduce a field in Rolling Update, which decides whether we want maxUnavailable with ordering or without ordering guarantees. Depending on what the user wants, this Choice can either choose behavior 1 or 3 if ordering guarantees are needed or choose behavior 2 if they dont care. To simplify this further
PodManagementPolicy today supports `OrderedReady` or `Parallel`. The `Parallel` mode only supports scale up and tear down of StatefulSets and currently doesnt apply to Rolling Updates. So instead of coming up
with a new field, we could use the PodManagementPolicy to choose the behavior the User wants.

1. PMP=Parallel will now apply to RollingUpdate. This will choose behavior described in 2 above.
This means always maxUnavailable number of Pods are terminating at the same time except in
the last case and no ordering guarantees are provided.
2. PMP=OrderedReady with maxUnavailable can choose one of behavior 1 or 3.

NOTE: The goal is faster updates of an application. In some cases , people would need both ordering
and faster updates. In other cases they just need faster updates and they dont care about ordering as
long as they get identity.

Choice 1 is simpler to reason about. It does not always have maxUnavailable number of Pods in
Terminating state. It does not guarantee ordering within the batch of maxUnavailable Pods. The maximum
difference in the ordinals which are Terminating out of Order, cannot be more than maxUnavailable.

Choice 2 always offers maxUnavailable number of Pods in Terminating state. This can sometime lead to
pods terminating out of order. This will always lead to the fastest rollouts. The maximum difference in the ordinals which are Terminating out of Order, can be more than maxUnavailable.

Choice 3 always guarantees than no two pods are ever Terminating out of order. It sometimes does that,
at the cost of not being able to Terminate maxUnavailable pods. The implementationg for this might be
complicated.

Choice 4 provides a choice to the users and hence takes the guessing out of the picture on what they
will expect. Implementing Choice 4 using PMP would be the easiest.

##### Recommended Choice

I recommend Choice 4, using PMP=Parallel for the first Alpha Phase. This would give the users fast
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In particular, I think users will want the behvior of (1) above if they are using Ordered PodManagement and the behavior of (2) if they are using parallel. Consider implementing just these semantics. That is, if PMP is ordered the user is declaring that they care about termination ordering, even if they decide they are willing to tolerate a larger number of distruptions during an update. If the PMP is Parallel, the user does not care about about the termination ordering during turn-up/turn-down. If they want to do care about the termination ordering during update they can set the maxUnvailable field to 1 to preserve the current behavior. If they wish to tolerate a larger number of disruptions they can increase its value.

rollouts without having them to second guess what the behavior should be. This choice also allows for
easily extending the behavior with PMP=OrderedReady in future to choose either behavior 1 or 3.

#### Implementation

TBD: Will be updated after we have agreed on the semantics being discussed above.

https://github.com/kubernetes/kubernetes/blob/v1.13.0/pkg/controller/statefulset/stateful_set_control.go#L504
```go
...
podsDeleted := 0
// we compute the minimum ordinal of the target sequence for a destructive update based on the strategy.
updateMin := 0
maxUnavailable := 1
if set.Spec.UpdateStrategy.RollingUpdate != nil {
updateMin = int(*set.Spec.UpdateStrategy.RollingUpdate.Partition)

// NEW CODE HERE
maxUnavailable, err = intstrutil.GetValueFromIntOrPercent(intstrutil.ValueOrDefault(set.Spec.UpdateStrategy.RollingUpdate.MaxUnavailable, intstrutil.FromInt(1)), int(replicaCount), false)
if err != nil {
return &status, err
}
}

var unavailablePods []string
// we terminate the Pod with the largest ordinal that does not match the update revision.
for target := len(replicas) - 1; target >= updateMin; target-- {

Expand All @@ -156,24 +211,37 @@ https://github.com/kubernetes/kubernetes/blob/v1.13.0/pkg/controller/statefulset
set.Namespace,
set.Name,
replicas[target].Name)
err := ssc.podControl.DeleteStatefulPod(set, replicas[target])
status.CurrentReplicas--

// NEW CODE HERE
if podsDeleted < set.Spec.UpdateStrategy.RollingUpdate.MaxUnavailable {
podsDeleted ++;
continue;
if err := ssc.podControl.DeleteStatefulPod(set, replicas[target]); err != nil {
return &status, err
}
return &status, err

// After deleting a Pod, dont Return From Here Yet.
// We might have maxUnavailable greater than 1
status.CurrentReplicas--
}

// wait for unhealthy Pods on update
if !isHealthy(replicas[target]) {
// If this Pod is unhealthy regardless of revision, count it in
// unavailable pods
unavailablePods = append(unavailablePods, replicas[target].Name)
klog.V(4).Infof(
"StatefulSet %s/%s is waiting for Pod %s to update",
set.Namespace,
set.Name,
replicas[target].Name)
}

// NEW CODE HERE
// If at anytime, total number of unavailable Pods exceeds maxUnavailable,
// we stop deleting more Pods for Update
if len(unavailablePods) >= maxUnavailable {
klog.V(4).Infof(
"StatefulSet %s/%s is waiting for unavailable Pods %v to update, max Allowed to Update Simultaneously %v",
set.Namespace,
set.Name,
unavailablePods,
maxUnavilable)
return &status, nil
}

Expand Down Expand Up @@ -201,7 +269,9 @@ tried this feature in Alpha, we would have time to fix issues.

### Tests

- maxUnavailable =1, Same behavior as today
- maxUnavailable =1, Same behavior as today with PodManagementPolicy as `OrderedReady` or `Parallel`
- Each of these Tests can be run in PodManagementPolicy = `OrderedReady` or `Parallel` and the Update
should happen at most maxUnavailable Pods at a time in ordered or parallel fashion respectively.
- maxUnavailable greater than 1 without partition
- maxUnavailable greater than replicas without partition
- maxUnavailable greater than 1 with partition and staged pods less then maxUnavailable
Expand All @@ -218,11 +288,11 @@ tried this feature in Alpha, we would have time to fix issues.
## Implementation History

- KEP Started on 1/1/2019
- Implementation PR and UT by 3/15
- Implementation PR and UT by 8/30

## Drawbacks [optional]
## Drawbacks

Why should this KEP _not_ be implemented.
NA

## Alternatives

Expand Down