kubernetes · k8s-ci-robot · Sep 12, 2019 · Apr 27, 2019 · kow3ns · Jul 22, 2019
diff --git a/keps/sig-apps/20190226-maxunavailable-for-statefulsets.md b/keps/sig-apps/20190226-maxunavailable-for-statefulsets.md
@@ -7,12 +7,14 @@ participating-sigs:
   - sig-apps
 reviewers:
   - "@janetkuo"
+  - "@kow3ns"
 approvers:
-  - TBD
+  - "@janetkuo"
+  - "@kow3ns"
 editor: TBD
 creation-date: 2018-12-29
-last-updated: 2018-12-29
-status: provisional
+last-updated: 2019-08-10
+status: implementable
 see-also:
   - n/a
 replaces:
@@ -35,63 +37,69 @@ superseded-by:
     - [Story 1](#story-1)
   - [Implementation Details](#implementation-details)
     - [API Changes](#api-changes)
+      - [Recommended Choice](#recommended-choice)
     - [Implementation](#implementation)
   - [Risks and Mitigations](#risks-and-mitigations)
   - [Upgrades/Downgrades](#upgradesdowngrades)
   - [Tests](#tests)
 - [Graduation Criteria](#graduation-criteria)
 - [Implementation History](#implementation-history)
-- [Drawbacks [optional]](#drawbacks-optional)
+- [Drawbacks](#drawbacks)
 - [Alternatives](#alternatives)
 <!-- /toc -->
 
 ## Summary
 
-The purpose of this enhancement is to implement maxUnavailable for StatefulSet during RollingUpdate. When a StatefulSet’s 
-`.spec.updateStrategy.type` is set to `RollingUpdate`, the StatefulSet controller will delete and recreate each Pod 
-in the StatefulSet. The updating of each Pod currently happens one at a time. With support for `maxUnavailable`, the updating 
-will proceed `maxUnavailable` number of pods at a time. Note, that maxUnavailable does not affect podManagementPolicy which
-is only applicable during scaling.
-
+The purpose of this enhancement is to implement maxUnavailable for StatefulSet during RollingUpdate. 
+When a StatefulSet’s `.spec.updateStrategy.type` is set to `RollingUpdate`, the StatefulSet controller 
+will delete and recreate each Pod in the StatefulSet. The updating of each Pod currently happens one at a time. With support for `maxUnavailable`, the updating will proceed `maxUnavailable` number of pods at a time. 
 
 ## Motivation
 
 Consider the following scenarios:-
 
-1. My containers publish metrics to a time series system. If I am using a Deployment, each rolling update creates a new pod name and hence the metrics 
-published by these new pod starts a new time series which makes tracking metrics for the application difficult. While this could be mitigated, 
-it requires some tricks on the time series collection side. It would be so much better, If we could use a StatefulSet object so my object names doesnt 
-change and hence all metrics goes to a single time series. This will be easier if StatefulSet is at feature parity with Deployments.
-2. My Container does some initial startup tasks like loading up cache or something that takes a lot of time. If we used StatefulSet, we can only go one 
-pod at a time which would result in a slow rolling update. If we did maxUnavailable for StatefulSet with a greater than 1 number, it would allow for a 
-faster rollout.
-3. My Stateful clustered application, has followers and leaders, with followers being many more than 1. My application can tolerate many followers going 
-down at the same time. I want to be able to do faster rollouts by bringing down 2 or more followers at the same time. This is only possible if StatefulSet
+1. My containers publish metrics to a time series system. If I am using a Deployment, each rolling 
+update creates a new pod name and hence the metrics published by this new pod starts a new time series 
+which makes tracking metrics for the application difficult. While this could be mitigated, it requires 
+some tricks on the time series collection side. It would be so much better, If we could use a 
+StatefulSet object so my object names doesnt change and hence all metrics goes to a single time series. This will be easier if StatefulSet is at feature parity with Deployments.
+2. My Container does some initial startup tasks like loading up cache or something that takes a lot of 
+time. If we used StatefulSet, we can only go one pod at a time which would result in a slow rolling 
+update. If StatefulSet supported maxUnavailable with value greater than 1, it would allow for a faster 
+rollout since a total of maxUnavailable number of pods could be loading up the cache at the same time.
+3. My Stateful clustered application, has followers and leaders, with followers being many more than 1. My application can tolerate many followers going down at the same time. I want to be able to do faster 
+rollouts by bringing down 2 or more followers at the same time. This is only possible if StatefulSet 
 supports maxUnavailable in Rolling Updates.
-4. Sometimes i just want easier tracking of revisions of a rolling update. Deployment does it through ReplicaSets and has its own nuances. Understanding 
-that requires diving into the complicacy of hashing and how replicasets are named. Over and above that, there are some issues with hash collisions which 
-further complicate the situation(I know they were solved). StatefulSet introduced ControllerRevisions in 1.7 which I believe are easier to think and reason 
-about. They are used by DaemonSet and StatefulSet for tracking revisions. It would be so much nicer if all the use cases of Deployments can be met and we 
-could track the revisions by ControllerRevisions.
-
-With this feature in place, when using StatefulSet with maxUnavailable >1, the user understands that this would not cause issues with their Stateful 
-Applications which have per pod state and identity while still providing all of the above written advantages.
+4. Sometimes I just want easier tracking of revisions of a rolling update. Deployment does it through 
+ReplicaSets and has its own nuances. Understanding that requires diving into the complicacy of hashing 
+and how ReplicaSets are named. Over and above that, there are some issues with hash collisions which 
+further complicate the situation(I know they were solved). StatefulSet introduced ControllerRevisions 
+in 1.7 which are much easier to think and reason about. They are used by DaemonSet and StatefulSet for 
+tracking revisions. It would be so much nicer if all the use cases of Deployments can be met in 
+StatefulSet's and additionally we could track the revisions by ControllerRevisions. Another way of 
+saying this is, all my Deployment use cases are easily met by StatefulSet, and additionally I can enjoy
+easier revision tracking only if StatefulSet supported `maxUnavailable`.
+
+With this feature in place, when using StatefulSet with maxUnavailable >1, the user is making a 
+conscious choice that more than one pod going down at the same time during rolling update, would not 
+cause issues with their Stateful Application which have per pod state and identity. Other Stateful 
+Applications which cannot tolerate more than one pod going down, will resort to the current behavior of one pod at a time Rolling Updates.
 
 ### Goals
-StatefulSet RollingUpdate strategy will contain an additional parameter called `maxUnavailable` to control how many Pods will be brought down at a time,
-during Rolling Update.
+StatefulSet RollingUpdate strategy will contain an additional parameter called `maxUnavailable` to 
+control how many Pods will be brought down at a time, during Rolling Update.
 
 ### Non-Goals
-maxUnavailable is only implemeted to affect the Rolling Update of StatefulSet. Considering maxUnavailable for Pod Management Policy of Parallel is beyond 
-the purview of this KEP.
+NA
 
 ## Proposal
 
 ### User Stories
 
 #### Story 1
-As a User of Kubernetes, I should be able to update my StatefulSet, more than one Pod at a time, in a RollingUpdate way, if my Stateful app can tolerate 
-more than one pod being down, thus allowing my update to finish much faster. 
+As a User of Kubernetes, I should be able to update my StatefulSet, more than one Pod at a time, in a 
+RollingUpdate manner, if my Stateful app can tolerate more than one pod being down, thus allowing my 
+update to finish much faster. 
 
 ### Implementation Details
 
@@ -121,32 +129,79 @@ type RollingUpdateStatefulSetStrategy struct {
 }
 ```
 
-- By Default, if maxUnavailable is not specified, its value will be assumed to be 1 and StatefulSets will follow their old behavior. This
-  will also help while upgrading from a release which doesnt support maxUnavailable to a release which supports this field.
+- By Default, if maxUnavailable is not specified, its value will be assumed to be 1 and StatefulSets 
+will follow their old behavior. This will also help while upgrading from a release which doesnt support maxUnavailable to a release which supports this field.
 - If maxUnavailable is specified, it cannot be greater than total number of replicas.
 - If maxUnavailable is specified and partition is also specified, MaxUnavailable cannot be greater than `replicas-partition`
-- If a partition is specified, maxUnavailable will only apply to all the pods which are staged by the partition. Which means all Pods 
-  with an ordinal that is greater than or equal to the partition will be updated when the StatefulSet’s .spec.template is updated. Lets 
-  say total replicas is 5 and partition is set to 2 and maxUnavailable is set to 2. If the image is changed in this scenario, following
-  are the behavior choices we have:-
-  -  pods with ordinal 4 and 3 will go down at the same time(because of maxUnavailable). Once they are both running and ready, pods with 
-     ordinal 2 will go down. Pods with ordinal 0 and 1 will remain untouched due the partition.
-  -  pods with ordinal 4 and 3 will go down at the same time(because of maxUnavailable). When any of 4 or 3 are running and ready, pods
-     with ordinal 2 will start going down. This could violate ordering guarantees, since if 3 is running and ready, then both 4 and 2
-     are terminating at the same time out of order.
-  -  pod with ordinal 4 and 3 will go down at the same time(because of maxUnavailable). When 4 is running and ready, 2 will go down. At
-     this time both 2 and 3 are terminating. If 3 is running and ready before 4, 2 wont go down to preserve ordering semantics. So at 
-     this time, only 1 is unavailable although we requested 2.
-- NOTE: The goal is faster updates of an application. In some cases , people would need both ordering and faster updates. In other cases 
-  they just need faster updates and they dont care about ordering as long as they get identity. We need to find which one users care
-  about more
+- If a partition is specified, maxUnavailable will only apply to all the pods which are staged by the 
+partition. Which means all Pods with an ordinal that is greater than or equal to the partition will be 
+updated when the StatefulSet’s .spec.template is updated. Lets say total replicas is 5 and partition is set to 2 and maxUnavailable is set to 2. If the image is changed in this scenario, following
+  are the possible behavior choices we have:-
+
+  1. Pods with ordinal 4 and 3 will start Terminating at the same time(because of maxUnavailable). Once they are both running and ready, pods with ordinal 2 will start Terminating. Pods with ordinal 0 and 1 
+will remain untouched due the partition. In this choice, the number of pods terminating is not always 
+maxUnavailable, but sometimes less than that. For e.g. if pod with ordinal 3 is running and ready but 4 is not, we still wait for 4 to be running and ready before moving on to 2. This implementation avoids 
+out of order Terminations of pods.
+  2. Pods with ordinal 4 and 3 will start Terminating at the same time(because of maxUnavailable). When any of 4 or 3 are running and ready, pods with ordinal 2 will start Terminating. This could violate 
+ordering guarantees, since if 3 is running and ready, then both 4 and 2 are terminating at the same 
+time out of order. If 4 is running and ready, then both 3 and 2 are Terminating at the same time and no ordering guarantees are violated. This implementation, guarantees, that always there are maxUnavailable number of Pods Terminating except the last batch.
+  3. Pod with ordinal 4 and 3 will start Terminating at the same time(because of maxUnavailable). When 4 is running and ready, 2 will start Terminating. At this time both 2 and 3 are terminating. If 3 is 
+running and ready before 4, 2 wont start Terminating to preserve ordering semantics. So at this time, 
+only 1 is unavailable although we requested 2.
+  4. Introduce a field in Rolling Update, which decides whether we want maxUnavailable with ordering or without ordering guarantees. Depending on what the user wants, this Choice can either choose behavior 1 or 3 if ordering guarantees are needed or choose behavior 2 if they dont care. To simplify this further
+PodManagementPolicy today supports `OrderedReady` or `Parallel`. The `Parallel` mode only supports scale up and tear down of StatefulSets and currently doesnt apply to Rolling Updates. So instead of coming up
+with a new field, we could use the PodManagementPolicy to choose the behavior the User wants.
+
+        1. PMP=Parallel will now apply to RollingUpdate. This will choose behavior described in 2 above.
+           This means always maxUnavailable number of Pods are terminating at the same time except in 
+           the last case and no ordering guarantees are provided.
+        2. PMP=OrderedReady with maxUnavailable can choose one of behavior 1 or 3. 
+
+NOTE: The goal is faster updates of an application. In some cases , people would need both ordering 
+and faster updates. In other cases they just need faster updates and they dont care about ordering as 
+long as they get identity.
+
+Choice 1 is simpler to reason about. It does not always have maxUnavailable number of Pods in 
+Terminating state. It does not guarantee ordering within the batch of maxUnavailable Pods. The maximum 
+difference in the ordinals which are Terminating out of Order, cannot be more than maxUnavailable.
+
+Choice 2 always offers maxUnavailable number of Pods in Terminating state. This can sometime lead to 
+pods terminating out of order. This will always lead to the fastest rollouts. The maximum difference in the ordinals which are Terminating out of Order, can be more than maxUnavailable.
+
+Choice 3 always guarantees than no two pods are ever Terminating out of order. It sometimes does that, 
+at the cost of not being able to Terminate maxUnavailable pods. The implementationg for this might be 
+complicated.
+
+Choice 4 provides a choice to the users and hence takes the guessing out of the picture on what they 
+will expect. Implementing Choice 4 using PMP would be the easiest.
+
+##### Recommended Choice
+
+I recommend Choice 4, using PMP=Parallel for the first Alpha Phase. This would give the users fast 
+rollouts without having them to second guess what the behavior should be. This choice also allows for 
+easily extending the behavior with PMP=OrderedReady in future to choose either behavior 1 or 3.
 
 #### Implementation
 
+TBD: Will be updated after we have agreed on the semantics being discussed above.
+
 https://github.com/kubernetes/kubernetes/blob/v1.13.0/pkg/controller/statefulset/stateful_set_control.go#L504
 ```go
 ...
-	podsDeleted := 0
+	 // we compute the minimum ordinal of the target sequence for a destructive update based on the strategy.
+        updateMin := 0
+	maxUnavailable := 1
+        if set.Spec.UpdateStrategy.RollingUpdate != nil {
+                updateMin = int(*set.Spec.UpdateStrategy.RollingUpdate.Partition)
+
+		// NEW CODE HERE
+		maxUnavailable, err = intstrutil.GetValueFromIntOrPercent(intstrutil.ValueOrDefault(set.Spec.UpdateStrategy.RollingUpdate.MaxUnavailable, intstrutil.FromInt(1)), int(replicaCount), false)
+		if err != nil {
+			return &status, err
+		}
+	}
+
+	var unavailablePods []string
 	// we terminate the Pod with the largest ordinal that does not match the update revision.
 	for target := len(replicas) - 1; target >= updateMin; target-- {
 
@@ -156,24 +211,37 @@ https://github.com/kubernetes/kubernetes/blob/v1.13.0/pkg/controller/statefulset
 				set.Namespace,
 				set.Name,
 				replicas[target].Name)
-			err := ssc.podControl.DeleteStatefulPod(set, replicas[target])
-			status.CurrentReplicas--
-
-			// NEW CODE HERE
-			if podsDeleted < set.Spec.UpdateStrategy.RollingUpdate.MaxUnavailable {
-				podsDeleted ++;
-				continue;
+			if err := ssc.podControl.DeleteStatefulPod(set, replicas[target]); err != nil {
+				return &status, err
 			}
-			return &status, err
+
+			// After deleting a Pod, dont Return From Here Yet.
+			// We might have maxUnavailable greater than 1
+			status.CurrentReplicas--
 		}
 
 		// wait for unhealthy Pods on update
 		if !isHealthy(replicas[target]) {
+			// If this Pod is unhealthy regardless of revision, count it in 
+			// unavailable pods
+			unavailablePods = append(unavailablePods, replicas[target].Name)
 			klog.V(4).Infof(
 				"StatefulSet %s/%s is waiting for Pod %s to update",
 				set.Namespace,
 				set.Name,
 				replicas[target].Name)
+		}
+
+		// NEW CODE HERE
+		// If at anytime, total number of unavailable Pods exceeds maxUnavailable,
+		// we stop deleting more Pods for Update
+		if len(unavailablePods) >= maxUnavailable {
+			klog.V(4).Infof(
+				"StatefulSet %s/%s is waiting for unavailable Pods %v to update, max Allowed to Update Simultaneously %v",
+				set.Namespace,
+				set.Name,
+				unavailablePods,
+				maxUnavilable)
 			return &status, nil
 		}
 
@@ -201,7 +269,9 @@ tried this feature in Alpha, we would have time to fix issues.
 
 ### Tests
 
-- maxUnavailable =1, Same behavior as today
+- maxUnavailable =1, Same behavior as today with PodManagementPolicy as `OrderedReady` or `Parallel`
+- Each of these Tests can be run in PodManagementPolicy = `OrderedReady` or `Parallel` and the Update
+  should happen at most maxUnavailable Pods at a time in ordered or parallel fashion respectively.
 - maxUnavailable greater than 1 without partition
 - maxUnavailable greater than replicas without partition
 - maxUnavailable greater than 1 with partition and staged pods less then maxUnavailable
@@ -218,11 +288,11 @@ tried this feature in Alpha, we would have time to fix issues.
 ## Implementation History
 
 - KEP Started on 1/1/2019
-- Implementation PR and UT by 3/15
+- Implementation PR and UT by 8/30
 
-## Drawbacks [optional]
+## Drawbacks
 
-Why should this KEP _not_ be implemented.
+NA
 
 ## Alternatives