Robust rollingupdate and rollback #1353

Closed
bgrant0607 opened this Issue Sep 18, 2014 · 37 comments

Projects

None yet
@bgrant0607
Member

In, PR #1325 we agreed rollingupdate should be factored out of kubecfg in the kubecfg overhaul.

#492 (comment) and PR #1007 discussed alternative approaches to rollingupdate.

What I'd recommend is that we implement a new rollingupdate library and a corresponding command-line wrapper. The library may be invoked by more general libraries/tools in the future.

The rolling update approach that should be used is that of creating a new replication controller with 1 replica, resizing the new (+1) and old (-1) controllers one by one, and then deleting the old controller once it reaches 0 replicas. Unlike the current approach, this predictably updates the set of pods regardless of unexpected failures.

The two replication controllers would need at least one differentiating label, which could use the image tag of the primary container of the pod, which is typically what motivates rolling updates.

It should be possible to apply the operation to services containing multiple release tracks, such as daily and weekly or canary and stable, by apply it to each track individually.

@bgrant0607 bgrant0607 added the config label Sep 19, 2014
@bgrant0607 bgrant0607 added this to the v1.0 milestone Oct 4, 2014
@alex-mohr
Member

The approach described above would seem to work for stateless services and/or blue-green deployments, but I'm not sure how it would work for something like e.g. a redis master where the pod itself has storage it cares about.

Are stateful pods a use case we want to support? If so, they would seem to require some form of in-place update?

@KyleAMathews
Contributor

I'd say yes with some sort of hook system to let people write custom migration scripts to support the transition say to the new Redis master.

@bgrant0607
Member

@alex-mohr See #598 and #1515. You could also put the data into a PD.

@stp-ip
Member
stp-ip commented Nov 21, 2014

The thing is we have to start somewhere and I would say we should start with a rolling update mechanism for stateless services only.

Statefull services have a lot of different needs. I would try to avoid making k8s into a workflow integrating system. The question is. Would it be possible to let containers handle it themselves? On a shutdown lock and flush database. On restart unlock database etc. It could be handled on a per container basis, but makes them a lot more complex.
Another idea, which only works for migrations in my opinion would be to spin up an intermediate pod.

database example:
-> let the container lock and flush database
-> start intermediate container to migrate database
-> remove intermediate container when finished (exit themself)
-> start new pod one container at a time
-> remove old pod one container at a time

This would be the easiest example to start with. I agree there are other ways to remove the complexity from the containers themselves. This could be being able to execute a script on each old container via docker exec for example, which basically kills them when the script was executed. Either when all are killed the migration script could get in (downtime...) or the new containers could already be spawned.

@KyleAMathews
Contributor

@bgrant0607 looks like we'd need a pre-prestop though as SIGTERM is still sent to the container :)

@bgrant0607
Member

Copying the detailed design of rolling update from #3061:

Requirements, principles, assumptions, etc.:

  • As a matter of principle, any functionality available in kubectl should be available by library call and declaratively, as well as by command.
  • Rolling update should not mix configuration generation and orchestration of the update.
  • It needs to handle arbitrary (e.g., multi-container) pods.
  • Durable data objects are assumed to have independent lifetimes of the pods themselves.
  • If the client fails or is killed in the middle of the rolling update, there needs to be a reasonable way to recover and complete or rollback the rollout.
  • Users should be able to specify the update rate on the command line, especially given that we don't yet check pod health or readiness. If it just blazes through the pods, it might as well not be a rolling update.

Proposed syntax:

kubectl rollingupdate <old-controller-name>  -f <new-replication-controller.json>

If the number of replicas specified in the new template were unspecified, I think it would default to 0 after parsing. Behavior in this case would be that we gradually increase it to the replica count of the original. We could also allow the user to specify a new size, which would do a rolling update for min(old,new) and then delete or add the remaining replicas.

Since kubectl can reason about the type of the json parameter, we could add support for specifying just the pod template later (like when we have config generation and v1beta3 sorted), or for other types of controllers (e.g., per-node controller).

Regarding recovery in the case of failure in the middle (i.e., resumption of the rolling update), and rollback of a partial rolling update: This is where annotations would be useful -- to store the final number of replicas desired, for instance. I think the identical syntax as above could work for recovery, if kubectl first checked whether the new replicationController already existed and, if so, merely continued. If we also supported just specifying a name for the new replication controller, that could also be used either to finish the rollingupdate or to rollback, by swapping old and new on the command line.

We should use an approach friendly to resizing, either via kubectl or via an auto-scaler. We should keep track of the number of replicas subtracted in an annotation on the original replication controller, so that the total desired is the current replica count plus the number subtracted. Unless the user specified the desired number explicitly in the new replication controller -- that can be stored in an annotation on the new replication controller.

I also eventually want to support "roll over" -- replace N replication controllers with one. I think that's straightforward if we just use the convention that the file or last name corresponds to the target replication controller, though it may be easier for users to shoot themselves in the foot. Perhaps a specific command-line argument to identify the new/target controller would therefore be useful.

kubectl rollingupdate old1 old2 ... oldN new
kubectl rollingupdate old1 old2 ... oldN -f new.json

This issue should not be closed until we support recovery and rollover.

@davidopp
Member
davidopp commented Jan 7, 2015

I don't understand why you need annotations. Isn't the rolling update essentially stateless, in the sense that you can figure out where you left off and what remains to be done just by looking at the old and new replication controllers and the pods?

@bgrant0607
Member

Almost, but not quite. The 2 replication controllers are not changed atomically, so the count could be off by one without keeping track. It's also the case that we allow the size to be changed by the rolling update.

@bgrant0607 bgrant0607 changed the title from Robust rollingupdate to Robust rollingupdate and rollback Jan 10, 2015
@bgrant0607
Member

How to make rolling update friendly to auto-scaling is described here: #2863 (comment)

@bgrant0607 bgrant0607 added the team/UX label Feb 5, 2015
@goltermann goltermann removed this from the v1.0 milestone Feb 6, 2015
@bgrant0607 bgrant0607 modified the milestone: v1.0 Feb 6, 2015
@bgrant0607
Member

See also rollingupdate-related issues in the cli roadmap.

@vipulnayyar

@bgrant0607 Just wanted to confirm whether this idea is really open or someone has started work on this. I went through @jlowdermilk's commit consisting of current rolling update code and future points in the cli roamap. Building an auto-scaler which is by default easy to use or configure and having an option for insertion of custom scripts is definitely exciting.

This would require using and expanding the current rolling update code which would support different scenarios chosen by the user for the type of scaling to be implemented INSIDE auto-scaler, ryt?

Anything else you may want to guide me about, before working on a more concrete design proposal based on ideas discussed in #2863?

@bgrant0607
Member

@vipulnayyar This issue specifically relates to improvements/extensions of rollingupdate. AFAIK, nobody is actively working on this issue at the moment.

Improvements/extensions:

  • Change the annotations and termination condition, as described by #2863 (comment), to be resilient to concurrent resizing for auto- or manual scaling
  • Make the operation reversible in order to support rollback
  • Create a controller generator (see run-container for another example of a generator) to facilitate creation of the new controller from an existing one by updating the image and perhaps other fields
  • Use readiness probe results to limit the number of not-ready pods during the rolling update
  • Support rollover: replace N controllers with 1
  • Allow the controllers to be replaced to be specified using multiple object names, label selector, objects gleaned from files, etc., as with other kubectl operations, using Builder
  • Unify with Openshift's DeploymentController #1743

Work on auto-scaling, OTOH, is ongoing -- e.g., #5492.

@davidopp
Member

On the controller generator issue: We recently worked with some people who were new to Kubernetes who were playing around with simple deployments as the first step of moving larger workloads onto the system. One of the confusions a couple of different people both ran into was the fact that the "id" of the replication controller has to change in order to do a rolling update. I agree with them that this is pretty unnatural; the user is thinking of "the controller for service X" and would presumably like it to keep a stable name even as the configuration of X changes over time. I can understand why we don't want Pods to keep a stable identity across reincarnations, but I don't think the same reasoning applies to replication controllers (or services or nodes or ...).

I'm very aware from our internal cluster management system that the system internals can become somewhat ugly if you want to maintain a stable collection name across rolling updates. However, if it simplifies the user experience, it might be worthwhile. The main problem is that I don't know how you make it work with the "everything is declarative/an object" semantics we have in Kubernetes -- our internal system is able to keep a stable job name because you don't manipulate the job directly, whereas in Kubernetes you do manipulate the RCs directly. I realize it would be a big change, but is there any way we could expose some kind of meta-object that encapsulates a rolling update (and which the user would manipulate via REST), and hide under the covers the RC manipulations it triggers? (Ideally such that at the end of the update, there is only one RC left and it has the same name as the RC you started with.) We sort of approaches a vaguely related idea in the discussion of a resize verb in the context of autoscaler.

@davidopp
Member

It's possible that a good config system could serve the same role as the meta-object I described (hiding the low-level details of the changing RC identity), but I'm not sure -- for example, monitoring would probably see both RC names while you're in the middle of the update, with no obvious link between them, and at the end of the update you end up with an RC with a different identity than the one you started with.

@jlowdermilk
Member

@davidopp, we've discussed several times changing the rollingupdate approach to start and end with the same controller. I agree that ending with a different controller id is non intuitive. IIRC, the argument against preserving the original RC id is that RCs are supposed to be ephemeral. I don't know how strong an argument that is, though.

@davidopp
Member

Yeah, I don't think that any objects should be ephemeral across updates. Actual death (like pod dies or RC is deleted) then yes the replacement should be a new "thing," but not for update.

On a related note, one of the same users was confused about the semantics of updating an RC. What happens if you update the replication count for an RC? What if you update the image? (Not using kubectl rollingupdate, but just using kubectl update)

@bgrant0607
Member

@davidopp @jlowdermilk

RCs are "helper" mechanisms intended to be manipulated by configuration/deployment tools/systems, auto-scalers, etc. They aren't tied to pod nor service lifetimes nor identities, and they're not intended to be pets. They aren't managed instance groups or jobs. Yes, it's different, but so are pods and services.

The argument for treating RCs as ephemeral is not circular. Trying to maintain the same RC name will make deployment tools more complex and restrict flexibility, while not simplifying auto-scalers, and I think the simplification for users is dubious, as well, especially in the presence of failures, where they'll see multiple controllers, need to rollback, etc.

I envision using labels to manage sets of replication controllers.

As for a higher-level entity -- that sounds similar to Openshift's Deployment controller #1743. I agree something like that would be useful, probably implemented as an API plugin (as we want to do with RC itself).

We don't yet have true declarative updates. That's being worked on: #1702, #4889. Once we do, we'll need a way to declaratively specify the update policy: recreate, patch, rolling update. For now, we just have the imperative update and rollingupdate commands.

@bgrant0607
Member

RC update semantics are simple: Changing the replica count causes pods to be created or deleted. Changing the pod template only affects new pods. It has no effect on existing pods -- no quantum entanglement. It's a cookie cutter, not a representation of the desired state of all pods.

@davidopp
Member

I discussed with @bgrant0607 and @thockin in person. The summary from my perspective is that for some users the ReplicationController semantics are not well-matched to their mental model, but longer-term we anticipate some users will interact with something different, for example a Deployment object that uses a model people may be more familiar with (for example, one 1:1 correspondence to a micro-service, keeps a stable identity over time, thought of as a pet with a name rather than as a cow addressed by label, updates to it updates pods rather than only changing what new pods will look like, etc.). This might be built as a layer on top of RC, or as a separate controller that lives at the same level of the hierarchy as the RC.

@thockin
Member
thockin commented Mar 19, 2015

I think there's a certain amount of re-education that people will
eventually want to undergo to adapt to the kubernetes mental model, but I
do think something like a deployment object could ease the transitions and
provide a lot of value.

I know @smarterclayton and his gang have done some work here already, but I
am not very familiar with their design.

On Wed, Mar 18, 2015 at 3:46 PM, David Oppenheimer <notifications@github.com

wrote:

I discussed with @bgrant0607 https://github.com/bgrant0607 and @thockin
https://github.com/thockin in person. The summary from my perspective
is that for some users the ReplicationController semantics are not
well-matched to their mental model, but longer-term we anticipate some
users will interact with something different, for example a Deployment
object that uses a model people may be more familiar with (for example, one
1:1 correspondence to a micro-service, keeps a stable identity over time,
thought of as a pet with a name rather than as a cow addressed by label,
updates to it updates pods rather than only changing what new pods will
look like, etc.). This might be built as a layer on top of RC, or as a
separate controller that lives at the same level of the hierarchy as the RC.


Reply to this email directly or view it on GitHub
#1353 (comment)
.

@smarterclayton
Contributor

The key requirement for the "micro service / self-updating component" is level based mutation. With Docker 1.6, it will be possible to do level based updates directly against a registry (although you'll have to poll the registry to do it). The poll (something has changed from before externally) and the mutation (what do you change within the pod template) are what we call the trigger - a detected change results in a mutation to the running template, which then results in a new deployment (uses the rolling update pattern of two RCs with "job" that runs the rolling update inside a pod). I think of that as a controller object above an RC that offers RC features + the rollout. The triggering is the thing that has coupling issues - I don't have a good story today for how you'd let someone add their own level-based behaviors to the trigger pattern.

I'm pretty convinced that a unit like this is an 80% case for most organizations, so I'm incentivized to make sure it's reusable on its own.

----- Original Message -----

I think there's a certain amount of re-education that people will
eventually want to undergo to adapt to the kubernetes mental model, but I
do think something like a deployment object could ease the transitions and
provide a lot of value.

I know @smarterclayton and his gang have done some work here already, but I
am not very familiar with their design.

On Wed, Mar 18, 2015 at 3:46 PM, David Oppenheimer <notifications@github.com

wrote:

I discussed with @bgrant0607 https://github.com/bgrant0607 and @thockin
https://github.com/thockin in person. The summary from my perspective
is that for some users the ReplicationController semantics are not
well-matched to their mental model, but longer-term we anticipate some
users will interact with something different, for example a Deployment
object that uses a model people may be more familiar with (for example, one
1:1 correspondence to a micro-service, keeps a stable identity over time,
thought of as a pet with a name rather than as a cow addressed by label,
updates to it updates pods rather than only changing what new pods will
look like, etc.). This might be built as a layer on top of RC, or as a
separate controller that lives at the same level of the hierarchy as the
RC.


Reply to this email directly or view it on GitHub
#1353 (comment)
.


Reply to this email directly or view it on GitHub:
#1353 (comment)

@bgrant0607 bgrant0607 added priority/P1 and removed priority/P2 labels Apr 8, 2015
@bgrant0607
Member

An example of someone using rolling update can be found here:
http://thepracticalsysadmin.com/kubernetes-resize-and-rolling-updates/

It's a typical example: someone wants to roll out a new image, and maybe command/args and/or env vars. They change the image (or at least the image tag), and need to change the uniquifying label value on the pod template (and correspondingly in the RC selector and labels) and RC name.

In lieu of a file for the new RC in rolling-update, it should be possible to specify those changes and then generate a new RC. Flags should be similar to run-container.

@bgrant0607
Member

Will comment on #1743 and/or #503 re. a higher-level deployment controller.

@quinton-hoole
Member

cc: quinton-hoole

@bgrant0607
Member

I think the common scenarios are:

  • rolling update
  • rolling update, pause in middle, rollback, create a patched image, rolling update
  • create a canary, run it for a while, then perform a rolling update by growing the canary and shrinking the original RC
  • create a canary/experiment, run it for a while, delete it
  • rolling update, pause in middle, create a patched image, perform a rolling update to replace the original image and the botched image (aka rollover)
  • rolling update with a resize in middle (possibly without pausing, since the resize might be due to an auto-scaler)
  • run multiple "release tracks" continuously, which are updated independently

cc @rjnagal

@smarterclayton
Contributor

Rolling update, at the end perform a migration automatically (post deployment step).

On Apr 10, 2015, at 3:56 PM, Brian Grant notifications@github.com wrote:

I think the common scenarios are:

rolling update
rolling update, pause in middle, rollback, create a patched image, rolling update
create a canary, run it for a while, then perform a rolling update by growing the canary and shrinking the original RC
create a canary/experiment, run it for a while, delete it
rolling update, pause in middle, create a patched image, perform a rolling update to replace the original image and the botched image (aka rollover)
rolling update with a resize in middle (possibly without pausing, since the resize might be due to an auto-scaler)
run multiple "release tracks" continuously, which are updated independently
cc @rjnagal


Reply to this email directly or view it on GitHub.

@bgrant0607
Member

@smarterclayton Migration meaning traffic shifting?

@smarterclayton
Contributor

No migration like schema upgrade (code version 2 rolled out, once code version 1 is gone, trigger the automatic DB schema update from schema 3->4)

@bgrant0607
Member

Ah, this is the post-deployment hook. Got it.

@smarterclayton
Contributor

Certainly doesn't have to be part of this, but if you think of deployment as a process of going from whatever the cluster previously had to something new, then many people may want to define the process (canaries, etc as you laid out). The process has to do a reasonable job of trying to converge, but it's acceptable to wedge and report being wedged due to unreconcilable differences (simple, like ports change, to complex, like image requires new security procedures). Since the assumption is that you're transforming between two steady states you either move forward, back, or stay stuck.

----- Original Message -----

Ah, this is the post-deployment hook. Got it.


Reply to this email directly or view it on GitHub:
#1353 (comment)

@bgrant0607
Member

I agree that hooks seem useful, certainly in the case where deployments are triggered automatically. If we were to defer triggers, hooks probably could also be deferred.

@bgrant0607
Member

Using #4140 for rollback/abort.

@bgrant0607 bgrant0607 modified the milestone: v1.0-post, v1.0 Apr 27, 2015
@bgrant0607 bgrant0607 added priority/P2 and removed priority/P1 labels Apr 27, 2015
@bgrant0607
Member

Updating from cli-roadmap.md:

Deployment (#1743) will need rollover (replace multiple replication controllers with one) in order to kick off new rollouts as soon as the pod template is updated.

We should think about whether we still want annotations on the underlying replication controllers and, if so, whether they need to be improved: #2863 (comment)

@bgrant0607 bgrant0607 removed this from the v1.0-post milestone Jul 24, 2015
@bgrant0607
Member

I believe we have more specific issues filed for remaining work, so I'm closing this "metaphysical" issue.

@bgrant0607 bgrant0607 closed this Jul 27, 2015
@huang195
Contributor
huang195 commented Dec 7, 2015

@grant0607 is readiness probe and/or liveness probe taken into account of rolling update? What is the expected behavior when these probes are in good/bad conditions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment