New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Service update failure thresholds and rollback #26421

Merged
merged 2 commits into from Oct 18, 2016

Conversation

@aaronlehmann
Contributor

aaronlehmann commented Sep 8, 2016

This adds support for two enhancements to swarm service rolling updates:

  • Failure thresholds: In Docker 1.12, a service update could be set up
    to either pause or continue after a single failure occurs. This adds
    an --update-max-failure-ratio flag that controls how many tasks need to
    fail to update for the update as a whole to be considered a failure. A
    counterpart flag, --update-monitor, controls how long to monitor each
    task for a failure after starting it during the update.
  • Rollback flag: service update --rollback reverts the service to its
    previous version. If a service update encounters task failures, or
    fails to function properly for some other reason, the user can roll back
    the update.

SwarmKit also has the ability to roll back updates automatically after
hitting the failure thresholds, but we've decided not to expose this in
the Docker API/CLI for now, favoring a workflow where the decision to
roll back is always made by an admin. Depending on user feedback, we may
add a "rollback" option to --update-failure-action in the future.

SwarmKit PR: docker/swarmkit#1380

@aaronlehmann aaronlehmann force-pushed the aaronlehmann:update-thresholds-rollbacks branch 2 times, most recently from 23d0359 to 0e0d238 Sep 8, 2016

flagWorkdir = "workdir"
flagRegistryAuth = "with-registry-auth"
flagLogDriver = "log-driver"
flagLogOpt = "log-opt"

This comment has been minimized.

@dperny

dperny Sep 13, 2016

Contributor

gofmt working as intended

@@ -36,6 +36,7 @@ func newUpdateCommand(dockerCli *command.DockerCli) *cobra.Command {
flags := cmd.Flags()
flags.String("image", "", "Service image tag")
flags.String("args", "", "Service command args")
flags.Bool("rollback", false, "Rollback to previous specification")

This comment has been minimized.

@dperny

dperny Sep 13, 2016

Contributor

Is rollback a flag you invoke by itself. Like, docker service update --rollback?

@@ -1178,6 +1180,7 @@ __docker_service_subcommand() {
"($help)*--container-label-add=[Add or update container labels]:label: " \
"($help)*--container-label-rm=[Remove a container label by its key]:label: " \
"($help)--image=[Service image tag]:image:__docker_repositories" \
"($help)--rollback[Rollback to previous specification]" \

This comment has been minimized.

@dperny

dperny Sep 13, 2016

Contributor

Nit: misaligned line.

to the configuration that was in place before the most recent
`docker service update` command. Other options can be combined with
`--rollback`; for example, `--update-delay 0s` to execute the rollback without
a delay between tasks.

This comment has been minimized.

@dperny

dperny Sep 13, 2016

Contributor

i suppose this answers my above question

@dperny

This comment has been minimized.

Contributor

dperny commented Sep 13, 2016

FWIW, LGTM. I am not a maintainer though.

@aaronlehmann aaronlehmann force-pushed the aaronlehmann:update-thresholds-rollbacks branch from 0e0d238 to e04ccb3 Sep 13, 2016

flagStopGracePeriod = "stop-grace-period"
flagUpdateDelay = "update-delay"
flagUpdateFailureAction = "update-failure-action"
flagUpdateFailureFraction = "update-failure-fraction"

This comment has been minimized.

@aluzzardi

aluzzardi Sep 17, 2016

Member

Not a big fan of --update-failure-fraction overall as I mentioned IRL but unfortunately I don't have a better idea.

@dnephin @vieux @aanand @ehazlett maybe?

This comment has been minimized.

@dnephin

dnephin Sep 17, 2016

Member

I agree, I usually prefer the term "ratio" over "fraction". Some ideas:

  • --update-min-success-ratio
  • --update-max-failure-ratio

This comment has been minimized.

@aaronlehmann

aaronlehmann Sep 19, 2016

Contributor

I think --update-max-failure-ratio could work. Ratio is indeed a better description.

@aluzzardi WDYT? Also, if we adopt this terminology, should we update SwarmKit to rename the protobuf field accordingly?

@aluzzardi

Overall LGTM - nits on flag names.

@dnephin

How does rollback behave with respect to service version? I'm thinking of something like this:

The service starts at version 1
User A performs an update (service is at version 2)
User B performs an update immediately after (service is at version 3)
User A sees things going bad, so attempts to rollback.
Service ends up on version 2 (I assume?) instead of version 1

If the --rollback flag accepted a service version it would prevent that scenario? The user would be informed of the error and they could force an update to the specific version they want manually.

} else if rollback {
updateOpts.RegistryAuthFrom = types.RegistryAuthFromPreviousSpec
} else {
updateOpts.RegistryAuthFrom = types.RegistryAuthFromSpec

This comment has been minimized.

@dnephin

dnephin Sep 17, 2016

Member

I think a switch/case would be appropriate here

`docker service update`'s `--rollback` flag. This will revert the service
to the configuration that was in place before the most recent
`docker service update` command. Other options can be combined with
`--rollback`; for example, `--update-delay 0s` to execute the rollback without

This comment has been minimized.

@dnephin

dnephin Sep 17, 2016

Member

When other options are combined with rollback are they applied to the service spec, or are they only used for the "update", so only "update" options would work?

My understanding (which may be out of date by now) was that if you ran an update which changed the update config of a service, that change wouldn't be reflected in the current update, it would only apply to the next one, because the service uses the current configuration to perform the update, not the new configuration.

Has that changed? Or is rollback a special case that works differently?

This comment has been minimized.

@aaronlehmann

aaronlehmann Sep 19, 2016

Contributor

When other options are combined with rollback are they applied to the service spec, or are they only used for the "update", so only "update" options would work?

The current implementation allows all options supported by service update. I think update-* options are the most useful options to specify in combination with a rollback. But I am allowing all options for simplicity. If there's a good UX argument for restricting the set of options that can be combined with rollback, we could consider doing that.

My understanding (which may be out of date by now) was that if you ran an update which changed the update config of a service, that change wouldn't be reflected in the current update, it would only apply to the next one, because the service uses the current configuration to perform the update, not the new configuration.

I don't believe that's the case. When I've tested these options, they seem to take effect immediately. If there are execptions to this, I don't think it's the intended behavior.

@aaronlehmann

This comment has been minimized.

Contributor

aaronlehmann commented Sep 19, 2016

How does rollback behave with respect to service version? I'm thinking of something like this:

The service starts at version 1
User A performs an update (service is at version 2)
User B performs an update immediately after (service is at version 3)
User A sees things going bad, so attempts to rollback.
Service ends up on version 2 (I assume?) instead of version 1

Yes, that scenario is accurate. However I don't see this as any different from the way docker service update works today. If User A runs docker service update --env-add A=B and immediately afterwards, User B runs docker service update --env-add C=D, the service ends up with both environment variables. We don't have the concept of a "user" in the daemon (no multitenancy), or a way of associating a service version with a particular user, so I'm not sure how this could work differently. I think for now it's a reasonable assumption that if the user asks for a rollback, they want to go to the immediately preceding version.

If the --rollback flag accepted a service version it would prevent that scenario? The user would be informed of the error and they could force an update to the specific version they want manually.

In theory, yes, but I see a few hurdles:

  • We'd need to store multiple historic versions of the service spec. This raises a few resource management issues (how many do you keep?)
  • The UI around this would be pretty involved. Internal version numbers aren't meaningful to the user, so I think we'd need a way to diff service specs and present the differences in a way that's easy to interpret. Then we'd need to design a reasonable CLI workflow for this. (service diff subcommand? Something that shows the incremental evolution of the service from version to version?)
  • Once we get into the business of storing generalized history, it probably doesn't make sense to do this as a one-off for services. Ideally we would extend SwarmKit's store to keep historical versions of all objects. This could definitely come in handy for other types such as tasks (looking at the different version of a particular task object could serve as an execution log). But this is a big redesign/refactor and @aluzzardi and I feel it's outside the scope of this rollback project.

@aaronlehmann aaronlehmann force-pushed the aaronlehmann:update-thresholds-rollbacks branch from e04ccb3 to 0e8b69c Sep 22, 2016

@aaronlehmann

This comment has been minimized.

Contributor

aaronlehmann commented Sep 22, 2016

Rebased and updated in response to feeback. --update-failure-fraction has been renamed to --update-max-failure-ratio.

@dnephin @aluzzardi PTAL

@dnephin

This comment has been minimized.

Member

dnephin commented Sep 22, 2016

design LGTM

1 similar comment
@thaJeztah

This comment has been minimized.

Member

thaJeztah commented Sep 22, 2016

design LGTM

@aaronlehmann aaronlehmann force-pushed the aaronlehmann:update-thresholds-rollbacks branch from 0e8b69c to 29e19b6 Sep 27, 2016

@aaronlehmann

This comment has been minimized.

Contributor

aaronlehmann commented Sep 27, 2016

Rebased.

Review ping?

@dnephin

This comment has been minimized.

Member

dnephin commented Sep 27, 2016

LGTM

@aaronlehmann aaronlehmann force-pushed the aaronlehmann:update-thresholds-rollbacks branch 2 times, most recently from c2b9fd2 to 0f5df94 Oct 13, 2016

@aaronlehmann

This comment has been minimized.

Contributor

aaronlehmann commented Oct 15, 2016

User docs here: docker/docker.github.io#219

@thaJeztah

This comment has been minimized.

Member

thaJeztah commented Oct 15, 2016

Oh, thanks!

@vdemeester

This comment has been minimized.

Member

vdemeester commented Oct 18, 2016

arf, @aaronlehmann needs a rebase 👼

@vdemeester

One comment/question, but docs LGTM 🐸

- **Delay** – Delay between restart attempts.
- **Attempts** – Maximum attempts to restart a given container before giving up (default value
- **Delay** – Delay between restart attempts, in nanoseconds.
- **MaxAttempts** – Maximum attempts to restart a given container before giving up (default value

This comment has been minimized.

@vdemeester

vdemeester Oct 18, 2016

Member

Was it wrongely documented, or has this changed ?

This comment has been minimized.

@aaronlehmann

aaronlehmann Oct 18, 2016

Contributor

It was wrongly documented.

@@ -41,10 +41,14 @@ Placement:
{{- if .HasUpdateConfig }}
UpdateConfig:
Parallelism: {{ .UpdateParallelism }}
{{- if .HasUpdateDelay -}}

This comment has been minimized.

@mistyhacks

mistyhacks Oct 18, 2016

Contributor

Just curious, what does the - do here? What does removing it do?

This comment has been minimized.

@aaronlehmann

aaronlehmann Oct 18, 2016

Contributor

The - tells the template engine to remove whitespace that follows. This was incorrect because it ended up merging Delay onto the same line as Parallelism. Notice that the other conditionals don't have a - at the end.

I could have opened another PR for this, but found it easiest to bundle it with the other changes I was making to the template.

Delay: opts.update.delay,
Monitor: opts.update.monitor,
FailureAction: opts.update.onFailure,
MaxFailureRatio: opts.update.maxFailureRatio,

This comment has been minimized.

@mistyhacks

mistyhacks Oct 18, 2016

Contributor

This one is missing onFailure?

aaronlehmann added some commits Sep 7, 2016

API changes for service rollback and failure threshold
Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
Service update failure thresholds and rollback
This adds support for two enhancements to swarm service rolling updates:

- Failure thresholds: In Docker 1.12, a service update could be set up
  to either pause or continue after a single failure occurs. This adds
  an --update-max-failure-ratio flag that controls how many tasks need to
  fail to update for the update as a whole to be considered a failure. A
  counterpart flag, --update-monitor, controls how long to monitor each
  task for a failure after starting it during the update.

- Rollback flag: service update --rollback reverts the service to its
  previous version. If a service update encounters task failures, or
  fails to function properly for some other reason, the user can roll back
  the update.

SwarmKit also has the ability to roll back updates automatically after
hitting the failure thresholds, but we've decided not to expose this in
the Docker API/CLI for now, favoring a workflow where the decision to
roll back is always made by an admin. Depending on user feedback, we may
add a "rollback" option to --update-failure-action in the future.

Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>

@aaronlehmann aaronlehmann force-pushed the aaronlehmann:update-thresholds-rollbacks branch from 8b24fdc to 6d4b527 Oct 18, 2016

@aaronlehmann

This comment has been minimized.

Contributor

aaronlehmann commented Oct 18, 2016

Rebased.

@thaJeztah

LGTM, thanks!

@thaJeztah thaJeztah merged commit 3b0660d into moby:master Oct 18, 2016

4 checks passed

docker/dco-signed All commits signed
Details
experimental Jenkins build Docker-PRs-experimental 25064 has succeeded
Details
janky Jenkins build Docker-PRs 33661 has succeeded
Details
windowsRS1 Jenkins build Docker-PRs-WoW-RS1 4516 has succeeded
Details

@aaronlehmann aaronlehmann deleted the aaronlehmann:update-thresholds-rollbacks branch Oct 18, 2016

@ralphtheninja

This comment has been minimized.

Contributor

ralphtheninja commented Jan 12, 2017

How would a rollback look in terms of using the docker api?

thaJeztah added a commit to thaJeztah/docker that referenced this pull request Jan 16, 2017

Add version annotation to various flags added in 1.13
Pull request moby#27745 added support for the
client to talk to older versions of the daemon. Various flags were added to
docker 1.13 that are not compatible with older daemons.

This PR adds annotations to those flags, so that they are automatically hidden
if the daemon is older than docker 1.13 (API 1.25).

Not all new flags affect the API (some are client-side only). The following
PR's added new flags to docker 1.13 that affect the API;

- moby#23430 added `--cpu-rt-period`and `--cpu-rt-runtime`
- moby#27800 / moby#25317 added `--group` / `--group-add` / `--group-rm`
- moby#27702 added `--network` to `docker build`
- moby#25962 added `--attachable` to `docker network create`
- moby#27998 added `--compose-file` to `docker stack deploy`
- moby#22566 added `--stop-timeout` to `docker run` and `docker create`
- moby#26061 added `--init` to `docker run` and `docker create`
- moby#26941 added `--init-path` to `docker run` and `docker create`
- moby#27958 added `--cpus` on `docker run` / `docker create`
- moby#27567 added `--dns`, `--dns-opt`, and `--dns-search` to `docker service create`
- moby#27596 added `--force` to `docker service update`
- moby#27857 added `--hostname` to `docker service create`
- moby#28031 added `--hosts`, `--host-add` / `--host-rm` to `docker service create` and `docker service update`
- moby#28076 added `--tty` on `docker service create` / `docker service update`
- moby#26421 added `--update-max-failure-ratio`, `--update-monitor` and `--rollback` on `docker service update`
- moby#27369 added `--health-cmd`, `--health-interval`, `--health-retries`, `--health-timeout` and `--no-healthcheck` options to `docker service create` and `docker service update`

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
@darklow

This comment has been minimized.

darklow commented Mar 6, 2017

@aaronlehmann Having rollback as option for --update-failure-action would be great. Actually I thought it is already how it is working now, so I kept running
docker service update --rollback=true my_service and wondering what I see no rollback parameter in docker service inspect.

Using --update-failure-action=rollback would help in following scenario, which I had recently. For example I have a service that is running nginx and I made a mistake in config file, so when I called docker service update ... new container failed to start, old was stopped and no one was listening at port 80 and immediately amazon ELB health check failed and took down whole cluster. While I realised what happened 1-2 minutes passed while I updated fix and amazon ELB re-added instances and resume services to work.

Having --update-failure-action=rollback would eliminate such case and because nginx failed it would roll back and there would be no 2 minutes downtime at our services.

@thaJeztah

This comment has been minimized.

Member

thaJeztah commented Mar 6, 2017

@darklow see #31108, does that cover your use case?

@darklow

This comment has been minimized.

darklow commented Mar 6, 2017

@thaJeztah very nice, that's a good one, wasn't aware of it yet, thanks! It would totally cover my case.

@darklow

This comment has been minimized.

darklow commented Mar 6, 2017

@thaJeztah one more question maybe you could help with. Rollback will definitely help. But I wonder why in my case/example old task was killed even if new one started unhealthy, nginx config had error, so main command failed to start and it still killed existing task and keep restarting new one and failing all the time.

Any existing settings would help solving this? It would make sense not to kill old service tasks until new ones are ready, I wonder why it didn't happened. Thank you.

@thaJeztah

This comment has been minimized.

Member

thaJeztah commented Mar 6, 2017

@darklow I think this may cover that #30261. Also the discussion on #30321 may be relevant

andrewhsu pushed a commit to docker/docker-ce that referenced this pull request May 19, 2017

Add version annotation to various flags added in 1.13
Pull request moby/moby#27745 added support for the
client to talk to older versions of the daemon. Various flags were added to
docker 1.13 that are not compatible with older daemons.

This PR adds annotations to those flags, so that they are automatically hidden
if the daemon is older than docker 1.13 (API 1.25).

Not all new flags affect the API (some are client-side only). The following
PR's added new flags to docker 1.13 that affect the API;

- moby/moby#23430 added `--cpu-rt-period`and `--cpu-rt-runtime`
- moby/moby#27800 / moby/moby#25317 added `--group` / `--group-add` / `--group-rm`
- moby/moby#27702 added `--network` to `docker build`
- moby/moby#25962 added `--attachable` to `docker network create`
- moby/moby#27998 added `--compose-file` to `docker stack deploy`
- moby/moby#22566 added `--stop-timeout` to `docker run` and `docker create`
- moby/moby#26061 added `--init` to `docker run` and `docker create`
- moby/moby#26941 added `--init-path` to `docker run` and `docker create`
- moby/moby#27958 added `--cpus` on `docker run` / `docker create`
- moby/moby#27567 added `--dns`, `--dns-opt`, and `--dns-search` to `docker service create`
- moby/moby#27596 added `--force` to `docker service update`
- moby/moby#27857 added `--hostname` to `docker service create`
- moby/moby#28031 added `--hosts`, `--host-add` / `--host-rm` to `docker service create` and `docker service update`
- moby/moby#28076 added `--tty` on `docker service create` / `docker service update`
- moby/moby#26421 added `--update-max-failure-ratio`, `--update-monitor` and `--rollback` on `docker service update`
- moby/moby#27369 added `--health-cmd`, `--health-interval`, `--health-retries`, `--health-timeout` and `--no-healthcheck` options to `docker service create` and `docker service update`

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Upstream-commit: 5d2722f83db9e301c6dcbe1c562c2051a52905db
Component: cli

srust added a commit to srust/moby that referenced this pull request Nov 30, 2017

Add version annotation to various flags added in 1.13
Pull request moby#27745 added support for the
client to talk to older versions of the daemon. Various flags were added to
docker 1.13 that are not compatible with older daemons.

This PR adds annotations to those flags, so that they are automatically hidden
if the daemon is older than docker 1.13 (API 1.25).

Not all new flags affect the API (some are client-side only). The following
PR's added new flags to docker 1.13 that affect the API;

- moby#23430 added `--cpu-rt-period`and `--cpu-rt-runtime`
- moby#27800 / moby#25317 added `--group` / `--group-add` / `--group-rm`
- moby#27702 added `--network` to `docker build`
- moby#25962 added `--attachable` to `docker network create`
- moby#27998 added `--compose-file` to `docker stack deploy`
- moby#22566 added `--stop-timeout` to `docker run` and `docker create`
- moby#26061 added `--init` to `docker run` and `docker create`
- moby#26941 added `--init-path` to `docker run` and `docker create`
- moby#27958 added `--cpus` on `docker run` / `docker create`
- moby#27567 added `--dns`, `--dns-opt`, and `--dns-search` to `docker service create`
- moby#27596 added `--force` to `docker service update`
- moby#27857 added `--hostname` to `docker service create`
- moby#28031 added `--hosts`, `--host-add` / `--host-rm` to `docker service create` and `docker service update`
- moby#28076 added `--tty` on `docker service create` / `docker service update`
- moby#26421 added `--update-max-failure-ratio`, `--update-monitor` and `--rollback` on `docker service update`
- moby#27369 added `--health-cmd`, `--health-interval`, `--health-retries`, `--health-timeout` and `--no-healthcheck` options to `docker service create` and `docker service update`

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment