More complex restart rules #9

IanCal · 2014-05-23T09:05:12Z

(apologies for the mostly copy/paste from the mailing list)

At the moment, there is a useful but simple approach which is to restart containers when they die.

On a very simple level, the two obvious approaches are

Let single containers die alone
If one container dies, kill all

However then it'd be useful to start adding slight different rules around stopping cleanly and dying unexpectedly:

Let containers stop alone, kill all if one dies
If one container dies, die alone
If one container stops, kill all

Then there's an additional rule, which would simply be to restart containers if they die.

Then we may have different rules for different containers, you may want to stop everything if your computation finishes but restart a database if it crashes or restart a stateless webserver but panic and restart everything if the load balancer crashes.

Beyond this is different strategies for restarting, try restarting one of your webservers 5 times and if that doesn't work kill all of them and start again but only 4 times in 60 seconds before terminating, but if the database crashes then kill all of the webservers, start the DB again and fire up the webservers again.

My examples may not match the behaviour you'd want, but I'm sure we can all think of cases where we'd want to start/stop/restart things differently depending on how things die.

I think it would be great if this was extended to consider trees of supervisors and restart strategies as used heavily in Erlang.

Some background on supervisor trees:

A gentle introduction: http://learnyousomeerlang.com/supervisors
Official guide: http://www.erlang.org/doc/design_principles/sup_princ.html
Actual docs: http://www.erlang.org/doc/man/supervisor.html

An quick example of what this could look like in the manifest file (may be syntax errors):

group:
    - name: root
      restart_strategy: one_for_all
    group:
        - name: webserver
          restart_strategy: one_for_one
          max_restarts: 4
          max_time: 60
          containers
            - name web1
              ...
    group:
        - name: api
          restart_strategy: one_for_one
          max_restarts: 4
          max_time: 60
          containers
            - name api1
             ...

If your web serving containers die, try restarting them, but no more often than 4 times in a minutes. If that happens, kill the group and let the layer above deal with the problem. The layer above then kills all containers and restarts them.

There are probably several different stages to this, each valuable

Some basic restart strategies, "restart" or "don't restart"
Groups, with a few extra strategies (one_for_one, all_for_one, rest_for_one)
Trees of these groups

One bit I'm not sure about is what to do when the root dies. Personally I'd like the instance to be stopped given my use cases but maybe others would rather keep it up to inspect?

The text was updated successfully, but these errors were encountered:

thockin · 2014-06-19T23:36:23Z

There's some interesting thoughts here. I'd like to keep this open, and consider it as we shift focus to Kubernetes.

bgrant0607 · 2015-09-17T18:55:40Z

cc @erictune @mikedanese

bgrant0607 mentioned this issue Jun 16, 2014

Configurable restart behavior kubernetes/kubernetes#127

Closed

mikedanese mentioned this issue Sep 17, 2015

Request for feature: provide mechanism for one of many containers to cause a pod to exit kubernetes/kubernetes#13847

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More complex restart rules #9

More complex restart rules #9

IanCal commented May 23, 2014

thockin commented Jun 19, 2014

bgrant0607 commented Sep 17, 2015

More complex restart rules #9

More complex restart rules #9

Comments

IanCal commented May 23, 2014

thockin commented Jun 19, 2014

bgrant0607 commented Sep 17, 2015