Skip to content
This repository has been archived by the owner on Apr 24, 2021. It is now read-only.

More complex restart rules #9

Open
IanCal opened this issue May 23, 2014 · 2 comments
Open

More complex restart rules #9

IanCal opened this issue May 23, 2014 · 2 comments

Comments

@IanCal
Copy link

IanCal commented May 23, 2014

(apologies for the mostly copy/paste from the mailing list)

At the moment, there is a useful but simple approach which is to restart containers when they die.

On a very simple level, the two obvious approaches are

  • Let single containers die alone
  • If one container dies, kill all

However then it'd be useful to start adding slight different rules around stopping cleanly and dying unexpectedly:

  • Let containers stop alone, kill all if one dies
  • If one container dies, die alone
  • If one container stops, kill all

Then there's an additional rule, which would simply be to restart containers if they die.

Then we may have different rules for different containers, you may want to stop everything if your computation finishes but restart a database if it crashes or restart a stateless webserver but panic and restart everything if the load balancer crashes.

Beyond this is different strategies for restarting, try restarting one of your webservers 5 times and if that doesn't work kill all of them and start again but only 4 times in 60 seconds before terminating, but if the database crashes then kill all of the webservers, start the DB again and fire up the webservers again.

My examples may not match the behaviour you'd want, but I'm sure we can all think of cases where we'd want to start/stop/restart things differently depending on how things die.

I think it would be great if this was extended to consider trees of supervisors and restart strategies as used heavily in Erlang.

Some background on supervisor trees:

An quick example of what this could look like in the manifest file (may be syntax errors):

group:
    - name: root
      restart_strategy: one_for_all
    group:
        - name: webserver
          restart_strategy: one_for_one
          max_restarts: 4
          max_time: 60
          containers
            - name web1
              ...
    group:
        - name: api
          restart_strategy: one_for_one
          max_restarts: 4
          max_time: 60
          containers
            - name api1
             ...

If your web serving containers die, try restarting them, but no more often than 4 times in a minutes. If that happens, kill the group and let the layer above deal with the problem. The layer above then kills all containers and restarts them.

There are probably several different stages to this, each valuable

  • Some basic restart strategies, "restart" or "don't restart"
  • Groups, with a few extra strategies (one_for_one, all_for_one, rest_for_one)
  • Trees of these groups

One bit I'm not sure about is what to do when the root dies. Personally I'd like the instance to be stopped given my use cases but maybe others would rather keep it up to inspect?

@thockin
Copy link
Contributor

thockin commented Jun 19, 2014

There's some interesting thoughts here. I'd like to keep this open, and consider it as we shift focus to Kubernetes.

@bgrant0607
Copy link

cc @erictune @mikedanese

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants