New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad Restart of Unhealthy Containers #876

Closed
vrenjith opened this Issue Mar 3, 2016 · 13 comments

Comments

Projects
None yet
@vrenjith
Contributor

vrenjith commented Mar 3, 2016

Assuming that nomad is configured with a check block while registering with consul so that Consul does a health check of the containers.
In this scenario, if consul reports that one of the containers are not healthy, will nomad restart/reschedule those containers?

@diptanu

This comment has been minimized.

Show comment
Hide comment
@diptanu

diptanu Mar 3, 2016

Collaborator

@vrenjith Not yet, but this is on our roadmap.

Collaborator

diptanu commented Mar 3, 2016

@vrenjith Not yet, but this is on our roadmap.

@mattcl

This comment has been minimized.

Show comment
Hide comment
@mattcl

mattcl Jul 5, 2016

Any updates on where in the roadmap this is?

mattcl commented Jul 5, 2016

Any updates on where in the roadmap this is?

@sbvitok

This comment has been minimized.

Show comment
Hide comment
@sbvitok

sbvitok Nov 16, 2016

Any news?

sbvitok commented Nov 16, 2016

Any news?

@dadgar

This comment has been minimized.

Show comment
Hide comment
@dadgar

dadgar Nov 17, 2016

Contributor

Hey, no update on this quite yet. We are refactoring the way we do Consul registrations in 0.5.X. This will make it easier to add new features like this.

Contributor

dadgar commented Nov 17, 2016

Hey, no update on this quite yet. We are refactoring the way we do Consul registrations in 0.5.X. This will make it easier to add new features like this.

@dmzaytsev

This comment has been minimized.

Show comment
Hide comment
@dmzaytsev

dmzaytsev Mar 27, 2017

Hi,
any news?
as 0.6.0 is coming I guess

Hi,
any news?
as 0.6.0 is coming I guess

@clstokes

This comment has been minimized.

Show comment
Hide comment
@clstokes

clstokes Apr 4, 2017

Contributor

Related to #164.

Contributor

clstokes commented Apr 4, 2017

Related to #164.

@alxark

This comment has been minimized.

Show comment
Hide comment
@alxark

alxark Jul 4, 2017

it would be great to add some option to restart some of containers but leave another one in failing state. and add timeout for restart, for example during some long term operations when container should be not available to accept client connection but should be active and killed only after deadline.

alxark commented Jul 4, 2017

it would be great to add some option to restart some of containers but leave another one in failing state. and add timeout for restart, for example during some long term operations when container should be not available to accept client connection but should be active and killed only after deadline.

@epetrovich

This comment has been minimized.

Show comment
Hide comment
@epetrovich

epetrovich Jul 12, 2017

Here is a workaround.
I have an autoscaled spot fleet in AWS with nomad agents and this feature was essential for me.
Our main application is written in Java and Java Machine do not fails by itself in "some cases". It just cannot respond to the health_check on http port.

What I've done:

  1. I've added an inline template to the main task to trigger restart by changes in Consul KV.
    I've used a hostname in the key. Allocation index is possible also or whatever that identifies allocation. I'm OK with hostnames as operator = "distinct_hosts"
  template {
    data = <<EOH
          last_restart:    {{ key_or_default (printf "apps/backend-rtb/backend-rtb-task/%s" (env "attr.unique.hostname")) "no_signal"  }}
          EOH
    destination   = "local/nomad_task_status"
    change_mode   = "restart"
  }  

Consul check for the main app

  service {
    tags = ["backend-rtb"]
    port = "backend"
    check {
        type     = "http",
        port     = "backend"
        path     = "/system/ping"
        interval = "2s"
        timeout  = "1s"
      }
  }
  1. I wrote a very simple watcher script and added it to the task group as a second task.
    It checks the status of the main job and writes a timestamp value into consul kv that triggers a restart. It queries a consul agent on the localhost. Here is just an example of params that i've used.
    command = "python"
    args    = [ "local/watcher.py",
                "--consul_host", "127.0.0.1",
                "--node","${node.unique.name}",
                "--consul_kv_key","apps/backend-rtb/backend-rtb-task/${attr.unique.hostname}",
                "--service","backend-rtb-backend-rtb-group-backend-rtb-task",
                "--filter","passing",
                "--check_interval","15",
                "--start_after","30",
                "--fails","6"
              ]
  }

With python and python-consul that was quite simple. Any custom restart logic is possible here.

epetrovich commented Jul 12, 2017

Here is a workaround.
I have an autoscaled spot fleet in AWS with nomad agents and this feature was essential for me.
Our main application is written in Java and Java Machine do not fails by itself in "some cases". It just cannot respond to the health_check on http port.

What I've done:

  1. I've added an inline template to the main task to trigger restart by changes in Consul KV.
    I've used a hostname in the key. Allocation index is possible also or whatever that identifies allocation. I'm OK with hostnames as operator = "distinct_hosts"
  template {
    data = <<EOH
          last_restart:    {{ key_or_default (printf "apps/backend-rtb/backend-rtb-task/%s" (env "attr.unique.hostname")) "no_signal"  }}
          EOH
    destination   = "local/nomad_task_status"
    change_mode   = "restart"
  }  

Consul check for the main app

  service {
    tags = ["backend-rtb"]
    port = "backend"
    check {
        type     = "http",
        port     = "backend"
        path     = "/system/ping"
        interval = "2s"
        timeout  = "1s"
      }
  }
  1. I wrote a very simple watcher script and added it to the task group as a second task.
    It checks the status of the main job and writes a timestamp value into consul kv that triggers a restart. It queries a consul agent on the localhost. Here is just an example of params that i've used.
    command = "python"
    args    = [ "local/watcher.py",
                "--consul_host", "127.0.0.1",
                "--node","${node.unique.name}",
                "--consul_kv_key","apps/backend-rtb/backend-rtb-task/${attr.unique.hostname}",
                "--service","backend-rtb-backend-rtb-group-backend-rtb-task",
                "--filter","passing",
                "--check_interval","15",
                "--start_after","30",
                "--fails","6"
              ]
  }

With python and python-consul that was quite simple. Any custom restart logic is possible here.

@anishmashankar

This comment has been minimized.

Show comment
Hide comment
@anishmashankar

anishmashankar Jul 22, 2017

+1
Any news on the road map for this feature?

+1
Any news on the road map for this feature?

@dadgar

This comment has been minimized.

Show comment
Hide comment
@dadgar

dadgar Aug 7, 2017

Contributor

From @samart:

It would nice if a failing check restarted a task or re-scheduled it elsewhere.

Consul provides a service discovery healthcheck, but when a service is unresponsive we'd like to restart it. Our mesos clusters do this with marathon keep-alive healthchecks and it works well to keep applications responsive.

We should be able to specify at least:

gracePeriod to wait before starting to healthcheck, after the task has started
number of check failures before a restart/re-schedule
interval between checks
timeout on the check attempt
Including the current nomad restart options would be nice as well. Max restart attempts, mode, etc.

Contributor

dadgar commented Aug 7, 2017

From @samart:

It would nice if a failing check restarted a task or re-scheduled it elsewhere.

Consul provides a service discovery healthcheck, but when a service is unresponsive we'd like to restart it. Our mesos clusters do this with marathon keep-alive healthchecks and it works well to keep applications responsive.

We should be able to specify at least:

gracePeriod to wait before starting to healthcheck, after the task has started
number of check failures before a restart/re-schedule
interval between checks
timeout on the check attempt
Including the current nomad restart options would be nice as well. Max restart attempts, mode, etc.

@tino

This comment has been minimized.

Show comment
Hide comment
@tino

tino Aug 11, 2017

@dadgar Is there a way to manually resolve/restart an unhealthy allocation?

Because I currently have one marked unhealthy, while it is perfectly responsive (as also consul shows). When I run plan to roll out an update however, it only wants to update one of 'm, and ignores the other: Task Group: "web" (1 create/destroy update, 1 ignore).

How can I get nomad to re-evaluate the allocation status?

tino commented Aug 11, 2017

@dadgar Is there a way to manually resolve/restart an unhealthy allocation?

Because I currently have one marked unhealthy, while it is perfectly responsive (as also consul shows). When I run plan to roll out an update however, it only wants to update one of 'm, and ignores the other: Task Group: "web" (1 create/destroy update, 1 ignore).

How can I get nomad to re-evaluate the allocation status?

@dadgar

This comment has been minimized.

Show comment
Hide comment
@dadgar

dadgar Aug 11, 2017

Contributor

@tino Currently there is no way to restart a particular allocation. Further I think the plan is just showing that because you likely have count = 2 and max_parallel = 1. It will do 1 at a time but will replace all of them.

Contributor

dadgar commented Aug 11, 2017

@tino Currently there is no way to restart a particular allocation. Further I think the plan is just showing that because you likely have count = 2 and max_parallel = 1. It will do 1 at a time but will replace all of them.

@schmichael schmichael referenced this issue Aug 26, 2017

Merged

Restart unhealthy tasks #3105

4 of 4 tasks complete
@alxark

This comment has been minimized.

Show comment
Hide comment
@alxark

alxark Aug 28, 2017

This function is critical! Also it would be great to see restart limits for whole cluster, to prevent situation when service is overloaded and can't handle all requests but massive restart might cause more problems and you need to restart services one by one.

alxark commented Aug 28, 2017

This function is critical! Also it would be great to see restart limits for whole cluster, to prevent situation when service is overloaded and can't handle all requests but massive restart might cause more problems and you need to restart services one by one.

@schmichael schmichael closed this in #3105 Sep 18, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment