Job status running despite of health checks failing #3875

sgnosti · 2018-02-15T09:51:16Z

Nomad version

Nomad v0.7.0-rc3

Operating system and Environment details

Linux ... 4.4.0-79-generic - Ubuntu 16.04.2 LTS

Issue

Job status shows running on all allocations even though one of the nodes is not responding.
On the Nomad Web UI, the status is running for all allocations since the job started (no re-allocations needed so far). However, Consul does show one critical node which the health check is failing on. The Docker container running on that node is indeed not responding.

Job file

traefik.nomad

preetapan · 2018-02-16T12:01:03Z

@sgnosti I see you have the following check stanza defined:

 check {
          name = "traefik healthcheck"
          type = "http"
          port = "admin"
          path = "/ping"
          interval = "10s"
          timeout = "1s"
        }

Is that check returning a 200 when this happens? That may be why nomad's view of the allocation running on that node is that its healthy.

sgnosti · 2018-02-16T13:26:58Z

Hi @preetapan, thanks for answering.
I guess the health check is timing out. The service is unavailable but I don't know if the nomad agent logs somewhere the successful/unsuccessful checks. The job logs don't provide any information either.
I thought my health check definition might be wrong but Consul does show a warning because of the failing health check.

preetapan · 2018-02-16T22:05:35Z

@sgnosti While Nomad registers the checks for you, it does not provide information about the failed healthcheck status via any API or CLI options. There is some debug level logging about it. HEalthchecks are a loosely coupled feature and their true status is in Consul and is available from there. So this is working as desired.

Any allocations of the job with a failing health check currently don't affect the running state of the allocation. Consider using the check restart stanza if you want Nomad to try restarting a task with a failing healthcheck. Note that simply restarting the task on the same node may not fix the underlying issue if it's because of Consul being unavailable.

We are also planning on improving Nomad's behavior with running tasks when Consul is unavailable entirely - instead of attempting to register the check again in the background, Nomad will fail the task. This is coming in a future release.

github-actions · 2022-12-03T02:14:43Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

preetapan closed this as completed Feb 16, 2018

github-actions bot locked as resolved and limited conversation to collaborators Dec 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Job status running despite of health checks failing #3875

Job status running despite of health checks failing #3875

sgnosti commented Feb 15, 2018

preetapan commented Feb 16, 2018

sgnosti commented Feb 16, 2018

preetapan commented Feb 16, 2018

github-actions bot commented Dec 3, 2022

Job status running despite of health checks failing #3875

Job status running despite of health checks failing #3875

Comments

sgnosti commented Feb 15, 2018

Nomad version

Operating system and Environment details

Issue

Job file

preetapan commented Feb 16, 2018

sgnosti commented Feb 16, 2018

preetapan commented Feb 16, 2018

github-actions bot commented Dec 3, 2022