Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job status running despite of health checks failing #3875

Closed
sgnosti opened this issue Feb 15, 2018 · 4 comments
Closed

Job status running despite of health checks failing #3875

sgnosti opened this issue Feb 15, 2018 · 4 comments

Comments

@sgnosti
Copy link

sgnosti commented Feb 15, 2018

Nomad version

Nomad v0.7.0-rc3

Operating system and Environment details

Linux ... 4.4.0-79-generic - Ubuntu 16.04.2 LTS

Issue

Job status shows running on all allocations even though one of the nodes is not responding.
On the Nomad Web UI, the status is running for all allocations since the job started (no re-allocations needed so far). However, Consul does show one critical node which the health check is failing on. The Docker container running on that node is indeed not responding.

Job file

traefik.nomad

@preetapan
Copy link
Member

@sgnosti I see you have the following check stanza defined:

 check {
          name = "traefik healthcheck"
          type = "http"
          port = "admin"
          path = "/ping"
          interval = "10s"
          timeout = "1s"
        }

Is that check returning a 200 when this happens? That may be why nomad's view of the allocation running on that node is that its healthy.

@sgnosti
Copy link
Author

sgnosti commented Feb 16, 2018

Hi @preetapan, thanks for answering.
I guess the health check is timing out. The service is unavailable but I don't know if the nomad agent logs somewhere the successful/unsuccessful checks. The job logs don't provide any information either.
I thought my health check definition might be wrong but Consul does show a warning because of the failing health check.

@preetapan
Copy link
Member

@sgnosti While Nomad registers the checks for you, it does not provide information about the failed healthcheck status via any API or CLI options. There is some debug level logging about it. HEalthchecks are a loosely coupled feature and their true status is in Consul and is available from there. So this is working as desired.

Any allocations of the job with a failing health check currently don't affect the running state of the allocation. Consider using the check restart stanza if you want Nomad to try restarting a task with a failing healthcheck. Note that simply restarting the task on the same node may not fix the underlying issue if it's because of Consul being unavailable.

We are also planning on improving Nomad's behavior with running tasks when Consul is unavailable entirely - instead of attempting to register the check again in the background, Nomad will fail the task. This is coming in a future release.

@github-actions
Copy link

github-actions bot commented Dec 3, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 3, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants