New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Job status running despite of health checks failing #3875
Comments
@sgnosti I see you have the following
Is that check returning a 200 when this happens? That may be why nomad's view of the allocation running on that node is that its healthy. |
Hi @preetapan, thanks for answering. |
@sgnosti While Nomad registers the checks for you, it does not provide information about the failed healthcheck status via any API or CLI options. There is some debug level logging about it. HEalthchecks are a loosely coupled feature and their true status is in Consul and is available from there. So this is working as desired. Any allocations of the job with a failing health check currently don't affect the running state of the allocation. Consider using the check restart stanza if you want Nomad to try restarting a task with a failing healthcheck. Note that simply restarting the task on the same node may not fix the underlying issue if it's because of Consul being unavailable. We are also planning on improving Nomad's behavior with running tasks when Consul is unavailable entirely - instead of attempting to register the check again in the background, Nomad will fail the task. This is coming in a future release. |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v0.7.0-rc3
Operating system and Environment details
Linux ... 4.4.0-79-generic - Ubuntu 16.04.2 LTS
Issue
Job status shows running on all allocations even though one of the nodes is not responding.
On the Nomad Web UI, the status is running for all allocations since the job started (no re-allocations needed so far). However, Consul does show one critical node which the health check is failing on. The Docker container running on that node is indeed not responding.
Job file
traefik.nomad
The text was updated successfully, but these errors were encountered: