Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

COMMAND Health Check does not resume after failure? #2179

Closed
jolexa opened this issue Sep 3, 2015 · 4 comments
Closed

COMMAND Health Check does not resume after failure? #2179

jolexa opened this issue Sep 3, 2015 · 4 comments

Comments

@jolexa
Copy link
Contributor

jolexa commented Sep 3, 2015

I have a command health check like this,

"healthChecks": [
    {
      "protocol": "COMMAND",
      "command": { "value": "date;curl --max-time 21 http://$HOST:$PORT/health | grep green" },
      "intervalSeconds": 10,
      "timeoutSeconds": 20,
      "maxConsecutiveFailures": 2
    }
]

I'm curious why the health check is not restarted after 21 seconds is elapsed. The result of this is that my app is not healthy and nothing is restarting because the health check is no longer running and can't satisfy the maxConsecutiveFailures threshold. Before I added the --max-time flag, the command would go on forever after forcing a network partition between app server and its dependancy. The health endpoint will just hang. Any insight / thoughts?

The Mesos stderr logs looks like this:

Thu Aug 27 16:19:22 CDT 2015
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed

0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- 0:00:02 --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- 0:00:03 --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- 0:00:04 --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- 0:00:05 --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- 0:00:06 --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- 0:00:07 --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- 0:00:08 --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- 0:00:09 --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- 0:00:10 --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- 0:00:11 --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- 0:00:12 --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- 0:00:13 --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- 0:00:14 --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- 0:00:15 --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- 0:00:16 --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- 0:00:17 --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- 0:00:18 --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- 0:00:19 --:--:-- 0W0827 16:19:42.832844 11540 main.cpp:375] Health check failed Command check failed with reason: status still pending after timeout 20secs

0 0 0 0 0 0 0 0 --:--:-- 0:00:20 --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- 0:00:21 --:--:-- 0
curl: (28) Operation timed out after 21001 milliseconds with 0 bytes received

@nfnt
Copy link
Contributor

nfnt commented Sep 18, 2015

The health checker expects the command to be finished after timeoutSeconds and checks the commands return code to decide whether it was a success or a failure. If the command is not finished in the expected time is unexpected behavior and will stop the health checker.
Try setting the --max-time parameter in the curl command to a value that is smaller than your timeoutSeconds parameter.

@jolexa
Copy link
Contributor Author

jolexa commented Sep 20, 2015 via email

@aquamatthias
Copy link
Contributor

@jolexa I agree with your assumption. I consider this as a mesos bug and created a ticket for this: https://issues.apache.org/jira/browse/MESOS-3479
Please vote, if you think this should get fixed.

@aquamatthias
Copy link
Contributor

Close that ticket, since it can not be solved in Marathon. Watch https://issues.apache.org/jira/browse/MESOS-3479 for progress.

@mesosphere mesosphere locked and limited conversation to collaborators Mar 27, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants