Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run "critical" health checks immediately when node is restarted/reloaded #954

Open
onnimonni opened this issue May 18, 2015 · 12 comments
Open
Labels
theme/health-checks Health Check functionality type/enhancement Proposed improvement or new feature

Comments

@onnimonni
Copy link

When I start one node of the cluster all of the checks start at the state critical. Some checks have really long intervals ( 30minutes ) and they will be critical until the first check. Is this by design?

Do you know if there are solutions to run tests more often when they're failing aka in 'critical' state?

@onnimonni onnimonni changed the title Run "critical" checks immediately when node is restarted/reloaded Run "critical" health checks immediately when node is restarted/reloaded May 18, 2015
@armon
Copy link
Member

armon commented May 18, 2015

@onnimonni We never considered the case of such long intervals. Most of the checks are definited on fairly short intervals. We purposely stagger start the checks to avoid a thundering heard, but in this case the stagger may be spread over a 30 minute period which is rather long. Tagging as bug, so we can add a cap to the stagger period.

@armon armon added the type/bug Feature does not function as expected label May 18, 2015
@onnimonni
Copy link
Author

I'm running rather intensive security scanning check for wordpress using consul and it's ok if node is under a pressure during the start for this, but after that it's ok to run it only seldomly

Thanks for considering it :)

On Tue, May 19, 2015 at 1:43 AM, Armon Dadgar notifications@github.com
wrote:

@onnimonni We never considered the case of such long intervals. Most of the checks are definited on fairly short intervals. We purposely stagger start the checks to avoid a thundering heard, but in this case the stagger may be spread over a 30 minute period which is rather long. Tagging as bug, so we can add a cap to the stagger period.

Reply to this email directly or view it on GitHub:
#954 (comment)

@stevendborrelli
Copy link

We've been bitten by this too. We run some low-priority host configuration checks intermittently. Critical checks cause the node to go offline in DNS, which causes problems across the board.

@langston-barrett
Copy link
Contributor

I wonder if this problem could be fixed by having a special case - the first round of health checks could be run immediately after the node is registered in Consul, and the timer can begin from there.

@langston-barrett
Copy link
Contributor

@onnimonni Solution posted in issue #1085:
"[#962] actually should solve this case by allowing the ability to provide the check status when registering the check. The confusion is probably that we forgot to document it on the web docs, so we can probably just create a ticket for that instead. @siddharthist does that satisfy your use case?"

@onnimonni
Copy link
Author

Solution in #1085 is okay, but I would still like to run these tests asap in the first-run even if it causes thundering heard.

Could we have it as variable?

{
  "check": {
    "id": "mem",
    "script": "/bin/check_mem",
    "interval": "10s",
    "allow-thundering-heard": "yes-please"
  }
}

Just kidding, but for me it would be really useful to run some tests once in start and then do them only once in a while later. This could be used to prioritize checks as well.

@armon
Copy link
Member

armon commented Jul 24, 2015

Hmm. What if we just bound the maximum stagger period to say 60 seconds? So even with a 30m check it will be run in a relatively timely manner. I prefer sane default behavior over having a million knobs to turn.

@jvoorhis
Copy link

jvoorhis commented Oct 2, 2015

👍 for this ticket. I would like to have faster feedback while writing/debugging hourly checks. Alternatively, it would be great to have a API or CLI command to force re-execution of a check.

@qingatqlik
Copy link

Is it possible to allow an optional parameter to specify the maximum interval which the first check should arrive? So you can set the interval to 30 minutes but specifying the first check should arrive within 1 second after registering the service for instance? Much like when you schedule a task at fixed rate in Java where you can specify an initial delay and a re-occurring time period. If such initial delay is set to 0, then the first health check should be done immediately after registration.

@slackpad slackpad added type/enhancement Proposed improvement or new feature and removed type/bug Feature does not function as expected labels Nov 22, 2016
@slackpad slackpad added this to the 0.7.4 milestone Nov 22, 2016
@slackpad slackpad removed this from the Triaged milestone Apr 18, 2017
@slackpad slackpad added the theme/health-checks Health Check functionality label May 25, 2017
@c-robinson
Copy link

I would also really appreciate a solution to this problem. We have a similar use-case were we'd like to be able to force a host to run it's checks ASAP. The goal for us is to be able to prove the status of a subset of our fleet before making a change to a member of it (for example: we monitor the health of a pair of network gateways. Before taking one out of service, we'd like to ensure that there are no problems on the other).

@asteven
Copy link

asteven commented Jun 19, 2018

I would also really appreciate a solution to this problem. We currently have checks that run once a day :-(.
The solution outlined by @armon like 3 years ago seems reasonable, not?

maximum stagger period to say 60 seconds

Or make it configurable. Default to what it is now but allow me to change to say 60 seconds or whatever in the consul agents config.

Alternatively if the agent checks http api would be writable I could at least work around the issue by running my checks manually and patching the check status.

Something like:

status="$(run-check-manually my-check-id)"
curl \
    --request PATCH \
    --data "$(printf '{"Status": "%s"}' "$status")" \
    https://consul.rocks/v1/agent/check/update/my-check-id

Unfortunately the above currently does not work as the /check/update/ endpoint only works for checks of type TTL :-(

@zorro786
Copy link

zorro786 commented Dec 10, 2020

@onnimonni We purposely stagger start the checks to avoid a thundering heard

@armon For normal shorter interval checks like very 30 seconds, what is the initial delay bound after staggering?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/health-checks Health Check functionality type/enhancement Proposed improvement or new feature
Projects
None yet
Development

No branches or pull requests

10 participants