Run "critical" health checks immediately when node is restarted/reloaded #954

onnimonni · 2015-05-18T21:10:54Z

When I start one node of the cluster all of the checks start at the state critical. Some checks have really long intervals ( 30minutes ) and they will be critical until the first check. Is this by design?

Do you know if there are solutions to run tests more often when they're failing aka in 'critical' state?

armon · 2015-05-18T22:43:17Z

@onnimonni We never considered the case of such long intervals. Most of the checks are definited on fairly short intervals. We purposely stagger start the checks to avoid a thundering heard, but in this case the stagger may be spread over a 30 minute period which is rather long. Tagging as bug, so we can add a cap to the stagger period.

onnimonni · 2015-05-18T23:39:03Z

I'm running rather intensive security scanning check for wordpress using consul and it's ok if node is under a pressure during the start for this, but after that it's ok to run it only seldomly

Thanks for considering it :)

On Tue, May 19, 2015 at 1:43 AM, Armon Dadgar notifications@github.com
wrote:

@onnimonni We never considered the case of such long intervals. Most of the checks are definited on fairly short intervals. We purposely stagger start the checks to avoid a thundering heard, but in this case the stagger may be spread over a 30 minute period which is rather long. Tagging as bug, so we can add a cap to the stagger period.

Reply to this email directly or view it on GitHub:
#954 (comment)

stevendborrelli · 2015-06-29T18:52:23Z

We've been bitten by this too. We run some low-priority host configuration checks intermittently. Critical checks cause the node to go offline in DNS, which causes problems across the board.

langston-barrett · 2015-06-29T20:44:18Z

I wonder if this problem could be fixed by having a special case - the first round of health checks could be run immediately after the node is registered in Consul, and the timer can begin from there.

langston-barrett · 2015-07-23T21:53:08Z

@onnimonni Solution posted in issue #1085:
"[#962] actually should solve this case by allowing the ability to provide the check status when registering the check. The confusion is probably that we forgot to document it on the web docs, so we can probably just create a ticket for that instead. @siddharthist does that satisfy your use case?"

onnimonni · 2015-07-24T11:50:55Z

Solution in #1085 is okay, but I would still like to run these tests asap in the first-run even if it causes thundering heard.

Could we have it as variable?

{
  "check": {
    "id": "mem",
    "script": "/bin/check_mem",
    "interval": "10s",
    "allow-thundering-heard": "yes-please"
  }
}

Just kidding, but for me it would be really useful to run some tests once in start and then do them only once in a while later. This could be used to prioritize checks as well.

armon · 2015-07-24T17:42:45Z

Hmm. What if we just bound the maximum stagger period to say 60 seconds? So even with a 30m check it will be run in a relatively timely manner. I prefer sane default behavior over having a million knobs to turn.

jvoorhis · 2015-10-02T05:33:11Z

👍 for this ticket. I would like to have faster feedback while writing/debugging hourly checks. Alternatively, it would be great to have a API or CLI command to force re-execution of a check.

qingatqlik · 2015-10-17T00:08:41Z

Is it possible to allow an optional parameter to specify the maximum interval which the first check should arrive? So you can set the interval to 30 minutes but specifying the first check should arrive within 1 second after registering the service for instance? Much like when you schedule a task at fixed rate in Java where you can specify an initial delay and a re-occurring time period. If such initial delay is set to 0, then the first health check should be done immediately after registration.

c-robinson · 2017-07-24T18:49:21Z

I would also really appreciate a solution to this problem. We have a similar use-case were we'd like to be able to force a host to run it's checks ASAP. The goal for us is to be able to prove the status of a subset of our fleet before making a change to a member of it (for example: we monitor the health of a pair of network gateways. Before taking one out of service, we'd like to ensure that there are no problems on the other).

asteven · 2018-06-19T16:11:41Z

I would also really appreciate a solution to this problem. We currently have checks that run once a day :-(.
The solution outlined by @armon like 3 years ago seems reasonable, not?

maximum stagger period to say 60 seconds

Or make it configurable. Default to what it is now but allow me to change to say 60 seconds or whatever in the consul agents config.

Alternatively if the agent checks http api would be writable I could at least work around the issue by running my checks manually and patching the check status.

Something like:

status="$(run-check-manually my-check-id)"
curl \
    --request PATCH \
    --data "$(printf '{"Status": "%s"}' "$status")" \
    https://consul.rocks/v1/agent/check/update/my-check-id

Unfortunately the above currently does not work as the /check/update/ endpoint only works for checks of type TTL :-(

zorro786 · 2020-12-10T17:16:09Z

@onnimonni We purposely stagger start the checks to avoid a thundering heard

@armon For normal shorter interval checks like very 30 seconds, what is the initial delay bound after staggering?

onnimonni changed the title ~~Run "critical" checks immediately when node is restarted/reloaded~~ Run "critical" health checks immediately when node is restarted/reloaded May 18, 2015

armon added the type/bug Feature does not function as expected label May 18, 2015

This was referenced Jun 29, 2015

[misfiled] #1068

Closed

Register Distributive as a health check rather than a service mantl/mantl#535

Closed

langston-barrett mentioned this issue Jul 6, 2015

Create a "Pending" health check state #1085

Closed

slackpad mentioned this issue Aug 17, 2015

Allow immediate helthcheck upon registration #1179

Closed

slackpad added type/enhancement Proposed improvement or new feature and removed type/bug Feature does not function as expected labels Nov 22, 2016

slackpad added this to the 0.7.4 milestone Nov 22, 2016

slackpad removed this from the Triaged milestone Apr 18, 2017

slackpad added the theme/health-checks Health Check functionality label May 25, 2017

jkirschner-hashicorp mentioned this issue Sep 1, 2021

checks: add support for delaying the first health check #10864

Open

ekmixon mentioned this issue Apr 25, 2023

[Snyk] Fix for 1 vulnerabilities ekmixon/consul#510

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run "critical" health checks immediately when node is restarted/reloaded #954

Run "critical" health checks immediately when node is restarted/reloaded #954

onnimonni commented May 18, 2015

armon commented May 18, 2015

onnimonni commented May 18, 2015

stevendborrelli commented Jun 29, 2015

langston-barrett commented Jun 29, 2015

langston-barrett commented Jul 23, 2015

onnimonni commented Jul 24, 2015

armon commented Jul 24, 2015

jvoorhis commented Oct 2, 2015

qingatqlik commented Oct 17, 2015

c-robinson commented Jul 24, 2017

asteven commented Jun 19, 2018

zorro786 commented Dec 10, 2020 •

edited

Loading

Run "critical" health checks immediately when node is restarted/reloaded #954

Run "critical" health checks immediately when node is restarted/reloaded #954

Comments

onnimonni commented May 18, 2015

armon commented May 18, 2015

onnimonni commented May 18, 2015

stevendborrelli commented Jun 29, 2015

langston-barrett commented Jun 29, 2015

langston-barrett commented Jul 23, 2015

onnimonni commented Jul 24, 2015

armon commented Jul 24, 2015

jvoorhis commented Oct 2, 2015

qingatqlik commented Oct 17, 2015

c-robinson commented Jul 24, 2017

asteven commented Jun 19, 2018

zorro786 commented Dec 10, 2020 • edited Loading

zorro786 commented Dec 10, 2020 •

edited

Loading