Checks to pass critical only after reaching a failure threshold #5739

PHBourquin · 2019-04-30T09:37:17Z

A check may be set to become critical only if a specified number of successive
checks return critical in a row. Status will stay identical as before until
the threshold is reached.
This feature is available for HTTP, TCP, gRPC, Docker & Monitor checks.

hashicorp-cla · 2019-04-30T09:37:24Z

All committers have signed the CLA.

ShimmerGlass · 2019-05-06T08:55:59Z

agent/checks/check_test.go

+	statusHandler.updateCheck(checkID, api.HealthCritical, "bar")
+
+	retry.Run(t, func(r *retry.R) {
+		if got, want := notif.Updates("foo"), 1; got != want {


You can use the require lib for this which is easier to read.
In this case

require.Equal(t, 1, notif.Updates("foo"))

hanshasselberg · 2019-07-03T11:47:33Z

@PHBourquin Thanks for your PR, this would be a nice addition.
As long as we stick to counters to store we don't have any concerns about our memory consumption. I was wondering what you think about adding the same for marking a service as healthy?

Would you be able to rebase your PR so that we can review it?

PHBourquin · 2019-07-03T13:31:46Z

@i0rek Rebase done :)
Regarding doing the same thing for marking a service as healthy, I'm not seeing the benefits of such thing ; we'd like to avoid sending 2 notifications passing -> critical -> passing if a service's instance is flapping (GC occuring during health check for exemple), but I don't see in which scenario would we want to delay a service passing to healthy while the instance is passing the HC ?

shantanugadgil · 2019-09-06T06:26:25Z

@PHBourquin The count for success in addition to failure is typically already found at many places like AWS LB configuration, HAProxy rise/fall options, any typical health check prober. :)

Using only 1 for healthy is ok, but I typically use 2 healthy probes for success and 5 unhealthy probes for failure. (Just my thought)

Great work on the PR, btw.

banks

Sorry this has sat for so long @PHBourquin there is clear interest and I think it's looking good.

I can see one issue with the threshold logic.

website/source/docs/agent/checks.html.md

banks · 2019-10-10T11:35:16Z

agent/checks/check.go

+			return
+		}
+		s.logger.Printf("[WARN] agent: Check %q failed but has not reached failure threshold %d/%d", checkID, s.failuresCounter, s.failuresBeforeCritical)
+	}


We never reset the counters which means that the thresholds will only work the first time round doesn't it?

e.g. if you have successBeforePassing = 2, failuresBeforeCritical = 2 then you get the sequence: OK, OK, OK, FAIL, FAIL, FAIL, OK

The check should not yet be passing again since there is only one OK since the last FAIL but the state here would be: successCounter = 4, successBeforePassing = 2 and so s.successCounter >= s.successBeforePassing would be true.

Unless I'm missing something? I don't think there is a test case that exercised the path beyond a single threshold traversal which would presumably catch this issue.

I think it's really easy to fix though - just need to reset success counter when failure threshold hits and vice versa but some through table tests would be good here because if you reset it too soon you also break the thresholding effect. I think there are at least 6-8 interesting check histories to enumerate here and test that this does the right think in each to be confident in the logic so a table test seems like an efficient way to do that.

Indeed the counter are not reset after different checks as they should ; counters are now set to 0 after updateCheck() with the opposite status is called, plus a test to ensure that sequences such as FAIL PASS FAIL FAIL won't update status to critical if threshold is set to 3
I think the current tests bring good coverage, but if you think there is added value to go for the full table we can go for it, your call.

Thanks for the catch !

I agree those tests seem to cover the cases. I have a preference for table tests over multiple long story-based tests for cases like this because it's easier to read the table and see which cases are tested than having to keep state in your head as you read but I think this is fine for now.

website/source/docs/agent/checks.html.md

banks

I think this looks great thanks!

banks · 2019-10-14T19:56:41Z

agent/checks/check.go

+			return
+		}
+		s.logger.Printf("[WARN] agent: Check %q failed but has not reached failure threshold %d/%d", checkID, s.failuresCounter, s.failuresBeforeCritical)
+	}


I agree those tests seem to cover the cases. I have a preference for table tests over multiple long story-based tests for cases like this because it's easier to read the table and see which cases are tested than having to keep state in your head as you read but I think this is fine for now.

banks · 2019-10-14T19:58:38Z

website/source/docs/agent/checks.html.md

+checks return passing/critical. Status will stay identical as before until
+the threshold is reached.
+This feature is available for HTTP, TCP, gRPC, Docker & Monitor checks.
+By default, will be set to 0 for both passing & critical - a single check is needed to update status.


Suggested change

By default, will be set to 0 for both passing & critical - a single check is needed to update status.

By default, both passing and critical thresholds will be set to 0 so the check status will always reflect the last check result.

@PHBourquin are you able to commit this tweak? If not I can make it but I can't commit to your fork so it makes it a bit more cumbersome.

banks · 2019-10-14T20:09:47Z

Thanks, I think this is read to merge - the integration test failure is a tooling issue with a change in nomad master and fixed in another adjacent PR - the acceptance tests with released versions of nomad all pass so this is OK to go.

…failure threshold A check may be set to become passing/critical only if a specified number of successive checks return passing/critical in a row. Status will stay identical as before until the threshold is reached. This feature is available for HTTP, TCP, gRPC, Docker & Monitor checks.

banks · 2019-10-14T20:49:17Z

Thanks @PHBourquin! This is great.

PHBourquin force-pushed the check-failure-count branch 2 times, most recently from 7309450 to 5ddddae Compare May 5, 2019 15:12

ShimmerGlass reviewed May 6, 2019

View reviewed changes

PHBourquin force-pushed the check-failure-count branch from 5ddddae to 5a6aa68 Compare May 6, 2019 20:24

banks added the type/enhancement Proposed improvement or new feature label Jun 4, 2019

banks added the needs-discussion Topic needs discussion with the larger Consul maintainers before committing to for a release label Jul 2, 2019

PHBourquin force-pushed the check-failure-count branch from 5a6aa68 to 442148d Compare July 3, 2019 13:16

PHBourquin force-pushed the check-failure-count branch 3 times, most recently from 24409ee to 6d3fbc2 Compare September 12, 2019 13:27

banks removed the needs-discussion Topic needs discussion with the larger Consul maintainers before committing to for a release label Oct 1, 2019

banks mentioned this pull request Oct 10, 2019

Consul ESM anti-flapping protection proposal hashicorp/consul-esm#50

Closed

banks requested changes Oct 10, 2019

View reviewed changes

PHBourquin force-pushed the check-failure-count branch 3 times, most recently from a667e3c to a7f2ea1 Compare October 14, 2019 16:16

mikemorris assigned banks Oct 14, 2019

banks approved these changes Oct 14, 2019

View reviewed changes

PHBourquin force-pushed the check-failure-count branch from a7f2ea1 to c40afbf Compare October 14, 2019 20:42

banks merged commit 0396156 into hashicorp:master Oct 14, 2019

blake mentioned this pull request Nov 16, 2019

Docs: Example shows TTL and Interval; results in "Invalid check" error #6343

Closed

unicell mentioned this pull request Dec 12, 2019

Feature: Service check flap detection #2890

Closed

pierresouchay mentioned this pull request Jan 10, 2020

Implement WarningTimeout for TCP/HTTP Checks #4522

Closed

scalp42 mentioned this pull request Jul 22, 2020

Support Consul failures_before_critical and success_before_passing in check stanza hashicorp/nomad#8506

Closed

lornasong mentioned this pull request Aug 21, 2020

Introduce fail/pass check thresholds to prevent flapping hashicorp/consul-esm#78

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checks to pass critical only after reaching a failure threshold #5739

Checks to pass critical only after reaching a failure threshold #5739

PHBourquin commented Apr 30, 2019

hashicorp-cla commented Apr 30, 2019 •

edited

ShimmerGlass May 6, 2019

hanshasselberg commented Jul 3, 2019

PHBourquin commented Jul 3, 2019

shantanugadgil commented Sep 6, 2019

banks left a comment

banks Oct 10, 2019

PHBourquin Oct 14, 2019

banks Oct 14, 2019

banks left a comment

banks Oct 14, 2019

banks Oct 14, 2019

banks Oct 14, 2019

PHBourquin Oct 14, 2019

banks commented Oct 14, 2019

banks commented Oct 14, 2019

	By default, will be set to 0 for both passing & critical - a single check is needed to update status.
	By default, both passing and critical thresholds will be set to 0 so the check status will always reflect the last check result.

Checks to pass critical only after reaching a failure threshold #5739

Checks to pass critical only after reaching a failure threshold #5739

Conversation

PHBourquin commented Apr 30, 2019

hashicorp-cla commented Apr 30, 2019 • edited

Choose a reason for hiding this comment

hanshasselberg commented Jul 3, 2019

PHBourquin commented Jul 3, 2019

shantanugadgil commented Sep 6, 2019

banks left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

banks left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

banks commented Oct 14, 2019

banks commented Oct 14, 2019

hashicorp-cla commented Apr 30, 2019 •

edited