-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce fail/pass check thresholds to prevent flapping #78
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few notes, but otherwise great work!
b516b77
to
ebf6c7a
Compare
I addressed all comments. What do you think @lornasong ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@edevil thanks again for your work! I think this is looking good.
I left two specific comments (see below). Would you be able add/update some tests that would be great for ESM's maintainability:
- updating
TestDecodeMergeConfig
to include the two new configs - testing to ensure the threshold is being respected and reset
Please let me know if you have any questions. Thanks!
10fc999
to
70d913e
Compare
@lornasong I've updated |
70d913e
to
bc8a0ff
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@edevil - thanks for your work! TestDecodeMergeConfig
looks good. I left a few questions around clarification, decrement logic, and uint.
As for testing the counters, here's one way:
- You could use a set up similar to
TestCheck_MinimumInterval
incheck_test.go
where you can set up: test server, api client, and new check runner, and create a new check - Then register the check in a similar way like
checks := api.HealthChecks{check}
runner.UpdateChecks(checks)
- Then retrieve the check
originalCheck, ok := runner.checks[checkHash(check)]
and do any checks - Then "update" the check and confirm any values and repeat. ex:
id := checkHash(check)
runner.UpdateCheck(id, api.HealthCritical, "")
assert.Equal(t, 1, originalCheck.failuresCounter)
assert.Equal(t, 0, originalCheck.successCounter)
assert.Equal(t, api.HealthPassing, originalCheck.Status)
...
2f724e7
to
8196a6b
Compare
Hello @lornasong. I added a test case that exercises the new code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the test @edevil! It helped me better understand the expected behavior of your feature. It looks really good. I left a follow up on the uint
vs. int
discussion
64c2534
to
1eff889
Compare
Hi @edevil, thanks for the updates! I have a couple of minor suggestions to the
I was also reviewing hashicorp/consul#5739 again and realized that the behavior here in ESM is different from Consul. As I understand, in ESM, the configs are for the number of additional checks, consecutive or non-consecutive, needed before switching status. In Consul, the configs are the number of total checks, consecutive only, needed before switching status. I have some hesitation have deviant behavior from Consul since users might not expect it. You mentioned earlier that the ESM behavior is more what people will want in production and you gave a great use-case where ESM’s behavior is better suited. If you (or anyone reading) have any additional thoughts, let me know. In the meantime, I will check-in with our team early next week to get their input. |
1eff889
to
af07150
Compare
I've implemented the changes requested. The test failures are due to general flakiness of the test suite which I attempted to improve in #80. |
@edevil thanks for the changes and putting up #80 to fix the test flakiness! This is awesome. I am planning to review #80 tomorrow. As an update regarding the difference in implementation between Consul and ESM: I reached out to our team and scheduled to discuss it mid next-week. I will update as soon as I learn more. Let me know if you have any questions. Thanks! |
af07150
to
9ca670d
Compare
@edevil I spoke with our team. We discussed that your implementation captures more use cases than the Consul implementation and that it would be ok for ESM to diverge from Consul. Practitioners may jump to the incorrect conclusion that ESM has the same implementation as Consul since it has the same configuration name. To avoid this, let's choose a new, different configuration name. Do you (or anyone else reading) have any thoughts on a new config name that would capture this feature? I'll include some initial ideas but we don't have to use them:
Other ideas welcome! |
I'm not very good with naming things but I would vote for threshold_to_pass. It's sufficiently different but still captures the gist of the feature. |
I'd suggest |
Let's go with |
9ca670d
to
a051d04
Compare
Changed the names accordingly. |
a051d04
to
d7b84be
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing work André! Thank you! Hope @lornasong can approve quickly 🙏.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@edevil - this is looking great! Thanks for renaming the configurations. I added a couple final suggestions that are documentation and comments. Feel free modify as you see fit or let me know if anything is inaccurate. Thanks!
Thank you for the documentation help. |
Adds two new config variables, critical_threshold and passing_threshold, that respectively configure the number of failed checks needed to go from healthy to critical and the number of passed checks to go from critical to healthy.
c9313b4
to
0350cc6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great @edevil! Thank you for all of your work here and all of the discussion! I think the community will really benefit from this feature.
I would love to use this feature, any chance we could have a 0.5.0 release soon? EDIT: I think a note in the CHANGELOG for this would be prudent. |
@rbjorklin, thanks for the question. I did another round of updates to the CHANGELOG, so you should be able to see this feature in there now. Regarding the 0.5.0 release, we have some work scheduled this week, so I am planning to release early next week. Let me know if you have any concerns. Thank you! |
Adds two new config variables, passing_threshold and critical_threshold,
that respectively configure the number of failed checks needed to go
from healthy to critical and the number of passed checks to go from
critical to healthy.
Addresses #50