Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce fail/pass check thresholds to prevent flapping #78

Merged
merged 1 commit into from
Sep 14, 2020

Conversation

edevil
Copy link
Contributor

@edevil edevil commented Aug 12, 2020

Adds two new config variables, passing_threshold and critical_threshold,
that respectively configure the number of failed checks needed to go
from healthy to critical and the number of passed checks to go from
critical to healthy.

Addresses #50

@hashicorp-cla
Copy link

hashicorp-cla commented Aug 12, 2020

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@yurkeen yurkeen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few notes, but otherwise great work!

config.go Outdated Show resolved Hide resolved
check.go Outdated Show resolved Hide resolved
check.go Outdated Show resolved Hide resolved
config.go Outdated Show resolved Hide resolved
check.go Outdated Show resolved Hide resolved
check.go Show resolved Hide resolved
@edevil
Copy link
Contributor Author

edevil commented Aug 13, 2020

I addressed all comments. What do you think @lornasong ?

@lornasong lornasong self-assigned this Aug 13, 2020
@lornasong
Copy link
Member

@edevil - thanks for your work here! This is really exciting! To keep you updated, I am planning to switch gears and take a look at this tomorrow (Friday) and Monday.

@yurkeen & @cbroglie thanks for engaging and reviewing!

Copy link
Member

@lornasong lornasong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@edevil thanks again for your work! I think this is looking good.

I left two specific comments (see below). Would you be able add/update some tests that would be great for ESM's maintainability:

  • updating TestDecodeMergeConfig to include the two new configs
  • testing to ensure the threshold is being respected and reset

Please let me know if you have any questions. Thanks!

README.md Outdated Show resolved Hide resolved
check.go Show resolved Hide resolved
@edevil
Copy link
Contributor Author

edevil commented Aug 14, 2020

@lornasong I've updated TestDecodeMergeConfig to include the new config variables. However, I'm unsure of how to test the state of the counters since this is not exposed to the test suite.

Copy link
Member

@lornasong lornasong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@edevil - thanks for your work! TestDecodeMergeConfig looks good. I left a few questions around clarification, decrement logic, and uint.

As for testing the counters, here's one way:

  • You could use a set up similar to TestCheck_MinimumInterval in check_test.go where you can set up: test server, api client, and new check runner, and create a new check
  • Then register the check in a similar way like
checks := api.HealthChecks{check}
runner.UpdateChecks(checks)
  • Then retrieve the check originalCheck, ok := runner.checks[checkHash(check)] and do any checks
  • Then "update" the check and confirm any values and repeat. ex:
id := checkHash(check)
runner.UpdateCheck(id, api.HealthCritical, "")
assert.Equal(t, 1, originalCheck.failuresCounter)
assert.Equal(t, 0, originalCheck.successCounter)
assert.Equal(t, api.HealthPassing, originalCheck.Status)
...

README.md Outdated Show resolved Hide resolved
check.go Show resolved Hide resolved
config.go Outdated Show resolved Hide resolved
@edevil edevil force-pushed the prevent_flapping branch 3 times, most recently from 2f724e7 to 8196a6b Compare August 19, 2020 15:43
@edevil
Copy link
Contributor Author

edevil commented Aug 19, 2020

Hello @lornasong. I added a test case that exercises the new code.

Copy link
Member

@lornasong lornasong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the test @edevil! It helped me better understand the expected behavior of your feature. It looks really good. I left a follow up on the uint vs. int discussion

config.go Outdated Show resolved Hide resolved
check.go Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
@edevil edevil force-pushed the prevent_flapping branch 2 times, most recently from 64c2534 to 1eff889 Compare August 21, 2020 09:50
@lornasong
Copy link
Member

Hi @edevil, thanks for the updates! I have a couple of minor suggestions to the IntValue code, apologies to not realize sooner:

  • Remove Set() and String() - it looks like we aren't using them yet
  • Make IntValue and Merge() private
  • Update the merge() method receiver from u to i

I was also reviewing hashicorp/consul#5739 again and realized that the behavior here in ESM is different from Consul. As I understand, in ESM, the configs are for the number of additional checks, consecutive or non-consecutive, needed before switching status. In Consul, the configs are the number of total checks, consecutive only, needed before switching status.

I have some hesitation have deviant behavior from Consul since users might not expect it. You mentioned earlier that the ESM behavior is more what people will want in production and you gave a great use-case where ESM’s behavior is better suited. If you (or anyone reading) have any additional thoughts, let me know. In the meantime, I will check-in with our team early next week to get their input.

@edevil
Copy link
Contributor Author

edevil commented Aug 24, 2020

I've implemented the changes requested. The test failures are due to general flakiness of the test suite which I attempted to improve in #80.

@eikenb
Copy link
Contributor

eikenb commented Aug 24, 2020

I restarted the test suite a couple times and they failed each time. I see #80 passed, maybe re-test this after #80 is merged to be sure it fixes the failures here?

@lornasong
Copy link
Member

@edevil thanks for the changes and putting up #80 to fix the test flakiness! This is awesome. I am planning to review #80 tomorrow.

As an update regarding the difference in implementation between Consul and ESM: I reached out to our team and scheduled to discuss it mid next-week. I will update as soon as I learn more.

Let me know if you have any questions. Thanks!

@lornasong
Copy link
Member

@edevil I spoke with our team. We discussed that your implementation captures more use cases than the Consul implementation and that it would be ok for ESM to diverge from Consul. Practitioners may jump to the incorrect conclusion that ESM has the same implementation as Consul since it has the same configuration name. To avoid this, let's choose a new, different configuration name.

Do you (or anyone else reading) have any thoughts on a new config name that would capture this feature? I'll include some initial ideas but we don't have to use them:

  • counts_before_pass / counts_before_critical (could also use 'until' or 'to' instead of 'before)
  • threshold_to_pass / threshold_to_critical
  • additional_counts_to_pass / additional_counts_to_critical

Other ideas welcome!

@edevil
Copy link
Contributor Author

edevil commented Sep 2, 2020

I'm not very good with naming things but I would vote for threshold_to_pass. It's sufficiently different but still captures the gist of the feature.

@yurkeen
Copy link
Contributor

yurkeen commented Sep 2, 2020

I'd suggest passing_threshold / critical_threshold.

@lornasong
Copy link
Member

Let's go with passing_threshold/critical_threshold. Thanks for the suggestion @yurkeen !

@edevil
Copy link
Contributor Author

edevil commented Sep 9, 2020

Changed the names accordingly.

check.go Outdated Show resolved Hide resolved
Copy link
Contributor

@yurkeen yurkeen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work André! Thank you! Hope @lornasong can approve quickly 🙏.

Copy link
Member

@lornasong lornasong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@edevil - this is looking great! Thanks for renaming the configurations. I added a couple final suggestions that are documentation and comments. Feel free modify as you see fit or let me know if anything is inaccurate. Thanks!

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Show resolved Hide resolved
check_test.go Show resolved Hide resolved
check_test.go Show resolved Hide resolved
@edevil
Copy link
Contributor Author

edevil commented Sep 11, 2020

Thank you for the documentation help.

Adds two new config variables, critical_threshold and
passing_threshold, that respectively configure the number
of failed checks needed to go from healthy to critical and the
number of passed checks to go from critical to healthy.
Copy link
Member

@lornasong lornasong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great @edevil! Thank you for all of your work here and all of the discussion! I think the community will really benefit from this feature.

@lornasong lornasong added this to the 0.5.0 milestone Sep 14, 2020
@lornasong lornasong linked an issue Sep 14, 2020 that may be closed by this pull request
@lornasong lornasong merged commit 3ac28ae into hashicorp:master Sep 14, 2020
@rbjorklin
Copy link

rbjorklin commented Nov 26, 2020

I would love to use this feature, any chance we could have a 0.5.0 release soon?

EDIT: I think a note in the CHANGELOG for this would be prudent.

@lornasong
Copy link
Member

@rbjorklin, thanks for the question. I did another round of updates to the CHANGELOG, so you should be able to see this feature in there now. Regarding the 0.5.0 release, we have some work scheduled this week, so I am planning to release early next week. Let me know if you have any concerns. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consul ESM anti-flapping protection proposal
7 participants