-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add splay to health check execution #7850
Conversation
Signed-off-by: David McNeil <mcneil.david2@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I like this change as it will hopefully help a lot with a large number of services that were started quickly by a supervisor. However, I think it might be a good exercise to update the docs for the health checks to make the new semantics clear.
splay.as_secs(), | ||
service_group); | ||
first_successful_health_check = true; | ||
splay.into() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading the log message I would have expected to see nominal_interval + splay.into()
here. In the end it doesn't matter, just a minor point and I don't have a great suggestion for better wording
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I thought making the health check interval shorter than the configured interval was more acceptable than making it longer.
Agreed the debug message was confusing. cd76d3 hopefully improves it.
Just curious why it matters that we only add splay after the first health check. Isn't each bundle sent to automate? |
My take was that this was being done once to randomly distributed services over the time range, but then we otherwise want the interval to be what the user set. |
Yes, that was my thought. All intervals are sent to automate. I am not sure how/if they are displayed, but I thought it could be a source of confusion if the intervals were always changing and not what the user explicitly configured. If we only add splay once there is only one "odd" interval. If we added jitter to every health check, it becomes harder to know what amount of time to add (ie random 10%, 20%, 50%). Whereas if we just do it on startup, I think we can get by choosing a random value over the entire range. If the startup splay is not effective we could probably do something on every execution, but I thought I would go with the simpler option first. |
Signed-off-by: David McNeil <mcneil.david2@gmail.com>
Signed-off-by: David McNeil <mcneil.david2@gmail.com>
Signed-off-by: David McNeil <mcneil.david2@gmail.com>
@@ -29,7 +29,10 @@ Optionally you may add an extension to the hook file. For example, you might cre | |||
File location: `<plan>/hooks/file-updated`. This hook is run whenever a configuration file that is not related to a user or about the state of the service instances is updated. | |||
|
|||
### health-check | |||
File location: `<plan>/hooks/health-check`. This hook is run when the Chef Habitat HTTP API receives a request at `/health`. | |||
File location: `<plan>/hooks/health-check`. This hook is run periodically on a configurable interval. There are two exceptions to the interval used between `health-check` runs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
File location: `<plan>/hooks/health-check`. This hook is run periodically on a configurable interval. There are two exceptions to the interval used between `health-check` runs: | |
File location: `<plan>/hooks/health-check`.(Default: 30 seconds) This hook repeats at a configured interval. There are two exceptions to the interval used between `health-check` runs: |
File location: `<plan>/hooks/health-check`. This hook is run periodically on a configurable interval. There are two exceptions to the interval used between `health-check` runs: | ||
|
||
- If the `health-check` hook exits with a non-`ok` status the next `health-check` will run after the default `health-check` interval (thirty seconds). This is only done when the configured interval is greater than the default interval. | ||
- Following the first `ok` `health-check` the next `health-check` will run after a randomly chosen delay between zero and the configured interval. This is done to introduce splay between `health-check` runs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Following the first `ok` `health-check` the next `health-check` will run after a randomly chosen delay between zero and the configured interval. This is done to introduce splay between `health-check` runs. | |
- If the `health-check` hook returns an `ok` status for the first time, then the next `health-check` will run after a randomly chosen delay between 0 and the configured `health-check` interval. This introduces a splay---a degree of difference---in the timing between the first and second `health-check` runs. All following health-check hooks run at the configured interval. The splay prevents more than one health-check hook from starting at the same time by giving each of them a unique starting point. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some suggestions on wording. Splay is a hard concept to put into words.
Signed-off-by: David McNeil <mcneil.david2@gmail.com>
Resolves #6867
This PR adds splay to health check hooks runs following the first successful health check.
This is slightly different than what was discussed in standup which proposed adding splay to every health check run. I decided to only add splay following the first successful run because the health check interval is reported to automate. I also did not add a config value for the splay and instead simply used the health check interval. If we think an explicit interval would be useful I can add one.
Signed-off-by: David McNeil mcneil.david2@gmail.com