Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add splay to health check execution #7850

Merged
merged 5 commits into from
Aug 12, 2020
Merged

Conversation

davidMcneil
Copy link
Contributor

@davidMcneil davidMcneil commented Jul 30, 2020

Resolves #6867

This PR adds splay to health check hooks runs following the first successful health check.

This is slightly different than what was discussed in standup which proposed adding splay to every health check run. I decided to only add splay following the first successful run because the health check interval is reported to automate. I also did not add a config value for the splay and instead simply used the health check interval. If we think an explicit interval would be useful I can add one.

Signed-off-by: David McNeil mcneil.david2@gmail.com

Signed-off-by: David McNeil <mcneil.david2@gmail.com>
@davidMcneil davidMcneil self-assigned this Jul 30, 2020
@davidMcneil davidMcneil added the Type:Breaking Change PRs that are classified as a change to existing behavior label Jul 30, 2020
Copy link
Contributor

@stevendanna stevendanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I like this change as it will hopefully help a lot with a large number of services that were started quickly by a supervisor. However, I think it might be a good exercise to update the docs for the health checks to make the new semantics clear.

splay.as_secs(),
service_group);
first_successful_health_check = true;
splay.into()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading the log message I would have expected to see nominal_interval + splay.into() here. In the end it doesn't matter, just a minor point and I don't have a great suggestion for better wording

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I thought making the health check interval shorter than the configured interval was more acceptable than making it longer.

Agreed the debug message was confusing. cd76d3 hopefully improves it.

@mwrock
Copy link
Contributor

mwrock commented Jul 31, 2020

Just curious why it matters that we only add splay after the first health check. Isn't each bundle sent to automate?

@stevendanna
Copy link
Contributor

Just curious why it matters that we only add splay after the first health check. Isn't each bundle sent to automate?

My take was that this was being done once to randomly distributed services over the time range, but then we otherwise want the interval to be what the user set.

@davidMcneil
Copy link
Contributor Author

davidMcneil commented Jul 31, 2020

Just curious why it matters that we only add splay after the first health check. Isn't each bundle sent to automate?

My take was that this was being done once to randomly distributed services over the time range, but then we otherwise want the interval to be what the user set.

Yes, that was my thought. All intervals are sent to automate. I am not sure how/if they are displayed, but I thought it could be a source of confusion if the intervals were always changing and not what the user explicitly configured. If we only add splay once there is only one "odd" interval.

If we added jitter to every health check, it becomes harder to know what amount of time to add (ie random 10%, 20%, 50%). Whereas if we just do it on startup, I think we can get by choosing a random value over the entire range.

If the startup splay is not effective we could probably do something on every execution, but I thought I would go with the simpler option first.

Signed-off-by: David McNeil <mcneil.david2@gmail.com>
Signed-off-by: David McNeil <mcneil.david2@gmail.com>
Signed-off-by: David McNeil <mcneil.david2@gmail.com>
@@ -29,7 +29,10 @@ Optionally you may add an extension to the hook file. For example, you might cre
File location: `<plan>/hooks/file-updated`. This hook is run whenever a configuration file that is not related to a user or about the state of the service instances is updated.

### health-check
File location: `<plan>/hooks/health-check`. This hook is run when the Chef Habitat HTTP API receives a request at `/health`.
File location: `<plan>/hooks/health-check`. This hook is run periodically on a configurable interval. There are two exceptions to the interval used between `health-check` runs:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
File location: `<plan>/hooks/health-check`. This hook is run periodically on a configurable interval. There are two exceptions to the interval used between `health-check` runs:
File location: `<plan>/hooks/health-check`.(Default: 30 seconds) This hook repeats at a configured interval. There are two exceptions to the interval used between `health-check` runs:

File location: `<plan>/hooks/health-check`. This hook is run periodically on a configurable interval. There are two exceptions to the interval used between `health-check` runs:

- If the `health-check` hook exits with a non-`ok` status the next `health-check` will run after the default `health-check` interval (thirty seconds). This is only done when the configured interval is greater than the default interval.
- Following the first `ok` `health-check` the next `health-check` will run after a randomly chosen delay between zero and the configured interval. This is done to introduce splay between `health-check` runs.
Copy link
Contributor

@kagarmoe kagarmoe Aug 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Following the first `ok` `health-check` the next `health-check` will run after a randomly chosen delay between zero and the configured interval. This is done to introduce splay between `health-check` runs.
- If the `health-check` hook returns an `ok` status for the first time, then the next `health-check` will run after a randomly chosen delay between 0 and the configured `health-check` interval. This introduces a splay---a degree of difference---in the timing between the first and second `health-check` runs. All following health-check hooks run at the configured interval. The splay prevents more than one health-check hook from starting at the same time by giving each of them a unique starting point.

Copy link
Contributor

@kagarmoe kagarmoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some suggestions on wording. Splay is a hard concept to put into words.

Signed-off-by: David McNeil <mcneil.david2@gmail.com>
@davidMcneil
Copy link
Contributor Author

Thanks @kagarmoe! Your feedback should be addressed in 657397

@davidMcneil davidMcneil merged commit 5a8454d into master Aug 12, 2020
@davidMcneil davidMcneil deleted the dmcneil/health-check-splay branch September 9, 2020 15:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type:Breaking Change PRs that are classified as a change to existing behavior
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Introduce splay when running health-checks
5 participants