Skip to content

Conversation

@KaylaBrady
Copy link
Collaborator

@KaylaBrady KaylaBrady commented May 1, 2025

Summary

What is this PR for?
No ticket, this is in response to some recent noisy health check count pages and is intended to avoid noisy failing health check pages in the future.

Previously, we were checking if alert data is stale by comparing the count of alerts in our store to the count of alerts returned from the API. We knew this was a first iteration that would be flaky since it can take a bit of time for our Store to catch up, which causes frequent health check failures (65 in the last 4 hours on prod). So far this hasn't resulted in any pages, but it does create some noise in our dashboard and has the potential to erroneously page someone external to our team, who wouldn't be able to immediately see this as noise.

This PR adds a grace period of 5 minutes to our alert count check to reduce health check failure noisiness.

Testing

  • Added unit tests
  • Deployed to dev-orange to hopefully evaluate this cutting down on failures, but turns out we don't see all that many failures on dev-orange (presumably because alerts change less frequently). Will evaluate on staging for a while once merged.

@KaylaBrady KaylaBrady added the deploy to dev-orange Automatically deploy this PR to dev-orange label May 1, 2025
@KaylaBrady KaylaBrady marked this pull request as ready for review May 1, 2025 20:26
@KaylaBrady KaylaBrady requested a review from a team as a code owner May 1, 2025 20:26
@KaylaBrady KaylaBrady requested review from boringcactus and removed request for a team May 1, 2025 20:26
@impl true
def check_health do
store_count = length(Store.Alerts.fetch([]))
store_count = length(Store.Alerts.fetch(fields: [alert: []]))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We sometimes have seen timeouts on health checks (ticket). My hypothesis is that since we don't actually need the alert fields here, requesting the sparse data will be faster.

@KaylaBrady KaylaBrady merged commit 30425df into main May 1, 2025
6 checks passed
@KaylaBrady KaylaBrady deleted the kb-alert-healthcheck branch May 1, 2025 20:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deploy to dev-orange Automatically deploy this PR to dev-orange

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants