Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributor: add ingester append timeouts error #10456

Merged

Conversation

dannykopping
Copy link
Contributor

@dannykopping dannykopping commented Sep 5, 2023

What this PR does / why we need it:
Failing to send samples to ingesters because the request exceeded its timeout is a very clear signal that ingesters are unable to keep up with demand. In an incident today we saw that ingesters' push latencies were increased sharply by an expensive regex query which was starving other goroutines of time on CPU.

This new alert loki_distributor_ingester_append_timeouts_total will give us a high-signal metric which we can use for alerting.

Which issue(s) this PR fixes:
N/A

Special notes for your reviewer:
Replacing the loki_distributor_ingester_append_failures_total metric; this was never very high signal since it included samples which could not append due to user-related errors like stream limit / too old.

Checklist

  • Reviewed the CONTRIBUTING.md guide (required)
  • Documentation added
  • Tests updated
  • CHANGELOG.md updated
    • If the change is worth mentioning in the release notes, add add-to-release-notes label
  • Changes that require user attention or interaction to upgrade are documented in docs/sources/setup/upgrade/_index.md
  • For Helm chart changes bump the Helm chart version in production/helm/loki/Chart.yaml and update production/helm/loki/CHANGELOG.md and production/helm/loki/README.md. Example PR

Replacing the ingesterAppendFailures metric; this was never very high signal since it included samples which could not append due to user-related errors like stream limit / too old

Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
@dannykopping dannykopping requested a review from a team as a code owner September 5, 2023 11:43
Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
@github-actions github-actions bot added the type/docs Issues related to technical documentation; the Docs Squad uses this label across many repositories label Sep 5, 2023
@dannykopping dannykopping changed the title Adding ingester append timeout error Distributor: add ingester append timeouts error Sep 5, 2023
@dannykopping dannykopping merged commit 2c84959 into grafana:main Sep 5, 2023
5 checks passed
@dannykopping dannykopping deleted the dannykopping/ingester-timeouts branch September 5, 2023 12:40
rhnasc pushed a commit to inloco/loki that referenced this pull request Apr 12, 2024
**What this PR does / why we need it**:
Failing to send samples to ingesters because the request exceeded its
timeout is a very clear signal that ingesters are unable to keep up with
demand. In an incident today we saw that ingesters' push latencies were
increased sharply by an expensive regex query which was starving other
goroutines of time on CPU.

This new alert `loki_distributor_ingester_append_timeouts_total` will
give us a high-signal metric which we can use for alerting.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/S type/docs Issues related to technical documentation; the Docs Squad uses this label across many repositories
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants