Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persist keep firing for state across restarts #14018

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

mustafain117
Copy link

@mustafain117 mustafain117 commented Apr 30, 2024

This PR resolves issue #13957

In this PR:

  • alerts using the keep_firing_for duration will be stored to persistence storage after each evaluation.
  • alerts are restored from storage before the first evaluation for the group after server restarts.
  • implemented the persistence store using badgerDB (open to feedback on choice of implementation)
    • store entries have a configurable TTL, runs GC to cleanup old entries
  • users of Prometheus as a library can provide their own store implementation when starting the rule manager
  • Feature flag to enable this feature

Testing:

Apart from unit tests, functionally tested with upto 1000 alerts:

image First annotation represents feature being enabled, second annotation is for server restart after which alerts state is restored. Firing alerts are no longer firing when the feature is disabled and server restarts, whereas they keep firing when server restarts and feature is enabled. Minimal spike in eval latency when restoring state.

Signed-off-by: Mustafain Ali Khan <mustalik@amazon.com>
@mustafain117 mustafain117 force-pushed the keep-firing-for branch 4 times, most recently from 97b4f9b to 73e9900 Compare May 14, 2024 14:15
Signed-off-by: Mustafain Ali Khan <mustalik@amazon.com>
@mustafain117 mustafain117 changed the title persist keep firing for state across restarts Persist keep firing for state across restarts May 14, 2024
@mustafain117 mustafain117 marked this pull request as ready for review May 14, 2024 16:34
@beorn7 beorn7 requested a review from roidelapluie May 15, 2024 12:17
@roidelapluie
Copy link
Member

Why can't we use the TSDB like we do for the alerts timestamps?

@mustafain117
Copy link
Author

Why can't we use the TSDB like we do for the alerts timestamps?

To restore the alert we need more than just the KeepFiringSince timestamp. Restored alert needs to have the same value, labels and annotations, therefore, restoring just the timestamp is not sufficient. When Prometheus restarts, the expression is already resolved so we can't evaluate the rule again to restore the value, labels and annotations (approach used to restore 'for' state)

While the alert labels can be stored in TSDB as part of the sample, was not sure how annotations can be stored in the same sample (perhaps as labels with custom prefix to differentiate between the alert labels and annotations). Additionally, would also need to store the expression value as a separate metric.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants