Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alerting: Support hysteresis command expression #75189

Merged
merged 55 commits into from
Jan 4, 2024

Conversation

yuri-tceretian
Copy link
Contributor

@yuri-tceretian yuri-tceretian commented Sep 20, 2023

What is this feature?
This PR adds support for a new type of command introduced in #70998. The hysteresis command requires information about the results of the previous evaluation. This PR updates the state manager and scheduler to facilitate this information for the evaluation engine:

  • state manager is updated to keep a fingerprint of the result instance for each State. It differs from CacheID or Labelshash because it is calculated from the labels of the raw result.
  • scheduler wraps the state manager and the rule into an intermediate structure AlertingResultsFromRuleState and provides it to the rule evaluator.
  • The evaluator analyzes the expression tree (parsed rule's query) and if it contains HysteresisCommand, it populates it with the data taken from the AlertingResultsFromRuleState.

The structure AlertingResultsFromRuleState implements the logic that determines what metric\dimension\state is considered as "loaded":

  • state Pending or Alerting
  • state reason is empty. This is to avoid execution of exceptional states such as Error\NoData as Alerting.

Why do we need this feature?
To support hysteresis expression in alerting. This will help users reduce flapping on the alert rules when the metric value frequently crosses threshold boundaries.

Who is this feature for?
Alerting

Which issue(s) does this PR fix?:

Related #6202

Special notes for your reviewer:

  • PR can be reviewed by commit. In some commits I provide the explanations for chosen approach.
  • The majority of the changes are tests. The core backend changes are in just two commits da01d9a and ef96be8
  • the feature is behind the flag recoveryThreshold. if it is disabled the expr engine does not create HysteresisCommand struct and code in this PR does not do anything.
UI screen recording
hysteresis_UI.mp4

When viewing the rule

Screenshot 2023-11-06 at 09 45 23

Please check that:

  • It works as expected from a user's perspective.
  • If this is a pre-GA feature, it is behind a feature toggle.
  • The docs are updated, and if this is a notable improvement, it's added to our What's New doc.

@ephemeral-instances-bot
Copy link

@soniaAguilarPeiron
Copy link
Member

last commit : skip hysteresis values being validated when unchecking hysteresis.

fix-hysteresis-validation.mp4

@soniaAguilarPeiron
Copy link
Member

/deploy-to-hg

@grafana grafana deleted a comment from ephemeral-instances-bot bot Dec 7, 2023
@grafana grafana deleted a comment from ephemeral-instances-bot bot Dec 7, 2023
@soniaAguilarPeiron
Copy link
Member

/deploy-to-hg

@grafana grafana deleted a comment from ephemeral-instances-bot bot Dec 14, 2023
@grafana grafana deleted a comment from ephemeral-instances-bot bot Dec 14, 2023
@soniaAguilarPeiron
Copy link
Member

My last commit is a fix for the style for hysteresis fields in dashboard panel expressions.

Before the change: Screenshot 2023-12-15 at 09 59 07 Screenshot 2023-12-15 at 09 58 56
After the change: Screenshot 2023-12-15 at 09 50 52 Screenshot 2023-12-15 at 09 50 44

@soniaAguilarPeiron
Copy link
Member

After a conversation with @yuri-tceretian we agreed on hiding hsyteresis in panel expressions.

@grafana grafana deleted a comment from ephemeral-instances-bot bot Jan 3, 2024
@grafana grafana deleted a comment from ephemeral-instances-bot bot Jan 4, 2024
@JacobsonMT
Copy link
Member

/deploy-to-hg

@ephemeral-instances-bot
Copy link

  • Preparing your instance. A comment containing your instance's url will be added to this PR when the instance is ready.
  • Your instance will be ready in ~10 minutes.
  • Check the GitHub actions tab to follow the workflow progress
  • Slack channel: #proj-ephemeral-hg-instances
  • Building instance with yuri-tceretian/expr-hysteresis-alerting oss branch and main enterprise branch. How to choose a branch

@ephemeral-instances-bot
Copy link

Copy link
Member

@JacobsonMT JacobsonMT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job! Confirmed working as expected, there are parts I'm not 100% sure on the implementation but considering this is behind a feature flag it's not necessary to dwell on those details at this stage. 🚀

@yuri-tceretian yuri-tceretian merged commit f6a4674 into main Jan 4, 2024
15 checks passed
@yuri-tceretian yuri-tceretian deleted the yuri-tceretian/expr-hysteresis-alerting branch January 4, 2024 16:47
zserge pushed a commit that referenced this pull request Jan 22, 2024
Backend: 

* Update the Grafana Alerting engine to provide feedback to HysteresisCommand. The feedback information is stored in state.Manager as a fingerprint of each state. The fingerprint is persisted to the database. Only fingerprints that belong to Pending and Alerting states are considered as "loaded" and provided back to the command.
   - add ResultFingerprint to state.State. It's different from other fingerprints we store in the state because it is calculated from the result labels.
  -  add rule_fingerprint column to alert_instance
   - update alerting evaluator to accept AlertingResultsReader via context, and update scheduler to provide it.
   - add AlertingResultsFromRuleState that implements the new interface in eval package
   - update getExprRequest to patch the hysteresis command.

* Only one "Recovery Threshold" query is allowed to be used in the alert rule and it must be the Condition.


Frontend:

* Add hysteresis option to Threshold in UI. It's called "Recovery Threshold"
* Add test for getUnloadEvaluatorTypeFromCondition
* Hide hysteresis in panel expressions

* Refactor isInvalid and add test for it
* Remove unnecesary React.memo
* Add tests for updateEvaluatorConditions

---------

Co-authored-by: Sonia Aguilar <soniaaguilarpeiron@gmail.com>
@summerwollin summerwollin modified the milestones: 10.3.x, 10.3.0 Jan 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet