-
Notifications
You must be signed in to change notification settings - Fork 11.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alerting: Support for simplified notification settings in rule API #81011
Alerting: Support for simplified notification settings in rule API #81011
Conversation
927ee20
to
b30ce42
Compare
f3ffbf6
to
e02229b
Compare
for stable JSON marshaling
- Validation interfaces now all in notifier/validation.go - Naming and type visibility cleaned up - Store no longer needs knowledge of validator interfaces
This introduces a slightly larger chance for stale AM configs to be applied if multiple rule updates race. This is because the gap between am config fetch and apply is larger. Any drift will still be reconciled on next mam periodic sync.
This interface relies on field getters rather than methods to get a set of names. Steps to do this: - Create receiver interface + implement that interface for both receiver types (currently only requirement is GetName) - Refactor receiver/mute time name methods to return slices of the structs - e.g. ReceiverNames -> GetReceivers - Introduce generic constraint over existing apiAlertingConfig interface that the return value of GetReceivers should follow - That's it, the compiler can infer the type for this generic constraint so call sites remain unchanged
b8a55f6
to
657dcb7
Compare
/deploy-to-hg --enterprise-ref yuri-tceretian/simplified-notificiations |
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, great effort and new feature! 🚀 🚀 🚀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! We got there, let's
Raised an issue about the missing unit tests in this PR: #90115 |
What is this feature?
This PR introduces support for assigning an alert rule to a contact point. We call it "simplified notification policies".
Alert Rule API changes
The Alert Rule API (including provisioning) supports a new set of settings available for each alert rule definition.
Above is an example configuration that has all fields defined. These are the same fields we use in notification policies. Only the receiver is a mandatory field, all others are optional.
These settings are consumed by the scheduler and Alertmanager configurator.
However, it is important to mention a few differences in handling the settings depending on via which API they are submitted:
Alertmanager configuration
The PR changes how we apply the configuration to the Grafana Alertmanager. Currently, only the embedded Alertmanager can handle these settings. When Alertmanager configuration is applied (via API request or via timer), a new route is added to the notification policies. The route is added to the top of the routing tree, the first after root, as described on the diagram below.
The auto-generated route can contain up to 3 levels:
__grafana_autogenerated__=true
. All alerts created by rules that have notification settings will be caught by this route__grafana_receiver__=<contact_point_name>
. The routes are always created for each existing contact point. NOTE: This is needed because the rule scheduling and Alertmanager configuration are asynchronous processes, and we want to make sure that the alert will be sent to a receiver even if it is evaluated before the new config is updated.__grafana_route_settings_hash__=<hash_of_optional_settings>
and is created for each unique permutation of optional settings. On that level, the custom group_by, group_wait, etc settings are defined.The auto-generated routes are visible to only the administrator but cannot be updated by anyone but the Grafana server. We decided to show them to administrators for troubleshooting purposes.
Scheduler
When a rule with notification settings is evaluated and the state manager decides to create an alert from the evaluation results, the alert will contain an additional 2 or 3 labels, depending on the notification settings:
__grafana_autogenerated__: true
__grafana_receiver__:<receiver_name>
the value is the value of the fieldreceiver
in the notification settings.__grafana_route_settings_hash__:<hash_of_optional_settings>
the value is the fingerprint of all optional settings. If the notification settings have only the receiver specified, this label is not created.State manager
When state is calculated, we merge 3 sets of labels into the state labels:
According to the logic, the above order reflects the priority in the case of the conflict. However, the auto-generated labels are optional and are not created if the rule does not have notification settings, therefore, merging rules would not override labels if they are provided via the result. In other words, a user could create a query that would result in labels that match the autogenerated labels, and exploit the routing. To prevent user from doing that, the state manager is updated to rename such result labels: the label is renamed and a suffix
_user
is appended, if that renamed label conflicts with an existing label in the result, the original is just removed with a warning message in the log.Why do we need this feature?
Special notes for your reviewer:
Please check that: