Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alerting: Support for simplified notification settings in rule API #81011

Merged
merged 96 commits into from
Feb 15, 2024

Conversation

yuri-tceretian
Copy link
Contributor

@yuri-tceretian yuri-tceretian commented Jan 22, 2024

What is this feature?
This PR introduces support for assigning an alert rule to a contact point. We call it "simplified notification policies".

Alert Rule API changes

The Alert Rule API (including provisioning) supports a new set of settings available for each alert rule definition.

"notification_settings":{
  "receiver": "Test-Receiver",
  "group_by": ["alertname","grafana_folder","test"],
  "group_wait": "1s",
  "group_interval": "5s",
  "repeat_interval": "5m",
  "mute_time_intervals": ["test-mute"]
}

Above is an example configuration that has all fields defined. These are the same fields we use in notification policies. Only the receiver is a mandatory field, all others are optional.

These settings are consumed by the scheduler and Alertmanager configurator.

However, it is important to mention a few differences in handling the settings depending on via which API they are submitted:

  • When a rule's notification settings are updated via UI (or "regular" API), the settings are applied synchronously. Therefore, it is guaranteed that as soon as the rule is evaluated it will be handled by the specified receiver with the specified grouping settings and mute intervals.
  • When a rule is updated via provisioning API (or Terraform provider), the Alertmanager configuration is applied asynchronously via a timer (default setting is every 60 seconds). Therefore, there is a chance that by the time the Alertmanager configuration is updated with rule settings, the rule is evaluated and an alert is created. This is the trade-off of keeping the performance of the provisioning API intact. We will address this in a follow-up PR.

Alertmanager configuration

The PR changes how we apply the configuration to the Grafana Alertmanager. Currently, only the embedded Alertmanager can handle these settings. When Alertmanager configuration is applied (via API request or via timer), a new route is added to the notification policies. The route is added to the top of the routing tree, the first after root, as described on the diagram below.

Untitled-2024-01-31-1557

The auto-generated route can contain up to 3 levels:

  • The top-level route has only one matcher __grafana_autogenerated__=true. All alerts created by rules that have notification settings will be caught by this route
  • The second level routes with matcher __grafana_receiver__=<contact_point_name>. The routes are always created for each existing contact point. NOTE: This is needed because the rule scheduling and Alertmanager configuration are asynchronous processes, and we want to make sure that the alert will be sent to a receiver even if it is evaluated before the new config is updated.
  • The third-level routes are optional and created only if there are notification settings with optional fields defined. This route has a matcher __grafana_route_settings_hash__=<hash_of_optional_settings> and is created for each unique permutation of optional settings. On that level, the custom group_by, group_wait, etc settings are defined.

The auto-generated routes are visible to only the administrator but cannot be updated by anyone but the Grafana server. We decided to show them to administrators for troubleshooting purposes.

Scheduler

When a rule with notification settings is evaluated and the state manager decides to create an alert from the evaluation results, the alert will contain an additional 2 or 3 labels, depending on the notification settings:

  • __grafana_autogenerated__: true
  • __grafana_receiver__:<receiver_name> the value is the value of the field receiver in the notification settings.
  • (optional) __grafana_route_settings_hash__:<hash_of_optional_settings> the value is the fingerprint of all optional settings. If the notification settings have only the receiver specified, this label is not created.

State manager

When state is calculated, we merge 3 sets of labels into the state labels:

  1. System-owned: rule UID, folder title, etc. + 3 new labels that are introduced in this PR
  2. Rule labels that are defined by the user. API validation ensures that those 3 reserved labels the user cannot specify
  3. Result labels that come from result evaluation.

According to the logic, the above order reflects the priority in the case of the conflict. However, the auto-generated labels are optional and are not created if the rule does not have notification settings, therefore, merging rules would not override labels if they are provided via the result. In other words, a user could create a query that would result in labels that match the autogenerated labels, and exploit the routing. To prevent user from doing that, the state manager is updated to rename such result labels: the label is renamed and a suffix _user is appended, if that renamed label conflicts with an existing label in the result, the original is just removed with a warning message in the log.

Why do we need this feature?

  1. This allows users to configure the entire alerting workflow for a specific rule in one place.
  2. Eliminates the necessity to maintain notification policies.
  3. Simplifies role-based access control: regular users do not need access to notification policies to be able to create end-to-end flow for a rule.

Special notes for your reviewer:

Please check that:

  • It works as expected from a user's perspective.
  • If this is a pre-GA feature, it is behind a feature toggle.
  • The docs are updated, and if this is a notable improvement, it's added to our What's New doc.

@grafana-delivery-bot grafana-delivery-bot bot added this to the 10.4.x milestone Jan 22, 2024
@yuri-tceretian yuri-tceretian changed the title Yuri tceretian/simplified notificiations Alerting: Support for simplified notification settings in rule API Jan 22, 2024
@yuri-tceretian yuri-tceretian force-pushed the yuri-tceretian/simplified-notificiations branch from 927ee20 to b30ce42 Compare January 23, 2024 21:03
@JacobsonMT JacobsonMT force-pushed the yuri-tceretian/simplified-notificiations branch 3 times, most recently from f3ffbf6 to e02229b Compare February 1, 2024 06:40
JacobsonMT and others added 7 commits February 13, 2024 00:52
- Validation interfaces now all in notifier/validation.go
- Naming and type visibility cleaned up
- Store no longer needs knowledge of validator interfaces
This introduces a slightly larger chance for stale AM configs to be applied if
multiple rule updates race. This is because the gap between am config fetch and
apply is larger. Any drift will still be reconciled on next mam periodic sync.
This interface relies on field getters rather than methods to get a set of names.

Steps to do this:
- Create receiver interface + implement that interface for both receiver types (currently only requirement is GetName)
- Refactor receiver/mute time name methods to return slices of the structs - e.g. ReceiverNames -> GetReceivers
- Introduce generic constraint over existing apiAlertingConfig interface that the return value of GetReceivers should follow
- That's it, the compiler can infer the type for this generic constraint so call sites remain unchanged
@rwwiv rwwiv force-pushed the yuri-tceretian/simplified-notificiations branch from b8a55f6 to 657dcb7 Compare February 13, 2024 15:35
@JacobsonMT
Copy link
Member

/deploy-to-hg --enterprise-ref yuri-tceretian/simplified-notificiations

@ephemeral-instances-bot
Copy link

  • Preparing your instance. A comment containing your instance's url will be added to this PR when the instance is ready.
  • Your instance will be ready in ~10 minutes.
  • Check the GitHub actions tab to follow the workflow progress
  • Slack channel: #proj-ephemeral-hg-instances
  • Building instance with yuri-tceretian/simplified-notificiations oss branch and yuri-tceretian/simplified-notificiations enterprise branch. How to choose a branch

@ephemeral-instances-bot
Copy link

Copy link
Member

@JacobsonMT JacobsonMT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, great effort and new feature! 🚀 🚀 🚀

Copy link
Contributor

@rwwiv rwwiv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! We got there, let's :shipit:

@stevesg
Copy link
Contributor

stevesg commented Jul 5, 2024

Raised an issue about the missing unit tests in this PR: #90115

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

5 participants