Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add proposal for grafana alerting #1349

Merged
merged 2 commits into from Jan 9, 2024
Merged

Conversation

theSuess
Copy link
Member

This picks up #911 and #1144. The proposal contains three different options for realizing alerting support in the operator and should serve as a base for discussion regarding this topic.

@theSuess
Copy link
Member Author

Personally, I'm in favor of option three. This would also allow us to implement this in multiple passes by implementing alert rules first, as the routing is label based anyway.

@NissesSenap
Copy link
Collaborator

Thanks for taking another shot at this @theSuess , here are my initial thoughts.

Trying to figure out what I feel about the suggestion, I will probably change my mind 3 times ;)
My main concern about reusing GrafanaFolder is that none will find Alerts. It's an awful name, and unless you read the docs of the grafana-opreator or know how the Grafana HTTP API works, you have no idea why it is like this.

I lean towards following how it's done in terraform and have a specific resource for Rule Groups, so doing option 1 but instead of calling itGrafanaAlertFolder I would call it GrafanaRuleGroup or GrafanaAlertGroup.
But then we of course have to ask our self, should we auto create a folder if it don't exist, just as we do for dashboards? Or should we instead start forcing people to use GrafanaFolder?

One of my main concern with this, due to the number of resources that point to each other is how we should connect them.
In a AlertRule should we point to a Datasource UID or a Datasource name? If we use a name we will have to lookup all datasource during every reconciliation and make sure the UID is still the same.
One issue with Datasource is that you can't specify UID in the HTTP request which makes using UID a bit painful.
This is not the same issue in Dashboards since you can create a variable that find Prometheus instances.

This is an easy problem to solve in terraform since you can just reference the Datasource resource in your RuleGroup resource, this is sadly not an easy thing to do in Kubernetes.
It can of course be solved using label selectors in Kubernetes, but my feeling is that it will be a bit painful for end users since it will be a few selectors.

The issue is the same in contact point and notification policy, but in contact point we can at least hard-code UID if we want to. For example, using the Kubernetes UID which we have as a backup solution for dashboards if it's not set.

How do you see GrafanaMuteTimings and GrafanaMessageTemplate? For me, I feel they are rather low priority, but we probably want to implement them in the future so we should at least consider how they should connect with the other resources.

@theSuess
Copy link
Member Author

Calling the resource GrafanaRuleGroup Sounds good to me. Creating a folder if it does not exist can certainly be done, not sure if it should be the default though.

Regarding references to the datasource, UIDs can be set during creation - might require some changes in the operator though.

image

I'm trying to get every API to support UIDs at creation on the grafana side 🤞

Mute timings and Notification templates are identified by name, we should be good here.

@NissesSenap
Copy link
Collaborator

That is great, I have been looking at the http API docs allot https://grafana.com/docs/grafana/latest/developers/http_api/data_source/, but thinking of it now, those are mostly examples and not the full schema 🤦

Getting API to support UID at creation would be awesome, Igor got an old issue around that for teams if I remember correctly.

How do you see the linking happening between the resources? Should we use UID? It's not the most user-friendly way of solving it. But it sure makes life easy for us ;)

Around performing changes to our existing APIs I'm all open for it. We can always release a v1beta2 CRD version if needed. But if we feel that we need to do that, it would be nice to explain that part in this document as well.

Also, to make this design document easier to read in the future, it would be nice with full examples on how we want the CRDs to look like. I think it will also make it clearer when we do the implementation so we have something to compare it to. Of course, we might change the design a bit during the implementation, but that we at least have a discussion point before doing so.

@theSuess
Copy link
Member Author

theSuess commented Dec 18, 2023

I think UID linking is the way to go here. We don't have many places where we actually need to link things.

I think this linking makes sense:

flowchart LR
    AlertGroup-- uid -->Folder
    AlertGroup-- label selector --> GrafanaInstance
    AlertGroup-- uid --> DataSource
    Folder-- label selector --> GrafanaInstance

    NotificationPolicy-- label selector --> GrafanaInstance
    NotificationPolicy-- uid --> ContactPoint
    ContactPoint-- label selector --> GrafanaInstance

Everything needs a label selector to decide to which grafana instance this applies to and cross-references are made using UIDs

@pb82
Copy link
Collaborator

pb82 commented Dec 19, 2023

@theSuess is the idea then to use predictable UIDs (chosen by the user) to make it easier to link the resources together?

@theSuess
Copy link
Member Author

Yes, that's the way I'd go about it

@theSuess
Copy link
Member Author

We'll also need to gather infos on how alert rule groups interact with subfolders

@theSuess
Copy link
Member Author

Updated the proposal with more examples and some more findings

Copy link
Collaborator

@NissesSenap NissesSenap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an implementation detail in a comment. Don't know if it needs to be added in the design document, though. But good to think about.

When implementing this feature it would be nice to use the new go-sdk for the grafana API as stataed in #1357.

All in all I think this looks great. Nice job @theSuess

docs/docs/proposals/002-alerting-support.md Show resolved Hide resolved
@NissesSenap
Copy link
Collaborator

Would love some feedback on this PR from the community.
So I will ping some old requestors
@apryor6 @R-Studio @siegenthalerroger @rammelmueller

@theSuess theSuess marked this pull request as ready for review January 9, 2024 12:47
This picks up #911 and #1144. The proposal contains three different options for
realizing alerting support in the operator and should serve as a base for
discussion regarding this topic.
@pb82 pb82 enabled auto-merge January 9, 2024 12:48
@pb82 pb82 merged commit 7df688e into master Jan 9, 2024
10 checks passed
@pb82 pb82 deleted the docs/alerting-proposal branch January 9, 2024 12:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants