Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: initial alerting support #1420

Merged
merged 26 commits into from
Mar 12, 2024
Merged

feat: initial alerting support #1420

merged 26 commits into from
Mar 12, 2024

Conversation

theSuess
Copy link
Member

This PR adds initial alerting support as described in the Alerting Support Proposal.

To keep changes small, this PR only adds the AlertRuleGroup custom resource.

This is also the first resource implemented using the new auto generated client (relevant for #1357)

@theSuess
Copy link
Member Author

To modernize the code & API a bit, I've also performed the following actions for the new CR:

  • Use kubernetes conditions instead of lastMessage and NoMatchingInstances
  • Use finalizers instead of relying on instance status fields for cleanup

If there are reasons why these paradigms were not adopted in the first place, please let me know and I'll get this in line with other resources

@theSuess
Copy link
Member Author

Waiting for grafana/grafana-openapi-client-go#75 to be merged before removing the draft status from the PR

@theSuess theSuess marked this pull request as ready for review February 16, 2024 08:20
@NissesSenap
Copy link
Collaborator

@theSuess I'm currently looking through everything, but it's not a small PR, so it takes some time ;).

RBAC

Here our CI is a bit lacking, but there is no magic to get new RBAC rules in to kustomize or helm. Sadly, you will have to copy that yaml manually.

Tests

As you have seen in the code we haven't written many tests, so I can definitely live with merging this PR without having a bunch of them, even though it of course would be ideal to add some. But it would be nice to at least add some e2e test for alerts. This way we have some kind of basic knowledge that it's working.
Take a look in https://github.com/grafana/grafana-operator/tree/master/tests/e2e/example-test for inspiration on how to set it up.

Comment

About conditions, I think we have some old issue about looking in to it, but I think we just never got that far. So I'm happy to see it used.

About finalizers, lets talk about it during our next meeting :)

@lsoica
Copy link

lsoica commented Feb 19, 2024

Will it be able to provision both grafana managed alerts and datasource managed alers ?

Copy link
Collaborator

@NissesSenap NissesSenap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor comments, but in general I think it looks great.

controllers/client/grafana_client.go Outdated Show resolved Hide resolved
controllers/grafanaalertrulegroup_controller.go Outdated Show resolved Hide resolved
@NissesSenap
Copy link
Collaborator

I started to play around with the alertgroup and I managed to create the following issue.
To recreate, I ran sh hack/kind/start-kind.sh to create my kind cluster.
Then I added applied config/samples/grafana_v1beta1_grafanaalertrulegroup.yaml.

2024-02-19T15:28:43+01:00       ERROR   updating status {"controller": "grafanaalertrulegroup", "controllerGroup": "grafana.integreatly.org", "controllerKind": "GrafanaAlertRuleGroup", "GrafanaAlertRuleGroup": {"name":"grafanaalertrulegroup-sample","namespace":"default"}, "namespace": "default", "name": "grafanaalertrulegroup-sample", "reconcileID": "e60b64c1-74f8-4c9d-ac22-335839f2b149", "error": "Operation cannot be fulfilled on grafanaalertrulegroups.grafana.integreatly.org \"grafanaalertrulegroup-sample\": the object has been modified; please apply your changes to the latest version and try again"}
github.com/grafana/grafana-operator/v5/controllers.(*GrafanaAlertRuleGroupReconciler).Reconcile.func1
        /home/edvin/go/src/github.com/NissesSenap/grafana-operator/controllers/grafanaalertrulegroup_controller.go:102
github.com/grafana/grafana-operator/v5/controllers.(*GrafanaAlertRuleGroupReconciler).Reconcile
        /home/edvin/go/src/github.com/NissesSenap/grafana-operator/controllers/grafanaalertrulegroup_controller.go:158
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
        /home/edvin/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.0/pkg/internal/controller/controller.go:119
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /home/edvin/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.0/pkg/internal/controller/controller.go:316
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /home/edvin/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.0/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /home/edvin/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.0/pkg/internal/controller/controller.go:227

Initially I thought this was related to the error I saw In grafana about the folder missing, which isn't great and we probably need some kind of check for that as well.

logger=ngalert t=2024-02-19T14:27:00.00159844Z level=error msg="failed to fetch alert rule namespace" err="folder not found" uid=4843de5c-4f8a-4af0-9509-23526a04faf8 org=1 namespace_uid=f9b0a98d-2ed3-45a6-9521-18679c74d4f1

I don't have any more time to look into why this error happens but it could also be something around datasources, it seems like Alert actually want to try to reach the datasource instead of just saying that it's there like a dashboard can live with.

Another thing I noticed was that we should probably bump our default grafana instance to 10 before releasing this feature.
If you don't have version 10, there is no export function for alerts, which makes this feature more or less impossible to use.

Another issue I faced was that I couldn't create a folder with a specific UID. Which created issues when I wanted to create the alert using the operator.

If anyone else will be playing around with folders I recommend these two curl commands.

# List all folders
curl -X GET -H Accept:application/json -H Content-Type:application/json -H "Authorization: Bearer SOOO_LONG_TOKEN" http://localhost:3000/api/folders
# Create a folder with a specific UID
curl -X POST -H Accept:application/json -H Content-Type:application/json -H "Authorization: Bearer SOOO_LONG_TOKEN" http://localhost:3000/api/folders -d '{"uid":"f9b0a98d-2ed3-45a6-9521-18679c74d4f1", "title":"foo"}'

We might need to solve the folder creation similarly as we did in dashboards, to make sure we didn't get any strange errors. Or we need to update grafanaFolder to take UID as an option so we can set it there.

@theSuess
Copy link
Member Author

the object has been modified; please apply your changes to the latest version and try again

This happens when we try to update the status on a resource which has been changed by another client. It might be caused by adding the finalizer. I'll build a fix for this

Regarding folders: what's your take on adding a folderSelector field in the alertrulegroup? This would allow users to select a folder CR instead of having to find the UID themselves

Adding a UID to the folder spec would work as well. I don't think it makes sense to dynamically create folders for alert rule groups as multiple alert rule groups can be stored in the same folder which would require further matching logic

@theSuess theSuess force-pushed the feat/alerting branch 4 times, most recently from 8e28b22 to ce00ec2 Compare February 21, 2024 16:15
@theSuess
Copy link
Member Author

Alright, most issues ironed out.

We now support a folderSelector as well as folderUID (which overrides the selector)

The only issue is that there is no way to enforce at least one field must be set without an admission hook so currently the CRD just requires the folderSelector to be set.

I've also added an e2e test & fixed the instrumentedroundtripper

@NissesSenap
Copy link
Collaborator

@theSuess could an option be to use validation rules instead of an adminsion webhook?
It was in beta in 1.25 and GA in 1.29, so I think it's reasonable that we can use it now.
https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#validation-rules

@theSuess
Copy link
Member Author

@theSuess could an option be to use validation rules instead of an adminsion webhook? It was in beta in 1.25 and GA in 1.29, so I think it's reasonable that we can use it now. https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#validation-rules

oooh that's interesting - will see if I can get this working with kubebuilder

@theSuess
Copy link
Member Author

Validation rules seem to work! Be careful with them though, if you get them wrong while having existing resources, it can be hard to get rid of them

@theSuess
Copy link
Member Author

Will it be able to provision both grafana managed alerts and datasource managed alers ?

@lsoica this will only support grafana managed alerts. Datasource managed alerts are much more complex and would require a direct connection to the datasource which is out of scope for the operator.

@theSuess
Copy link
Member Author

Regarding the folder selector: maybe we can simplify this to just be a folderRef? Selectors only make sense if it's possible to match multiple resources, which does not make sense for targeting a single folder.

Thoughts?

@NissesSenap
Copy link
Collaborator

Regarding the folder selector: maybe we can simplify this to just be a folderRef? Selectors only make sense if it's possible to match multiple resources, which does not make sense for targeting a single folder.

Thoughts?

Agree, it sounds very resonable.

@theSuess
Copy link
Member Author

Switched over to folderRef in tests & examples now. All open topics should thus be resolved 🤞

Copy link
Collaborator

@NissesSenap NissesSenap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit

return ctrl.Result{}, nil
}
controllerLog.Error(err, "error getting grafana folder cr")
return ctrl.Result{RequeueAfter: RequeueDelay}, err
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Started reading the finalizer docs https://book.kubebuilder.io/reference/using-finalizers
I wonder if we shoulden't use return ctrl.Result{}, client.IgnoreNotFound(err) instead.

@NissesSenap
Copy link
Collaborator

I took the freedom to change a few labels and run some make commands to be more consistent with the current setup.
Added a few minor comments, but I think this looks great. I can't wait for the feature, my self.
Great job @theSuess !

Copy link

@HOCKNAS HOCKNAS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @theSuess @NissesSenap ,

Wow, just checked out your PR for alerting in Grafana Operator – absolutely killer work! 🚀 You guys really knocked it out of the park. Integrating alert management like this is a game-changer for us Kubernetes folks. Mad props for the effort and genius you poured into this. Can't wait to see what you hack together next!

Cheers,

@pb82
Copy link
Collaborator

pb82 commented Mar 8, 2024

Currently testing this using the provided example. I saw that the alerting rule group got applied:

  status:
    conditions:
    - lastTransitionTime: "2024-03-08T10:16:02Z"
      message: Alert Rule Group was successfully applied to 1 instances
      observedGeneration: 3
      reason: ApplySuccesfull
      status: "True"
      type: AlertGroupSynchronized

however, in Grafana I don't see it:

grafik

Am I missing something?

Copy link
Collaborator

@NissesSenap NissesSenap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, great job

@theSuess theSuess merged commit 68801e9 into master Mar 12, 2024
10 checks passed
@theSuess theSuess deleted the feat/alerting branch March 12, 2024 14:42
@theSuess
Copy link
Member Author

I've tried to implement a safeguard for alerting on older Grafana versions but was blocked by not being able to infer the version of the instance. Tracking this in #1451

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants