Add PrometheusRule and basic alert. Fixes #3342 #3368

PaulusTM · 2020-10-10T20:37:24Z

This PR adds a PrometheusRule with a basic alert to detect if Cert-Manager is running.
Let's discuss if there need more alerts included.

Note: This is my first PR to cert-manager, please let me know if I missed any important steps.

Added PrometheusRule to Helm charts

jetstack-bot · 2020-10-10T20:37:39Z

Hi @PaulusTM. Thanks for your PR.

I'm waiting for a jetstack or cert-manager member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

deploy/charts/cert-manager/values.yaml

meyskens · 2020-10-13T08:43:44Z

deploy/charts/cert-manager/templates/prometheusrule.yaml

+    app.kubernetes.io/name: {{ include "cert-manager.name" . }}
+    app.kubernetes.io/instance: {{ .Release.Name }}
+    app.kubernetes.io/managed-by: {{ .Release.Service }}
+    app.kubernetes.io/component: "controller"


This label can be removed as it applies to the deployment

Only the app.kubernetes.io/component label or should all labels be removed?

I think only the app.kubernetes.io/component label should be removed.

deploy/charts/cert-manager/values.yaml

meyskens · 2020-10-13T08:49:18Z

Looks good to me! Just some few nits in naming of things. I wouldn't yet say that it fully solves #3342 but it is a great start!

/ok-to-test

munnerz · 2020-10-13T11:15:04Z

deploy/charts/cert-manager/values.yaml

+    enabled: false
+    labels: {}
+    rules:
+    - alert: CertManagerAbsent


It seems weird to me that we are basically embedding the PrometheusRule spec here - why would a user choose to embed their AlertManager alerts into our values.yaml, when it does not have a schema, as well as less control over the resulting resource that is created? To me at least, it seems like it'd make more sense to have your own standalone chart/set of YAML for this. It doesn't seem like this provides a meaningful/useful abstraction to users...

As an user I expect the cert-manager helm chart to provide me with basic alerts grouped as application (with a feature flag, so I can enable or disable them). This is a good start point to create a basic set of alerts.

It can also be used to find alerts that make sense for cert-manager that I can use in my prometheus operator setup.

Sure, but I don't think exposing the entirety of the PrometheusRule spec is desirable, as this basically makes the cert-manager Helm chart a 'deployment tool' for PrometheusRule resources. I agree that it's great if we can provide some out of the box configuration to guide people & to get them started though.

Instead of exposing all this in values.yaml, can we instead define this as a resource in the templates/ directory and gate it behind a simple boolean prometheus.alerts.enabled: true/false? (with a default of false as a lot of users won't have PrometheusRule as a CRD installed?)

I think that'd make this a lot more palatable to accept, and if in future there is demand for users being able to configure their own custom alerts via our Helm chart, we are not 'boxing ourselves in' and it'd be possible to still add prometheus.alerts.customRules or something?

Usually, what is done is that there are default rules (with prometheusRule.enabled: false by default) with specific selectors + templating in order NOT to be triggered by another cert-manager, and the whole PrometheusRule value is given. Of course, it has weak validation from the Chart point of view, but anyway Prometheus validator webhook is able to do the validation.

See bitnami for example: https://github.com/bitnami/charts/blob/master/bitnami/redis/values.yaml#L1222

It would be very useful to have such default alerts to have alerting out of the box. Today, everybody have to re-implement them.

Defaults alerts such as not ready Certificates (certmanager_certificate_ready_status{condition!="True"} > 0 for more than 30 minutes?) would be helpful.

I understand your remarks @munnerz. I'll update the PR to reflect your idea.

muffl0n · 2021-10-22T07:01:47Z

I added this PrometheusRule to our clusters to check for Certificates that can't be issued (in state READY = False):

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: cert-manager
spec:
  groups:
  - name: cert-manager
    rules:
    - alert: CertManagerCertificateReadyStatus
      annotations:
        description: 'Certificate for "{{`{{ $labels.name }}`}}" is not ready.'
        summary: Certificate is not ready
      expr: certmanager_certificate_ready_status{condition="False"} == 1
      labels:
        severity: critical

Would love to see this PR getting on the road again! :)

Signed-off-by: Daniel Paulus <d.paulus@gmail.com>

jetstack-bot · 2021-12-31T11:54:36Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: PaulusTM
To complete the pull request process, please assign munnerz after the PR has been reviewed.
You can assign the PR to them by writing /assign @munnerz in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

deploy/charts/cert-manager/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jetstack-bot · 2021-12-31T12:06:38Z

@PaulusTM: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Rerun command
pull-cert-manager-e2e-v1-20	`789fbc5`	link	`/test pull-cert-manager-e2e-v1-20`
pull-cert-manager-chart	`eae4239`	link	`/test pull-cert-manager-chart`
pull-cert-manager-bazel	`eae4239`	link	`/test pull-cert-manager-bazel`
pull-cert-manager-e2e-v1-22	`eae4239`	link	`/test pull-cert-manager-e2e-v1-22`
pull-cert-manager-upgrade	`eae4239`	link	`/test pull-cert-manager-upgrade`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

jetstack-bot · 2022-01-04T14:26:10Z

@PaulusTM: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jetstack-bot · 2022-04-04T16:53:39Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale

jetstack-bot · 2022-05-04T17:09:38Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle rotten
/remove-lifecycle stale

jetstack-bot · 2022-06-03T18:09:37Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.
Send feedback to jetstack.
/close

jetstack-bot · 2022-06-03T18:09:40Z

@jetstack-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.
Send feedback to jetstack.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sebastiangaiser · 2022-06-12T13:41:06Z

/reopen

jetstack-bot · 2022-06-12T13:41:07Z

@sebastiangaiser: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jetstack-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Oct 10, 2020

jetstack-bot requested a review from munnerz October 10, 2020 20:37

PaulusTM mentioned this pull request Oct 10, 2020

Add PrometheusRule #3342

Closed

meyskens reviewed Oct 13, 2020

View reviewed changes

deploy/charts/cert-manager/values.yaml Show resolved Hide resolved

meyskens reviewed Oct 13, 2020

View reviewed changes

deploy/charts/cert-manager/values.yaml Outdated Show resolved Hide resolved

meyskens reviewed Oct 13, 2020

View reviewed changes

deploy/charts/cert-manager/values.yaml Outdated Show resolved Hide resolved

jetstack-bot added ok-to-test and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 13, 2020

munnerz reviewed Oct 13, 2020

View reviewed changes

jetstack-bot added dco-signoff: no Indicates that at least one commit in this pull request is missing the DCO sign-off message. and removed dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. labels Oct 13, 2020

PaulusTM force-pushed the prometheus-rule branch 2 times, most recently from 81a2891 to 51e3e86 Compare October 14, 2020 09:24

jetstack-bot added dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. and removed dco-signoff: no Indicates that at least one commit in this pull request is missing the DCO sign-off message. labels Oct 14, 2020

jetstack-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 11, 2021

PaulusTM added 2 commits December 31, 2021 12:53

resolve merge conflict

b29d720

Signed-off-by: Daniel Paulus <d.paulus@gmail.com>

fix merge conflict

eae4239

Signed-off-by: Daniel Paulus <d.paulus@gmail.com>

PaulusTM force-pushed the prometheus-rule branch from 789fbc5 to eae4239 Compare December 31, 2021 11:54

jetstack-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 31, 2021

jetstack-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 4, 2022

davidcollom requested review from davidcollom and removed request for davidcollom January 4, 2022 16:01

maelvls mentioned this pull request Mar 6, 2022

Retry on conflict for the end-to-end test "CA Injector for api services should update data when the certificate changes" #4925

Merged

jetstack-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 4, 2022

jetstack-bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 4, 2022

jetstack-bot closed this Jun 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add PrometheusRule and basic alert. Fixes #3342 #3368

Add PrometheusRule and basic alert. Fixes #3342 #3368

PaulusTM commented Oct 10, 2020 •

edited

Loading

jetstack-bot commented Oct 10, 2020

meyskens Oct 13, 2020

PaulusTM Oct 21, 2020

davidcollom Aug 13, 2021

meyskens commented Oct 13, 2020

munnerz Oct 13, 2020

PaulusTM Oct 13, 2020 •

edited

Loading

munnerz Nov 5, 2020

desaintmartin Sep 14, 2021 •

edited

Loading

PaulusTM Dec 31, 2021

muffl0n commented Oct 22, 2021

jetstack-bot commented Dec 31, 2021

jetstack-bot commented Dec 31, 2021

jetstack-bot commented Jan 4, 2022

jetstack-bot commented Apr 4, 2022

jetstack-bot commented May 4, 2022

jetstack-bot commented Jun 3, 2022

jetstack-bot commented Jun 3, 2022

sebastiangaiser commented Jun 12, 2022

jetstack-bot commented Jun 12, 2022

Add PrometheusRule and basic alert. Fixes #3342 #3368

Add PrometheusRule and basic alert. Fixes #3342 #3368

Conversation

PaulusTM commented Oct 10, 2020 • edited Loading

jetstack-bot commented Oct 10, 2020

meyskens Oct 13, 2020

Choose a reason for hiding this comment

PaulusTM Oct 21, 2020

Choose a reason for hiding this comment

davidcollom Aug 13, 2021

Choose a reason for hiding this comment

meyskens commented Oct 13, 2020

munnerz Oct 13, 2020

Choose a reason for hiding this comment

PaulusTM Oct 13, 2020 • edited Loading

Choose a reason for hiding this comment

munnerz Nov 5, 2020

Choose a reason for hiding this comment

desaintmartin Sep 14, 2021 • edited Loading

Choose a reason for hiding this comment

PaulusTM Dec 31, 2021

Choose a reason for hiding this comment

muffl0n commented Oct 22, 2021

jetstack-bot commented Dec 31, 2021

jetstack-bot commented Dec 31, 2021

jetstack-bot commented Jan 4, 2022

jetstack-bot commented Apr 4, 2022

jetstack-bot commented May 4, 2022

jetstack-bot commented Jun 3, 2022

jetstack-bot commented Jun 3, 2022

sebastiangaiser commented Jun 12, 2022

jetstack-bot commented Jun 12, 2022

PaulusTM commented Oct 10, 2020 •

edited

Loading

PaulusTM Oct 13, 2020 •

edited

Loading

desaintmartin Sep 14, 2021 •

edited

Loading