Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PrometheusRule and basic alert. Fixes #3342 #3368

Closed
wants to merge 2 commits into from

Conversation

PaulusTM
Copy link

@PaulusTM PaulusTM commented Oct 10, 2020

This PR adds a PrometheusRule with a basic alert to detect if Cert-Manager is running.
Let's discuss if there need more alerts included.

Note: This is my first PR to cert-manager, please let me know if I missed any important steps.

Added PrometheusRule to Helm charts

@jetstack-bot jetstack-bot added dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 10, 2020
@jetstack-bot
Copy link
Contributor

Hi @PaulusTM. Thanks for your PR.

I'm waiting for a jetstack or cert-manager member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jetstack-bot jetstack-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Oct 10, 2020
@jetstack-bot jetstack-bot added area/deploy Indicates a PR modifies deployment configuration release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Oct 10, 2020
@PaulusTM PaulusTM mentioned this pull request Oct 10, 2020
app.kubernetes.io/name: {{ include "cert-manager.name" . }}
app.kubernetes.io/instance: {{ .Release.Name }}
app.kubernetes.io/managed-by: {{ .Release.Service }}
app.kubernetes.io/component: "controller"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This label can be removed as it applies to the deployment

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only the app.kubernetes.io/component label or should all labels be removed?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think only the app.kubernetes.io/component label should be removed.

@meyskens
Copy link
Contributor

Looks good to me! Just some few nits in naming of things. I wouldn't yet say that it fully solves #3342 but it is a great start!

/ok-to-test

@jetstack-bot jetstack-bot added ok-to-test and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 13, 2020
enabled: false
labels: {}
rules:
- alert: CertManagerAbsent
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems weird to me that we are basically embedding the PrometheusRule spec here - why would a user choose to embed their AlertManager alerts into our values.yaml, when it does not have a schema, as well as less control over the resulting resource that is created? To me at least, it seems like it'd make more sense to have your own standalone chart/set of YAML for this. It doesn't seem like this provides a meaningful/useful abstraction to users...

Copy link
Author

@PaulusTM PaulusTM Oct 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an user I expect the cert-manager helm chart to provide me with basic alerts grouped as application (with a feature flag, so I can enable or disable them). This is a good start point to create a basic set of alerts.

It can also be used to find alerts that make sense for cert-manager that I can use in my prometheus operator setup.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but I don't think exposing the entirety of the PrometheusRule spec is desirable, as this basically makes the cert-manager Helm chart a 'deployment tool' for PrometheusRule resources. I agree that it's great if we can provide some out of the box configuration to guide people & to get them started though.

Instead of exposing all this in values.yaml, can we instead define this as a resource in the templates/ directory and gate it behind a simple boolean prometheus.alerts.enabled: true/false? (with a default of false as a lot of users won't have PrometheusRule as a CRD installed?)

I think that'd make this a lot more palatable to accept, and if in future there is demand for users being able to configure their own custom alerts via our Helm chart, we are not 'boxing ourselves in' and it'd be possible to still add prometheus.alerts.customRules or something?

Copy link

@desaintmartin desaintmartin Sep 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually, what is done is that there are default rules (with prometheusRule.enabled: false by default) with specific selectors + templating in order NOT to be triggered by another cert-manager, and the whole PrometheusRule value is given. Of course, it has weak validation from the Chart point of view, but anyway Prometheus validator webhook is able to do the validation.

See bitnami for example: https://github.com/bitnami/charts/blob/master/bitnami/redis/values.yaml#L1222

It would be very useful to have such default alerts to have alerting out of the box. Today, everybody have to re-implement them.

Defaults alerts such as not ready Certificates (certmanager_certificate_ready_status{condition!="True"} > 0 for more than 30 minutes?) would be helpful.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand your remarks @munnerz. I'll update the PR to reflect your idea.

@jetstack-bot jetstack-bot added dco-signoff: no Indicates that at least one commit in this pull request is missing the DCO sign-off message. and removed dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. labels Oct 13, 2020
@PaulusTM PaulusTM force-pushed the prometheus-rule branch 2 times, most recently from 81a2891 to 51e3e86 Compare October 14, 2020 09:24
@jetstack-bot jetstack-bot added dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. and removed dco-signoff: no Indicates that at least one commit in this pull request is missing the DCO sign-off message. labels Oct 14, 2020
@muffl0n
Copy link

muffl0n commented Oct 22, 2021

I added this PrometheusRule to our clusters to check for Certificates that can't be issued (in state READY = False):

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: cert-manager
spec:
  groups:
  - name: cert-manager
    rules:
    - alert: CertManagerCertificateReadyStatus
      annotations:
        description: 'Certificate for "{{`{{ $labels.name }}`}}" is not ready.'
        summary: Certificate is not ready
      expr: certmanager_certificate_ready_status{condition="False"} == 1
      labels:
        severity: critical

Would love to see this PR getting on the road again! :)

@jetstack-bot jetstack-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 11, 2021
Signed-off-by: Daniel Paulus <d.paulus@gmail.com>
Signed-off-by: Daniel Paulus <d.paulus@gmail.com>
@jetstack-bot jetstack-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 31, 2021
@jetstack-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: PaulusTM
To complete the pull request process, please assign munnerz after the PR has been reviewed.
You can assign the PR to them by writing /assign @munnerz in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jetstack-bot
Copy link
Contributor

@PaulusTM: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Rerun command
pull-cert-manager-e2e-v1-20 789fbc5 link /test pull-cert-manager-e2e-v1-20
pull-cert-manager-chart eae4239 link /test pull-cert-manager-chart
pull-cert-manager-bazel eae4239 link /test pull-cert-manager-bazel
pull-cert-manager-e2e-v1-22 eae4239 link /test pull-cert-manager-e2e-v1-22
pull-cert-manager-upgrade eae4239 link /test pull-cert-manager-upgrade

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@jetstack-bot jetstack-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 4, 2022
@jetstack-bot
Copy link
Contributor

@PaulusTM: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jetstack-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale

@jetstack-bot jetstack-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 4, 2022
@jetstack-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle rotten
/remove-lifecycle stale

@jetstack-bot jetstack-bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 4, 2022
@jetstack-bot
Copy link
Contributor

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.
Send feedback to jetstack.
/close

@jetstack-bot
Copy link
Contributor

@jetstack-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.
Send feedback to jetstack.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sebastiangaiser
Copy link

/reopen

@jetstack-bot
Copy link
Contributor

@sebastiangaiser: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/deploy Indicates a PR modifies deployment configuration dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. ok-to-test release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants