Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MON-2358] add alert PrometheusOperatorRejectedResources #126

Conversation

raptorsun
Copy link
Contributor

This PR add runbook for the alert PrometheusOperatorRejectedResources.
I have different opinion in the mitigation sections than the runbook of Prometheus Operator, which suggests the CR does not conform CRD schema. I find if the CR does not conform the schema, kubelet API server will reject before Prometheus Operator does. The error should come from the value of certain fields instead.

@openshift-ci openshift-ci bot requested review from jan--f and slashpai July 3, 2023 21:18
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 3, 2023
@raptorsun raptorsun force-pushed the alert/PrometheusOperatorRejectedResources branch from 74aae14 to 9f677a9 Compare July 4, 2023 15:58
Comment on lines 67 to 64
The custom resource itself should already conform CRD schema, elsewise the API server
will reject it before the Prometheus Operator does.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure that it's worth mentioning it.

Copy link
Contributor Author

@raptorsun raptorsun Jul 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we tell users not to bother with the schema, the problem comes from value, because user may not know that schema is validated before writing the resource to database.

The custom resource itself should already conform CRD schema, elsewise the API server
will reject it before the Prometheus Operator does.

The error often comes from the value of certain fields that the operator cannot
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would write down a list of possible causes and how to remediate

  • ServiceMonitor or PodMonitor referencing a file in the filesystem => use secret/configmap key reference
  • Missing secret or configmap key reference => verify that the secret/key exists
  • Invalid relabeling configuration => fix the configuration
  • ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a list of possible causes, not exhaustive though.

## Diagnosis

Find out the type of custom resource and its namespace in the `description` field
of the alert `PrometheusOperatorRejectedResources`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC namespace is either openshift-monitoring or openshift-user-workload-monitoring. In the first case, there's nothing much to do except filing a support case (e.g. it's a product bug). In the second case, the user deploying the resource should fix something on their end. This can be explicitly mentioned in the runbook.

Copy link
Contributor Author

@raptorsun raptorsun Jul 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I will add "If the namespace is named openshift-.*, for example, openshift-monitoring,
this may be a bug in Openshift, please file a support case."

@raptorsun raptorsun force-pushed the alert/PrometheusOperatorRejectedResources branch from 9f677a9 to 6b98c9d Compare July 16, 2023 22:46
@raptorsun raptorsun force-pushed the alert/PrometheusOperatorRejectedResources branch from 6b98c9d to fe426bc Compare July 17, 2023 16:41
@raptorsun
Copy link
Contributor Author

Thank you for the review @bburt-rh :) The runbook has been updated accordingly.

@jan--f
Copy link
Contributor

jan--f commented Jul 18, 2023

/lgtm
but I'll add a
/hold
for @bburt-rh to give his final approval. Feel free to unhold after.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 18, 2023
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 18, 2023
@raptorsun raptorsun force-pushed the alert/PrometheusOperatorRejectedResources branch from 3076ee3 to 274fe97 Compare August 4, 2023 16:45
@raptorsun
Copy link
Contributor Author

Thank you for the review @bburt-rh :)
The runbook has been updated accordingly.

a link, followed by the resource type.

After you know the namespace and the resource type, you can search for a recently
modified custom resource of that type in that namespace.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This paragraph is a bit unclear to me. The namespace label can be either openshift-monitoring or openshift-user-workload-monitoring but the faulty resource might live elsewhere.

Copy link
Contributor

@bburt-rh bburt-rh Aug 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonpasquier - WDYT of adding the following sentence:

"However, note that the namespace label in the description can be either openshift-monitoring or openshift-user-workload-monitoring, but the faulty resource might still be located elsewhere."

deployment of Prometheus Operator.
Possible causes include the following:

- Violation of file system access rules through a bearer token file or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rephrase it to make it more clear about what isn't allowed? E.g. "A ServiceMonitor or PodMonitor object references a file to use as bearer token or TLS file which isn't allowed for user-defined monitoring. Instead it is necessary to create a secret holding the credential data in the same namespace as the ServiceMonitor or PodMonitor object and use a secret key reference in the ServiceMonitor or PodMonitor."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly edited version of @simonpasquier's suggested wording:

"A violation can occur when a ServiceMonitor or PodMonitor object references a file to use as a bearer token or references a TLS file. These configurations are not allowed in user-defined monitoring. Instead, in user-defined monitoring, you must create a secret that contains the credential data in the same namespace as the ServiceMonitor or PodMonitor object and use a secret key reference in the ServiceMonitor or PodMonitor configuration."

WDYT?


- Violation of file system access rules through a bearer token file or
a TLS certification file.
- **AlertManager**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd list by monitor resources:

  • PodMonitor and ServiceMonitor
    • ...
  • AlertmanagerConfig
    • ...
  • PrometheusRules
    • ...

@raptorsun raptorsun force-pushed the alert/PrometheusOperatorRejectedResources branch from bc1c6f6 to 40a627a Compare September 4, 2023 17:29
@raptorsun
Copy link
Contributor Author

Thanks @bburt-rh @simonpasquier for the review.
Runbook is updated according to latest comments, please have a look :)

### Diagnose using the Openshift web console

1. Browse to **Observe** -> **Alerting**.
2. If the **Filter** for **Alert State** is not set to **Firing**,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the page should always filter by firing alerts -> I'd remove this step

a link, followed by the resource type.

Knowing the namespace and the resource type, you can search for recently modified
custom resources of that type in that namespace.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it's confusing because the namespace isn't the namespace where the faulty resource lives.

3. Search for the `PrometheusOperatorRejectedResources` alert.
4. Click the alert to view its details.
5. Scroll down and view the **Description** field. The namespace is displayed as
a link, followed by the resource type.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather tell to look at the labels directly.

located elsewhere.

**Note**: If the namespace is named `openshift-monitoring`,
this might be a bug in OpenShift. Please file a support case.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd make the difference between openshift-monitoring and openshift-user-workload-monitoring more explicit.

The course of action depends on the value of the `namespace` label:
* When the value is `openshift-monitoring`, this is an issue with the platform monitoring stack. Please submit request to the support.
* When the value is `openshift-user-workload-monitoring`, this is an issue with a user-defined monitoring resource.

deployment of Prometheus Operator.
Possible causes include the following:

- A Violation of file system access rules can occur when a `ServiceMonitor` or `PodMonitor`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be moved with the other ServiceMonitor/PodMonitor causes?

@raptorsun raptorsun force-pushed the alert/PrometheusOperatorRejectedResources branch from 40a627a to 6f4af8b Compare September 5, 2023 23:28
@raptorsun
Copy link
Contributor Author

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 13, 2023
Copy link
Contributor

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small nit otherwise lgtm.


## Diagnosis

Find the custom resource type and the namespace in the `description` label
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Find the custom resource type and the namespace in the `description` label
Identify the custom resource type and the namespace from the `resource` and `namespace` labels

@raptorsun raptorsun force-pushed the alert/PrometheusOperatorRejectedResources branch from 6f4af8b to 9c2319f Compare September 19, 2023 17:23
@raptorsun
Copy link
Contributor Author

Thank you for the review, @bburt-rh @simonpasquier :D
The runbook is updated accoridngly.

@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 19, 2023
@jan--f
Copy link
Contributor

jan--f commented Dec 19, 2023

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 19, 2023
Copy link
Contributor

openshift-ci bot commented Dec 19, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jan--f, raptorsun

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

openshift-ci bot commented Dec 19, 2023

@raptorsun: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit 5802296 into openshift:master Dec 19, 2023
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants