[MON-2358] add alert PrometheusOperatorRejectedResources #126

raptorsun · 2023-07-03T21:18:26Z

This PR add runbook for the alert PrometheusOperatorRejectedResources.
I have different opinion in the mitigation sections than the runbook of Prometheus Operator, which suggests the CR does not conform CRD schema. I find if the CR does not conform the schema, kubelet API server will reject before Prometheus Operator does. The error should come from the value of certain fields instead.

alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md

simonpasquier · 2023-07-05T07:21:48Z

alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md

+The custom resource itself should already conform CRD schema, elsewise the API server
+will reject it before the Prometheus Operator does.


not sure that it's worth mentioning it.

I suggest we tell users not to bother with the schema, the problem comes from value, because user may not know that schema is validated before writing the resource to database.

simonpasquier · 2023-07-05T07:25:07Z

alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md

+The custom resource itself should already conform CRD schema, elsewise the API server
+will reject it before the Prometheus Operator does.
+
+The error often comes from the value of certain fields that the operator cannot


I would write down a list of possible causes and how to remediate

ServiceMonitor or PodMonitor referencing a file in the filesystem => use secret/configmap key reference

Missing secret or configmap key reference => verify that the secret/key exists

Invalid relabeling configuration => fix the configuration

...

Added a list of possible causes, not exhaustive though.

alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md

simonpasquier · 2023-07-05T07:43:57Z

alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md

+## Diagnosis
+
+Find out the type of custom resource and its namespace in the `description` field
+of the alert `PrometheusOperatorRejectedResources`.


IIUC namespace is either openshift-monitoring or openshift-user-workload-monitoring. In the first case, there's nothing much to do except filing a support case (e.g. it's a product bug). In the second case, the user deploying the resource should fix something on their end. This can be explicitly mentioned in the runbook.

Yup, I will add "If the namespace is named openshift-.*, for example, openshift-monitoring,
this may be a bug in Openshift, please file a support case."

alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md

raptorsun · 2023-07-17T16:42:11Z

Thank you for the review @bburt-rh :) The runbook has been updated accordingly.

jan--f · 2023-07-18T14:32:27Z

/lgtm
but I'll add a
/hold
for @bburt-rh to give his final approval. Feel free to unhold after.

alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md

raptorsun · 2023-08-04T16:46:20Z

Thank you for the review @bburt-rh :)
The runbook has been updated accordingly.

simonpasquier · 2023-08-08T12:00:21Z

alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md

+a link, followed by the resource type.
+
+After you know the namespace and the resource type, you can search for a recently
+modified custom resource of that type in that namespace.


This paragraph is a bit unclear to me. The namespace label can be either openshift-monitoring or openshift-user-workload-monitoring but the faulty resource might live elsewhere.

@simonpasquier - WDYT of adding the following sentence:

"However, note that the namespace label in the description can be either openshift-monitoring or openshift-user-workload-monitoring, but the faulty resource might still be located elsewhere."

alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md

simonpasquier · 2023-08-08T12:05:18Z

alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md

+deployment of Prometheus Operator.
+Possible causes include the following:
+
+- Violation of file system access rules through a bearer token file or


Can we rephrase it to make it more clear about what isn't allowed? E.g. "A ServiceMonitor or PodMonitor object references a file to use as bearer token or TLS file which isn't allowed for user-defined monitoring. Instead it is necessary to create a secret holding the credential data in the same namespace as the ServiceMonitor or PodMonitor object and use a secret key reference in the ServiceMonitor or PodMonitor."

Slightly edited version of @simonpasquier's suggested wording:

"A violation can occur when a ServiceMonitor or PodMonitor object references a file to use as a bearer token or references a TLS file. These configurations are not allowed in user-defined monitoring. Instead, in user-defined monitoring, you must create a secret that contains the credential data in the same namespace as the ServiceMonitor or PodMonitor object and use a secret key reference in the ServiceMonitor or PodMonitor configuration."

WDYT?

simonpasquier · 2023-08-08T12:11:11Z

alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md

+
+- Violation of file system access rules through a bearer token file or
+a TLS certification file.
+- **AlertManager**


I'd list by monitor resources:

PodMonitor and ServiceMonitor

...

AlertmanagerConfig

...

PrometheusRules

...

raptorsun · 2023-09-04T17:29:57Z

Thanks @bburt-rh @simonpasquier for the review.
Runbook is updated according to latest comments, please have a look :)

alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md

simonpasquier · 2023-09-05T07:51:55Z

alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md

+### Diagnose using the Openshift web console
+
+1. Browse to **Observe** -> **Alerting**.
+2. If the **Filter** for **Alert State** is not set to **Firing**,


the page should always filter by firing alerts -> I'd remove this step

simonpasquier · 2023-09-05T07:56:47Z

alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md

+a link, followed by the resource type.
+
+Knowing the namespace and the resource type, you can search for recently modified
+custom resources of that type in that namespace.


I think that it's confusing because the namespace isn't the namespace where the faulty resource lives.

simonpasquier · 2023-09-05T07:57:15Z

alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md

+3. Search for the `PrometheusOperatorRejectedResources` alert.
+4. Click the alert to view its details.
+5. Scroll down and view the **Description** field. The namespace is displayed as
+a link, followed by the resource type.


I'd rather tell to look at the labels directly.

simonpasquier · 2023-09-05T08:02:08Z

alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md

+located elsewhere.
+
+**Note**: If the namespace is named `openshift-monitoring`,
+this might be a bug in OpenShift. Please file a support case.


I'd make the difference between openshift-monitoring and openshift-user-workload-monitoring more explicit.

The course of action depends on the value of the `namespace` label: * When the value is `openshift-monitoring`, this is an issue with the platform monitoring stack. Please submit request to the support. * When the value is `openshift-user-workload-monitoring`, this is an issue with a user-defined monitoring resource.

alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md

simonpasquier · 2023-09-05T08:05:22Z

alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md

+deployment of Prometheus Operator.
+Possible causes include the following:
+
+- A Violation of file system access rules can occur when a `ServiceMonitor` or `PodMonitor`


Should be moved with the other ServiceMonitor/PodMonitor causes?

raptorsun · 2023-09-13T07:12:43Z

/unhold

simonpasquier

One small nit otherwise lgtm.

simonpasquier · 2023-09-14T12:01:01Z

alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md

+
+## Diagnosis
+
+Find the custom resource type and the namespace in the `description` label


Suggested change

Find the custom resource type and the namespace in the `description` label

Identify the custom resource type and the namespace from the `resource` and `namespace` labels

alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md

raptorsun · 2023-09-19T17:23:35Z

Thank you for the review, @bburt-rh @simonpasquier :D
The runbook is updated accoridngly.

openshift-bot · 2023-12-19T01:00:19Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

jan--f · 2023-12-19T09:13:38Z

/lgtm

openshift-ci · 2023-12-19T09:14:20Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jan--f, raptorsun

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~alerts/cluster-monitoring-operator/OWNERS~~ [jan--f,raptorsun]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2023-12-19T09:19:54Z

@raptorsun: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci bot requested review from jan--f and slashpai July 3, 2023 21:18

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 3, 2023

raptorsun force-pushed the alert/PrometheusOperatorRejectedResources branch from 74aae14 to 9f677a9 Compare July 4, 2023 15:58

simonpasquier reviewed Jul 5, 2023

View reviewed changes

raptorsun force-pushed the alert/PrometheusOperatorRejectedResources branch from 9f677a9 to 6b98c9d Compare July 16, 2023 22:46

raptorsun requested a review from simonpasquier July 16, 2023 22:47

bburt-rh reviewed Jul 17, 2023

View reviewed changes

alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md Outdated Show resolved Hide resolved

bburt-rh reviewed Jul 17, 2023

View reviewed changes

alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md Outdated Show resolved Hide resolved

bburt-rh reviewed Jul 17, 2023

View reviewed changes

alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md Outdated Show resolved Hide resolved

bburt-rh reviewed Jul 17, 2023

View reviewed changes

alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md Outdated Show resolved Hide resolved

raptorsun force-pushed the alert/PrometheusOperatorRejectedResources branch from 6b98c9d to fe426bc Compare July 17, 2023 16:41

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 18, 2023

openshift-ci bot assigned jan--f Jul 18, 2023

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 18, 2023

bburt-rh reviewed Jul 20, 2023

View reviewed changes

alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md Outdated Show resolved Hide resolved

bburt-rh reviewed Jul 20, 2023

View reviewed changes

alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md Outdated Show resolved Hide resolved

bburt-rh reviewed Jul 20, 2023

View reviewed changes

alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md Outdated Show resolved Hide resolved

bburt-rh reviewed Jul 20, 2023

View reviewed changes

alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md Outdated Show resolved Hide resolved

bburt-rh reviewed Jul 20, 2023

View reviewed changes

alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md Outdated Show resolved Hide resolved

bburt-rh reviewed Jul 20, 2023

View reviewed changes

alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md Outdated Show resolved Hide resolved

bburt-rh reviewed Jul 20, 2023

View reviewed changes

alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md Outdated Show resolved Hide resolved

bburt-rh reviewed Jul 20, 2023

View reviewed changes

alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md Outdated Show resolved Hide resolved

bburt-rh reviewed Jul 20, 2023

View reviewed changes

alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md Outdated Show resolved Hide resolved

bburt-rh reviewed Jul 20, 2023

View reviewed changes

alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md Outdated Show resolved Hide resolved

bburt-rh reviewed Jul 20, 2023

View reviewed changes

alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md Outdated Show resolved Hide resolved

bburt-rh reviewed Jul 20, 2023

View reviewed changes

alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md Outdated Show resolved Hide resolved

bburt-rh reviewed Jul 20, 2023

View reviewed changes

alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md Outdated Show resolved Hide resolved

bburt-rh reviewed Aug 4, 2023

View reviewed changes

alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md Outdated Show resolved Hide resolved

bburt-rh reviewed Aug 4, 2023

View reviewed changes

alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md Outdated Show resolved Hide resolved

bburt-rh reviewed Aug 4, 2023

View reviewed changes

alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md Outdated Show resolved Hide resolved

bburt-rh reviewed Aug 4, 2023

View reviewed changes

alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md Outdated Show resolved Hide resolved

bburt-rh reviewed Aug 4, 2023

View reviewed changes

alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md Outdated Show resolved Hide resolved

raptorsun force-pushed the alert/PrometheusOperatorRejectedResources branch from 3076ee3 to 274fe97 Compare August 4, 2023 16:45

simonpasquier reviewed Aug 8, 2023

View reviewed changes

raptorsun force-pushed the alert/PrometheusOperatorRejectedResources branch from bc1c6f6 to 40a627a Compare September 4, 2023 17:29

simonpasquier reviewed Sep 5, 2023

View reviewed changes

raptorsun force-pushed the alert/PrometheusOperatorRejectedResources branch from 40a627a to 6f4af8b Compare September 5, 2023 23:28

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 13, 2023

simonpasquier reviewed Sep 14, 2023

View reviewed changes

bburt-rh reviewed Sep 18, 2023

View reviewed changes

alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md Outdated Show resolved Hide resolved

bburt-rh reviewed Sep 18, 2023

View reviewed changes

alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md Outdated Show resolved Hide resolved

bburt-rh reviewed Sep 18, 2023

View reviewed changes

alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md Outdated Show resolved Hide resolved

bburt-rh reviewed Sep 18, 2023

View reviewed changes

alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md Outdated Show resolved Hide resolved

bburt-rh reviewed Sep 18, 2023

View reviewed changes

alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md Outdated Show resolved Hide resolved

add alert PrometheusOperatorRejectedResources

9c2319f

raptorsun force-pushed the alert/PrometheusOperatorRejectedResources branch from 6f4af8b to 9c2319f Compare September 19, 2023 17:23

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 19, 2023

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 19, 2023

openshift-merge-bot bot merged commit 5802296 into openshift:master Dec 19, 2023
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MON-2358] add alert PrometheusOperatorRejectedResources #126

[MON-2358] add alert PrometheusOperatorRejectedResources #126

raptorsun commented Jul 3, 2023

simonpasquier Jul 5, 2023

raptorsun Jul 10, 2023 •

edited

simonpasquier Jul 5, 2023

raptorsun Jul 16, 2023

simonpasquier Jul 5, 2023

raptorsun Jul 10, 2023 •

edited

raptorsun commented Jul 17, 2023

jan--f commented Jul 18, 2023

raptorsun commented Aug 4, 2023

simonpasquier Aug 8, 2023

bburt-rh Aug 16, 2023 •

edited

simonpasquier Aug 8, 2023

bburt-rh Aug 16, 2023

simonpasquier Aug 8, 2023

raptorsun commented Sep 4, 2023

simonpasquier Sep 5, 2023

simonpasquier Sep 5, 2023

simonpasquier Sep 5, 2023

simonpasquier Sep 5, 2023

simonpasquier Sep 5, 2023

raptorsun commented Sep 13, 2023

simonpasquier left a comment

simonpasquier Sep 14, 2023

raptorsun commented Sep 19, 2023

openshift-bot commented Dec 19, 2023

jan--f commented Dec 19, 2023

openshift-ci bot commented Dec 19, 2023

openshift-ci bot commented Dec 19, 2023

		The custom resource itself should already conform CRD schema, elsewise the API server
		will reject it before the Prometheus Operator does.


		## Diagnosis

		Find the custom resource type and the namespace in the `description` label

	Find the custom resource type and the namespace in the `description` label
	Identify the custom resource type and the namespace from the `resource` and `namespace` labels

[MON-2358] add alert PrometheusOperatorRejectedResources #126

[MON-2358] add alert PrometheusOperatorRejectedResources #126

Conversation

raptorsun commented Jul 3, 2023

Choose a reason for hiding this comment

raptorsun Jul 10, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

raptorsun Jul 10, 2023 • edited

Choose a reason for hiding this comment

raptorsun commented Jul 17, 2023

jan--f commented Jul 18, 2023

raptorsun commented Aug 4, 2023

Choose a reason for hiding this comment

bburt-rh Aug 16, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

raptorsun commented Sep 4, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

raptorsun commented Sep 13, 2023

simonpasquier left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

raptorsun commented Sep 19, 2023

openshift-bot commented Dec 19, 2023

jan--f commented Dec 19, 2023

openshift-ci bot commented Dec 19, 2023

openshift-ci bot commented Dec 19, 2023

raptorsun Jul 10, 2023 •

edited

raptorsun Jul 10, 2023 •

edited

bburt-rh Aug 16, 2023 •

edited