-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MON-2358] add alert PrometheusOperatorRejectedResources #126
[MON-2358] add alert PrometheusOperatorRejectedResources #126
Conversation
74aae14
to
9f677a9
Compare
alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md
Outdated
Show resolved
Hide resolved
alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md
Outdated
Show resolved
Hide resolved
alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md
Outdated
Show resolved
Hide resolved
alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md
Outdated
Show resolved
Hide resolved
The custom resource itself should already conform CRD schema, elsewise the API server | ||
will reject it before the Prometheus Operator does. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure that it's worth mentioning it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest we tell users not to bother with the schema, the problem comes from value, because user may not know that schema is validated before writing the resource to database.
The custom resource itself should already conform CRD schema, elsewise the API server | ||
will reject it before the Prometheus Operator does. | ||
|
||
The error often comes from the value of certain fields that the operator cannot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would write down a list of possible causes and how to remediate
- ServiceMonitor or PodMonitor referencing a file in the filesystem => use secret/configmap key reference
- Missing secret or configmap key reference => verify that the secret/key exists
- Invalid relabeling configuration => fix the configuration
- ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a list of possible causes, not exhaustive though.
alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md
Outdated
Show resolved
Hide resolved
alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md
Outdated
Show resolved
Hide resolved
## Diagnosis | ||
|
||
Find out the type of custom resource and its namespace in the `description` field | ||
of the alert `PrometheusOperatorRejectedResources`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC namespace is either openshift-monitoring or openshift-user-workload-monitoring. In the first case, there's nothing much to do except filing a support case (e.g. it's a product bug). In the second case, the user deploying the resource should fix something on their end. This can be explicitly mentioned in the runbook.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, I will add "If the namespace is named openshift-.*
, for example, openshift-monitoring
,
this may be a bug in Openshift, please file a support case."
9f677a9
to
6b98c9d
Compare
alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md
Outdated
Show resolved
Hide resolved
alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md
Outdated
Show resolved
Hide resolved
alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md
Outdated
Show resolved
Hide resolved
alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md
Outdated
Show resolved
Hide resolved
6b98c9d
to
fe426bc
Compare
Thank you for the review @bburt-rh :) The runbook has been updated accordingly. |
/lgtm |
alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md
Outdated
Show resolved
Hide resolved
alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md
Outdated
Show resolved
Hide resolved
alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md
Outdated
Show resolved
Hide resolved
alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md
Outdated
Show resolved
Hide resolved
alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md
Outdated
Show resolved
Hide resolved
alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md
Outdated
Show resolved
Hide resolved
alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md
Outdated
Show resolved
Hide resolved
alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md
Outdated
Show resolved
Hide resolved
alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md
Outdated
Show resolved
Hide resolved
alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md
Outdated
Show resolved
Hide resolved
alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md
Outdated
Show resolved
Hide resolved
alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md
Outdated
Show resolved
Hide resolved
alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md
Outdated
Show resolved
Hide resolved
alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md
Outdated
Show resolved
Hide resolved
alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md
Outdated
Show resolved
Hide resolved
alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md
Outdated
Show resolved
Hide resolved
alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md
Outdated
Show resolved
Hide resolved
alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md
Outdated
Show resolved
Hide resolved
3076ee3
to
274fe97
Compare
Thank you for the review @bburt-rh :) |
a link, followed by the resource type. | ||
|
||
After you know the namespace and the resource type, you can search for a recently | ||
modified custom resource of that type in that namespace. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This paragraph is a bit unclear to me. The namespace label can be either openshift-monitoring or openshift-user-workload-monitoring but the faulty resource might live elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simonpasquier - WDYT of adding the following sentence:
"However, note that the namespace label in the description can be either openshift-monitoring
or openshift-user-workload-monitoring
, but the faulty resource might still be located elsewhere."
alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md
Outdated
Show resolved
Hide resolved
deployment of Prometheus Operator. | ||
Possible causes include the following: | ||
|
||
- Violation of file system access rules through a bearer token file or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rephrase it to make it more clear about what isn't allowed? E.g. "A ServiceMonitor or PodMonitor object references a file to use as bearer token or TLS file which isn't allowed for user-defined monitoring. Instead it is necessary to create a secret holding the credential data in the same namespace as the ServiceMonitor or PodMonitor object and use a secret key reference in the ServiceMonitor or PodMonitor."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slightly edited version of @simonpasquier's suggested wording:
"A violation can occur when a ServiceMonitor
or PodMonitor
object references a file to use as a bearer token or references a TLS file. These configurations are not allowed in user-defined monitoring. Instead, in user-defined monitoring, you must create a secret that contains the credential data in the same namespace as the ServiceMonitor
or PodMonitor
object and use a secret key reference in the ServiceMonitor
or PodMonitor
configuration."
WDYT?
|
||
- Violation of file system access rules through a bearer token file or | ||
a TLS certification file. | ||
- **AlertManager** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd list by monitor resources:
- PodMonitor and ServiceMonitor
- ...
- AlertmanagerConfig
- ...
- PrometheusRules
- ...
bc1c6f6
to
40a627a
Compare
Thanks @bburt-rh @simonpasquier for the review. |
alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md
Outdated
Show resolved
Hide resolved
alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md
Outdated
Show resolved
Hide resolved
alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md
Outdated
Show resolved
Hide resolved
### Diagnose using the Openshift web console | ||
|
||
1. Browse to **Observe** -> **Alerting**. | ||
2. If the **Filter** for **Alert State** is not set to **Firing**, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the page should always filter by firing alerts -> I'd remove this step
a link, followed by the resource type. | ||
|
||
Knowing the namespace and the resource type, you can search for recently modified | ||
custom resources of that type in that namespace. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that it's confusing because the namespace isn't the namespace where the faulty resource lives.
3. Search for the `PrometheusOperatorRejectedResources` alert. | ||
4. Click the alert to view its details. | ||
5. Scroll down and view the **Description** field. The namespace is displayed as | ||
a link, followed by the resource type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather tell to look at the labels directly.
located elsewhere. | ||
|
||
**Note**: If the namespace is named `openshift-monitoring`, | ||
this might be a bug in OpenShift. Please file a support case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd make the difference between openshift-monitoring and openshift-user-workload-monitoring more explicit.
The course of action depends on the value of the `namespace` label:
* When the value is `openshift-monitoring`, this is an issue with the platform monitoring stack. Please submit request to the support.
* When the value is `openshift-user-workload-monitoring`, this is an issue with a user-defined monitoring resource.
alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md
Outdated
Show resolved
Hide resolved
alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md
Outdated
Show resolved
Hide resolved
deployment of Prometheus Operator. | ||
Possible causes include the following: | ||
|
||
- A Violation of file system access rules can occur when a `ServiceMonitor` or `PodMonitor` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be moved with the other ServiceMonitor/PodMonitor causes?
40a627a
to
6f4af8b
Compare
/unhold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small nit otherwise lgtm.
|
||
## Diagnosis | ||
|
||
Find the custom resource type and the namespace in the `description` label |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Find the custom resource type and the namespace in the `description` label | |
Identify the custom resource type and the namespace from the `resource` and `namespace` labels |
alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md
Outdated
Show resolved
Hide resolved
alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md
Outdated
Show resolved
Hide resolved
alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md
Outdated
Show resolved
Hide resolved
alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md
Outdated
Show resolved
Hide resolved
alerts/cluster-monitoring-operator/PrometheusOperatorRejectedResources.md
Outdated
Show resolved
Hide resolved
6f4af8b
to
9c2319f
Compare
Thank you for the review, @bburt-rh @simonpasquier :D |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jan--f, raptorsun The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@raptorsun: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
This PR add runbook for the alert PrometheusOperatorRejectedResources.
I have different opinion in the mitigation sections than the runbook of Prometheus Operator, which suggests the CR does not conform CRD schema. I find if the CR does not conform the schema, kubelet API server will reject before Prometheus Operator does. The error should come from the value of certain fields instead.