Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[regression] Prometheus Operator fails to start due to RBAC permissions #1045

Closed
leifmadsen opened this issue Apr 12, 2022 · 10 comments
Closed

Comments

@leifmadsen
Copy link

I've been trying to track down why the Prometheus Operator stopped working recently in our environment, and I've tracked this down to a change in #958 which was filed in #942

The change results in the following output when starting the Prometheus Operator.

level=error ts=2022-04-12T12:49:41.346331692Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/client-go@v0.21.0/tools/cache/reflector.go:167: Failed to watch *v1.Secret: failed to list *v1.Secret: secrets is forbidden: User \"system:serviceaccount:testing1:prometheus-operator\" cannot list resource \"secrets\" in API group \"\" at the cluster scope"
level=error ts=2022-04-12T12:49:41.747235163Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/client-go@v0.21.0/tools/cache/reflector.go:167: Failed to watch *v1.Namespace: failed to list *v1.Namespace: namespaces is forbidden: User \"system:serviceaccount:testing1:prometheus-operator\" cannot list resource \"namespaces\" in API group \"\" at the cluster scope"
level=error ts=2022-04-12T12:49:41.887071286Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/client-go@v0.21.0/tools/cache/reflector.go:167: Failed to watch *v1.Probe: failed to list *v1.Probe: probes.monitoring.coreos.com is forbidden: User \"system:serviceaccount:testing1:prometheus-operator\" cannot list resource \"probes\" in API group \"monitoring.coreos.com\" at the cluster scope"
level=error ts=2022-04-12T12:49:41.918912086Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/client-go@v0.21.0/tools/cache/reflector.go:167: Failed to watch *v1.PrometheusRule: failed to list *v1.PrometheusRule: prometheusrules.monitoring.coreos.com is forbidden: User \"system:serviceaccount:testing1:prometheus-operator\" cannot list resource \"prometheusrules\" in API group \"monitoring.coreos.com\" at the cluster scope"
level=error ts=2022-04-12T12:49:42.004230409Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/client-go@v0.21.0/tools/cache/reflector.go:167: Failed to watch *v1.ServiceMonitor: failed to list *v1.ServiceMonitor: servicemonitors.monitoring.coreos.com is forbidden: User \"system:serviceaccount:testing1:prometheus-operator\" cannot list resource \"servicemonitors\" in API group \"monitoring.coreos.com\" at the cluster scope"
level=error ts=2022-04-12T12:49:42.167747675Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/client-go@v0.21.0/tools/cache/reflector.go:167: Failed to watch *v1.Namespace: failed to list *v1.Namespace: namespaces is forbidden: User \"system:serviceaccount:testing1:prometheus-operator\" cannot list resource \"namespaces\" in API group \"\" at the cluster scope"
...etc

You can reproduce this by running the bundles directly. First create an OperatorGroup to limit this to the local namespace in order to not conflict with existing deployments in other namespaces. I am doing this on OpenShift.

oc new-project testing1

oc create -f - <<EOF
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: operator-sdk-og
  namespace: testing1
spec:
  targetNamespaces:
  - testing1
EOF

operator-sdk run bundle -ntesting1 quay.io/operatorhubio/prometheus:v0.47.0

oc logs -f --selector=k8s-app=prometheus-operator

# <<< RBAC permission errors are seen in STDOUT

oc edit csv prometheusoperator.0.47.0

# <<< edit install.spec.deployments[0].spec.spec.containers.args and add:
# - -namespace=$(NAMESPACES)
# ... back into the containers arguments

# wait for new pod to spin up

oc logs -f --selector=k8s-app=prometheus-operator

# <<< all RBAC permissions are resolved

operator-sdk cleanup prometheus

This is a request to revert the changes in 963cb69 or add the -namespaces=$(NAMESPACES) back in which is the top level configuration option, and then provide a separate ENVVAR to allow overriding for the other instance-namespaces configurations.

@leifmadsen
Copy link
Author

CC @zfrhv

@zfrhv
Copy link
Contributor

zfrhv commented Apr 12, 2022

hello, thanks for mentioning

its true that the operator tries to check resources outside his namespaces so he gets the RBAC errors.
im not sure tho if its the main problem why the prometheus stopped monitoring.

i will try to check this problem tomorrow (or in a few days) when i can get my hands on lab.

meanwhile, do you see any other errors? or could you please provide your prometheus custom resource configuration?

and what do you mean but "stopped working", is it crashLoopBackoff or prometheus doesnt detects and serviceMonitors?

i also get forbiden to list secret errors, but prometheus still monitors fine

@leifmadsen
Copy link
Author

If you have an existing Prometheus then it might continue to work (likely), but if you create a new Prometheus Operator subscription in a new namespace, and try to create a Prometheus resource, then Prometheus Operator is unable to create a new Prometheus workload.

This is a pretty serious regression as new deployments are no longer possible.

@leifmadsen
Copy link
Author

Also I'm wondering why these types of changes are happening in a stable release channel without a bump in CSV version.

@zfrhv
Copy link
Contributor

zfrhv commented Apr 12, 2022

Also I'm wondering why these types of changes are happening in a stable release channel without a bump in CSV version.

the CSV version goes with the operator version, so if bumping the CSV then the operator also needs to be bumped


when using the -namespaces=$(NAMESPACES) option the operator wont even try to search for servicemonitor and all of the CRs outside of his namespace.
and deploying prometheus on each namespace is not ideally.
(plus thats what redhat are doing for openshift-user-workload-monitoring, so i feel pretty confident about this configuration)

This is a pretty serious regression as new deployments are no longer possible.

im not sure why it doesnt creates prometheus workload for you, i didnt tried to install the operator with operator-sdk, i install all of the operators with OLM

seems to work fine for me, but i will check it later. maybe you are right


maybe a good solution would be to enable MultiNamespace installation, and then configure the -namespaces=$(NAMESPACES) when NAMESPACES would be the targetNamespaces list from the operator group somehow.

im not sure how to do it tho, im a little busy these days, but i can check later.
and im also not sure why the MultiNamespace installation is disabled

@zfrhv
Copy link
Contributor

zfrhv commented Apr 13, 2022

hii,
i tried to install the operator on different cluster and reproduced the issue.
thanks for the update, i will create a new pull reques

@zfrhv
Copy link
Contributor

zfrhv commented Apr 13, 2022

@leifmadsen can you please verify that it works for you?

@leifmadsen
Copy link
Author

@leifmadsen can you please verify that it works for you?

I will try it tomorrow thanks.

@leifmadsen
Copy link
Author

Also I'm wondering why these types of changes are happening in a stable release channel without a bump in CSV version.

the CSV version goes with the operator version, so if bumping the CSV then the operator also needs to be bumped

That's not true or necessary. The CSV version is totally independent of the workload version, and by overwriting the CSV version you break the ability for OLM to perform automatic updates. In the release of my operators we use a release scheme of the major.minor.unixdate format so that we are always increasing the CSV version, which allows for Operators to be updated when the CSV/bundle is modified. The workload version/tag doesn't need to change.

@leifmadsen
Copy link
Author

It looks as if the changes in #1058 resolved this issue. I am no longer seeing the RBAC errors in the log, and have confirmed my CSV has the changes as merge in the referred issue. Thanks for the quick resolution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants