[regression] Prometheus Operator fails to start due to RBAC permissions #1045

leifmadsen · 2022-04-12T13:00:25Z

I've been trying to track down why the Prometheus Operator stopped working recently in our environment, and I've tracked this down to a change in #958 which was filed in #942

The change results in the following output when starting the Prometheus Operator.

level=error ts=2022-04-12T12:49:41.346331692Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/client-go@v0.21.0/tools/cache/reflector.go:167: Failed to watch *v1.Secret: failed to list *v1.Secret: secrets is forbidden: User \"system:serviceaccount:testing1:prometheus-operator\" cannot list resource \"secrets\" in API group \"\" at the cluster scope"
level=error ts=2022-04-12T12:49:41.747235163Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/client-go@v0.21.0/tools/cache/reflector.go:167: Failed to watch *v1.Namespace: failed to list *v1.Namespace: namespaces is forbidden: User \"system:serviceaccount:testing1:prometheus-operator\" cannot list resource \"namespaces\" in API group \"\" at the cluster scope"
level=error ts=2022-04-12T12:49:41.887071286Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/client-go@v0.21.0/tools/cache/reflector.go:167: Failed to watch *v1.Probe: failed to list *v1.Probe: probes.monitoring.coreos.com is forbidden: User \"system:serviceaccount:testing1:prometheus-operator\" cannot list resource \"probes\" in API group \"monitoring.coreos.com\" at the cluster scope"
level=error ts=2022-04-12T12:49:41.918912086Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/client-go@v0.21.0/tools/cache/reflector.go:167: Failed to watch *v1.PrometheusRule: failed to list *v1.PrometheusRule: prometheusrules.monitoring.coreos.com is forbidden: User \"system:serviceaccount:testing1:prometheus-operator\" cannot list resource \"prometheusrules\" in API group \"monitoring.coreos.com\" at the cluster scope"
level=error ts=2022-04-12T12:49:42.004230409Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/client-go@v0.21.0/tools/cache/reflector.go:167: Failed to watch *v1.ServiceMonitor: failed to list *v1.ServiceMonitor: servicemonitors.monitoring.coreos.com is forbidden: User \"system:serviceaccount:testing1:prometheus-operator\" cannot list resource \"servicemonitors\" in API group \"monitoring.coreos.com\" at the cluster scope"
level=error ts=2022-04-12T12:49:42.167747675Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/client-go@v0.21.0/tools/cache/reflector.go:167: Failed to watch *v1.Namespace: failed to list *v1.Namespace: namespaces is forbidden: User \"system:serviceaccount:testing1:prometheus-operator\" cannot list resource \"namespaces\" in API group \"\" at the cluster scope"
...etc

You can reproduce this by running the bundles directly. First create an OperatorGroup to limit this to the local namespace in order to not conflict with existing deployments in other namespaces. I am doing this on OpenShift.

oc new-project testing1

oc create -f - <<EOF
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: operator-sdk-og
  namespace: testing1
spec:
  targetNamespaces:
  - testing1
EOF

operator-sdk run bundle -ntesting1 quay.io/operatorhubio/prometheus:v0.47.0

oc logs -f --selector=k8s-app=prometheus-operator

# <<< RBAC permission errors are seen in STDOUT

oc edit csv prometheusoperator.0.47.0

# <<< edit install.spec.deployments[0].spec.spec.containers.args and add:
# - -namespace=$(NAMESPACES)
# ... back into the containers arguments

# wait for new pod to spin up

oc logs -f --selector=k8s-app=prometheus-operator

# <<< all RBAC permissions are resolved

operator-sdk cleanup prometheus

This is a request to revert the changes in 963cb69 or add the -namespaces=$(NAMESPACES) back in which is the top level configuration option, and then provide a separate ENVVAR to allow overriding for the other instance-namespaces configurations.

The text was updated successfully, but these errors were encountered:

leifmadsen · 2022-04-12T13:00:45Z

CC @zfrhv

zfrhv · 2022-04-12T14:30:14Z

hello, thanks for mentioning

its true that the operator tries to check resources outside his namespaces so he gets the RBAC errors.
im not sure tho if its the main problem why the prometheus stopped monitoring.

i will try to check this problem tomorrow (or in a few days) when i can get my hands on lab.

meanwhile, do you see any other errors? or could you please provide your prometheus custom resource configuration?

and what do you mean but "stopped working", is it crashLoopBackoff or prometheus doesnt detects and serviceMonitors?

i also get forbiden to list secret errors, but prometheus still monitors fine

leifmadsen · 2022-04-12T15:05:15Z

If you have an existing Prometheus then it might continue to work (likely), but if you create a new Prometheus Operator subscription in a new namespace, and try to create a Prometheus resource, then Prometheus Operator is unable to create a new Prometheus workload.

This is a pretty serious regression as new deployments are no longer possible.

leifmadsen · 2022-04-12T15:07:10Z

Also I'm wondering why these types of changes are happening in a stable release channel without a bump in CSV version.

zfrhv · 2022-04-12T18:14:50Z

Also I'm wondering why these types of changes are happening in a stable release channel without a bump in CSV version.

the CSV version goes with the operator version, so if bumping the CSV then the operator also needs to be bumped

when using the -namespaces=$(NAMESPACES) option the operator wont even try to search for servicemonitor and all of the CRs outside of his namespace.
and deploying prometheus on each namespace is not ideally.
(plus thats what redhat are doing for openshift-user-workload-monitoring, so i feel pretty confident about this configuration)

This is a pretty serious regression as new deployments are no longer possible.

im not sure why it doesnt creates prometheus workload for you, i didnt tried to install the operator with operator-sdk, i install all of the operators with OLM

seems to work fine for me, but i will check it later. maybe you are right

maybe a good solution would be to enable MultiNamespace installation, and then configure the -namespaces=$(NAMESPACES) when NAMESPACES would be the targetNamespaces list from the operator group somehow.

im not sure how to do it tho, im a little busy these days, but i can check later.
and im also not sure why the MultiNamespace installation is disabled

zfrhv · 2022-04-13T06:05:31Z

hii,
i tried to install the operator on different cluster and reproduced the issue.
thanks for the update, i will create a new pull reques

zfrhv · 2022-04-13T19:06:25Z

@leifmadsen can you please verify that it works for you?

leifmadsen · 2022-04-13T20:12:56Z

@leifmadsen can you please verify that it works for you?

I will try it tomorrow thanks.

leifmadsen · 2022-04-13T20:30:48Z

Also I'm wondering why these types of changes are happening in a stable release channel without a bump in CSV version.

the CSV version goes with the operator version, so if bumping the CSV then the operator also needs to be bumped

That's not true or necessary. The CSV version is totally independent of the workload version, and by overwriting the CSV version you break the ability for OLM to perform automatic updates. In the release of my operators we use a release scheme of the major.minor.unixdate format so that we are always increasing the CSV version, which allows for Operators to be updated when the CSV/bundle is modified. The workload version/tag doesn't need to change.

leifmadsen · 2022-04-19T15:21:20Z

It looks as if the changes in #1058 resolved this issue. I am no longer seeing the RBAC errors in the log, and have confirmed my CSV has the changes as merge in the referred issue. Thanks for the quick resolution.

This was referenced Apr 13, 2022

operators [O] prometheus (0.47.0) #1051

Closed

operators [O] prometheus (0.47.0) #1054

Closed

operators [O] prometheus (0.47.0) #1057

Closed

operators [O] prometheus (0.47.0) #1058

Merged

leifmadsen closed this as completed Apr 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[regression] Prometheus Operator fails to start due to RBAC permissions #1045

[regression] Prometheus Operator fails to start due to RBAC permissions #1045

leifmadsen commented Apr 12, 2022

leifmadsen commented Apr 12, 2022

zfrhv commented Apr 12, 2022

leifmadsen commented Apr 12, 2022

leifmadsen commented Apr 12, 2022

zfrhv commented Apr 12, 2022

zfrhv commented Apr 13, 2022

zfrhv commented Apr 13, 2022

leifmadsen commented Apr 13, 2022

leifmadsen commented Apr 13, 2022

leifmadsen commented Apr 19, 2022

[regression] Prometheus Operator fails to start due to RBAC permissions #1045

[regression] Prometheus Operator fails to start due to RBAC permissions #1045

Comments

leifmadsen commented Apr 12, 2022

leifmadsen commented Apr 12, 2022

zfrhv commented Apr 12, 2022

leifmadsen commented Apr 12, 2022

leifmadsen commented Apr 12, 2022

zfrhv commented Apr 12, 2022

zfrhv commented Apr 13, 2022

zfrhv commented Apr 13, 2022

leifmadsen commented Apr 13, 2022

leifmadsen commented Apr 13, 2022

leifmadsen commented Apr 19, 2022