LOG-2732: Fix ES servicemonitor for user-workload-monitoring #903

periklis · 2022-06-28T12:34:31Z

Description

For legacy reasons the elasticsearch-operator assumes that Elasticsearch and owned ServiceMonitor resources installed only on openshift- namespaces or those annotated with openshift.io/cluster-monitoring: true. In both case the cluster-monitoring stack takes responsibility to reconcile the ServiceMonitor resources for the cluster-monitoring Prometheus. In detail ServiceMonitor endpoints used the prometheus-k8s's serviceaccount token to scrape metrics from elasticsearch and elasticsearch-proxy. This is the legacy and nowadays not-recommended practice that is still sustained in OCP's cluster-monitoring for compatibility reason (i.e. prometheus CR ArbitraryFSAccessThroughSMsConfig.Deny: false)

Moving forward in time with the addition of User Workload Monitoring in OCP (since 4.8) the monitoring stack is amended by a second instance of prometheus-operator where ArbitraryFSAccessThroughSMsConfig.Deny: true is applied by default. In turn this means that the following fields in ServiceMonitor's are not allowed for use any more:

Spec.Endpoints[].TLSConfig.CAFile: Certificate Authority file for verifying server-side certificates when scraping metrics.
Spec.Endpoints[].BearerTokenFile: Bearer Token file for authorizing against server-side when scraping metrics.

In summary this PR makes ServiceMonitor resources compliant with ArbitraryFSAccessThroughSMsConfig.Deny: true and in turn extends the support of monitoring Elasticsearch from cluster-monitoring only to cluster-monitoring and user-workload-monitoring. In detail the denied fields for CAFile and BearerTokenFile are replaced by:

Instead of CAFile the endpoints use a local object reference to a configmap annotated with service.beta.openshift.io/inject-cabundle: true
Instead of BearerTokenFile the endpoints use a local object reference to the serviceaccount token secret of the elasticsearch-metrics serviceaccount.

Notes for reviewer

To make the above settings work in parallel with cluster-monitoring (i.e. openshift-logging) and user-workload-monitoring (i.e. openshift distributed tracing platform), the PR changes the elasticsearch-proxy backend role mapping from using a cluster-scoped non-resource-url (i.e. /metrics) to a custom virtual namespace scoped resource (i.e. elasticsearch.openshift.io/metrics). This simplifies RBAC by providing for each stack a serviceaccount (elasticsearch-metrics) and a pair of Role/Rolebinding.

/cc @xperimental

/cherry-pick release-5.4

Links

JIRA: https://issues.redhat.com/browse/LOG-2732

openshift-ci · 2022-06-28T12:37:08Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: periklis

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [periklis]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

periklis · 2022-06-29T16:23:03Z

/retest-required

periklis · 2022-06-29T16:27:24Z

/test e2e-upgrade

periklis · 2022-06-30T07:00:39Z

/test e2e-upgrade

Red-GV

Looks good to me.

xperimental

Works nicely on a 4.10 cluster, but I think the existing code will not work on 4.11 anymore. Not putting an lgtm on it yet to clear up this question first.

xperimental · 2022-07-18T15:09:15Z

internal/elasticsearch/service_monitor.go

+	}
+
+	var tokenSecret string
+	for _, oref := range sa.Secrets {


This code will probably not work correctly on OCP 4.11 as ServiceAccounts do not get a Secret containing the token by default anymore on Kubernetes 1.24.

Good catch, I have update the PR to create a ServiceAccountToken Secret manually. PTAL

This works for me on 4.11.

periklis · 2022-07-20T09:55:31Z

/retest

xperimental

Works fine on 4.10, 4.11 cluster is still booting...

internal/elasticsearch/service_monitor.go

internal/elasticsearch/serviceaccount.go

xperimental · 2022-07-20T11:46:38Z

/lgtm

periklis · 2022-07-20T13:06:55Z

/retest

openshift-ci-robot · 2022-07-20T15:37:17Z

/retest-required

Remaining retests: 2 against base HEAD 36a0cae and 8 for PR HEAD 19f7482 in total

periklis · 2022-07-20T17:25:56Z

/hold Investigating e2e failures

periklis · 2022-07-20T17:42:32Z

/hold cancel

periklis · 2022-07-20T17:42:53Z

/retest-required

openshift-ci · 2022-07-20T18:55:57Z

@periklis: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

periklis · 2022-07-20T19:26:55Z

/cherry-pick release-5.4

openshift-cherrypick-robot · 2022-07-20T19:27:37Z

@periklis: #903 failed to apply on top of branch "release-5.4":

Applying: Fix ES servicemonitor for user-workload-monitoring
Using index info to reconstruct a base tree...
M	internal/elasticsearch/common.go
M	internal/elasticsearch/configmaps.go
M	internal/elasticsearch/rbac.go
M	internal/elasticsearch/reconciler.go
M	internal/elasticsearch/service_monitor.go
M	internal/elasticsearch/service_monitor_test.go
M	internal/elasticsearch/serviceaccount.go
M	internal/manifests/configmap/configmap.go
M	internal/manifests/secret/secret.go
M	internal/manifests/serviceaccount/serviceaccount.go
Falling back to patching base and 3-way merge...
Auto-merging internal/manifests/serviceaccount/serviceaccount.go
Auto-merging internal/manifests/secret/secret.go
Auto-merging internal/manifests/configmap/configmap.go
CONFLICT (content): Merge conflict in internal/manifests/configmap/configmap.go
Auto-merging internal/elasticsearch/serviceaccount.go
CONFLICT (content): Merge conflict in internal/elasticsearch/serviceaccount.go
Auto-merging internal/elasticsearch/service_monitor_test.go
CONFLICT (content): Merge conflict in internal/elasticsearch/service_monitor_test.go
Auto-merging internal/elasticsearch/service_monitor.go
CONFLICT (content): Merge conflict in internal/elasticsearch/service_monitor.go
Auto-merging internal/elasticsearch/reconciler.go
Auto-merging internal/elasticsearch/rbac.go
CONFLICT (content): Merge conflict in internal/elasticsearch/rbac.go
Auto-merging internal/elasticsearch/configmaps.go
Auto-merging internal/elasticsearch/common.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Fix ES servicemonitor for user-workload-monitoring
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-5.4

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci bot requested a review from xperimental June 28, 2022 12:34

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 28, 2022

periklis force-pushed the fix-service-monitor-uwm branch from fa21b26 to 695cc7b Compare June 29, 2022 13:12

Red-GV reviewed Jun 30, 2022

View reviewed changes

xperimental reviewed Jul 18, 2022

View reviewed changes

periklis force-pushed the fix-service-monitor-uwm branch from 695cc7b to d31ca01 Compare July 20, 2022 08:07

xperimental reviewed Jul 20, 2022

View reviewed changes

internal/elasticsearch/service_monitor.go Outdated Show resolved Hide resolved

internal/elasticsearch/serviceaccount.go Outdated Show resolved Hide resolved

Fix ES servicemonitor for user-workload-monitoring

19f7482

periklis force-pushed the fix-service-monitor-uwm branch from d31ca01 to 19f7482 Compare July 20, 2022 11:11

openshift-ci bot assigned xperimental Jul 20, 2022

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 20, 2022

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 20, 2022

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 20, 2022

openshift-merge-robot merged commit 8b4e198 into openshift:master Jul 20, 2022

periklis mentioned this pull request Jul 21, 2022

[release-5.4] LOG-2845: Fix ES servicemonitor for user-workload-monitoring #919

Merged

periklis mentioned this pull request Sep 12, 2022

Add missing owner for ca bundle configmap #937

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LOG-2732: Fix ES servicemonitor for user-workload-monitoring #903

LOG-2732: Fix ES servicemonitor for user-workload-monitoring #903

periklis commented Jun 28, 2022

openshift-ci bot commented Jun 28, 2022

periklis commented Jun 29, 2022

periklis commented Jun 29, 2022

periklis commented Jun 30, 2022

Red-GV left a comment

xperimental left a comment

xperimental Jul 18, 2022

periklis Jul 20, 2022

periklis Jul 20, 2022

periklis commented Jul 20, 2022

xperimental left a comment

xperimental commented Jul 20, 2022

periklis commented Jul 20, 2022

openshift-ci-robot commented Jul 20, 2022

periklis commented Jul 20, 2022

periklis commented Jul 20, 2022

periklis commented Jul 20, 2022

openshift-ci bot commented Jul 20, 2022

periklis commented Jul 20, 2022

openshift-cherrypick-robot commented Jul 20, 2022

LOG-2732: Fix ES servicemonitor for user-workload-monitoring #903

LOG-2732: Fix ES servicemonitor for user-workload-monitoring #903

Conversation

periklis commented Jun 28, 2022

Description

Notes for reviewer

Links

openshift-ci bot commented Jun 28, 2022

periklis commented Jun 29, 2022

periklis commented Jun 29, 2022

periklis commented Jun 30, 2022

Red-GV left a comment

Choose a reason for hiding this comment

xperimental left a comment

Choose a reason for hiding this comment

xperimental Jul 18, 2022

Choose a reason for hiding this comment

periklis Jul 20, 2022

Choose a reason for hiding this comment

periklis Jul 20, 2022

Choose a reason for hiding this comment

periklis commented Jul 20, 2022

xperimental left a comment

Choose a reason for hiding this comment

xperimental commented Jul 20, 2022

periklis commented Jul 20, 2022

openshift-ci-robot commented Jul 20, 2022

periklis commented Jul 20, 2022

periklis commented Jul 20, 2022

periklis commented Jul 20, 2022

openshift-ci bot commented Jul 20, 2022

periklis commented Jul 20, 2022

openshift-cherrypick-robot commented Jul 20, 2022