Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LOG-2732: Fix ES servicemonitor for user-workload-monitoring #903

Merged

Conversation

periklis
Copy link
Contributor

Description

For legacy reasons the elasticsearch-operator assumes that Elasticsearch and owned ServiceMonitor resources installed only on openshift- namespaces or those annotated with openshift.io/cluster-monitoring: true. In both case the cluster-monitoring stack takes responsibility to reconcile the ServiceMonitor resources for the cluster-monitoring Prometheus. In detail ServiceMonitor endpoints used the prometheus-k8s's serviceaccount token to scrape metrics from elasticsearch and elasticsearch-proxy. This is the legacy and nowadays not-recommended practice that is still sustained in OCP's cluster-monitoring for compatibility reason (i.e. prometheus CR ArbitraryFSAccessThroughSMsConfig.Deny: false)

Moving forward in time with the addition of User Workload Monitoring in OCP (since 4.8) the monitoring stack is amended by a second instance of prometheus-operator where ArbitraryFSAccessThroughSMsConfig.Deny: true is applied by default. In turn this means that the following fields in ServiceMonitor's are not allowed for use any more:

  • Spec.Endpoints[].TLSConfig.CAFile: Certificate Authority file for verifying server-side certificates when scraping metrics.
  • Spec.Endpoints[].BearerTokenFile: Bearer Token file for authorizing against server-side when scraping metrics.

In summary this PR makes ServiceMonitor resources compliant with ArbitraryFSAccessThroughSMsConfig.Deny: true and in turn extends the support of monitoring Elasticsearch from cluster-monitoring only to cluster-monitoring and user-workload-monitoring. In detail the denied fields for CAFile and BearerTokenFile are replaced by:

  • Instead of CAFile the endpoints use a local object reference to a configmap annotated with service.beta.openshift.io/inject-cabundle: true
  • Instead of BearerTokenFile the endpoints use a local object reference to the serviceaccount token secret of the elasticsearch-metrics serviceaccount.

Notes for reviewer

To make the above settings work in parallel with cluster-monitoring (i.e. openshift-logging) and user-workload-monitoring (i.e. openshift distributed tracing platform), the PR changes the elasticsearch-proxy backend role mapping from using a cluster-scoped non-resource-url (i.e. /metrics) to a custom virtual namespace scoped resource (i.e. elasticsearch.openshift.io/metrics). This simplifies RBAC by providing for each stack a serviceaccount (elasticsearch-metrics) and a pair of Role/Rolebinding.

/cc @xperimental

/cherry-pick release-5.4

Links

@openshift-ci openshift-ci bot requested a review from xperimental June 28, 2022 12:34
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 28, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: periklis

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 28, 2022
@periklis
Copy link
Contributor Author

/retest-required

@periklis
Copy link
Contributor Author

/test e2e-upgrade

1 similar comment
@periklis
Copy link
Contributor Author

/test e2e-upgrade

Copy link
Contributor

@Red-GV Red-GV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

Copy link
Contributor

@xperimental xperimental left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works nicely on a 4.10 cluster, but I think the existing code will not work on 4.11 anymore. Not putting an lgtm on it yet to clear up this question first.

}

var tokenSecret string
for _, oref := range sa.Secrets {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code will probably not work correctly on OCP 4.11 as ServiceAccounts do not get a Secret containing the token by default anymore on Kubernetes 1.24.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I have update the PR to create a ServiceAccountToken Secret manually. PTAL

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works for me on 4.11.

@periklis
Copy link
Contributor Author

/retest

Copy link
Contributor

@xperimental xperimental left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works fine on 4.10, 4.11 cluster is still booting...

internal/elasticsearch/service_monitor.go Outdated Show resolved Hide resolved
internal/elasticsearch/serviceaccount.go Outdated Show resolved Hide resolved
@xperimental
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 20, 2022
@periklis
Copy link
Contributor Author

/retest

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 2 against base HEAD 36a0cae and 8 for PR HEAD 19f7482 in total

@periklis
Copy link
Contributor Author

/hold Investigating e2e failures

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 20, 2022
@periklis
Copy link
Contributor Author

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 20, 2022
@periklis
Copy link
Contributor Author

/retest-required

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 20, 2022

@periklis: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit 8b4e198 into openshift:master Jul 20, 2022
@periklis
Copy link
Contributor Author

/cherry-pick release-5.4

@openshift-cherrypick-robot

@periklis: #903 failed to apply on top of branch "release-5.4":

Applying: Fix ES servicemonitor for user-workload-monitoring
Using index info to reconstruct a base tree...
M	internal/elasticsearch/common.go
M	internal/elasticsearch/configmaps.go
M	internal/elasticsearch/rbac.go
M	internal/elasticsearch/reconciler.go
M	internal/elasticsearch/service_monitor.go
M	internal/elasticsearch/service_monitor_test.go
M	internal/elasticsearch/serviceaccount.go
M	internal/manifests/configmap/configmap.go
M	internal/manifests/secret/secret.go
M	internal/manifests/serviceaccount/serviceaccount.go
Falling back to patching base and 3-way merge...
Auto-merging internal/manifests/serviceaccount/serviceaccount.go
Auto-merging internal/manifests/secret/secret.go
Auto-merging internal/manifests/configmap/configmap.go
CONFLICT (content): Merge conflict in internal/manifests/configmap/configmap.go
Auto-merging internal/elasticsearch/serviceaccount.go
CONFLICT (content): Merge conflict in internal/elasticsearch/serviceaccount.go
Auto-merging internal/elasticsearch/service_monitor_test.go
CONFLICT (content): Merge conflict in internal/elasticsearch/service_monitor_test.go
Auto-merging internal/elasticsearch/service_monitor.go
CONFLICT (content): Merge conflict in internal/elasticsearch/service_monitor.go
Auto-merging internal/elasticsearch/reconciler.go
Auto-merging internal/elasticsearch/rbac.go
CONFLICT (content): Merge conflict in internal/elasticsearch/rbac.go
Auto-merging internal/elasticsearch/configmaps.go
Auto-merging internal/elasticsearch/common.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Fix ES servicemonitor for user-workload-monitoring
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-5.4

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants