Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus Agent support #5385

Merged
merged 26 commits into from
Mar 27, 2023

Conversation

ArthurSens
Copy link
Member

@ArthurSens ArthurSens commented Mar 2, 2023

Description

This PR implements the PrometheusAgent CRD, following the design document.

Even though we've broken down the implementation is several smaller PRs[1][2][3][4][5], this one still got pretty chunky. Although not perfect, I tried my best to implement it in several small commits. For the reviewers, if reviewing the whole change at once seems overwhelming, I recommend trying commit by commit :)

Type of change

What type of changes does your code introduce to the Prometheus operator? Put an x in the box that apply.

  • CHANGE (fix or feature that would cause existing functionality to not work as expected)
  • FEATURE (non-breaking change which adds functionality)
  • BUGFIX (non-breaking change which fixes an issue)
  • ENHANCEMENT (non-breaking change which improves existing functionality)
  • NONE (if none of the other choices apply. Example, tooling, build system, CI, docs, etc.)

Changelog entry

Please put a one-line changelog entry below. This will be copied to the changelog file during the release process.

Introduce PrometheusAgent CRD

@ArthurSens ArthurSens force-pushed the prometheusagent branch 2 times, most recently from 079aa0c to 6bacbb3 Compare March 4, 2023 19:02
@ArthurSens ArthurSens changed the title [WIP] Prometheus Agent support Prometheus Agent support Mar 16, 2023
@ArthurSens ArthurSens marked this pull request as ready for review March 16, 2023 18:19
@ArthurSens ArthurSens requested a review from a team as a code owner March 16, 2023 18:19
@ArthurSens
Copy link
Member Author

Currently struggling to fix the failing e2e-test. They fail because the prometheus operator restarts, twice, in the middle of the tests.

When reading the logs, I can find failures while trying to update the PrometheusAgent cache, but I couldn't find out why yet 😢

Here are the logs in case anyone is able to easily tell me why 😬

Last 50 logs from the operator during e2e-test failure
level=debug ts=2023-03-16T19:01:39.675540111Z caller=klog.go:84 component=k8s_client_runtime func=Infof msg="GET https://10.96.0.1:443/api/v1/namespaces/allns-y-promremotewritewithtls-variant-15-rrmni9-0-635c20ef/secrets/prometheus-test 200 OK in 2 milliseconds"
level=debug ts=2023-03-16T19:01:39.679255366Z caller=klog.go:84 component=k8s_client_runtime func=Infof msg="PUT https://10.96.0.1:443/api/v1/namespaces/allns-y-promremotewritewithtls-variant-15-rrmni9-0-635c20ef/secrets/prometheus-test 200 OK in 3 milliseconds"
level=debug ts=2023-03-16T19:01:39.68165243Z caller=klog.go:84 component=k8s_client_runtime func=Infof msg="GET https://10.96.0.1:443/api/v1/namespaces/allns-y-promremotewritewithtls-variant-15-rrmni9-0-635c20ef/secrets/prometheus-test-tls-assets-0 200 OK in 2 milliseconds"
level=debug ts=2023-03-16T19:01:39.685715209Z caller=klog.go:84 component=k8s_client_runtime func=Infof msg="PUT https://10.96.0.1:443/api/v1/namespaces/allns-y-promremotewritewithtls-variant-15-rrmni9-0-635c20ef/secrets/prometheus-test-tls-assets-0 200 OK in 2 milliseconds"
level=debug ts=2023-03-16T19:01:39.691479704Z caller=klog.go:84 component=k8s_client_runtime func=Infof msg="DELETE https://10.96.0.1:443/api/v1/namespaces/allns-y-promremotewritewithtls-variant-15-rrmni9-0-635c20ef/secrets/prometheus-test-tls-assets-1 404 Not Found in 4 milliseconds"
level=debug ts=2023-03-16T19:01:39.693473941Z caller=klog.go:84 component=k8s_client_runtime func=Infof msg="DELETE https://10.96.0.1:443/api/v1/namespaces/allns-y-promremotewritewithtls-variant-15-rrmni9-0-635c20ef/secrets/prometheus-test-tls-assets 404 Not Found in 1 milliseconds"
level=debug ts=2023-03-16T19:01:39.693680755Z caller=operator.go:1717 component=prometheusoperator msg="tls-asset secret: stored"
level=debug ts=2023-03-16T19:01:39.696055218Z caller=klog.go:84 component=k8s_client_runtime func=Infof msg="GET https://10.96.0.1:443/api/v1/namespaces/allns-y-promremotewritewithtls-variant-15-rrmni9-0-635c20ef/secrets/prometheus-test-web-config 200 OK in 1 milliseconds"
level=debug ts=2023-03-16T19:01:39.698843609Z caller=klog.go:84 component=k8s_client_runtime func=Infof msg="PUT https://10.96.0.1:443/api/v1/namespaces/allns-y-promremotewritewithtls-variant-15-rrmni9-0-635c20ef/secrets/prometheus-test-web-config 200 OK in 2 milliseconds"
level=debug ts=2023-03-16T19:01:39.700981856Z caller=klog.go:84 component=k8s_client_runtime func=Infof msg="GET https://10.96.0.1:443/api/v1/namespaces/allns-y-promremotewritewithtls-variant-15-rrmni9-0-635c20ef/services/prometheus-operated 200 OK in 1 milliseconds"
level=debug ts=2023-03-16T19:01:39.704111471Z caller=klog.go:84 component=k8s_client_runtime func=Infof msg="PUT https://10.96.0.1:443/api/v1/namespaces/allns-y-promremotewritewithtls-variant-15-rrmni9-0-635c20ef/services/prometheus-operated 200 OK in 2 milliseconds"
level=debug ts=2023-03-16T19:01:39.705341755Z caller=operator.go:1140 component=prometheusoperator key=allns-y-promremotewritewithtls-variant-15-rrmni9-0-635c20ef/test statefulset=prometheus-test shard=0 msg="reconciling statefulset"
level=debug ts=2023-03-16T19:01:39.706606942Z caller=operator.go:1203 component=prometheusoperator key=allns-y-promremotewritewithtls-variant-15-rrmni9-0-635c20ef/test statefulset=prometheus-test shard=0 msg="new statefulset generation inputs match current, skipping any actions"
level=info ts=2023-03-16T19:01:39.706704049Z caller=operator.go:1290 component=prometheusoperator key=allns-y-promremotewritewithtls-variant-15-rrmni9-0-635c20ef/test msg="update prometheus status"
level=debug ts=2023-03-16T19:01:39.709905668Z caller=klog.go:84 component=k8s_client_runtime func=Infof msg="GET https://10.96.0.1:443/api/v1/namespaces/allns-y-promremotewritewithtls-variant-15-rrmni9-0-635c20ef/pods?labelSelector=app.kubernetes.io%2Finstance%3Dtest%2Capp.kubernetes.io%2Fmanaged-by%3Dprometheus-operator%2Capp.kubernetes.io%2Fname%3Dprometheus%2Coperator.prometheus.io%2Fname%3Dtest%2Coperator.prometheus.io%2Fshard%3D0%2Cprometheus%3Dtest 200 OK in 3 milliseconds"
level=debug ts=2023-03-16T19:01:39.716507721Z caller=klog.go:84 component=k8s_client_runtime func=Infof msg="PUT https://10.96.0.1:443/apis/monitoring.coreos.com/v1/namespaces/allns-y-promremotewritewithtls-variant-15-rrmni9-0-635c20ef/prometheuses/test/status 200 OK in 5 milliseconds"
level=debug ts=2023-03-16T19:01:42.987438545Z caller=klog.go:55 component=k8s_client_runtime func=Verbose.Infof msg="pkg/mod/k8s.io/client-go@v0.26.2/tools/cache/reflector.go:169: Watch close - *v1.ConfigMap total 11 items received"
level=debug ts=2023-03-16T19:01:42.989027654Z caller=klog.go:84 component=k8s_client_runtime func=Infof msg="GET https://10.96.0.1:443/api/v1/configmaps?allowWatchBookmarks=true&labelSelector=thanos-ruler-name&resourceVersion=26268&timeout=9m33s&timeoutSeconds=573&watch=true 200 OK in 1 milliseconds"
level=debug ts=2023-03-16T19:01:51.528708437Z caller=klog.go:55 component=k8s_client_runtime func=Verbose.Infof msg="Listing and watching *v1.PrometheusAgent from pkg/mod/k8s.io/client-go@v0.26.2/tools/cache/reflector.go:169"
level=debug ts=2023-03-16T19:01:51.530467457Z caller=klog.go:84 component=k8s_client_runtime func=Infof msg="GET https://10.96.0.1:443/apis/monitoring.coreos.com/v1/prometheusagents?limit=500&resourceVersion=0 403 Forbidden in 1 milliseconds"
level=warn ts=2023-03-16T19:01:51.530664071Z caller=klog.go:108 component=k8s_client_runtime func=Warningf msg="pkg/mod/k8s.io/client-go@v0.26.2/tools/cache/reflector.go:169: failed to list *v1.PrometheusAgent: prometheusagents.monitoring.coreos.com is forbidden: User \"system:serviceaccount:allns-rrmn1y-0-2af3e0df:prometheus-operator\" cannot list resource \"prometheusagents\" in API group \"monitoring.coreos.com\" at the cluster scope"
level=error ts=2023-03-16T19:01:51.530705274Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="pkg/mod/k8s.io/client-go@v0.26.2/tools/cache/reflector.go:169: Failed to watch *v1.PrometheusAgent: failed to list *v1.PrometheusAgent: prometheusagents.monitoring.coreos.com is forbidden: User \"system:serviceaccount:allns-rrmn1y-0-2af3e0df:prometheus-operator\" cannot list resource \"prometheusagents\" in API group \"monitoring.coreos.com\" at the cluster scope"
level=debug ts=2023-03-16T19:01:54.916183261Z caller=resource_reconciler.go:300 component=thanosoperator msg="update handler" old=26283 cur=26329
level=debug ts=2023-03-16T19:01:54.917730267Z caller=resource_reconciler.go:153 component=thanosoperator msg="different resource versions" current=26329 old=26283 object=allns-y-promremotewritewithtls-variant-15-rrmni9-0-635c20ef/prometheus-test
level=debug ts=2023-03-16T19:01:54.921316813Z caller=resource_reconciler.go:300 component=prometheusoperator msg="update handler" old=26283 cur=26329
level=debug ts=2023-03-16T19:01:54.921428821Z caller=resource_reconciler.go:153 component=prometheusoperator msg="different resource versions" current=26329 old=26283 object=allns-y-promremotewritewithtls-variant-15-rrmni9-0-635c20ef/prometheus-test
level=debug ts=2023-03-16T19:01:54.921695039Z caller=resource_reconciler.go:315 component=prometheusoperator msg="StatefulSet updated"
level=info ts=2023-03-16T19:01:54.922650505Z caller=operator.go:1290 component=prometheusoperator key=allns-y-promremotewritewithtls-variant-15-rrmni9-0-635c20ef/test msg="update prometheus status"
level=debug ts=2023-03-16T19:01:54.924672043Z caller=resource_reconciler.go:300 component=alertmanageroperator msg="update handler" old=26283 cur=26329
level=debug ts=2023-03-16T19:01:54.92491286Z caller=resource_reconciler.go:153 component=alertmanageroperator msg="different resource versions" current=26329 old=26283 object=allns-y-promremotewritewithtls-variant-15-rrmni9-0-635c20ef/prometheus-test
level=debug ts=2023-03-16T19:01:54.925171177Z caller=operator.go:543 component=alertmanageroperator msg="StatefulSet key did not match an Alertmanager key format" key=allns-y-promremotewritewithtls-variant-15-rrmni9-0-635c20ef/prometheus-test
level=debug ts=2023-03-16T19:01:54.92900004Z caller=klog.go:84 component=k8s_client_runtime func=Infof msg="GET https://10.96.0.1:443/api/v1/namespaces/allns-y-promremotewritewithtls-variant-15-rrmni9-0-635c20ef/pods?labelSelector=app.kubernetes.io%2Finstance%3Dtest%2Capp.kubernetes.io%2Fmanaged-by%3Dprometheus-operator%2Capp.kubernetes.io%2Fname%3Dprometheus%2Coperator.prometheus.io%2Fname%3Dtest%2Coperator.prometheus.io%2Fshard%3D0%2Cprometheus%3Dtest 200 OK in 4 milliseconds"
level=debug ts=2023-03-16T19:01:54.938075962Z caller=klog.go:84 component=k8s_client_runtime func=Infof msg="PUT https://10.96.0.1:443/apis/monitoring.coreos.com/v1/namespaces/allns-y-promremotewritewithtls-variant-15-rrmni9-0-635c20ef/prometheuses/test/status 200 OK in 7 milliseconds"
level=debug ts=2023-03-16T19:02:01.987698191Z caller=klog.go:55 component=k8s_client_runtime func=Verbose.Infof msg="pkg/mod/k8s.io/client-go@v0.26.2/tools/cache/reflector.go:169: Watch close - *v1.PrometheusRule total 31 items received"
level=debug ts=2023-03-16T19:02:01.990288939Z caller=klog.go:84 component=k8s_client_runtime func=Infof msg="GET https://10.96.0.1:443/apis/monitoring.coreos.com/v1/prometheusrules?allowWatchBookmarks=true&resourceVersion=26334&timeout=7m53s&timeoutSeconds=473&watch=true 200 OK in 2 milliseconds"
level=debug ts=2023-03-16T19:02:04.818718111Z caller=klog.go:55 component=k8s_client_runtime func=Verbose.Infof msg="stop requested"
level=warn ts=2023-03-16T19:02:04.818739712Z caller=operator.go:387 component=prometheusagentoperator informer=PrometheusAgent msg="cache sync not yet completed"
level=error ts=2023-03-16T19:02:04.818774215Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="unable to sync caches for prometheusagent"
level=error ts=2023-03-16T19:02:04.818795316Z caller=operator.go:396 component=prometheusagentoperator informer=PrometheusAgent msg="failed to sync cache"
level=debug ts=2023-03-16T19:02:04.819281649Z caller=klog.go:55 component=k8s_client_runtime func=Verbose.Infof msg="Stopping reflector *v1.ThanosRuler (5m0s) from pkg/mod/k8s.io/client-go@v0.26.2/tools/cache/reflector.go:169"
level=debug ts=2023-03-16T19:02:04.819331753Z caller=klog.go:55 component=k8s_client_runtime func=Verbose.Infof msg="Stopping reflector *v1.PrometheusAgent (5m0s) from pkg/mod/k8s.io/client-go@v0.26.2/tools/cache/reflector.go:169"
level=debug ts=2023-03-16T19:02:04.819391757Z caller=klog.go:55 component=k8s_client_runtime func=Verbose.Infof msg="Stopping reflector *v1.Namespace (5m0s) from pkg/mod/k8s.io/client-go@v0.26.2/tools/cache/reflector.go:169"
level=debug ts=2023-03-16T19:02:04.819464262Z caller=klog.go:55 component=k8s_client_runtime func=Verbose.Infof msg="Stopping reflector *v1.Probe (5m0s) from pkg/mod/k8s.io/client-go@v0.26.2/tools/cache/reflector.go:169"
level=debug ts=2023-03-16T19:02:04.819488863Z caller=klog.go:55 component=k8s_client_runtime func=Verbose.Infof msg="Stopping reflector *v1.ConfigMap (5m0s) from pkg/mod/k8s.io/client-go@v0.26.2/tools/cache/reflector.go:169"
level=debug ts=2023-03-16T19:02:04.819524966Z caller=klog.go:55 component=k8s_client_runtime func=Verbose.Infof msg="Stopping reflector *v1.PrometheusRule (5m0s) from pkg/mod/k8s.io/client-go@v0.26.2/tools/cache/reflector.go:169"
level=debug ts=2023-03-16T19:02:04.819561968Z caller=klog.go:55 component=k8s_client_runtime func=Verbose.Infof msg="Stopping reflector *v1.ConfigMap (5m0s) from pkg/mod/k8s.io/client-go@v0.26.2/tools/cache/reflector.go:169"
level=debug ts=2023-03-16T19:02:04.81958257Z caller=klog.go:55 component=k8s_client_runtime func=Verbose.Infof msg="Stopping reflector *v1.Prometheus (5m0s) from pkg/mod/k8s.io/client-go@v0.26.2/tools/cache/reflector.go:169"
level=debug ts=2023-03-16T19:02:04.819658975Z caller=klog.go:55 component=k8s_client_runtime func=Verbose.Infof msg="Stopping reflector *v1.ConfigMap (5m0s) from pkg/mod/k8s.io/client-go@v0.26.2/tools/cache/reflector.go:169"
level=debug ts=2023-03-16T19:02:04.819721879Z caller=klog.go:55 component=k8s_client_runtime func=Verbose.Infof msg="Stopping reflector *v1.Namespace (5m0s) from pkg/mod/k8s.io/client-go@v0.26.2/tools/cache/reflector.go:169"
level=debug ts=2023-03-16T19:02:04.819817786Z caller=klog.go:55 component=k8s_client_runtime func=Verbose.Infof msg="Stopping reflector *v1.Namespace (5m0s) from pkg/mod/k8s.io/client-go@v0.26.2/tools/cache/reflector.go:169"
level=warn ts=2023-03-16T19:02:04.845799851Z caller=main.go:429 msg="Unhandled error received. Exiting..." err="failed to sync cache for PrometheusAgent informer"

@simonpasquier
Copy link
Contributor

@ArthurSens I suspect that the prometheus operator doesn't have the right permissions to list/watch/get the PrometheusAgent objects. As what we discussed on monday for #2787, it would be good that the operator starts the agent controller loop only if it has the correct permissions using SelfSubjectAccessReview.

@ArthurSens
Copy link
Member Author

@ArthurSens I suspect that the prometheus operator doesn't have the right permissions to list/watch/get the PrometheusAgent objects. As what we discussed on monday for #2787, it would be good that the operator starts the agent controller loop only if it has the correct permissions using SelfSubjectAccessReview.

Nice, I was trying to think how to do such thing... SelfSubjectAccessReview worked like a charm :)

Copy link
Contributor

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good! Excited to see this merged finally.

pkg/prometheus/agent/operator.go Outdated Show resolved Hide resolved
@@ -265,6 +266,13 @@ func Main() int {
return 1
}

pao, err := prometheusagentcontroller.New(ctx, cfg, log.With(logger, "component", "prometheusagentoperator"), r)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might want to add --prometheus-agent-instance-selector and --prometheus-agent-instance-namespaces flags but it doesn't have to be in this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

though it might be ok to use the same flags for both Prometheus and PrometheusAgent...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm leaning a bit more to have them as separate labels but don't have a strong case for it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the original flags are good enough for agent as well, but I can open an issue and work on it if you folks think that the change is worth it :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO it's fine to leave it for now (e.g. prometheus-label-selector and prometheus-instance-namespaces arguments apply to both Prometheus controllers) but the help text needs to be updated to mention it.

pkg/apis/monitoring/v1/prometheusagent_types.go Outdated Show resolved Hide resolved
pkg/prometheus/agent/operator.go Show resolved Hide resolved
pkg/prometheus/agent/operator.go Show resolved Hide resolved
pkg/prometheus/agent/operator.go Outdated Show resolved Hide resolved
pkg/prometheus/agent/operator.go Outdated Show resolved Hide resolved
// UpdateStatus updates the status subresource of the object identified by the given
// key.
// UpdateStatus implements the operator.Syncer interface.
func (c *Operator) UpdateStatus(ctx context.Context, key string) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same remarks here about duplication.

pkg/prometheus/agent/statefulset.go Outdated Show resolved Hide resolved
pkg/prometheus/agent/statefulset.go Outdated Show resolved Hide resolved
@ArthurSens ArthurSens force-pushed the prometheusagent branch 4 times, most recently from ad4830b to f551d16 Compare March 17, 2023 23:51
Copy link
Contributor

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one good test might be to create a prometheus object and a prometheusagent object with the same name in the same namespace and verify that the 2 controllers don't fight for the same resources :)

pkg/prometheus/agent/operator.go Outdated Show resolved Hide resolved
pkg/prometheus/agent/operator.go Outdated Show resolved Hide resolved
pkg/prometheus/agent/operator.go Show resolved Hide resolved
pkg/prometheus/agent/operator.go Outdated Show resolved Hide resolved
pkg/prometheus/agent/operator.go Outdated Show resolved Hide resolved
pkg/prometheus/agent/operator.go Outdated Show resolved Hide resolved
pkg/prometheus/agent/statefulset.go Outdated Show resolved Hide resolved
pkg/prometheus/agent/statefulset.go Outdated Show resolved Hide resolved
pkg/prometheus/agent/operator.go Outdated Show resolved Hide resolved
pkg/prometheus/agent/operator.go Outdated Show resolved Hide resolved
@ArthurSens
Copy link
Member Author

ArthurSens commented Mar 20, 2023

one good test might be to create a prometheus object and a prometheusagent object with the same name in the same namespace and verify that the 2 controllers don't fight for the same resources :)

I'm struggling a bit with this one, even though I added the permissions to examples/rbac/prometheus-operator/prometheus-operator-cluster-role.yaml in cdf164d, the e2e tests are still failing and I'm seeing this one the logs

level=info ts=2023-03-20T22:13:11.304683967Z caller=operator.go:314 component=prometheusagentoperator msg="Prometheus agent controller disabled because it lacks the required permissions on PrometheusAgent objects."

I'm not sure what I'm doing wrong 🤔

Copy link
Contributor

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that jsonnet/prometheus-operator/prometheus-operator.libsonnet needs an update to include all the bits for prometheusagent.

You need to add one line to import prometheusagents-crd.json here:

'0alertmanagerCustomResourceDefinition': import 'alertmanagers-crd.json',
'0alertmanagerConfigCustomResourceDefinition': (import 'alertmanagerconfigs-crd.json') +
if po.config.enableAlertmanagerConfigV1beta1 then
(import 'alertmanagerconfigs-v1beta1-crd.libsonnet')
else {},
'0prometheusCustomResourceDefinition': import 'prometheuses-crd.json',
'0servicemonitorCustomResourceDefinition': import 'servicemonitors-crd.json',
'0podmonitorCustomResourceDefinition': import 'podmonitors-crd.json',
'0probeCustomResourceDefinition': import 'probes-crd.json',
'0prometheusruleCustomResourceDefinition': import 'prometheusrules-crd.json',
'0thanosrulerCustomResourceDefinition': import 'thanosrulers-crd.json',

And add prometheusagents and prometheusagents/status there:

resources: [
'alertmanagers',
'alertmanagers/finalizers',
'alertmanagers/status',
'alertmanagerconfigs',
'prometheuses',
'prometheuses/finalizers',
'prometheuses/status',
'thanosrulers',
'thanosrulers/finalizers',
'servicemonitors',
'podmonitors',
'probes',
'prometheusrules',
],

Then regenerate all YAML manifests including the bundle.

I'm not sure what the /finalizers subresources are about...

pkg/prometheus/statefulset.go Outdated Show resolved Hide resolved
Spec: authv1.SelfSubjectAccessReviewSpec{
ResourceAttributes: &authv1.ResourceAttributes{
Verb: verb,
Group: "monitoring.coreos.com/v1alpha1",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Group: "monitoring.coreos.com/v1alpha1",
Group: monitoring.GroupName,

ResourceAttributes: &authv1.ResourceAttributes{
Verb: verb,
Group: "monitoring.coreos.com/v1alpha1",
Resource: "prometheusagents",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Resource: "prometheusagents",
Resource: monitoringv1alpha1.PrometheusAgentName,

pkg/prometheus/agent/operator.go Outdated Show resolved Hide resolved
pkg/prometheus/agent/operator.go Outdated Show resolved Hide resolved
Signed-off-by: Arthur Silva Sens <arthursens2005@gmail.com>
(cherry picked from commit 5b83359)
Signed-off-by: Arthur Silva Sens <arthursens2005@gmail.com>
(cherry picked from commit 6507964)
Signed-off-by: Arthur Silva Sens <arthursens2005@gmail.com>
The server operator already handles it

Signed-off-by: ArthurSens <arthursens2005@gmail.com>
Signed-off-by: ArthurSens <arthursens2005@gmail.com>
Signed-off-by: ArthurSens <arthursens2005@gmail.com>
Signed-off-by: ArthurSens <arthursens2005@gmail.com>
…vicemonitor and podmonitor selectors are empty

Signed-off-by: ArthurSens <arthursens2005@gmail.com>
Signed-off-by: ArthurSens <arthursens2005@gmail.com>
Signed-off-by: ArthurSens <arthursens2005@gmail.com>
Signed-off-by: Arthur Silva Sens <arthursens2005@gmail.com>
Signed-off-by: Arthur Silva Sens <arthursens2005@gmail.com>
Signed-off-by: Arthur Silva Sens <arthursens2005@gmail.com>
…gent

Signed-off-by: Arthur Silva Sens <arthursens2005@gmail.com>
@ArthurSens
Copy link
Member Author

ArthurSens commented Mar 23, 2023

Alright, I believe I've addressed everything that couldn't be left to future PRs. The only thing I'm not certain about is how I addressed the OperatorUpgrade e2e test 🤔

If the operator's service account has all permissions on the cluster and
the CRD isn't installed then the PrometheusAgent controller will run
but fail because of the absence of the CRD.

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
@simonpasquier
Copy link
Contributor

I've tested locally and it worked almost fine!
I added 2 small commits:

  • 84d2ca7 which checks that the PrometheusAgent CRD exists (when testing my account had all permissions on the cluster hence checkAgentAuthorization() returned true but the controller was stuck.
  • 4eddf7e which creates a dedicated governing service for prometheus agent.

Otherwise the PR looks good to me. In particular, the operator doesn't mess it up when Prometheus and PrometheusAgent objects with the same name exist in the same namespace which was my biggest concern.

Copy link
Contributor

@JoaoBraveCoding JoaoBraveCoding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a small note, but overall lgtm great work!

},
},
Selector: map[string]string{
"app.kubernetes.io/name": "prometheus-agent",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have a namespace where we deploy more than one Prometheus Agent CR isn't this going to send traffic to all instances? Shouldn't we include the name of the CR ("app.kubernetes.io/instance") as we do for the statefulsets?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes in principle though this service is used as the statefulset's governing service which exists for creating DNS entries rather than load-balancing.
It takes the same approach than the Prometheus CRD controller but we might be able to do better: like you said, 2 objects in the same namespace would be aggregated in the same *-operated service...

@simonpasquier simonpasquier merged commit cc47b1e into prometheus-operator:main Mar 27, 2023
@simonpasquier
Copy link
Contributor

Great work @ArthurSens! There are a few follow-up issues to file so we don't forget about them ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants