-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prometheus Agent support #5385
Prometheus Agent support #5385
Conversation
079aa0c
to
6bacbb3
Compare
6bacbb3
to
ec9e324
Compare
Currently struggling to fix the failing e2e-test. They fail because the prometheus operator restarts, twice, in the middle of the tests. When reading the logs, I can find failures while trying to update the PrometheusAgent cache, but I couldn't find out why yet 😢 Here are the logs in case anyone is able to easily tell me why 😬 Last 50 logs from the operator during e2e-test failure
|
@ArthurSens I suspect that the prometheus operator doesn't have the right permissions to list/watch/get the PrometheusAgent objects. As what we discussed on monday for #2787, it would be good that the operator starts the agent controller loop only if it has the correct permissions using SelfSubjectAccessReview. |
b1732bf
to
6507964
Compare
e99bb05
to
147c048
Compare
Nice, I was trying to think how to do such thing... SelfSubjectAccessReview worked like a charm :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good! Excited to see this merged finally.
@@ -265,6 +266,13 @@ func Main() int { | |||
return 1 | |||
} | |||
|
|||
pao, err := prometheusagentcontroller.New(ctx, cfg, log.With(logger, "component", "prometheusagentoperator"), r) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we might want to add --prometheus-agent-instance-selector
and --prometheus-agent-instance-namespaces
flags but it doesn't have to be in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
though it might be ok to use the same flags for both Prometheus and PrometheusAgent...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm leaning a bit more to have them as separate labels but don't have a strong case for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the original flags are good enough for agent as well, but I can open an issue and work on it if you folks think that the change is worth it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO it's fine to leave it for now (e.g. prometheus-label-selector and prometheus-instance-namespaces arguments apply to both Prometheus controllers) but the help text needs to be updated to mention it.
// UpdateStatus updates the status subresource of the object identified by the given | ||
// key. | ||
// UpdateStatus implements the operator.Syncer interface. | ||
func (c *Operator) UpdateStatus(ctx context.Context, key string) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same remarks here about duplication.
ad4830b
to
f551d16
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one good test might be to create a prometheus object and a prometheusagent object with the same name in the same namespace and verify that the 2 controllers don't fight for the same resources :)
f551d16
to
269e1cf
Compare
269e1cf
to
606f552
Compare
e5c471e
to
cdf164d
Compare
I'm struggling a bit with this one, even though I added the permissions to
I'm not sure what I'm doing wrong 🤔 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that jsonnet/prometheus-operator/prometheus-operator.libsonnet
needs an update to include all the bits for prometheusagent.
You need to add one line to import prometheusagents-crd.json
here:
prometheus-operator/jsonnet/prometheus-operator/prometheus-operator.libsonnet
Lines 31 to 41 in 2de1db7
'0alertmanagerCustomResourceDefinition': import 'alertmanagers-crd.json', | |
'0alertmanagerConfigCustomResourceDefinition': (import 'alertmanagerconfigs-crd.json') + | |
if po.config.enableAlertmanagerConfigV1beta1 then | |
(import 'alertmanagerconfigs-v1beta1-crd.libsonnet') | |
else {}, | |
'0prometheusCustomResourceDefinition': import 'prometheuses-crd.json', | |
'0servicemonitorCustomResourceDefinition': import 'servicemonitors-crd.json', | |
'0podmonitorCustomResourceDefinition': import 'podmonitors-crd.json', | |
'0probeCustomResourceDefinition': import 'probes-crd.json', | |
'0prometheusruleCustomResourceDefinition': import 'prometheusrules-crd.json', | |
'0thanosrulerCustomResourceDefinition': import 'thanosrulers-crd.json', |
And add prometheusagents
and prometheusagents/status
there:
prometheus-operator/jsonnet/prometheus-operator/prometheus-operator.libsonnet
Lines 72 to 86 in 2de1db7
resources: [ | |
'alertmanagers', | |
'alertmanagers/finalizers', | |
'alertmanagers/status', | |
'alertmanagerconfigs', | |
'prometheuses', | |
'prometheuses/finalizers', | |
'prometheuses/status', | |
'thanosrulers', | |
'thanosrulers/finalizers', | |
'servicemonitors', | |
'podmonitors', | |
'probes', | |
'prometheusrules', | |
], |
Then regenerate all YAML manifests including the bundle.
I'm not sure what the /finalizers subresources are about...
pkg/prometheus/agent/operator.go
Outdated
Spec: authv1.SelfSubjectAccessReviewSpec{ | ||
ResourceAttributes: &authv1.ResourceAttributes{ | ||
Verb: verb, | ||
Group: "monitoring.coreos.com/v1alpha1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Group: "monitoring.coreos.com/v1alpha1", | |
Group: monitoring.GroupName, |
pkg/prometheus/agent/operator.go
Outdated
ResourceAttributes: &authv1.ResourceAttributes{ | ||
Verb: verb, | ||
Group: "monitoring.coreos.com/v1alpha1", | ||
Resource: "prometheusagents", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resource: "prometheusagents", | |
Resource: monitoringv1alpha1.PrometheusAgentName, |
Signed-off-by: Arthur Silva Sens <arthursens2005@gmail.com> (cherry picked from commit 5b83359)
Signed-off-by: Arthur Silva Sens <arthursens2005@gmail.com> (cherry picked from commit 6507964)
Signed-off-by: Arthur Silva Sens <arthursens2005@gmail.com>
The server operator already handles it Signed-off-by: ArthurSens <arthursens2005@gmail.com>
Signed-off-by: ArthurSens <arthursens2005@gmail.com>
Signed-off-by: ArthurSens <arthursens2005@gmail.com>
Signed-off-by: ArthurSens <arthursens2005@gmail.com>
…vicemonitor and podmonitor selectors are empty Signed-off-by: ArthurSens <arthursens2005@gmail.com>
Signed-off-by: ArthurSens <arthursens2005@gmail.com>
Signed-off-by: ArthurSens <arthursens2005@gmail.com>
Signed-off-by: Arthur Silva Sens <arthursens2005@gmail.com>
Signed-off-by: Arthur Silva Sens <arthursens2005@gmail.com>
Signed-off-by: Arthur Silva Sens <arthursens2005@gmail.com>
…gent Signed-off-by: Arthur Silva Sens <arthursens2005@gmail.com>
c5cb595
to
23bc229
Compare
Alright, I believe I've addressed everything that couldn't be left to future PRs. The only thing I'm not certain about is how I addressed the OperatorUpgrade e2e test 🤔 |
If the operator's service account has all permissions on the cluster and the CRD isn't installed then the PrometheusAgent controller will run but fail because of the absence of the CRD. Signed-off-by: Simon Pasquier <spasquie@redhat.com>
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
I've tested locally and it worked almost fine!
Otherwise the PR looks good to me. In particular, the operator doesn't mess it up when Prometheus and PrometheusAgent objects with the same name exist in the same namespace which was my biggest concern. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a small note, but overall lgtm great work!
}, | ||
}, | ||
Selector: map[string]string{ | ||
"app.kubernetes.io/name": "prometheus-agent", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have a namespace where we deploy more than one Prometheus Agent CR isn't this going to send traffic to all instances? Shouldn't we include the name of the CR ("app.kubernetes.io/instance"
) as we do for the statefulsets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes in principle though this service is used as the statefulset's governing service which exists for creating DNS entries rather than load-balancing.
It takes the same approach than the Prometheus CRD controller but we might be able to do better: like you said, 2 objects in the same namespace would be aggregated in the same *-operated
service...
Great work @ArthurSens! There are a few follow-up issues to file so we don't forget about them ;) |
Description
This PR implements the PrometheusAgent CRD, following the design document.
Even though we've broken down the implementation is several smaller PRs[1][2][3][4][5], this one still got pretty chunky. Although not perfect, I tried my best to implement it in several small commits. For the reviewers, if reviewing the whole change at once seems overwhelming, I recommend trying commit by commit :)
Type of change
What type of changes does your code introduce to the Prometheus operator? Put an
x
in the box that apply.CHANGE
(fix or feature that would cause existing functionality to not work as expected)FEATURE
(non-breaking change which adds functionality)BUGFIX
(non-breaking change which fixes an issue)ENHANCEMENT
(non-breaking change which improves existing functionality)NONE
(if none of the other choices apply. Example, tooling, build system, CI, docs, etc.)Changelog entry
Please put a one-line changelog entry below. This will be copied to the changelog file during the release process.