Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enhancement: Replace UpdateStatus() for ApplyStatus() to avoid informer errors #5883

Merged

Conversation

marioferh
Copy link
Contributor

@marioferh marioferh commented Sep 4, 2023

Description

To set the status subresource, the operator make use of ApplyStatus() instead of UpdateStatus() to avoid informer errors.

Type of change

  • CHANGE (fix or feature that would cause existing functionality to not work as expected)
  • FEATURE (non-breaking change which adds functionality)
  • BUGFIX (non-breaking change which fixes an issue)
  • ENHANCEMENT (non-breaking change which improves existing functionality)
  • NONE (if none of the other choices apply. Example, tooling, build system, CI, docs, etc.)

Changelog entry

  • replace UpdateStatus() for ApplyStatus() in order to avoid informer errors.

@marioferh marioferh requested a review from a team as a code owner September 4, 2023 08:34
pkg/alertmanager/operator.go Outdated Show resolved Hide resolved
pkg/alertmanager/operator.go Outdated Show resolved Hide resolved
pkg/prometheus/agent/operator.go Outdated Show resolved Hide resolved
@simonpasquier
Copy link
Contributor

To avoid too many back-and-forth with the e2e tests you might want to modify only one controller (e.g. Alertmanager) first.

Copy link
Contributor

@danielmellado danielmellado left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getting a bit cryptic with all these 'trac', 'aac' and so but they come from the applyconfiguration changes 🗡️

Copy link
Contributor

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The call to stsReporter.Update() at L806 also sets some of the status fields so we need to adapt it.

pkg/alertmanager/operator.go Outdated Show resolved Hide resolved
pkg/alertmanager/operator.go Show resolved Hide resolved
@marioferh marioferh force-pushed the replace_UpdateStatus branch 3 times, most recently from e50417b to f3d7700 Compare September 7, 2023 16:29
Signed-off-by: Mario Fernandez <mariofer@redhat.com>
@ArthurSens
Copy link
Member

👋 -- Just out of curiosity, what errors are we seeing in UpdateStatus() that we're fixing here? :)

@marioferh
Copy link
Contributor Author

👋 -- Just out of curiosity, what errors are we seeing in UpdateStatus() that we're fixing here? :)

if _, err = c.mclient.MonitoringV1().Alertmanagers(a.Namespace).UpdateStatus(ctx, a, metav1.UpdateOptions{}); err != nil { return errors.Wrap(err, "failed to update status subresource") }

Trying to update the operator status, the error: "failed to update status subresource" is shown and could be confused.

@simonpasquier
Copy link
Contributor

To add more context, we're seeing this type of messages in the OpenShift's prometheus operator logs (I've seen in plain Kind clusters too):

level=error ts=2023-09-11T07:11:06.75456597Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="status \"prominstancens-allowlist-s0t7xg-2-137e5aff/instance\" failed: faile
d to update prometheus status subresource: Operation cannot be fulfilled on prometheuses.monitoring.coreos.com \"instance\": the object has been modified; please apply your changes to the la
test version and try again"

My guess is that since UpdateStatus() reads the Prometheus object from the cache, it may have a slightly outdated version with an old versionNumber. While the status eventually converges (e.g. the operator will requeue the status update), it's somehow disturbing for users to see errors in the logs. Using an apply operation of update should get rid of this annoyance.

@ArthurSens
Copy link
Member

My guess is that since UpdateStatus() reads the Prometheus object from the cache, it may have a slightly outdated version with an old versionNumber. While the status eventually converges (e.g. the operator will requeue the status update), it's somehow disturbing for users to see errors in the logs. Using an apply operation of update should get rid of this annoyance.

Would it be possible to have the Status set to an older version if we use apply, even if for a little bit?

I don't really see a problem with that, just thinking out loud at the moment, but do you see that as a problem?

@simonpasquier
Copy link
Contributor

@ArthurSens my assumption is that the update will retrigger a status update if required. In most cases, it should be a no-op.

pkg/alertmanager/operator.go Outdated Show resolved Hide resolved
pkg/alertmanager/operator.go Outdated Show resolved Hide resolved
pkg/alertmanager/operator.go Outdated Show resolved Hide resolved
marioferh and others added 2 commits September 11, 2023 16:12
Co-authored-by: Simon Pasquier <spasquie@redhat.com>
change fieldManager to const andd add force true

Signed-off-by: Mario Fernandez <mariofer@redhat.com>
pkg/alertmanager/operator.go Outdated Show resolved Hide resolved
pkg/alertmanager/operator.go Outdated Show resolved Hide resolved
Signed-off-by: Mario Fernandez <mariofer@redhat.com>
@@ -53,7 +54,8 @@ import (
)

const (
resyncPeriod = 5 * time.Minute
resyncPeriod = 5 * time.Minute
prometheusOperatorFieldManager = "PrometheusOperator"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It doesn't hurt, but if it has only one usage I wouldn't bother creating a var for it. Do you plan to use this in any follow-up PRs? Don't bother changing it back just because of this, just curious.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same change will be applied to the Prometheus and Thanos controllers so we'll need a constant in the end.

Copy link
Contributor

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Can you proceed with the other UpdateStatus() functions?

@@ -53,7 +54,8 @@ import (
)

const (
resyncPeriod = 5 * time.Minute
resyncPeriod = 5 * time.Minute
prometheusOperatorFieldManager = "PrometheusOperator"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same change will be applied to the Prometheus and Thanos controllers so we'll need a constant in the end.

@simonpasquier
Copy link
Contributor

Clarified offline with @marioferh. Let's merge this and PRs will follow for the other controllers.

@simonpasquier simonpasquier enabled auto-merge (squash) September 13, 2023 09:59
Copy link
Contributor

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@simonpasquier simonpasquier merged commit 60368c1 into prometheus-operator:main Sep 13, 2023
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants