New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fixes race condition when deleting alertmanager, prometheus and thanosruler instances #5954
Conversation
Thanks for the PR! For completeness can you share the operator's version where the issue happens? |
I've tested and experienced it on
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we'd need the same for pkg/prometheus/agent/operator.go and pkg/thanos/operator.go.
pkg/alertmanager/operator.go
Outdated
@@ -644,6 +644,11 @@ func (c *Operator) sync(ctx context.Context, key string) error { | |||
return nil | |||
} | |||
|
|||
// Check if the Alertmanager instance is marked for deletion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest to get inspiration from the UpdateStatus() method and do the check at the very beginning of the function.
prometheus-operator/pkg/alertmanager/operator.go
Lines 782 to 789 in 535cd8c
a, err := c.getAlertmanagerFromKey(key) | |
if err != nil { | |
return err | |
} | |
if a == nil || c.rr.DeletionInProgress(a) { | |
return nil | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your comment. I've added thanos and the agent, reused the DeletionInProgress function and moved the checks as high as possible.
pkg/prometheus/agent/operator.go
Outdated
|
||
// Check if the Agent instance is marked for deletion. | ||
if c.rr.DeletionInProgress(p) { | ||
level.Info(logger).Log("msg", "the resource is deleting, not reconciling") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) c.rr.DeletionInProgress
already logs (though at debug level)
level.Info(logger).Log("msg", "the resource is deleting, not reconciling") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
pkg/prometheus/server/operator.go
Outdated
@@ -1166,6 +1166,12 @@ func (c *Operator) sync(ctx context.Context, key string) error { | |||
logger := log.With(c.logger, "key", key) | |||
logDeprecatedFields(logger, p) | |||
|
|||
// Check if the Prometheus instance is marked for deletion. | |||
if c.rr.DeletionInProgress(p) { | |||
level.Info(logger).Log("msg", "the resource is deleting, not reconciling") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) same
level.Info(logger).Log("msg", "the resource is deleting, not reconciling") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Description
A race condition in appears when using Pulumi to delete a prometheus instance custom resouce from a Kubernetes cluster.
I've created a demo repo https://github.com/mheers/pulumi-prometheus-operator-race-condition to reproduce the behaviour.
Deleting the CR with pulumi will get the operator stuck in a loop. The operator will try to delete the prometheus instance but will fail with errors exported to https://github.com/mheers/pulumi-prometheus-operator-race-condition/blob/main/logs/operator.log
Type of change
CHANGE
(fix or feature that would cause existing functionality to not work as expected)FEATURE
(non-breaking change which adds functionality)BUGFIX
(non-breaking change which fixes an issue)ENHANCEMENT
(non-breaking change which improves existing functionality)NONE
(if none of the other choices apply. Example, tooling, build system, CI, docs, etc.)Changelog entry