-
Notifications
You must be signed in to change notification settings - Fork 733
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean up watch manager #308
Clean up watch manager #308
Conversation
default: | ||
time.Sleep(5 * time.Second) | ||
return nil | ||
case <-ticker.C: | ||
if _, err := wm.updateOrPause(); err != nil { | ||
log.Error(err, "error in updateManagerLoop") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return this err
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want to do that, as it is possible for transient errors (e.g. network issue connecting to the API server) to cause an error, which would then cause the server to crash. This would cause the webhook to become unavailable.
I think it's better to have a graceful degradation model where the webhook continues to serve and custom watches could potentially recover on the next restart loop. Detection of this failure state should likely be detected by Prometheus metrics.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. we definitely dont want this to impact the webhook. So what if it continues to fail to restart the watch manager?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prometheus alerts. We could export metrics for restart failures. Users should also monitor the status fields of constraints/templates to make sure they are operating properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should make sure to test this when we add liveness probes. wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likely not on a continual basis, as failing liveliness probes have the effect of forcing the server to reboot, which is effectively the same as exit-on-failure.
We can test the initial state on startup indirectly by validating cache warming:
- List all constraints/templates/cached resources on startup
- Do not report as healthy until we validate those resources have been handled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good. @sozercan ^
Fixes open-policy-agent#295 Signed-off-by: Max Smythe <smythe@google.com>
Signed-off-by: Max Smythe <smythe@google.com>
Signed-off-by: Max Smythe <smythe@google.com>
c8b42cf
to
72ea3ad
Compare
Fixes #295
Signed-off-by: Max Smythe smythe@google.com