Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up watch manager #308

Merged
merged 3 commits into from
Dec 12, 2019

Conversation

maxsmythe
Copy link
Contributor

Fixes #295

Signed-off-by: Max Smythe smythe@google.com

default:
time.Sleep(5 * time.Second)
return nil
case <-ticker.C:
if _, err := wm.updateOrPause(); err != nil {
log.Error(err, "error in updateManagerLoop")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return this err

Copy link
Contributor Author

@maxsmythe maxsmythe Dec 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to do that, as it is possible for transient errors (e.g. network issue connecting to the API server) to cause an error, which would then cause the server to crash. This would cause the webhook to become unavailable.

I think it's better to have a graceful degradation model where the webhook continues to serve and custom watches could potentially recover on the next restart loop. Detection of this failure state should likely be detected by Prometheus metrics.

WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. we definitely dont want this to impact the webhook. So what if it continues to fail to restart the watch manager?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prometheus alerts. We could export metrics for restart failures. Users should also monitor the status fields of constraints/templates to make sure they are operating properly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should make sure to test this when we add liveness probes. wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likely not on a continual basis, as failing liveliness probes have the effect of forcing the server to reboot, which is effectively the same as exit-on-failure.

We can test the initial state on startup indirectly by validating cache warming:

  1. List all constraints/templates/cached resources on startup
  2. Do not report as healthy until we validate those resources have been handled

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good. @sozercan ^

Fixes open-policy-agent#295

Signed-off-by: Max Smythe <smythe@google.com>
Signed-off-by: Max Smythe <smythe@google.com>
Signed-off-by: Max Smythe <smythe@google.com>
@maxsmythe maxsmythe merged commit d07c4bb into open-policy-agent:master Dec 12, 2019
@maxsmythe maxsmythe deleted the sync-mgr-cleanup branch December 12, 2019 22:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

WatchManager should implement manager.Runnable interface
2 participants