-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Webhook YAML causes constant config drift (reopened) #13449
Comments
Looking at the original issue, @dprotaso 's rationale for closing was that "this drift is unfortunately unavoidable." I think the recommendation in the original issue is probably still the same today: to look into your tooling's options to handle merges between what's declared and what's present on the API server. |
Hi @psschwei Again, to point out I've never run into other Kubernetes manifests or Helm charts that do this - my view is it is probably an anti-pattern to specify a spec as X in deploy and modify the spec to Y at runtime. If you want to do this, configure nothing for the rules in the manifest. For declarative configuration, sure, there are things that get set and even change at runtime, the solution is just don't set those in the declarative part of the configuration. I don't see this as a technical issue, let me know if I'm missing something. Based on some of the prior comments (i.e. settings A good example of this is right here in knative/serving: serving/config/core/300-secret.yaml Line 32 in deed9f4
That said, I propose knative/serving should just expand the rules in the manifest. It's not really helping anything by keeping them unexpanded, but I would agree is easier to read for humans in the unexpanded form. I've created PR #13450 to fix this. |
That's not how Kubernetes works - any admission webhook can mutate your resources (ie. defaulting, applying opinions etc) and the end result is different than the YAML that was applied. Even Microsoft AKS modifies mutating webhooks to include additional selectors - Azure/AKS#1771 You need to setup GitOps tooling to accomodate this. Regarding |
I'm going to close this out as it's working as we expect. |
Hi @dprotaso I still feel like there is a misunderstanding here. Modifying / mutating a resource is fine, but the anti-pattern here is defining your manifests as X and then letting the controller modify them at runtime to Y. I also feel you're referring to mutating resources as being an issue, which is not the issue here. The issue is defining keys and values in one place for install, and changing them in another later on. This anti-pattern would be more easily understood as problematic if it were applying to something like a There are really 3 ways to approach this:
So since you rejected a synchronized resource manifest. Why are the rules even configured at all in the manifest if you want the controller to be authoritative over the webhook rules? Why not just remove the rules all together? This would make it clear that rules are not declared as part of the install (controller will manage this) and avoids config drift against the stored manifests. I've opened PR #13453 to remove the inaccurate placeholder values, giving the controller complete control over webhook rules. |
/reopen |
@mbrancato: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Context is here: #7576 If we remove the rules there's a window during install & upgrade that allows folks to create resources where the webhook isn't hooked up. This means the API server can accept Knative resources that are not valid. So we have this initial seed of rules to prevent this race. |
So the issue here is that before the webhook configuration is installed, a bad For example, if you look at the process for install, the first step is to install knative CRDs. At that point, a bad I did walk through that specific issue you linked. To clarify, on an upgrade (let's assume My thought to fix this would be to go back to using a manifest with expanded rules (#13450). I think we have to ignore the fact that admission webhooks are not guaranteed until they're setup (initial install), and everything after that is a reconciliation issue. In a normal upgrade, I'd argue the webhook rules should never allow this (e.g. unless the user deletes the webhook config). So if something made it thru the admission process in that time, the new controller should catch it on reconciliation and reach eventual consistency. |
We include admission webhooks as part of our installation yaml that will prevent resources being created until the webhook is up and updates the configuration. serving/config/core/webhooks/resource-validation.yaml Lines 25 to 28 in 0511892
What's missing in the config? |
Since CRDs are installed first, there is a period where the a
There is no
I'm not certain what is missing in all of the reconciliation process, but I did notice that creating a This can be simulated by deleting the webhook config and then creating a {"severity":"ERROR","timestamp":"2022-11-09T15:46:44.489865Z","logger":"controller","caller":"revision/reconciler.go:302","message":"Returned an error","commit":"e82287d","knative.dev/pod":"controller-76d69dd7fc-q74sr","knative.dev/controller":"knative.dev.serving.pkg.reconciler.revision.Reconciler","knative.dev/kind":"serving.knative.dev.Revision","knative.dev/traceid":"9d2bfec3-b38b-46bb-aa23-d852b8af4e7e","knative.dev/key":"default/hello-00001","targetMethod":"ReconcileKind","error":"failed to update deployment \"hello-00001-deployment\": Operation cannot be fulfilled on deployments.apps \"hello-00001-deployment\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"knative.dev/serving/pkg/client/injection/reconciler/serving/v1/revision.(*reconcilerImpl).Reconcile\n\tknative.dev/serving/pkg/client/injection/reconciler/serving/v1/revision/reconciler.go:302\nknative.dev/pkg/controller.(*Impl).processNextWorkItem\n\tknative.dev/pkg@v0.0.0-20221011175852-714b7630a836/controller/controller.go:542\nknative.dev/pkg/controller.(*Impl).RunContext.func3\n\tknative.dev/pkg@v0.0.0-20221011175852-714b7630a836/controller/controller.go:491"}
{"severity":"ERROR","timestamp":"2022-11-09T15:46:44.489996Z","logger":"controller","caller":"controller/controller.go:566","message":"Reconcile error","commit":"e82287d","knative.dev/pod":"controller-76d69dd7fc-q74sr","knative.dev/controller":"knative.dev.serving.pkg.reconciler.revision.Reconciler","knative.dev/kind":"serving.knative.dev.Revision","knative.dev/traceid":"9d2bfec3-b38b-46bb-aa23-d852b8af4e7e","knative.dev/key":"default/hello-00001","duration":"100.233ms","error":"failed to update deployment \"hello-00001-deployment\": Operation cannot be fulfilled on deployments.apps \"hello-00001-deployment\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"knative.dev/pkg/controller.(*Impl).handleErr\n\tknative.dev/pkg@v0.0.0-20221011175852-714b7630a836/controller/controller.go:566\nknative.dev/pkg/controller.(*Impl).processNextWorkItem\n\tknative.dev/pkg@v0.0.0-20221011175852-714b7630a836/controller/controller.go:543\nknative.dev/pkg/controller.(*Impl).RunContext.func3\n\tknative.dev/pkg@v0.0.0-20221011175852-714b7630a836/controller/controller.go:491"} The controller then stops, it never attempts again to reconcile the revision deployment even though it received an error. |
We split up the instructions like that so it doesn't trip up new users - our installation includes a CRD and an instance of that CRD (image resource) serving/config/core/999-cache.yaml Lines 15 to 27 in bd88e05
You can get an error like
I'm going to close this off - this is behaving as we expect and I recommend looking into tooling that supports better rebasing support - ie. we use https://carvel.dev/blog/kapp-rebase-rules/ |
This issue is going to keep arising. These configurations really should be declarative. Mutating fields is fine, because there is an owner defined and managing the resource. Using implicit wildcards to handle all rules/permutations is not a best practice with permissions. Permissions granted should be explicit. Kubernetes does flatten out the wildcard when applied--that is the entire reason why we are seeing configuration drift in the first place. Kubernetes does supporting dynamic configuration in the case of aggregated-clusterroles, https://kubernetes.io/docs/reference/access-authn-authz/rbac/#aggregated-clusterroles, where kubernetes handles updating permissions as referenced clusterroles are given permissions. That is not the case with MutatingWebhookConfiguration or ValidatingWebhookConfiguration. ignoring differences in those resources is going to lead to issues down the line because the thing it needs to update cannot be known/is ignored. |
In what area(s)?
/area build
What version of Knative?
1.8.0
Expected Behavior
The generated YAML for webhook rules is modified after apply, changing the defined spec and causing constant configuration drift. I've reopened this from #12474 which has been closed as fixed. I've confirmed this still happens in 1.8.0.
Actual Behavior
As with before, webhook rules for
webhook.domainmapping.serving.knative.dev
are in the release YAML as:After applying, the spec is modified/expanded to change rules to:
When using
kubectl apply
, config drift looks likeMutatingWebhookConfiguration/webhook.domainmapping.serving.knative.dev configured
and generates a new resource version. Whereas, it would sayunchanged
if what had been defined in the spec was not changed.Since we use GitOps to deploy, and git is the source of truth, this causes a lot of notifications. Also, for comparison, I've never had this issue with the dozens of other pieces of software installed in our Kubernetes clusters.
Steps to Reproduce the Problem
Apply the YAML more than once, notice the resource has changed.
The text was updated successfully, but these errors were encountered: