Webhook YAML causes constant config drift (reopened) #13449

mbrancato · 2022-11-07T20:30:03Z

In what area(s)?

/area build

What version of Knative?

1.8.0

Expected Behavior

The generated YAML for webhook rules is modified after apply, changing the defined spec and causing constant configuration drift. I've reopened this from #12474 which has been closed as fixed. I've confirmed this still happens in 1.8.0.

Actual Behavior

As with before, webhook rules for webhook.domainmapping.serving.knative.dev are in the release YAML as:

    rules:
      - apiGroups:
          - serving.knative.dev
        apiVersions:
          - "*"
        operations:
          - CREATE
          - UPDATE
        scope: "*"
        resources:
          - domainmappings
          - domainmappings/status

After applying, the spec is modified/expanded to change rules to:

  rules:
  - apiGroups:
    - serving.knative.dev
    apiVersions:
    - v1alpha1
    operations:
    - CREATE
    - UPDATE
    resources:
    - domainmappings
    - domainmappings/status
    scope: '*'
  - apiGroups:
    - serving.knative.dev
    apiVersions:
    - v1beta1
    operations:
    - CREATE
    - UPDATE
    resources:
    - domainmappings
    - domainmappings/status
    scope: '*'

When using kubectl apply, config drift looks like MutatingWebhookConfiguration/webhook.domainmapping.serving.knative.dev configured and generates a new resource version. Whereas, it would say unchanged if what had been defined in the spec was not changed.

Since we use GitOps to deploy, and git is the source of truth, this causes a lot of notifications. Also, for comparison, I've never had this issue with the dozens of other pieces of software installed in our Kubernetes clusters.

Steps to Reproduce the Problem

Apply the YAML more than once, notice the resource has changed.

The text was updated successfully, but these errors were encountered:

psschwei · 2022-11-07T20:45:07Z

I've reopened this from #12474 which has been closed as fixed. I've confirmed this still happens in 1.8.0.

Looking at the original issue, @dprotaso 's rationale for closing was that "this drift is unfortunately unavoidable."

I think the recommendation in the original issue is probably still the same today: to look into your tooling's options to handle merges between what's declared and what's present on the API server.

mbrancato · 2022-11-07T21:26:08Z

Looking at the original issue, @dprotaso 's rationale for closing was that "this drift is unfortunately unavoidable."

I think the recommendation in the original issue is probably still the same today: to look into your tooling's options to handle merges between what's declared and what's present on the API server.

Hi @psschwei

Again, to point out I've never run into other Kubernetes manifests or Helm charts that do this - my view is it is probably an anti-pattern to specify a spec as X in deploy and modify the spec to Y at runtime. If you want to do this, configure nothing for the rules in the manifest. For declarative configuration, sure, there are things that get set and even change at runtime, the solution is just don't set those in the declarative part of the configuration. I don't see this as a technical issue, let me know if I'm missing something. Based on some of the prior comments (i.e. settings caBundle or spec.clusterIP) it seems that there is just a misunderstanding about how declarative configuration works when applied to Kubernetes manifests.

A good example of this is right here in knative/serving:

serving/config/core/300-secret.yaml

Line 32 in deed9f4

# The data is populated when internal-encryption is enabled.

That said, I propose knative/serving should just expand the rules in the manifest. It's not really helping anything by keeping them unexpanded, but I would agree is easier to read for humans in the unexpanded form.

I've created PR #13450 to fix this.

dprotaso · 2022-11-08T15:46:21Z

my view is it is probably an anti-pattern to specify a spec as X in deploy and modify the spec to Y at runtime.

That's not how Kubernetes works - any admission webhook can mutate your resources (ie. defaulting, applying opinions etc) and the end result is different than the YAML that was applied.

Even Microsoft AKS modifies mutating webhooks to include additional selectors - Azure/AKS#1771

You need to setup GitOps tooling to accomodate this.

Regarding caBundle as an example - our controllers generate that certificate and rotate it when it expires - so it will change.

dprotaso · 2022-11-08T15:47:42Z

I'm going to close this out as it's working as we expect.

mbrancato · 2022-11-08T19:21:24Z

Hi @dprotaso

I still feel like there is a misunderstanding here. Modifying / mutating a resource is fine, but the anti-pattern here is defining your manifests as X and then letting the controller modify them at runtime to Y. I also feel you're referring to mutating resources as being an issue, which is not the issue here. The issue is defining keys and values in one place for install, and changing them in another later on. This anti-pattern would be more easily understood as problematic if it were applying to something like a Role resource, and permissions that were approved for install as one thing, but then a controller modified these later to have more permissions.

There are really 3 ways to approach this:

placeholder values that the controller will later become the authoritative owner - this causes configuration drift
synchronized manifests that define the resource the same way the controller will, giving a little more transparency (this is what my PR did)
sparse manifests that only defines the parts of the resource that should remain, allowing the controller to own the rest.

So since you rejected a synchronized resource manifest. Why are the rules even configured at all in the manifest if you want the controller to be authoritative over the webhook rules? Why not just remove the rules all together? This would make it clear that rules are not declared as part of the install (controller will manage this) and avoids config drift against the stored manifests.

I've opened PR #13453 to remove the inaccurate placeholder values, giving the controller complete control over webhook rules.

mbrancato · 2022-11-08T19:21:32Z

/reopen

knative-prow · 2022-11-08T19:21:35Z

@mbrancato: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dprotaso · 2022-11-08T19:27:58Z

Why are the rules even configured at all in the manifest if you want the controller to be authoritative over the webhook rules?

Context is here: #7576

If we remove the rules there's a window during install & upgrade that allows folks to create resources where the webhook isn't hooked up. This means the API server can accept Knative resources that are not valid.

So we have this initial seed of rules to prevent this race.

mbrancato · 2022-11-09T02:43:35Z

So the issue here is that before the webhook configuration is installed, a bad ksvc can be created. I understand this, but I'm not sure having these default rules is actually solving that specific problem.

For example, if you look at the process for install, the first step is to install knative CRDs. At that point, a bad ksvc can be created. So having rules here on the webhook isn't solving this race condition for install, but may be a workaround for some.
https://knative.dev/docs/install/yaml-install/serving/install-serving-with-yaml/#install-the-knative-serving-component

I did walk through that specific issue you linked. To clarify, on an upgrade (let's assume kubectl apply) the expanded webhook configuration rules would never be empty and won't allow this issue to happen. With a sparse configuration, it would. Before the controller is running, a deployed ksvc will eventually deploy, with 2 revisions. However, if the controller is already running, a missing webhook config will cause it to get stuck. But that said, after testing this in several ways (before deploy, after deploy without webhook cfg, etc), it seems to me this is a bug in the reconciliation loop for the rev and ksvc resources. This obviously has "eventual consistency" implications.

My thought to fix this would be to go back to using a manifest with expanded rules (#13450). I think we have to ignore the fact that admission webhooks are not guaranteed until they're setup (initial install), and everything after that is a reconciliation issue. In a normal upgrade, I'd argue the webhook rules should never allow this (e.g. unless the user deletes the webhook config). So if something made it thru the admission process in that time, the new controller should catch it on reconciliation and reach eventual consistency.

dprotaso · 2022-11-09T14:38:07Z

For example, if you look at the process for install, the first step is to install knative CRDs. At that point, a bad ksvc can be created. So having rules here on the webhook isn't solving this race condition for install, but may be a workaround for some.

We include admission webhooks as part of our installation yaml that will prevent resources being created until the webhook is up and updates the configuration.

serving/config/core/webhooks/resource-validation.yaml

Lines 25 to 28 in 0511892

    
           clientConfig: 
        
             service: 
        
               name: webhook 
        
               namespace: knative-serving

However, if the controller is already running, a missing webhook config will cause it to get stuck. But that said, after testing this in several ways (before deploy, after deploy without webhook cfg, etc), it seems to me this is a bug in the reconciliation loop for the rev and ksvc resources. This obviously has "eventual consistency" implications.

What's missing in the config?

mbrancato · 2022-11-09T16:21:29Z

We include admission webhooks as part of our installation yaml that will prevent resources being created until the webhook is up and updates the configuration.

Since CRDs are installed first, there is a period where the a ksvc can be created and no ValidatingWebhookConfiguration is yet created. From the install docs:

1. Install the required custom resources by running the command:

kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.8.0/serving-crds.yaml

There is no ValidatingWebhookConfiguration after installing these. But this doesn't really matter as on initial install, this issue doesn't appear to come up (but it does generate 2 revisions).

What's missing in the config?

I'm not certain what is missing in all of the reconciliation process, but I did notice that creating a ksvc before the webhooks are created causes the ksvc to lack the field spec.enableServiceLinks. However, there is probably a lot more to why knative can't get the ksvc to a working state with active revisions when created after the controller is running, and when the webhooks are not configured.

This can be simulated by deleting the webhook config and then creating a ksvc.

{"severity":"ERROR","timestamp":"2022-11-09T15:46:44.489865Z","logger":"controller","caller":"revision/reconciler.go:302","message":"Returned an error","commit":"e82287d","knative.dev/pod":"controller-76d69dd7fc-q74sr","knative.dev/controller":"knative.dev.serving.pkg.reconciler.revision.Reconciler","knative.dev/kind":"serving.knative.dev.Revision","knative.dev/traceid":"9d2bfec3-b38b-46bb-aa23-d852b8af4e7e","knative.dev/key":"default/hello-00001","targetMethod":"ReconcileKind","error":"failed to update deployment \"hello-00001-deployment\": Operation cannot be fulfilled on deployments.apps \"hello-00001-deployment\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"knative.dev/serving/pkg/client/injection/reconciler/serving/v1/revision.(*reconcilerImpl).Reconcile\n\tknative.dev/serving/pkg/client/injection/reconciler/serving/v1/revision/reconciler.go:302\nknative.dev/pkg/controller.(*Impl).processNextWorkItem\n\tknative.dev/pkg@v0.0.0-20221011175852-714b7630a836/controller/controller.go:542\nknative.dev/pkg/controller.(*Impl).RunContext.func3\n\tknative.dev/pkg@v0.0.0-20221011175852-714b7630a836/controller/controller.go:491"}
{"severity":"ERROR","timestamp":"2022-11-09T15:46:44.489996Z","logger":"controller","caller":"controller/controller.go:566","message":"Reconcile error","commit":"e82287d","knative.dev/pod":"controller-76d69dd7fc-q74sr","knative.dev/controller":"knative.dev.serving.pkg.reconciler.revision.Reconciler","knative.dev/kind":"serving.knative.dev.Revision","knative.dev/traceid":"9d2bfec3-b38b-46bb-aa23-d852b8af4e7e","knative.dev/key":"default/hello-00001","duration":"100.233ms","error":"failed to update deployment \"hello-00001-deployment\": Operation cannot be fulfilled on deployments.apps \"hello-00001-deployment\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"knative.dev/pkg/controller.(*Impl).handleErr\n\tknative.dev/pkg@v0.0.0-20221011175852-714b7630a836/controller/controller.go:566\nknative.dev/pkg/controller.(*Impl).processNextWorkItem\n\tknative.dev/pkg@v0.0.0-20221011175852-714b7630a836/controller/controller.go:543\nknative.dev/pkg/controller.(*Impl).RunContext.func3\n\tknative.dev/pkg@v0.0.0-20221011175852-714b7630a836/controller/controller.go:491"}

The controller then stops, it never attempts again to reconcile the revision deployment even though it received an error.

dprotaso · 2022-11-09T18:40:17Z

We split up the instructions like that so it doesn't trip up new users - our installation includes a CRD and an instance of that CRD (image resource)

serving/config/core/999-cache.yaml

Lines 15 to 27 in bd88e05

    
           apiVersion: caching.internal.knative.dev/v1alpha1 
        
           kind: Image 
        
           metadata: 
        
             name: queue-proxy 
        
             namespace: knative-serving 
        
             labels: 
        
               app.kubernetes.io/component: queue-proxy 
        
               app.kubernetes.io/name: knative-serving 
        
               app.kubernetes.io/version: devel 
        
           spec: 
        
             # This is the Go import path for the binary that is containerized 
        
             # and substituted here. 
        
             image: ko://knative.dev/serving/cmd/queue

You can get an error like

error: resource mapping not found for name: "queue-proxy" namespace: "knative-serving" from "https://github.com/knative/serving/releases/download/knative-v1.8.0/serving-core.yaml": no matches for kind "Image" in version "caching.internal.knative.dev/v1alpha1"
ensure CRDs are installed first

I'm going to close this off - this is behaving as we expect and I recommend looking into tooling that supports better rebasing support - ie. we use https://carvel.dev/blog/kapp-rebase-rules/

acelinkio · 2023-08-13T00:27:05Z

This issue is going to keep arising. These configurations really should be declarative. Mutating fields is fine, because there is an owner defined and managing the resource.

Using implicit wildcards to handle all rules/permutations is not a best practice with permissions. Permissions granted should be explicit. Kubernetes does flatten out the wildcard when applied--that is the entire reason why we are seeing configuration drift in the first place.

Kubernetes does supporting dynamic configuration in the case of aggregated-clusterroles, https://kubernetes.io/docs/reference/access-authn-authz/rbac/#aggregated-clusterroles, where kubernetes handles updating permissions as referenced clusterroles are given permissions.

That is not the case with MutatingWebhookConfiguration or ValidatingWebhookConfiguration. ignoring differences in those resources is going to lead to issues down the line because the thing it needs to update cannot be known/is ignored.

mbrancato added the kind/bug Categorizes issue or PR as related to a bug. label Nov 7, 2022

knative-prow bot added the area/build Build topics specifically related to Knative label Nov 7, 2022

mbrancato mentioned this issue Nov 7, 2022

Fix unexpanded Webhook rules in YAML manifests #13450

Closed

dprotaso closed this as completed Nov 8, 2022

knative-prow bot reopened this Nov 8, 2022

mbrancato mentioned this issue Nov 8, 2022

Remove controller-managed webhook rules #13453

Closed

dprotaso closed this as completed Nov 9, 2022

dprotaso closed this as not planned Won't fix, can't repro, duplicate, stale Nov 9, 2022

mbrancato mentioned this issue Feb 9, 2023

Constant config drift on webhook rules expansion fluxcd/kustomize-controller#792

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Webhook YAML causes constant config drift (reopened) #13449

Webhook YAML causes constant config drift (reopened) #13449

mbrancato commented Nov 7, 2022

psschwei commented Nov 7, 2022

mbrancato commented Nov 7, 2022

dprotaso commented Nov 8, 2022

dprotaso commented Nov 8, 2022

mbrancato commented Nov 8, 2022

mbrancato commented Nov 8, 2022

knative-prow bot commented Nov 8, 2022

dprotaso commented Nov 8, 2022

mbrancato commented Nov 9, 2022

dprotaso commented Nov 9, 2022

mbrancato commented Nov 9, 2022

dprotaso commented Nov 9, 2022

acelinkio commented Aug 13, 2023 •

edited

Webhook YAML causes constant config drift (reopened) #13449

Webhook YAML causes constant config drift (reopened) #13449

Comments

mbrancato commented Nov 7, 2022

In what area(s)?

What version of Knative?

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

psschwei commented Nov 7, 2022

mbrancato commented Nov 7, 2022

dprotaso commented Nov 8, 2022

dprotaso commented Nov 8, 2022

mbrancato commented Nov 8, 2022

mbrancato commented Nov 8, 2022

knative-prow bot commented Nov 8, 2022

dprotaso commented Nov 8, 2022

mbrancato commented Nov 9, 2022

dprotaso commented Nov 9, 2022

mbrancato commented Nov 9, 2022

dprotaso commented Nov 9, 2022

acelinkio commented Aug 13, 2023 • edited

acelinkio commented Aug 13, 2023 •

edited