Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Webhook YAML causes constant config drift (reopened) #13449

Closed
mbrancato opened this issue Nov 7, 2022 · 13 comments
Closed

Webhook YAML causes constant config drift (reopened) #13449

mbrancato opened this issue Nov 7, 2022 · 13 comments
Labels
area/build Build topics specifically related to Knative kind/bug Categorizes issue or PR as related to a bug.

Comments

@mbrancato
Copy link

In what area(s)?

/area build

What version of Knative?

1.8.0

Expected Behavior

The generated YAML for webhook rules is modified after apply, changing the defined spec and causing constant configuration drift. I've reopened this from #12474 which has been closed as fixed. I've confirmed this still happens in 1.8.0.

Actual Behavior

As with before, webhook rules for webhook.domainmapping.serving.knative.dev are in the release YAML as:

    rules:
      - apiGroups:
          - serving.knative.dev
        apiVersions:
          - "*"
        operations:
          - CREATE
          - UPDATE
        scope: "*"
        resources:
          - domainmappings
          - domainmappings/status

After applying, the spec is modified/expanded to change rules to:

  rules:
  - apiGroups:
    - serving.knative.dev
    apiVersions:
    - v1alpha1
    operations:
    - CREATE
    - UPDATE
    resources:
    - domainmappings
    - domainmappings/status
    scope: '*'
  - apiGroups:
    - serving.knative.dev
    apiVersions:
    - v1beta1
    operations:
    - CREATE
    - UPDATE
    resources:
    - domainmappings
    - domainmappings/status
    scope: '*'

When using kubectl apply, config drift looks like MutatingWebhookConfiguration/webhook.domainmapping.serving.knative.dev configured and generates a new resource version. Whereas, it would say unchanged if what had been defined in the spec was not changed.

Since we use GitOps to deploy, and git is the source of truth, this causes a lot of notifications. Also, for comparison, I've never had this issue with the dozens of other pieces of software installed in our Kubernetes clusters.

Steps to Reproduce the Problem

Apply the YAML more than once, notice the resource has changed.

@mbrancato mbrancato added the kind/bug Categorizes issue or PR as related to a bug. label Nov 7, 2022
@knative-prow knative-prow bot added the area/build Build topics specifically related to Knative label Nov 7, 2022
@psschwei
Copy link
Contributor

psschwei commented Nov 7, 2022

I've reopened this from #12474 which has been closed as fixed. I've confirmed this still happens in 1.8.0.

Looking at the original issue, @dprotaso 's rationale for closing was that "this drift is unfortunately unavoidable."

I think the recommendation in the original issue is probably still the same today: to look into your tooling's options to handle merges between what's declared and what's present on the API server.

@mbrancato
Copy link
Author

Looking at the original issue, @dprotaso 's rationale for closing was that "this drift is unfortunately unavoidable."

I think the recommendation in the original issue is probably still the same today: to look into your tooling's options to handle merges between what's declared and what's present on the API server.

Hi @psschwei

Again, to point out I've never run into other Kubernetes manifests or Helm charts that do this - my view is it is probably an anti-pattern to specify a spec as X in deploy and modify the spec to Y at runtime. If you want to do this, configure nothing for the rules in the manifest. For declarative configuration, sure, there are things that get set and even change at runtime, the solution is just don't set those in the declarative part of the configuration. I don't see this as a technical issue, let me know if I'm missing something. Based on some of the prior comments (i.e. settings caBundle or spec.clusterIP) it seems that there is just a misunderstanding about how declarative configuration works when applied to Kubernetes manifests.

A good example of this is right here in knative/serving:

# The data is populated when internal-encryption is enabled.

That said, I propose knative/serving should just expand the rules in the manifest. It's not really helping anything by keeping them unexpanded, but I would agree is easier to read for humans in the unexpanded form.

I've created PR #13450 to fix this.

@dprotaso
Copy link
Member

dprotaso commented Nov 8, 2022

my view is it is probably an anti-pattern to specify a spec as X in deploy and modify the spec to Y at runtime.

That's not how Kubernetes works - any admission webhook can mutate your resources (ie. defaulting, applying opinions etc) and the end result is different than the YAML that was applied.

Even Microsoft AKS modifies mutating webhooks to include additional selectors - Azure/AKS#1771

You need to setup GitOps tooling to accomodate this.

Regarding caBundle as an example - our controllers generate that certificate and rotate it when it expires - so it will change.

@dprotaso
Copy link
Member

dprotaso commented Nov 8, 2022

I'm going to close this out as it's working as we expect.

@dprotaso dprotaso closed this as completed Nov 8, 2022
@mbrancato
Copy link
Author

Hi @dprotaso

I still feel like there is a misunderstanding here. Modifying / mutating a resource is fine, but the anti-pattern here is defining your manifests as X and then letting the controller modify them at runtime to Y. I also feel you're referring to mutating resources as being an issue, which is not the issue here. The issue is defining keys and values in one place for install, and changing them in another later on. This anti-pattern would be more easily understood as problematic if it were applying to something like a Role resource, and permissions that were approved for install as one thing, but then a controller modified these later to have more permissions.

There are really 3 ways to approach this:

  1. placeholder values that the controller will later become the authoritative owner - this causes configuration drift
  2. synchronized manifests that define the resource the same way the controller will, giving a little more transparency (this is what my PR did)
  3. sparse manifests that only defines the parts of the resource that should remain, allowing the controller to own the rest.

So since you rejected a synchronized resource manifest. Why are the rules even configured at all in the manifest if you want the controller to be authoritative over the webhook rules? Why not just remove the rules all together? This would make it clear that rules are not declared as part of the install (controller will manage this) and avoids config drift against the stored manifests.

I've opened PR #13453 to remove the inaccurate placeholder values, giving the controller complete control over webhook rules.

@mbrancato
Copy link
Author

/reopen

@knative-prow
Copy link

knative-prow bot commented Nov 8, 2022

@mbrancato: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@knative-prow knative-prow bot reopened this Nov 8, 2022
@dprotaso
Copy link
Member

dprotaso commented Nov 8, 2022

Why are the rules even configured at all in the manifest if you want the controller to be authoritative over the webhook rules?

Context is here: #7576

If we remove the rules there's a window during install & upgrade that allows folks to create resources where the webhook isn't hooked up. This means the API server can accept Knative resources that are not valid.

So we have this initial seed of rules to prevent this race.

@mbrancato
Copy link
Author

So the issue here is that before the webhook configuration is installed, a bad ksvc can be created. I understand this, but I'm not sure having these default rules is actually solving that specific problem.

For example, if you look at the process for install, the first step is to install knative CRDs. At that point, a bad ksvc can be created. So having rules here on the webhook isn't solving this race condition for install, but may be a workaround for some.
https://knative.dev/docs/install/yaml-install/serving/install-serving-with-yaml/#install-the-knative-serving-component

I did walk through that specific issue you linked. To clarify, on an upgrade (let's assume kubectl apply) the expanded webhook configuration rules would never be empty and won't allow this issue to happen. With a sparse configuration, it would. Before the controller is running, a deployed ksvc will eventually deploy, with 2 revisions. However, if the controller is already running, a missing webhook config will cause it to get stuck. But that said, after testing this in several ways (before deploy, after deploy without webhook cfg, etc), it seems to me this is a bug in the reconciliation loop for the rev and ksvc resources. This obviously has "eventual consistency" implications.

My thought to fix this would be to go back to using a manifest with expanded rules (#13450). I think we have to ignore the fact that admission webhooks are not guaranteed until they're setup (initial install), and everything after that is a reconciliation issue. In a normal upgrade, I'd argue the webhook rules should never allow this (e.g. unless the user deletes the webhook config). So if something made it thru the admission process in that time, the new controller should catch it on reconciliation and reach eventual consistency.

@dprotaso
Copy link
Member

dprotaso commented Nov 9, 2022

For example, if you look at the process for install, the first step is to install knative CRDs. At that point, a bad ksvc can be created. So having rules here on the webhook isn't solving this race condition for install, but may be a workaround for some.

We include admission webhooks as part of our installation yaml that will prevent resources being created until the webhook is up and updates the configuration.

clientConfig:
service:
name: webhook
namespace: knative-serving

However, if the controller is already running, a missing webhook config will cause it to get stuck. But that said, after testing this in several ways (before deploy, after deploy without webhook cfg, etc), it seems to me this is a bug in the reconciliation loop for the rev and ksvc resources. This obviously has "eventual consistency" implications.

What's missing in the config?

@mbrancato
Copy link
Author

We include admission webhooks as part of our installation yaml that will prevent resources being created until the webhook is up and updates the configuration.

Since CRDs are installed first, there is a period where the a ksvc can be created and no ValidatingWebhookConfiguration is yet created. From the install docs:

1. Install the required custom resources by running the command:

kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.8.0/serving-crds.yaml

There is no ValidatingWebhookConfiguration after installing these. But this doesn't really matter as on initial install, this issue doesn't appear to come up (but it does generate 2 revisions).

What's missing in the config?

I'm not certain what is missing in all of the reconciliation process, but I did notice that creating a ksvc before the webhooks are created causes the ksvc to lack the field spec.enableServiceLinks. However, there is probably a lot more to why knative can't get the ksvc to a working state with active revisions when created after the controller is running, and when the webhooks are not configured.

This can be simulated by deleting the webhook config and then creating a ksvc.

{"severity":"ERROR","timestamp":"2022-11-09T15:46:44.489865Z","logger":"controller","caller":"revision/reconciler.go:302","message":"Returned an error","commit":"e82287d","knative.dev/pod":"controller-76d69dd7fc-q74sr","knative.dev/controller":"knative.dev.serving.pkg.reconciler.revision.Reconciler","knative.dev/kind":"serving.knative.dev.Revision","knative.dev/traceid":"9d2bfec3-b38b-46bb-aa23-d852b8af4e7e","knative.dev/key":"default/hello-00001","targetMethod":"ReconcileKind","error":"failed to update deployment \"hello-00001-deployment\": Operation cannot be fulfilled on deployments.apps \"hello-00001-deployment\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"knative.dev/serving/pkg/client/injection/reconciler/serving/v1/revision.(*reconcilerImpl).Reconcile\n\tknative.dev/serving/pkg/client/injection/reconciler/serving/v1/revision/reconciler.go:302\nknative.dev/pkg/controller.(*Impl).processNextWorkItem\n\tknative.dev/pkg@v0.0.0-20221011175852-714b7630a836/controller/controller.go:542\nknative.dev/pkg/controller.(*Impl).RunContext.func3\n\tknative.dev/pkg@v0.0.0-20221011175852-714b7630a836/controller/controller.go:491"}
{"severity":"ERROR","timestamp":"2022-11-09T15:46:44.489996Z","logger":"controller","caller":"controller/controller.go:566","message":"Reconcile error","commit":"e82287d","knative.dev/pod":"controller-76d69dd7fc-q74sr","knative.dev/controller":"knative.dev.serving.pkg.reconciler.revision.Reconciler","knative.dev/kind":"serving.knative.dev.Revision","knative.dev/traceid":"9d2bfec3-b38b-46bb-aa23-d852b8af4e7e","knative.dev/key":"default/hello-00001","duration":"100.233ms","error":"failed to update deployment \"hello-00001-deployment\": Operation cannot be fulfilled on deployments.apps \"hello-00001-deployment\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"knative.dev/pkg/controller.(*Impl).handleErr\n\tknative.dev/pkg@v0.0.0-20221011175852-714b7630a836/controller/controller.go:566\nknative.dev/pkg/controller.(*Impl).processNextWorkItem\n\tknative.dev/pkg@v0.0.0-20221011175852-714b7630a836/controller/controller.go:543\nknative.dev/pkg/controller.(*Impl).RunContext.func3\n\tknative.dev/pkg@v0.0.0-20221011175852-714b7630a836/controller/controller.go:491"}

The controller then stops, it never attempts again to reconcile the revision deployment even though it received an error.

@dprotaso
Copy link
Member

dprotaso commented Nov 9, 2022

We split up the instructions like that so it doesn't trip up new users - our installation includes a CRD and an instance of that CRD (image resource)

apiVersion: caching.internal.knative.dev/v1alpha1
kind: Image
metadata:
name: queue-proxy
namespace: knative-serving
labels:
app.kubernetes.io/component: queue-proxy
app.kubernetes.io/name: knative-serving
app.kubernetes.io/version: devel
spec:
# This is the Go import path for the binary that is containerized
# and substituted here.
image: ko://knative.dev/serving/cmd/queue

You can get an error like

error: resource mapping not found for name: "queue-proxy" namespace: "knative-serving" from "https://github.com/knative/serving/releases/download/knative-v1.8.0/serving-core.yaml": no matches for kind "Image" in version "caching.internal.knative.dev/v1alpha1"
ensure CRDs are installed first

I'm going to close this off - this is behaving as we expect and I recommend looking into tooling that supports better rebasing support - ie. we use https://carvel.dev/blog/kapp-rebase-rules/

@acelinkio
Copy link

acelinkio commented Aug 13, 2023

This issue is going to keep arising. These configurations really should be declarative. Mutating fields is fine, because there is an owner defined and managing the resource.

Using implicit wildcards to handle all rules/permutations is not a best practice with permissions. Permissions granted should be explicit. Kubernetes does flatten out the wildcard when applied--that is the entire reason why we are seeing configuration drift in the first place.

Kubernetes does supporting dynamic configuration in the case of aggregated-clusterroles, https://kubernetes.io/docs/reference/access-authn-authz/rbac/#aggregated-clusterroles, where kubernetes handles updating permissions as referenced clusterroles are given permissions.

That is not the case with MutatingWebhookConfiguration or ValidatingWebhookConfiguration. ignoring differences in those resources is going to lead to issues down the line because the thing it needs to update cannot be known/is ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/build Build topics specifically related to Knative kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
4 participants