Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(crdvalidator): adding webhook to ensure safety of crd create/upgrades #213

Merged

Conversation

tylerslaton
Copy link
Contributor

@tylerslaton tylerslaton commented Apr 4, 2022

NOTES

Summary

Doing a few things here that we should talk through:

  • Creating a new webhook (crdvalidator) under the cmd directory. There has been talk of moving this webhook into its own repo in the future and this should make that very easy.
  • Adding Makefile targets to deploy the webhook
  • Moving the crd util package along side the webhook code.

Open Questions

  • Should this have an option to force crd applies through or would that option just be removing this webhook?

@tylerslaton tylerslaton requested a review from a team as a code owner April 4, 2022 15:00
@openshift-ci openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 4, 2022
@tylerslaton tylerslaton force-pushed the crd-admission-webhook branch 2 times, most recently from 711a12c to 834a63c Compare April 4, 2022 15:10
@openshift-ci openshift-ci bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 4, 2022
@tylerslaton tylerslaton force-pushed the crd-admission-webhook branch 2 times, most recently from f4c2747 to 4a84301 Compare April 4, 2022 15:12
Copy link
Member

@exdx exdx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice work! This looks really good.

For an example idea of how an e2e could look like, check out https://github.com/operator-framework/operator-lifecycle-manager/blob/ac3aa278ddc8d59add546bf9e33e583f17d62bb0/test/e2e/crd_e2e_test.go#L115

cmd/crdvalidator/manifests/00_service_account.yaml Outdated Show resolved Hide resolved
cmd/crdvalidator/internal/handlers/crd.go Outdated Show resolved Hide resolved
cmd/crdvalidator/internal/handlers/crd.go Outdated Show resolved Hide resolved
cmd/crdvalidator/internal/handlers/crd.go Outdated Show resolved Hide resolved
cmd/crdvalidator/manifests/04_webhook.yaml Outdated Show resolved Hide resolved
Makefile Outdated Show resolved Hide resolved
@tylerslaton tylerslaton force-pushed the crd-admission-webhook branch 7 times, most recently from 7a4760f to 97000b1 Compare April 8, 2022 20:53

RUN make bin/crdvalidator

FROM gcr.io/distroless/static:debug
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the webhook need a shell, is there a reason to use the :debug tag? Best practices for building container images are to not include a shell unless it's necessary for the application somehow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 It doesn't, you're right. I was mainly copying this over from the root Dockerfile. Do you think that means we should be moving that container away from having a shell as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, yes. I would be for it personally.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @exdx brings up a good point but I wonder whether it's good to use a debug tag in the short term while we continue to iterate, and then later on work towards removing it.

Makefile Outdated Show resolved Hide resolved
Makefile Outdated Show resolved Hide resolved
cmd/crdvalidator/internal/crd/crd.go Outdated Show resolved Hide resolved
cmd/crdvalidator/manifests/01_cluster_role.yaml Outdated Show resolved Hide resolved
cmd/crdvalidator/test/e2e/crdvalidator_test.go Outdated Show resolved Hide resolved
// 3. New CRD adds a version that Old CRD does not have =>
// - If conversion strategy is None, ensure existing CRs validate with new schema.
// - If conversion strategy is Webhook, allow update (assume webhook handles conversion correctly)
func validateCRDCompatibility(ctx context.Context, cl client.Client, oldCRD *apiextensionsv1.CustomResourceDefinition, newCRD *apiextensionsv1.CustomResourceDefinition) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: all the code in this function and following it is fairly in-depth, and could use some unit tests as a follow-up. Since it came from elsewhere, largely from OLM, it's fairly safe to rely on, but we could test it more on the rukpak side as a follow-up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I'll make a follow-up item for this PR that adds unit testing for this particular package.

Copy link
Member

@exdx exdx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks for all the work on this. Let's get one more approval before merging.

Makefile Outdated Show resolved Hide resolved
Comment on lines 86 to 113
It("should deny admission", func() {
Eventually(func() bool {
if err := c.Get(ctx, client.ObjectKeyFromObject(crd), crd); err != nil {
return false
}

crd.Spec.Versions = []apiextensionsv1.CustomResourceDefinitionVersion{
{
Name: "v1alpha2",
Served: true,
Storage: true,
Schema: &apiextensionsv1.CustomResourceValidation{
OpenAPIV3Schema: &apiextensionsv1.JSONSchemaProps{
Type: "object",
Description: "my crd schema",
},
},
},
}

err := c.Create(ctx, crd)
if err == nil {
return false
}

return errorsLooselyEqual(err, errors.New("cannot remove stored versions"))
}).Should(BeTrue())
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of things:

  • I'm slightly confused whether we should be firing off an update request vs. a create one
  • Can we return an error here and use gomega matchers to enforce the expected error message, e.g.
    ContainSubstring(`no matches for kind "CatalogSource" in version "operators.coreos.com/v1alpha1"`),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah so the Update here was attempted and continuously failed. If you have the Create it succeeds - however, this is not what we should really be expecting and just obfuscated the issue. Something under-the-hood in Kube is now blocking unsafe storage removals and thus the error that we are expecting is not coming through. This is because the Update request does not seem to even hit the webhook and Kube returns its own error. I'm making some changes in an upcoming commit along with a todo comment to address this in the future by reevaluating if we need to ensure safe storage removals or if Kube does it by default now.

go.mod Outdated Show resolved Hide resolved
Copy link
Contributor

@timflannagan timflannagan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I just did a quick passthrough of these changes. I think my main takeaways were the following:

  • I'm not a huge fan of the current directory structure as it's difficult to discover where packages live in what directories and we don't have a concrete want/need to split this kind of component into it's own repository, or attempt to introduce it to upstream k/k.
  • The make run command doesn't install this component like I would've expected
  • It's not entirely evident how we could disable this behavior dynamically (or whether that's something we'd need to support)
  • We can punt on the debug/static distroless tag discussion
  • I think we can drop that k8s.io/apiserver dependency that was introduced with these changes
  • We're potentially introducing a confusing installation UX by having a deploy and install Makefile targets
  • It doesn't look like the CRD validator e2e test Makefile target loads the kind-load-crdvalidator target unless I'm reading these changes wrong
  • We'll need to update goreleaser to build this new executable package

I also ran these changes locally and ran into e2e failures. I haven't dived too deep into those yet though.

@timflannagan
Copy link
Contributor

FYI - It looks like I ran into a container image build warning locally too:

Step 3/14 : COPY go.mod go.mod
 ---> 73952c5724e3
Step 4/14 : COPY go.sum go.sum
 ---> 358d31c7b255
Step 5/14 : RUN go mod download
 ---> Running in eeb714ec03fd
Removing intermediate container eeb714ec03fd
 ---> d370149889ea
Step 6/14 : COPY Makefile Makefile
 ---> ddfb93c559ff
Step 7/14 : COPY cmd/crdvalidator cmd/crdvalidator
 ---> 1ba657d6e072
Step 8/14 : RUN make bin/crdvalidator
 ---> Running in 49439f4699bf
fatal: not a git repository (or any of the parent directories): .git
CGO_ENABLED=0 go build -ldflags "-X github.com/operator-framework/rukpak/internal/version.GitCommit=" -o bin/crdvalidator ./cmd/crdvalidator

@tylerslaton
Copy link
Contributor Author

Okay I just did a quick passthrough of these changes. I think my main takeaways were the following:

  • The make run command doesn't install this component like I would've expected
  • It doesn't look like the CRD validator e2e test Makefile target loads the kind-load-crdvalidator target unless I'm reading these changes wrong

I also ran these changes locally and ran into e2e failures. I haven't dived too deep into those yet though.

Yep, so I recently rebased this PR and it removed my addition to make run that had the crdvalidator being deployed. This resulted in the crdvalidators e2e suite not passing as a side effect. I'll work to address the other issues you listed.

@tylerslaton tylerslaton force-pushed the crd-admission-webhook branch 2 times, most recently from 410d954 to 084afe5 Compare April 13, 2022 19:37
Makefile Outdated Show resolved Hide resolved
Makefile Outdated Show resolved Hide resolved
Copy link
Contributor

@timflannagan timflannagan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're super close and then I'm ready to approve:

  • Can we remove that test-crdvalidator-e2e target which appears to be unused?
  • Can we remove that unused Dockerfile.crdvalidator now that we've consolidated this component in the current repository structure?

@timflannagan
Copy link
Contributor

Note: it looks like this branch references out-of-date files compared to the main branch. It might be worth rebasing against the main branch, and respinning e2e once more to make sense we're not potentially silently regressing, but I'll leave that up to you as I just approved this.

@timflannagan
Copy link
Contributor

Really nice work btw @tylerslaton

@tylerslaton
Copy link
Contributor Author

Note: it looks like this branch references out-of-date files compared to the main branch. It might be worth rebasing against the main branch, and respinning e2e once more to make sense we're not potentially silently regressing, but I'll leave that up to you as I just approved this.

Rebased and pushed. Looks like everything is good to go. Going to squash all the commits together. @exdx anything you want to touch on before we merge this?

Add a new webhook that integrates the CRD validation logic into a webhook.
With this enabled on cluster, bundle installs and pivots can be guranteed as
safe by the webhook. It works by intercepting all CRD create/update requests
that come through and declining or admitting them based on the internal/crd
package.

Signed-off-by: Tyler Slaton <tyslaton@redhat.com>
Copy link
Member

@exdx exdx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, just some small comments.

install: install-plain ## Install all rukpak core CRDs and provisioners

deploy: install-apis ## Deploy the operator to the current cluster
deploy: install-apis install-crdvalidator ## Deploy the operator to the current cluster
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: install-apis already calls cert-mgr so we end up installing cert-manager twice on deploy. I'm not sure a good way around this -- is there a way to only run a target if it hasn't already been run?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also thinking on this. I haven't come up with a great solution other than having all the cert-mgr dependent installs in a single target.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds reasonable, we can have something like install-webhooks which calls cert-mgr under the hood. I suppose we can address this is in a follow-up

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally, but I agree with you. Its an easy addition that I can just tack onto the follow-ups.

// it through. This is fine and ultimately what the safe storage logic of the
// CRDValidator was designed to prevent but is unknown why it is occurring. We
// Should come back to this test case, figure out what is preventing it from
// hitting the webhook and decide if we want to keep that logic or not.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we open an issue to track this and put the link to it here in the test comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that this action will create a issue for it:

https://github.com/operator-framework/rukpak/blob/main/.github/workflows/todo.yaml

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// CRDValidator was designed to prevent but is unknown why it is occurring. We
// Should come back to this test case, figure out what is preventing it from
// hitting the webhook and decide if we want to keep that logic or not.
PWhen("an incoming crd event removes a stored version", func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we had some issues getting this test to work -- but we did confirm the test case where the schema changed in an incompatible way between upgrades. It's worth adding that test so we have at least one scenario where the admission was denied due to the webhook in our tests. This can be added in a follow-up though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep - good call out. I'm also cool with a follow-up here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep I'm planning on doing a follow-up PR like the recent git work after this lands.

@timflannagan
Copy link
Contributor

LGTM again - do you want to push this through @tylerslaton.

@tylerslaton
Copy link
Contributor Author

@timflannagan Yep, I'm good with this going through and addressing the remaining comments in a follow up. Merging now.

@tylerslaton tylerslaton merged commit 1efea88 into operator-framework:main Apr 14, 2022
@tylerslaton tylerslaton deleted the crd-admission-webhook branch April 14, 2022 15:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants