Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade gatekeeper to 3.6.0 #8973

Merged
merged 14 commits into from Mar 3, 2022

Conversation

lsviben
Copy link
Contributor

@lsviben lsviben commented Feb 11, 2022

What does this PR do / Why do we need it:
Upgrades the OPA integration Gatekeeper from 3.5.2 to 3.6.0.

Important change here is that ConstraintTemplates are now v1 from v1beta1

Does this PR close any issues?:
Fixes #8832

Does this PR introduce a user-facing change?:

Upgraded OPA integration Gatekeeper version from 3.5.2 to 3.6.0. Important change here is that the OPA ConstraintTemplates are upgraded from v1beta1 to v1. What this means for Kubermatic ConstraintTemplates is that when creating new ones, the `spec.crd.spec.validation.openAPIV3Schema` needs to be structurally correct. The old ConstraintTemplates will have a `legacySchema` flag in the `spec.crd.spec.validation` so they wont need to be migrated yet, although we suggest editing them and fixing the schema to be structurally correct.

More info about change from v1beta1 to v1 in gatekeeper docs:  https://open-policy-agent.github.io/gatekeeper/website/docs/constrainttemplates#v1-constraint-template

@kubermatic-bot kubermatic-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Denotes that all commits in the pull request have the valid DCO signoff message. sig/cluster-management Denotes a PR or issue as being assigned to SIG Cluster Management. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. sig/crd sig/api Denotes a PR or issue as being assigned to SIG API. sig/app-management Denotes a PR or issue as being assigned to SIG App Management. labels Feb 11, 2022
@lsviben
Copy link
Contributor Author

lsviben commented Feb 11, 2022

/retest

2 similar comments
@lsviben
Copy link
Contributor Author

lsviben commented Feb 11, 2022

/retest

@lsviben
Copy link
Contributor Author

lsviben commented Feb 11, 2022

/retest

@xrstf
Copy link
Contributor

xrstf commented Feb 12, 2022

/override pre-kubermatic-test-helm-charts

This one is broken because of the new docker registry mirroring stuff, because of the dangling socat process at the end. The charts in this PR are actually fine.

/test pre-kubermatic-opa-e2e

This one is flaky.

Are we sure this can be applied over an existing OPA installation without issues? Changes to CRDs are always scary to me.

@kubermatic-bot
Copy link
Contributor

@xrstf: Overrode contexts on behalf of xrstf: pre-kubermatic-test-helm-charts

In response to this:

/override pre-kubermatic-test-helm-charts

This one is broken because of the new docker registry mirroring stuff, because of the dangling socat process at the end. The charts in this PR are actually fine.

/test pre-kubermatic-opa-e2e

This one is flaky.

Are we sure this can be applied over an existing OPA installation without issues? Changes to CRDs are always scary to me.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@lsviben
Copy link
Contributor Author

lsviben commented Feb 12, 2022

/override pre-kubermatic-test-helm-charts

This one is broken because of the new docker registry mirroring stuff, because of the dangling socat process at the end. The charts in this PR are actually fine.

/test pre-kubermatic-opa-e2e

This one is flaky.

Are we sure this can be applied over an existing OPA installation without issues? Changes to CRDs are always scary to me.

Nor sure. This opa test became flaky again after the crd migration, i have to check it to see why.

So i dont know if now its flaky or its something real. But no rush with this PR ill check it on monday

@xrstf
Copy link
Contributor

xrstf commented Feb 12, 2022

let my quote my notes from team A's slack channel 2 hours ago:

i just checked and the situation was

  • gatekeeper pods were scheduled, but could not run because the nodes weren't ready yet. the test should wait for node==Ready, not node==Exists
  • the gatekeeper webhook seems to block the gatekeeper deployments, because k8s fails to create events because the webhook is not yet ready. chicken & egg problem.

maybe the webhook should only be deployed after the gatekeeper pods are ready.

we already have 2 statusses on the ClusterStatus for the 2 gatekeeper deployments, so tieing the webhook deployment to the status should be trivial.

there's a chance the CRD mirgation made it flaky, but it would guess more that the improved ClusterStatus handling made things so fast that we now experience these issues 😁

@lsviben
Copy link
Contributor Author

lsviben commented Feb 14, 2022

let my quote my notes from team A's slack channel 2 hours ago:

i just checked and the situation was

  • gatekeeper pods were scheduled, but could not run because the nodes weren't ready yet. the test should wait for node==Ready, not node==Exists
  • the gatekeeper webhook seems to block the gatekeeper deployments, because k8s fails to create events because the webhook is not yet ready. chicken & egg problem.

maybe the webhook should only be deployed after the gatekeeper pods are ready.
we already have 2 statusses on the ClusterStatus for the 2 gatekeeper deployments, so tieing the webhook deployment to the status should be trivial.

there's a chance the CRD mirgation made it flaky, but it would guess more that the improved ClusterStatus handling made things so fast that we now experience these issues grin

Right, it could be the ClusterStatus.

Anyway the webhook shouldnt block the deployment, because it has a failure policy set to Ignore (allow if validation request fails), but maybe something changed, ill check it out.

In the meantime, regarding this PR, I found 1 place in the e2e tests where the old v1beta1 CT was being used, maybe it helps here.

@lsviben
Copy link
Contributor Author

lsviben commented Feb 14, 2022

let my quote my notes from team A's slack channel 2 hours ago:

i just checked and the situation was

  • gatekeeper pods were scheduled, but could not run because the nodes weren't ready yet. the test should wait for node==Ready, not node==Exists
  • the gatekeeper webhook seems to block the gatekeeper deployments, because k8s fails to create events because the webhook is not yet ready. chicken & egg problem.

maybe the webhook should only be deployed after the gatekeeper pods are ready.
we already have 2 statusses on the ClusterStatus for the 2 gatekeeper deployments, so tieing the webhook deployment to the status should be trivial.

there's a chance the CRD mirgation made it flaky, but it would guess more that the improved ClusterStatus handling made things so fast that we now experience these issues grin

Right, it could be the ClusterStatus.

Anyway the webhook shouldnt block the deployment, because it has a failure policy set to Ignore (allow if validation request fails), but maybe something changed, ill check it out.

In the meantime, regarding this PR, I found 1 place in the e2e tests where the old v1beta1 CT was being used, maybe it helps here.

Ill first focus on fixing the OPA tests flakiness, then check here. It seems this one is failing for a different issue, so Ill put this on hold for now.

/hold

@kubermatic-bot kubermatic-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 14, 2022
@kubermatic-bot kubermatic-bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 18, 2022
@lsviben
Copy link
Contributor Author

lsviben commented Feb 18, 2022

There is some issue with the OPA e2e tests in this PR beside the flakiness. So I will be checking it here, keeping the PR on hold

@lsviben
Copy link
Contributor Author

lsviben commented Feb 18, 2022

/retest

@lsviben lsviben force-pushed the upgrade-gatekeeper-to-3.6.0 branch 3 times, most recently from b780ef7 to d79bd89 Compare February 21, 2022 18:10
@lsviben
Copy link
Contributor Author

lsviben commented Feb 21, 2022

/retest

2 similar comments
@lsviben
Copy link
Contributor Author

lsviben commented Feb 21, 2022

/retest

@lsviben
Copy link
Contributor Author

lsviben commented Feb 21, 2022

/retest

@kubermatic-bot kubermatic-bot removed the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Feb 22, 2022
@lsviben
Copy link
Contributor Author

lsviben commented Feb 23, 2022

/retest

3 similar comments
@lsviben
Copy link
Contributor Author

lsviben commented Feb 23, 2022

/retest

@lsviben
Copy link
Contributor Author

lsviben commented Feb 24, 2022

/retest

@lsviben
Copy link
Contributor Author

lsviben commented Feb 24, 2022

/retest

@lsviben lsviben removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 24, 2022
@lsviben
Copy link
Contributor Author

lsviben commented Feb 25, 2022

@xrstf this is ready for re-review. The failing opa test was fixed, it was fixed in the commit, the issue was that dynamically you cant (or at least i havent found a way) properly create openAPIschema for arrays which includes its type. And as the new v1 ConstraintTemplates demand this, it was failing.

Btw the opa test is still flaky because of #9047 .

@lsviben
Copy link
Contributor Author

lsviben commented Feb 28, 2022

@xrstf PTAL

@lsviben lsviben force-pushed the upgrade-gatekeeper-to-3.6.0 branch from 3e8f236 to bcac200 Compare March 2, 2022 16:15
@lsviben
Copy link
Contributor Author

lsviben commented Mar 3, 2022

/retest

Copy link
Contributor

@xrstf xrstf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@kubermatic-bot kubermatic-bot added the lgtm Indicates that a PR is ready to be merged. label Mar 3, 2022
@kubermatic-bot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 557d8027fbd8d1c520041e63efd4cdc25c686350

@kubermatic-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lsviben, xrstf

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubermatic-bot kubermatic-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 3, 2022
@kubermatic-bot kubermatic-bot merged commit a36066e into kubermatic:master Mar 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Denotes that all commits in the pull request have the valid DCO signoff message. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api Denotes a PR or issue as being assigned to SIG API. sig/app-management Denotes a PR or issue as being assigned to SIG App Management. sig/cluster-management Denotes a PR or issue as being assigned to SIG Cluster Management. sig/crd size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Upgrade OPA integration gatekeeper to 3.6.0
3 participants