Skip to content

Conversation

@vivekbagade
Copy link
Contributor

Also add test plan and answer scalability questionnaire

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 5, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: vivekbagade
To complete the pull request process, please assign lavalamp after the PR has been reviewed.
You can assign the PR to them by writing /assign @lavalamp in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Oct 5, 2020
@vivekbagade
Copy link
Contributor Author

/assign @liggitt

This section details the behavior of the mechanism when encountering non-optimal
situations.
In cases where:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and if any of these are true when reading the file on a process start, does the process fail to start? I would expect it to.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that matches my expectation, but should be called out explicitly

@deads2k
Copy link
Contributor

deads2k commented Oct 5, 2020

This answers the outstanding questions that I know of. I'm ok with this moving to implementable if we can find a solid reviewer who can sign up for the work (I'm not available).

@lavalamp any comments you'd like addressed before implementable?

Also add test plan and answer scalability questionnaire
This section details the behavior of the mechanism when encountering non-optimal
situations.
In cases where:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that matches my expectation, but should be called out explicitly

through the API.

### New AdmissionConfig schema

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see this called out, but the treatment of non-absolute file paths for webhooksFile should be described. Absolute paths are required for the kubeconfigFile field in the same config, which is the simplest approach. If relative paths are allowed, I would expect them to be treated relative to the location of the admission config file, which requires replumbing some bits of admission config loading.


a. Webhook metadata: This metric would contain metadata of each webhook invoked,
both API based and manifest based. The metadata includes the webhook type
(mutating/validating) and if it is manifest based.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

detail the name and labels of the metric?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, add the metric to the metrics section of the kep.yaml file (same for all the metrics)


b. Manifest errors: Since the manifest file can be reconfigured post API server
startup, we would add a metric informing users if there were errors
loading/parsing the current manifest file.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

detail the name and labels of the metric?

loading/parsing the current manifest file.

Audit annotations for mutating webhooks would also carry an additional field
`manifestBased` indicating the invocation of a manifest based webhook.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

give an example of the audit annotation including the manifestBased field

@liggitt
Copy link
Member

liggitt commented Oct 5, 2020

Asked for a few clarifications, then lgtm.

I can review the implementation.

@lavalamp
Copy link
Contributor

lavalamp commented Oct 5, 2020

This design has the following problems:

  • Doesn't work for other types (e.g. CRDs).
  • No way for status to be reported on the "virtual" resources.
  • There's no longer a single source of truth.
  • It's very hard for a user to figure out what the system is doing or how to change that.

So, I'm a big -1 on this design.

I'd accept a design that either

  1. works for all resources, by adding a controller that reads a directory for manifests and puts them in etcd; and also configures an authz plugin to block non-system changes of those resources, OR
  2. just an authz mechanism that prevents non-system changes to certain resources, plus blessing a super-super-user account that can change those resources (and then it's up to the holder of that account to fix those resources).

Honestly I think the authz addition is too heavy-handed and in the common case will prevent people from fixing clusters as frequently as it prevents them from breaking clusters; but I recognize that it is greatly desired.

@liggitt
Copy link
Member

liggitt commented Oct 5, 2020

This design has the following problems:

  • Doesn't work for other types (e.g. CRDs).
  • No way for status to be reported on the "virtual" resources.
  • There's no longer a single source of truth.
  • It's very hard for a user to figure out what the system is doing or how to change that.

I'm ok with taking a step back to see if we can come up with an approach that is more coherent, and I agree there are multiple resources that have similar use cases, but some of those characteristics are features in some scenarios :-)

  • local-only policy config allows controlled rollout in a multi-server environment without worrying about another server stomping config ("my apiserver will load policy file X into etcd, then enforce whatever is present in etcd" is a way weaker guarantee than "my apiserver will enforce policy file X")
  • stratifying policy specified by the cluster operator and policy specified/inspectable/changeable by the cluster user is useful in managed environments

@vivekbagade
Copy link
Contributor Author

vivekbagade commented Oct 6, 2020

  • No way for status to be reported on the "virtual" resources.

Webhook config does not have status field IIUC

  • There's no longer a single source of truth.

Yes, I partially agree with this. But, the source of truth moves from the API to the API server metrics which not as useful.

  • It's very hard for a user to figure out what the system is doing or how to change that.

I don't completely agree. This mechanism is meant to provide operators with a better mechanism compared to compiled in admission control. And this method provides better visibility compared to compiled in admission control. There are audit annotations (for mutating webhooks) and metrics that help with this. As for what "users" can do to alter the behavior, this is a feature of the mechanism(extremely desirable in managed platforms) and the users would just have to reach out to their operator.

So, I'm a big -1 on this design.

I'd accept a design that either

  1. works for all resources, by adding a controller that reads a directory for manifests and puts them in etcd; and also configures an authz plugin to block non-system changes of those resources, OR
  2. just an authz mechanism that prevents non-system changes to certain resources, plus blessing a super-super-user account that can change those resources (and then it's up to the holder of that account to fix those resources).

Honestly I think the authz addition is too heavy-handed and in the common case will prevent people from fixing clusters as frequently as it prevents them from breaking clusters; but I recognize that it is greatly desired.

I also want to +1 Jordan's comment on the usefulness of stratifying policy specified by the operator vs policy specified by the cluster user.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 4, 2021
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 3, 2021
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants