Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validation Tightening is not Generally Possible #64841

Open
lavalamp opened this issue Jun 6, 2018 · 37 comments
Open

Validation Tightening is not Generally Possible #64841

lavalamp opened this issue Jun 6, 2018 · 37 comments

Comments

@lavalamp
Copy link
Member

@lavalamp lavalamp commented Jun 6, 2018

I have made this comment on a few PRs so maybe it's time to document and/or build a mechanism.

You cannot generally make validation tighter, because any existing objects that violate the new rule will not be able to be updated, and in many cases, deleted (because deletions involving finalizers require a sequence of updates).

This is because, to validate updates, we validate BOTH the transition, AND the final state of the object. It's this last part that is problematic.

It should be possible to address this by adding a new class of validation function that ONLY gets called on new objects, but then we have these problems:

  • When is it safe to promote the logic into the main validation function? There's no way to know.
  • This solves the problem for apiserver, but not for any clients that were sending invalid objects. Especially automated clients will suddenly and without recourse be quite broken.

Even at version boundaries it's hard to tighten validation, since objects have to be round-trippable and there are still old objects in the system.

So, new API authors: start out with restrictive validation; it's easier to relax than to tighten.

Existing API authors: prepare to write defensive clients.

/kind bug
/sig api-machinery

@liggitt
Copy link
Member

@liggitt liggitt commented Jun 6, 2018

@liggitt
Copy link
Member

@liggitt liggitt commented Jun 6, 2018

a methodical way to do ratcheting validation would be ideal:

  • apply tightened validation on create
  • apply tightened validation on update where the old object passed this validation (don't let a valid object become invalid)
  • skip tightened validation on update where the old object fails this validation
@lavalamp
Copy link
Member Author

@lavalamp lavalamp commented Jun 6, 2018

Yes, that's the mechanism I had in mind when I wrote the two bullet points in the original post.

(also thanks for the extensive list of cross references)

@jennybuckley
Copy link
Contributor

@jennybuckley jennybuckley commented Jun 7, 2018

/cc @mbohlool

@liggitt
Copy link
Member

@liggitt liggitt commented Jun 8, 2018

/assign

@liggitt
Copy link
Member

@liggitt liggitt commented Jun 8, 2018

WIP in #64907

@tallclair
Copy link
Member

@tallclair tallclair commented Jul 6, 2018

xref #64532

@bgrant0607
Copy link
Member

@bgrant0607 bgrant0607 commented Jul 17, 2018

Sorry for not noticing this earlier.

Tightening validation can break clients. I don't see much of a way around that. Special-casing updates protects the apiserver but not clients and users.

It seems like the only recourse would be to downgrade to the prior release until the clients can be fixed. And downgrade is known to not work in the general case.

Maybe we need an advisory mode that also applies to create, so that users can debug problems before they irreversibly hose their clusters?

@bgrant0607
Copy link
Member

@bgrant0607 bgrant0607 commented Jul 17, 2018

https://github.com/kubernetes/community/blob/master/contributors/devel/api_changes.md#backward-compatibility-gotchas

Changing any validation rules always has the potential of breaking some client, since it changes the assumptions about part of the API, similar to adding new enum values. Validation rules on spec fields can neither be relaxed nor strengthened. Strengthening cannot be permitted because any requests that previously worked must continue to work. Weakening validation has the potential to break other consumers and generators of the API resource. Status fields whose writers are under our control (e.g., written by non-pluggable controllers), may potentially tighten validation, since that would cause a subset of previously valid values to be observable by clients.

@bgrant0607
Copy link
Member

@bgrant0607 bgrant0607 commented Jul 17, 2018

The end of that should be: "to no longer be observable by clients"

@bgrant0607
Copy link
Member

@bgrant0607 bgrant0607 commented Jul 17, 2018

The list of cross references was useful.

We should document how to handle common cases, such as duplicate list entries.

It looks like maybe a few of them wouldn't have worked, but with asynchronous failures rather than synchronous ones.

However, it's possible for cases we didn't intend to be supported, such as multiple instances of the same environment variable, to be working for someone.

@bgrant0607
Copy link
Member

@bgrant0607 bgrant0607 commented Jul 17, 2018

Other thoughts:

We also need to demand validation tests that fail for every field that doesn't accept all values of the base type.

This doesn't handle cases where contributors introduce new validation checks outside the ratcheting mechanism. What are we trying to defend against?

The cluster admin will want to know if any of potential problems. We need to introduce exported variables that surface affected resource types.

@liggitt
Copy link
Member

@liggitt liggitt commented Jul 17, 2018

Tightening validation can break clients. I don't see much of a way around that. Special-casing updates protects the apiserver but not clients and users.

What are we protecting clients and users from? My intent with these was to only address validation gaps that were allowing in fatally flawed objects.

It seems like the only recourse would be to downgrade to the prior release until the clients can be fixed.

Is it a great loss to block a client that is submitting deployments/pods that can not succeed? Is the concern that such a client would not back off and would perpetually attempt to create the objects? Doesn't that risk already exist with dynamic reasons for denial (quota, authorization, webhook admission)?

any requests that previously worked must continue to work

what does that mean for an object which, if created, can not work properly, and will inevitably fail asynchronously (sometimes taking a long time to do so, as in the case of the duplicate PVC volume refs)? the fixes in the linked PR were intended to resolve problems of that variety.

The only one I could see being questionable is the duplicate envvar fix. That seemed valuable to prevent because it is known to cause data loss when using apply to try to remove the extra envvar on an update, but I can pull that one out if we'd rather try to think of a different way to address that (since a pod with that duplication can still run successfully)

@bgrant0607
Copy link
Member

@bgrant0607 bgrant0607 commented Jul 17, 2018

@bgrant0607
Copy link
Member

@bgrant0607 bgrant0607 commented Jul 17, 2018

@liggitt I agree with respect to requests that were fatally flawed. However, we haven't always been correct when trying to identify such cases in the past.

@bgrant0607
Copy link
Member

@bgrant0607 bgrant0607 commented Jul 17, 2018

Example:

return v1.StorageMediumDefault, !notMnt, nil

It looks like unknown emptyDir medium values succeed:

	if buf.Type == linuxTmpfsMagic {
		return v1.StorageMediumMemory, !notMnt, nil
	} else if int64(buf.Type) == linuxHugetlbfsMagic {
		return v1.StorageMediumHugePages, !notMnt, nil
	}
	return v1.StorageMediumDefault, !notMnt, nil
@liggitt
Copy link
Member

@liggitt liggitt commented Jul 17, 2018

It looks like unknown emptyDir medium values succeed

If that's the case, I agree it shouldn't be tightened. I was going off of #52936 (comment)

@bgrant0607
Copy link
Member

@bgrant0607 bgrant0607 commented Jul 17, 2018

Sorry, that's not the right check.

So, maybe not:

switch ed.medium {

But, anyway, my point is sometimes these can unintentionally succeed

@liggitt
Copy link
Member

@liggitt liggitt commented Jul 17, 2018

Yeah, point taken. I put a hold on the PR until we can think through whether this approach is a net positive. If we end up moving forward with it, we'll need to be really clear about the types of scenarios where it is an appropriate tool to use.

@bgrant0607
Copy link
Member

@bgrant0607 bgrant0607 commented Jul 17, 2018

@liggitt One approach could be to write tests that demonstrate asynchronous failure and commit them before enabling new validation.

I also think we need to start requiring tests for issues such as duplicate entries in associative lists ASAP.

@tallclair
Copy link
Member

@tallclair tallclair commented Jul 17, 2018

Along those lines, I actually don't think the integer memory requirement should be included. I can successfully start a pod with fractional memory requirements, which just get rounded when necessary.

@apelisse
Copy link
Member

@apelisse apelisse commented Jul 17, 2018

Would it be possible to implement a tighter validation as a validating webhook, that the cluster admin can decide to enable/disable? There could even be a way (annotation?) to ignore specific objects known to be broken.

@liggitt
Copy link
Member

@liggitt liggitt commented Jul 17, 2018

One approach could be to write tests that demonstrate asynchronous failure and commit them before enabling new validation.

Good idea.

I also think we need to start requiring tests for issues such as duplicate entries in associative lists ASAP.

Yes. Validation in general tends to get little attention in new API PRs.

@yue9944882
Copy link
Member

@yue9944882 yue9944882 commented Dec 12, 2018

/remove-lifecycle stale

@lavalamp
Copy link
Member Author

@lavalamp lavalamp commented Mar 28, 2019

https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api_changes.md#backward-compatibility-gotchas

There's another way validation changes can break clients. By e.g. permitting formerly disallowed field contents, e.g. a new character in a name. Clients that depended on the former limited expressiveness could be broken. This is similar to the problem that obtains when one adds an enum value.

Detecting this in a fully general way is going to be hard to impossible, because the error happens after the client has done a read. The error may actually be very subtle and not cause a crash (imagine allowing / in some field which then leads to a client having a path traversal vulnerability).

@liggitt
Copy link
Member

@liggitt liggitt commented Mar 28, 2019

There's another way validation changes can break clients. By e.g. permitting formerly disallowed field contents, e.g. a new character in a name. Clients that depended on the former limited expressiveness could be broken.

Correct. #63362 (comment) are the types of questions we need to ask in cases like that.

This is similar to the problem that obtains when one adds an enum value.

At least with enums, we have the option (when defining the field for the first time) to specify what a client should do if it encounters an unknown enum value.

@lavalamp
Copy link
Member Author

@lavalamp lavalamp commented Mar 28, 2019

At least with enums, we have the option (when defining the field for the first time) to specify what a client should do if it encounters an unknown enum value.

Clients which ignore suggestions like that will still exist, so it doesn't seem like a solution to me.

@apelisse
Copy link
Member

@apelisse apelisse commented Mar 28, 2019

we have the option (when defining the field for the first time) to specify what a client should do if it encounters an unknown enum value.

Yeah, it feels more like it's up to client to decide what to do with unknown enum values. At least for enum values they can expect that MAYBE it's going to change in the future and plan accordingly. For value validation, there isn't much they can do.

@khenidak
Copy link
Contributor

@khenidak khenidak commented Aug 22, 2019

We don't currently validate that loadBalancerIP, loadBalancerSourceRanges, externalIPs match the ipfamily of the cluster. And we do have users with resources that has this broken behavior. This comment is for tracking only. Actions will be planned later. @liggitt @thockin

@lavalamp
Copy link
Member Author

@lavalamp lavalamp commented Aug 22, 2019

  • this is still important
  • serverside apply in particular is probably going to drive some schema changes that will need this
  • My #64841 (comment) above (plus following clarifications) is still how I want to deal with this
  • validation relaxation actually needs to go through this process in reverse to preserve downgradability.
@julianvmodesto
Copy link
Member

@julianvmodesto julianvmodesto commented Mar 3, 2020

/wg apply

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

You can’t perform that action at this time.