Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-2149: Cluster id api review request #3084

Merged

Conversation

lauralorenz
Copy link
Contributor

@lauralorenz lauralorenz commented Dec 13, 2021

  • Other comments:
  1. This KEP proposes an out-of-tree CRD to be hosted at (provisionally named) kubernetes-sigs/about-api. The names selected in this proposal (about.k8s.io API group; ClusterProperty kind name; id.k8s.io and clusterset.k8s.io well-known CRs) were voted on by surveys sent to the SIG-Multicluster mailing list:
    • CR name announcement with data is in this email.
    • API name announcement with data in it is in this email.
  2. Another well-known CR is under discussion right now in KEP-1645: Add specification for multi-network scenario #3045. I'm interested to know if additions like these need to undergo their own API review if/when they are included in the base KEP.
  3. If I understand from the directions correctly API review starts with the KEP, but let me know if you need an implementation to go with it

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Dec 13, 2021
@lauralorenz
Copy link
Contributor Author

/label api-review

@k8s-ci-robot k8s-ci-robot added the api-review Categorizes an issue or PR as actively needing an API review. label Dec 14, 2021
@lauralorenz lauralorenz marked this pull request as ready for review December 14, 2021 05:36
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 14, 2021
@lauralorenz
Copy link
Contributor Author

cc @thockin @JeremyOT @pmorie

@liggitt liggitt added this to Assigned in API Reviews Dec 14, 2021
@thockin
Copy link
Member

thockin commented Dec 17, 2021

  • ClusterProperty.name: The KEP specs "id.k8s.io" as a well-known name and reserves a set of suffixes for future use. Did we consider "bare" (no suffix) names? Are they reserved for system use or for end-users? As I think about how this will be used, it seems like it might be more valuable to claim bare names for system-use and demand that non-well-known names use a suffix.

E.g. kubectl get clusterproperty id => well-known meaning. kubectl get clusterproperty fingerprint.mycompany.com => custom.

  • ClusterProperty.value defines a max-length of 128k. Does the CRD subsystem support declarative validation of that flavor? It would be really nice to take a position of "no webhooks required" for this apigroup. If not, we should talk to CRD folks (I know Joe Betz was doing work here) and see if we can get there.

@JeremyOT
Copy link
Member

JeremyOT commented Feb 8, 2022

it seems like it might be more valuable to claim bare names for system-use and demand that non-well-known names use a suffix.

Agreed - this seems nice and convenient. We can still claim k8s.io and other 1st party suffixes (we have to) but claiming bare for first party makes sense. Given that we had previously required suffixes for everything, I don't think this will be controversial.

max-length of 128k. Does the CRD subsystem support declarative validation of that flavor?

Yes, I believe maxLength validation has been supported since CRD went v1. Huge +1 on "no webhooks required" for this apigroup. Users can install their own if they want, but the basic API should be simple

@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/multicluster Categorizes an issue or PR as relevant to SIG Multicluster. labels Feb 11, 2022
@lauralorenz
Copy link
Contributor Author

Reserved bare names and provided examples in ff80ff9.

I confirmed maxLength validation is in OpenAPIv3 (and v2 for that matter) which CRD validation is built on, and maxLength specifically survived the apparent purge in KEP-2335 which limited the number and types of OpenAPIv3 validations "vanilla" v1 CRDs allow (aka without enabling further features). CRD validation (including maxLength) is on by default in CRD v1 and was feature gated in beta.

That being said the OpenAPI maxLength validation is set as an integer and by definition it validates against the string's number of characters. OpenAPI spec says all strings should be unicode code points, so unless something else in Kubernetes will be restricting this to ASCII, seems to me maxLength validation on a OpenAPIv3 string can't help us validate byte lengths. Should/can we even reasonably change this to a maximum character length? Are there other examples of byte-limited fields we can draw from (either CRDs or in core -- seems like annotations get some custom validation if I'm looking in the right place).

@liggitt
Copy link
Member

liggitt commented Feb 11, 2022

unless something else in Kubernetes will be restricting this to ASCII

couldn't the pattern validation restrict the value to ASCII?

@lauralorenz
Copy link
Contributor Author

@liggitt yes, though I brought ASCII up only for the purposes of discussing the limitations of maxLength validating against a byte limit. In reality I don't think ClusterProperty.value should be limited to ASCII.

I don't have very good context on whether restricting input to ASCII is generally acceptable in "arbitrary text" type fields in Kubernetes or out-of-tree APIs. I just checked live against a cluster I had up that annotation values are not restricted to ASCII, while things like names/namespaces of course are (as they must be RFC 1123 subdomains). To me the ClusterProperty.value field operates much more like an annotation value between those two examples and would be expected to behave similarly when it comes to input.

@thockin
Copy link
Member

thockin commented Feb 16, 2022

Agree that ASCII seems like an artificial restriction here (and in fact, we probably should have done unicode all over the place, but that's a different argument).

I am not a unicode expert, but the IIUC the max unicode code point is 4 bytes, right? So, worst case, can we say that the payload is 128k bytes (as few as 32k unicode code points) or 128K code points (up to 512k bytes)? Which one does openAPI validation gurantee?

Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use the "bare" names for the well-known properties defined above?

keps/sig-multicluster/2149-clusterid/README.md Outdated Show resolved Hide resolved
@lauralorenz
Copy link
Contributor Author

Yes 4 bytes is the max.

Honestly I'm link spelunking back and forth between OpenAPI and JSON spec and the language of "Unicode character" vs "Unicode code point" seems inconsistent or at best very confusing, but based on some open API client code examples and as much of the spec as I understand, I think all that OpenAPI can promise is Unicode code point count (which is not necessarily the same as perceived "characters" -- for example 🇺🇸 is 2 Unicode code points). Therefore I think it's best for us to state it in Unicode code point count. I changed the text to try and get all this across in 4f2a5b8; as for 128k code points vs 32k code points, I assume our bottleneck is etcd value limits which both of those*4 bytes are very low compared to so I went with the higher limit, 128k code points.

# python 3 where strings are Unicode code points
>>> len("𐀀".encode("utf-8"))
4
>>> len("𐀀".encode("utf-16"))
6
>>> len("𐀀".encode("utf-32"))
8
>>> len("𐀀")
1
>>> len("🇺🇸")
2

@lauralorenz
Copy link
Contributor Author

Should we use the "bare" names for the well-known properties defined above?

Personally I've become partial to the .k8s.io suffix and I feel it also marks it more clearly as system managed/dependent. Also, while the risk is quite minor, even though we explicitly reserve bare names for system use in the spec, I could see people in test clusters messing around with ClusterProperty CRDs and independently invent the need for an id and overwrite MCS' id property by accident and get confused, but its unlikely they would do that to an id.k8s.io property by accident.

@liggitt
Copy link
Member

liggitt commented Feb 22, 2022

Should we use the "bare" names for the well-known properties defined above?

Personally I've become partial to the .k8s.io suffix and I feel it also marks it more clearly as system managed/dependent. Also, while the risk is quite minor, even though we explicitly reserve bare names for system use in the spec, I could see people in test clusters messing around with ClusterProperty CRDs and independently invent the need for an id and overwrite MCS' id property by accident and get confused, but its unlikely they would do that to an id.k8s.io property by accident.

I agree... I wish we'd consistently included kubernetes suffixes on existing annotations and API groups

@JeremyOT
Copy link
Member

I don't have a very strong opinion on which we should use, bare names, or .k8s.io for well-known properties. The arguments for the suffix are clear, and the arguments against seems to be minor usability improvements with kubectl. I'm inclined to lean toward to the suffix.

Either way, the important part is that we require users to provide a suffix, and that's been done

Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with suffix being required.

@@ -300,10 +300,12 @@ _For example, [CAPN's virtualcluster project](https://github.com/kubernetes-sigs
The `ClusterProperty` resource provides a way to store identification related, cluster scoped information for multi-cluster tools while creating flexibility for implementations. A cluster may have multiple `ClusterProperty`s, each holding a different identification related value. Each property contains the following information:

* **Name** - a well known or custom name to identify the property.
* **Value** - a property-dependent string, up to 128 KB.
* **Value** - a property-dependent string, up 128k Unicode code points (see _Note_).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uggh I hate this topic. I can't keep it straight in my head.

If we are asserting length in OpenAPI, do we need to assert an ecoding (e.g. UTF8)? What if the value is NOT valid unicode - what will the validator do? E.g. can I craft a value of entirely invalid bytes, and will that trip up the validator?

Sorry to keep poking this, but if we get it wrong it's a trap for consumers.

IMO:

If it's a string we should validate that it is actually all unicode characters in some encoding, of some max length in code points.

If it's an array of bytes we should validate the length in bytes and nothing more.

Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tl;dr I think we leave the wording as is. UTF-32 is weird but I don't think we should solve that here

Ok tested this empirically on a kind cluster using a CRD with ClusterProperty.value's maxLength set to 3 (see describe crd), and trying to insert one CR whose value uses 2 code points ("🇺🇸"), one CR whose value uses 4 code points ("🇺🇸🇺🇸"), and also trying with my YAML encoded in UTF-8, UTF-16, and UTF-32 (methodology).

If we are asserting length in OpenAPI, do we need to assert an ecoding (e.g. UTF8)?

Empirically I confirmed that OpenAPIv3 maxLength is based on the number of Unicode code points of the decoded string because "🇺🇸" passes validation while "🇺🇸🇺🇸" does not [1]. This is regardless of if my YAML was originally in UTF-8 or UTF-16 (and FYI "🇺🇸" uses 8 and 10 bytes in UTF-8 and UTF-16 respectively).

Watching from the outside my best guess is by the time OpenAPIv3's maxLength validation code gets involved, the input is already decoded, though there is a way to specify encoding validation in OpenAPIv3 BUT it is not exposed in the k8s CRD version of OpenAPIv3 (ref) and I also empirically observed it was not allowed when I tried anyways. I don't think we need to enforce an encoding since ultimately we are limiting based on code points -- as long as someone's bytes can get decoded into Unicode code points by the time the validator checks, OpenAPIv3 maxLength doesn't care about the encoding and neither do I.

Empirically I also observed using UTF-8 OR UTF-16 encoded YAML both work fine, but something doesn't seem to be able to decode UTF-32 as I cannot apply my YAML if it's in UTF-32 [1]. Assuming my methodology to get to UTF-32 encoded YAML file was all correct, then from the error thrown it seems like the yaml->json converter library used by kubectl maybe can't sniff that my file was UTF-32. I'm not sure k8s generally really cares if people have their YAML in UTF-32 but if they do this might be worth investigating more outside of the context of this KEP.

[1] See the output.

What if the value is NOT valid unicode - what will the validator do

I observe that if I put an invalid UTF-8 byte sequence in as the value aka something that cannot be interpreted into a Unicode code point when decoding from UTF-8, it gets validated against number of resulting replacement characters ("�") which is one per invalid byte. I think for our purposes this means it didn't blow up and that is acceptable (as I'm sure we don't intend to support invalid bytes in UTF-* for this field). Specifically I tried cases 3.1.3 and 3.5.1 in this UTF-8 test file.

Importantly regarding the methodology of this test, I did not copy paste between browsers / text editors but instead painstakingly (though probably not in the most efficient way) manipulated it in a Python3 interpreter (which is the language I know the best when it comes to the quirks of the string implementation). In summary I wgetted that file, read that file into a Python interpreter as bytes only, concatenated the bad bytes into the value field of the bytes of my YAML, and wrote it to file as bytes and never decoded / reencoded it, and even double checked by reading it in as bytes again that the byte sequence was not the byte sequence for the replacement character -- in other words, I feel very confident the bad bytes were not turned into "�" until k8s did so sometime during kubectl apply.

IMO:

If it's a string we should validate that it is actually all unicode characters in some encoding, of some max length in code points.

It is a string, agree on max length is against code points, and for reasons above I don't think we should enforce an encoding and also we can't easily.

If it's an array of bytes we should validate the length in bytes and nothing more.

It's not. The only openings this way seem to be OpenAPIv3 formats "byte" (base64 encoded) or "binary" (currently not supported by the k8s CRD implementation of OpenAPIv3), and which would require everyone to write CRs in a non-human readable way like this which we would never do:

apiVersion: about.x-k8s.io/v1alpha1
kind: ClusterProperty
metadata:
  name: clusterproperty-fits
spec:
  value: "8J+HuvCfh7g=" # base64(flag emoji in UTF-8)

@lauralorenz lauralorenz requested a review from thockin March 8, 2022 02:27
@lauralorenz
Copy link
Contributor Author

For the literary Unicode fans in the house: https://twitter.com/FakeUnicode/status/893305253420027904?s=20&t=18SvHLy3NyffBZGWXXOq1w

@thockin
Copy link
Member

thockin commented Mar 8, 2022

Thanks!

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 8, 2022
@lauralorenz
Copy link
Contributor Author

/assign @JeremyOT

@JeremyOT
Copy link
Member

Thanks Laura!

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JeremyOT, lauralorenz, thockin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 10, 2022
@k8s-ci-robot k8s-ci-robot merged commit c87c669 into kubernetes:master Mar 10, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.24 milestone Mar 10, 2022
@liggitt liggitt moved this from Assigned to API review completed, 1.24 in API Reviews Aug 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-review Categorizes an issue or PR as actively needing an API review. approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/multicluster Categorizes an issue or PR as relevant to SIG Multicluster. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
Status: API review completed, 1.24
Development

Successfully merging this pull request may close these issues.

None yet

5 participants