Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

apimachinery: Add a strict YAML and JSON deserializer option #71589

Closed
wants to merge 1 commit into from

Conversation

neolit123
Copy link
Member

@neolit123 neolit123 commented Nov 30, 2018

What type of PR is this?
/kind feature

What this PR does / why we need it:
pkg/runtime: implement a strict YAML and JSON deserializer

Add a new universal decoder and universal deserializer.
This enables checks for unknown and duplicate fields in input YAML
and JSON data.

Example usage:

runtime.DecodeInto(MyCodecFactory.UniversalStrictDecoder(), content, into)
MyCodecFactory.UniversalStrictDeserializer().Decode(content, gvk, into)

The same CodecFactory can also return the non-strict variants.

A custom json-iterator API object is used to check for unknown fields.
For duplicate fields the sigs.k8s.io/yaml.YAMLToJSONStrict() function
is used.

Also add:

  • Unit tests in json_test.go.
  • New error types StrictDecoderError, DuplicateFieldError,
    UnknownFieldError.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
xref: kubernetes/community#2977
?

Special notes for your reviewer:
NONE

Does this PR introduce a user-facing change?:

apimachinery: Add a strict YAML and JSON deserializer option

/assign @liggitt @luxas
cc @BenTheElder
/priority important-longterm
/sig api-machinery

@k8s-ci-robot k8s-ci-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Nov 30, 2018
@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Nov 30, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: neolit123
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: liggitt

If they are not already assigned, you can assign the PR to them by writing /assign @liggitt in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Member

@luxas luxas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @neolit123! Great to see this moving forward 👏

As a consumer, I want something in between in the meantime that warns the user, not completely exits the application on a failed strict decode, so I think we need to be able to check the error type, and also it'd be nice if we could get some more programmatic information from the error instead of "just" the string, e.g. the gvk metadata for the type decode that failed.

@luxas luxas added this to the v1.14 milestone Nov 30, 2018
@luxas luxas changed the title pkg/runtime: implement a strict YAML and JSON deserializer apimachinery: Add a strict YAML and JSON deserializer option Nov 30, 2018
@luxas
Copy link
Member

luxas commented Nov 30, 2018

@neolit123 edited the title/relnote a bit to make it clear that this is optional, and not enforced by default.

@neolit123
Copy link
Member Author

neolit123 commented Nov 30, 2018

As a consumer, I want something in between in the meantime that warns the user, not completely exits the application on a failed strict decode, so I think we need to be able to check the error type, and also it'd be nice if we could get some more programmatic information from the error instead of "just" the string, e.g. the gvk metadata for the type decode that failed.

i will definitely include the GVK and properly type the errors.
the lack of proper typing in the backend libraries like encoding/json and yaml.2 is concerning.
they just dump untyped string errors.

currently there are a couple of ways to do this:

  1. create two decoders
if err := runtime.DecodeInto(MyCodecFactory.UniversalStrictDecoder(), content, into); err != nil {
    return fmt.Println(err) // prints warning
}
if err := runtime.DecodeInto(MyCodecFactory.UniversalDecoder(), content, into); err != nil {
    return err;
}
  1. check for error types and make the UniversalStrictDecoder not fail:
if err := runtime.DecodeInto(MyCodecFactory.UniversalStrictDecoder(), content, into); err != nil {
    switch err.(type) {
    case UnknownFieldError:
    case DuplicateFieldError:
        fmt.Println(err)
    default:
        return err
    }
}

both ways would be computationally similar as unmarshaling has to be done twice.
my assumption is that we want to go for 2.

@luxas
Copy link
Member

luxas commented Nov 30, 2018

Option 2 is what I prefer, it is way better.

both ways would be computationally similar as unmarshaling has to be done twice.

See my comment in #71589 (comment) to avoid doing unmarshal twice.

In any case, I think we need some Go benchmarks to know how much slower the strict decoding is, if we're ever gonna use it in places like the API server where milliseconds matter.

@neolit123
Copy link
Member Author

neolit123 commented Nov 30, 2018

@luxas

See my comment in #71589 (comment) to avoid doing unmarshal twice.

the problem is that if we enable strict unmarshal for json-iterator it will fail and not trow a warning (finish the unmashal).
so if we want to go for 2) we still have to unmarshal twice. :\

@neolit123 neolit123 force-pushed the strict-decoder branch 2 times, most recently from 07bf78f to 79837db Compare December 10, 2018 23:08
@neolit123
Copy link
Member Author

@liggitt @luxas
updated, i think i addressed the comments.

this ends up being in the lines of:

strictDecode() <--- decide to what to do with the error.
decode() <--- can continue to process regularly.

please TAL at this part:
https://github.com/kubernetes/kubernetes/pull/71589/files/79837dbe919c2c64d624cbc78ffd2436b3a3b54d#diff-f216f544515d2fd05d66d92c5f95a248
as we may have to remove the custom error types and only leave the StrictDecoderError one.

@neolit123
Copy link
Member Author

neolit123 commented Dec 15, 2018

added unit tests for valid input to the strict decoders.
plus some refactor/optimization.

@liggitt
Copy link
Member

liggitt commented Jan 4, 2019

cc @smarterclayton for strict decoding mechanism

@luxas
Copy link
Member

luxas commented Jan 4, 2019

/assign @smarterclayton

@neolit123
Copy link
Member Author

i will update the PR with the comments by @luxas on Monday.

Add a new universal decoder and universal deserializer.
This enables checks for unknown and duplicate fields in input YAML
and JSON data.

Example usage:
runtime.DecodeInto(MyCodecFactory.UniversalStrictDecoder(), content, into)
MyCodecFactory.UniversalStrictDeserializer().Decode(content, gvk, into)

The same CodecFactory can also return the non-strict variants.

A custom json-iterator API object is used to check for unknown fields.
For duplicate fields the sigs.k8s.io/yaml.YAMLToJSONStrict() function
is used.

Also add:
- Unit tests in json_test.go.
- New error types StrictDecoderError, DuplicateFieldError,
UnknownFieldError.
@neolit123
Copy link
Member Author

i will update the PR with the comments by @luxas on Monday.

updated.

@neolit123
Copy link
Member Author

some benchmarks

pseudo test code:

start := time.Now()
for i := 0; i < 10000; i++ {
    runtime.DecodeInto(myScheme.Codecs.UniversalDecoder(), fileContent, targetObject)
}
elapsed := time.Since(start)
fmt.Printf("elapsed non-strict %v\n", elapsed)
start = time.Now()
for i := 0; i < 10000; i++ {
    runtime.DecodeInto(myScheme.Codecs.UniversalStrictDecoder(), fileContent, targetObject)
}
elapsed = time.Since(start)
fmt.Printf("elapsed strict %v\n", elapsed)

test data:

apiVersion: kubeadm.k8s.io/v1beta1
kind: ClusterConfiguration
etcd:
  local:
    imageRepository: "k8s.gcr.io"
    imageTag: "3.2.24"
    dataDir: "/var/lib/etcd"
    extraArgs:
      listen-client-urls: "http://10.100.0.1:2379"
    serverCertSANs:
    -  "ec2-10-100-0-1.compute-1.amazonaws.com"
    peerCertSANs:
    - "10.100.0.1"
networking:
  serviceSubnet: "10.96.0.0/12"
  podSubnet: "10.100.0.1/24"
  dnsDomain: "cluster.local"
kubernetesVersion: "v1.12.0"
controlPlaneEndpoint: "10.100.0.1:6443"
apiServer:
  extraArgs:
    authorization-mode: "Node,RBAC"
  extraVolumes:
  - name: "some-volume"
    hostPath: "/etc/some-path"
    mountPath: "/etc/some-pod-path"
    readOnly: false
    pathType: File
  certSANs:
  - "10.100.1.1"
  - "ec2-10-100-0-1.compute-1.amazonaws.com"
  timeoutForControlPlane: 4m0s
controllerManager:
  extraArgs:
    "node-cidr-mask-size": "20"
  extraVolumes:
  - name: "some-volume"
    hostPath: "/etc/some-path"
    mountPath: "/etc/some-pod-path"
    readOnly: false
    pathType: File
scheduler:
  extraArgs:
    address: "10.100.0.1"
  extraVolumes:
  - name: "some-volume"
    hostPath: "/etc/some-path"
    mountPath: "/etc/some-pod-path"
    readOnly: false
    pathType: File
certificatesDir: "/etc/kubernetes/pki"
imageRepository: "k8s.gcr.io"
useHyperKubeImage: false
clusterName: "example-cluster"

test cases:

  1. test data as YAML:
elapsed non-strict 3.552124169s
elapsed strict 3.846053951s
  1. test data as JSON:
elapsed non-strict 822.631525ms
elapsed strict 3.799129945s

summary
looks like the main bottleneck (regardless of this PR) is the YAML to JSON converter.
in 2) we observe a big difference due to the fact that the YAML to JSON converter is always used to catch duplicate field errors.

@liggitt
Copy link
Member

liggitt commented Jan 7, 2019

There are three types of behavior we'll eventually want from decoders:

  • ignore duplicate/unknown fields (current behavior)
  • warn on duplicate/unknown fields (useful for surfacing potential issues while keeping API compatibility)
  • error on duplicate/unknown fields (what this PR partially adds)

All of those could be handled uniformly if the decoder returned structured duplicate/unknown field info separately, and the caller decided whether to ignore, warn, or error on it. I'm not sure adding factory APIs to construct alternate decoders with unstructured fail-fast errors for duplicate/unknown fields takes us in the right direction. Would like @smarterclayton's thoughts on the direction of that approach.

@neolit123
Copy link
Member Author

neolit123 commented Jan 7, 2019

There are three types of behavior we'll eventually want from decoders:

with the current usage of low level libraries, a warning state is not possible without multi-pass unmarshal.
if avoiding multi-pass is required, said libraries have to be replaced (and there is nothing to replace them with, really).

I'm not sure adding factory APIs to construct alternate decoders with unstructured fail-fast errors for duplicate/unknown fields takes us in the right direction

i can still see usage for the separate strict / non strict decoder and the space in between if one wants to handle warnings.

@smarterclayton
Copy link
Contributor

smarterclayton commented Jan 8, 2019 via email

@neolit123
Copy link
Member Author

ok, i'm going to leave this PR to the lifecycle bots.
if there is a need for the PR to merge before it rots we can rebase and update.

@neolit123
Copy link
Member Author

closing in favor of: #72883

@neolit123 neolit123 closed this Feb 16, 2019
component-base automation moved this from In progress to Done Feb 16, 2019
pohly added a commit to pohly/kubernetes that referenced this pull request Mar 4, 2019
It is useful to apply the storage testsuite also to "external" (=
out-of-tree) storage drivers. One way of doing that is setting up a
custom E2E test suite, but that's still quite a bit of work.

An easier alternative is to parameterize the Kubernetes e2e.test
binary at runtime so that it instantiates the testsuite for one or
more drivers. Some parameters have to be provided before starting the
test because they define configuration and capabilities of the driver
and its storage backend that cannot be discovered at runtime. This is
done by populating the DriverDefinition with the content of the file
that the new -storage.testdriver parameters points to.

The universal .yaml and .json decoder from Kubernetes is used. It's
flexible, but has some downsides:
- currently ignores unknown fields (see kubernetes#71589)
- poor error messages when fields have the wrong type

Storage drivers have to be installed in the test cluster before
starting e2e.test. Only tests involving dynamically provisioned
volumes are currently supported.
pohly added a commit to pohly/kubernetes that referenced this pull request Mar 6, 2019
It is useful to apply the storage testsuite also to "external" (=
out-of-tree) storage drivers. One way of doing that is setting up a
custom E2E test suite, but that's still quite a bit of work.

An easier alternative is to parameterize the Kubernetes e2e.test
binary at runtime so that it instantiates the testsuite for one or
more drivers. Some parameters have to be provided before starting the
test because they define configuration and capabilities of the driver
and its storage backend that cannot be discovered at runtime. This is
done by populating the DriverDefinition with the content of the file
that the new -storage.testdriver parameters points to.

The universal .yaml and .json decoder from Kubernetes is used. It's
flexible, but has some downsides:
- currently ignores unknown fields (see kubernetes#71589)
- poor error messages when fields have the wrong type

Storage drivers have to be installed in the test cluster before
starting e2e.test. Only tests involving dynamically provisioned
volumes are currently supported.
@nikhita nikhita added this to Done in component-base Apr 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Development

Successfully merging this pull request may close these issues.

None yet

7 participants