apimachinery: Add a strict YAML and JSON deserializer option #71589

neolit123 · 2018-11-30T07:06:18Z

What type of PR is this?
/kind feature

What this PR does / why we need it:
pkg/runtime: implement a strict YAML and JSON deserializer

Add a new universal decoder and universal deserializer.
This enables checks for unknown and duplicate fields in input YAML
and JSON data.

Example usage:

runtime.DecodeInto(MyCodecFactory.UniversalStrictDecoder(), content, into)
MyCodecFactory.UniversalStrictDeserializer().Decode(content, gvk, into)

The same CodecFactory can also return the non-strict variants.

A custom json-iterator API object is used to check for unknown fields.
For duplicate fields the sigs.k8s.io/yaml.YAMLToJSONStrict() function
is used.

Also add:

Unit tests in json_test.go.
New error types StrictDecoderError, DuplicateFieldError,
UnknownFieldError.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
xref: kubernetes/community#2977
?

Special notes for your reviewer:
NONE

Does this PR introduce a user-facing change?:

apimachinery: Add a strict YAML and JSON deserializer option

/assign @liggitt @luxas
cc @BenTheElder
/priority important-longterm
/sig api-machinery

k8s-ci-robot · 2018-11-30T07:08:53Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: neolit123
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: liggitt

If they are not already assigned, you can assign the PR to them by writing /assign @liggitt in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

staging/src/k8s.io/apimachinery/pkg/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

luxas

Thanks @neolit123! Great to see this moving forward 👏

As a consumer, I want something in between in the meantime that warns the user, not completely exits the application on a failed strict decode, so I think we need to be able to check the error type, and also it'd be nice if we could get some more programmatic information from the error instead of "just" the string, e.g. the gvk metadata for the type decode that failed.

staging/src/k8s.io/apimachinery/pkg/runtime/serializer/json/json.go

staging/src/k8s.io/apimachinery/pkg/runtime/serializer/codec_factory.go

staging/src/k8s.io/apimachinery/pkg/runtime/serializer/json/json.go

luxas · 2018-11-30T10:58:07Z

@neolit123 edited the title/relnote a bit to make it clear that this is optional, and not enforced by default.

staging/src/k8s.io/apimachinery/pkg/runtime/serializer/codec_factory.go

staging/src/k8s.io/apimachinery/pkg/runtime/serializer/json/json.go

neolit123 · 2018-11-30T17:56:10Z

As a consumer, I want something in between in the meantime that warns the user, not completely exits the application on a failed strict decode, so I think we need to be able to check the error type, and also it'd be nice if we could get some more programmatic information from the error instead of "just" the string, e.g. the gvk metadata for the type decode that failed.

i will definitely include the GVK and properly type the errors.
the lack of proper typing in the backend libraries like encoding/json and yaml.2 is concerning.
they just dump untyped string errors.

currently there are a couple of ways to do this:

create two decoders

if err := runtime.DecodeInto(MyCodecFactory.UniversalStrictDecoder(), content, into); err != nil {
    return fmt.Println(err) // prints warning
}
if err := runtime.DecodeInto(MyCodecFactory.UniversalDecoder(), content, into); err != nil {
    return err;
}

check for error types and make the UniversalStrictDecoder not fail:

if err := runtime.DecodeInto(MyCodecFactory.UniversalStrictDecoder(), content, into); err != nil {
    switch err.(type) {
    case UnknownFieldError:
    case DuplicateFieldError:
        fmt.Println(err)
    default:
        return err
    }
}

both ways would be computationally similar as unmarshaling has to be done twice.
my assumption is that we want to go for 2.

luxas · 2018-11-30T18:21:14Z

Option 2 is what I prefer, it is way better.

both ways would be computationally similar as unmarshaling has to be done twice.

See my comment in #71589 (comment) to avoid doing unmarshal twice.

In any case, I think we need some Go benchmarks to know how much slower the strict decoding is, if we're ever gonna use it in places like the API server where milliseconds matter.

neolit123 · 2018-11-30T19:03:02Z

@luxas

See my comment in #71589 (comment) to avoid doing unmarshal twice.

the problem is that if we enable strict unmarshal for json-iterator it will fail and not trow a warning (finish the unmashal).
so if we want to go for 2) we still have to unmarshal twice. :\

staging/src/k8s.io/apimachinery/pkg/runtime/serializer/json/json.go

neolit123 · 2018-12-10T23:16:58Z

@liggitt @luxas
updated, i think i addressed the comments.

this ends up being in the lines of:

strictDecode() <--- decide to what to do with the error.
decode() <--- can continue to process regularly.

please TAL at this part:
https://github.com/kubernetes/kubernetes/pull/71589/files/79837dbe919c2c64d624cbc78ffd2436b3a3b54d#diff-f216f544515d2fd05d66d92c5f95a248
as we may have to remove the custom error types and only leave the StrictDecoderError one.

neolit123 · 2018-12-15T18:21:46Z

added unit tests for valid input to the strict decoders.
plus some refactor/optimization.

liggitt · 2019-01-04T17:28:24Z

cc @smarterclayton for strict decoding mechanism

luxas · 2019-01-04T17:38:12Z

/assign @smarterclayton

neolit123 · 2019-01-05T09:56:26Z

i will update the PR with the comments by @luxas on Monday.

Add a new universal decoder and universal deserializer. This enables checks for unknown and duplicate fields in input YAML and JSON data. Example usage: runtime.DecodeInto(MyCodecFactory.UniversalStrictDecoder(), content, into) MyCodecFactory.UniversalStrictDeserializer().Decode(content, gvk, into) The same CodecFactory can also return the non-strict variants. A custom json-iterator API object is used to check for unknown fields. For duplicate fields the sigs.k8s.io/yaml.YAMLToJSONStrict() function is used. Also add: - Unit tests in json_test.go. - New error types StrictDecoderError, DuplicateFieldError, UnknownFieldError.

neolit123 · 2019-01-07T12:00:58Z

i will update the PR with the comments by @luxas on Monday.

updated.

neolit123 · 2019-01-07T12:29:01Z

some benchmarks

pseudo test code:

start := time.Now()
for i := 0; i < 10000; i++ {
    runtime.DecodeInto(myScheme.Codecs.UniversalDecoder(), fileContent, targetObject)
}
elapsed := time.Since(start)
fmt.Printf("elapsed non-strict %v\n", elapsed)
start = time.Now()
for i := 0; i < 10000; i++ {
    runtime.DecodeInto(myScheme.Codecs.UniversalStrictDecoder(), fileContent, targetObject)
}
elapsed = time.Since(start)
fmt.Printf("elapsed strict %v\n", elapsed)

test data:

apiVersion: kubeadm.k8s.io/v1beta1
kind: ClusterConfiguration
etcd:
  local:
    imageRepository: "k8s.gcr.io"
    imageTag: "3.2.24"
    dataDir: "/var/lib/etcd"
    extraArgs:
      listen-client-urls: "http://10.100.0.1:2379"
    serverCertSANs:
    -  "ec2-10-100-0-1.compute-1.amazonaws.com"
    peerCertSANs:
    - "10.100.0.1"
networking:
  serviceSubnet: "10.96.0.0/12"
  podSubnet: "10.100.0.1/24"
  dnsDomain: "cluster.local"
kubernetesVersion: "v1.12.0"
controlPlaneEndpoint: "10.100.0.1:6443"
apiServer:
  extraArgs:
    authorization-mode: "Node,RBAC"
  extraVolumes:
  - name: "some-volume"
    hostPath: "/etc/some-path"
    mountPath: "/etc/some-pod-path"
    readOnly: false
    pathType: File
  certSANs:
  - "10.100.1.1"
  - "ec2-10-100-0-1.compute-1.amazonaws.com"
  timeoutForControlPlane: 4m0s
controllerManager:
  extraArgs:
    "node-cidr-mask-size": "20"
  extraVolumes:
  - name: "some-volume"
    hostPath: "/etc/some-path"
    mountPath: "/etc/some-pod-path"
    readOnly: false
    pathType: File
scheduler:
  extraArgs:
    address: "10.100.0.1"
  extraVolumes:
  - name: "some-volume"
    hostPath: "/etc/some-path"
    mountPath: "/etc/some-pod-path"
    readOnly: false
    pathType: File
certificatesDir: "/etc/kubernetes/pki"
imageRepository: "k8s.gcr.io"
useHyperKubeImage: false
clusterName: "example-cluster"

test cases:

test data as YAML:

elapsed non-strict 3.552124169s
elapsed strict 3.846053951s

test data as JSON:

elapsed non-strict 822.631525ms
elapsed strict 3.799129945s

summary
looks like the main bottleneck (regardless of this PR) is the YAML to JSON converter.
in 2) we observe a big difference due to the fact that the YAML to JSON converter is always used to catch duplicate field errors.

liggitt · 2019-01-07T17:36:38Z

There are three types of behavior we'll eventually want from decoders:

ignore duplicate/unknown fields (current behavior)
warn on duplicate/unknown fields (useful for surfacing potential issues while keeping API compatibility)
error on duplicate/unknown fields (what this PR partially adds)

All of those could be handled uniformly if the decoder returned structured duplicate/unknown field info separately, and the caller decided whether to ignore, warn, or error on it. I'm not sure adding factory APIs to construct alternate decoders with unstructured fail-fast errors for duplicate/unknown fields takes us in the right direction. Would like @smarterclayton's thoughts on the direction of that approach.

neolit123 · 2019-01-07T18:46:00Z

There are three types of behavior we'll eventually want from decoders:

with the current usage of low level libraries, a warning state is not possible without multi-pass unmarshal.
if avoiding multi-pass is required, said libraries have to be replaced (and there is nothing to replace them with, really).

I'm not sure adding factory APIs to construct alternate decoders with unstructured fail-fast errors for duplicate/unknown fields takes us in the right direction

i can still see usage for the separate strict / non strict decoder and the space in between if one wants to handle warnings.

smarterclayton · 2019-01-08T06:28:25Z

Yeah, I am really concerned with adding a new path to the factory that doesn’t take those into account. We want to reduce the complexity of decoding, not increase it. There are three rough decoding angles at play: 1. An apiserver needs to decode into a target version, get an accounting of everything it does not recognize, and then make a decision based on other api input whether to warn, error, or continue (and definitely needs structured errors a la the invalid structure which identifies field names) 2. A client talking to the apiserver needs the choice of whether to warn or ignore, but handles it differently (based on the callers needs for the use case) 3. Disk / stable storage reading code needs to perform minimal transformation of the input where possible and delegate to the server (unstructured / kubectl) or it needs to have strictly defined behavior (reading config from disk or loading data from etcd) We have talked about dramatically simplifying the serialization stack for 1 and 2, and the first part of three (likely we would either remove or simplify codec and the factory). The second part of three would probably also go through some simplification to make conversion explicit. It might be best if we talk through what the changes above might mean before we grow the factory. On Jan 7, 2019, at 12:36 PM, Jordan Liggitt <notifications@github.com> wrote: There are three types of behavior we'll eventually want from decoders: - ignore duplicate/unknown fields (current behavior) - warn on duplicate/unknown fields (useful for surfacing potential issues while keeping API compatibility) - error on duplicate/unknown fields (what this PR partially adds) All of those could be handled uniformly if the decoder returned structured duplicate/unknown field info separately, and the caller decided whether to ignore, warn, or error on it. I'm not sure adding factory APIs to construct alternate decoders with unstructured fail-fast errors for duplicate/unknown fields takes us in the right direction. Would like @smarterclayton <https://github.com/smarterclayton>'s thoughts on the direction of that approach. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#71589 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG_p7Q-TotZnQ5fQ5SWtgjYy8V7enokks5vA4W3gaJpZM4Y7IAF> .

neolit123 · 2019-01-08T10:34:49Z

ok, i'm going to leave this PR to the lifecycle bots.
if there is a need for the PR to merge before it rots we can rebase and update.

neolit123 · 2019-02-16T15:35:50Z

closing in favor of: #72883

It is useful to apply the storage testsuite also to "external" (= out-of-tree) storage drivers. One way of doing that is setting up a custom E2E test suite, but that's still quite a bit of work. An easier alternative is to parameterize the Kubernetes e2e.test binary at runtime so that it instantiates the testsuite for one or more drivers. Some parameters have to be provided before starting the test because they define configuration and capabilities of the driver and its storage backend that cannot be discovered at runtime. This is done by populating the DriverDefinition with the content of the file that the new -storage.testdriver parameters points to. The universal .yaml and .json decoder from Kubernetes is used. It's flexible, but has some downsides: - currently ignores unknown fields (see kubernetes#71589) - poor error messages when fields have the wrong type Storage drivers have to be installed in the test cluster before starting e2e.test. Only tests involving dynamically provisioned volumes are currently supported.

k8s-ci-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Nov 30, 2018

k8s-ci-robot assigned liggitt and luxas Nov 30, 2018

k8s-ci-robot requested review from caesarxuchao and ncdc November 30, 2018 07:08

neolit123 mentioned this pull request Nov 30, 2018

KEP: Create a k8s.io/component repo kubernetes/community#2977

Merged

luxas approved these changes Nov 30, 2018

View reviewed changes

luxas added this to the v1.14 milestone Nov 30, 2018

luxas changed the title ~~pkg/runtime: implement a strict YAML and JSON deserializer~~ apimachinery: Add a strict YAML and JSON deserializer option Nov 30, 2018