Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to associate a schema to a YAML data? #29

Closed
nowox opened this issue Nov 29, 2015 · 9 comments
Closed

How to associate a schema to a YAML data? #29

nowox opened this issue Nov 29, 2015 · 9 comments
Projects
Milestone

Comments

@nowox
Copy link
Contributor

nowox commented Nov 29, 2015

In XML you can specify your XSD with a XML xsi:schemaLocation attribute. YAML specs lack with kind of information.

So in my opinion two options are possible:

  1. Using a global tag:
%YAML 1.2
%SCHEMA foo.schema.yml

---
some: data
...
  1. Having a special key in the YAML description
%YAML 1.2

---
some: data
<schema>:
    location: 'foo.schema.yml'
...

The second method allows to embed the schema inside the YAML description, this could be nice. Also, it allows to partially apply a schema to a particular node:

%YAML 1.2

---
foo:
   bar:
      - list
<schema>:
    foo:
      bar: 
         <location>: 'foo.schema.yml'   
...

I don't know which option would be the best...

Any idea?

@Grokzen
Copy link
Owner

Grokzen commented Nov 29, 2015

The only one that i might consider to implement is the first suggestion. The main reason is that i do not like to open up the mix of data and schema in the same file, i think it is very messy. The best thing about the first option is that when there is a pointer to the resource, it can be implemented in a way that opens up for any type of location of that resource. It would make sense to implement the standard patterns of file:// git:// ftp:// and all of them. It would also open up for the implementer of the code to implement their own handlers in case the default one do not exists yet in the lib.

The bad thing is that the python yaml parser do not support out of the box to get the global tags after the data have been loaded. But on the other side, it is very easy to just parse the file and take the first %SCHEMA pattern and use it.

@nowox
Copy link
Contributor Author

nowox commented Nov 29, 2015

I agree with you, I also prefer the first option while the %SCHEMA may be not recognized by all parsers. That said it is not a forbidden tag so I guess we are free to use it.

From the YAML specs the two recognized % commands are %YAML and %TAG. So we could also use something like this:

%TAG !schema! file:///usr/local/share/foo.yml

Last but not least, using the word %SCHEMA is perhaps a bit pretentious in the way it says PyKwalify is the default (and will become the) standard YAML validator. I see this as a very good thing but some will not I guess.

Next step would be to add this validation support to the PyYAML module...

@Grokzen
Copy link
Owner

Grokzen commented Nov 29, 2015

I am still more in favor of the initial suggestion of

%SCHEMA file:///usr/foo.yaml

because it is more clean and easier to use. Atleast pyyaml do not throw up if i add %SCHEMA at the top of the file but at the same time i understand that the specs is more in favor of %TAG !schema! .... But on the other hand so did i find this in the spec

Directives are instructions to the YAML processor. This specification defines two directives, “YAML” and “TAG”, and reserves all other directives for future use. There is no way to define private directives. This is intentional.

So that any tag can be implemented and the spec says that it should be compatible so if any other client is not compatible with a custom Tag then that one is broken, not pykwalify use of it :]

@nowox
Copy link
Contributor Author

nowox commented Nov 30, 2015

So let's choose %SCHEMA and make a new standard for %YAML 1.3 that all the World will use !

The next step will be to extend PyYAML to support PyKwalify...

@Grokzen
Copy link
Owner

Grokzen commented Nov 30, 2015

And after that, world domination :]

worlddomination

But %SCHEMA ... it will be, i think that initially only file:// will be supported out of the box but a plugin type of system shall also be added so that it can be extended to support other formats in the future.

@Grokzen Grokzen modified the milestone: 2.0.0 Mar 22, 2016
@nowox
Copy link
Contributor Author

nowox commented Sep 7, 2016

I would like to work on this implementation, but I need more inputs. What do we decide? Does it worth to inform maintainers of the YAML standard?

@flyx
Copy link

flyx commented Sep 28, 2016

xsi:schemaLocation always has been kind of a dirty hack. Normally, you specify the schema URI in XML with xmlns and then have the application that takes the XML as input provide the schema file.

This is even more true with YAML: You specify the type of the top element and all other elements that will not be resolved automatically to the correct type as a tag:

%YAML 1.2
--- !my:data:schema
some: data
...

Since YAML is designed to be deserialized into values native to the implementation language, it makes little sense to define a schema language for it. The native types the loader transforms the YAML into are the schema. In PyYAML, for example, you can derive from YAMLObject in order to create a schema for you YAML file.

Now I am aware that this project has created a schema language nonetheless. Which is fine. But I think it would be a big mistake to implement somthing akin to xsi:schemaLocation in YAML because YAML is designed to be portable, and that would be greatly harmed if you start to reference a local file within it (how would you send that file along in some kind of network stream? YAML is designed to work well within streaming environments, which a schema location directive would totally destroy).

However, is an explicit tag at the root element not enough for a validator to search for the relevant schema? Given the YAML above, the loader could say „oh, I know this tag, I have a schema for it“ and then validate against that schema. It is even possible to only parse parts of the YAML against a schema:

%YAML 1.2
---
some: data
key with typed value: !my:data:schema
    some: more
    data: here
...

According to the YAML spec, the top value (a mapping) will implicitly get the !!map tag, which allows all elements as content. Then, the loader sees a tag on one of the values in it, and can validate that subtree according to some schema.

The only difference to the approaches discussed here is that there needs to be a mapping from tags to schema files outside the YAML. Which I think is fine; if you use XML within an application, you also ship the schema with the application and do not search for it in xsi:schemaLocation. That is little more than a hint for XML editors, but most of them are also able to define an URI -> schema file mapping in their configuration. So xsi:schemaLocation is a superfluous alien meta information in your XML file that actually does not belong there and harms portability. I for one would like to avoid carrying that mistake over to YAML.

@Grokzen
Copy link
Owner

Grokzen commented Dec 21, 2016

@flyx @nowox I think i will postpone implementing this feature for now.

@flyx I agree with you that taking this over to the PyYaml/ruamel.yaml or even the YAML org itself is not the best idea.This validation language is not the "one and only and best yaml validation language" and i do not intend or have any motivation to bring it to that level.

Another thing that has been bothering me about this feature is the security around it all. Say that we would implement some fetching from a http or git source for example, then we eitehr have to sandbox it very good to not escape and do bad things to your system by making it download something that you do not intent or that can cause harm to your system.

This is kinda a problem factor for extensions but the main difference there is that you can't through the data or schema definition tell pykwalify to download something from a source and then execute it. I might even back off on the entire feature just based on this security implications. It is then better that you implement this kind of feature outside of pykwalify and just keep pykwalify as is where you must specify a schema and data explicitly.

@Grokzen
Copy link
Owner

Grokzen commented Oct 19, 2019

I do not plan to implement any of this. The feature seems to far out to really be usefull right now.

@Grokzen Grokzen closed this as completed Oct 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Backlog
2.0.0
Development

No branches or pull requests

3 participants