-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add schema_format to aid parsers #51
Conversation
validation/schema.json
Outdated
"schema_format", | ||
"schema_version", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oliverchang Knowing which schema to validate against is critical for parsers and something I think we should require for all OSV objects. This will be immensely helpful as we add new fields and data types, such as new reference types or relationship data to the OSV format.
(The fact that this will break existing OSVs is an example of why I feel we should require this - as we parse through old data it will be much easier to update it to newer formats if we know what version the data was originally created against.)
We've been considering adding this as a database specific field value. I agree it would be helpful to denote in the file what the format is. At the moment consuming systems just have to assume this based on info that the publishing system may provide, but it's possible for a publisher to distribute more than one file format, possibly even more than one JSON file format. |
Thanks for raising this PR! I have some initial concerns here:
Perhaps this would be better kept out-of-band, or introduced as some kind of protocol. i.e. either as part of the filename/filepath ("ID.osv.json") or through some other mechanism for publishers to state that they are using the OSV format. @rsc @chrisbloom7 thoughts? |
If a system publishes more than one format in the same stream then the processing systems need a way to interrogate each file to decide how to handle it. In some cases this could be handled by using custom headers, but for systems that don't transmit the files directly (i.e. a repo full of JSON files that processing systems fetch), then if the type is supposed to be inferred from the file name then each publishing system must make their format known and each processing system must account for differences in naming conventions. Adding it to the schema means a processing system need only interrogate the parsed JSON to see if an expected format node is present, and if that node is part of the schema then that makes it easy for consumers to ask. Granted there is no standard here. There was a discussion about this in the JSON schema project and the decision at the time was that they wouldn't standardize on anything but they might document a recommended practice of using a root node such as |
For use-cases like the GSD where we are potentially embedding multiple formats within the same json file as a part of different namespaces, it's critical for the parser to be able to interrogate each namespace for known formats to aid in display, validation, and translation to other formats. We can't rely on anything outside of the json object itself for this purpose (e.g. filename or headers). Using |
For embedding formats, would it work at all to do something like: {
"NAMESPACE-1": {
"format": "OSV",
"entry": {
},
},
"NAMESPACE-2": {
"format": "CVE-5",
"entry": {
},
}
} This way it's both consistent and clear what format each embedded format is (without having to parse the format itself).
For distribution, I do think we need to standardize how OSV's are distributed -- today we just have a list of git repos in a README.md file, but if there was a consistent way to discover/determine if a source is producing OSV entries (e.g. through network headers or filename conventions), this may also solve the identification problem. |
Hmm, that's definitely worth considering, and could help in scenarios where a format doesn't expose an easy way to parse identification itself (including scenarios where it's XML or some other format encoded as base64). @joshbressers @kurtseifried any input on the format/version declarations and pros/cons of having it at the different levels? |
Friendly ping on this :) |
Regarding format/version declarations I think we should support this because we support namespaces and the namespaces can do whatever they want. And not all formats (e.g. OSV) support a simple declaritive way to determine what they are (e.g. CVE JSON does). I think we need to be flexible though, e.g. I think the bigger players MUST label their data properly (GSD, OSV) but if Wendy wants to throw some stuff into her namespace she SHOULD include some identifying info/schema but it's not required (since she may not have one yet). Obviously it would also give preference to the namespaces that do declare what they are cleanly, e.g.: "$schema": "https://json-schema.org/draft/2020-12/schema" Links: So my thought is this: We ask for a "$schema" tag, we apply a $schema tag when we definitely know what it is (e.g. the robot that pulls CVE/NVD data in knows it is CVE JSON format) and we suggest/educate people where possible, but ultimately it's ok if there is no schema tag. The only downside I see here is if the wrong declaration is used (e.g. type or format), so we should be careful when we apply the tag to data (we should make sure it's correct). And it means $schema becomes a reserved keyword for keys (but not for values), which I think is an acceptable tradeoff. |
What does it mean that we support namespaces? That's surprising to me. |
we == GSD. Since we consume data from multiple sources and in multiple formats we support namespaces so e.g. nvd.nist.org can have whatever they want/need, and a vendor can have their own data (e.g. different CVSS scores) and so on. |
I like the idea of including a schema key when possible. It would be redundant for OSV to include a schema for all of their entries. This would make the most sense as an optional field. |
Especially if OSV wants to be included with other data sets, this makes it far simpler. |
Signed-off-by: Josh Buker <crypto@joshbuker.com>
992f101
to
adbcb3e
Compare
Signed-off-by: Josh Buker <crypto@joshbuker.com>
adbcb3e
to
847c254
Compare
Hey @oliverchang, updated this PR per some of your feedback. Mind taking another look? |
Sorry for the silence on this! I was out on a vacation for a few weeks and just got back! I am a little hesitant to move this one forward, unless the other core committers (@chrisbloom7 and @rsc) express their support for this. The reasons being that for the two main use cases that have been brought up:
|
@oliverchang I noted earlier that GitHub had been considering adding this as a database specific field and that I was in support of it. Your point about this particular solution only addressing the question for a single format (OSV) rather than all potential formats for that producer has made me rethink this. If a single source provides multiple versions of vulnerability data which are indistinguishable from each other based on URI, request info, or content namespacing then adding a field to the OSV schema only solves the problem for one format and plasters over the problem. It seems reasonable to me that if a user is establishing a way to query a given source and if said source has a documented protocol defining the structure of their data, then the user already has a way of distinguishing one format from another without need for an internal identifier. |
Adds a schema_format field so that parsers can positively identify if an object is trying to represent an OSV object.
I think having this somewhere in the data itself will be critical in conjunction with the schema_version field, but I don't really have any opinions on how best to format that. I went with what I felt was a simple implementation that follows the naming convention of the version field. @oliverchang what are your thoughts on how to incorporate that into the schema?