Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add schema_format to aid parsers #51

Closed
wants to merge 1 commit into from

Conversation

joshbuker
Copy link
Contributor

Adds a schema_format field so that parsers can positively identify if an object is trying to represent an OSV object.

I think having this somewhere in the data itself will be critical in conjunction with the schema_version field, but I don't really have any opinions on how best to format that. I went with what I felt was a simple implementation that follows the naming convention of the version field. @oliverchang what are your thoughts on how to incorporate that into the schema?

Comment on lines 306 to 307
"schema_format",
"schema_version",
Copy link
Contributor Author

@joshbuker joshbuker Jun 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oliverchang Knowing which schema to validate against is critical for parsers and something I think we should require for all OSV objects. This will be immensely helpful as we add new fields and data types, such as new reference types or relationship data to the OSV format.

(The fact that this will break existing OSVs is an example of why I feel we should require this - as we parse through old data it will be much easier to update it to newer formats if we know what version the data was originally created against.)

@chrisbloom7
Copy link
Collaborator

We've been considering adding this as a database specific field value. I agree it would be helpful to denote in the file what the format is. At the moment consuming systems just have to assume this based on info that the publishing system may provide, but it's possible for a publisher to distribute more than one file format, possibly even more than one JSON file format.

@oliverchang
Copy link
Contributor

oliverchang commented Jun 28, 2022

Thanks for raising this PR!

I have some initial concerns here:

  • This is largely redundant information for most people consuming this format.
  • There is no standardized way to do this for all the different schemas out there. i.e. we may use "schema_format" here, but other formats will use different conventions and field names. What ends up happening is there will need to be customized parsing/detection logic for each schema out there anyway.

Perhaps this would be better kept out-of-band, or introduced as some kind of protocol. i.e. either as part of the filename/filepath ("ID.osv.json") or through some other mechanism for publishers to state that they are using the OSV format.

@rsc @chrisbloom7 thoughts?

@chrisbloom7
Copy link
Collaborator

chrisbloom7 commented Jun 28, 2022

If a system publishes more than one format in the same stream then the processing systems need a way to interrogate each file to decide how to handle it. In some cases this could be handled by using custom headers, but for systems that don't transmit the files directly (i.e. a repo full of JSON files that processing systems fetch), then if the type is supposed to be inferred from the file name then each publishing system must make their format known and each processing system must account for differences in naming conventions. Adding it to the schema means a processing system need only interrogate the parsed JSON to see if an expected format node is present, and if that node is part of the schema then that makes it easy for consumers to ask. Granted there is no standard here. There was a discussion about this in the JSON schema project and the decision at the time was that they wouldn't standardize on anything but they might document a recommended practice of using a root node such as "http://json-schema.org/#$schema": "..." for JSON objects with the value pointing back to the schema it is intended to be validated against. (Which then possibly relates to #52)

@joshbuker
Copy link
Contributor Author

joshbuker commented Jun 28, 2022

For use-cases like the GSD where we are potentially embedding multiple formats within the same json file as a part of different namespaces, it's critical for the parser to be able to interrogate each namespace for known formats to aid in display, validation, and translation to other formats. We can't rely on anything outside of the json object itself for this purpose (e.g. filename or headers).

Using "$id": "<URI for specific version of OSV here>" could be one approach, although having the format (OSV, CVE, etc) and version directly available for inspection would be simpler to parse in most scenarios. Having both would ultimately be useful.

The CVE json format already does this, for example: v4, v5

@joshbuker joshbuker marked this pull request as ready for review June 28, 2022 15:52
@oliverchang
Copy link
Contributor

oliverchang commented Jun 30, 2022

For use-cases like the GSD where we are potentially embedding multiple formats within the same json file as a part of different namespaces, it's critical for the parser to be able to interrogate each namespace for known formats to aid in display, validation, and translation to other formats. We can't rely on anything outside of the json object itself for this purpose (e.g. filename or headers).

For embedding formats, would it work at all to do something like:

{
  "NAMESPACE-1": {
    "format": "OSV",
    "entry": {
     },
  },
  "NAMESPACE-2": {
    "format": "CVE-5",
    "entry": {
    },
  }
}

This way it's both consistent and clear what format each embedded format is (without having to parse the format itself).

Re distribution

For distribution, I do think we need to standardize how OSV's are distributed -- today we just have a list of git repos in a README.md file, but if there was a consistent way to discover/determine if a source is producing OSV entries (e.g. through network headers or filename conventions), this may also solve the identification problem.

@joshbuker
Copy link
Contributor Author

Hmm, that's definitely worth considering, and could help in scenarios where a format doesn't expose an easy way to parse identification itself (including scenarios where it's XML or some other format encoded as base64).

@joshbressers @kurtseifried any input on the format/version declarations and pros/cons of having it at the different levels?

@oliverchang
Copy link
Contributor

Hmm, that's definitely worth considering, and could help in scenarios where a format doesn't expose an easy way to parse identification itself (including scenarios where it's XML or some other format encoded as base64).

@joshbressers @kurtseifried any input on the format/version declarations and pros/cons of having it at the different levels?

Friendly ping on this :)

@kurtseifried
Copy link
Contributor

Regarding format/version declarations I think we should support this because we support namespaces and the namespaces can do whatever they want. And not all formats (e.g. OSV) support a simple declaritive way to determine what they are (e.g. CVE JSON does).

I think we need to be flexible though, e.g. I think the bigger players MUST label their data properly (GSD, OSV) but if Wendy wants to throw some stuff into her namespace she SHOULD include some identifying info/schema but it's not required (since she may not have one yet).

Obviously it would also give preference to the namespaces that do declare what they are cleanly, e.g.:

"$schema": "https://json-schema.org/draft/2020-12/schema"

Links:
https://json-schema.org/learn/getting-started-step-by-step.html

So my thought is this:

We ask for a "$schema" tag, we apply a $schema tag when we definitely know what it is (e.g. the robot that pulls CVE/NVD data in knows it is CVE JSON format) and we suggest/educate people where possible, but ultimately it's ok if there is no schema tag.

The only downside I see here is if the wrong declaration is used (e.g. type or format), so we should be careful when we apply the tag to data (we should make sure it's correct). And it means $schema becomes a reserved keyword for keys (but not for values), which I think is an acceptable tradeoff.

@rsc
Copy link

rsc commented Jul 13, 2022

Regarding format/version declarations I think we should support this because we support namespaces

What does it mean that we support namespaces? That's surprising to me.

@kurtseifried
Copy link
Contributor

we == GSD. Since we consume data from multiple sources and in multiple formats we support namespaces so e.g. nvd.nist.org can have whatever they want/need, and a vendor can have their own data (e.g. different CVSS scores) and so on.

@joshbressers
Copy link

I like the idea of including a schema key when possible. It would be redundant for OSV to include a schema for all of their entries. This would make the most sense as an optional field.

@kurtseifried
Copy link
Contributor

Especially if OSV wants to be included with other data sets, this makes it far simpler.

joshbuker added a commit to joshbuker/osv-schema that referenced this pull request Oct 6, 2022
Signed-off-by: Josh Buker <crypto@joshbuker.com>
Signed-off-by: Josh Buker <crypto@joshbuker.com>
@joshbuker
Copy link
Contributor Author

Hey @oliverchang, updated this PR per some of your feedback. Mind taking another look?

@oliverchang
Copy link
Contributor

Hey @oliverchang, updated this PR per some of your feedback. Mind taking another look?

Sorry for the silence on this! I was out on a vacation for a few weeks and just got back!

I am a little hesitant to move this one forward, unless the other core committers (@chrisbloom7 and @rsc) express their support for this.

The reasons being that for the two main use cases that have been brought up:

  • For aggregators/producers supporting multiple formats: this info could be stored through namespacing in a way that's consistent across all of that user's formats, rather than customized logic to understand schema_format/data_type/other variants of this.

  • For discovery of OSV entries, this could be addressed by a protocol that clearly defines how users can discover OSV entries. If someone is using this protocol to discover OSV, then they know it's OSV.

@chrisbloom7
Copy link
Collaborator

@oliverchang I noted earlier that GitHub had been considering adding this as a database specific field and that I was in support of it. Your point about this particular solution only addressing the question for a single format (OSV) rather than all potential formats for that producer has made me rethink this.

If a single source provides multiple versions of vulnerability data which are indistinguishable from each other based on URI, request info, or content namespacing then adding a field to the OSV schema only solves the problem for one format and plasters over the problem. It seems reasonable to me that if a user is establishing a way to query a given source and if said source has a documented protocol defining the structure of their data, then the user already has a way of distinguishing one format from another without need for an internal identifier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants