Add schema_format to aid parsers #51

joshbuker · 2022-06-25T15:32:05Z

Adds a schema_format field so that parsers can positively identify if an object is trying to represent an OSV object.

I think having this somewhere in the data itself will be critical in conjunction with the schema_version field, but I don't really have any opinions on how best to format that. I went with what I felt was a simple implementation that follows the naming convention of the version field. @oliverchang what are your thoughts on how to incorporate that into the schema?

joshbuker · 2022-06-25T15:36:04Z

validation/schema.json

+    "schema_format",
+    "schema_version",


@oliverchang Knowing which schema to validate against is critical for parsers and something I think we should require for all OSV objects. This will be immensely helpful as we add new fields and data types, such as new reference types or relationship data to the OSV format.

(The fact that this will break existing OSVs is an example of why I feel we should require this - as we parse through old data it will be much easier to update it to newer formats if we know what version the data was originally created against.)

chrisbloom7 · 2022-06-27T14:02:38Z

We've been considering adding this as a database specific field value. I agree it would be helpful to denote in the file what the format is. At the moment consuming systems just have to assume this based on info that the publishing system may provide, but it's possible for a publisher to distribute more than one file format, possibly even more than one JSON file format.

oliverchang · 2022-06-28T03:25:05Z

Thanks for raising this PR!

I have some initial concerns here:

This is largely redundant information for most people consuming this format.
There is no standardized way to do this for all the different schemas out there. i.e. we may use "schema_format" here, but other formats will use different conventions and field names. What ends up happening is there will need to be customized parsing/detection logic for each schema out there anyway.

Perhaps this would be better kept out-of-band, or introduced as some kind of protocol. i.e. either as part of the filename/filepath ("ID.osv.json") or through some other mechanism for publishers to state that they are using the OSV format.

@rsc @chrisbloom7 thoughts?

chrisbloom7 · 2022-06-28T14:53:48Z

If a system publishes more than one format in the same stream then the processing systems need a way to interrogate each file to decide how to handle it. In some cases this could be handled by using custom headers, but for systems that don't transmit the files directly (i.e. a repo full of JSON files that processing systems fetch), then if the type is supposed to be inferred from the file name then each publishing system must make their format known and each processing system must account for differences in naming conventions. Adding it to the schema means a processing system need only interrogate the parsed JSON to see if an expected format node is present, and if that node is part of the schema then that makes it easy for consumers to ask. Granted there is no standard here. There was a discussion about this in the JSON schema project and the decision at the time was that they wouldn't standardize on anything but they might document a recommended practice of using a root node such as "http://json-schema.org/#$schema": "..." for JSON objects with the value pointing back to the schema it is intended to be validated against. (Which then possibly relates to #52)

joshbuker · 2022-06-28T15:50:03Z

For use-cases like the GSD where we are potentially embedding multiple formats within the same json file as a part of different namespaces, it's critical for the parser to be able to interrogate each namespace for known formats to aid in display, validation, and translation to other formats. We can't rely on anything outside of the json object itself for this purpose (e.g. filename or headers).

Using "$id": "<URI for specific version of OSV here>" could be one approach, although having the format (OSV, CVE, etc) and version directly available for inspection would be simpler to parse in most scenarios. Having both would ultimately be useful.

The CVE json format already does this, for example: v4, v5

oliverchang · 2022-06-30T23:39:52Z

For use-cases like the GSD where we are potentially embedding multiple formats within the same json file as a part of different namespaces, it's critical for the parser to be able to interrogate each namespace for known formats to aid in display, validation, and translation to other formats. We can't rely on anything outside of the json object itself for this purpose (e.g. filename or headers).

For embedding formats, would it work at all to do something like:

{
  "NAMESPACE-1": {
    "format": "OSV",
    "entry": {
     },
  },
  "NAMESPACE-2": {
    "format": "CVE-5",
    "entry": {
    },
  }
}

This way it's both consistent and clear what format each embedded format is (without having to parse the format itself).

Re distribution

For distribution, I do think we need to standardize how OSV's are distributed -- today we just have a list of git repos in a README.md file, but if there was a consistent way to discover/determine if a source is producing OSV entries (e.g. through network headers or filename conventions), this may also solve the identification problem.

joshbuker · 2022-06-30T23:48:35Z

Hmm, that's definitely worth considering, and could help in scenarios where a format doesn't expose an easy way to parse identification itself (including scenarios where it's XML or some other format encoded as base64).

@joshbressers @kurtseifried any input on the format/version declarations and pros/cons of having it at the different levels?

oliverchang · 2022-07-08T06:25:24Z

Hmm, that's definitely worth considering, and could help in scenarios where a format doesn't expose an easy way to parse identification itself (including scenarios where it's XML or some other format encoded as base64).

@joshbressers @kurtseifried any input on the format/version declarations and pros/cons of having it at the different levels?

Friendly ping on this :)

kurtseifried · 2022-07-08T17:22:52Z

Regarding format/version declarations I think we should support this because we support namespaces and the namespaces can do whatever they want. And not all formats (e.g. OSV) support a simple declaritive way to determine what they are (e.g. CVE JSON does).

I think we need to be flexible though, e.g. I think the bigger players MUST label their data properly (GSD, OSV) but if Wendy wants to throw some stuff into her namespace she SHOULD include some identifying info/schema but it's not required (since she may not have one yet).

Obviously it would also give preference to the namespaces that do declare what they are cleanly, e.g.:

"$schema": "https://json-schema.org/draft/2020-12/schema"

Links:
https://json-schema.org/learn/getting-started-step-by-step.html

So my thought is this:

We ask for a "$schema" tag, we apply a $schema tag when we definitely know what it is (e.g. the robot that pulls CVE/NVD data in knows it is CVE JSON format) and we suggest/educate people where possible, but ultimately it's ok if there is no schema tag.

The only downside I see here is if the wrong declaration is used (e.g. type or format), so we should be careful when we apply the tag to data (we should make sure it's correct). And it means $schema becomes a reserved keyword for keys (but not for values), which I think is an acceptable tradeoff.

rsc · 2022-07-13T13:48:51Z

Regarding format/version declarations I think we should support this because we support namespaces

What does it mean that we support namespaces? That's surprising to me.

kurtseifried · 2022-07-13T15:20:28Z

we == GSD. Since we consume data from multiple sources and in multiple formats we support namespaces so e.g. nvd.nist.org can have whatever they want/need, and a vendor can have their own data (e.g. different CVSS scores) and so on.

joshbressers · 2022-07-14T13:41:58Z

I like the idea of including a schema key when possible. It would be redundant for OSV to include a schema for all of their entries. This would make the most sense as an optional field.

kurtseifried · 2022-07-14T14:02:01Z

Especially if OSV wants to be included with other data sets, this makes it far simpler.

Signed-off-by: Josh Buker <crypto@joshbuker.com>

joshbuker · 2022-10-06T17:21:48Z

Hey @oliverchang, updated this PR per some of your feedback. Mind taking another look?

oliverchang · 2022-10-24T06:02:16Z

Hey @oliverchang, updated this PR per some of your feedback. Mind taking another look?

Sorry for the silence on this! I was out on a vacation for a few weeks and just got back!

I am a little hesitant to move this one forward, unless the other core committers (@chrisbloom7 and @rsc) express their support for this.

The reasons being that for the two main use cases that have been brought up:

For aggregators/producers supporting multiple formats: this info could be stored through namespacing in a way that's consistent across all of that user's formats, rather than customized logic to understand schema_format/data_type/other variants of this.
For discovery of OSV entries, this could be addressed by a protocol that clearly defines how users can discover OSV entries. If someone is using this protocol to discover OSV, then they know it's OSV.

chrisbloom7 · 2022-10-24T19:18:25Z

@oliverchang I noted earlier that GitHub had been considering adding this as a database specific field and that I was in support of it. Your point about this particular solution only addressing the question for a single format (OSV) rather than all potential formats for that producer has made me rethink this.

If a single source provides multiple versions of vulnerability data which are indistinguishable from each other based on URI, request info, or content namespacing then adding a field to the OSV schema only solves the problem for one format and plasters over the problem. It seems reasonable to me that if a user is establishing a way to query a given source and if said source has a documented protocol defining the structure of their data, then the user already has a way of distinguishing one format from another without need for an internal identifier.

joshbuker commented Jun 25, 2022

View reviewed changes

joshbuker mentioned this pull request Jun 27, 2022

Add $id to validation/schema.json #52

Closed

joshbuker marked this pull request as ready for review June 28, 2022 15:52

joshbuker mentioned this pull request Jul 11, 2022

Define a protocol / conventions for discovery #64

Open

joshbuker added a commit to joshbuker/osv-schema that referenced this pull request Oct 6, 2022

Update schema per PR ossf#51 feedback from @oliverchang

992f101

Signed-off-by: Josh Buker <crypto@joshbuker.com>

joshbuker force-pushed the feature/format-id branch from 992f101 to adbcb3e Compare October 6, 2022 17:17

Add schema_format to aid parsers

847c254

Signed-off-by: Josh Buker <crypto@joshbuker.com>

joshbuker force-pushed the feature/format-id branch from adbcb3e to 847c254 Compare October 6, 2022 17:19

joshbuker closed this Mar 29, 2023

joshbuker deleted the feature/format-id branch March 29, 2023 01:41

joshbuker mentioned this pull request Mar 29, 2023

Add optional schema_type for data format hinting (ala CVE) #134

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add schema_format to aid parsers #51

Add schema_format to aid parsers #51

joshbuker commented Jun 25, 2022

joshbuker Jun 25, 2022 •

edited

Loading

chrisbloom7 commented Jun 27, 2022

oliverchang commented Jun 28, 2022 •

edited

Loading

chrisbloom7 commented Jun 28, 2022 •

edited

Loading

joshbuker commented Jun 28, 2022 •

edited

Loading

oliverchang commented Jun 30, 2022 •

edited

Loading

joshbuker commented Jun 30, 2022

oliverchang commented Jul 8, 2022

kurtseifried commented Jul 8, 2022

rsc commented Jul 13, 2022

kurtseifried commented Jul 13, 2022

joshbressers commented Jul 14, 2022

kurtseifried commented Jul 14, 2022

joshbuker commented Oct 6, 2022

oliverchang commented Oct 24, 2022

chrisbloom7 commented Oct 24, 2022

Add schema_format to aid parsers #51

Add schema_format to aid parsers #51

Conversation

joshbuker commented Jun 25, 2022

joshbuker Jun 25, 2022 • edited Loading

Choose a reason for hiding this comment

chrisbloom7 commented Jun 27, 2022

oliverchang commented Jun 28, 2022 • edited Loading

chrisbloom7 commented Jun 28, 2022 • edited Loading

joshbuker commented Jun 28, 2022 • edited Loading

oliverchang commented Jun 30, 2022 • edited Loading

joshbuker commented Jun 30, 2022

oliverchang commented Jul 8, 2022

kurtseifried commented Jul 8, 2022

rsc commented Jul 13, 2022

kurtseifried commented Jul 13, 2022

joshbressers commented Jul 14, 2022

kurtseifried commented Jul 14, 2022

joshbuker commented Oct 6, 2022

oliverchang commented Oct 24, 2022

chrisbloom7 commented Oct 24, 2022

joshbuker Jun 25, 2022 •

edited

Loading

oliverchang commented Jun 28, 2022 •

edited

Loading

chrisbloom7 commented Jun 28, 2022 •

edited

Loading

joshbuker commented Jun 28, 2022 •

edited

Loading

oliverchang commented Jun 30, 2022 •

edited

Loading