Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aliases, and how they are supposed to be used #888

Open
nscuro opened this issue Dec 5, 2022 · 6 comments
Open

Aliases, and how they are supposed to be used #888

nscuro opened this issue Dec 5, 2022 · 6 comments
Labels
backlog Important but currently unprioritized data quality Issues with data quality

Comments

@nscuro
Copy link

nscuro commented Dec 5, 2022

Hey OSV team, thanks for your great work!

We're currently looking at how we can correlate vulnerabilities that describe the same thing.

As per specification, OSV has the aliases field for this:

The aliases field gives a list of IDs of the same vulnerability in other databases, in the form of the id field.

At least in my interpretation, aliasing is a bidirectional relationship that also applies transitively.
If X aliases Y and Z, Y should also alias X, and Y should also alias Z. If they all describe the same thing, that should be a valid assumption.

However, in reality, we see that many vulnerability databases (ab-)use the OSV schema to publish advisories. In my understanding, a vulnerability would describe one defect, and that one defect only. Whereas an advisory can potentially refer to multiple vulnerabilities (as in "we patched all these vulnerabilities in version 1.2.3 of our package"). This appears to be a common thing for at least the Go, Rust, and (especially) Debian ecosystems in the OSV database. There are most likely more, but these have been the most obvious candidates to us.

For example, GO-2022-0586 presumably aliases four CVEs and four GHSAs:

These are four different vulnerabilities, with different CWEs, descriptions and severities. CVEs and GHSAs actually alias each other in pairs of two (GHSA-28r2-q6m8-9hpx aliases CVE-2022-30323, but not CVE-2022-26945 etc.):

Aliases of GO-2022-0586

In cases of advisories like this, the "aliases" are neither bidirectional (GHSA-28r2-q6m8-9hpx isn't really the same as GO-2022-0586), nor are they fully transitive (CVE-2022-26945 is not the same as CVE-2022-30323). If one was to attempt to find all aliases for GHSA-28r2-q6m8-9hpx here, traversing this graph would yield wrong results.

The Debian ecosystem especially has many of these scenarios, where one DLA can refer to loads of CVEs:

image

I have the feeling that OSV entries of type "advisory" (maybe such a distinction would be good to have?) should instead use the related field. Although I imagine this will be hard to enforce, and even harder to apply in an automated fashion.

Am I understanding aliasing in OSV correctly? Is this a data quality issue with the databases that use the OSV schema? Is there anything we can do about it?

@oliverchang
Copy link
Collaborator

Hey @nscuro !

Thanks for the very detailed issue!

Can you explain a bit more about the exact use case you're trying to achieve here with the OSV data? Are you trying to build your own graph representation?

For the OSV schema, we actively avoided trying to make an explicit distinction between a group of vulnerabilities (advisory) vs a single vulnerability to keep things simple. In terms of the end result we want to enable, it's the same -- the ability to identify which package versions are affected and which versions to update to.

How we envision a vulnerability scanner working with our data would be this:

  1. Extract a list of packages and versions to query. Say this is just Package "Foo" at Version "1.0.0".
  2. Query OSV and get the list of vulnerability entries that say "Foo" at "1.0.0" is vulnerable.
  3. Use "aliases"/"related" to group them together for presentation, e.g. in a bug filed.
  4. Suggest a fix/resolution such that all the entries in a single group agree.

Under this workflow, it seems to make sense to group all of the related vulnerabilities together, so users have the full context on what all the vulnerability sources say, and updates/remediation steps account for all relevant entries in the same group. The fact that some of these are "advisories" should not matter -- having them be split up would have the same effect.

If this is an issue of semantics and representation, we can certainly ask our data sources to use related instead in the cases where an entry they're exporting consists of multiple other vulnerabilities from a different source, and this is likely more correct. I think this would be a relatively easy ask for our current sources to adopt.

@nscuro
Copy link
Author

nscuro commented Dec 8, 2022

Can you explain a bit more about the exact use case you're trying to achieve here with the OSV data?

Our use case is not primarily about recommending a fixed version to an end user ("updating to version X will resolve all these issues"), it's more about tracking risk, and making it transparent. So knowing which vulnerabilities are the same and which are not does matter to us.

We also have a VEX-like use case, where users (or machines) evaluate whether a project is actually affected by a vulnerability, and record their decision. Obviously we want to avoid redundant work being done, a decision should not have to be recorded for GHSA-28r2-q6m8-9hpx and CVE-2022-30323 separately, as they describe the same thing. On the other hand, we don't want the same decision being applied to different vulnerabilities (CVE-2022-30323 vs. CVE-2022-26945), because the exposure, attack vector, impact etc. may differ.

Approaching this use case the other way around, if a vendor provided a VEX document stating that their product is not affected by CVE-2022-30323, this should also be applicable to actual aliases like GHSA-28r2-q6m8-9hpx, but not CVE-2022-26945.

If this is an issue of semantics and representation, we can certainly ask our data sources to use related instead in the cases where an entry they're exporting consists of multiple other vulnerabilities from a different source, and this is likely more correct.

That'd be great!

@oliverchang
Copy link
Collaborator

oliverchang commented Dec 9, 2022

Can you explain a bit more about the exact use case you're trying to achieve here with the OSV data?

Our use case is not primarily about recommending a fixed version to an end user ("updating to version X will resolve all these issues"), it's more about tracking risk, and making it transparent. So knowing which vulnerabilities are the same and which are not does matter to us.

We also have a VEX-like use case, where users (or machines) evaluate whether a project is actually affected by a vulnerability, and record their decision. Obviously we want to avoid redundant work being done, a decision should not have to be recorded for GHSA-28r2-q6m8-9hpx and CVE-2022-30323 separately, as they describe the same thing. On the other hand, we don't want the same decision being applied to different vulnerabilities (CVE-2022-30323 vs. CVE-2022-26945), because the exposure, attack vector, impact etc. may differ.

Got it, thanks for explaning! Are you thinking of recording VEX on a per package basis, such that users can transitively determine from the entire dependency graph if they're actually indirectly affected by a vulnerability?

Approaching this use case the other way around, if a vendor provided a VEX document stating that their product is not affected by CVE-2022-30323, this should also be applicable to actual aliases like GHSA-28r2-q6m8-9hpx, but not CVE-2022-26945.

If this is an issue of semantics and representation, we can certainly ask our data sources to use related instead in the cases where an entry they're exporting consists of multiple other vulnerabilities from a different source, and this is likely more correct.

That'd be great!

We'll start conversations here with Go, and fix up the Debian ones.

Copy link

This issue has not had any activity for 60 days and will be automatically closed in two weeks

@github-actions github-actions bot added the stale The issue or PR is stale and pending automated closure label Jul 27, 2024
@nscuro
Copy link
Author

nscuro commented Jul 27, 2024

Commenting to signal that this issue is still relevant.

I am enlightened however to see there is a continuous effort to improve the situation :)

@oliverchang oliverchang added backlog Important but currently unprioritized and removed stale The issue or PR is stale and pending automated closure labels Jul 28, 2024
@oliverchang
Copy link
Collaborator

Commenting to signal that this issue is still relevant.

I am enlightened however to see there is a continuous effort to improve the situation :)

Thanks! removed the stale tags.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlog Important but currently unprioritized data quality Issues with data quality
Projects
None yet
Development

No branches or pull requests

2 participants