Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for lower bound exclusive version ranges #31

Closed
chrisbloom7 opened this issue Feb 8, 2022 · 6 comments
Closed

Add support for lower bound exclusive version ranges #31

chrisbloom7 opened this issue Feb 8, 2022 · 6 comments

Comments

@chrisbloom7
Copy link
Collaborator

chrisbloom7 commented Feb 8, 2022

When defining vulnerable version ranges, the > operator, aka lower bound exclusive, represents the last-known-good release before a vulnerability is introduced. It might be argued that if one knows the last-known-good, one should also know the first-known-bad. However, this isn't always possible, and in some cases there are legitimate reasons for using > even if one has access to the first-known-bad version.

Example 1 - prereleases

In SemVer syntax, 1.2.3.alpha1 is less than 1.2.3 because the former is a prerelease of the latter. If a vulnerability is known to have been introduced in a particular prerelease then one could simply say >= 1.2.3.rc2 which can be converted to an introduced range event type to cover all latter prereleases plus the release itself, assuming the order of the prereleases is alphabetical and machine parseable. However, sometimes there may be so many prereleases that it's not possible to immediately identify which ones are affected. In this case, one way to ensure that all prereleases are covered is to specify > last-known-good.

Example 2 - interlaced affected versions

If a package has multiple affected versions with interlaced fixes, it's reasonable to list them like so:

>= 2.0-beta7, < 2.3.2
> 2.3.2, < 2.12.4
> 2.12.4, < 2.17.1

That example is from the recent log4j vulnerability.

Proposal

One suggestion for handling this is to extend affected[].ranges[].events[] to accept an event with a "last known good" event, and change the validation so that at least one introduced OR "last known good" field is present. This can still be programmatically evaluated by comparing a given version to see if it's > the last known good, just as you could compare it to see if it's >= and introduced event. This could be expressed in the evaluation example in the schema docs by slightly altering the IncludedInRanges method:

func IncludedInRanges(v, ranges)
  vulnerable = false
  for range in ranges
    if BeforeLimits(v, range)
      for evt in sorted(range.events)
        if evt.last_known_good is present && v > evt.last_known_good
           vulnerable = true
        else if evt.introduced is present && v >= evt.introduced
           vulnerable = true
        if evt.fixed is present && v >= evt.fixed
           vulnerable = false
@oliverchang
Copy link
Contributor

oliverchang commented Feb 8, 2022

Thanks for filing this issue!

The meaning of ">" seems overloaded in this case, in that there are two potential definitions? Either as a "last known good" (i.e. the definition of an exclusive lower bound), or meaning that the given version is affected but it potentially goes back further into a pre-release (so not really an "exclusive" lower bound) ?

We could potentially add a "introduced_after" event to capture an exclusive lower bound, but this could potentially introduce incompatibilities and prevent interop with other standards such as the CVE 5.0 schema which does not have a similar operator. Would it be feasible at all to just encode introduced: "next_available_version" in these ~45 cases, and add a field to database_specific to indicate that this lower bound needs more investigation? i.e. using your example:

{
  "ranges": {
    "events": [
      {"introduced": "2.3.3"},
    ], 
  }
  "database_specific": {
    "needs_investigation": ["2.3.3"]
  }
}

@rsc for thoughts too

@chrisbloom7
Copy link
Collaborator Author

chrisbloom7 commented Feb 8, 2022

I can see how overloading the use here is problematic. Let me think about how we might make a case for both interpretations though, as both are legitimate scenarios that deserve representation.

In the LKG scenario, the trouble with determining "next_available_version" from the LKG version is it requires interacting with each ecosystem to determine what comes next and that's not always possible. If we could do that, then this wouldn't be an issue at all because > LKG is the same as saying >= NAV. In this case, having a LKG event type would help and can still be used by a machine to determine if a given version is affected.

In the case where we need to cover 1.2.3 and all prebuilds... We run into a similar issue as above. If we had access to all ecosystems and could determine what prebuilds there are, then we could just use the versions field. But we can't do this uniformly, and so we're trying to avoid doing it at all. I think in this case we could potentially use a database_specific field as you describe above. Of course the issue becomes trying to determine which usage the curators intended when they used the > operator, and obviously this is a problem on our end to manage though it does highlight how the overloaded use is a problem in general.

I will go back to my curation team about the latter use case to see how we might better capture this intent in our systems, and I'll update the description above to focus solely on the former use case where > represents only the LKG version. all of this. I need to develop a better understanding of which problem we're really trying to solve for here. Hold please...

@chrisbloom7
Copy link
Collaborator Author

@oliverchang I've revised the OP to try to remove the operator overloading problem. I think in many cases the argument can be made that one could simply locate the first-known-bad from the last-known-good, but I think there are reasonable use cases for supporting a lower bound exclusive for instances where that's not possible or practical. I want to call out that this isn't a blocker for us at at the moment, not specifically, but I think it would be helpful to be able to support it going forward.

@oliverchang
Copy link
Contributor

Thanks for updating the examples!

To clarify example 1 -- do you mean specifying "> 1.2.2" ? "> 1.2.3" would exclude "1.2.3-alpha1" as "1.2.3-alpha < 1.2.3"? This implies you'd still have to interface with the ecosystem to get the previous non-prerelease version.

Assuming access to the ecosystem by checking the list of released versions (or just bumping the version number for well defined schemes such as SemVer), the second example could also be encoded as

>= 2.0-beta7, < 2.3.2
>= 2.4, < 2.12.4
>= 2.13.0, < 2.17.1

But I can definitely see the ease of allowing >! I'm just concerned about introducing an incompatibility with CVE 5.0 (which I believe has no analogue to > either). This was also discussed here as part of the 5.0 schema discussions where it was decided against to have "versionAfter" due to the small number of uses: CVEProject/cve-schema#87 (comment)

@rsc do you have any thoughts here?

@rsc
Copy link

rsc commented Feb 11, 2022

The reasons against adding > seem to be:

  • It implies that either (1) we don't know which commit added the vulnerability, or (2) we don't know the full set of relevant versions of the software. In either case, we'd like to aim to do better.
  • We didn't include it in CVE JSON 5.0, and we want OSV to map losslessly into CVE and back.

Concrete examples are always essential for thinking through these issues, so thank you for pointing at Log4j. The GHSA description says:

Apache Log4j2 versions 2.0-beta7 through 2.17.0 (excluding security fix releases 2.3.2 and 2.12.4) are vulnerable to an attack where an attacker with permission to modify the logging configuration file can construct a malicious configuration using a JDBC Appender with a data source referencing a JNDI URI which can execute remote code. This issue is fixed by limiting JNDI data source names to the java protocol in Log4j2 versions 2.17.1, 2.12.4, and 2.3.2.

I see that the GHSA encoded this text as:

>= 2.0-beta7, < 2.3.2
> 2.3.2, < 2.12.4
> 2.12.4, < 2.17.1

But that seems wrong to me.

Is the vuln fixed in 2.3.2 but not in 2.3.3? Surely not. It seems that 2.3.3 simply doesn't exist right now. Writing it down that way is problematic because when the next Log4j vuln is found and they do issue 2.3.3 and people upgrade, vuln checkers will start reporting that 2.3.3 is vulnerable to the original vuln. Same issue for the 2.12.x branch starting at 2.12.5.

So in this case at least, making > available would lead to an inaccurate encoding of versions that would lead automated version decisions to incorrect results. This seems like a good reason not to add >.

Also in this case, log4j's source repo is available (as it will be for essentially any open source vulnerability in OSV), and we can inspect the tags in the repo to identify precise start points. It's worth noting that they use Maven ordering and not SemVer ordering, although it doesn't affect the outcome too much in this case. Reading through the tags after cloning the repo, there were no 2.4 prereleases, so >= 2.4 is more accurate than > 2.3.2, and the first 2.13 sequence was 2.13.0-rc1, so >= 2.13.0-rc1 is more accurate than > 2.12.4.

So in this case the non-buggy way to write the versions for Log4j is:

>= 2.0-beta7, < 2.3.2
>= 2.4, < 2.12.4
>= 2.13.0-rc1, < 2.17.1

My conclusion from this is that Log4j by itself does not justify adding >.

Going back to the more hypothetical case from Example 1:

However, sometimes there may be so many prereleases that it's not possible to immediately identify which ones are affected. In this case, one way to ensure that all prereleases are covered is to specify > last-known-good.

For an open source project, I don't see why this would be the case. First, there should be a commit that introduced the vuln, and 'git tag --contains' should tell us which prereleases are vulnerable. If we don't know the commit, we really should find out to improve the precision. Second, even if we don't know the commit and cannot take the time to find out, we can certainly fetch a list of all the prereleases and identify the very first one in semver order and use that.

Also, in this scenario, using > last-known-good, when last-known-good is on a different minor version branch, produces exactly the same future false positive as in the Log4j case. That's a significant downside, because once again it gets in the way of accurate machine-generated reports of vulnerabilities.

My conclusion from this is that the hypothetical Example 1 does not justify adding > either.

Trying to generalize, what these two examples seem to show is:

  1. Almost all uses of > indicate some kind of uncertainty that is likely to introduce false positives.
  2. Because we are considering open source, it is always possible to enumerate the known versions and find the least version along a particular minor branch, rather than needing to write > a specific version on a different major branch.
  3. Because we are considering open source, it is also always possible, with more work, to identify the commit or commits where a vuln was introduced and precisely identify the least versions following those commits. But this is not strictly necessary to avoid using >.

Are there other examples we should consider that would contradict these conclusions?

@chrisbloom7
Copy link
Collaborator Author

Closing this topic for now. Our reliance on this operator is primarily due to a limitation in our internal curation process and I don't believe we can make a compelling enough argument for adding support for it externally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants