Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Version range #66

Open
iamwillbar opened this issue Oct 16, 2019 · 29 comments
Open

Version range #66

iamwillbar opened this issue Oct 16, 2019 · 29 comments

Comments

@iamwillbar
Copy link
Member

Is there desire for PURL to support version ranges or is that out of scope? For example, to describe vulnerable versions of a package.

@brianf
Copy link
Contributor

brianf commented Oct 16, 2019

If nothing else, capturing the various ecosystem version sorting is really key. Then you could at least use purl ids as upper and lower bounds and still evaluate what's in between.

@pombredanne
Copy link
Member

@iamwillbar there is a need a alright to have a common way to express version ranges... but I wonder if this is possible, because there is no universal way to express versions. Semver comes close (but cannot handle some epochs or Debian "or" AFAIK)

@brianf
There is no universal way to compare versions too (which is why RPM- and Debian-based distros had to adopt a concept of epoch) so documenting the way things are compared for each package type would be great as a start.

@iamwillbar if you were to provide some unified specification for version ranges what would it look like?
(leaving aside for now if this could be stuffed in a PURL or not)

@mprpic
Copy link
Contributor

mprpic commented Sep 28, 2020

@pombredanne I suppose nothing is preventing purl users from specifying the versioning scheme in a qualifier, e.g. pkg:pypi/django@1.11.1?version_scheme=semver. Given a set of these, you could order them and determine vulnerable versions by comparing against a known fixed version. This however falls apart as soon as you have multiple supported version streams that are updated independently (as is the case for Django). There are a ton of "standards" on version range specifications:

If we were to use the Python one, and use this security release as an example, the purls for the released versions could be:

  • pkg:pypi/django@3.1.1?version_scheme=semver&vulnerable_versions=3.1.*
  • pkg:pypi/django@3.0.10?version_scheme=semver&vulnerable_versions=3.0.*
  • pkg:pypi/django@2.2.16?version_scheme=semver&vulnerable_versions=2.2.*

Mandatory qualifiers are not a thing in the spec however, so this would solely depend on maintainers of said projects to use them.

@pombredanne
Copy link
Member

@mprpic

I suppose nothing is preventing purl users from specifying the versioning scheme in a qualifier, e.g. pkg:pypi/django@1.11.1?version_scheme=semver.

Sure thing nothing prevents this, and we could even make it part of the spec too.
Yet as you mentioned this falls apart as soon as you have multiple supported version streams that are updated independently. If there were anything that would be practical, that would be to adopt the many semantics of how each package type handles versions constraints for dependencies which is often more complex than just a scheme.

That said, in the context in the context vulnerability reporting is there really a need, value an correctness to use a version range? I started wondering about this when @sbs2001 mentioned it in https://gitter.im/aboutcode-org/vulnerablecode?at=5f70857f5a56b467a5f2a835

At a point in time I can state that a list of concrete and discrete versions (not a range but a list) are subject to a certain vulnerability, and that there is a list of concrete and discrete versions (not a range but a list) in which that vulnerability has been patched/resolved fixed.

This could be a list of Package URLs or a list of versions, not ranges. Anything that uses a range or some wildcard is either potentially incorrect or misleading or both, which to me makes the range value both low and/or dangerous. And this is likely even more so when looking as distro packages such as RPM or Debian packages that would add patch numbers to the upstream version scheme with a releases/build number and or epoch and the affected versions would rarely resolve to a proper range but always be correct when using a list.

Does this make some sense?
What would be the benefits of a range?

@pombredanne
Copy link
Member

@mprpic this thread has a good argument from @copernico for the need of version ranges https://gitter.im/aboutcode-org/vulnerablecode?at=5f7231fa6e85e0058c5f4aaf

@pombredanne
Copy link
Member

@pombredanne
Copy link
Member

Now here are some aesthetic considerations:

A purl pkg:npm/foo with a (complex) version_range ~= 0.9, >= 1.0, != 1.3.4.*, < 2.0 defined using the PEP-440 syntax and used as qualifier would come out once encoded as:

pkg:npm/foo?version_range=%7E%3D%200.9%2C%20%3E%3D%201.0%2C%20%21%3D%201.3.4.%2A%2C%20%3C%202.0

Hum 😒

@stevespringett
Copy link
Member

Ugh, I figured it would make purl version ranges unreadable. If ranges will be included, I don't see a way to eliminate the aesthetic problems it creates. The only thing I can think of is maybe to break down the version clauses into individual purl qualifiers rather than a single version_range string. That would likely make the purl readable without as much encoding.

@jhutchings1
Copy link
Contributor

Hey folks, looks like this issue has gone stale, but I'd love to restart the conversation. @pombredanne 's suggestion seems entirely reasonable, even if the URL encoding of the characters makes it less human readable.

@pombredanne
Copy link
Member

@jhutchings1 the issue has not gone stale at all ... it is just that we are making practical experiments with a version range separately in another repo!

There is a draft spec there (it would need to be extracted and brought here) in there:
https://github.com/nexB/univers/blob/386eb32468c75ecac25ec872ea004b3257962946/VERSION-RANGE-SPEC.rst

ATM the draft starts to have some legs ... but I need to play with actual real working to validate that this can work at scale.
The WIP experimental code in https://github.com/nexB/univers needs to be beefed up to match the spec and tested with vulnerablecode.... but another implementation would be welcomed too!

@pombredanne
Copy link
Member

@jhutchings1 actually I separated a draft spec in a clean branch here aboutcode-org/univers#11

@sdboyer
Copy link

sdboyer commented Nov 30, 2021

/me takes deep breath

i'm gonna break a personal and OSS best practice rule and spread unsupported FUD. Sorry. I'm only doing it because i see concrete progress being made here, and i think not saying something may be more harmful.

i've come to believe that version ranges are, in general, harmful. i do have an alternative that i've been working on for a while - it's not public because it's unfinished, but the relevant bits are plausibly finished enough for purl and the purposes of this discussion. @pombredanne, it's been quite a while since we've talked (FOSDEM 2018, right?), but if you want to catch up about it, i'd be happy to find an hour sometime - DM me on twitter?

Totally understood that my unspecified, general concern should not block actual progress, though.

@jbmaillet
Copy link

Maybe playing MisterObvious, but IMHO the issue is not version range or version list: the root cause of our headaches is version computability (operators ==, <, >, <=, >=), and while it is more or less OK for Semver or CalVer, except for some wildcards and attributes corner cases, it is indeed a tough one.

When the NVD introduced the new way of defining ranges with

  • versionStartIncluding
  • versionStartExcluding
  • versionEndIncluding
  • versionEndExcluding
    ...in JSON CVE Schema 0.1_beta - 2017-11-01 in a very discreet way, some tools that relied on a full list of impacted version for CVE broke, and to make matter worst broke silently. The one I used at the time took 2 years to fix it (I found a better one in the meantime).

I'd suggest extreme care: the NVD people have been working on software inventory for 20 years, are not stupid, and yet kind of failed (at least for the use cases we now have).

There has been a very long discussion on the topic in the upcoming CVE JSON Schema development, now in v5.0.0 release candidate 5. I can't find the exact the exact discussion reference back, but as of today on CVE / CPE side the outcomes are there:
schema/v5.0: introduce computable version ranges
Merge pull request #100 from rsc/computable-versions

@pombredanne
Copy link
Member

@jbmaillet re:

Maybe playing MisterObvious, but IMHO the issue is not version range or version list: the root cause of our headaches is version computability (operators ==, <, >, <=, >=),

yes it is! and a range in any notation demands to be informed by how two versions are compared.

In the experimental spec at aboutcode-org/univers#11 for a compact range notation and in the WIP companion working implementation at https://github.com/nexB/univers/tree/main/src/univers by @sbs2001

  • the range syntax or notation is unified and shared by any package type and versioning scheme
  • the version comparison semantics are unique for versioning scheme (which practically equals to a package type)

The WIP spec has extensive research on the topic when used for vulnerable ranges, including the NVD approach, but also when used for package dependencies ranges.

The NVD versionStartIncluding, versionStartExcluding, versionEndIncluding, versionEndExcluding are missing one important data piece which is how to compare two versions as greater or lesser which is something that has been integrated as versionType in CVEProject/cve-schema@e3d43c6 and the related https://github.com/ossf/osv-schema spec

The draft "vers" specs tries to address this with a slightly different goal to have a compact yet obvious notation for version ranges.
Practically the companion "univers" Python library at https://github.com/nexB/univers/tree/main/src/univers relies on multiple package-type/ecosystem specific comparison functions: these include for now node-semver (as used in npm), rpms, maven, debian, gentoo, arch, semver, ruby and even one which is specific to a single package for nginx that have their own peculiar way to define vulnerable ranges in their advisories (see https://github.com/nexB/univers/blob/7a99ab9288ff8e20bcc69b4b383015be6615c2b9/src/univers/version_range.py#L375 )

At this stage it makes sense that I move the draft at aboutcode-org/univers#11 to a PR here as the two are closely tied! :)

@pombredanne
Copy link
Member

@jbmaillet and everyone here ... See #139 .... comments are badly needed.

@sdboyer
Copy link

sdboyer commented Nov 30, 2021

the root cause of our headaches is version computability (operators ==, <, >, <=, >=)

Indeed. But,

while it is more or less OK for Semver or CalVer

i'd say "less," at least for semver, where the tendency is to construct ranges with bounds on versions that may not yet exist - and even if they do, versions may come to exist after the publication of the range.

But, i see this comment CVEProject/cve-schema#87 (comment), particularly:

I came to appreciate that version ranges can only ever be an approximation; and that a complete enumeration of all affected versions is the only correct statement

and suspect that if you're embracing ranges while having accepted this, then the additional things i could add will be of marginal value, which is OK. /me bows out

@brianf
Copy link
Contributor

brianf commented Nov 30, 2021

I'd agree that version ranges as a mechanism for choosing dependencies is generally bad. (Hence why things like LATEST and RELEASE were deprecated in Maven 3 years ago). However for this spec, we still need a way to express ranges, eg this vulnerability applies to versions x to y. IOW ranges are required to be expressive generally, but using them to declare dependencies is a bridge too far...but I don't think that's the point of the spec here.

@pombredanne
Copy link
Member

@sdboyer Hey! it has been a while.... great to for you to drop by! Let me ping you on twitter. I am pombr there

i've come to believe that version ranges are, in general, harmful.

I agree ++. I am intrigued by what your alternative could be!

Eventually ranges are all leaky and make false promise at some level and in practice, only full enumerations might be correct yet ... they do exist in the wild and capturing the wild beasts is what I want somehow.

@brianf re:

However for this spec, we still need a way to express ranges, eg this vulnerability applies to versions x to y.
IOW ranges are required to be expressive generally,

Exactly.

but using them to declare dependencies is a bridge too far...but I don't think that's the point of the spec here.

The spec does not take a stand on how ranges would be used and it could be used to depict vulnerable or dependent ranges.
I have a question though wrt. Maven: how common would you say using ranges are? https://maven.apache.org/pom.html#Dependency_Version_Requirement_Specification

@brianf
Copy link
Contributor

brianf commented Nov 30, 2021 via email

@pombredanne
Copy link
Member

@brianf re:

Ranges in Maven are very rarely used.

Thanks... this confirms my impression. As an aside, it's funny that dependency ranges have been mostly abandoned by Maven, yet are fairly prevalent in Python, npm and Ruby package manifests, commonly accompanied by an extra full enumeration of pinned versions a.k.a. a lockfile.

@brianf
Copy link
Contributor

brianf commented Nov 30, 2021 via email

@jbmaillet
Copy link

jbmaillet commented Dec 1, 2021

(This is a long rant, but you can jump to the conclusion.)

To give a bit of context to my comments, and self introducing: I work in the IoT/embedded field, for automotive systems, cybersecurity (plus a bit of OSS licensing). That's Linux, Android, AUTOSAR, FreeRTOS as environments, and a SLOC count of 95% of C/C++ the rest being Java or Kotlin. Everything is built from sources, either from archives + good old autotools or from plain git repositories (think a la AOSP or buildroot or Yocto).

The source code is often a fork from upstream, at least for the Linux kernels (the SoC vendors fork the kernel, and we fork it again for our own customization, or for some CVE or plain bug backports because upgrading is extremely painful in such context). Last time I checked, the Linux kernel on an LTS branch such as those we use has 13000+ Kconfig options: in a typical industrial product, only 20% of the code is actually compiled (and hence only roughly 20% of the CVE are relevant, for example).

An Android source tree (which is more than AOSP, because AOSP does not come with a kernel for your SoC CPU nor bootloader nor hypervisor etc) + our added OSS and our proprietary code is about 100GB before build, with about 800 git repositories (AOSP for Android 12, without kernel nor added customization, consist of 1079 git repositories as of today). And it is only part of a system/product, and of course we have several systems/products.

As a result, a complete system/product will have about 1000 CVE to track (yes, a thousand), 95% of which being false positives (code not compiled per build configuration, fix backported, but mostly poorly document CVE, more on this latter) in hundreds of reposirories. Plus our suppliers private advisories etc.

In this context, I use CVE (and other sources), and so I'm stuck with CPE for now, ready to jump to SWID when they will be used by the NVD. I do not (yet) use purl, nor SPDX, both by lack of need and by lack of spare time (also, we have our own internal tooling and processes in place). But I consider any effort in software inventory, version computation/matching, dependency tracking, SBOM as important for my securities assesments and OSS licensing compliance , and I try to follow the development in these areas. So BTW: thank you all for your work.

This being said:

  • projects like univers are, sadly, not usable for me because my ecosystem is C/C++ bare metal code built from sources

  • @sdboyer "a complete enumeration of all affected versions is the only correct statement", yes, I 100% agree, but this does not work, never did, and I'm afraid never will. Take the Linux kernel. Right now, there are 7 branches developed and maintained (4.4, 4.9, 4.14, 4.19, 5.4, 5.10, 5.15). Plus the old product that are still in use in some car on the road (product serial life). New releases / tags of these 7 branches are made every week (RSS feed here). Now let's pick up randomly one of my thousand CVE: CVE-2021-43057. As you can see, it is just documented as "Up to (excluding) 5.14.8". In peculiar it does not:

    • give the status for each of the other 6 current branches (and this is very rarely done, maybe 1% of the kernel CVE have this information details)
    • says in which version(s) the issue was introduced

And this is not done because it is too much of an analysis workload... so the this workload is transferred to auditors/analysts like me.

Considering the kernel is by far my biggest volume and flow of continuously incoming CVE, that most kernel maintainer don't care about CVE (some of them even making it a personal matter), that this situation has always been so (even when versions where - partially - listed in the NVD before the UpToExcluding etc syntax), the kernel organization or Linux foundation is not and does not want to be a CNA a full enumeration of versions will never, ever, work. Version range is the "least worst" option.

Google with Android is even worst: they put all there CVE in a unique CPE o:google:android, and good luck to track in which of its 800-1000 repositories the issue is if you are not an official Google partner with privileged access to their bulletins, only relying on the public ones.

CONCLUSION:

Don't get me wrong: a full version list could work in theory, it would be suitable and great, but it does not match the industrial reality. And it's in great part a question of people and organizations, not a question of specification. So computable version range are hard, but they are a MUST. You can enumerate version for a libfoobar that has new CVE once per month or quarter, but this does not matter if you do not address code base such as the kernel, with as of today more than 500 CVE on its 4.14 branch, had close to 2500 CVE in all its history, or Android (close to 3800 CVE in all its history) and new ones coming every weeks: these are (some) of the hard cases to address I deal with daily, I imagine there are others in other ecosystems/industry, I seen hundreds of CVE on Windows/Oracle/Citrix/IT products or Jenkins/Atlassian/tooling passing every week.

Sorry for this long rant. At least, if you never work in the embedded field, now you know why "the S in IoT is for Security". ;-)

@pombredanne
Copy link
Member

pombredanne commented Dec 9, 2021

@jbmaillet re #66 (comment)

This makes all sense and I agree saying this is a mess is an understatement!

I am rather familiar with contexts similar to yours and this brings a question (and possibly something we could craft into some project): assuming that you can efficiently determine and trace the subset of kernel code that you use in a given build , what is the minimum you would need to be able to sort CVEs there? Would knowing the fixing commit (and therefore a fixing patch) be enough as a first pass to determine if the built code subset contains the fixable code? I feel that you are likely solving an important problem and that there may be a way to pull and pool energies to fix this together (probably elsewhere, not in purl proper)

(side note: I have somewhat efficiently used strace to trace kernel and full Android devices builds to find out which code subset is baked into a built with https://github.com/nexB/tracecode-toolkit )

@jbmaillet
Copy link

jbmaillet commented Dec 10, 2021

@pombredanne , in my experience, on the kernel which is my hard case, there are:

  • 80% of CVE which are false positive per build configuration: buggy file(s) simply not compiled.
  • Between 20 to 50% of CVE which are false positive because either you already have the fix in your history, or you (or the upstream) backported the patch, depending on close/far you are from the tip of a given branch.

These 2 sets of course overlap.

CVE documentation is most of the time terrible, but you can still cross-leverage on it to help the situation.

Firs the build configuration: it is very easy and build agnostic to generate a compilation database using for example a tool such as Bear (don't get blocked on the Clang aspect: it works as well with regular or cross gcc) (same for the CMake things: it works fine with good old GNU make or totally alien build systems such as Android with its ninja/soong). The only limitation is C/C++. Note that there are tools similar to Bear for other languages / ecosystems.

It is also very easy to get a list of files implied in a CVE, if they are mentioned in the CVE description as is often the case for the kernel, just by using good old regexp.

Then you cross both sources of information and voila: you know if a file was compiled or not, and hence if the CVE is relevant or not. 80% of kernel CVE automatically sorted out as false positives.

Then for the fixes and backports: The kernel does not mandate mentioning a CVE Id in a commit, so this is unusable[*]. But there is an official kernel documentation for a backport formalism in the git commit message. It is easy to spot CVE references which include a full git sha1, again using regexp. So you get the git sha1 fixe(s), you explore your git history searching for either the sha1 "as is" or the "sha1 as a backport" and voila, 20-50% of false positives automatically CVE sorted out.

There are a few corner cases in both passes, but you get the idea. Also note that this can be used on other pieces of software, either similarly highly configurable, or that use the same backport formalism (for example GNOME GLib).

The information still missing, is when / in which version was the bug first introduced? Some people do it, without giving the full details.

PS:

(side note: I have somewhat efficiently used strace to trace kernel and full Android devices builds to find out which code subset is baked into a built with https://github.com/nexB/tracecode-toolkit )

Some colleagues and I used such an strace strategy circa 2011 for OSS and commercial licensing compliance, which was our concern at the time, with good results. But there are other and better technologies now, we would do it differently (see above for Bear as an example).

*: It makes sense not to mandate a CVE Id in a commit. For example you might not get a CVE Id yet if your are not a CNA (which the kernel should be!) and still would want to do the fix. My metric figures show that since 2002, CVE Id have never been mentioned in more than 25% of the cases, topping in CVE-2013-NNNN. In August, is was around 15% for CVE-2021-NNNN.

@tschmidtb51
Copy link

@pombredanne What is the current status of https://github.com/package-url/purl-spec/blob/version-range-spec/VERSION-RANGE-SPEC.rst? Ready for use? 🤔

@pombredanne
Copy link
Member

@tschmidtb51 I am pretty satisfied with it at this stage. Unless there are objections I will likely merge it this week.

@tschmidtb51
Copy link

@tschmidtb51 I am pretty satisfied with it at this stage. Unless there are objections I will likely merge it this week.

@pombredanne In general, I like the approach. I flagged some details, where I think the spec should be improved for clarity and the benefit of simplicity (e.g. prohibit consecutive pipes and empty <version-constraint>).

@jkowalleck
Copy link
Contributor

@tschmidtb51 I am pretty satisfied with it at this stage. Unless there are objections I will likely merge it this week.

@pombredanne status?

@fproulx-boostsecurity
Copy link

Ping ^

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests