GSoC 2024 Project Idea: Product Mapping using PURLs #3771

terriko · 2024-02-01T00:03:11Z

Related GSoC 2024: Start Here #3550

cve-bin-tool: Product Mapping using PURLs

Project description

CVE Binary Tool needs to identify components in order to scan for vulnerabilities, but uniquely identifying software is not always an easy thing to do. Some examples:

Some projects have very common dictionary words as names, so multiple projects with the same name exists (e.g. false positive: name collision for python arrow vs rust arrow #3193 and Name collision with "docutils" #3152)
Some projects are wrappers around a popular library, but the wrapper may have its own different set of version numbers (e.g. name collision with zstandard #3179 )
A single product may have a lot of names/identifiers depending on who packaged it and some other context
- For example: python package BeautifulSoup can be known as ...
  - 'beautifulsoup4' when installed from pip
  - 'python3-bs4' on debian and ubuntu systems
  - 'python3-beautifulsoup4' on fedora based systems (and thus redhat based ones)
  - 'bs4' or 'python-bs4' or python-beautifulsoup in some automated tools intended to list software (e.g. sbom tools, yocto)
Don't assume any identifier will be unique, or that the same identifier will be used in all databases.

CVE Binary Tool currently has explicit, pre-defined mappings between our binary signatures and a list of CPE identifiers for that "product." This works pretty well (although it does need to be updated somewhat regularly as different groups handle filing of known vulnerabilities).

Where we struggle is matching arbitrary product names found in component lists such as python's requirements.txt files. We have enough information to do better, but we need tools that do that in a mapping.

This project is intended to improve our product mapping and reduce false positives (like in the bugs linked above). We've noodled around on ideas and my current plan is:

Generate internal PURL identifiers within our "language" parsers found here: https://github.com/intel/cve-bin-tool/tree/main/cve_bin_tool/parsers PURL would let us say the equivalent of "this is python arrow." This may happen before GSoC starts (e.g. someone may make a pull request to do this in February to get us started)
Integrate purl2cpe to provide direct mapping between our PURLs and known CPEs
Some things won't have CPE entries and thus won't be in purl2CPE. But we may know (from bug reports) that there's a product with the same name that is absolutely not the same thing. So we'll need to provide a "is not" database to reduce false positives.
- I suggest using a similar setup to what purl2cpe does -- allow humans to submit pull requests, make all the data readable, provide a way to load it into a queryable database. We could/should spend some time arguing out those details before you begin, but give us your best guess of how this will work as part of your project proposal.
Once we have these integrated for language parsers, see if we can also integrate them into our SBOM matching routines (currently our SBOMs can and do read PURL but don't use the purl2cpe database)
Make our "is not" de-duplication database available to the general public similar to purl2cpe -- i.e. make a library so that anyone else can use the data easily. I'd expect the library initially to be released as part of cve-bin-tool itself, but we could consider packaging it separately if that turns out to be useful to folk (it would certainly have a lot fewer dependencies).

If there's more time after that, I think we may want to consider pulling out other ideas and other sources of data to use. Do some brainstorming and include those as stretch goals.

Skills

python
understanding of software identifiers such as CPE, PURL, SWID would be helpful (you can learn this as you write your application)

Difficulty level

medium/hard.

Project Length

350 hours (e.g. full-time for 10 weeks or part-time for longer)
It would be possible to do part of this project in a 175 hour project, but we may prefer candidates who have the time to do more assuming similar levels of ability

Mentor

The primary mentor for this project will likely be @terriko . Please DO NOT EMAIL TERRI DIRECTLY and ask all questions on this issue instead so you can benefit from the expertise of other contributors and mentors.

GSoC Participants Only

This issue is a potential project idea for GSoC 2024, and is reserved for completion by a selected GSoC contributor. Please do not work on it outside of that program. If you'd like to apply to do it through GSoC, please start by reading #3550.

The text was updated successfully, but these errors were encountered:

inosmeet · 2024-02-15T13:19:47Z

Generate internal PURL identifiers within our "language" parsers

for this, do we want hard-coded strings like scheme:{type}/{name}@{version} which could look something like this pkg:npm/foobar@12.3.1 or do you have some specific thing in mind?

terriko · 2024-02-15T17:18:26Z

Hard coded strings is what I had in mind for step 1, since we should know what type of data we parsed to make the listing. We might need something more fancy for npm/javascript because the same code can parse more than one type of listing but I think it may be the only one like that.

It should be pretty simple, which is why I said that step 1 could probably be done right now. I was actually intending to do it myself before I got pulled into some more urgent stuff this month.

inosmeet · 2024-02-15T18:00:31Z

So, can I try this one?

terriko · 2024-02-20T18:00:16Z

@Dev-Voldemort yes, feel free to try step 1. The rest is reserved as a gsoc project so if you want to do it please apply through gsoc when applications open (assuming we get selected, yada yada)

lavi20 · 2024-02-24T02:44:47Z

@terriko I ant to work on this issue please assign this to me

joydeep049 · 2024-02-26T19:46:34Z

Step-1 seems interesting.
Think I'm gonna give it a try

terriko · 2024-02-26T21:26:31Z

@crazytrain328 If you do, start with something other than Rust. You may also want to read the code review on @Dev-Voldemort 's PR #3859 so you can learn from that feedback!

joydeep049 · 2024-02-27T05:48:10Z

@terriko
Sure! I think I'll write PURL generation for python and java.
Also, Could you review Debian Parser PR #3543 when you have time?
That one has been open for about 2 months now :(

joydeep049 · 2024-02-27T12:02:28Z

@crazytrain328 If you do, start with something other than Rust. You may also want to read the code review on @Dev-Voldemort 's PR #3859 so you can learn from that feedback!

My code for python and java PURL generator is ready. Waiting for #3859 to get merged . Next I'll be working on adding PURL generation to swift, ruby and perl

inosmeet · 2024-02-27T13:30:46Z

Hey @crazytrain328! I have already coded it for r, ruby and swift. Feel free to choose any other :)

joydeep049 · 2024-02-27T13:39:11Z

@Dev-Voldemort , Not a Problem!
I'll keep perl, php and javascript on my agenda for next week then.

terriko · 2024-02-27T21:19:46Z

More recommended reading: feat: added purl generation for go parser #3833

I had a lot to say in the code review there which probably applies to anyone else working on PURL generation.

The short summary: I need to generate a valid PURL no matter what, even if it's kind of a crappy one (e.g. vendor = UNKNOWN). you can use re.sub if you need to just get rid of characters not on an approved list.

e.g.

>>> re.sub("[^a-zA-Z0-9\-_]", "", "my golang product***")
'mygolangproduct'

(I'm open to other ideas if there's something more performant than the regex. But without any + or * wildcards in there I think it's probably fast enough and unlikely to get weird expansion effects.)

tahifahimi · 2024-02-28T05:48:12Z

Hey folks,
We all can help with the first part.
@crazytrain328, I help with php and Javascript if you don't mind.

joydeep049 · 2024-02-28T06:08:42Z

Hello @tahifahimi ,
I already have the code ready for both of those, and for perl as well.
Thank you for volunteering!
There are a few issues which you could help with. Fuzzer needs to created for RPM file parser.
And you could see if @Dev-Voldemort needs some help with the fuzzing reports.

If you have some further time, You can research ways of opening tar files without using tarfile library, which @terriko would want integrated before the next release! (which works for both windows and linux)
Cheers!!

inosmeet · 2024-02-28T11:14:29Z

Hey @tahifahimi!
I think we should pause the development of purl integration until the first one is merged(other ones will be based on that).
I have to implement some requested changes, but I'm AFK right now so that may take a couple days to be merged.

Meanwhile, It would be great if you could help in #3800 (it would help you in understanding fuzzer better), especially the java fuzzer. AFAIK only 1-2 are remaining, so we can complete that.

joydeep049 · 2024-02-28T11:16:45Z

Agreed @Dev-Voldemort ,
I have all the scripts ready but they are not gonna be merged anytime soon (I think).
Meanwhile, there are other issues to handle

joydeep049 · 2024-03-12T18:00:20Z

Hello,

I'm thinking to design a separate database for integrating purl2cpe into our repo. Is this way of thinking correct? Or it would
be more efficient to try and think of some other method?
Basically we would be mapping purl to CPEs using purl2cpe, then mapping CPEs to associated CVEs found in any source like
NVD.
Is this a good approach?
Relational database??
I'm definitely not proposing a 1-1 PURL-CPE mapping :)
Are we considering CWEs? May increase the quality of the tool.
SQLAlchemy???
My ideas for the de-duplication database are:
- Tables which contain info about a particular software component, and their relation to valid or invalid CPEs.
- Then, To access all the valid CPEs we could query it smth like "GET ME ALL VALID CPEs for PURL={purl}"

(I have a more detailed schema in mind but ig that should be part of my proposal)
@terriko @anthonyharrison

anthonyharrison · 2024-03-13T08:53:05Z

@joydeep049 I think it would be worthwhile looking at what is currently available and see if these solutions could be integrated into cve-bin-tool rather than creating something new (it would be a very ambitious project!). For example, have a look at Vulnerable Code which already has good support for PURLs. SCANOSS also has a solution but when I looked at it it didn't have version information and the coverage was incomplete as the data was being manually curated.

cc @terriko

terriko · 2024-03-13T20:36:36Z

Expected workflow here in dubious pythonic pseudocode

# 1. Get a valid purl identifier either from input or guessing one
purl = get_purl_from_input()
if purl is None:
    purl = generate_purl()

# 2. Look it up in the purl2cpe database.  If it's found then we're done.
cpe_list = purl2cpe_lookup(purl)
if cpe_list is not None:
    return cpe_list

# 3. If there's no match, try searching the data (current cve-bin-tool behaviour)
if purl.vendor != "UNKNOWN":
    cpe_list = nvd_lookup(purl.vendor, purl.product)
else: 
    cpe_list = nvd_lookup(purl.product)

# 4. Check these results against a de-dupe database
bad_cpe_list = dedupe_lookup(purl)
for cpe in cpe_list:
    if cpe in bad_cpe_list:
        cpe_list.remove(cpe)

# 5. return anything still potentially valid (will likely be an empty list in MOST cases)
return cpe_list

My current feeling is as follows:

we should integrate the purl2cpe database following their instructions and not do anything too fancy
we will need a second (ideally completely compatible) database for de-duplication (the "python arrow != rust arrow" database) and we will have to start and maintain that ourselves.
You could integrate those databases together but as you can see from the pseudocode above, there's no particular reason to do so since we're only querying the dedupe db if the first purl2cpe match failed.

Unless there's a particularly strong reason not to, I'd suggest that we have our second database follow the same pattern as purl2cpe. There's a few reasons for this:

We would be able to use nearly identical tools for loading and using the data
Users may find themselves in the position where they'll want to edit both those databases to improve the quality of matches, and we don't want them to have to learn two very different systems for data entry.
Maybe (likely in the further future than gsoc) we'd want to combine the two databases into a single source of truth about purl->cpe mappings (likely by contributing our data back to purl2cpe rather than forking their project).

purl2cpe currently has an sqlite loader and works on data based in yaml files in a repo. No relational database, so no point in getting excited about sqlalchemy yet. (And honestly, sqlite is good enough for our purposes. We're really not doing complicated database queries.)

terriko · 2024-03-13T22:28:22Z

Just so we're clear:

The most urgent part of this project is the de-dupe part.

We've got multiple bugs related to de-duplication that need solving and don't have a good workaround at the moment. The only reason I didn't emphasize doing that first is that I think integration with an existing database will give you a better idea of how to organize the de-dupe data, so I think we'll wind up with a better architecture if we learn from others before setting up our de-dupe stuff.

I chose purl2cpe for this project because in my analysis it looked like the clearest path for a person who is new to the space -- it's just mappings, it's using sqlite which we already use, and yaml which we also already use. No need for complicated new dependencies, and it's already designed for integration into other products. But @anthonyharrison is absolutely right that we could equally use Vulnerable Code or SCANOSS in place of purl2cpe. You could absolutely use something other than purl2cpe as the first data source here. I'm willing to be convinced since I really only did a few hours of reading before choosing one to put in this project idea. But if you suggest a different one be prepared to explain and defend your choice in the project proposal!

Adding the ability to use multiple libraries for cpe lookup is a perfectly reasonable long-term goal for the cve-bin-tool project, but I wouldn't put it into this year's gsoc project except as a stretch goal. This isn't because I think a GSoC contributor couldn't do it, but because years of experience have taught me to leave a lot of extra time because docs, tests, life, and weird GitHub Actions network issues will eat up a lot of time and it's not always predictable in advance. I think focusing on one database instead of multiples will increase the chances of success on the project. Plus, people have a lot more fun if they feel like they're ahead and doing stretch goals in the last month than they do if they feel like they might fail at any moment, and people having fun write better code and are more likely to stick around doing open source stuff. 😄

* Related: #3771 Signed-off-by: Meet Soni <meetsoni3017@gmail.com> Co-authored-by: Terri Oda <terri.oda@intel.com>

terriko · 2024-03-13T23:21:01Z

We've got the first of our generated purl ids for the go parsers now:

feat: added purl generation for go parser #3833

Folk interested in this project may find our iteration/code review in that issue interesting, as it took us quite a few tries to figure out what we wanted and how.

I think we'll probably want others to follow that pattern; feel free to create one for any parser that hasn't been claimed here and doesn't already have a pull request.

joydeep049 · 2024-03-14T07:30:47Z

We've got the first of our generated purl ids for the go parsers now:

feat: added purl generation for go parser #3833

Folk interested in this project may find our iteration/code review in that issue interesting, as it took us quite a few tries to figure out what we wanted and how.

I think we'll probably want others to follow that pattern; feel free to create one for any parser that hasn't been claimed here and doesn't already have a pull request.

Finally, I can continue my work on the parsers i mentioned somewhere above in this chat.

* Related: intel#3771 Signed-off-by: Meet Soni <meetsoni3017@gmail.com>

anthonyharrison · 2024-03-14T15:18:39Z

As we move to support PURLs more extensively, I think it might be worth creating a class for handling all of the PURL related processing to make it easier to manage.

terriko · 2024-03-14T18:08:57Z

As we move to support PURLs more extensively, I think it might be worth creating a class for handling all of the PURL related processing to make it easier to manage.

Definitely. Relative to the pseudocode above: sections 2-5 should be grouped into a function that accepts a purl and spits out a list of CPEs (or None), and that function should be in an easily imported class we can use in at least the language parsers and the sbom parser. (I wouldn't expect to use purl in the binary checkers, though. We could but it would be hours and hours of refactoring to little benefit, so I'd say it's a waste of time at the moment.)

joydeep049 · 2024-03-16T19:58:08Z

Hello @terriko , I am interested in this project and will share an initial draft proposal on gitter soon!

joydeep049 · 2024-03-18T20:51:28Z

@terriko @anthonyharrison
I had a question. You have mentioned that step-2 would be integration of purl2cpe database into the language parsers.
So since we are also going to have to integrate the information that the de-duplication database will hold into the language parsers, should that be done in step-2 or step-4 ?

terriko · 2024-03-18T21:40:22Z

@joydeep049 I would plan to integrate the purl2cpe database first. I think you'll learn enough form integrating it that it'll result in a better design and integration setup for the de-dupe db in the later steps.

terriko · 2024-03-18T21:41:47Z

And no, you're not really putting the de-dupe info in the language parsers I don't think? As @anthonyharrison indicated, you'll be putting the de-dupe info into a function that gets called from the general cve lookup code eventually, because we'll want to use it in the SBOM lookup as well as the language parsers.

joydeep049 · 2024-03-18T21:42:10Z

@joydeep049 I would plan to integrate the purl2cpe database first. I think you'll learn enough form integrating it that it'll result in a better design and integration setup for the de-dupe db in the later steps.

Thanx!
Does developing a library for de-dupe database mean developing something close to the utilities part of purl2cpe?

joydeep049 · 2024-03-18T21:45:24Z

And no, you're not really putting the de-dupe info in the language parsers I don't think? As @anthonyharrison indicated, you'll be putting the de-dupe info into a function that gets called from the general cve lookup code eventually, because we'll want to use it in the SBOM lookup as well as the language parsers.

This would mean that I would just be adding extra steps to the general lookup that we have for all data sources?
And also to the SBOM parsers?

terriko · 2024-03-18T21:47:36Z

Thanx! Does developing a library for de-dupe database mean developing something close to the utilities part of purl2cpe?

Possibly? We also potentially have the option to just publish the created purl-de-dupe.db file on our mirrors with instructions and sql info. It's ok if we don't know the best path here yet; I suspect you'll have many opinions about what purl2cpe did right and wrong by the time you get to that last piece and you should let that experience inform the final design rather than trying to get it all perfect at the proposal stage, so make a guess but leave yourself open to change there.

terriko · 2024-03-18T21:58:27Z

This would mean that I would just be adding extra steps to the general lookup that we have for all data sources? And also to the SBOM parsers?

If I had to guess, you'd be effectively replacing get_vendor_product_pairs (which is approximately section 3 and 5 of the pseudocode I gave above) but there may be some refactoring involved so that the language parsers, the sbom parsers, and the binary parsers can use the same function and can pass in a purl in lieu of a product name. Feel free to poke around and check, but you're probably getting too detailed for a gsoc proposal at this point, though. It's ok to just add a step of "integrate de-dupe database with sboms" as a step 6 and leave it without more than a few sentences of how you expect that to work.

joydeep049 · 2024-03-18T22:14:15Z

This would mean that I would just be adding extra steps to the general lookup that we have for all data sources? And also to the SBOM parsers?

If I had to guess, you'd be effectively replacing get_vendor_product_pairs (which is approximately section 3 and 5 of the pseudocode I gave above) but there may be some refactoring involved so that the language parsers, the sbom parsers, and the binary parsers can use the same function and can pass in a purl in lieu of a product name. Feel free to poke around and check, but you're probably getting too detailed for a gsoc proposal at this point, though. It's ok to just add a step of "integrate de-dupe database with sboms" as a step 6 and leave it without more than a few sentences of how you expect that to work.

Yes probably! I just wanted to be as thorough as possible in my proposal. But i guess i should leave some of the decisions on how exactly things will work for when i get to work on the project.

* Related: #3771 Signed-off-by: Meet Soni <meetsoni3017@gmail.com>

joydeep049 · 2024-04-07T10:22:04Z

Just a heads-up to anyone who wants to work on PURL generation for language parsers. Most of the parsers have been claimed but dart remains. So any contributor who wants to work on this can take it up!

If you have any doubt, please feel free to reach out to me or to @terriko or @anthonyharrison.

terriko · 2024-06-26T22:05:19Z

Marking this as attached to the 3.3.1 milestone to make my tracking for that easier.

terriko added the gsoc Tasks related to our participation in Google Summer of Code label Feb 1, 2024

terriko mentioned this issue Feb 1, 2024

GSoC 2024: Start Here #3550

Closed

terriko changed the title ~~GSoC 2024 Project Idea: Improved Product Mapping using PURLs~~ GSoC 2024 Project Idea: Product Mapping using PURLs Feb 5, 2024

inosmeet mentioned this issue Feb 16, 2024

feat: added purl generation for go parser #3833

Merged

inosmeet mentioned this issue Feb 25, 2024

feat: added PURL generation to rust parser #3859

Merged

terriko mentioned this issue Mar 9, 2024

Name collision with "docutils" #3152

Closed

inosmeet mentioned this issue Mar 11, 2024

feat(purl): first purl iteration anthonyharrison/lib4sbom#16

Open

terriko added a commit that referenced this issue Mar 13, 2024

feat: added purl generation for go parser (#3833)

b81b51e

* Related: #3771 Signed-off-by: Meet Soni <meetsoni3017@gmail.com> Co-authored-by: Terri Oda <terri.oda@intel.com>

inosmeet added a commit to inosmeet/cve-bin-tool that referenced this issue Mar 14, 2024

feat: added PURL generation to rust parser

d29e62c

* Related: intel#3771 Signed-off-by: Meet Soni <meetsoni3017@gmail.com>

inosmeet mentioned this issue Mar 15, 2024

feat: added PURL generation to ruby parser #3939

Merged

joydeep049 mentioned this issue Mar 16, 2024

feat: PURL generation for PythonParser #3945

Merged

inosmeet mentioned this issue Mar 20, 2024

feat: added PURL generation for swift parser #3957

Merged

terriko mentioned this issue Mar 20, 2024

test: PURL generation for language parsers #3961

Open

This was referenced Mar 27, 2024

feat: added PURL generation to JavaParser #3986

Merged

feat: added PURL generation to JavascriptParser #3987

Merged

feat: added PURL generation to PerlParser #3992

Merged

terriko pushed a commit that referenced this issue Apr 3, 2024

feat: added PURL generation to rust parser (#3859)

dc95e61

* Related: #3771 Signed-off-by: Meet Soni <meetsoni3017@gmail.com>

mastersans mentioned this issue Apr 7, 2024

feat: added PURL generation to DartParser #4004

Merged

joydeep049 mentioned this issue Apr 9, 2024

feat: added PURL generation to PhpParser #4016

Merged

inosmeet mentioned this issue Apr 16, 2024

feat: added PURL generation for r parser #4035

Merged

This was referenced Apr 17, 2024

Improve product vendor matching for component list scanning #1504

Closed

feat: Improve matching for language parsers (avoid name collisions, use purl) #3180

Closed

terriko assigned inosmeet May 30, 2024

terriko added this to the 3.3.1 milestone Jun 26, 2024

GSoC 2024 Project Idea: Product Mapping using PURLs #3771

GSoC 2024 Project Idea: Product Mapping using PURLs #3771

Comments

terriko commented Feb 1, 2024 • edited Loading

cve-bin-tool: Product Mapping using PURLs

Project description

Related reading

Skills

Difficulty level

Project Length

Mentor

GSoC Participants Only

inosmeet commented Feb 15, 2024 • edited Loading

terriko commented Feb 15, 2024

inosmeet commented Feb 15, 2024

terriko commented Feb 20, 2024

lavi20 commented Feb 24, 2024

joydeep049 commented Feb 26, 2024

terriko commented Feb 26, 2024

joydeep049 commented Feb 27, 2024 • edited Loading

joydeep049 commented Feb 27, 2024

inosmeet commented Feb 27, 2024

joydeep049 commented Feb 27, 2024

terriko commented Feb 27, 2024

tahifahimi commented Feb 28, 2024

joydeep049 commented Feb 28, 2024

inosmeet commented Feb 28, 2024

joydeep049 commented Feb 28, 2024

joydeep049 commented Mar 12, 2024 • edited Loading

anthonyharrison commented Mar 13, 2024

terriko commented Mar 13, 2024

terriko commented Mar 13, 2024

The most urgent part of this project is the de-dupe part.

terriko commented Mar 13, 2024

joydeep049 commented Mar 14, 2024

anthonyharrison commented Mar 14, 2024

terriko commented Mar 14, 2024

joydeep049 commented Mar 16, 2024

joydeep049 commented Mar 18, 2024

terriko commented Mar 18, 2024

terriko commented Mar 18, 2024

joydeep049 commented Mar 18, 2024

joydeep049 commented Mar 18, 2024

terriko commented Mar 18, 2024

terriko commented Mar 18, 2024

joydeep049 commented Mar 18, 2024

joydeep049 commented Apr 7, 2024

terriko commented Jun 26, 2024

terriko commented Feb 1, 2024 •

edited

Loading

inosmeet commented Feb 15, 2024 •

edited

Loading

joydeep049 commented Feb 27, 2024 •

edited

Loading

joydeep049 commented Mar 12, 2024 •

edited

Loading