Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GSoC 2024 Project Idea: Product Mapping using PURLs #3771

Open
terriko opened this issue Feb 1, 2024 · 35 comments
Open

GSoC 2024 Project Idea: Product Mapping using PURLs #3771

terriko opened this issue Feb 1, 2024 · 35 comments
Assignees
Labels
gsoc Tasks related to our participation in Google Summer of Code
Milestone

Comments

@terriko
Copy link
Contributor

terriko commented Feb 1, 2024

cve-bin-tool: Product Mapping using PURLs

Project description

CVE Binary Tool needs to identify components in order to scan for vulnerabilities, but uniquely identifying software is not always an easy thing to do. Some examples:

  • Some projects have very common dictionary words as names, so multiple projects with the same name exists (e.g. false positive: name collision for python arrow vs rust arrow #3193 and Name collision with "docutils" #3152)
  • Some projects are wrappers around a popular library, but the wrapper may have its own different set of version numbers (e.g. name collision with zstandard #3179 )
  • A single product may have a lot of names/identifiers depending on who packaged it and some other context
    • For example: python package BeautifulSoup can be known as ...
      • 'beautifulsoup4' when installed from pip
      • 'python3-bs4' on debian and ubuntu systems
      • 'python3-beautifulsoup4' on fedora based systems (and thus redhat based ones)
      • 'bs4' or 'python-bs4' or python-beautifulsoup in some automated tools intended to list software (e.g. sbom tools, yocto)
  • Don't assume any identifier will be unique, or that the same identifier will be used in all databases.

CVE Binary Tool currently has explicit, pre-defined mappings between our binary signatures and a list of CPE identifiers for that "product." This works pretty well (although it does need to be updated somewhat regularly as different groups handle filing of known vulnerabilities).

Where we struggle is matching arbitrary product names found in component lists such as python's requirements.txt files. We have enough information to do better, but we need tools that do that in a mapping.

This project is intended to improve our product mapping and reduce false positives (like in the bugs linked above). We've noodled around on ideas and my current plan is:

  1. Generate internal PURL identifiers within our "language" parsers found here: https://github.com/intel/cve-bin-tool/tree/main/cve_bin_tool/parsers PURL would let us say the equivalent of "this is python arrow." This may happen before GSoC starts (e.g. someone may make a pull request to do this in February to get us started)
  2. Integrate purl2cpe to provide direct mapping between our PURLs and known CPEs
  3. Some things won't have CPE entries and thus won't be in purl2CPE. But we may know (from bug reports) that there's a product with the same name that is absolutely not the same thing. So we'll need to provide a "is not" database to reduce false positives.
    • I suggest using a similar setup to what purl2cpe does -- allow humans to submit pull requests, make all the data readable, provide a way to load it into a queryable database. We could/should spend some time arguing out those details before you begin, but give us your best guess of how this will work as part of your project proposal.
  4. Once we have these integrated for language parsers, see if we can also integrate them into our SBOM matching routines (currently our SBOMs can and do read PURL but don't use the purl2cpe database)
  5. Make our "is not" de-duplication database available to the general public similar to purl2cpe -- i.e. make a library so that anyone else can use the data easily. I'd expect the library initially to be released as part of cve-bin-tool itself, but we could consider packaging it separately if that turns out to be useful to folk (it would certainly have a lot fewer dependencies).

If there's more time after that, I think we may want to consider pulling out other ideas and other sources of data to use. Do some brainstorming and include those as stretch goals.

Related reading

Skills

  • python
  • understanding of software identifiers such as CPE, PURL, SWID would be helpful (you can learn this as you write your application)

Difficulty level

  • medium/hard.

Project Length

  • 350 hours (e.g. full-time for 10 weeks or part-time for longer)
  • It would be possible to do part of this project in a 175 hour project, but we may prefer candidates who have the time to do more assuming similar levels of ability

Mentor

  • The primary mentor for this project will likely be @terriko . Please DO NOT EMAIL TERRI DIRECTLY and ask all questions on this issue instead so you can benefit from the expertise of other contributors and mentors.

GSoC Participants Only

This issue is a potential project idea for GSoC 2024, and is reserved for completion by a selected GSoC contributor. Please do not work on it outside of that program. If you'd like to apply to do it through GSoC, please start by reading #3550.

@terriko terriko added the gsoc Tasks related to our participation in Google Summer of Code label Feb 1, 2024
@terriko terriko changed the title GSoC 2024 Project Idea: Improved Product Mapping using PURLs GSoC 2024 Project Idea: Product Mapping using PURLs Feb 5, 2024
@inosmeet
Copy link
Contributor

inosmeet commented Feb 15, 2024

Generate internal PURL identifiers within our "language" parsers

for this, do we want hard-coded strings like scheme:{type}/{name}@{version} which could look something like this pkg:npm/foobar@12.3.1 or do you have some specific thing in mind?

@terriko
Copy link
Contributor Author

terriko commented Feb 15, 2024

Hard coded strings is what I had in mind for step 1, since we should know what type of data we parsed to make the listing. We might need something more fancy for npm/javascript because the same code can parse more than one type of listing but I think it may be the only one like that.

It should be pretty simple, which is why I said that step 1 could probably be done right now. I was actually intending to do it myself before I got pulled into some more urgent stuff this month.

@inosmeet
Copy link
Contributor

So, can I try this one?

@terriko
Copy link
Contributor Author

terriko commented Feb 20, 2024

@Dev-Voldemort yes, feel free to try step 1. The rest is reserved as a gsoc project so if you want to do it please apply through gsoc when applications open (assuming we get selected, yada yada)

@lavi20
Copy link

lavi20 commented Feb 24, 2024

@terriko I ant to work on this issue please assign this to me

@joydeep049
Copy link
Contributor

Step-1 seems interesting.
Think I'm gonna give it a try

@terriko
Copy link
Contributor Author

terriko commented Feb 26, 2024

@crazytrain328 If you do, start with something other than Rust. You may also want to read the code review on @Dev-Voldemort 's PR #3859 so you can learn from that feedback!

@joydeep049
Copy link
Contributor

joydeep049 commented Feb 27, 2024

@terriko
Sure! I think I'll write PURL generation for python and java.
Also, Could you review Debian Parser PR #3543 when you have time?
That one has been open for about 2 months now :(

@joydeep049
Copy link
Contributor

@crazytrain328 If you do, start with something other than Rust. You may also want to read the code review on @Dev-Voldemort 's PR #3859 so you can learn from that feedback!

My code for python and java PURL generator is ready. Waiting for #3859 to get merged . Next I'll be working on adding PURL generation to swift, ruby and perl

@inosmeet
Copy link
Contributor

Hey @crazytrain328! I have already coded it for r, ruby and swift. Feel free to choose any other :)

@joydeep049
Copy link
Contributor

@Dev-Voldemort , Not a Problem!
I'll keep perl, php and javascript on my agenda for next week then.

@terriko
Copy link
Contributor Author

terriko commented Feb 27, 2024

I had a lot to say in the code review there which probably applies to anyone else working on PURL generation.

The short summary: I need to generate a valid PURL no matter what, even if it's kind of a crappy one (e.g. vendor = UNKNOWN). you can use re.sub if you need to just get rid of characters not on an approved list.

e.g.

>>> re.sub("[^a-zA-Z0-9\-_]", "", "my golang product***")
'mygolangproduct'

(I'm open to other ideas if there's something more performant than the regex. But without any + or * wildcards in there I think it's probably fast enough and unlikely to get weird expansion effects.)

@tahifahimi
Copy link
Contributor

Hey folks,
We all can help with the first part.
@crazytrain328, I help with php and Javascript if you don't mind.

@joydeep049
Copy link
Contributor

Hello @tahifahimi ,
I already have the code ready for both of those, and for perl as well.
Thank you for volunteering!
There are a few issues which you could help with. Fuzzer needs to created for RPM file parser.
And you could see if @Dev-Voldemort needs some help with the fuzzing reports.

If you have some further time, You can research ways of opening tar files without using tarfile library, which @terriko would want integrated before the next release! (which works for both windows and linux)
Cheers!!

@inosmeet
Copy link
Contributor

Hey @tahifahimi!
I think we should pause the development of purl integration until the first one is merged(other ones will be based on that).
I have to implement some requested changes, but I'm AFK right now so that may take a couple days to be merged.

Meanwhile, It would be great if you could help in #3800 (it would help you in understanding fuzzer better), especially the java fuzzer. AFAIK only 1-2 are remaining, so we can complete that.

@joydeep049
Copy link
Contributor

Agreed @Dev-Voldemort ,
I have all the scripts ready but they are not gonna be merged anytime soon (I think).
Meanwhile, there are other issues to handle

@joydeep049
Copy link
Contributor

joydeep049 commented Mar 12, 2024

Hello,

  • I'm thinking to design a separate database for integrating purl2cpe into our repo. Is this way of thinking correct? Or it would
    be more efficient to try and think of some other method?
  • Basically we would be mapping purl to CPEs using purl2cpe, then mapping CPEs to associated CVEs found in any source like
    NVD.
    Is this a good approach?
  • Relational database??
  • I'm definitely not proposing a 1-1 PURL-CPE mapping :)
  • Are we considering CWEs? May increase the quality of the tool.
  • SQLAlchemy???
  • My ideas for the de-duplication database are:
    • Tables which contain info about a particular software component, and their relation to valid or invalid CPEs.
    • Then, To access all the valid CPEs we could query it smth like "GET ME ALL VALID CPEs for PURL={purl}"

(I have a more detailed schema in mind but ig that should be part of my proposal)
@terriko @anthonyharrison

@anthonyharrison
Copy link
Contributor

@joydeep049 I think it would be worthwhile looking at what is currently available and see if these solutions could be integrated into cve-bin-tool rather than creating something new (it would be a very ambitious project!). For example, have a look at Vulnerable Code which already has good support for PURLs. SCANOSS also has a solution but when I looked at it it didn't have version information and the coverage was incomplete as the data was being manually curated.

cc @terriko

@terriko
Copy link
Contributor Author

terriko commented Mar 13, 2024

Expected workflow here in dubious pythonic pseudocode

# 1. Get a valid purl identifier either from input or guessing one
purl = get_purl_from_input()
if purl is None:
    purl = generate_purl()

# 2. Look it up in the purl2cpe database.  If it's found then we're done.
cpe_list = purl2cpe_lookup(purl)
if cpe_list is not None:
    return cpe_list

# 3. If there's no match, try searching the data (current cve-bin-tool behaviour)
if purl.vendor != "UNKNOWN":
    cpe_list = nvd_lookup(purl.vendor, purl.product)
else: 
    cpe_list = nvd_lookup(purl.product)

# 4. Check these results against a de-dupe database
bad_cpe_list = dedupe_lookup(purl)
for cpe in cpe_list:
    if cpe in bad_cpe_list:
        cpe_list.remove(cpe)

# 5. return anything still potentially valid (will likely be an empty list in MOST cases)
return cpe_list

My current feeling is as follows:

  • we should integrate the purl2cpe database following their instructions and not do anything too fancy
  • we will need a second (ideally completely compatible) database for de-duplication (the "python arrow != rust arrow" database) and we will have to start and maintain that ourselves.
  • You could integrate those databases together but as you can see from the pseudocode above, there's no particular reason to do so since we're only querying the dedupe db if the first purl2cpe match failed.

Unless there's a particularly strong reason not to, I'd suggest that we have our second database follow the same pattern as purl2cpe. There's a few reasons for this:

  1. We would be able to use nearly identical tools for loading and using the data
  2. Users may find themselves in the position where they'll want to edit both those databases to improve the quality of matches, and we don't want them to have to learn two very different systems for data entry.
  3. Maybe (likely in the further future than gsoc) we'd want to combine the two databases into a single source of truth about purl->cpe mappings (likely by contributing our data back to purl2cpe rather than forking their project).

purl2cpe currently has an sqlite loader and works on data based in yaml files in a repo. No relational database, so no point in getting excited about sqlalchemy yet. (And honestly, sqlite is good enough for our purposes. We're really not doing complicated database queries.)

@terriko
Copy link
Contributor Author

terriko commented Mar 13, 2024

Just so we're clear:

The most urgent part of this project is the de-dupe part.

We've got multiple bugs related to de-duplication that need solving and don't have a good workaround at the moment. The only reason I didn't emphasize doing that first is that I think integration with an existing database will give you a better idea of how to organize the de-dupe data, so I think we'll wind up with a better architecture if we learn from others before setting up our de-dupe stuff.

I chose purl2cpe for this project because in my analysis it looked like the clearest path for a person who is new to the space -- it's just mappings, it's using sqlite which we already use, and yaml which we also already use. No need for complicated new dependencies, and it's already designed for integration into other products. But @anthonyharrison is absolutely right that we could equally use Vulnerable Code or SCANOSS in place of purl2cpe. You could absolutely use something other than purl2cpe as the first data source here. I'm willing to be convinced since I really only did a few hours of reading before choosing one to put in this project idea. But if you suggest a different one be prepared to explain and defend your choice in the project proposal!

Adding the ability to use multiple libraries for cpe lookup is a perfectly reasonable long-term goal for the cve-bin-tool project, but I wouldn't put it into this year's gsoc project except as a stretch goal. This isn't because I think a GSoC contributor couldn't do it, but because years of experience have taught me to leave a lot of extra time because docs, tests, life, and weird GitHub Actions network issues will eat up a lot of time and it's not always predictable in advance. I think focusing on one database instead of multiples will increase the chances of success on the project. Plus, people have a lot more fun if they feel like they're ahead and doing stretch goals in the last month than they do if they feel like they might fail at any moment, and people having fun write better code and are more likely to stick around doing open source stuff. 😄

terriko added a commit that referenced this issue Mar 13, 2024
* Related: #3771

Signed-off-by: Meet Soni <meetsoni3017@gmail.com>
Co-authored-by: Terri Oda <terri.oda@intel.com>
@terriko
Copy link
Contributor Author

terriko commented Mar 13, 2024

We've got the first of our generated purl ids for the go parsers now:

Folk interested in this project may find our iteration/code review in that issue interesting, as it took us quite a few tries to figure out what we wanted and how.

I think we'll probably want others to follow that pattern; feel free to create one for any parser that hasn't been claimed here and doesn't already have a pull request.

@joydeep049
Copy link
Contributor

We've got the first of our generated purl ids for the go parsers now:

Folk interested in this project may find our iteration/code review in that issue interesting, as it took us quite a few tries to figure out what we wanted and how.

I think we'll probably want others to follow that pattern; feel free to create one for any parser that hasn't been claimed here and doesn't already have a pull request.

Finally, I can continue my work on the parsers i mentioned somewhere above in this chat.

inosmeet added a commit to inosmeet/cve-bin-tool that referenced this issue Mar 14, 2024
* Related: intel#3771

Signed-off-by: Meet Soni <meetsoni3017@gmail.com>
@anthonyharrison
Copy link
Contributor

As we move to support PURLs more extensively, I think it might be worth creating a class for handling all of the PURL related processing to make it easier to manage.

@terriko
Copy link
Contributor Author

terriko commented Mar 14, 2024

As we move to support PURLs more extensively, I think it might be worth creating a class for handling all of the PURL related processing to make it easier to manage.

Definitely. Relative to the pseudocode above: sections 2-5 should be grouped into a function that accepts a purl and spits out a list of CPEs (or None), and that function should be in an easily imported class we can use in at least the language parsers and the sbom parser. (I wouldn't expect to use purl in the binary checkers, though. We could but it would be hours and hours of refactoring to little benefit, so I'd say it's a waste of time at the moment.)

@joydeep049
Copy link
Contributor

Hello @terriko , I am interested in this project and will share an initial draft proposal on gitter soon!

@joydeep049
Copy link
Contributor

@terriko @anthonyharrison
I had a question. You have mentioned that step-2 would be integration of purl2cpe database into the language parsers.
So since we are also going to have to integrate the information that the de-duplication database will hold into the language parsers, should that be done in step-2 or step-4 ?

@terriko
Copy link
Contributor Author

terriko commented Mar 18, 2024

@joydeep049 I would plan to integrate the purl2cpe database first. I think you'll learn enough form integrating it that it'll result in a better design and integration setup for the de-dupe db in the later steps.

@terriko
Copy link
Contributor Author

terriko commented Mar 18, 2024

And no, you're not really putting the de-dupe info in the language parsers I don't think? As @anthonyharrison indicated, you'll be putting the de-dupe info into a function that gets called from the general cve lookup code eventually, because we'll want to use it in the SBOM lookup as well as the language parsers.

@joydeep049
Copy link
Contributor

@joydeep049 I would plan to integrate the purl2cpe database first. I think you'll learn enough form integrating it that it'll result in a better design and integration setup for the de-dupe db in the later steps.

Thanx!
Does developing a library for de-dupe database mean developing something close to the utilities part of purl2cpe?

@joydeep049
Copy link
Contributor

And no, you're not really putting the de-dupe info in the language parsers I don't think? As @anthonyharrison indicated, you'll be putting the de-dupe info into a function that gets called from the general cve lookup code eventually, because we'll want to use it in the SBOM lookup as well as the language parsers.

This would mean that I would just be adding extra steps to the general lookup that we have for all data sources?
And also to the SBOM parsers?

@terriko
Copy link
Contributor Author

terriko commented Mar 18, 2024

Thanx! Does developing a library for de-dupe database mean developing something close to the utilities part of purl2cpe?

Possibly? We also potentially have the option to just publish the created purl-de-dupe.db file on our mirrors with instructions and sql info. It's ok if we don't know the best path here yet; I suspect you'll have many opinions about what purl2cpe did right and wrong by the time you get to that last piece and you should let that experience inform the final design rather than trying to get it all perfect at the proposal stage, so make a guess but leave yourself open to change there.

@terriko
Copy link
Contributor Author

terriko commented Mar 18, 2024

This would mean that I would just be adding extra steps to the general lookup that we have for all data sources? And also to the SBOM parsers?

If I had to guess, you'd be effectively replacing get_vendor_product_pairs (which is approximately section 3 and 5 of the pseudocode I gave above) but there may be some refactoring involved so that the language parsers, the sbom parsers, and the binary parsers can use the same function and can pass in a purl in lieu of a product name. Feel free to poke around and check, but you're probably getting too detailed for a gsoc proposal at this point, though. It's ok to just add a step of "integrate de-dupe database with sboms" as a step 6 and leave it without more than a few sentences of how you expect that to work.

@joydeep049
Copy link
Contributor

This would mean that I would just be adding extra steps to the general lookup that we have for all data sources? And also to the SBOM parsers?

If I had to guess, you'd be effectively replacing get_vendor_product_pairs (which is approximately section 3 and 5 of the pseudocode I gave above) but there may be some refactoring involved so that the language parsers, the sbom parsers, and the binary parsers can use the same function and can pass in a purl in lieu of a product name. Feel free to poke around and check, but you're probably getting too detailed for a gsoc proposal at this point, though. It's ok to just add a step of "integrate de-dupe database with sboms" as a step 6 and leave it without more than a few sentences of how you expect that to work.

Yes probably! I just wanted to be as thorough as possible in my proposal. But i guess i should leave some of the decisions on how exactly things will work for when i get to work on the project.

@joydeep049
Copy link
Contributor

Just a heads-up to anyone who wants to work on PURL generation for language parsers. Most of the parsers have been claimed but dart remains. So any contributor who wants to work on this can take it up!

If you have any doubt, please feel free to reach out to me or to @terriko or @anthonyharrison.

@terriko
Copy link
Contributor Author

terriko commented Jun 26, 2024

Marking this as attached to the 3.3.1 milestone to make my tracking for that easier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gsoc Tasks related to our participation in Google Summer of Code
Projects
None yet
Development

No branches or pull requests

6 participants