-
Notifications
You must be signed in to change notification settings - Fork 445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GSoC 2024 Project Idea: Product Mapping using PURLs #3771
Comments
for this, do we want hard-coded strings like |
Hard coded strings is what I had in mind for step 1, since we should know what type of data we parsed to make the listing. We might need something more fancy for npm/javascript because the same code can parse more than one type of listing but I think it may be the only one like that. It should be pretty simple, which is why I said that step 1 could probably be done right now. I was actually intending to do it myself before I got pulled into some more urgent stuff this month. |
So, can I try this one? |
@Dev-Voldemort yes, feel free to try step 1. The rest is reserved as a gsoc project so if you want to do it please apply through gsoc when applications open (assuming we get selected, yada yada) |
@terriko I ant to work on this issue please assign this to me |
Step-1 seems interesting. |
@crazytrain328 If you do, start with something other than Rust. You may also want to read the code review on @Dev-Voldemort 's PR #3859 so you can learn from that feedback! |
My code for |
Hey @crazytrain328! I have already coded it for |
@Dev-Voldemort , Not a Problem! |
I had a lot to say in the code review there which probably applies to anyone else working on PURL generation. The short summary: I need to generate a valid PURL no matter what, even if it's kind of a crappy one (e.g. vendor = UNKNOWN). you can use e.g. >>> re.sub("[^a-zA-Z0-9\-_]", "", "my golang product***")
'mygolangproduct' (I'm open to other ideas if there's something more performant than the regex. But without any + or * wildcards in there I think it's probably fast enough and unlikely to get weird expansion effects.) |
Hey folks, |
Hello @tahifahimi , If you have some further time, You can research ways of opening tar files without using |
Hey @tahifahimi! Meanwhile, It would be great if you could help in #3800 (it would help you in understanding fuzzer better), especially the java fuzzer. AFAIK only 1-2 are remaining, so we can complete that. |
Agreed @Dev-Voldemort , |
Hello,
(I have a more detailed schema in mind but ig that should be part of my proposal) |
@joydeep049 I think it would be worthwhile looking at what is currently available and see if these solutions could be integrated into cve-bin-tool rather than creating something new (it would be a very ambitious project!). For example, have a look at Vulnerable Code which already has good support for PURLs. SCANOSS also has a solution but when I looked at it it didn't have version information and the coverage was incomplete as the data was being manually curated. cc @terriko |
Expected workflow here in dubious pythonic pseudocode # 1. Get a valid purl identifier either from input or guessing one
purl = get_purl_from_input()
if purl is None:
purl = generate_purl()
# 2. Look it up in the purl2cpe database. If it's found then we're done.
cpe_list = purl2cpe_lookup(purl)
if cpe_list is not None:
return cpe_list
# 3. If there's no match, try searching the data (current cve-bin-tool behaviour)
if purl.vendor != "UNKNOWN":
cpe_list = nvd_lookup(purl.vendor, purl.product)
else:
cpe_list = nvd_lookup(purl.product)
# 4. Check these results against a de-dupe database
bad_cpe_list = dedupe_lookup(purl)
for cpe in cpe_list:
if cpe in bad_cpe_list:
cpe_list.remove(cpe)
# 5. return anything still potentially valid (will likely be an empty list in MOST cases)
return cpe_list My current feeling is as follows:
Unless there's a particularly strong reason not to, I'd suggest that we have our second database follow the same pattern as purl2cpe. There's a few reasons for this:
purl2cpe currently has an sqlite loader and works on data based in yaml files in a repo. No relational database, so no point in getting excited about sqlalchemy yet. (And honestly, sqlite is good enough for our purposes. We're really not doing complicated database queries.) |
Just so we're clear: The most urgent part of this project is the de-dupe part.We've got multiple bugs related to de-duplication that need solving and don't have a good workaround at the moment. The only reason I didn't emphasize doing that first is that I think integration with an existing database will give you a better idea of how to organize the de-dupe data, so I think we'll wind up with a better architecture if we learn from others before setting up our de-dupe stuff. I chose purl2cpe for this project because in my analysis it looked like the clearest path for a person who is new to the space -- it's just mappings, it's using sqlite which we already use, and yaml which we also already use. No need for complicated new dependencies, and it's already designed for integration into other products. But @anthonyharrison is absolutely right that we could equally use Vulnerable Code or SCANOSS in place of purl2cpe. You could absolutely use something other than purl2cpe as the first data source here. I'm willing to be convinced since I really only did a few hours of reading before choosing one to put in this project idea. But if you suggest a different one be prepared to explain and defend your choice in the project proposal! Adding the ability to use multiple libraries for cpe lookup is a perfectly reasonable long-term goal for the cve-bin-tool project, but I wouldn't put it into this year's gsoc project except as a stretch goal. This isn't because I think a GSoC contributor couldn't do it, but because years of experience have taught me to leave a lot of extra time because docs, tests, life, and weird GitHub Actions network issues will eat up a lot of time and it's not always predictable in advance. I think focusing on one database instead of multiples will increase the chances of success on the project. Plus, people have a lot more fun if they feel like they're ahead and doing stretch goals in the last month than they do if they feel like they might fail at any moment, and people having fun write better code and are more likely to stick around doing open source stuff. 😄 |
* Related: #3771 Signed-off-by: Meet Soni <meetsoni3017@gmail.com> Co-authored-by: Terri Oda <terri.oda@intel.com>
We've got the first of our generated purl ids for the go parsers now: Folk interested in this project may find our iteration/code review in that issue interesting, as it took us quite a few tries to figure out what we wanted and how. I think we'll probably want others to follow that pattern; feel free to create one for any parser that hasn't been claimed here and doesn't already have a pull request. |
Finally, I can continue my work on the parsers i mentioned somewhere above in this chat. |
* Related: intel#3771 Signed-off-by: Meet Soni <meetsoni3017@gmail.com>
As we move to support PURLs more extensively, I think it might be worth creating a class for handling all of the PURL related processing to make it easier to manage. |
Definitely. Relative to the pseudocode above: sections 2-5 should be grouped into a function that accepts a purl and spits out a list of CPEs (or None), and that function should be in an easily imported class we can use in at least the language parsers and the sbom parser. (I wouldn't expect to use purl in the binary checkers, though. We could but it would be hours and hours of refactoring to little benefit, so I'd say it's a waste of time at the moment.) |
Hello @terriko , I am interested in this project and will share an initial draft proposal on gitter soon! |
@terriko @anthonyharrison |
@joydeep049 I would plan to integrate the purl2cpe database first. I think you'll learn enough form integrating it that it'll result in a better design and integration setup for the de-dupe db in the later steps. |
And no, you're not really putting the de-dupe info in the language parsers I don't think? As @anthonyharrison indicated, you'll be putting the de-dupe info into a function that gets called from the general cve lookup code eventually, because we'll want to use it in the SBOM lookup as well as the language parsers. |
Thanx! |
This would mean that I would just be adding extra steps to the general lookup that we have for all data sources? |
Possibly? We also potentially have the option to just publish the created purl-de-dupe.db file on our mirrors with instructions and sql info. It's ok if we don't know the best path here yet; I suspect you'll have many opinions about what purl2cpe did right and wrong by the time you get to that last piece and you should let that experience inform the final design rather than trying to get it all perfect at the proposal stage, so make a guess but leave yourself open to change there. |
If I had to guess, you'd be effectively replacing |
Yes probably! I just wanted to be as thorough as possible in my proposal. But i guess i should leave some of the decisions on how exactly things will work for when i get to work on the project. |
* Related: #3771 Signed-off-by: Meet Soni <meetsoni3017@gmail.com>
Just a heads-up to anyone who wants to work on PURL generation for language parsers. Most of the parsers have been claimed but If you have any doubt, please feel free to reach out to me or to @terriko or @anthonyharrison. |
Marking this as attached to the 3.3.1 milestone to make my tracking for that easier. |
cve-bin-tool: Product Mapping using PURLs
Project description
CVE Binary Tool needs to identify components in order to scan for vulnerabilities, but uniquely identifying software is not always an easy thing to do. Some examples:
python-beautifulsoup
in some automated tools intended to list software (e.g. sbom tools, yocto)CVE Binary Tool currently has explicit, pre-defined mappings between our binary signatures and a list of CPE identifiers for that "product." This works pretty well (although it does need to be updated somewhat regularly as different groups handle filing of known vulnerabilities).
Where we struggle is matching arbitrary product names found in component lists such as python's requirements.txt files. We have enough information to do better, but we need tools that do that in a mapping.
This project is intended to improve our product mapping and reduce false positives (like in the bugs linked above). We've noodled around on ideas and my current plan is:
If there's more time after that, I think we may want to consider pulling out other ideas and other sources of data to use. Do some brainstorming and include those as stretch goals.
Related reading
Skills
Difficulty level
Project Length
Mentor
GSoC Participants Only
This issue is a potential project idea for GSoC 2024, and is reserved for completion by a selected GSoC contributor. Please do not work on it outside of that program. If you'd like to apply to do it through GSoC, please start by reading #3550.
The text was updated successfully, but these errors were encountered: