Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support local DBs #81

Closed
oliverchang opened this issue Dec 19, 2022 · 9 comments · Fixed by #183
Closed

Support local DBs #81

oliverchang opened this issue Dec 19, 2022 · 9 comments · Fixed by #183
Labels
enhancement New feature or request

Comments

@oliverchang
Copy link
Collaborator

oliverchang commented Dec 19, 2022

Currently the scanner works by using the OSV.dev API, which ensures the matching against latest live DB with little (targeted <15 minute latency from the upstream source)

We should support a local mode, which supports taking in a local OSV DB.

One of the key prerequisites here is:

  • Implementing version comparison rules for all our supported ecosystems. This is necessary for precise vulnerability matching based on the OSV version matching algorithm (@G-Rath has something for this that does this for the many of our ecosystems)

This will have some limitations:

  • Commit based matching will not work -- the API indexes all commit hashes ingested, and it's not feasible to replicate this index locally.
  • Potential performance issues?
@oliverchang oliverchang added the enhancement New feature or request label Dec 19, 2022
@G-Rath
Copy link
Collaborator

G-Rath commented Dec 19, 2022

As you know this is (imo) pretty extensively supported by osv-detector - I'm happy for any code from that to be brought into the scanner, and think a good first step would be to look at that to see what people think is missing (or plain wrong).

Implementing version comparison rules for all our supported ecosystems.

I'm hoping to get around to PRing that over to here this week 😅 done 🎉

Potential performance issues?

Since the detector has been using a local database since day 1, I've spent a lot of time looking at the performance in a number of cases especially when:

  • I added support for deduplicating aliases (which added like 3 extra full loops), and extra loops that came from misc fixes like handling Python packages with unnormalized names
  • support for the osv.dev API was landed, to see if it was faster than using the local database
  • I overhauled semantic to use ecosystem-specific version comparisons, since that greatly changed the work being done by the detector

In all cases the detector performed extremely fast, with the slowest part always being downloading the database (which even then would typically only take <5 seconds, for all the dbs needed for a project) - in contrast with the API it would take 15+ seconds 😅

(not saying there couldn't be performances issues, just sharing my experience so far)

@oliverchang
Copy link
Collaborator Author

@G-Rath any chance you'd have time to try tackle this one within the next few weeks? Otherwise we might be interested in taking this on as we've been seeing requests for this.

@G-Rath
Copy link
Collaborator

G-Rath commented Feb 2, 2023

@oliverchang yup probably - do you want to start with just an --offline flag for now and go from there?

@oliverchang
Copy link
Collaborator Author

oliverchang commented Feb 2, 2023

@oliverchang yup probably - do you want to start with just an --offline flag for now and go from there?

SGTM! Learning from our mistake with --json / --format though, perhaps something like --db may be more extensible. e.g.

--db='https://api.osv.dev' # default
--db=offline 

This lets us make things a bit more extensible in the future.

We can start this out as an experimental flag (subject to change) to get the ball rolling and change it later if we need to.

CC @another-rex

@G-Rath
Copy link
Collaborator

G-Rath commented Feb 2, 2023

Hmm ok not really a fan of that but cant think of better approach beyond what's in the detector (whichd be a sizable change), so happy to go with it if you think it's the way to go.

Though what if we followed Nodes pattern of sticking --experimental- in front of these flags to let us really play around with them

@oliverchang
Copy link
Collaborator Author

--experimental- sgtm!

@spencerschrock
Copy link
Contributor

which supports taking in a local OSV DB

This DB in question is the one hosted at gs://osv-vulnerabilities?

This will have some limitations:

  • Commit based matching will not work -- the API indexes all commit hashes ingested, and it's not feasible to replicate this index locally.

This is for scenarios where a lockfile specifies a dependency by commit hash and the local DB wouldn't know how to map it to a version for matching against vulns?

oliverchang pushed a commit that referenced this issue Aug 9, 2023
Resolves #81

~This is based off a lot of the core of the detector - it's not working
yet because I need to figure how to handle passing in the queries to the
local db given that the detector takes `PackageDetails`, but really the
key thing there is how to handle PURL which comes from SBOMs that I
don't really know how to use 😅 (idk if I'm just dumb or what, but for
some reason I've still not been able to figure how to accurately
generate one from a `Gemfile.lock`, `package-lock.json`, etc)~

~If someone could provide some sample SBOMs that would be very useful
(I'll also do a PR adding tests using them as fixtures), and also happy
to receive feedback on the general approach - there are some smaller
bits to discuss, like if fields should be omitted from the JSON output
vs an empty array, and the `Describe` related stuff too.~

This is now working, though personally it feels pretty awkward codewise
- I know I'm bias but I feel like it would be better to trying to bring
across the whole `database` package from the detector, as the API db is
pretty much the same and then you'd have support for zips, directories,
and the API with extra configs like working directories + an extensive
test suite for all three (I don't think it would be as painful as one
might first think, even with `osvscanner` having just been made public
because that's relatively small).

Still, this does work as advertised - there's definitely a few things
that could do with some cleaning up (including if fields should be
omitted from the JSON output vs an empty array, and the `Describe`
related stuff too) but am leaving them for now until I hear what folks
think of the general implementation + my above comment.

I've also gone with two boolean flags rather than the url-based flag
@oliverchang suggested because I didn't feel comfortable trying to
shoehorn that into this PR as well, and now that we're using
`--experimental` it should be fine to completely change these flags in
future.
@spencerschrock
Copy link
Contributor

Friendly ping on these questions.

@another-rex
Copy link
Collaborator

Sorry missed this message!

which supports taking in a local OSV DB

This DB in question is the one hosted at gs://osv-vulnerabilities?

Yes, specifically the all.zip's in each ecosystem

This will have some limitations:

  • Commit based matching will not work -- the API indexes all commit hashes ingested, and it's not feasible to replicate this index locally.

This is for scenarios where a lockfile specifies a dependency by commit hash and the local DB wouldn't know how to map it to a version for matching against vulns?

Yes, and also for scanning the commit of any git repos that's being scanned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants