-
-
Notifications
You must be signed in to change notification settings - Fork 548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CycloneDx output option #2698
Add CycloneDx output option #2698
Conversation
introduce new --cyclonedx{-json} output options and register associated plugins in setup.cfg Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>
Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>
…1888 suppress keys with None values from output Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>
…code-org#1888 Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>
Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>
added custom JSONEncoder to rename fields in compliance with CycloneDx spec Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>
avoid duplicate component entries by merging two components with the same bom-ref Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>
Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>
Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>
Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>
Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>
Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>
do not serialize packages without name or version Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>
Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>
…boutcode-org#1888 Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>
38e3913
to
8346eb8
Compare
Looking at the jobs that are currently failing it seems that the failures are related to the introduction of the I'm not sure how to proceed here. @pombredanne could you kindly provide some guidance? |
@agschrei Thank you for contributing a CycloneDX output plugin! We will review the plugin in depth, but it looks good so far. We are looking into the test error you mentioned, as it is affecting other branches as well. |
@agschrei Thank youn ++ for this... let me review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a beautiful PR!
I came with a few nitpickings for your consideration. One thing I am not sure I get is how licenses or license expressions are stored in CycloneDX.
We can have also some chat at your convenience; we meet at https://gitter.im/aboutcode-org/discuss
"purl": "pkg:npm/abbrev@1.0.7", | ||
"licenses": [ | ||
{ | ||
"license": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like you are taking the declared license for licenses. This would be the license that's asserted in a package manifest, not detected and not normalized. You should instead use the license_expression. And likely convert this to an SPDX expression?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pombredanne I already mentioned this a bit further down, but my reasoning was that the license expression inferred by scancode can sometimes be needlessly complex, see this example from the tests:
"purl": "pkg:npm/has@1.0.3",
"licenses": [
{
"expression": "mit AND (mit AND (mit AND unknown))"
},
{
"license": {
"id": "MIT"
}
}
],
Granted this can be normalized to mit AND unknown
but at that point it might still be helpful to have the declared license alongside it.
If license_expression and declared_license are one and the same, we only keep one of the entries, so this approach should never lead to duplicate information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my reasoning was that the license expression inferred by scancode can sometimes be needlessly complex, see this example from the tests:
There is no such thing as a "needlessly complex expression". There are two cases:
- an expression is correct and this should be the default case
- an expression is not correct and this a bug
In the case of the has
npm package this is a bug and we should treat this correctly
Can you enter a bug?
https://github.com/tarruda/has/blob/8146666148f1812f90e737aeefcded17a8cf81f4/package.json#L24
If license_expression and declared_license are one and the same, we only keep one of the entries, so this approach should never lead to duplicate information.
As explained above the declared_license
is not something that is a reliable data point as this has not gone through detection and normalization using the license detection engine, it should never be use as the license in a CycloneDX document.
return obj_dict | ||
|
||
|
||
class CycloneDxEncoder(json.JSONEncoder): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we instead have an to_dict()
methods on objects to serialize and avoid having JSON-specific serialization code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pombredanne I am open to this but would like to understand your reasoning about it first. Is there some downside to the serialization code that I am not seeing?
I solved it this way because this encoder throws away any fields with empty or None values and renames the bom_ref
field to bom-ref
for output no matter what we throw in, so it works generically for all objects we work with here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am open to this but would like to understand your reasoning about it first. Is there some downside to the serialization code that I am not seeing?
We also use a to_dict()
function throughout the codebase as a convention.
A JSONEncoder makes the serialization JSON-specific and impossible to use with other presentations such as YAML and I would rather have an explicit field-by-field mapping rather than something implicit.
Also why would you rename fields? Does the format demands to have "bom-ref" vs. "bom_ref" ?
url = license.osi_url | ||
if license.text_urls: | ||
url = license.text_urls[0] | ||
lic = CycloneDxLicense(id=license.spdx_license_key, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need an SPDX license expression? if so we can get this directly instead.
What is CycloneDxLicenseEntry? a single license or an expression?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am currently using the scancode license expression here as is, because the BOM format doesn't specify anything further about the expression field. But if we can get an SPDX license expression from that directly I would be in favor of that. Where do I need to look? Does the codebase offer some kind of utility to convert to that?
A CycloneDxLicenseEntry
is a union type that can either hold a single expression or a CycloneDxLicense
that has an (spdx) id, name and optional URL. CycloneDx also allows to include the license text here, but we currently don't map this as I believe including the license text for every finding would only needlessly bloat the report.
My argument here is that if a user / downstream tool wants the license text they can just grab it from the provided URL
Thanks, I have to say that your contribution guide is really helpful for first-time contributors.
I'll have to set some time aside to implement these changes. Looking at your comments most of the open points should be quick fixes, So there's a good chance I'll try to find you on gitter like you suggested to clarify on that.
The way I understood the CycloneDx spec, a component may have a list of
but obviously not both. I used Atlassian's great schema viewer to explore the CycloneDx 1.3 schema here if you're interested. |
Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>
Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>
…#1888 put bom-ref, type and scope keys higher up in component entries, consistently use single quotes Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>
It makes sense. So I think that we should instead likely:
|
Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>
Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>
@agschrei thanks for the updates. Let me review. |
…de-org#1888 Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>
@agschrei gentle ping. I would like to merge this soon enough. |
Most of the requested changes are now implemented in the way I understood them. The only major thing that is still missing is the correct handling of I'll reach out to you on the gitter channel to double-check my solution then. |
Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>
I triggered a re-run of the test failures which are due to Azure flakiness, not a real error. |
and there is a new lint/check that was merged recently for the documentation and you are failing it... that's an easy fix though:
|
…g#1888 Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>
I just fixed the docs sections that were failing the lint check. Hopefully everything passes now. |
@agschrei gentle ping... there are few nitpickings left for your review |
@pombredanne I was still waiting on feedback for some of the points you raised. I have now resolved everything that is fixed/considered, but for the points that still remain open I either need additional input or just a nod that this is how you want to proceed. Could you please have a look at the comments there? Also, should I rebase this onto develop in the meantime to see if it plays nice with the latest changes there? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late review!
See my comments inline for your consideration.
return (licenses, frozenset(spdx_keys)) | ||
|
||
|
||
known_licenses, spdx_ids = _get_set_of_known_licenses_and_spdx_license_ids() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point was also to move this line into a function and not call the function at import time
) | ||
|
||
|
||
def merge_components(existing: CycloneDxComponent, new: CycloneDxComponent): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be a problem. ScanCode reports all instances of each package so there can be duplicated purls (and this is evolving further with #2748 )
The thing is that each package instance may have the same purl but can different attributes (for instanced when derived from a certain manifest vs. another format of style such as a Gemfile, gemspec and installed metadata.
return obj_dict | ||
|
||
|
||
class CycloneDxEncoder(json.JSONEncoder): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am open to this but would like to understand your reasoning about it first. Is there some downside to the serialization code that I am not seeing?
We also use a to_dict()
function throughout the codebase as a convention.
A JSONEncoder makes the serialization JSON-specific and impossible to use with other presentations such as YAML and I would rather have an explicit field-by-field mapping rather than something implicit.
Also why would you rename fields? Does the format demands to have "bom-ref" vs. "bom_ref" ?
"purl": "pkg:npm/abbrev@1.0.7", | ||
"licenses": [ | ||
{ | ||
"license": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my reasoning was that the license expression inferred by scancode can sometimes be needlessly complex, see this example from the tests:
There is no such thing as a "needlessly complex expression". There are two cases:
- an expression is correct and this should be the default case
- an expression is not correct and this a bug
In the case of the has
npm package this is a bug and we should treat this correctly
Can you enter a bug?
https://github.com/tarruda/has/blob/8146666148f1812f90e737aeefcded17a8cf81f4/package.json#L24
If license_expression and declared_license are one and the same, we only keep one of the entries, so this approach should never lead to duplicate information.
As explained above the declared_license
is not something that is a reliable data point as this has not gone through detection and normalization using the license detection engine, it should never be use as the license in a CycloneDX document.
Note that we are working on introducing a "primary license" for packages... which would address your concern. In the meantime, I will merge your branch patched to use the |
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
This new function returns an SPDX license expression string from a ScanCode license_expression string. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Only deal with an SPDX license_expression and not plain licenses or declared licenses, because this is the only correct license information. Removing most type hints since we are not yet checking these. Remove dependency lookup which could be misleading. Instead add warning if dependencies are skipped. Add docstring to most methods and functions. Streamline nested list comprehensions and filter/map constructions and replace these with simpler constructs. Replace JSON encoder by a plain to_dict() serialization throughout. Move attrs attributes declarations towards the top of classes declarations rather than at the bottom. Move move method-like functions to their respective objects, including moving XML serialization functions to to_xml_element() methods for each model object. Use factories for fields defaults such as URN and timestamp. Split method that work on ScanCode packages lists in two: one operating on a single package mapping and one working on the packages list. Update tests to work on larger set of expected data. Always return all attributes, including empty values for now. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
@agschrei I merged the latest develop and updated your code according to some of the comments above. Please review before I merge! The key changes are outlined in the commit message of 953a40e Refine CycloneDX support #1888
|
The ways of XML and encoding are sometimes complex. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
All green now. Merging! |
Implements #1888
Tasks
Run tests locally to check for errors.