Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CycloneDx output option #2698

Merged

Conversation

agschrei
Copy link
Contributor

@agschrei agschrei commented Sep 8, 2021

Implements #1888

Tasks

  • Reviewed contribution guidelines
  • PR is descriptively titled 📑 and links the original issue above 🔗
  • Tests pass -- look for a green checkbox ✔️ a few minutes after opening your PR
    Run tests locally to check for errors.
  • Commits are in uniquely-named feature branch and has no merge conflicts 📁

introduce new --cyclonedx{-json} output options and register associated plugins in setup.cfg

Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>
Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>
…1888

suppress keys with None values from output

Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>
…code-org#1888

Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>
Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>
added custom JSONEncoder to rename fields in compliance with CycloneDx spec

Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>
avoid duplicate component entries by merging two components with the same bom-ref

Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>
Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>
Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>
Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>
Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>


Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>
Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>
do not serialize packages without name or version

Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>
Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>
…boutcode-org#1888

Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>
@agschrei
Copy link
Contributor Author

agschrei commented Sep 8, 2021

Looking at the jobs that are currently failing it seems that the failures are related to the introduction of the output_format_version key in the JSON output which is independent of the changes this PR introduced.
See for example this job

I'm not sure how to proceed here.

@pombredanne could you kindly provide some guidance?

@JonoYang
Copy link
Member

@agschrei Thank you for contributing a CycloneDX output plugin! We will review the plugin in depth, but it looks good so far. We are looking into the test error you mentioned, as it is affecting other branches as well.

@pombredanne
Copy link
Member

@agschrei Thank youn ++ for this... let me review.

Copy link
Member

@pombredanne pombredanne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a beautiful PR!
I came with a few nitpickings for your consideration. One thing I am not sure I get is how licenses or license expressions are stored in CycloneDX.
We can have also some chat at your convenience; we meet at https://gitter.im/aboutcode-org/discuss

tests/formattedcode/data/cyclonedx/expected.xml Outdated Show resolved Hide resolved
"purl": "pkg:npm/abbrev@1.0.7",
"licenses": [
{
"license": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you are taking the declared license for licenses. This would be the license that's asserted in a package manifest, not detected and not normalized. You should instead use the license_expression. And likely convert this to an SPDX expression?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pombredanne I already mentioned this a bit further down, but my reasoning was that the license expression inferred by scancode can sometimes be needlessly complex, see this example from the tests:

 "purl": "pkg:npm/has@1.0.3",
      "licenses": [
        {
          "expression": "mit AND (mit AND (mit AND unknown))"
        },
        {
          "license": {
            "id": "MIT"
          }
        }
      ],

Granted this can be normalized to mit AND unknown but at that point it might still be helpful to have the declared license alongside it.

If license_expression and declared_license are one and the same, we only keep one of the entries, so this approach should never lead to duplicate information.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my reasoning was that the license expression inferred by scancode can sometimes be needlessly complex, see this example from the tests:

There is no such thing as a "needlessly complex expression". There are two cases:

  1. an expression is correct and this should be the default case
  2. an expression is not correct and this a bug

In the case of the has npm package this is a bug and we should treat this correctly
Can you enter a bug?

https://github.com/tarruda/has/blob/8146666148f1812f90e737aeefcded17a8cf81f4/package.json#L24

If license_expression and declared_license are one and the same, we only keep one of the entries, so this approach should never lead to duplicate information.

As explained above the declared_license is not something that is a reliable data point as this has not gone through detection and normalization using the license detection engine, it should never be use as the license in a CycloneDX document.

tests/formattedcode/data/cyclonedx/expected.json Outdated Show resolved Hide resolved
docs/source/rst_snippets/output_format_options.rst Outdated Show resolved Hide resolved
src/formattedcode/output_cyclonedx.py Outdated Show resolved Hide resolved
return obj_dict


class CycloneDxEncoder(json.JSONEncoder):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we instead have an to_dict() methods on objects to serialize and avoid having JSON-specific serialization code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pombredanne I am open to this but would like to understand your reasoning about it first. Is there some downside to the serialization code that I am not seeing?

I solved it this way because this encoder throws away any fields with empty or None values and renames the bom_ref field to bom-ref for output no matter what we throw in, so it works generically for all objects we work with here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am open to this but would like to understand your reasoning about it first. Is there some downside to the serialization code that I am not seeing?

We also use a to_dict() function throughout the codebase as a convention.
A JSONEncoder makes the serialization JSON-specific and impossible to use with other presentations such as YAML and I would rather have an explicit field-by-field mapping rather than something implicit.
Also why would you rename fields? Does the format demands to have "bom-ref" vs. "bom_ref" ?

src/formattedcode/output_cyclonedx.py Show resolved Hide resolved
src/formattedcode/output_cyclonedx.py Show resolved Hide resolved
url = license.osi_url
if license.text_urls:
url = license.text_urls[0]
lic = CycloneDxLicense(id=license.spdx_license_key,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need an SPDX license expression? if so we can get this directly instead.

What is CycloneDxLicenseEntry? a single license or an expression?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am currently using the scancode license expression here as is, because the BOM format doesn't specify anything further about the expression field. But if we can get an SPDX license expression from that directly I would be in favor of that. Where do I need to look? Does the codebase offer some kind of utility to convert to that?

A CycloneDxLicenseEntry is a union type that can either hold a single expression or a CycloneDxLicense that has an (spdx) id, name and optional URL. CycloneDx also allows to include the license text here, but we currently don't map this as I believe including the license text for every finding would only needlessly bloat the report.
My argument here is that if a user / downstream tool wants the license text they can just grab it from the provided URL

src/formattedcode/output_cyclonedx.py Outdated Show resolved Hide resolved
@agschrei
Copy link
Contributor Author

This is a beautiful PR!

Thanks, I have to say that your contribution guide is really helpful for first-time contributors.

I came with a few nitpickings for your consideration.

I'll have to set some time aside to implement these changes.

Looking at your comments most of the open points should be quick fixes,
but there's some other stuff like the declared licenses topic that might require additional back and forth.

So there's a good chance I'll try to find you on gitter like you suggested to clarify on that.

One thing I am not sure I get is how licenses or license expressions are stored in CycloneDX.

The way I understood the CycloneDx spec, a component may have a list of License entries where each entry may either have

  • a license expression as a string
  • a License Object as the spec calls it

but obviously not both.
So I tried to implement that as an explicit union type with CycloneDxLicenseEntry.

I used Atlassian's great schema viewer to explore the CycloneDx 1.3 schema here if you're interested.

Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>
Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>
…#1888

put bom-ref, type and scope keys higher up in component entries,
consistently use single quotes

Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>
@pombredanne
Copy link
Member

The way I understood the CycloneDx spec, a component may have a list of License entries where each entry may either have

* a license expression as a string

* a `License Object` as the spec calls it

but obviously not both.
So I tried to implement that as an explicit union type with CycloneDxLicenseEntry.

I used Atlassian's great schema viewer to explore the CycloneDx 1.3 schema here if you're interested.

It makes sense. So I think that we should instead likely:

  1. use only the license_expression field, as a bona fide SPDX expression
  2. in order to also provide information about all the LicenseRef-scancode which are numerous, we could provide some 'license' externalReference for each non-SPDX license (or each license) as explained in https://json-schema.app/view/%23/%23%2Fdefinitions%2Fcomponent/%23%2Fdefinitions%2FexternalReference?url=https%3A%2F%2Fraw.githubusercontent.com%2FCycloneDX%2Fspecification%2Fmaster%2Fschema%2Fbom-1.3.schema.json
  3. we could consider also keep the matched license notice text somewhere in the reference comment, but I do not think this can work as each license match may have a different notice even when matched to the same expression

Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>
Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>
@pombredanne
Copy link
Member

@agschrei thanks for the updates. Let me review.

…de-org#1888

Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>
@pombredanne
Copy link
Member

@agschrei gentle ping. I would like to merge this soon enough.

…1888

Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>
…#1888

Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>
@agschrei
Copy link
Contributor Author

Most of the requested changes are now implemented in the way I understood them.
Please let me know if something still irks you about the modifications.

The only major thing that is still missing is the correct handling of LicenseRef-* SPDX IDs as suggested above, I will hopefully finish that tomorrow.

I'll reach out to you on the gitter channel to double-check my solution then.

Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>
@pombredanne
Copy link
Member

I triggered a re-run of the test failures which are due to Azure flakiness, not a real error.

@pombredanne
Copy link
Member

and there is a new lint/check that was merged recently for the documentation and you are failing it... that's an easy fix though:

Run ./scripts/doc8_style_check.sh
source/rst_snippets/output_format_options.rst:34: D001 Line too long
source/cli-reference/output-format.rst:585: D001 Line too long
Error: Process completed with exit code 1.

…g#1888

Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>
@agschrei
Copy link
Contributor Author

I just fixed the docs sections that were failing the lint check.

Hopefully everything passes now.

@pombredanne
Copy link
Member

@agschrei gentle ping... there are few nitpickings left for your review

@agschrei
Copy link
Contributor Author

agschrei commented Oct 27, 2021

@pombredanne I was still waiting on feedback for some of the points you raised. I have now resolved everything that is fixed/considered, but for the points that still remain open I either need additional input or just a nod that this is how you want to proceed. Could you please have a look at the comments there?

Also, should I rebase this onto develop in the meantime to see if it plays nice with the latest changes there?

Copy link
Member

@pombredanne pombredanne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review!
See my comments inline for your consideration.

return (licenses, frozenset(spdx_keys))


known_licenses, spdx_ids = _get_set_of_known_licenses_and_spdx_license_ids()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point was also to move this line into a function and not call the function at import time

src/formattedcode/output_cyclonedx.py Outdated Show resolved Hide resolved
src/formattedcode/output_cyclonedx.py Outdated Show resolved Hide resolved
src/formattedcode/output_cyclonedx.py Outdated Show resolved Hide resolved
src/formattedcode/output_cyclonedx.py Outdated Show resolved Hide resolved
src/formattedcode/output_cyclonedx.py Outdated Show resolved Hide resolved
)


def merge_components(existing: CycloneDxComponent, new: CycloneDxComponent):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be a problem. ScanCode reports all instances of each package so there can be duplicated purls (and this is evolving further with #2748 )
The thing is that each package instance may have the same purl but can different attributes (for instanced when derived from a certain manifest vs. another format of style such as a Gemfile, gemspec and installed metadata.

src/formattedcode/output_cyclonedx.py Outdated Show resolved Hide resolved
return obj_dict


class CycloneDxEncoder(json.JSONEncoder):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am open to this but would like to understand your reasoning about it first. Is there some downside to the serialization code that I am not seeing?

We also use a to_dict() function throughout the codebase as a convention.
A JSONEncoder makes the serialization JSON-specific and impossible to use with other presentations such as YAML and I would rather have an explicit field-by-field mapping rather than something implicit.
Also why would you rename fields? Does the format demands to have "bom-ref" vs. "bom_ref" ?

"purl": "pkg:npm/abbrev@1.0.7",
"licenses": [
{
"license": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my reasoning was that the license expression inferred by scancode can sometimes be needlessly complex, see this example from the tests:

There is no such thing as a "needlessly complex expression". There are two cases:

  1. an expression is correct and this should be the default case
  2. an expression is not correct and this a bug

In the case of the has npm package this is a bug and we should treat this correctly
Can you enter a bug?

https://github.com/tarruda/has/blob/8146666148f1812f90e737aeefcded17a8cf81f4/package.json#L24

If license_expression and declared_license are one and the same, we only keep one of the entries, so this approach should never lead to duplicate information.

As explained above the declared_license is not something that is a reliable data point as this has not gone through detection and normalization using the license detection engine, it should never be use as the license in a CycloneDX document.

@pombredanne
Copy link
Member

my reasoning was that the license expression inferred by scancode can sometimes be needlessly complex, see this example from the tests:

Note that we are working on introducing a "primary license" for packages... which would address your concern. In the meantime, I will merge your branch patched to use the license_expression and not the declared_license.

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
This new function returns an SPDX license expression string from a
ScanCode license_expression string.

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Only deal with an SPDX license_expression and not plain licenses or
declared licenses, because this is the only correct license information.

Removing most type hints since we are not yet checking these.

Remove dependency lookup which could be misleading. Instead add warning
if dependencies are skipped.

Add docstring to most methods and functions.

Streamline nested list comprehensions and filter/map constructions and
replace these with simpler constructs.

Replace JSON encoder by a plain to_dict() serialization throughout.

Move attrs attributes declarations towards the top of classes
declarations rather than at the bottom.

Move move method-like functions to their respective objects, including
moving XML serialization functions to to_xml_element() methods for each
model object.

Use factories for fields defaults such as URN and timestamp.

Split method that work on ScanCode packages lists in two: one operating
on a single package mapping and one working on the packages list.

Update tests to work on larger set of expected data.

Always return all attributes, including empty values for now.

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
@pombredanne
Copy link
Member

pombredanne commented Dec 25, 2021

@agschrei I merged the latest develop and updated your code according to some of the comments above. Please review before I merge!
Thank you for your patience.

The key changes are outlined in the commit message of 953a40e

Refine CycloneDX support #1888

  • Only deal with an SPDX license_expression and not plain licenses or
    declared licenses, because this is the only correct license information.

  • Removing most type hints since we are not yet checking these.

  • Remove dependency lookup which could be misleading. Instead add warning
    if dependencies are skipped.

  • Add docstring to most methods and functions.

  • Streamline nested list comprehensions and filter/map constructions and
    replace these with simpler constructs.

  • Replace JSON encoder by a plain to_dict() serialization throughout.

  • Move attrs attributes declarations towards the top of classes
    declarations rather than at the bottom.

  • Move move method-like functions to their respective objects, including
    moving XML serialization functions to to_xml_element() methods for each
    model object.

  • Use factories for fields defaults such as URN and timestamp.

  • Split method that work on ScanCode packages lists in two: one operating
    on a single package mapping and one working on the packages list.

  • Update tests to work on larger set of expected data.

  • Always return all attributes, including empty values for now.

The ways of XML and encoding are sometimes complex.

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
@pombredanne
Copy link
Member

All green now. Merging!

@pombredanne pombredanne merged commit 4a97637 into aboutcode-org:develop Dec 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants