Add CycloneDx output option #2698

agschrei · 2021-09-08T07:26:41Z

Implements #1888

Tasks

Reviewed contribution guidelines
PR is descriptively titled 📑 and links the original issue above 🔗
Tests pass -- look for a green checkbox ✔️ a few minutes after opening your PR
Run tests locally to check for errors.
Commits are in uniquely-named feature branch and has no merge conflicts 📁

introduce new --cyclonedx{-json} output options and register associated plugins in setup.cfg Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>

Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>

…1888 suppress keys with None values from output Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>

…code-org#1888 Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>

Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>

added custom JSONEncoder to rename fields in compliance with CycloneDx spec Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>

avoid duplicate component entries by merging two components with the same bom-ref Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>

Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>

do not serialize packages without name or version Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>

Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>

…boutcode-org#1888 Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>

agschrei · 2021-09-08T08:33:40Z

Looking at the jobs that are currently failing it seems that the failures are related to the introduction of the output_format_version key in the JSON output which is independent of the changes this PR introduced.
See for example this job

I'm not sure how to proceed here.

@pombredanne could you kindly provide some guidance?

JonoYang · 2021-09-10T00:22:09Z

@agschrei Thank you for contributing a CycloneDX output plugin! We will review the plugin in depth, but it looks good so far. We are looking into the test error you mentioned, as it is affecting other branches as well.

pombredanne · 2021-09-14T15:33:41Z

@agschrei Thank youn ++ for this... let me review.

pombredanne

This is a beautiful PR!
I came with a few nitpickings for your consideration. One thing I am not sure I get is how licenses or license expressions are stored in CycloneDX.
We can have also some chat at your convenience; we meet at https://gitter.im/aboutcode-org/discuss

tests/formattedcode/data/cyclonedx/expected.xml

pombredanne · 2021-09-15T09:56:24Z

tests/formattedcode/data/cyclonedx/expected.json

+      "purl": "pkg:npm/abbrev@1.0.7",
+      "licenses": [
+        {
+          "license": {


It looks like you are taking the declared license for licenses. This would be the license that's asserted in a package manifest, not detected and not normalized. You should instead use the license_expression. And likely convert this to an SPDX expression?

@pombredanne I already mentioned this a bit further down, but my reasoning was that the license expression inferred by scancode can sometimes be needlessly complex, see this example from the tests:

"purl": "pkg:npm/has@1.0.3", "licenses": [ { "expression": "mit AND (mit AND (mit AND unknown))" }, { "license": { "id": "MIT" } } ],

Granted this can be normalized to mit AND unknown but at that point it might still be helpful to have the declared license alongside it.

If license_expression and declared_license are one and the same, we only keep one of the entries, so this approach should never lead to duplicate information.

my reasoning was that the license expression inferred by scancode can sometimes be needlessly complex, see this example from the tests:

There is no such thing as a "needlessly complex expression". There are two cases:

an expression is correct and this should be the default case

an expression is not correct and this a bug

In the case of the has npm package this is a bug and we should treat this correctly
Can you enter a bug?

https://github.com/tarruda/has/blob/8146666148f1812f90e737aeefcded17a8cf81f4/package.json#L24

If license_expression and declared_license are one and the same, we only keep one of the entries, so this approach should never lead to duplicate information.

As explained above the declared_license is not something that is a reliable data point as this has not gone through detection and normalization using the license detection engine, it should never be use as the license in a CycloneDX document.

tests/formattedcode/data/cyclonedx/expected.json

docs/source/rst_snippets/output_format_options.rst

src/formattedcode/output_cyclonedx.py

pombredanne · 2021-09-15T14:09:12Z

src/formattedcode/output_cyclonedx.py

+    return obj_dict
+
+
+class CycloneDxEncoder(json.JSONEncoder):


Could we instead have an to_dict() methods on objects to serialize and avoid having JSON-specific serialization code?

@pombredanne I am open to this but would like to understand your reasoning about it first. Is there some downside to the serialization code that I am not seeing?

I solved it this way because this encoder throws away any fields with empty or None values and renames the bom_ref field to bom-ref for output no matter what we throw in, so it works generically for all objects we work with here

I am open to this but would like to understand your reasoning about it first. Is there some downside to the serialization code that I am not seeing?

We also use a to_dict() function throughout the codebase as a convention.
A JSONEncoder makes the serialization JSON-specific and impossible to use with other presentations such as YAML and I would rather have an explicit field-by-field mapping rather than something implicit.
Also why would you rename fields? Does the format demands to have "bom-ref" vs. "bom_ref" ?

src/formattedcode/output_cyclonedx.py

pombredanne · 2021-09-15T14:16:53Z

src/formattedcode/output_cyclonedx.py

+        url = license.osi_url
+        if license.text_urls:
+            url = license.text_urls[0]
+        lic = CycloneDxLicense(id=license.spdx_license_key,


Do you need an SPDX license expression? if so we can get this directly instead.

What is CycloneDxLicenseEntry? a single license or an expression?

I am currently using the scancode license expression here as is, because the BOM format doesn't specify anything further about the expression field. But if we can get an SPDX license expression from that directly I would be in favor of that. Where do I need to look? Does the codebase offer some kind of utility to convert to that?

A CycloneDxLicenseEntry is a union type that can either hold a single expression or a CycloneDxLicense that has an (spdx) id, name and optional URL. CycloneDx also allows to include the license text here, but we currently don't map this as I believe including the license text for every finding would only needlessly bloat the report.
My argument here is that if a user / downstream tool wants the license text they can just grab it from the provided URL

src/formattedcode/output_cyclonedx.py

agschrei · 2021-09-15T17:01:02Z

This is a beautiful PR!

Thanks, I have to say that your contribution guide is really helpful for first-time contributors.

I came with a few nitpickings for your consideration.

I'll have to set some time aside to implement these changes.

Looking at your comments most of the open points should be quick fixes,
but there's some other stuff like the declared licenses topic that might require additional back and forth.

So there's a good chance I'll try to find you on gitter like you suggested to clarify on that.

One thing I am not sure I get is how licenses or license expressions are stored in CycloneDX.

The way I understood the CycloneDx spec, a component may have a list of License entries where each entry may either have

a license expression as a string
a License Object as the spec calls it

but obviously not both.
So I tried to implement that as an explicit union type with CycloneDxLicenseEntry.

I used Atlassian's great schema viewer to explore the CycloneDx 1.3 schema here if you're interested.

Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>

…#1888 put bom-ref, type and scope keys higher up in component entries, consistently use single quotes Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>

pombredanne · 2021-09-15T21:56:28Z

The way I understood the CycloneDx spec, a component may have a list of License entries where each entry may either have
* a license expression as a string

* a `License Object` as the spec calls it
but obviously not both.
So I tried to implement that as an explicit union type with CycloneDxLicenseEntry.

I used Atlassian's great schema viewer to explore the CycloneDx 1.3 schema here if you're interested.

It makes sense. So I think that we should instead likely:

use only the license_expression field, as a bona fide SPDX expression
in order to also provide information about all the LicenseRef-scancode which are numerous, we could provide some 'license' externalReference for each non-SPDX license (or each license) as explained in https://json-schema.app/view/%23/%23%2Fdefinitions%2Fcomponent/%23%2Fdefinitions%2FexternalReference?url=https%3A%2F%2Fraw.githubusercontent.com%2FCycloneDX%2Fspecification%2Fmaster%2Fschema%2Fbom-1.3.schema.json
we could consider also keep the matched license notice text somewhere in the reference comment, but I do not think this can work as each license match may have a different notice even when matched to the same expression

Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>

pombredanne · 2021-09-16T14:56:59Z

@agschrei thanks for the updates. Let me review.

…de-org#1888 Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>

pombredanne · 2021-09-21T08:35:05Z

@agschrei gentle ping. I would like to merge this soon enough.

…1888 Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>

…#1888 Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>

agschrei · 2021-09-21T20:30:46Z

Most of the requested changes are now implemented in the way I understood them.
Please let me know if something still irks you about the modifications.

The only major thing that is still missing is the correct handling of LicenseRef-* SPDX IDs as suggested above, I will hopefully finish that tomorrow.

I'll reach out to you on the gitter channel to double-check my solution then.

Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>

pombredanne · 2021-09-23T10:47:58Z

I triggered a re-run of the test failures which are due to Azure flakiness, not a real error.

pombredanne · 2021-09-23T10:49:09Z

and there is a new lint/check that was merged recently for the documentation and you are failing it... that's an easy fix though:

Run ./scripts/doc8_style_check.sh
source/rst_snippets/output_format_options.rst:34: D001 Line too long
source/cli-reference/output-format.rst:585: D001 Line too long
Error: Process completed with exit code 1.

…g#1888 Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>

agschrei · 2021-09-23T12:44:25Z

I just fixed the docs sections that were failing the lint check.

Hopefully everything passes now.

pombredanne · 2021-10-21T14:30:20Z

@agschrei gentle ping... there are few nitpickings left for your review

agschrei · 2021-10-27T08:03:35Z

@pombredanne I was still waiting on feedback for some of the points you raised. I have now resolved everything that is fixed/considered, but for the points that still remain open I either need additional input or just a nod that this is how you want to proceed. Could you please have a look at the comments there?

Also, should I rebase this onto develop in the meantime to see if it plays nice with the latest changes there?

pombredanne

Sorry for the late review!
See my comments inline for your consideration.

pombredanne · 2021-11-26T11:02:32Z

src/formattedcode/output_cyclonedx.py

+    return (licenses, frozenset(spdx_keys))
+
+
+known_licenses, spdx_ids = _get_set_of_known_licenses_and_spdx_license_ids()


My point was also to move this line into a function and not call the function at import time

src/formattedcode/output_cyclonedx.py

pombredanne · 2021-11-26T12:01:43Z

src/formattedcode/output_cyclonedx.py

+    )
+
+
+def merge_components(existing: CycloneDxComponent, new: CycloneDxComponent):


This may be a problem. ScanCode reports all instances of each package so there can be duplicated purls (and this is evolving further with #2748 )
The thing is that each package instance may have the same purl but can different attributes (for instanced when derived from a certain manifest vs. another format of style such as a Gemfile, gemspec and installed metadata.

src/formattedcode/output_cyclonedx.py

pombredanne · 2021-11-26T12:07:45Z

src/formattedcode/output_cyclonedx.py

+    return obj_dict
+
+
+class CycloneDxEncoder(json.JSONEncoder):


I am open to this but would like to understand your reasoning about it first. Is there some downside to the serialization code that I am not seeing?

We also use a to_dict() function throughout the codebase as a convention.
A JSONEncoder makes the serialization JSON-specific and impossible to use with other presentations such as YAML and I would rather have an explicit field-by-field mapping rather than something implicit.
Also why would you rename fields? Does the format demands to have "bom-ref" vs. "bom_ref" ?

pombredanne · 2021-11-26T12:13:15Z

tests/formattedcode/data/cyclonedx/expected.json

+      "purl": "pkg:npm/abbrev@1.0.7",
+      "licenses": [
+        {
+          "license": {


my reasoning was that the license expression inferred by scancode can sometimes be needlessly complex, see this example from the tests:

There is no such thing as a "needlessly complex expression". There are two cases:

an expression is correct and this should be the default case

an expression is not correct and this a bug

In the case of the has npm package this is a bug and we should treat this correctly
Can you enter a bug?

https://github.com/tarruda/has/blob/8146666148f1812f90e737aeefcded17a8cf81f4/package.json#L24

If license_expression and declared_license are one and the same, we only keep one of the entries, so this approach should never lead to duplicate information.

As explained above the declared_license is not something that is a reliable data point as this has not gone through detection and normalization using the license detection engine, it should never be use as the license in a CycloneDX document.

pombredanne · 2021-12-24T06:26:21Z

my reasoning was that the license expression inferred by scancode can sometimes be needlessly complex, see this example from the tests:

Note that we are working on introducing a "primary license" for packages... which would address your concern. In the meantime, I will merge your branch patched to use the license_expression and not the declared_license.

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

This new function returns an SPDX license expression string from a ScanCode license_expression string. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Only deal with an SPDX license_expression and not plain licenses or declared licenses, because this is the only correct license information. Removing most type hints since we are not yet checking these. Remove dependency lookup which could be misleading. Instead add warning if dependencies are skipped. Add docstring to most methods and functions. Streamline nested list comprehensions and filter/map constructions and replace these with simpler constructs. Replace JSON encoder by a plain to_dict() serialization throughout. Move attrs attributes declarations towards the top of classes declarations rather than at the bottom. Move move method-like functions to their respective objects, including moving XML serialization functions to to_xml_element() methods for each model object. Use factories for fields defaults such as URN and timestamp. Split method that work on ScanCode packages lists in two: one operating on a single package mapping and one working on the packages list. Update tests to work on larger set of expected data. Always return all attributes, including empty values for now. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

pombredanne · 2021-12-25T21:21:29Z

@agschrei I merged the latest develop and updated your code according to some of the comments above. Please review before I merge!
Thank you for your patience.

The key changes are outlined in the commit message of 953a40e

Refine CycloneDX support #1888

Only deal with an SPDX license_expression and not plain licenses or
declared licenses, because this is the only correct license information.
Removing most type hints since we are not yet checking these.
Remove dependency lookup which could be misleading. Instead add warning
if dependencies are skipped.
Add docstring to most methods and functions.
Streamline nested list comprehensions and filter/map constructions and
replace these with simpler constructs.
Replace JSON encoder by a plain to_dict() serialization throughout.
Move attrs attributes declarations towards the top of classes
declarations rather than at the bottom.
Move move method-like functions to their respective objects, including
moving XML serialization functions to to_xml_element() methods for each
model object.
Use factories for fields defaults such as URN and timestamp.
Split method that work on ScanCode packages lists in two: one operating
on a single package mapping and one working on the packages list.
Update tests to work on larger set of expected data.
Always return all attributes, including empty values for now.

The ways of XML and encoding are sometimes complex. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

pombredanne · 2021-12-29T11:00:01Z

All green now. Merging!

agschrei added 16 commits September 8, 2021 09:22

add scaffolding for cyclonedx output plugin aboutcode-org#1888

487851f

introduce new --cyclonedx{-json} output options and register associated plugins in setup.cfg Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>

extract CycloneDX components from codebase packages aboutcode-org#1888

799d72d

Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>

move external reference mapping into separate function aboutcode-org#…

b9877d6

…1888 suppress keys with None values from output Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>

refine mapping of declared_license fields to CycloneDx licenses about…

24f64e9

…code-org#1888 Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>

filter duplicate license entries from output aboutcode-org#1888

73f50bc

Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>

add custom JSONEncoder aboutcode-org#1888

a2a99f1

added custom JSONEncoder to rename fields in compliance with CycloneDx spec Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>

merge components avoiding duplicates aboutcode-org#1888

8bf7fc7

avoid duplicate component entries by merging two components with the same bom-ref Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>

add indentation to make json output pretty-print aboutcode-org#1888

5110639

Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>

check in expected result for cyclonedx json output aboutcode-org#1888

9c9c60b

Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>

add XML serialization aboutcode-org#1888

bcce29c

Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>

extract optional scope from dependency tree aboutcode-org#1888

f558da8

Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>

docs: add cli-reference stub about new output options aboutcode-org#1888

ac698ad

Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>

test: verify removal of None or empty collections aboutcode-org#1888

513c5c8

Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>

handle scan results with empty packages aboutcode-org#1888

3ee1601

do not serialize packages without name or version Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>

CHANGELOG: add info about CycloneDx output plugin aboutcode-org#1888

b81a2fd

Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>

test: update test_scan_cli_help test resource with new output options a…

8346eb8

…boutcode-org#1888 Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>

agschrei force-pushed the add-cyclonedx-output-option branch from 38e3913 to 8346eb8 Compare September 8, 2021 07:44

pombredanne requested changes Sep 15, 2021

View reviewed changes

agschrei added 3 commits September 15, 2021 22:54

replace xml with lxml to allow easy pretty-printing aboutcode-org#1888

80ff326

Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>

rename cyclonedx cli options to favor json over xml aboutcode-org#1888

055c951

Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>

reorder JSON keys and align with project's coding style aboutcode-org…

ef4748e

…#1888 put bom-ref, type and scope keys higher up in component entries, consistently use single quotes Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>

agschrei added 2 commits September 16, 2021 00:34

apply black formatting without string normalization aboutcode-org#1888

1df94fc

Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>

fix failing test by updating expected result aboutcode-org#1888

5d15d30

Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>

rename variables and reliably extract scancode-toolkit header aboutco…

8bf905c

…de-org#1888 Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>

agschrei added 2 commits September 21, 2021 21:02

refactor: turn standalone functions into class methods aboutcode-org#…

e633ba6

…1888 Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>

map multiple authors and use license db url as fallback aboutcode-org…

d26d3b1

…#1888 Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>

add external refs for scancode spdx ids aboutcode-org#1888

9f8d8d4

Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>

docs: break lines at 100 chars length to pass lint check aboutcode-or…

da91fee

…g#1888 Signed-off-by: Alexander Gschrei <alexander.gschrei@fau.de>

pombredanne requested changes Nov 26, 2021

View reviewed changes

pombredanne added 4 commits December 24, 2021 07:37

Merge latest develop branch

93f046a

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Sort and streamline imports

7500b64

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Add new build_spdx_license_expression() function

4e2526c

This new function returns an SPDX license expression string from a ScanCode license_expression string. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Revert change in SDPX tests XML loading

af286da

The ways of XML and encoding are sometimes complex. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

pombredanne merged commit 4a97637 into aboutcode-org:develop Dec 29, 2021

eastmadc mentioned this pull request Aug 31, 2022

Add support for CycloneDX Output format #1888

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CycloneDx output option #2698

Add CycloneDx output option #2698

agschrei commented Sep 8, 2021 •

edited by pombredanne

Loading

agschrei commented Sep 8, 2021

JonoYang commented Sep 10, 2021

pombredanne commented Sep 14, 2021

pombredanne left a comment

pombredanne Sep 15, 2021

agschrei Oct 27, 2021

pombredanne Nov 26, 2021

pombredanne Sep 15, 2021

agschrei Oct 27, 2021

pombredanne Nov 26, 2021

pombredanne Sep 15, 2021

agschrei Oct 27, 2021

agschrei commented Sep 15, 2021

pombredanne commented Sep 15, 2021

pombredanne commented Sep 16, 2021

pombredanne commented Sep 21, 2021

agschrei commented Sep 21, 2021

pombredanne commented Sep 23, 2021

pombredanne commented Sep 23, 2021

agschrei commented Sep 23, 2021

pombredanne commented Oct 21, 2021

agschrei commented Oct 27, 2021 •

edited

Loading

pombredanne left a comment

pombredanne Nov 26, 2021

pombredanne Nov 26, 2021

pombredanne Nov 26, 2021

pombredanne Nov 26, 2021

pombredanne commented Dec 24, 2021

pombredanne commented Dec 25, 2021 •

edited

Loading

pombredanne commented Dec 29, 2021

		return (licenses, frozenset(spdx_keys))


		known_licenses, spdx_ids = _get_set_of_known_licenses_and_spdx_license_ids()

		)


		def merge_components(existing: CycloneDxComponent, new: CycloneDxComponent):

Add CycloneDx output option #2698

Add CycloneDx output option #2698

Conversation

agschrei commented Sep 8, 2021 • edited by pombredanne Loading

Tasks

agschrei commented Sep 8, 2021

JonoYang commented Sep 10, 2021

pombredanne commented Sep 14, 2021

pombredanne left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agschrei commented Sep 15, 2021

pombredanne commented Sep 15, 2021

pombredanne commented Sep 16, 2021

pombredanne commented Sep 21, 2021

agschrei commented Sep 21, 2021

pombredanne commented Sep 23, 2021

pombredanne commented Sep 23, 2021

agschrei commented Sep 23, 2021

pombredanne commented Oct 21, 2021

agschrei commented Oct 27, 2021 • edited Loading

pombredanne left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pombredanne commented Dec 24, 2021

pombredanne commented Dec 25, 2021 • edited Loading

Refine CycloneDX support #1888

pombredanne commented Dec 29, 2021

agschrei commented Sep 8, 2021 •

edited by pombredanne

Loading

agschrei commented Oct 27, 2021 •

edited

Loading

pombredanne commented Dec 25, 2021 •

edited

Loading