RFC: Improve tabular output formats #3043

pombredanne · 2022-08-05T10:28:59Z

The current CSV output is a mess, albeit a convenient mess. We need something and quick. I suggest these short term and long term actions

For now in v31:

Deprecate and hide the CLI help the current --csv option so that we do not further the mess in r31.... drop it entirely in v32 #3047
The deprecation message should have a link to this PR.

In v32:

Create a new --csv-file option that would only list file level details in this way:
- file info columns, license expression, copyright, holder, urls, emails, "for_package".
- no line number columns, no package data
- exactly one row per file
- multiple values are joined with a line in a single cell
- long values that exceed offices tools limits are truncated
Create a new --csv-package option that would only list package details:
- one row per package instance
Create a new --csv-dependency option that would only list dependency details:
- one row per dependency instance
Create a new --csv-license option that would only list file level license scan information, used for debugging and hidden from the CLI help:
- one row for each license match with diagnostic details for each match (rule, scores, etc)
Drop the hidden --csv option in v32
Add new XLSX option that creates a proper spreadsheet with multiples tabs
- files
- packages
- dependencies
- summary
- and possibly: file "package_data", file "license_detections"
  where these essentially mirror the new csv-* options

For reference, we have these related issues:

The text was updated successfully, but these errors were encountered:

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

AyanSinhaMahapatra · 2022-11-14T11:06:00Z

I was looking into https://www.python-excel.org/ for what we can use to implement this:

https://foss.heptapod.net/openpyxl/openpyxl/ has pandas in it's requirements so probably not something we can use.

Others are libraries that only use the standard library, no other requirements:

https://github.com/jmcnamara/XlsxWriter seems the best choice, actively maintained and lots of functionality. Author maintains tools in other languages for the same too :P
https://github.com/PydPiper/pylightxl seems to lightweight and can be used by vendoring a single file

https://github.com/python-excel/xlwt was widely used but this is for old excel formats and not actively maintained anymore.

@pombredanne what do you think?

pombredanne · 2022-11-14T11:35:02Z

@ayan Sinha Mahapatra IMHO just reuse what is used in SCIO
See https://github.com/nexB/thirdparty-packages/blob/main/pypi/XlsxWriter-3.0.3.tar.gz.ABOUT
See https://github.com/nexB/scancode.io/blob/c4523ea0ab5d19d5dd22a5b5af785bac05be583c/scanpipe/views.py#L339

rspier · 2023-04-21T18:02:37Z

Not arguing that the current CSV file is unwieldy, but it's easier to automatically process than multiple-tabs. It contains all the data, which is sometimes what you want in a hurry.

My other approach has been to process the json into a custom CSV, but sometimes that's a hassle.

Would you consider keeping the existing file around as --csv-everything-yolo or similar?

pombredanne · 2023-04-22T18:31:01Z

@rspier you wrote:

Would you consider keeping the existing file around as --csv-everything-yolo or similar?

of course, but then we can may be design it so that it has everything AND not too much at the same time, so this is compact and efficient to review? For instance, returning the start and end lines of copyrights and license matches may not be needed there, and we could design something that has the key data that could fit on a single row per file?

rspier · 2023-04-24T22:09:51Z

I think "compact and efficient to review" might be something that comes from the other views, while the "big one" is for those cases where you want all the data, or things that aren't in the other formats. It's significantly easier to hide/remove data than it is to add/merge it back in, which is one reason I'd lean towards having a CSV with too much info.

For example, I often use matched_rule__identifier so I can see what the rule was to determine if it's likely a false positive. (i.e. lGpL)

There are a lot of nested fields that I would drop before I dropped start and end line. For example:

license__homepage_url
license__text_url
license__reference_url
license__scancode_text_url
license__scancode_data_url
license__spdx_license_key
license__spdx_url

are highly repetitive of each other and other fields (like license__key)

also license__name vs license__short_name.

An alternative idea would be to leverage --custom-template FILE by providing pre-made "kitchen sink" templates. i.e. /path/to/scancode/lib/templates/oldstyle.csv

armijnhemel · 2023-11-24T16:07:28Z

The current CSV doesn't include matched_text which is what most of the people I work with are actually interested in.

pombredanne · 2023-11-24T16:46:42Z

@armijnhemel re:

The current CSV doesn't include matched_text which is what most of the people I work with are actually interested in.

The problem is that this is too big in practice to be routinely included in a CSV... all commercial and libre spreadsheets I know off will choke with an AGPL matched text :]

Libreoffice: likely 2**31 but very slow with tens of thousands
Google sheet: 50,000 chars
Ms Excel: 32,767 chars

All tools start to choke with a few 1000.

The way out is IMHO using ScanCode.io or the ScanCode workbench that both display the license detection and matched text loaded from a JSON scan even if you do not have the original scanned code.

pombredanne added must have GUI and outputs Priority: high labels Aug 5, 2022

pombredanne modified the milestones: v31.0, v32.0 Aug 11, 2022

pombredanne mentioned this issue Aug 17, 2022

Deprecate and hide the CLI help the current --csv option so that we do not further the mess in r31.... drop it entirely in v32 #3047

Closed

pombredanne added a commit that referenced this issue Aug 17, 2022

Use correct link to CSV issue #3043

cb86166

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

AyanSinhaMahapatra modified the milestones: v32.0, v32.1 Jan 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Improve tabular output formats #3043

RFC: Improve tabular output formats #3043

pombredanne commented Aug 5, 2022 •

edited

Loading

AyanSinhaMahapatra commented Nov 14, 2022

pombredanne commented Nov 14, 2022

rspier commented Apr 21, 2023

pombredanne commented Apr 22, 2023 •

edited

Loading

rspier commented Apr 24, 2023

armijnhemel commented Nov 24, 2023

pombredanne commented Nov 24, 2023

RFC: Improve tabular output formats #3043

RFC: Improve tabular output formats #3043

Comments

pombredanne commented Aug 5, 2022 • edited Loading

For now in v31:

In v32:

AyanSinhaMahapatra commented Nov 14, 2022

pombredanne commented Nov 14, 2022

rspier commented Apr 21, 2023

pombredanne commented Apr 22, 2023 • edited Loading

rspier commented Apr 24, 2023

armijnhemel commented Nov 24, 2023

pombredanne commented Nov 24, 2023

pombredanne commented Aug 5, 2022 •

edited

Loading

pombredanne commented Apr 22, 2023 •

edited

Loading