Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Improve tabular output formats #3043

Open
1 of 17 tasks
pombredanne opened this issue Aug 5, 2022 · 7 comments
Open
1 of 17 tasks

RFC: Improve tabular output formats #3043

pombredanne opened this issue Aug 5, 2022 · 7 comments

Comments

@pombredanne
Copy link
Member

pombredanne commented Aug 5, 2022

The current CSV output is a mess, albeit a convenient mess. We need something and quick. I suggest these short term and long term actions

For now in v31:

In v32:

  • Create a new --csv-file option that would only list file level details in this way:

    • file info columns, license expression, copyright, holder, urls, emails, "for_package".
    • no line number columns, no package data
    • exactly one row per file
    • multiple values are joined with a line in a single cell
    • long values that exceed offices tools limits are truncated
  • Create a new --csv-package option that would only list package details:

    • one row per package instance
  • Create a new --csv-dependency option that would only list dependency details:

    • one row per dependency instance
  • Create a new --csv-license option that would only list file level license scan information, used for debugging and hidden from the CLI help:

    • one row for each license match with diagnostic details for each match (rule, scores, etc)
  • Drop the hidden --csv option in v32

  • Add new XLSX option that creates a proper spreadsheet with multiples tabs

    • files
    • packages
    • dependencies
    • summary
    • and possibly: file "package_data", file "license_detections"
      where these essentially mirror the new csv-* options

For reference, we have these related issues:

@AyanSinhaMahapatra
Copy link
Member

I was looking into https://www.python-excel.org/ for what we can use to implement this:

https://foss.heptapod.net/openpyxl/openpyxl/ has pandas in it's requirements so probably not something we can use.

Others are libraries that only use the standard library, no other requirements:

https://github.com/jmcnamara/XlsxWriter seems the best choice, actively maintained and lots of functionality. Author maintains tools in other languages for the same too :P
https://github.com/PydPiper/pylightxl seems to lightweight and can be used by vendoring a single file

https://github.com/python-excel/xlwt was widely used but this is for old excel formats and not actively maintained anymore.

@pombredanne what do you think?

@pombredanne
Copy link
Member Author

@ayan Sinha Mahapatra IMHO just reuse what is used in SCIO
See https://github.com/nexB/thirdparty-packages/blob/main/pypi/XlsxWriter-3.0.3.tar.gz.ABOUT
See https://github.com/nexB/scancode.io/blob/c4523ea0ab5d19d5dd22a5b5af785bac05be583c/scanpipe/views.py#L339

@AyanSinhaMahapatra AyanSinhaMahapatra modified the milestones: v32.0, v32.1 Jan 4, 2023
@rspier
Copy link

rspier commented Apr 21, 2023

Not arguing that the current CSV file is unwieldy, but it's easier to automatically process than multiple-tabs. It contains all the data, which is sometimes what you want in a hurry.

My other approach has been to process the json into a custom CSV, but sometimes that's a hassle.

Would you consider keeping the existing file around as --csv-everything-yolo or similar?

@pombredanne
Copy link
Member Author

pombredanne commented Apr 22, 2023

@rspier you wrote:

Would you consider keeping the existing file around as --csv-everything-yolo or similar?

of course, but then we can may be design it so that it has everything AND not too much at the same time, so this is compact and efficient to review? For instance, returning the start and end lines of copyrights and license matches may not be needed there, and we could design something that has the key data that could fit on a single row per file?

@rspier
Copy link

rspier commented Apr 24, 2023

I think "compact and efficient to review" might be something that comes from the other views, while the "big one" is for those cases where you want all the data, or things that aren't in the other formats. It's significantly easier to hide/remove data than it is to add/merge it back in, which is one reason I'd lean towards having a CSV with too much info.

For example, I often use matched_rule__identifier so I can see what the rule was to determine if it's likely a false positive. (i.e. lGpL)

There are a lot of nested fields that I would drop before I dropped start and end line. For example:

  • license__homepage_url
  • license__text_url
  • license__reference_url
  • license__scancode_text_url
  • license__scancode_data_url
  • license__spdx_license_key
  • license__spdx_url

are highly repetitive of each other and other fields (like license__key)

also license__name vs license__short_name.

An alternative idea would be to leverage --custom-template FILE by providing pre-made "kitchen sink" templates. i.e. /path/to/scancode/lib/templates/oldstyle.csv

@armijnhemel
Copy link
Contributor

The current CSV doesn't include matched_text which is what most of the people I work with are actually interested in.

@pombredanne
Copy link
Member Author

@armijnhemel re:

The current CSV doesn't include matched_text which is what most of the people I work with are actually interested in.

The problem is that this is too big in practice to be routinely included in a CSV... all commercial and libre spreadsheets I know off will choke with an AGPL matched text :]

  • Libreoffice: likely 2**31 but very slow with tens of thousands
  • Google sheet: 50,000 chars
  • Ms Excel: 32,767 chars

All tools start to choke with a few 1000.

The way out is IMHO using ScanCode.io or the ScanCode workbench that both display the license detection and matched text loaded from a JSON scan even if you do not have the original scanned code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants