Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combine license matches in new LicenseDetection #2961

Merged
merged 86 commits into from
Nov 11, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
86 commits
Select commit Hold shift + click to select a range
a6e2941
Import functions from `scancode-analyzer`
AyanSinhaMahapatra May 17, 2022
ce84de8
Modify LicenseDetection attributes
AyanSinhaMahapatra May 17, 2022
466e0e0
Create LicenseDetection serialization function
AyanSinhaMahapatra May 17, 2022
8de3111
Enable LicenseDetection in API
AyanSinhaMahapatra May 17, 2022
77ccbc9
Remove the --unknown-licenses option
AyanSinhaMahapatra May 17, 2022
3bd00bc
Update local license dereferencing code
AyanSinhaMahapatra May 17, 2022
56facc3
Update unknown license dereferencing for key-files
AyanSinhaMahapatra May 17, 2022
391383c
Update intro rules with is_license_intro as True
AyanSinhaMahapatra May 17, 2022
5b0dc11
Add test files for LicenseDetection
AyanSinhaMahapatra May 17, 2022
ce19d7a
Add LicenseDetection full scan tests
AyanSinhaMahapatra May 17, 2022
578bd8d
Add function to create index from test rules folders
AyanSinhaMahapatra May 17, 2022
dc25cd4
Remove unwanted code, update docstrings
AyanSinhaMahapatra May 24, 2022
d70acde
Move LicenseDetection tests to new file
AyanSinhaMahapatra May 24, 2022
b4ddc77
Modify LicenseMatch data in results #2416
AyanSinhaMahapatra May 24, 2022
6a441da
Add --licenses-reference option
AyanSinhaMahapatra May 25, 2022
bca7f4b
Add docs for LicenseDetection and detail referencing
AyanSinhaMahapatra May 25, 2022
d57f390
Do not summurize license detections
AyanSinhaMahapatra May 31, 2022
cc945a0
Apply LicenseDetection everywhere
AyanSinhaMahapatra May 31, 2022
6a91773
Regenerate test expectations for LicenseDetection
AyanSinhaMahapatra May 31, 2022
5826429
Merge branch 'develop' into add-license-detection
AyanSinhaMahapatra May 31, 2022
3e68f76
Fix SPDX and other output plugins
AyanSinhaMahapatra Jun 1, 2022
4fc7d24
Rename license fields for resource
AyanSinhaMahapatra Jun 8, 2022
5e20984
Update test expectations for license field renaming
AyanSinhaMahapatra Jun 8, 2022
2e3dd3e
Rename and Add package license attributes
AyanSinhaMahapatra Jul 12, 2022
855cdef
Add functions for package LicenseDetection
AyanSinhaMahapatra Jul 12, 2022
802c085
Modify package parsers to adopt new LicenseDetection
AyanSinhaMahapatra Jul 12, 2022
ab677c6
Align to package LicenseDetection
AyanSinhaMahapatra Jul 12, 2022
552f173
Modify system packages to use LicenseDetection
AyanSinhaMahapatra Jul 12, 2022
e94a19e
Add new license key `undetected-license`
AyanSinhaMahapatra Jul 12, 2022
0462ef4
Regenerate test expectations for package LicenseDetection
AyanSinhaMahapatra Jul 12, 2022
fd9cd57
Merge branch 'develop' into add-license-detection
AyanSinhaMahapatra Jul 12, 2022
1147381
Fix test failures
AyanSinhaMahapatra Jul 12, 2022
8929571
Fix pypi setup.py email bug
AyanSinhaMahapatra Jul 17, 2022
9df811e
Add manifest license references detection
AyanSinhaMahapatra Jul 18, 2022
c47ce89
Add license from file if empty manifest license
AyanSinhaMahapatra Jul 18, 2022
fce8e7c
Add feature to get package license from sibling file
AyanSinhaMahapatra Jul 18, 2022
eb88b56
Allow package LicenseDetection without --licenses
AyanSinhaMahapatra Jul 21, 2022
67872a4
Regen datadriven LicenseDetections
AyanSinhaMahapatra Jul 21, 2022
9a93552
Revert to `rule_identifier`
AyanSinhaMahapatra Jul 21, 2022
ce11ae4
Remove `undetected-license` in favour of `unknown`
AyanSinhaMahapatra Jul 21, 2022
7fc32ee
Regen test expectations and fix tests
AyanSinhaMahapatra Jul 22, 2022
5aae457
Reorder license expressions and detections in result
AyanSinhaMahapatra Jul 22, 2022
4c1c129
Add fucntions to test license detection with subset of rules
AyanSinhaMahapatra Jul 31, 2022
86ec441
Fix datadriven license test errors
AyanSinhaMahapatra Jul 31, 2022
ca823b7
Fix debian_copyright test failure
AyanSinhaMahapatra Jul 31, 2022
81793dd
Address review feedback
AyanSinhaMahapatra Aug 1, 2022
fb8f492
Add tests from eclipse foundation issues
AyanSinhaMahapatra Aug 3, 2022
08cb42d
Merge branch 'develop' into add-license-detection
AyanSinhaMahapatra Aug 4, 2022
dca0371
Regenerate test expectations after merging develop
AyanSinhaMahapatra Aug 4, 2022
174a097
Add `other_license*` attributes for packages #2065
AyanSinhaMahapatra Aug 8, 2022
c8ca7a3
Remove `compute_normalized_license` functions
AyanSinhaMahapatra Aug 9, 2022
539ebed
Add `default_license_relation` attribute to handlers
AyanSinhaMahapatra Aug 9, 2022
6f42f6f
Support NuGet license URLs #3037
AyanSinhaMahapatra Aug 11, 2022
4f29860
Fix test failures
AyanSinhaMahapatra Aug 11, 2022
0450138
Fix rpm tests
AyanSinhaMahapatra Aug 18, 2022
21648a5
Merge branch 'develop' into add-license-detection
AyanSinhaMahapatra Aug 18, 2022
57a3c2a
Fix test failures and expectations after merging develop
AyanSinhaMahapatra Aug 18, 2022
f1ee5f6
Do not return empty dict as exctracted license
AyanSinhaMahapatra Aug 19, 2022
bae1a30
Fix csv output after adding LicenseDetection`
AyanSinhaMahapatra Aug 19, 2022
b1e422c
Tag intro rule properly
AyanSinhaMahapatra Sep 5, 2022
1b9a8f7
Fix package license expression None bug
AyanSinhaMahapatra Sep 5, 2022
eadb7bd
Also classify license intro in false positives list
AyanSinhaMahapatra Sep 6, 2022
64bd3d8
Merge branch 'develop' into add-license-detection
AyanSinhaMahapatra Sep 6, 2022
8edfb92
Make LicenseMatch grouping affected by presence of license intro
AyanSinhaMahapatra Sep 6, 2022
0d8fba9
Modify license references to also effect clues
AyanSinhaMahapatra Sep 19, 2022
08afdfd
Restore returning whole_lines by default
AyanSinhaMahapatra Sep 19, 2022
6e2cad3
Add libxml files as unknown license reference test
AyanSinhaMahapatra Sep 20, 2022
a5c52fa
Tag license intro rules correctly
AyanSinhaMahapatra Sep 29, 2022
02ab56c
Update false positives and unknown intro heuristics
AyanSinhaMahapatra Oct 3, 2022
7d5c647
Add unknown license reference to package dereferencing #2965 #1379
AyanSinhaMahapatra Oct 13, 2022
8721bdf
Improve unknown reference to package dereferencing #2965 #1379
AyanSinhaMahapatra Oct 18, 2022
5b0efe4
Merge branch 'develop' into add-license-detection
AyanSinhaMahapatra Oct 18, 2022
afd4025
Merge branch 'develop' into add-license-detection
AyanSinhaMahapatra Nov 3, 2022
b1c999f
Rename `detection_rules` to `detection_log`
AyanSinhaMahapatra Nov 6, 2022
f0213ec
Update license clues heuristics
AyanSinhaMahapatra Nov 7, 2022
5de2744
Add `rule_url` and update scancode URLs
AyanSinhaMahapatra Nov 7, 2022
fd3f1d7
Restore the --unknown-licenses experimental CLI option
AyanSinhaMahapatra Nov 8, 2022
637c4dd
Adjust unknown licenses heuristics
AyanSinhaMahapatra Nov 8, 2022
01b1de4
Replace `package` in referenced_filename
AyanSinhaMahapatra Nov 8, 2022
5768c8c
Update docstrings and improve readability
AyanSinhaMahapatra Nov 9, 2022
0e00d28
Update changelog and docs
AyanSinhaMahapatra Nov 10, 2022
5ebca43
Improve CHANGELOG
pombredanne Nov 11, 2022
18aca01
Update CHANGELOG to remove duplications
AyanSinhaMahapatra Nov 11, 2022
05d163a
Bump output format to v3
pombredanne Nov 11, 2022
d264aec
Update CHANGELOG
pombredanne Nov 11, 2022
8f07fdf
Bump version
pombredanne Nov 11, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
168 changes: 121 additions & 47 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
@@ -1,19 +1,36 @@
Changelog
=========

v33.0.0 (next next, roadmap)

----------------------------

v32.0.0 (next next, roadmap)
----------------------------------

Package detection:
~~~~~~~~~~~~~~~~~~

- We now support new package manifest formats:

- OpenWRT packages.
- Yocto/BitBake .bb recipes.


v32.0.0 (next, roadmap)
-----------------------

Important API changes:
~~~~~~~~~~~~~~~~~~~~~~

This is a major release with major API and output format changes and signicant
feature updates.

In particular changed to the output format for the licenses and packages, and
we changed some of the command line options.

The output format version is now 3.0.0



Package detection:
~~~~~~~~~~~~~~~~~~

- Update ``GemfileLockParser`` to track the gem which the Gemfile.lock is for,
which we assign to the new ``GemfileLockParser.primary_gem`` field. Update
``GemfileLockHandler.parse()`` to handle the case where there is a primary gem
Expand All @@ -39,48 +56,6 @@ Package detection:

https://github.com/nexB/scancode-toolkit/issues/3081

License detection:
~~~~~~~~~~~~~~~~~~~

- There is a major update to license detection where we now combine one or
matches in a larger license detecion. This remove a larger number of false
positive or ambiguous license detections.

- The data structure of the JSON output has changed for licenses. We now
return match details once for each matched license expression rather than
once for each license in a matched expression. There is a new top-level
"license_references" attribute that contains the data details for each
detected license only once. This data can contain the reference license text
as an option.

- There is a new "scancode-reindex-licenses" command that replace the
"scancode --reindex-licenses" command line option which has been
removed. This new command supports simpler reindexing using custom
license texts and license rules contributed by plugins or stored in an
additional directory. The "--reindex-licenses-for-all-languages" CLI option
is also moved to the "scancode-reindex-licenses" command as an option
"--all-languages".

- We can now detect licenses using custom license texts and license rules.
These can be provided as a one off in a directory or packaged as a plugin
for consistent reuse and deployment. There is an option "--additional-directory"
with the "scancode-reindex-licenses" command and also a new "--only-builtin"
option to only use the builtin licenses to build the cache.

- Scancode LICENSE and RULE files now also contain their data as YAML frontmatter,
which previously used to be in their respective YAML files. This reduces number of
files in those directories, 'rules' and 'licenses' to half. Git line history is
preserved for the files.

- A new command line option "--get-license-data" is added to dump license data in
JSON, YAML and HTML formats, and also generates a local index and a static website
to view the data. This will essentially be an API/way to get scancode license data
as opposed to just reading the files.


Package detection:
~~~~~~~~~~~~~~~~~~~~~

- Code for parsing a Maven POM, npm package.json, freebsd manifest and haxelib
JSON have been separated into two functions: one that creates a PackageData
object from the parsed Resource, and another that calls the previous function
Expand All @@ -89,6 +64,105 @@ Package detection:
libraries.


License detection:
~~~~~~~~~~~~~~~~~~~

- This is a major update to license detection where we now combine one or more
license matches in a larger license detection. This approach improves the
accuracy of license detection and removes a larger number of false positive
or ambiguous license detections. See for details
https://github.com/nexB/scancode-toolkit/issues/2878

- The data structure of the JSON output has changed for licenses at file level:

- The``licenses`` attribute is deleted.

- A new ``license_detections`` attribute contains license detections in that file.
This object has three attributes: ``license_expression``, ``detection_log``
and ``matches``. ``matches`` is a list of license matches and is roughly
the same as ``licenses`` in the previous version with additional structure
changes detailed below.

- A new attribute ``license_clues`` contains license matches with the
same data structure as the ``matches`` attribute in ``license_detections``.
This contains license matches that are mere clues and where not considered
to be a proper conclusive license detection.

- The ``license_expressions`` list of license expressions is deleted and
replaced by a ``detected_license_expression`` single expression.
Similarly ``spdx_license_expressions`` was removed and replaced by
``detected_license_expression_spdx``.

- See `license updates documentation <https://scancode-toolkit.readthedocs.io/en/latest/explanations/license-detection-reference.html#change-in-license-data-format-resource>`_
for examples and details.

- The data structure of license attributes in ``package_data`` and the codebase
level ``packages`` has been updated accordingly:

- There is a new ``license_detections`` attribute for the primary, top-level
declared licenses of a package and an ``other_license_detections`` attribute
for the other secondary detections.

- The ``license_expression`` is replaced by the ``declared_license_expression``
and ``other_license_expression`` attributes with their SPDX counterparts
``declared_license_expression_spdx`` and ``other_license_expression_spdx``.
These expressions are parallel to detections.

- The ``declared_license`` attribute is renamed ``extracted_license_statement``
and is now a YAML-encoded string.

See `license updates documentation <https://scancode-toolkit.readthedocs.io/en/latest/explanations/license-detection-reference.html#change-in-license-data-format-package>`_
for examples and details.

- The license matches structure has changed: we used to report one match for each
license ``key`` of a matched license expression. We now report instead one
single match for each matched license expression, and list the license keys
as a ``licenses`` attribute. This avoids data duplication.
Inside each match, we list each match and matched rule attributred directly
avoiding nesting. See `license updates doc <https://scancode-toolkit.readthedocs.io/en/latest/explanations/license-detection-reference.html#licensematch-result-data>`_
for examples and details.

- There is a new ``--licenses-reference`` command line option to report
reference license metadata and texts once for each license matched across the
scan; we now have two codebase level attributes: ``license_references`` and
``rule_references`` that list unique detected license and license rules.
See `license updates documentation <https://scancode-toolkit.readthedocs.io/en/latest/explanations/license-detection-reference.html#comparision-before-after-license-references>`_
for examples and details.

- We replaced the ``scancode --reindex-licenses`` command line option with a
new separate command named ``scancode-reindex-licenses``.

- The ``--reindex-licenses-for-all-languages`` CLI option is also moved to
the ``scancode-reindex-licenses`` command as an option ``--all-languages``.

- We can now detect licenses using custom license texts and license rules
stored in a directory or packaged as a plugin for consistent reuse and deployment.

- There is an ``--additional-directory`` option with the ``scancode-reindex-licenses``
command to add the licenses from a directory.

- There is also a ``--only-builtin`` option to use ony builtin licenses
ignoring any additional license plugins.

- See https://github.com/nexB/scancode-toolkit/issues/480 for more details.

- We combined the licensedata file and text file of each license in a single
file with a .LICENSE extension. The .yml data file is now included at the
top of each .LICENSE file as "YAML frontmatter". The same applies to license
rules and their .RULE and .yml files. This halves the number of data files
from about 60,000 to 30,000. Git line history is preserved for the combined
text + yml files.

- See https://github.com/nexB/scancode-toolkit/issues/3049

- Theer is a new ``--get-license-data`` scancode command line option to export
license data in JSON, YAML and HTML, with indexes and a static website for use
in the licensedb web site. This becomes the API way to getr scancode license
data.

See https://github.com/nexB/scancode-toolkit/issues/2738


v31.2.1 - 2022-10-05
----------------------------------

Expand Down
1 change: 1 addition & 0 deletions docs/source/explanations/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
:maxdepth: 2

overview
license-detection-reference

..
[ToAdd]
Expand Down