Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for external licenses in scans #2979

Merged
merged 45 commits into from
Oct 28, 2022

Conversation

KevinJi22
Copy link
Collaborator

@KevinJi22 KevinJi22 commented Jun 1, 2022

This is for the 2022 Google Summer of Code. This PR contains all the work that needs to be done for the project.

This adds --additional-license-directory as a command line option in license reindexing. This allows users to specify paths to directories of licenses and rules they'd like to use during license detection, but would not like to add to the ScanCode database of licenses. First, the user must run Scancode with the --reindex-licenses option and they can use --additional-license-directory to add directories of licenses or rules to the index.

This involves adding a new option in licensedcode/plugin_license.py, and this option is used as a parameter in scancode/api.py. In this approach, the licenses and rules contained in these additional directories are combined with the existing licenses and rules in the ScanCode database to produce a single index. The code for this is found in licensedcode/cache.py and the helper methods for loading these licenses and rules are found in licensedcode/models.py.

This commit also includes unit tests to verify that license detection succeeds with additional directories and installed licenses/rules. Part of the setup for the unit test and future tests involves creating a new directory in tests/licensedcode/data that contains sample external licenses used in the unit tests.

Fixes #480

Tasks

  • Reviewed contribution guidelines
  • PR is descriptively titled 📑 and links the original issue above 🔗
  • Tests pass -- look for a green checkbox ✔️ a few minutes after opening your PR
    Run tests locally to check for errors.
  • Commits are in uniquely-named feature branch and has no merge conflicts 📁

Copy link
Contributor

@JonoYang JonoYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KevinJi22 I think this is looking good so far! I left some notes about docstrings and style.

src/licensedcode/models.py Outdated Show resolved Hide resolved
src/licensedcode/models.py Outdated Show resolved Hide resolved
src/licensedcode/models.py Outdated Show resolved Hide resolved
src/licensedcode/models.py Outdated Show resolved Hide resolved
tests/licensedcode/test_plugin_license.py Outdated Show resolved Hide resolved
src/licensedcode/plugin_license.py Outdated Show resolved Hide resolved
src/licensedcode/models.py Outdated Show resolved Hide resolved
src/licensedcode/models.py Outdated Show resolved Hide resolved
@KevinJi22 KevinJi22 force-pushed the external-licenses-480 branch 2 times, most recently from fda3b2d to 93cc339 Compare June 2, 2022 01:44
@pombredanne
Copy link
Member

There is something in your changes that make the tests fail...
I do not think you can just use multiple cached directories. You should use a single cache and then validate, then combine multiple input a single cached index.

@KevinJi22
Copy link
Collaborator Author

KevinJi22 commented Jun 8, 2022

@pombredanne I'm actually combining all the licenses and rules in the input directories into a single index here. In load_licenses_from_multiple_dirs, all the licenses (including the ones in the ScanCode database) are combined into a single license database, and the same is done with all the rules. The _CACHED_DIRECTORIES variable is used to determine whether a new index needs to be built. For example, if we've created an index with additional directories A and B, but now we run a scan with additional directories B and C, we'd need to recreate the combined index.

I'll update this code with comments so that this is more clear.

if not licenses_db:
if additional_directories:
additional_license_dirs = get_license_dirs(additional_dirs=additional_directories)
combined_license_directories = [licenses_data_dir] + additional_license_dirs
licenses_db = load_licenses_from_multiple_dirs(license_dirs=combined_license_directories)
else:
licenses_db = load_licenses(licenses_data_dir=licenses_data_dir)
if additional_directories:
additional_rule_dirs = get_rule_dirs(additional_dirs=additional_directories)
combined_rule_directories = [rules_data_dir] + additional_rule_dirs
rules = get_rules_from_multiple_dirs(licenses_db=licenses_db, rule_directories=combined_rule_directories)
else:
rules = get_rules(licenses_db=licenses_db, rules_data_dir=rules_data_dir)

@KevinJi22 KevinJi22 force-pushed the external-licenses-480 branch 3 times, most recently from 93e90a4 to ec988c5 Compare June 9, 2022 22:21
@KevinJi22 KevinJi22 requested a review from JonoYang June 10, 2022 01:07
Copy link
Contributor

@JonoYang JonoYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a few nitpicks about formatting. I have an issue trying out the new -dir option. I am running scancode -l -dir tests/licensedcode/data/example_external_licenses/example1 --json-pp ~/Desktop/out.json /home/jono/src/scancode-toolkit-kevinji22/tests/licensedcode/data/plugin_license/external_licenses/scan/, however I am not seeing the custom license being detected in the scan results.

src/licensedcode/models.py Show resolved Hide resolved
tests/licensedcode/test_plugin_license.py Show resolved Hide resolved
tests/licensedcode/test_plugin_license.py Outdated Show resolved Hide resolved
tests/licensedcode/test_plugin_license.py Outdated Show resolved Hide resolved
@KevinJi22 KevinJi22 force-pushed the external-licenses-480 branch 2 times, most recently from 7696b97 to c5afa46 Compare June 11, 2022 03:17
@KevinJi22
Copy link
Collaborator Author

KevinJi22 commented Jun 11, 2022

Thanks for the feedback! This bug was actually due to how os.path.join() works, since the path that you passed in wasn't an absolute path, so the paths were just concatenated in the function build_rule_from_license() in models.py. This meant that the external license wasn't being loaded in as a rule. I added some code in get_license_dirs() that converts the input path to an absolute path, which will hopefully prevent this from happening in the future.

This means that running the scan with the same parameters should work this time.

@KevinJi22 KevinJi22 requested a review from JonoYang June 11, 2022 18:56
Copy link
Contributor

@JonoYang JonoYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! After talking to @pombredanne, the next thing we want on top is to be able to install custom licenses using wheels. Something similar to the way we handle providing binaries to scancode via plugins: https://github.com/nexB/scancode-plugins/tree/main/builtins

We also merged the fix for the PyPI test in develop, so be sure to rebase your branch.

@KevinJi22 KevinJi22 force-pushed the external-licenses-480 branch 9 times, most recently from 65dbef5 to 2243f51 Compare June 16, 2022 20:20
@KevinJi22 KevinJi22 changed the title [WIP] Add support for external licenses in scans #480 [WIP] Add support for external licenses in scans Jun 16, 2022
@JonoYang
Copy link
Contributor

JonoYang commented Jun 16, 2022

@KevinJi22 Don't feel compelled to amend your commit and force push it every time you make a change. You can create as many commits as you want on this PR. 🙂

@KevinJi22 KevinJi22 force-pushed the external-licenses-480 branch 2 times, most recently from 60d0c6a to 0db7bab Compare June 23, 2022 13:50
@AyanSinhaMahapatra AyanSinhaMahapatra force-pushed the external-licenses-480 branch 5 times, most recently from f9e82e9 to 392bc01 Compare October 12, 2022 20:49
Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
@AyanSinhaMahapatra
Copy link
Member

@KevinJi22 we (@pombredanne and me) added the following changes:

CLI Options:

  • various reindexing operations are no longer CLI options
  • we have a seperate script scancode-reindex-licenses which does that
  • only one additional directory of licenses/rules can be added, i.e. this is not a multiple option

Tests:

  • Organize tests data under one single folder.
  • Organize test functions under one single file.
  • Keep three tests, one for additional directory, one for
    additional plugin, and one for combined testing.
  • Edit the CI file accordingly.

Misc:

  • code refactoring at get_rules_from_multiple_dirs, get_rules_from_multiple_dirs etc
  • external license directories/plugin info is returned in scan header

Remaining work:

  • update how license data (links and other) is returned in api.py
  • also report new attribute is_builtin
    @pombredanne ^ this will create a lot of test file updates, I was thinking I'll do this update in the main LicenseDetection PR as there's lot's of change in license data reporting already.

Otherwise tests are all green and this is ready AFAIK.
Please review. 🙇

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
@AyanSinhaMahapatra
Copy link
Member

Remaining work:

  • update how license data (links and other) is returned in api.py
  • also report new attribute is_builtin
    @pombredanne ^ this will create a lot of test file updates, I was thinking I'll do this update in the main LicenseDetection PR as
    there's lot's of change in license data reporting already.

Modified the code to add is_builtin flags to license match data when there is an additional license directory or additional license plugins installed. So this work is not remaining anymore.

Ready to review @pombredanne!

Was also wondering if there could be a --only-builtin CLI option in scancode-reindex-licenses to reindex the license freshly without having to delete and reconfigure the environment, but not sure how to uninstall plugins from the python script. @pombredanne is this doable?

@pombredanne
Copy link
Member

Was also wondering if there could be a --only-builtin CLI option in scancode-reindex-licenses to reindex the license freshly without having to delete and reconfigure the environment, but not sure how to uninstall plugins from the python script. @pombredanne is this doable?

Sure, but remember that each new added option may be a sign of our failure to automatically do the right thing without options :)

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
Copy link
Member

@pombredanne pombredanne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here some extra feedback. I think we should refactor this a bit more such that indexing licenses and rules just takes a list of directories. Everything else should IMHO just use the things already loaded. And we can simplify the detection of builtins based on a list of builtin licenses... This would help keep the code tidier IMHO. @AyanSinhaMahapatra @KevinJi22 let's chat at your convenience

azure-pipelines.yml Outdated Show resolved Hide resolved
docs/source/cli-reference/core-options.rst Outdated Show resolved Hide resolved
docs/source/how-to-guides/install_new_license_plugin.rst Outdated Show resolved Hide resolved
docs/source/rst_snippets/core_options.rst Outdated Show resolved Hide resolved
docs/source/rst_snippets/note_snippets/core_indep.rst Outdated Show resolved Hide resolved
src/licensedcode/cache.py Outdated Show resolved Hide resolved
src/licensedcode/cache.py Outdated Show resolved Hide resolved
src/licensedcode/cache.py Show resolved Hide resolved
src/licensedcode/models.py Outdated Show resolved Hide resolved
Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
Refactor according to the feedback.

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
@pombredanne
Copy link
Member

Here some extra feedback. I think we should refactor this a bit more such that indexing licenses and rules just takes a list of directories. Everything else should IMHO just use the things already loaded. And we can simplify the detection of builtins based on a list of builtin licenses... This would help keep the code tidier IMHO. @AyanSinhaMahapatra @KevinJi22 let's chat at your convenience

I created #3137 as a follow up!

Copy link
Member

@pombredanne pombredanne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good to go. Merging

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for "extra", e.g. private or local licenses
4 participants