Add new licenses and new detection rules #2765

pombredanne · 2021-11-22T15:07:23Z

This PR provides general purpose license detection improvements mostly in the form of new licenses and new license detection rules.

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

These templates as code are used only in the cobra library The code has not evolved much and is best as a false positive. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Used to be a license detection rule Contributed-by: Dennis Clark <dmclark@nexb.com> Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Contributed-by: Dennis Clark <dmclark@nexb.com> Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

@mrombout

Do not use defaultdict for Query.unknowns_by_pos and Query.stopwords_by_pos. Otherwise there are pernicious side effects to add new entries in these dctionaries when querying them after their creation. Reported-by: Mike Rombout @mrombout Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

This is much more readable Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Some were marked as is_license_tag incorrectly. Also make the case of the text something which is consistent and likely to be the most common case style that would show up. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

An idx index was used as an agrument in a few places and not used. Format code and ensure that logging will work Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

This helps to update rule text in code and then dump the rule back Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

This is a partial implementation of some of the ideas of #2403 Add a new filter_invalid_single_word_matches_in_binaries() function called in licensedcode.match.refine_matches(). This filters matches under these conditions: - the match is for a binary file - the matched rule that has a single word (length 1) - the matched rule has a low relevance, e.g., under 75 - the matched text has either: - one or more leading or trailing punctuations (except for +) unless this has a high relevance and the rule is contained as-is in the matched text (considering case) - mixed upper and lower case charcaters (but not a Title case) unless exactly the same mixed case as the rule text Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Rename match.filter_key_phrase_spans to filter_matches_missing_key_phrases Move query.Query.tokens_with_unknowns and query..QueryRun.tokens_with_unknowns to licensedcode_test_utils Remove obsolete u'' string prefixes from test_query.py Add tracing to query.Query.tokens_by_line() processing Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Call match.set_matched_lines() once centrally Call key pharse filtering early Implement function to group matches in regions used for debugging Implement and track a discard_reason for each LicenseMatch which is a code that tells why a match is filtered. Format code Fix Span and LicenseMatch distance_to for touching spans: it should be to one then. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Improve tests of key phrase testing more and using more than one rule Add tests for LicenseMatch regions Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

And remove duplicated combine_expressions functions. Tests have been moved upstream too. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Rule.compute_relevance() was only doing a set, hence a renaming to set_relevance() Also extract a plain "compute_relevance()" function to use with a length input only. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

This returns an SPDX license expression from a plain scancode expression. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

…cense-improvements

The move to use in the license_expression.combine_expressions() function missed that it was returning a LicenseExpression when existing functiosn were returning a string. This is fixing this issue. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

This was updated incorrectly when switching to license_expression code in previous commits. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

We can be smarter and use a filter to catch them all. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Update rare JTA license Update rules Add new Rule and License validation code to ensure that all keys are lowercase. Also make minor fixes to docstring and formatting Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Used shared get_licenses_by_spdx_key() function. Do not generate false positive Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

The new filter_false_positive_license_lists_matches() filter is able to recognize sequences of license matches that are false positive lists such as a list of SPDX license ids seen in some license related tools. These were previously detected using a large number of rules that were subsequences of these lists of licenses. With this filter we now have removed all the rules used before and added new rules to properly detect each license reference. As a result, the license detection is both more accurate (detecting more references) and less noisy (not reporting false positive lists of sucg references) - "s" is reinstated as a regular word (and not a stopword) - several new and improved rules have been added - tests have been updated accordingly Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Remove non-continuous filter and use instead the key phrases filter, as a rule that must be matched continuously is the same as a rule with a single key phrase covering all its text Track stopwords on the index side in each rule, and use these to ensure that when keyphrases are present and contain stopwords we check that the stopwords are the same on the query and index side of a match to consider the keyphrases matches. For stopwords introduced a new index tokenizer "index_tokenizer_with_stopwords". Sort stopwords for easier search in the future. Streamline keyphrase tokenizer and run it in Rule.tokens() Remove query.Query.stopwords_span that was not used Remove match.LicenseMatch.qcontains_stopwords() that was not used Remove match.filter_non_continuous_matches() no longer used Add new and improved rules, and requalify some non-english short rules as bona-fide rules. Apply multiple minor refactoring, doc, and formatting refinements Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

And spell spurious with one r, not two. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

This support new ways to combine multiple license matches together, and distinguish primary and secondary licenses. Most concepts have been drawn from early work on Debian copyright detection improvements. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

pombredanne · 2022-01-06T08:06:26Z

All green ... merging at last!

pombredanne added 21 commits October 15, 2021 15:41

Improve license metadata

82b7187

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Merge branch 'develop' into omnibus-fall2-license-improvements

2af8e50

Do not report template as license #270

9e0c205

These templates as code are used only in the cobra library The code has not evolved much and is best as a false positive. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Add new and improved license tests

db306db

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Add new copyleft licenses

2cf3117

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Add new permissive licenses rules

09d1601

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Add new miscellaneous license rules

1aadc1c

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Add new license detection tests

0956537

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Promote rule as new license

2630654

Used to be a license detection rule Contributed-by: Dennis Clark <dmclark@nexb.com> Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Promote rule as new license

7bbaf3a

Used to be a license detection rule Contributed-by: Dennis Clark <dmclark@nexb.com> Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Improve license metadata

36e3491

Contributed-by: Dennis Clark <dmclark@nexb.com> Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Add new licenses

6c94508

Contributed-by: Dennis Clark <dmclark@nexb.com> Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Add new license detection tests

4a5f99b

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Add new license tests

dc08c93

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Add new license detection rules

63df817

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Update licenses to SPDX license list 3.15

36cbe98

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Improve license sync

1330682

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Add new proprietary license rule

4ba7111

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Update license metadata

93bd9f3

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Fix typo in test data file

22f578c

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Update SPDX tests with latest list version

cc30ed2

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

pombredanne mentioned this pull request Nov 25, 2021

GPL-2.0 false alarm #2757

Open

pombredanne added 8 commits November 25, 2021 13:59

Correct failing license detection tests

a8a75b7

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Correct license detection test

5e3bfa7

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Format functions arguments, black-style

dfc0b69

This is much more readable Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Ensure that all single word rule are references

d4820f2

Some were marked as is_license_tag incorrectly. Also make the case of the text something which is consistent and likely to be the most common case style that would show up. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Remove unused arguments and format code

73e0376

An idx index was used as an agrument in a few places and not used. Format code and ensure that logging will work Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Check for Rule stored text first in Rule.text()

1ce8879

This helps to update rule text in code and then dump the rule back Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

pombredanne added 27 commits December 26, 2021 23:20

Remove old "rule templates" markers from tests

d9789ac

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Improve tests of LicenseMatch

b5972e1

Improve tests of key phrase testing more and using more than one rule Add tests for LicenseMatch regions Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Improve LicenseMatch fields documentation

f84a23d

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Use license_expression.combine_expressions()

5f92c93

And remove duplicated combine_expressions functions. Tests have been moved upstream too. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Rename Rule.compute_relevance to set_relevance

a824594

Rule.compute_relevance() was only doing a set, hence a renaming to set_relevance() Also extract a plain "compute_relevance()" function to use with a length input only. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Format docstrings

840527f

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Use new cache.build_spdx_license_expression()

e138421

This returns an SPDX license expression from a plain scancode expression. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Update CHANGELOG.rst

0595e65

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Merge remote-tracking branch 'upstream/develop' into omnibus-fall3-li…

8d57889

…cense-improvements

Restore combine_expression behavior

107eea7

This was updated incorrectly when switching to license_expression code in previous commits. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Remove SPDX license lists false positive rules

b7a7593

We can be smarter and use a filter to catch them all. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Validate and use only lower case license keys

5637c82

Update rare JTA license Update rules Add new Rule and License validation code to ensure that all keys are lowercase. Also make minor fixes to docstring and formatting Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Create shared get_licenses_by_spdx_key function

37db46f

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Improve rule generation

10eadcc

Used shared get_licenses_by_spdx_key() function. Do not generate false positive Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Add new licenses

f4ddd14

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Remove unused test options for Python2

56deead

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Add more license detection rules

0adb718

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Add missing test file

92a1dbb

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Refine license sync

012800f

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Refine short Apache license rules and tests

ba249d1

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Refine license tests

ad4dfff

And spell spurious with one r, not two. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

pombredanne merged commit 37d574b into develop Jan 6, 2022

pombredanne deleted the omnibus-fall3-license-improvements branch January 6, 2022 08:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new licenses and new detection rules #2765

Add new licenses and new detection rules #2765

pombredanne commented Nov 22, 2021

pombredanne commented Jan 6, 2022

Add new licenses and new detection rules #2765

Add new licenses and new detection rules #2765

Conversation

pombredanne commented Nov 22, 2021

pombredanne commented Jan 6, 2022