Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new licenses and new detection rules #2765

Merged
merged 68 commits into from
Jan 6, 2022

Conversation

pombredanne
Copy link
Member

This PR provides general purpose license detection improvements mostly in the form of new licenses and new license detection rules.

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
These templates as code are used only in the cobra library
The code has not evolved much and is best as a false positive.

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Used to be a license detection rule

Contributed-by: Dennis Clark <dmclark@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Used to be a license detection rule

Contributed-by: Dennis Clark <dmclark@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Contributed-by: Dennis Clark <dmclark@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Contributed-by: Dennis Clark <dmclark@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Do not use defaultdict for Query.unknowns_by_pos and
Query.stopwords_by_pos. Otherwise there are pernicious side effects to
add new entries in these dctionaries when querying them after their
creation.

Reported-by: Mike Rombout @mrombout
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
This is much more readable

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Some were marked as is_license_tag incorrectly. Also make the case of
the text something which is consistent and likely to be the most common
case style that would show up.

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
An idx index was used as an agrument in a few places and not used.
Format code and ensure that logging will work

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
This helps to update rule text in code and then dump the rule back

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
This is a partial implementation of some of the ideas of #2403
Add a new filter_invalid_single_word_matches_in_binaries() function
called in licensedcode.match.refine_matches().

This filters matches under these conditions:

- the match is for a binary file

- the matched rule that has a single word (length 1)

- the matched rule has a low relevance, e.g., under 75

- the matched text has either:
  - one or more leading or trailing punctuations (except for +)
    unless this has a high relevance and the rule is contained as-is
    in the matched text (considering case)

  - mixed upper and lower case charcaters (but not a Title case) unless
    exactly the same mixed case as the rule text

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Rename match.filter_key_phrase_spans to
filter_matches_missing_key_phrases

Move query.Query.tokens_with_unknowns
and query..QueryRun.tokens_with_unknowns
to licensedcode_test_utils

Remove obsolete u'' string prefixes from test_query.py

Add tracing to query.Query.tokens_by_line() processing

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Call match.set_matched_lines() once centrally

Call key pharse filtering early

Implement function to group matches in regions used for debugging

Implement and track a discard_reason for each LicenseMatch which is a
code that tells why a match is filtered.

Format code

Fix Span and LicenseMatch distance_to for touching spans: it should be
to one then.

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Improve tests of key phrase testing more and using more than one rule

Add tests for LicenseMatch regions

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
And remove duplicated combine_expressions functions.
Tests have been moved upstream too.

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Rule.compute_relevance() was only doing a set, hence a renaming to
set_relevance()

Also extract a plain "compute_relevance()" function to use with a length
input only.

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
This returns an SPDX license expression from a plain scancode
expression.

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
The move to use in the license_expression.combine_expressions()
function missed that it was returning a LicenseExpression when existing
functiosn were returning a string. This is fixing this issue.

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
This was updated incorrectly when switching to license_expression code
in previous commits.

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
We can be smarter and use a filter to catch them all.

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Update rare JTA license
Update rules
Add new Rule and License validation code to ensure that all keys
are lowercase.
Also make minor fixes to docstring and formatting

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Used shared get_licenses_by_spdx_key() function.
Do not generate false positive

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
The new filter_false_positive_license_lists_matches() filter is able
to recognize sequences of license matches that are false positive lists
such as a list of SPDX license ids seen in some license related tools.
These were previously detected using a large number of rules that were
subsequences of these lists of licenses.

With this filter we now have removed all the rules used before and
added new rules to properly detect each license reference. As a result,
the license detection is both more accurate (detecting more references)
and less noisy (not reporting false positive lists of sucg references)

- "s" is reinstated as a regular word (and not a stopword)
- several new and improved rules have been added
- tests have been updated accordingly

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Remove non-continuous filter and use instead the key phrases filter,
as a rule that must be matched continuously is the same as a rule with
a single key phrase covering all its text

Track stopwords on the index side in each rule, and use these to ensure
that when keyphrases are present and contain stopwords we check that the
stopwords are the same on the query and index side of a match to
consider the keyphrases matches.

For stopwords introduced a new index tokenizer
"index_tokenizer_with_stopwords".

Sort stopwords for easier search in the future.

Streamline keyphrase tokenizer and run it in Rule.tokens()

Remove query.Query.stopwords_span that was not used
Remove match.LicenseMatch.qcontains_stopwords() that was not used
Remove match.filter_non_continuous_matches() no longer used

Add new and improved rules, and requalify some non-english short rules
as bona-fide rules.

Apply multiple minor refactoring, doc, and formatting refinements

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
And spell spurious with one r, not two.

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
This support new ways to combine multiple license matches together,
and distinguish primary and secondary licenses.
Most concepts have been drawn from early work on Debian copyright
detection improvements.

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
@pombredanne
Copy link
Member Author

All green ... merging at last!

@pombredanne pombredanne merged commit 37d574b into develop Jan 6, 2022
@pombredanne pombredanne deleted the omnibus-fall3-license-improvements branch January 6, 2022 08:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant