-
-
Notifications
You must be signed in to change notification settings - Fork 548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new licenses and new detection rules #2765
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
These templates as code are used only in the cobra library The code has not evolved much and is best as a false positive. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Used to be a license detection rule Contributed-by: Dennis Clark <dmclark@nexb.com> Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Used to be a license detection rule Contributed-by: Dennis Clark <dmclark@nexb.com> Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Contributed-by: Dennis Clark <dmclark@nexb.com> Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Contributed-by: Dennis Clark <dmclark@nexb.com> Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Do not use defaultdict for Query.unknowns_by_pos and Query.stopwords_by_pos. Otherwise there are pernicious side effects to add new entries in these dctionaries when querying them after their creation. Reported-by: Mike Rombout @mrombout Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
This is much more readable Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Some were marked as is_license_tag incorrectly. Also make the case of the text something which is consistent and likely to be the most common case style that would show up. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
An idx index was used as an agrument in a few places and not used. Format code and ensure that logging will work Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
This helps to update rule text in code and then dump the rule back Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
This is a partial implementation of some of the ideas of #2403 Add a new filter_invalid_single_word_matches_in_binaries() function called in licensedcode.match.refine_matches(). This filters matches under these conditions: - the match is for a binary file - the matched rule that has a single word (length 1) - the matched rule has a low relevance, e.g., under 75 - the matched text has either: - one or more leading or trailing punctuations (except for +) unless this has a high relevance and the rule is contained as-is in the matched text (considering case) - mixed upper and lower case charcaters (but not a Title case) unless exactly the same mixed case as the rule text Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Rename match.filter_key_phrase_spans to filter_matches_missing_key_phrases Move query.Query.tokens_with_unknowns and query..QueryRun.tokens_with_unknowns to licensedcode_test_utils Remove obsolete u'' string prefixes from test_query.py Add tracing to query.Query.tokens_by_line() processing Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Call match.set_matched_lines() once centrally Call key pharse filtering early Implement function to group matches in regions used for debugging Implement and track a discard_reason for each LicenseMatch which is a code that tells why a match is filtered. Format code Fix Span and LicenseMatch distance_to for touching spans: it should be to one then. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Improve tests of key phrase testing more and using more than one rule Add tests for LicenseMatch regions Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
And remove duplicated combine_expressions functions. Tests have been moved upstream too. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Rule.compute_relevance() was only doing a set, hence a renaming to set_relevance() Also extract a plain "compute_relevance()" function to use with a length input only. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
This returns an SPDX license expression from a plain scancode expression. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
…cense-improvements
The move to use in the license_expression.combine_expressions() function missed that it was returning a LicenseExpression when existing functiosn were returning a string. This is fixing this issue. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
This was updated incorrectly when switching to license_expression code in previous commits. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
We can be smarter and use a filter to catch them all. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Update rare JTA license Update rules Add new Rule and License validation code to ensure that all keys are lowercase. Also make minor fixes to docstring and formatting Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Used shared get_licenses_by_spdx_key() function. Do not generate false positive Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
The new filter_false_positive_license_lists_matches() filter is able to recognize sequences of license matches that are false positive lists such as a list of SPDX license ids seen in some license related tools. These were previously detected using a large number of rules that were subsequences of these lists of licenses. With this filter we now have removed all the rules used before and added new rules to properly detect each license reference. As a result, the license detection is both more accurate (detecting more references) and less noisy (not reporting false positive lists of sucg references) - "s" is reinstated as a regular word (and not a stopword) - several new and improved rules have been added - tests have been updated accordingly Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Remove non-continuous filter and use instead the key phrases filter, as a rule that must be matched continuously is the same as a rule with a single key phrase covering all its text Track stopwords on the index side in each rule, and use these to ensure that when keyphrases are present and contain stopwords we check that the stopwords are the same on the query and index side of a match to consider the keyphrases matches. For stopwords introduced a new index tokenizer "index_tokenizer_with_stopwords". Sort stopwords for easier search in the future. Streamline keyphrase tokenizer and run it in Rule.tokens() Remove query.Query.stopwords_span that was not used Remove match.LicenseMatch.qcontains_stopwords() that was not used Remove match.filter_non_continuous_matches() no longer used Add new and improved rules, and requalify some non-english short rules as bona-fide rules. Apply multiple minor refactoring, doc, and formatting refinements Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
And spell spurious with one r, not two. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
This support new ways to combine multiple license matches together, and distinguish primary and secondary licenses. Most concepts have been drawn from early work on Debian copyright detection improvements. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
All green ... merging at last! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR provides general purpose license detection improvements mostly in the form of new licenses and new license detection rules.