-
-
Notifications
You must be signed in to change notification settings - Fork 548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow license rules to require the presence of certain defining keywords #2773
Allow license rules to require the presence of certain defining keywords #2773
Conversation
tests/licensedcode/data/datadriven/lic1/boto-boto3-cw-example-creating-alarm.txt.yml
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
See a few comments in line.
tests/licensedcode/data/datadriven/lic1/boto-boto3-cw-example-creating-alarm.txt.yml
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is overall awesome!
I'd like to check if the impact on indexing is not too bad... but this is not a critical process anymore.
""" | ||
Return a filtered list of kept LicenseMatch matches and a list of | ||
discardable matches by removing all matches that do not contain all key | ||
phrases required by the rule. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if there is only one match and this does not contain the keyphrases?
Would you discard it, therefore leaving no matches?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the only match does not contain the key phrase wouldn't that indicate a high chance of being a false positive? If so, for my use-case I would rather have no results, but maybe that's not in-line with Scancode philosophy(e.g. not having false negatives)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the only match does not contain the key phrase wouldn't that indicate a high chance of being a false positive?
yes in the general case, but let's take a practical example:
- in query I have
licensed under the AGPL 3.0
and in rule1 I haveAGPL 3.0
- in rule2 I have
licensed under the {{AGPL 3.0 license}}
- the query will initially match both rules
- the match to rule1 may be discarded as contained in the rule2 match
- the match to rule2 may be discarded as missing a key phrase
I am left with no matches yet licensed under the AGPL 3.0
was very clearly a license notice.
On the other hand if the match to rule2 is discarded as missing a key phrase early and before the containment filter, then we are at least left with the match to rule1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, that makes sense. But unfortunately I was unable to recreate this scenario in a unit test.
Instead I experimented and moved the key phrase filter up (before the containment filter) and 5 of the datadriven tests I've added broke. Looking into that now. E.g. tests/licensedcode/test_detection_datadriven1.py::TestLicenseDataDriven1::test_detection_datadriven_lic1_alexa_skills_kit_sdk_for_java_txt
now detects [apache-2.0, apache-2.0, unknown-license-reference]
as opposed to only apache-2.0
.
Also, are you suggesting to skip if there's only one rule like some of the other filters?
if len(matches) < 2:
return matches
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some more comments for your review.
src/licensedcode/match.py
Outdated
@@ -1485,6 +1540,10 @@ def _log(_matches, _discarded, msg): | |||
all_discarded.extend(discarded) | |||
_log(matches, discarded, 'HIGH ENOUGH SCORE') | |||
|
|||
matches, discarded = filter_key_phrase_spans(matches) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this filter should be positioned much earlier in the process otherwise here is the risk:
- there was a smalller match that was contained
- it is filtered because of containment in a larger match
- the larger is discarded because of not having keyphrases
- there is no match left for this region and we now have a false negative
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've moved the key phrase filter up in 2c377b3, to right before the first containment filter. I was unable to reproduce this scenario exactly, but some datadriven tests at least demonstrated the effectiveness (tests/licensedcode/test_detection_datadriven1.py::TestLicenseDataDriven1::test_detection_datadriven_lic1_alexa_skills_kit_sdk_for_java_txt
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pombredanne Would the above suffice?
tests/licensedcode/data/datadriven/lic1/alexa-skills-kit-sdk-for-java.txt.yml
Outdated
Show resolved
Hide resolved
A key phrase can be defined in a rule by surrounding one or more words (tokens) with the `{{` and `}}` characters. Key phrases are words that **must** be present in order to successfully match. The key phrases are parsed for each rule during indexing. Then after matching using the standard matching algorithm a filter removes all matches that do not have the key phrases present. --- At the moment sometimes the `ispan` alone can not be relied upon to determine if key phrases are present. At one case stopwords and unknown tokens made it appear as if the `ispan` was more expansive than it should be. This has been solved by checking if the `qspan` corresponding with the `ispan` intersect with any unknown tokens or stopwords. Another problem that is not solved is with the example boto3 example which should match with `cc-by-nc-sa-4.0` but it now matching with several other CC licenses. For some reason `Creative Commons Attribution 4.0 International License` is considered to be matching in the ispan, and not intersecting with any unknown tokens or stopwords (citation needed). Signed-off-by: Mike Rombout <mike.rombout@elastisys.com>
License cc-by-sa-4.0 was detected as cc-by-4.0. Signed-off-by: Mike Rombout <mike.rombout@elastisys.com>
License mit was detected as gpl-2.0. Signed-off-by: Mike Rombout <mike.rombout@elastisys.com>
Non-license text is detected as bsd-simplified AND gpl-2.0. Signed-off-by: Mike Rombout <mike.rombout@elastisys.com>
Non-license text was detected as apache-2.0 AND cc-by-sa-4.0. Signed-off-by: Mike Rombout <mike.rombout@elastisys.com>
License mit was detected as ruby. Signed-off-by: Mike Rombout <mike.rombout@elastisys.com>
License mpl-2.0 was detected as apache-2.0. Signed-off-by: Mike Rombout <mike.rombout@elastisys.com>
License mit was detected as apache-2.0 OR epl-2.0. Signed-off-by: Mike Rombout <mike.rombout@elastisys.com>
License Bouncy Castyle License (mit) is detected as apache-2.0 OR gpl-2.0-plus WITH classpath-exception-2.0. Signed-off-by: Mike Rombout <mike.rombout@elastisys.com>
License erlangpl-1.1 was detected as mpl-1.1. Signed-off-by: Mike Rombout <mike.rombout@elastisys.com>
License mit was detected as cddl-1.0. Signed-off-by: Mike Rombout <mike.rombout@elastisys.com>
License apache-2.0 was detected as mit. Signed-off-by: Mike Rombout <mike.rombout@elastisys.com>
License apache-2.0 was detected as imagemagick. Signed-off-by: Mike Rombout <mike.rombout@elastisys.com>
Signed-off-by: Mike Rombout <mike.rombout@elastisys.com>
Co-authored-by: Philippe Ombredanne <pombredanne@gmail.com> Signed-off-by: Mike Rombout <mike.rombout@elastisys.com>
Signed-off-by: Mike Rombout <mike.rombout@elastisys.com>
Do not use defaultdict for Query.unknowns_by_pos and Query.stopwords_by_pos. Otherwise there are pernicious side effects to add new entries in these dctionaries when querying them after their creation. Reported-by: Mike Rombout @mrombout Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
When a key phrase definition is started in a rule (with `{{`) but is not closed (with `}}`) the rule is considered invalid and an error is thrown. Signed-off-by: Mike Rombout <mike.rombout@elastisys.com>
…_bootloader-exception Signed-off-by: Jiyeong Seok <jiyeong.seok@lge.com>
Signed-off-by: Mike Rombout <mike.rombout@elastisys.com>
Signed-off-by: Mike Rombout <mike.rombout@elastisys.com>
Signed-off-by: Mike Rombout <mike.rombout@elastisys.com>
Signed-off-by: Mike Rombout <mike.rombout@elastisys.com>
Signed-off-by: Mike Rombout <mike.rombout@elastisys.com>
These rules contained `{{` and `}}` characters, but were never intended to be considered key phrases. In order to avoid conflicts I've removed the syntax from these rules. Signed-off-by: Mike Rombout <mike.rombout@elastisys.com>
Signed-off-by: Mike Rombout <mike.rombout@elastisys.com>
This will prevent false-positives for the following scenario: 1. There are multiple matches, with a smalller match that id contained 2. It is filtered because of containment in the larger match 3. The larger is discarded because of not having keyphrases 4. There is no match left, we now have a false negative As a practical example: 1. In query there is `licensed under the AGPL 3.0` 2. In rule1 I have `AGPL 3.0` 3. In rule2 I have `licensed under the {{AGPL 3.0 license}}` 3. The query will initially match both rules 4. The match to rule1 may be discarded as contained in the rule2 match 5. The match to rule2 may be discarded as missing a key phrase By moving the key phrase filter up before the containment filter the larger match may be filtered out, giving the smaller matches a chance to stay. Signed-off-by: Mike Rombout <mike.rombout@elastisys.com>
Signed-off-by: Mike Rombout <mike.rombout@elastisys.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM and thank you ++ for this!
Fixes #2637
This PR depends on 97a1a57, contained in #2765. I have cherry-picked that commit into this PR as 2428723.
Tasks
Run tests locally to check for errors.