Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow license rules to require the presence of certain defining keywords #2773

Merged
merged 28 commits into from
Dec 25, 2021
Merged

Allow license rules to require the presence of certain defining keywords #2773

merged 28 commits into from
Dec 25, 2021

Conversation

mrombout
Copy link
Contributor

@mrombout mrombout commented Nov 30, 2021

Fixes #2637

This PR depends on 97a1a57, contained in #2765. I have cherry-picked that commit into this PR as 2428723.

Tasks

  • Reviewed contribution guidelines
  • PR is descriptively titled 📑 and links the original issue above 🔗
  • Tests pass -- look for a green checkbox ✔️ a few minutes after opening your PR
    Run tests locally to check for errors.
  • Commits are in uniquely-named feature branch and has no merge conflicts 📁

Copy link
Member

@pombredanne pombredanne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!
See a few comments in line.

AUTHORS.rst Show resolved Hide resolved
CHANGELOG.rst Show resolved Hide resolved
src/licensedcode/data/rules/false-positive_ansible_2.RULE Outdated Show resolved Hide resolved
Copy link
Member

@pombredanne pombredanne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is overall awesome!
I'd like to check if the impact on indexing is not too bad... but this is not a critical process anymore.

"""
Return a filtered list of kept LicenseMatch matches and a list of
discardable matches by removing all matches that do not contain all key
phrases required by the rule.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if there is only one match and this does not contain the keyphrases?
Would you discard it, therefore leaving no matches?

Copy link
Contributor Author

@mrombout mrombout Dec 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the only match does not contain the key phrase wouldn't that indicate a high chance of being a false positive? If so, for my use-case I would rather have no results, but maybe that's not in-line with Scancode philosophy(e.g. not having false negatives)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the only match does not contain the key phrase wouldn't that indicate a high chance of being a false positive?

yes in the general case, but let's take a practical example:

  1. in query I have licensed under the AGPL 3.0 and in rule1 I have AGPL 3.0
  2. in rule2 I have licensed under the {{AGPL 3.0 license}}
  3. the query will initially match both rules
  4. the match to rule1 may be discarded as contained in the rule2 match
  5. the match to rule2 may be discarded as missing a key phrase

I am left with no matches yet licensed under the AGPL 3.0 was very clearly a license notice.

On the other hand if the match to rule2 is discarded as missing a key phrase early and before the containment filter, then we are at least left with the match to rule1?

Copy link
Contributor Author

@mrombout mrombout Dec 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, that makes sense. But unfortunately I was unable to recreate this scenario in a unit test.

Instead I experimented and moved the key phrase filter up (before the containment filter) and 5 of the datadriven tests I've added broke. Looking into that now. E.g. tests/licensedcode/test_detection_datadriven1.py::TestLicenseDataDriven1::test_detection_datadriven_lic1_alexa_skills_kit_sdk_for_java_txt now detects [apache-2.0, apache-2.0, unknown-license-reference] as opposed to only apache-2.0.

Also, are you suggesting to skip if there's only one rule like some of the other filters?

    if len(matches) < 2:
        return matches

Copy link
Member

@pombredanne pombredanne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some more comments for your review.

@@ -1485,6 +1540,10 @@ def _log(_matches, _discarded, msg):
all_discarded.extend(discarded)
_log(matches, discarded, 'HIGH ENOUGH SCORE')

matches, discarded = filter_key_phrase_spans(matches)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this filter should be positioned much earlier in the process otherwise here is the risk:

  1. there was a smalller match that was contained
  2. it is filtered because of containment in a larger match
  3. the larger is discarded because of not having keyphrases
  4. there is no match left for this region and we now have a false negative

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved the key phrase filter up in 2c377b3, to right before the first containment filter. I was unable to reproduce this scenario exactly, but some datadriven tests at least demonstrated the effectiveness (tests/licensedcode/test_detection_datadriven1.py::TestLicenseDataDriven1::test_detection_datadriven_lic1_alexa_skills_kit_sdk_for_java_txt)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pombredanne Would the above suffice?

@mrombout mrombout marked this pull request as ready for review December 6, 2021 09:53
mrombout and others added 20 commits December 14, 2021 13:14
A key phrase can be defined in a rule by surrounding one or more
words (tokens) with the `{{` and `}}` characters. Key phrases are
words that **must** be present in order to successfully match.

The key phrases are parsed for each rule during indexing. Then after
matching using the standard matching algorithm a filter removes all
matches that do not have the key phrases present.

---

At the moment sometimes the `ispan` alone can not be relied upon to determine
if key phrases are present.

At one case stopwords and unknown tokens made it appear as if the `ispan` was
more expansive than it should be. This has been solved by checking if the
`qspan` corresponding with the `ispan` intersect with any unknown tokens or
stopwords.

Another problem that is not solved is with the example boto3 example which
should match with `cc-by-nc-sa-4.0` but it now matching with several other
CC licenses. For some reason `Creative Commons Attribution 4.0 International
License` is considered to be matching in the ispan, and not intersecting with
any unknown tokens or stopwords (citation needed).

Signed-off-by: Mike Rombout <mike.rombout@elastisys.com>
License cc-by-sa-4.0 was detected as cc-by-4.0.

Signed-off-by: Mike Rombout <mike.rombout@elastisys.com>
License mit was detected as gpl-2.0.

Signed-off-by: Mike Rombout <mike.rombout@elastisys.com>
Non-license text is detected as bsd-simplified AND gpl-2.0.

Signed-off-by: Mike Rombout <mike.rombout@elastisys.com>
Non-license text was detected as apache-2.0 AND cc-by-sa-4.0.

Signed-off-by: Mike Rombout <mike.rombout@elastisys.com>
License mit was detected as ruby.

Signed-off-by: Mike Rombout <mike.rombout@elastisys.com>
License mpl-2.0 was detected as apache-2.0.

Signed-off-by: Mike Rombout <mike.rombout@elastisys.com>
License mit was detected as apache-2.0 OR epl-2.0.

Signed-off-by: Mike Rombout <mike.rombout@elastisys.com>
License Bouncy Castyle License (mit) is detected
as apache-2.0 OR gpl-2.0-plus WITH
classpath-exception-2.0.

Signed-off-by: Mike Rombout <mike.rombout@elastisys.com>
License erlangpl-1.1 was detected as mpl-1.1.

Signed-off-by: Mike Rombout <mike.rombout@elastisys.com>
License mit was detected as cddl-1.0.

Signed-off-by: Mike Rombout <mike.rombout@elastisys.com>
License apache-2.0 was detected as mit.

Signed-off-by: Mike Rombout <mike.rombout@elastisys.com>
License apache-2.0 was detected as imagemagick.

Signed-off-by: Mike Rombout <mike.rombout@elastisys.com>
Signed-off-by: Mike Rombout <mike.rombout@elastisys.com>
Co-authored-by: Philippe Ombredanne <pombredanne@gmail.com>
Signed-off-by: Mike Rombout <mike.rombout@elastisys.com>
Signed-off-by: Mike Rombout <mike.rombout@elastisys.com>
Do not use defaultdict for Query.unknowns_by_pos and
Query.stopwords_by_pos. Otherwise there are pernicious side effects to
add new entries in these dctionaries when querying them after their
creation.

Reported-by: Mike Rombout @mrombout
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
When a key phrase definition is started in a rule (with `{{`) but is
not closed (with `}}`) the rule is considered invalid and an error is
thrown.

Signed-off-by: Mike Rombout <mike.rombout@elastisys.com>
…_bootloader-exception

Signed-off-by: Jiyeong Seok <jiyeong.seok@lge.com>
Signed-off-by: Mike Rombout <mike.rombout@elastisys.com>
Signed-off-by: Mike Rombout <mike.rombout@elastisys.com>
Signed-off-by: Mike Rombout <mike.rombout@elastisys.com>
Signed-off-by: Mike Rombout <mike.rombout@elastisys.com>
Signed-off-by: Mike Rombout <mike.rombout@elastisys.com>
These rules contained `{{` and `}}` characters, but
were never intended to be considered key phrases.

In order to avoid conflicts I've removed the syntax
from these rules.

Signed-off-by: Mike Rombout <mike.rombout@elastisys.com>
Signed-off-by: Mike Rombout <mike.rombout@elastisys.com>
This will prevent false-positives for the following scenario:

1. There are multiple matches, with a smalller match that id contained
2. It is filtered because of containment in the larger match
3. The larger is discarded because of not having keyphrases
4. There is no match left, we now have a false negative

As a practical example:

1. In query there is `licensed under the AGPL 3.0`
2. In rule1 I have `AGPL 3.0`
3. In rule2 I have `licensed under the {{AGPL 3.0 license}}`
3. The query will initially match both rules
4. The match to rule1 may be discarded as contained in the rule2 match
5. The match to rule2 may be discarded as missing a key phrase

By moving the key phrase filter up before the containment filter the
larger match may be filtered out, giving the smaller matches a chance
to stay.

Signed-off-by: Mike Rombout <mike.rombout@elastisys.com>
Signed-off-by: Mike Rombout <mike.rombout@elastisys.com>
Copy link
Member

@pombredanne pombredanne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM and thank you ++ for this!

@pombredanne pombredanne merged commit d515036 into aboutcode-org:develop Dec 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

allow license rules to require the presence of certain defining keywords
3 participants