Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new rule to fix #3738 #3750

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

vasily-pozdnyakov
Copy link

@vasily-pozdnyakov vasily-pozdnyakov commented Apr 26, 2024

Fixes #3738

Tasks

  • Reviewed contribution guidelines
  • PR is descriptively titled 馃搼 and links the original issue above 馃敆
  • Commits are in uniquely-named feature branch and has no merge conflicts 馃搧

@vasily-pozdnyakov
Copy link
Author

vasily-pozdnyakov commented Apr 26, 2024

Hi, I am new to new rules creation and have a couple of questions:

  1. What does scancode-reindex-licenses do? It does not introduce any visible changes (validation is clear, what about reindex?).
  2. Why most of the rules are weak (without "{{}}")? If I understand correctly, it might bring a lot of false positives (let's say, there is a matching license notice for MIT, but with a different license name (Apache) - the scancode might detect it incorrectly), is it like that?

@AyanSinhaMahapatra
Copy link
Member

@vasily-pozdnyakov welcome and thanks for the PR!

What does scancode-reindex-licenses do? It does not introduce any visible changes (validation is clear, what about reindex?).

When you get SCTK from a git checkout and install scancode to run locally (or get scancode from github releases or pip install) you have the license indexes at scancode-toolkit/src/licensedcode/data/cache/license_index/ which contains the license index (pre-built if you're downloading the release/via pip) that is used for license detection when you run scancode. Now if you are updating/adding new rule/licenses in scancode you want to run scancode-reindex-licenses so next time you run a scan locally these rules are in the index and are used in license detection, but this does no changes in the repository of rules.

Why most of the rules are weak (without "{{}}")? If I understand correctly, it might bring a lot of false positives (let's say, there is a matching license notice for MIT, but with a different license name (Apache) - the scancode might detect it incorrectly), is it like that?

That's mostly correct, this could (and does also) generate a lot of false positives in a lot of cases. See #2878

We are working on resolving this and I have a pending PR at #3254 testing out adding required phrases automatically and massively across all the rules from the following:

  1. license names and other license keywords
  2. from other instances of these required phrases added in other rules.
    This would improve the accuracy significantly for these rules and also make sure when we add a required phrase, same is marked across the same phrases present in all rules. There is just a bit of work remaining on testing this and verifying this works correctly across all the rules. But this addition of required phrases would be continuous, or a massive one-time effort otherwise.

Signed-off-by: Vasily Pozdnyakov <vasily.pozdnyakov@tngtech.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Surprising false positives of mit_or_gpl-3.0_17.RULE
2 participants