Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Punish length differences more, give fill-in fields more leeway #475

Merged
merged 3 commits into from
Jan 14, 2021

Conversation

mlinksva
Copy link
Contributor

Also follow up on 5436184 by adding - (in addition to /) as a word character so to minimize wordset differences when one adds a whole URL with -'s in it to a fill in (would also work for hyphenated names)

Overall this seems right, though it'd be comforting to have a far larger test corpus. Length shouldn't vary -- licenses shouldn't be modified -- but it's inevitable with licenses with fill-in fields.

Fixes #474 which was about the former -- a modified Apache 2.0 license.

Also follow up on 5436184 by
adding - (in addition to /) as a word character so to minimize
wordset differences when one adds a whole URL with -'s in it to
a fill in (would also work for hyphenated names)

Overall this seems right, though it'd be comforting to have a far
larger test corpus. Length shouldn't vary -- licenses shouldn't be
modified -- but it's inevitable with licenses with fill-in fields.

Fixes #474 which was about the former -- a modified Apache 2.0 license.
@benbalter
Copy link
Contributor

Rather than playing whack-a-mole with length deltas (which was originally intended as a performance improvement - limiting the corpus of licenses to compare - not a precision improvement), would it make more sense to either (A) raise the confidence threshold or add the modified apache as a license that we recognize (which would then presumably have a higher match score over the original apache?

@mlinksva
Copy link
Contributor Author

would it make more sense to either (A) raise the confidence threshold or [B] add the modified apache as a license that we recognize (which would then presumably have a higher match score over the original apache?

Probably not. (A) causes false negatives for to explode; this PR effectively raises the confidence threshold for longer licenses (by penalizing differences more) but without impacting all licenses that much. (B) is more whack-a-mole than this PR -- adding negative licenses addresses specific instances, rather than a general pattern of variation. Adding known specifics to the test suite seems like the right thing to do. Plus, adding notion of specific licenses to not detect would require more refactoring.

Copy link
Contributor

@benbalter benbalter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question, otherwise 👍

@@ -98,7 +98,7 @@ markdown-artistic:
hash: fb7d858f0e6b9885b5ae4a7a763888f01624ebdf
markdown-gpl:
key: gpl-2.0
matcher: exact
matcher: dice
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why the matcher when from exact to dice with theoretically better normalization?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The increased punishment for length delta exposed imperfect markdown normalization. I'm fairly certain that's what the reason is, but I'll try to confirm and fix when I have some time to dig in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One character fix, but it does change the hashes of many licenses.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I benefit though is that -diff output will be less noisy now for some licenses.

@mlinksva mlinksva merged commit cbae8f6 into master Jan 14, 2021
@delete-merged-branch delete-merged-branch bot deleted the pixar-modified-apache branch January 14, 2021 01:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Projects with modified Apache license are marked as Apache
2 participants