-
Notifications
You must be signed in to change notification settings - Fork 259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Punish length differences more, give fill-in fields more leeway #475
Conversation
Also follow up on 5436184 by adding - (in addition to /) as a word character so to minimize wordset differences when one adds a whole URL with -'s in it to a fill in (would also work for hyphenated names) Overall this seems right, though it'd be comforting to have a far larger test corpus. Length shouldn't vary -- licenses shouldn't be modified -- but it's inevitable with licenses with fill-in fields. Fixes #474 which was about the former -- a modified Apache 2.0 license.
Rather than playing whack-a-mole with length deltas (which was originally intended as a performance improvement - limiting the corpus of licenses to compare - not a precision improvement), would it make more sense to either (A) raise the confidence threshold or add the modified apache as a license that we recognize (which would then presumably have a higher match score over the original apache? |
Probably not. (A) causes false negatives for to explode; this PR effectively raises the confidence threshold for longer licenses (by penalizing differences more) but without impacting all licenses that much. (B) is more whack-a-mole than this PR -- adding negative licenses addresses specific instances, rather than a general pattern of variation. Adding known specifics to the test suite seems like the right thing to do. Plus, adding notion of specific licenses to not detect would require more refactoring. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question, otherwise 👍
spec/fixtures/fixtures.yml
Outdated
@@ -98,7 +98,7 @@ markdown-artistic: | |||
hash: fb7d858f0e6b9885b5ae4a7a763888f01624ebdf | |||
markdown-gpl: | |||
key: gpl-2.0 | |||
matcher: exact | |||
matcher: dice |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious why the matcher when from exact to dice with theoretically better normalization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The increased punishment for length delta exposed imperfect markdown normalization. I'm fairly certain that's what the reason is, but I'll try to confirm and fix when I have some time to dig in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One character fix, but it does change the hashes of many licenses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I benefit though is that -diff
output will be less noisy now for some licenses.
…th emphasis also normalized to -
Also follow up on 5436184 by adding - (in addition to /) as a word character so to minimize wordset differences when one adds a whole URL with -'s in it to a fill in (would also work for hyphenated names)
Overall this seems right, though it'd be comforting to have a far larger test corpus. Length shouldn't vary -- licenses shouldn't be modified -- but it's inevitable with licenses with fill-in fields.
Fixes #474 which was about the former -- a modified Apache 2.0 license.