Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AUTO: Score repeated bigrams correctly, fixes #1959 #1994

Merged
merged 1 commit into from Jun 15, 2021

Conversation

github-actions[bot]
Copy link
Contributor

No description provided.

The current Sorensen-Dice coefficient algorithm does not correctly score strings with repeating bigrams. The score can end up being greater than 1 (the max possible score). This is because the algorithm does not consume bigrams as it matches them. The match count ends up being a count of the cartesian join of the matching bigrams. The revised algorithm in this change will consume bigrams as they are matched, preventing the cartesian join situation and providing correct scores.
@conker84 conker84 merged commit 4a76eac into 4.3 Jun 15, 2021
@conker84 conker84 deleted the auto-4.3-c6d40191e7c97f0f3f60bfb4d26c8590023280a1 branch June 15, 2021 14:37
github-actions bot added a commit that referenced this pull request Jun 15, 2021
The current Sorensen-Dice coefficient algorithm does not correctly score strings with repeating bigrams. The score can end up being greater than 1 (the max possible score). This is because the algorithm does not consume bigrams as it matches them. The match count ends up being a count of the cartesian join of the matching bigrams. The revised algorithm in this change will consume bigrams as they are matched, preventing the cartesian join situation and providing correct scores.

Co-authored-by: Tom Larsen <larsenthomasj@gmail.com>
conker84 pushed a commit that referenced this pull request Jul 1, 2021
The current Sorensen-Dice coefficient algorithm does not correctly score strings with repeating bigrams. The score can end up being greater than 1 (the max possible score). This is because the algorithm does not consume bigrams as it matches them. The match count ends up being a count of the cartesian join of the matching bigrams. The revised algorithm in this change will consume bigrams as they are matched, preventing the cartesian join situation and providing correct scores.

Co-authored-by: Tom Larsen <larsenthomasj@gmail.com>

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Tom Larsen <larsenthomasj@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants