-
-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The results differ significantly from FuzzyWuzzy #112
Comments
FuzzyWuzzy uses either difflib as a pure Python implementation or python-Levenshtein. From your results you used the version using python-Levenshtein. In fuzz.partial_ratio there are two relevant algorithms:
It tries to find a substring of len(shorter) in the longer string (the substring can be shorter if it is placed at the end of the longer string). To find this substring fuzzywuzzy uses difflib.SequenceMatcher with get_matching_blocks (In fact this is not a valid way to find the best aligned substring, but works relatively well in most cases). The implementation of this get_matching_blocks method in python-Levenshtein is completely broken, so for RapidFuzz I decided to use:
In [7]: rapidfuzz.fuzz.partial_ratio('vodafone', 'bentley')
Out[7]: 44.44444444444444
In [8]: fuzzywuzzy.fuzz.partial_ratio('vodafone', 'bentley')
Out[8]: 29 This is an example, where python-Levenshtein has incorrect results. difflib returns the following matching blocks: >>> Levenshtein.StringMatcher.StringMatcher(None, 'bentley', 'vodafone').get_matching_blocks()
[(7, 8, 0)]
>>> difflib.SequenceMatcher(None, 'bentley', 'vodafone').get_matching_blocks()
[Match(a=1, b=7, size=1), Match(a=7, b=8, size=0)] This has the result, that RapidFuzz tests the following alignments:
While FuzzyWuzzy using difflib tests the following alignments:
and FuzzyWuzzy using python-Levenshtein tests these alignments:
The same applies to the second example. This broken implementation is already known for quite a while: seatgeek/fuzzywuzzy#79, but apparently the results are good enough for SeatGeek and faster than the pure Python implementation. |
I appreciate your thorough and proficient explanation. It makes a lot of sense now. |
These strings return different results depending on which library you are using.
The text was updated successfully, but these errors were encountered: