Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The results differ significantly from FuzzyWuzzy #112

Closed
wavenator opened this issue Jul 11, 2021 · 2 comments
Closed

The results differ significantly from FuzzyWuzzy #112

wavenator opened this issue Jul 11, 2021 · 2 comments

Comments

@wavenator
Copy link

These strings return different results depending on which library you are using.

In [7]: rapidfuzz.fuzz.partial_ratio('vodafone', 'bentley')
Out[7]: 44.44444444444444

In [8]: fuzzywuzzy.fuzz.partial_ratio('vodafone', 'bentley')
Out[8]: 29
In [3]: rapidfuzz.fuzz.partial_ratio('vodafone', 'blablalbaasjd')
Out[3]: 18.181818181818187

In [4]: fuzzywuzzy.fuzz.partial_ratio('vodafone', 'blablalbaasjd')
Out[4]: 12
@maxbachmann
Copy link
Member

maxbachmann commented Jul 11, 2021

FuzzyWuzzy uses either difflib as a pure Python implementation or python-Levenshtein. From your results you used the version using python-Levenshtein. In fuzz.partial_ratio there are two relevant algorithms:

  1. algorithm to calculate the optimal alignment
  2. algorithm to calculate the normalized edit distance

It tries to find a substring of len(shorter) in the longer string (the substring can be shorter if it is placed at the end of the longer string). To find this substring fuzzywuzzy uses difflib.SequenceMatcher with get_matching_blocks (In fact this is not a valid way to find the best aligned substring, but works relatively well in most cases). The implementation of this get_matching_blocks method in python-Levenshtein is completely broken, so for RapidFuzz I decided to use:

  1. The implementation of difflib for the optimal alignment
  2. the Levenshtein distance for the normalized edit distance
    So basically a combination of these two variants of FuzzyWuzzy
In [7]: rapidfuzz.fuzz.partial_ratio('vodafone', 'bentley')
Out[7]: 44.44444444444444

In [8]: fuzzywuzzy.fuzz.partial_ratio('vodafone', 'bentley')
Out[8]: 29

This is an example, where python-Levenshtein has incorrect results. difflib returns the following matching blocks:

>>> Levenshtein.StringMatcher.StringMatcher(None, 'bentley', 'vodafone').get_matching_blocks()
[(7, 8, 0)]
>>> difflib.SequenceMatcher(None, 'bentley', 'vodafone').get_matching_blocks()
[Match(a=1, b=7, size=1), Match(a=7, b=8, size=0)]

This has the result, that RapidFuzz tests the following alignments:

bentley <-> ne          -> normalized Levenshtein = 44.44444444444444
bentley <-> odafone  -> normalized Levenshtein = 28.57

While FuzzyWuzzy using difflib tests the following alignments:

bentley <-> ne          -> difflib.SequenceMatcher.ratio = 22
bentley <-> odafone  -> difflib.SequenceMatcher.ratio = 14

and FuzzyWuzzy using python-Levenshtein tests these alignments:

bentley <-> odafone  -> normalized Levenshtein = 29

The same applies to the second example. This broken implementation is already known for quite a while: seatgeek/fuzzywuzzy#79, but apparently the results are good enough for SeatGeek and faster than the pure Python implementation.

@wavenator
Copy link
Author

I appreciate your thorough and proficient explanation. It makes a lot of sense now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants