The results differ significantly from FuzzyWuzzy #112

wavenator · 2021-07-11T14:55:24Z

These strings return different results depending on which library you are using.

In [7]: rapidfuzz.fuzz.partial_ratio('vodafone', 'bentley')
Out[7]: 44.44444444444444

In [8]: fuzzywuzzy.fuzz.partial_ratio('vodafone', 'bentley')
Out[8]: 29

In [3]: rapidfuzz.fuzz.partial_ratio('vodafone', 'blablalbaasjd')
Out[3]: 18.181818181818187

In [4]: fuzzywuzzy.fuzz.partial_ratio('vodafone', 'blablalbaasjd')
Out[4]: 12

maxbachmann · 2021-07-11T16:07:22Z

FuzzyWuzzy uses either difflib as a pure Python implementation or python-Levenshtein. From your results you used the version using python-Levenshtein. In fuzz.partial_ratio there are two relevant algorithms:

algorithm to calculate the optimal alignment
algorithm to calculate the normalized edit distance

It tries to find a substring of len(shorter) in the longer string (the substring can be shorter if it is placed at the end of the longer string). To find this substring fuzzywuzzy uses difflib.SequenceMatcher with get_matching_blocks (In fact this is not a valid way to find the best aligned substring, but works relatively well in most cases). The implementation of this get_matching_blocks method in python-Levenshtein is completely broken, so for RapidFuzz I decided to use:

The implementation of difflib for the optimal alignment
the Levenshtein distance for the normalized edit distance
So basically a combination of these two variants of FuzzyWuzzy

In [7]: rapidfuzz.fuzz.partial_ratio('vodafone', 'bentley')
Out[7]: 44.44444444444444

In [8]: fuzzywuzzy.fuzz.partial_ratio('vodafone', 'bentley')
Out[8]: 29

This is an example, where python-Levenshtein has incorrect results. difflib returns the following matching blocks:

>>> Levenshtein.StringMatcher.StringMatcher(None, 'bentley', 'vodafone').get_matching_blocks()
[(7, 8, 0)]
>>> difflib.SequenceMatcher(None, 'bentley', 'vodafone').get_matching_blocks()
[Match(a=1, b=7, size=1), Match(a=7, b=8, size=0)]

This has the result, that RapidFuzz tests the following alignments:

bentley <-> ne          -> normalized Levenshtein = 44.44444444444444
bentley <-> odafone  -> normalized Levenshtein = 28.57

While FuzzyWuzzy using difflib tests the following alignments:

bentley <-> ne          -> difflib.SequenceMatcher.ratio = 22
bentley <-> odafone  -> difflib.SequenceMatcher.ratio = 14

and FuzzyWuzzy using python-Levenshtein tests these alignments:

bentley <-> odafone  -> normalized Levenshtein = 29

The same applies to the second example. This broken implementation is already known for quite a while: seatgeek/fuzzywuzzy#79, but apparently the results are good enough for SeatGeek and faster than the pure Python implementation.

wavenator · 2021-07-12T08:39:38Z

I appreciate your thorough and proficient explanation. It makes a lot of sense now.

wavenator closed this as completed Jul 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The results differ significantly from FuzzyWuzzy #112

The results differ significantly from FuzzyWuzzy #112

wavenator commented Jul 11, 2021

maxbachmann commented Jul 11, 2021 •

edited

Loading

wavenator commented Jul 12, 2021

The results differ significantly from FuzzyWuzzy #112

The results differ significantly from FuzzyWuzzy #112

Comments

wavenator commented Jul 11, 2021

maxbachmann commented Jul 11, 2021 • edited Loading

wavenator commented Jul 12, 2021

maxbachmann commented Jul 11, 2021 •

edited

Loading