add support for rapidfuzz #77

maxbachmann · 2021-11-06T23:31:54Z

The implementation used by rapidfuzz has the following algorithms

Jaro/JaroWinkler (fastest by a large margin)
Hamming (slightly slower than python-Levenshtein)
Levenshtein (similar fast to python-Levenshtein for very short strings and fastest for longer strings)

Additionally it supports any sequence of hashable types (e.g. lists of strings) and not only text

Here is the benchmark result:

# Faster than textdistance:

| algorithm          | library                 | function                     |        time |
|--------------------+-------------------------+------------------------------+-------------|
| DamerauLevenshtein | jellyfish               | damerau_levenshtein_distance | 0.0181046   |
| DamerauLevenshtein | pyxdameraulevenshtein   | damerau_levenshtein_distance | 0.030925    |
| Hamming            | Levenshtein             | hamming                      | 0.000351586 |
| Hamming            | rapidfuzz.string_metric | hamming                      | 0.00040442  |
| Hamming            | jellyfish               | hamming_distance             | 0.0143502   |
| Jaro               | rapidfuzz.string_metric | jaro_similarity              | 0.000749048 |
| Jaro               | jellyfish               | jaro_similarity              | 0.0152322   |
| JaroWinkler        | rapidfuzz.string_metric | jaro_winkler_similarity      | 0.000776006 |
| JaroWinkler        | jellyfish               | jaro_winkler_similarity      | 0.0157833   |
| Levenshtein        | rapidfuzz.string_metric | levenshtein                  | 0.0010058   |
| Levenshtein        | Levenshtein             | distance                     | 0.00103176  |
| Levenshtein        | jellyfish               | levenshtein_distance         | 0.0147382   |
| Levenshtein        | pylev                   | levenshtein                  | 0.14116     |
Total: 13 libs.

and the benchmark results when adding slightly longer strings:

STMT = """
func('text', 'test')
func('qwer', 'asdf')
func('a' * 15, 'b' * 15)
func('a' * 30, 'b' * 30)
"""

# Faster than textdistance:

| algorithm          | library                 | function                     |        time |
|--------------------+-------------------------+------------------------------+-------------|
| DamerauLevenshtein | jellyfish               | damerau_levenshtein_distance | 0.0323887   |
| DamerauLevenshtein | pyxdameraulevenshtein   | damerau_levenshtein_distance | 0.143235    |
| Hamming            | Levenshtein             | hamming                      | 0.000489837 |
| Hamming            | rapidfuzz.string_metric | hamming                      | 0.000517879 |
| Hamming            | jellyfish               | hamming_distance             | 0.0182341   |
| Jaro               | rapidfuzz.string_metric | jaro_similarity              | 0.00111363  |
| Jaro               | jellyfish               | jaro_similarity              | 0.0201971   |
| JaroWinkler        | rapidfuzz.string_metric | jaro_winkler_similarity      | 0.00105238  |
| JaroWinkler        | jellyfish               | jaro_winkler_similarity      | 0.0206678   |
| Levenshtein        | rapidfuzz.string_metric | levenshtein                  | 0.00138601  |
| Levenshtein        | Levenshtein             | distance                     | 0.0034889   |
| Levenshtein        | jellyfish               | levenshtein_distance         | 0.0232467   |
| Levenshtein        | pylev                   | levenshtein                  | 0.599603    |
Total: 13 libs.

textdistance/libraries.py

maxbachmann · 2021-11-07T00:09:20Z

@orsinium I am not quite sure how to fix the build errors, since I am not very familiar with drone CI + alpine linux. Currently numpy + rapidfuzz do not have musllinux wheels, so it needs to build everything from source. Is there any specific reason why the CI uses alpine linux? Pretty much no python project provides wheels for musllinux so far.

orsinium · 2021-11-10T14:31:41Z

Thank you! Looks cool. Please, resolve CI, and I'll merge it. No, there is no particular reason to use alpine except that it worked well before. You can migrate it to anything else, buster should do. I think it should be enough. Give it a try and if you can't make it work, just ping me, I'll help.

maxbachmann · 2021-12-07T13:35:25Z

@orsinium This will likely take a bit longer, since I am currently working on v2.0.0 of RapidFuzz, which will simplify this PR.

I have one question regarding textdistance:
I generally like the way you classify metrics into different categories like e.g.edit based distances and plan to do something similar when adding more metrics to RapidFuzz. However I was unsure when a metric is sequence_based. I would have personally placed LCSSeq in edit_based, since it is a metric which only uses Insertions/Deletions but no Substitutions.

orsinium · 2021-12-08T13:03:31Z

If I remember correctly, sequence_based are ones that find common subsequences in both given strings.

maxbachmann · 2022-06-28T15:09:47Z

@orsinium I finally had time for this. I updated rapidfuzz, which simplifies the addition. In addition I updated the ci to debian. As a side effect this appears to decrease the CI runtime significantly.

I do not understand the test failures. From reading it sound like the following occurs:

rapidfuzz.distance.Jaro.similarity('0', '0') -> 0.0
rapidfuzz.distance.JaroWinkler.similarity('0', '0') -> 0.0
rapidfuzz.distance.Jaro.similarity([], []) -> 0.0
rapidfuzz.distance.JaroWinkler.similarity([], []) -> 0.0

I can not reproduce the first two issues:

>>> from rapidfuzz.distance import Jaro, JaroWinkler
>>> Jaro.similarity('0', '0')
1.0
>>> JaroWinkler.similarity('0', '0')
1.0

the other one comes down to the question, whether the result of matching two empty sequences should be a perfect match or no match.

>>> textdistance.algorithms.Jaro().similarity('', '')
1
>>> textdistance.algorithms.Jaro().similarity([], [])
1
>>> rapidfuzz.distance.Jaro.similarity('', '')
0.0
>>> rapidfuzz.distance.Jaro.similarity([], [])
0.0
>>> jellyfish.jaro_similarity('', '')
0.0
>>> Levenshtein.jaro_winkler('', '')
1.0

maxbachmann · 2022-06-28T15:11:57Z

As far as I can see the behavior for two empty sequences does not matter, since it will not be called with two empty sequences anyways:

if not all(sequences):
    return self.maximum(*sequences)
# try get answer from external libs
answer = self.external_answer(*sequences)
if answer is not None:
    return answer

orsinium · 2022-06-29T07:18:15Z

The point of the two failed tests is to ensure that doesn't matter if you use an external or internal implementation of an algorithm, the result will be the same in both cases. If there is a mismatch, we need to address it on either side.

I can not reproduce the first two issues

The test test_qval converts the input sequence into q-grams before running the external function. So, I guess, the input that rapidfuzz gets is something like [('0',)], not just '0'. You can run pytest with --pdb to see for sure.

orsinium · 2022-06-29T07:21:27Z

If you have issues only with Jaro and Jaro-Winkler, you can disable rapidfuzz for them for now and merge only integration for Hamming and Levenshtein. You've done quite some work, and having something merged is better than nothing :)

textdistance/libraries.json

maxbachmann · 2022-06-29T08:03:59Z

The point of the two failed tests is to ensure that doesn't matter if you use an external or internal implementation of an algorithm, the result will be the same in both cases. If there is a mismatch, we need to address it on either side.

I worked around this issue by handling empty strings in textdistance.

The test test_qval converts the input sequence into q-grams before running the external function. So, I guess, the input that rapidfuzz gets is something like [('0',)], not just '0'. You can run pytest with --pdb to see for sure.

appears this is fixed by fixing the behaviour for empty sequences, so apparently it actually called the lib with empty sequences.

tests/test_external.py

maxbachmann · 2022-06-29T08:14:34Z

Somehow my last changes caused the flake8 test to fail, which worked yesterday. Not quite sure what this is about.

worked after re-triggering the CI.

rebased everything on master and should be ready to merge.

orsinium · 2022-06-29T08:36:05Z

textdistance/libraries.py

@@ -97,6 +97,18 @@ def get_function(self):
                # object constructor - the distance metric method is
                # called dist_abs() (whereas dist() gives a normalised distance)
                obj = getattr(module, self.func_name)().dist_abs
+            elif self.module_name in {'rapidfuzz.distance.Jaro', 'rapidfuzz.distance.JaroWinkler'}:


Interesting 🤔 So, rapidfuzz considers two empty strings to have 0 similarity? Why? An empty string is equal to another empty string.

Also, it does it only for Jaro but not for Hamming, if I understand correctly. Very interesting. Can we fix it upstream? Or there is anything in the original algorithm that says otherwise?

It does this for everything right now. However you only use the hamming distance, which is 0 in this case and perform the normalisation inside textdistance.
Looking at wikipedia: https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance it appears that the similarity is supposed to be 0 when there are 0 matching characters, which is the case for two empty strings.

Interesting. I need to think about it a bit. One of the goals of textdistance is to be consistent in the behavior of all algorithms. And if two strings are equal, they have a zero distance.

Similar to textdistance I have the following algorithms

distance similarity normalized_similarity normalized_distance

for two empty strings I handle this in the following way:

distance -> 0 similarity -> max - distance = 0 - 0 normalized_similarity -> 1.0 - normalized_distance -> 1.0 normalized_distance -> max / dist if max else 0 -> 0

orsinium · 2022-06-29T12:10:46Z

Now that should do. I've patched tests to skip quick results since there are calculated before calling any external libs anyway.

UPD: yep, you mentioned it :)

orsinium · 2022-06-29T12:28:18Z

Thank you!

maxbachmann commented Nov 6, 2021

View reviewed changes

textdistance/libraries.py Outdated Show resolved Hide resolved

maxbachmann commented Nov 7, 2021

View reviewed changes

textdistance/libraries.py Outdated Show resolved Hide resolved

orsinium reviewed Jun 29, 2022

View reviewed changes

textdistance/libraries.json Outdated Show resolved Hide resolved

maxbachmann commented Jun 29, 2022

View reviewed changes

tests/test_external.py Outdated Show resolved Hide resolved

This comment was marked as resolved.

Sign in to view

add support for rapidfuzz

bbf8713

orsinium reviewed Jun 29, 2022

View reviewed changes

orsinium added 5 commits June 29, 2022 13:52

fix bug in Jaro-Winkler for list of zeros

8071e92

improve taskfile

6f54351

ignore warnings from abydos

ef49f87

skip external tests if quick answer is available

a549387

skip external tests if quick answer is available

b18e949

orsinium merged commit b8dbc02 into life4:master Jun 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add support for rapidfuzz #77

add support for rapidfuzz #77

maxbachmann commented Nov 6, 2021

maxbachmann commented Nov 7, 2021 •

edited

orsinium commented Nov 10, 2021

maxbachmann commented Dec 7, 2021

orsinium commented Dec 8, 2021

maxbachmann commented Jun 28, 2022 •

edited

maxbachmann commented Jun 28, 2022 •

edited

orsinium commented Jun 29, 2022

orsinium commented Jun 29, 2022

maxbachmann commented Jun 29, 2022

This comment was marked as resolved.

maxbachmann commented Jun 29, 2022

orsinium Jun 29, 2022

orsinium Jun 29, 2022

maxbachmann Jun 29, 2022

orsinium Jun 29, 2022

maxbachmann Jun 29, 2022

orsinium commented Jun 29, 2022 •

edited

orsinium commented Jun 29, 2022

add support for rapidfuzz #77

add support for rapidfuzz #77

Conversation

maxbachmann commented Nov 6, 2021

maxbachmann commented Nov 7, 2021 • edited

orsinium commented Nov 10, 2021

maxbachmann commented Dec 7, 2021

orsinium commented Dec 8, 2021

maxbachmann commented Jun 28, 2022 • edited

maxbachmann commented Jun 28, 2022 • edited

orsinium commented Jun 29, 2022

orsinium commented Jun 29, 2022

maxbachmann commented Jun 29, 2022

This comment was marked as resolved.

maxbachmann commented Jun 29, 2022

orsinium Jun 29, 2022

Choose a reason for hiding this comment

orsinium Jun 29, 2022

Choose a reason for hiding this comment

maxbachmann Jun 29, 2022

Choose a reason for hiding this comment

orsinium Jun 29, 2022

Choose a reason for hiding this comment

maxbachmann Jun 29, 2022

Choose a reason for hiding this comment

orsinium commented Jun 29, 2022 • edited

orsinium commented Jun 29, 2022

maxbachmann commented Nov 7, 2021 •

edited

maxbachmann commented Jun 28, 2022 •

edited

maxbachmann commented Jun 28, 2022 •

edited

orsinium commented Jun 29, 2022 •

edited