Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for rapidfuzz #77

Merged
merged 6 commits into from Jun 29, 2022
Merged

add support for rapidfuzz #77

merged 6 commits into from Jun 29, 2022

Conversation

maxbachmann
Copy link
Contributor

The implementation used by rapidfuzz has the following algorithms

  • Jaro/JaroWinkler (fastest by a large margin)
  • Hamming (slightly slower than python-Levenshtein)
  • Levenshtein (similar fast to python-Levenshtein for very short strings and fastest for longer strings)

Additionally it supports any sequence of hashable types (e.g. lists of strings) and not only text

Here is the benchmark result:

# Faster than textdistance:

| algorithm          | library                 | function                     |        time |
|--------------------+-------------------------+------------------------------+-------------|
| DamerauLevenshtein | jellyfish               | damerau_levenshtein_distance | 0.0181046   |
| DamerauLevenshtein | pyxdameraulevenshtein   | damerau_levenshtein_distance | 0.030925    |
| Hamming            | Levenshtein             | hamming                      | 0.000351586 |
| Hamming            | rapidfuzz.string_metric | hamming                      | 0.00040442  |
| Hamming            | jellyfish               | hamming_distance             | 0.0143502   |
| Jaro               | rapidfuzz.string_metric | jaro_similarity              | 0.000749048 |
| Jaro               | jellyfish               | jaro_similarity              | 0.0152322   |
| JaroWinkler        | rapidfuzz.string_metric | jaro_winkler_similarity      | 0.000776006 |
| JaroWinkler        | jellyfish               | jaro_winkler_similarity      | 0.0157833   |
| Levenshtein        | rapidfuzz.string_metric | levenshtein                  | 0.0010058   |
| Levenshtein        | Levenshtein             | distance                     | 0.00103176  |
| Levenshtein        | jellyfish               | levenshtein_distance         | 0.0147382   |
| Levenshtein        | pylev                   | levenshtein                  | 0.14116     |
Total: 13 libs.

and the benchmark results when adding slightly longer strings:

STMT = """
func('text', 'test')
func('qwer', 'asdf')
func('a' * 15, 'b' * 15)
func('a' * 30, 'b' * 30)
"""
# Faster than textdistance:

| algorithm          | library                 | function                     |        time |
|--------------------+-------------------------+------------------------------+-------------|
| DamerauLevenshtein | jellyfish               | damerau_levenshtein_distance | 0.0323887   |
| DamerauLevenshtein | pyxdameraulevenshtein   | damerau_levenshtein_distance | 0.143235    |
| Hamming            | Levenshtein             | hamming                      | 0.000489837 |
| Hamming            | rapidfuzz.string_metric | hamming                      | 0.000517879 |
| Hamming            | jellyfish               | hamming_distance             | 0.0182341   |
| Jaro               | rapidfuzz.string_metric | jaro_similarity              | 0.00111363  |
| Jaro               | jellyfish               | jaro_similarity              | 0.0201971   |
| JaroWinkler        | rapidfuzz.string_metric | jaro_winkler_similarity      | 0.00105238  |
| JaroWinkler        | jellyfish               | jaro_winkler_similarity      | 0.0206678   |
| Levenshtein        | rapidfuzz.string_metric | levenshtein                  | 0.00138601  |
| Levenshtein        | Levenshtein             | distance                     | 0.0034889   |
| Levenshtein        | jellyfish               | levenshtein_distance         | 0.0232467   |
| Levenshtein        | pylev                   | levenshtein                  | 0.599603    |
Total: 13 libs.

textdistance/libraries.py Outdated Show resolved Hide resolved
textdistance/libraries.py Outdated Show resolved Hide resolved
@maxbachmann
Copy link
Contributor Author

maxbachmann commented Nov 7, 2021

@orsinium I am not quite sure how to fix the build errors, since I am not very familiar with drone CI + alpine linux. Currently numpy + rapidfuzz do not have musllinux wheels, so it needs to build everything from source. Is there any specific reason why the CI uses alpine linux? Pretty much no python project provides wheels for musllinux so far.

@orsinium
Copy link
Member

Thank you! Looks cool. Please, resolve CI, and I'll merge it. No, there is no particular reason to use alpine except that it worked well before. You can migrate it to anything else, buster should do. I think it should be enough. Give it a try and if you can't make it work, just ping me, I'll help.

@maxbachmann
Copy link
Contributor Author

@orsinium This will likely take a bit longer, since I am currently working on v2.0.0 of RapidFuzz, which will simplify this PR.

I have one question regarding textdistance:
I generally like the way you classify metrics into different categories like e.g.edit based distances and plan to do something similar when adding more metrics to RapidFuzz. However I was unsure when a metric is sequence_based. I would have personally placed LCSSeq in edit_based, since it is a metric which only uses Insertions/Deletions but no Substitutions.

@orsinium
Copy link
Member

orsinium commented Dec 8, 2021

If I remember correctly, sequence_based are ones that find common subsequences in both given strings.

@maxbachmann
Copy link
Contributor Author

maxbachmann commented Jun 28, 2022

@orsinium I finally had time for this. I updated rapidfuzz, which simplifies the addition. In addition I updated the ci to debian. As a side effect this appears to decrease the CI runtime significantly.

I do not understand the test failures. From reading it sound like the following occurs:

rapidfuzz.distance.Jaro.similarity('0', '0') -> 0.0
rapidfuzz.distance.JaroWinkler.similarity('0', '0') -> 0.0
rapidfuzz.distance.Jaro.similarity([], []) -> 0.0
rapidfuzz.distance.JaroWinkler.similarity([], []) -> 0.0

I can not reproduce the first two issues:

>>> from rapidfuzz.distance import Jaro, JaroWinkler
>>> Jaro.similarity('0', '0')
1.0
>>> JaroWinkler.similarity('0', '0')
1.0

the other one comes down to the question, whether the result of matching two empty sequences should be a perfect match or no match.

>>> textdistance.algorithms.Jaro().similarity('', '')
1
>>> textdistance.algorithms.Jaro().similarity([], [])
1
>>> rapidfuzz.distance.Jaro.similarity('', '')
0.0
>>> rapidfuzz.distance.Jaro.similarity([], [])
0.0
>>> jellyfish.jaro_similarity('', '')
0.0
>>> Levenshtein.jaro_winkler('', '')
1.0

@maxbachmann
Copy link
Contributor Author

maxbachmann commented Jun 28, 2022

As far as I can see the behavior for two empty sequences does not matter, since it will not be called with two empty sequences anyways:

if not all(sequences):
    return self.maximum(*sequences)
# try get answer from external libs
answer = self.external_answer(*sequences)
if answer is not None:
    return answer

@orsinium
Copy link
Member

The point of the two failed tests is to ensure that doesn't matter if you use an external or internal implementation of an algorithm, the result will be the same in both cases. If there is a mismatch, we need to address it on either side.

I can not reproduce the first two issues

The test test_qval converts the input sequence into q-grams before running the external function. So, I guess, the input that rapidfuzz gets is something like [('0',)], not just '0'. You can run pytest with --pdb to see for sure.

@orsinium
Copy link
Member

If you have issues only with Jaro and Jaro-Winkler, you can disable rapidfuzz for them for now and merge only integration for Hamming and Levenshtein. You've done quite some work, and having something merged is better than nothing :)

@maxbachmann
Copy link
Contributor Author

The point of the two failed tests is to ensure that doesn't matter if you use an external or internal implementation of an algorithm, the result will be the same in both cases. If there is a mismatch, we need to address it on either side.

I worked around this issue by handling empty strings in textdistance.

The test test_qval converts the input sequence into q-grams before running the external function. So, I guess, the input that rapidfuzz gets is something like [('0',)], not just '0'. You can run pytest with --pdb to see for sure.

appears this is fixed by fixing the behaviour for empty sequences, so apparently it actually called the lib with empty sequences.

tests/test_external.py Outdated Show resolved Hide resolved
@maxbachmann

This comment was marked as resolved.

@maxbachmann
Copy link
Contributor Author

Somehow my last changes caused the flake8 test to fail, which worked yesterday. Not quite sure what this is about.

worked after re-triggering the CI.

rebased everything on master and should be ready to merge.

@@ -97,6 +97,18 @@ def get_function(self):
# object constructor - the distance metric method is
# called dist_abs() (whereas dist() gives a normalised distance)
obj = getattr(module, self.func_name)().dist_abs
elif self.module_name in {'rapidfuzz.distance.Jaro', 'rapidfuzz.distance.JaroWinkler'}:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting 🤔 So, rapidfuzz considers two empty strings to have 0 similarity? Why? An empty string is equal to another empty string.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it does it only for Jaro but not for Hamming, if I understand correctly. Very interesting. Can we fix it upstream? Or there is anything in the original algorithm that says otherwise?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does this for everything right now. However you only use the hamming distance, which is 0 in this case and perform the normalisation inside textdistance.
Looking at wikipedia: https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance it appears that the similarity is supposed to be 0 when there are 0 matching characters, which is the case for two empty strings.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. I need to think about it a bit. One of the goals of textdistance is to be consistent in the behavior of all algorithms. And if two strings are equal, they have a zero distance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to textdistance I have the following algorithms

distance
similarity
normalized_similarity
normalized_distance

for two empty strings I handle this in the following way:

distance -> 0
similarity -> max - distance = 0 - 0
normalized_similarity -> 1.0 - normalized_distance -> 1.0
normalized_distance -> max / dist if max else 0 -> 0

@orsinium
Copy link
Member

orsinium commented Jun 29, 2022

Now that should do. I've patched tests to skip quick results since there are calculated before calling any external libs anyway.

UPD: yep, you mentioned it :)

@orsinium orsinium merged commit b8dbc02 into life4:master Jun 29, 2022
@orsinium
Copy link
Member

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants