Skip to content

nathanrooy/turboshtein

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

turboshtein

ci GitHub license PyPI pyversions PyPi Version

Background

I didn't study computer science so bitwise operations have always been a bit of a gray area for me. I finally gave in and decided to spend some time reading up on the subject and ended up implementing the Myers bit-parallel algorithm[1] for computing Levenshtein edit distance. I was fairly happy with the results so I wrapped the original C code in Python and uploaded it to PyPI. Coincidentally, it's also pretty fast.

Usage

Install via PyPI:

pip install turboshtein

Or not:

python setup.py install

Computing levenshtein edit distance between two strings:

>>> from turboshtein import levenshtein
>>> levenshtein("saturday", "sunday")
3

Note that there are several limitations, mainly:

  • Max string length must be less than 64.
  • ASCII characters only.

Performance

Values represent the number of string pairs processed per second.

library version m=n=8 m=n=16 m=n=24 m=n=32 m=n=40 m=n=48 m=n=56
turboshtein.levenshtein 0.0.4 7,625,612 6,134,367 5,160,757 4,455,692 3,897,966 3,477,885 3,147,237
Levenshtein.distance 0.20.5 6,448,035 5,335,239 4,556,680 4,034,226 3,550,695 3,211,664 2,910,581
rapidfuzz.distance.Levenshtein.distance 2.11.1 6,446,954 5,268,454 4,560,213 4,027,110 3,540,574 3,215,206 2,912,386
jellyfish.levenshtein_distance 0.9.0 3,823,843 1,494,087 718,785 399,693 262,541 183,554 135,081
textdistance.levenshtein 4.5.0 444,305 413,950 411,441 401,539 395,777 389,462 385,233

Test setup:

  • XPS 15 9570
  • Intel i7-8750H
  • Fedora Linux 36 (Workstation Edition)

References

[1]: Gene Myers. 1999. A fast bit-vector algorithm for approximate string matching based on dynamic programming. J. ACM 46, 3 (May 1999), 395-415. https://doi.org/10.1145/316542.316550

About

Possibly the fastest way of computing the Levenshtein distance between two fairly short ASCII strings in Python…

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published