Skip to content

Prints out groups of strings that are similar based on their Levenshtein edit distances

License

Notifications You must be signed in to change notification settings

maabdelatif/print-lev-groups

Repository files navigation

PRINT LEV GROUPS

This prints out groups of similar fields for a given input of files or stdin. The similarity is measured by the Levenshtein edit distance.

You can put in the similarity factor as a number between 0 and 100, where 0 matches almost everything, and 100 looks for exact matches.

Requirements

This project uses python3.5, python2.7 may work but is not guarenteed.

You require the following python packages to run

  • fuzzywuzzy (0.15.0)
  • networkx (1.11)
  • python-Levenshtein (0.12.0)
  • matplotlib (2.0.2)

Which you can install using pip

pip3 install fuzzywuzzy networkx python-Levenshtein matplotlib

In addition you require python-tk which unfortunately cannot be installed by pip, however you can install it on Ubuntu via:

sudo apt install python3-tk

Usage

# Get the latest snapshot
git clone --depth=1 https://github.com/maabdelatif/print-lev-groups.git myproject

# Change directory
cd myproject

# Run the script against the sample first-names.txt files and 60 as the similarity percentage
python3 print_lev_groups.py --files small-file.txt --ratio 60

Disclaimer

Please do not use this for any production code

Credits

About

Prints out groups of strings that are similar based on their Levenshtein edit distances

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published