Fuzzymatch

This is some code to efficiently merge two (very large) data sets without common identifiers by comparing string variables. Match quality is evaluated based on parameters of a training dataset. The training data in the repository is about music artist names collected from weekly top chart data and musicbrainz.org. Example data sets are about book authors, collected from various public sources such as goodreads.com.

The procedure uses a minimum amount of indexing to reduce the consideration set, based on sub-strings (i.e. first letter). If that's not what you need, you can think of other substrings, e.g. n-grams, or functions of (sub-)strings, e.g. phonetic algorithms such as Soundex. The script uses multiple metrics for pair comparison (i.e. distance between strings, phonetics), building on fuzzywuzzy (https://github.com/seatgeek/fuzzywuzzy). Pair evaluation is aided by parameters obtained in a logistic regression on a manually coded training data set (”supervised learning”). The procedure uses parallel computing to enable fast and efficient matching of large data sets. The output file is a list of the best matches between two observations along with a probability score, which can later be used by the researcher to make a decision on the best match.

Feel free to contact me if you have any questions or comments!

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
fuzzymatch.py		fuzzymatch.py
input_x.csv		input_x.csv
input_y.csv		input_y.csv
output.csv		output.csv
training_data.csv		training_data.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

fuzzymatch.py

fuzzymatch.py

input_x.csv

input_x.csv

input_y.csv

input_y.csv

output.csv

output.csv

training_data.csv

training_data.csv

Repository files navigation

Fuzzymatch

About

Releases

Packages

Languages

patricklucescu/Fuzzymatch

Folders and files

Latest commit

History

Repository files navigation

Fuzzymatch

About

Resources

Stars

Watchers

Forks

Languages