Skip to content

kyleburton/fuzzy-string

Repository files navigation

Title: A Survey of Fuzzy String Matching Algorithms

Brief

Do you like how Google suggests a fix when you mis-spell (I mean typo) a word? Did you ever wonder how you could implement a spell checker?

Fuzzy string matching dates back to the 1800s to the US Census Bureau’s mandate of not overcounting the US population. My personal exploration into the record linkage aspect of data integration has touched on several fuzzy string matching algorithms. Come and hear about how several of these algorithms work and some of the techniques I have used in industry settings with large datasets.

This Project

The original slides are in Keynote, with PDF and PPT exports.

There is an interactive Rails application that demonstrates the major algorithms discussed in the talk. It is in the edist-app subdirectory.

bin/fuzzy.rb

This command line program supports executing each of the main algorithms discussed in the talk, including a search across /usr/share/dict/words using each of the algorithms (or any file you so choose).


Usage: bin/fuzzy.rb [options] command [args]
    -v, --[no-]verbose               Run verbosely
    -f, --file=                      File to search for the 'find' command.
    -q, --quiet                      Run quietly
    -t, --threshold=                 Threshold (percentage) for Edist and Brew Algorithms.
    -s, --score-table=               Set the Brew Score table to use (typo, abbrev, edist)
    -b, --brew-table=                Set the Brew Score table from an expression.  To be valid it must be a hash with the following form: {:initial=>0,:match=>0,:insert=>0.1,:delete=>5,:subst=>5,:transpose=>2}.  Behavior of this program is undefined in the case the table you provide is invalid.

About

Talk: Survey of Fuzzy String Matching Algorithms

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published