Skip to content

patrickdet/fuzzy_compare

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FuzzyCompare

Getting started

In order to compare two strings with each other do the following:

iex> FuzzyCompare.similarity("Oscar-Claude Monet", "monet, claude")
0.95

Inner workings

Imagine you had to match some names.

Try to match the following list of painters:

  • "Oscar-Claude Monet"
  • "Edouard Manet"
  • "Monet, Claude"

For a human it is easy to see that some of the names have just been flipped and that others are different but similar sounding.

A first approrach could be to compare the strings with a string similarity function like the Jaro-Winkler function.

iex> String.jaro_distance("Oscar-Claude Monet", "Monet, Claude")
0.6032763532763533

iex> String.jaro_distance("Oscar-Claude Monet", "Edouard Manet")
0.6749287749287749

This is not an improvement over exact equality.

In order to improve the results this library uses two different approaches, FuzzyCompare.ChunkSet and FuzzyCompare.SortedChunks.

Sorted chunks

This approach yields good results when words within a string have been shuffled around. The strategy will sort all substrings by words and compare the sorted strings.

iex> FuzzyCompare.SortedChunks.substring_similarity("Oscar-Claude Monet", "Monet, Claude")
1.0

iex(4)> FuzzyCompare.SortedChunks.substring_similarity("Oscar-Claude Monet", "Edouard Manet")
0.6944444444444443

Chunkset

The chunkset approach is best in scenarios when the strings contain other substrings that are not relevant to what is being searched for.

iex> FuzzyCompare.ChunkSet.standard_similarity("Claude Monet", "Alice Hoschedé was the wife of Claude Monet")
1.0

Substring comparison

Should one of the strings be much longer than the other the library will attempt to compare matching substrings only.

Credits

This library is inspired by a seatgeek blogpost from 2011.