# Example of fuzzy matching 

updated 31/10/2020

These example routines use a special library in Python called fuzzywuzzy which implements various strategies for ranking similarity. Absolutely use it. There are other basic libraries but then you need to program the logic on top to make the best choices.

Note that this approach works for a relatively low number of records. If you are talking several 100K records things will become exponential and you may not be able to really run these. There are some approaches for large scale comparisons in a separate recipe.

In [1]:
import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
#Note: apparently installing the python-Levenshtein module alongside fuzzywuzzy can increase performance


Inspired by  

https://www.datacamp.com/community/tutorials/fuzzy-string-python

The Levenshtein distance is a metric to measure how apart are two sequences of words. In other words, it measures the minimum number of edits that you need to do to change a one-word sequence into the other. These edits can be insertions, deletions or substitutions. This metric was named after Vladimir Levenshtein, who originally considered it in 1965.

In [16]:
Str1 = "Apple Inc."
Str2 = "apple Inc"
Ratio = fuzz.ratio(Str1.lower(),Str2.lower())
print(Ratio)

95


In [17]:
Str1 = "Los Angeles Lakers"
Str2 = "L.A. Lakers"
Ratio = fuzz.ratio(Str1.lower(),Str2.lower())
Partial_Ratio = fuzz.partial_ratio(Str1.lower(),Str2.lower())
Token_Sort_Ratio = fuzz.token_sort_ratio(Str1.lower(),Str2.lower())
print(Ratio)
print(Partial_Ratio)
print(Token_Sort_Ratio)

62
73
64


fuzz.partial_ratio() is capable of detecting that both strings are referring to the Lakers. Thus, it yields 100% similarity. The way this works is by using an "optimal partial" logic. In other words, if the short string has length k and the longer string has the length m, then the algorithm seeks the score of the best matching length-k substring.

Nevertheless, this approach is not foolproof. What happens when the strings comparison the same, but they are in a different order?

In [43]:
Str1 = "united states v. nixon"
Str2 = "Nixon v. United States"
Ratio = fuzz.ratio(Str1.lower(),Str2.lower())
Partial_Ratio = fuzz.partial_ratio(Str1.lower(),Str2.lower())
Token_Sort_Ratio = fuzz.token_sort_ratio(Str1,Str2)
print(Ratio)
print(Partial_Ratio)
print(Token_Sort_Ratio)

59
74
100


The fuzz.token functions have an important advantage over ratio and partial_ratio. They tokenize the strings and preprocess them by turning them to lower case and getting rid of punctuation. In the case of fuzz.token_sort_ratio(), the string tokens get sorted alphabetically and then joined together. After that, a simple fuzz.ratio() is applied to obtain the similarity percentage. This allows cases such as court cases in this example to be marked as being the same.

Still, what happens if these two strings are of widely differing lengths? Thats where fuzz.token_set_ratio() comes in.

In [46]:
Str1 = "The supreme court case of Nixon vs The United States"
Str2 = "Nixon v. United States"
Ratio = fuzz.ratio(Str1.lower(),Str2.lower())
Partial_Ratio = fuzz.partial_ratio(Str1.lower(),Str2.lower())
Token_Sort_Ratio = fuzz.token_sort_ratio(Str1,Str2)
Token_Set_Ratio = fuzz.token_set_ratio(Str1,Str2)
print(Ratio)
print(Partial_Ratio)
print(Token_Sort_Ratio)
print(Token_Set_Ratio)

57
77
58
95


fuzz.token_set_ratio() takes a more flexible approach than fuzz.token_sort_ratio(). Instead of just tokenizing the strings, sorting and then pasting the tokens back together, token_set_ratio performs a set operation that takes out the common tokens (the intersection) and then makes fuzz.ratio() pairwise comparisons between the following new strings:

s1 = Sorted_tokens_in_intersection  
s2 = Sorted_tokens_in_intersection + sorted_rest_of_str1_tokens  
s3 = Sorted_tokens_in_intersection + sorted_rest_of_str2_tokens  
The logic behind these comparisons is that since Sorted_tokens_in_intersection is always the same, the score will tend to go up as these words make up a larger chunk of the original strings or the remaining tokens are closer to each other.

## Example of a routine of checking one file against another

In [2]:
df_short= pd.read_excel("./test_data/data_short.xlsx")
df_long=pd.read_excel("./test_data/data_long.xlsx")


In [3]:
df_short.head()

Unnamed: 0.1,Unnamed: 0,address,city,country,dob,first_name,full_address,full_name,gender,last_name,member_id,middle_name,suffix,title
0,0,84724 Nicole Villages Suite 945,Leetown,New Zealand,2009-11-05,Deborah,"84724 Nicole Villages Suite 945, Leetown, New ...",Deborah Nunez,F,Nunez,5102279799502,,,Dr.
1,1,658 Aaron Vista Apt. 239,Mistyborough,Micronesia,1958-03-25,Nathan,"658 Aaron Vista Apt. 239, Mistyborough, Micron...",Nathan Baker,M,Baker,2719238400725,,,Mr.
2,2,66108 Vasquez Course,Jeffreymouth,Serbia,1961-07-01,Ronald,"66108 Vasquez Course, Jeffreymouth, Serbia",Ronald Smith,M,Smith,5344277168281,,,Mr.
3,3,117 Wood Turnpike Apt. 562,North Christopher,Montenegro,1951-12-14,Douglas,"117 Wood Turnpike Apt. 562, North Christopher,...",Douglas Stephanie Ramirez,M,Ramirez,9262734362101,Stephanie,,Mr.
4,4,7483 Nguyen Square,North Benjamin,Germany,1950-12-06,Alex,"7483 Nguyen Square, North Benjamin, Germany",Alex Curry,M,Curry,2504631179862,,,Mr.


In [4]:
df_long.head()

Unnamed: 0.1,Unnamed: 0,address,city,country,dob,first_name,full_address,full_name,gender,last_name,member_id,middle_name,suffix,title
0,0,98006 Daniel Causeway,Morenoborough,Palestinian Territory,1940-10-13,Cynthia,"98006 Daniel Causeway, Morenoborough, Palestin...",Cynthia Brandy Brown,F,Brown,4005263871981,Brandy,,Mrs.
1,1,3871 Stevens Lane Apt. 513,East Margaret,Seychelles,1982-02-03,John,"3871 Stevens Lane Apt. 513, East Margaret, Sey...",John Campbell,M,Campbell,5593277759078,,,Mr.
2,2,707 Nichole Run,New Jerrybury,France,2017-09-13,Tara,"707 Nichole Run, New Jerrybury, France",Tara Amanda Manning,F,Manning,9253704032896,Amanda,,Mrs.
3,3,29214 Christopher Lodge,Lake Andrew,Isle of Man,1955-11-13,Michaela,"29214 Christopher Lodge, Lake Andrew, Isle of Man",Michaela Brock,F,Brock,4666408668950,,,Dr.
4,4,31860 Earl Stravenue Suite 296,Vickiburgh,Belgium,1974-02-17,Peter,"31860 Earl Stravenue Suite 296, Vickiburgh, Be...",Peter Perez,M,Perez,4289840277902,,,Mr.


While the fuzzy matching already ignores cases and special chars, it perhaps convenient to do all this in advance, as when you pick the unique values via "set" and when you queries for potential matches afterwards, you have a cleaner and smaller set of data. Possibly worth also removing other charts like aphostrophes and double spaces, trimming trailing spaces etc.

Also note that using a generic fuzzy field makes the rest of the routine more reusable

In [22]:
df_short['fuzzy']=df_short['full_name'].str.lower()
df_short.set_index('fuzzy')
list_to_check_against= list(set(df_short['fuzzy']))
print(len(list_to_check_against))

500


In [23]:
df_long['fuzzy']=df_long['full_name'].str.lower()
df_short.set_index('fuzzy')
print(df_long.shape)
data_to_check= list(set(df_long['fuzzy']))
print(len(data_to_check))

(2000, 15)
1983


In [24]:
key_identifier="member_id"

In [25]:

matches = []
for item in data_to_check:
    if type(item) != str: # Checking base case. 
        pass
    else:
        match = process.extractOne(item, list_to_check_against, scorer=fuzz.token_sort_ratio) #Returns tuple of best match and percent fit.
        matches.append([item,match[0],match[1]])
        

    
        

In [30]:
possible_matches=[]
possible_matches= [x for x in matches if x[2]>85]
possible_matches

[['andrew hernandez', 'andrew hernandez', 100],
 ['deborah hanson', 'deborah johnson', 90],
 ['jessica lee', 'levy, jessica', 87],
 ['david williams', 'david williams', 100],
 ['robert olson', 'robert nelson', 88],
 ['mary hernandez', 'mary hernandez', 100],
 ['james martin', 'james martin', 100],
 ['michael taylor', 'michael taylor', 100],
 ['keith hardy', 'seth hardy', 86],
 ['christopher brown', 'christopher brown', 100],
 ['blake, michael', 'michael blackwell', 87],
 ['robert anderson', 'robert nelson', 86],
 ['kelly williams', 'kelsey williams', 90],
 ['katherine garcia', 'garcia, katherine h', 94],
 ['george collins', 'george collins', 100],
 ['emily brown', 'burton, emily', 87],
 ['olivia smith', 'smith, olivia', 100],
 ['thomas anderson', 'thomas nelson', 86],
 ['katherine smith', 'katherine smith', 100],
 ['christopher ford', 'christopher ward', 88],
 ['andrew smith', 'andrea smith', 92],
 ['jamie owens', 'james owens', 91],
 ['daniel gonzales', 'daniel gonzales', 100],
 ['joe

In [31]:

output_list=[]
for x in possible_matches:
    key1=df_short.loc[df_short['fuzzy'] == x[1]][key_identifier].tolist()
    key2=df_long.loc[df_long['fuzzy'] == x[0]][key_identifier].tolist()
    output_list.append([x[1], x[0], x[2], key1, key2])
output= pd.DataFrame(output_list, columns=['short_list','long_list','perc_match','key_short','key_long'])

In [32]:
df_short.loc[df_short['fuzzy'] == 'smith, olivia'][key_identifier]

432    4358287106424
Name: member_id, dtype: int64

In [33]:
output

Unnamed: 0,short_list,long_list,perc_match,key_short,key_long
0,andrew hernandez,andrew hernandez,100,[358339665507],[2340862914892]
1,deborah johnson,deborah hanson,90,[7076844059736],[2653643982561]
2,"levy, jessica",jessica lee,87,[6004324614398],[1523491204232]
3,david williams,david williams,100,[229720631152],[3922931562549]
4,robert nelson,robert olson,88,[6410876087878],[5064774225767]
5,mary hernandez,mary hernandez,100,[7534575761576],[6492089346744]
6,james martin,james martin,100,[7488960271747],[7957927902845]
7,michael taylor,michael taylor,100,[5847830015911],[2836812545211]
8,seth hardy,keith hardy,86,[4630418385398],[4902411601370]
9,christopher brown,christopher brown,100,[1970815105834],[502887664677]
