# Examples of fuzzy matching 

updated 
21/2/2021
These example routines use a special library in Python called fuzzywuzzy which implements various strategies for ranking similarity. Absolutely use it. There are other basic libraries but then you need to program the logic on top to make the best choices.

Note that this approach works for a relatively low number of records. If you are talking several 100K records things will become exponential and you may not be able to really run these. There are some approaches for large scale comparisons in a separate recipe.

In [4]:
import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
from datetime import datetime
#import Levenshtein
#Note: apparently installing the python-Levenshtein module alongside fuzzywuzzy can increase performance

#we need this for the alternative method of self matching, ie fuzzy duplicates 
from itertools import combinations 


Inspired by  

https://www.datacamp.com/community/tutorials/fuzzy-string-python

The Levenshtein distance is a metric to measure how apart are two sequences of words. In other words, it measures the minimum number of edits that you need to do to change a one-word sequence into the other. These edits can be insertions, deletions or substitutions. This metric was named after Vladimir Levenshtein, who originally considered it in 1965.

In [5]:
Str1 = "Apple Inc."
Str2 = "apple Inc"
Ratio = fuzz.ratio(Str1.lower(),Str2.lower())
print(Ratio)

95


In [6]:
Str1 = "Los Angeles Lakers"
Str2 = "L.A. Lakers"
Ratio = fuzz.ratio(Str1.lower(),Str2.lower())
Partial_Ratio = fuzz.partial_ratio(Str1.lower(),Str2.lower())
Token_Sort_Ratio = fuzz.token_sort_ratio(Str1.lower(),Str2.lower())
print(Ratio)
print(Partial_Ratio)
print(Token_Sort_Ratio)

62
73
64


fuzz.partial_ratio() is capable of detecting that both strings are referring to the Lakers. Thus, it yields 100% similarity. The way this works is by using an "optimal partial" logic. In other words, if the short string has length k and the longer string has the length m, then the algorithm seeks the score of the best matching length-k substring.

Nevertheless, this approach is not foolproof. What happens when the strings comparison the same, but they are in a different order?

In [7]:
Str1 = "united states v. nixon"
Str2 = "Nixon v. United States"
Ratio = fuzz.ratio(Str1.lower(),Str2.lower())
Partial_Ratio = fuzz.partial_ratio(Str1.lower(),Str2.lower())
Token_Sort_Ratio = fuzz.token_sort_ratio(Str1,Str2)
print(Ratio)
print(Partial_Ratio)
print(Token_Sort_Ratio)

59
74
100


The fuzz.token functions have an important advantage over ratio and partial_ratio. They tokenize the strings and preprocess them by turning them to lower case and getting rid of punctuation. In the case of fuzz.token_sort_ratio(), the string tokens get sorted alphabetically and then joined together. After that, a simple fuzz.ratio() is applied to obtain the similarity percentage. This allows cases such as court cases in this example to be marked as being the same.

Still, what happens if these two strings are of widely differing lengths? Thats where fuzz.token_set_ratio() comes in.

In [8]:
Str1 = "The supreme court case of Nixon vs The United States"
Str2 = "Nixon v. United States"
Ratio = fuzz.ratio(Str1.lower(),Str2.lower())
Partial_Ratio = fuzz.partial_ratio(Str1.lower(),Str2.lower())
Token_Sort_Ratio = fuzz.token_sort_ratio(Str1,Str2)
Token_Set_Ratio = fuzz.token_set_ratio(Str1,Str2)
print(Ratio)
print(Partial_Ratio)
print(Token_Sort_Ratio)
print(Token_Set_Ratio)

57
77
58
95


fuzz.token_set_ratio() takes a more flexible approach than fuzz.token_sort_ratio(). Instead of just tokenizing the strings, sorting and then pasting the tokens back together, token_set_ratio performs a set operation that takes out the common tokens (the intersection) and then makes fuzz.ratio() pairwise comparisons between the following new strings:

s1 = Sorted_tokens_in_intersection  
s2 = Sorted_tokens_in_intersection + sorted_rest_of_str1_tokens  
s3 = Sorted_tokens_in_intersection + sorted_rest_of_str2_tokens  
The logic behind these comparisons is that since Sorted_tokens_in_intersection is always the same, the score will tend to go up as these words make up a larger chunk of the original strings or the remaining tokens are closer to each other.

In [41]:
import pandas as pd
import numpy as np

def create_fuzzy_column(df,col_name):
    df['fuzzy']=df[col_name].str.lower()
    df['fuzzy'].replace("\'", '',regex= True , inplace=True)
    df['fuzzy'].replace(r'\s', ' ', regex = True, inplace = True)
    df['fuzzy'].replace("",np.nan,regex=False, inplace=True)
    #This will directly remove accented chars and enie , not replace with the vowel or n   
    #df['fuzzy'] = df['fuzzy'].str.encode('ascii', 'ignore').str.decode('ascii')
    
    #This will also remove accented chars 
    #from string import printable
    #st = set(printable)
    #df["fuzzy"] = df["fuzzy"].apply(lambda x: ''.join([" " if  i not in  st else i for i in x]))   
    
    #This will retain the characters but standardise for compatibility, there are various libraries that do that too.  
    df["fuzzy"]= df["fuzzy"].str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8')
    
    df["fuzzy"].replace('[^a-z0-9 ]', ' ', regex=True, inplace=True)
    df['fuzzy'].replace(' +', ' ', regex=True, inplace=True)
    df['fuzzy'] = df['fuzzy'].str.strip()
    df.set_index('fuzzy')
    return list(set(df['fuzzy'][~pd.isnull(df.fuzzy)]))

test_data = [["  New\rLine","new line"],['Iñaqui  ', 'inaqui'], ["Tab\tEntry", "tab entry"], ['Lucía',"lucia"], ["",""], ["Ryan     O'Neill","ryan oneill"],["Ana-María","ana maria"],["John Smith 2nd","john smith 2nd"],["Peter\uFF3FDrücker","peter drucker"],["Emma\u005FWatson","emma watson"]]
test_df = pd.DataFrame(test_data, columns = ['input', 'expected']) 
test_list= create_fuzzy_column(test_df,'input')
print(test_list)
test_df

['lucia', 'tab entry', 'emma watson', 'ana maria', 'new line', 'inaqui', 'ryan oneill', 'peter drucker', 'john smith 2nd']


Unnamed: 0,input,expected,fuzzy
0,New\rLine,new line,new line
1,Iñaqui,inaqui,inaqui
2,Tab\tEntry,tab entry,tab entry
3,Lucía,lucia,lucia
4,,,
5,Ryan O'Neill,ryan oneill,ryan oneill
6,Ana-María,ana maria,ana maria
7,John Smith 2nd,john smith 2nd,john smith 2nd
8,Peter＿Drücker,peter drucker,peter drucker
9,Emma_Watson,emma watson,emma watson


## Example of a routine of checking one file against another

In [9]:
df_short= pd.read_excel("./test_data/data_short.xlsx")
df_long=pd.read_excel("./test_data/data_long.xlsx")


In [10]:
df_short.head()

Unnamed: 0.1,Unnamed: 0,address,city,country,dob,first_name,full_address,full_name,gender,last_name,member_id,middle_name,suffix,title
0,0,84724 Nicole Villages Suite 945,Leetown,New Zealand,2009-11-05,Deborah,"84724 Nicole Villages Suite 945, Leetown, New ...",Deborah Nunez,F,Nunez,5102279799502,,,Dr.
1,1,658 Aaron Vista Apt. 239,Mistyborough,Micronesia,1958-03-25,Nathan,"658 Aaron Vista Apt. 239, Mistyborough, Micron...",Nathan Baker,M,Baker,2719238400725,,,Mr.
2,2,66108 Vasquez Course,Jeffreymouth,Serbia,1961-07-01,Ronald,"66108 Vasquez Course, Jeffreymouth, Serbia",Ronald Smith,M,Smith,5344277168281,,,Mr.
3,3,117 Wood Turnpike Apt. 562,North Christopher,Montenegro,1951-12-14,Douglas,"117 Wood Turnpike Apt. 562, North Christopher,...",Douglas Stephanie Ramirez,M,Ramirez,9262734362101,Stephanie,,Mr.
4,4,7483 Nguyen Square,North Benjamin,Germany,1950-12-06,Alex,"7483 Nguyen Square, North Benjamin, Germany",Alex Curry,M,Curry,2504631179862,,,Mr.


In [11]:
df_long.head()

Unnamed: 0.1,Unnamed: 0,address,city,country,dob,first_name,full_address,full_name,gender,last_name,member_id,middle_name,suffix,title
0,0,98006 Daniel Causeway,Morenoborough,Palestinian Territory,1940-10-13,Cynthia,"98006 Daniel Causeway, Morenoborough, Palestin...",Cynthia Brandy Brown,F,Brown,4005263871981,Brandy,,Mrs.
1,1,3871 Stevens Lane Apt. 513,East Margaret,Seychelles,1982-02-03,John,"3871 Stevens Lane Apt. 513, East Margaret, Sey...",John Campbell,M,Campbell,5593277759078,,,Mr.
2,2,707 Nichole Run,New Jerrybury,France,2017-09-13,Tara,"707 Nichole Run, New Jerrybury, France",Tara Amanda Manning,F,Manning,9253704032896,Amanda,,Mrs.
3,3,29214 Christopher Lodge,Lake Andrew,Isle of Man,1955-11-13,Michaela,"29214 Christopher Lodge, Lake Andrew, Isle of Man",Michaela Brock,F,Brock,4666408668950,,,Dr.
4,4,31860 Earl Stravenue Suite 296,Vickiburgh,Belgium,1974-02-17,Peter,"31860 Earl Stravenue Suite 296, Vickiburgh, Be...",Peter Perez,M,Perez,4289840277902,,,Mr.


While the fuzzy matching already ignores cases and special chars, it perhaps convenient to do a bit of cleanup in advance. This is because when you pick the unique values via "set" command and when you queries for potential matches afterwards, you would have a cleaner and smaller set of data. Possibly worth also removing other artifacts like aphostrophes and double spaces, trimming trailing spaces etc.

Also note that using a generic fuzzy field makes the rest of the routine more reusable

In [14]:
primary_list=create_fuzzy_column(df_long,"full_name")
print(df_long.shape)
print(len(primary_list))

(2000, 15)
1983


In [15]:
secondary_list= create_fuzzy_column(df_short,"full_name")
print(df_short.shape)
print(len(secondary_list))

(500, 15)
500


In [16]:
matches = []
for item in primary_list:
    if type(item) != str: # Checking base case. 
        pass
    else:
        match = process.extractOne(item, secondary_list, scorer=fuzz.token_sort_ratio) #Returns tuple of best match and percent fit.
        matches.append([item,match[0],match[1]])       

we filter by some threshold

In [18]:
likely_matches=[]
likely_matches= [x for x in matches if x[2]>90]
len(likely_matches)

20

This piece is to recreate the actual records in the file with an identifier key, the reason is that we could technically have more than one hit per each duplicate found (if the file has duplicates itself)

In [23]:
output_list=[]
for x in likely_matches:
    key1=df_long.loc[df_long['fuzzy'] == x[0]]['member_id'].tolist()
    key2=df_short.loc[df_short['fuzzy'] == x[1]]['member_id'].tolist()
    output_list.append([x[0], x[1], x[2], key1, key2])
output= pd.DataFrame(output_list, columns=['primary','secondary','perc_match','key_primary','key_secondary'])

In [24]:
output

Unnamed: 0,primary,secondary,perc_match,key_primary,key_secondary
0,mary hernandez,mary hernandez,100,[6492089346744],[7534575761576]
1,daniel gonzales,daniel gonzales,100,[7993593208263],[140312516230]
2,andrew smith,andrea smith,92,[1989665637228],[1896374821001]
3,joe smith,"smith, jose",95,[3880509524088],[4617626102085]
4,joseph smith,"smith, jose",91,"[2828385250821, 5400110077162]",[4617626102085]
5,jamie owens,james owens,91,[2377196554883],[410879923414]
6,james martin,james martin,100,[7957927902845],[7488960271747]
7,george collins,george collins,100,[5179019038436],[1306491549530]
8,christopher brown,christopher brown,100,[502887664677],[1970815105834]
9,katherine smith,katherine smith,100,[8986386973416],[7563384088130]


In [25]:
df_long[df_long['member_id']==output.iloc[6][3][0]]

Unnamed: 0.1,Unnamed: 0,address,city,country,dob,first_name,full_address,full_name,gender,last_name,member_id,middle_name,suffix,title,fuzzy
1124,1124,2360 Michelle Keys Apt. 147,Port Belindaburgh,Guam,1942-11-23,James,"2360 Michelle Keys Apt. 147, Port Belindaburgh...",James Martin,M,Martin,7957927902845,,,Dr.,james martin


In [22]:
df_short[df_short['member_id']==output.iloc[6][4][0]]

Unnamed: 0.1,Unnamed: 0,address,city,country,dob,first_name,full_address,full_name,gender,last_name,member_id,middle_name,suffix,title,fuzzy
477,477,046 Melvin Parkway,South Christina,Antarctica (the territory South of 60 deg S),2017-05-12,James,"046 Melvin Parkway, South Christina, Antarctic...",James Martin,M,Martin,7488960271747,,,Dr.,james martin


## Example of checking for fuzzy duplicates within one file

You need to make sure the value is a string.  
You can also here concatenate a few fields to do fuzzy matching across a few fields,  
though it maybe better to do separate similarity checks and then work on those scores

The routines below would be useful for finding fuzzy duplicates within a file... 
Because we are comparing against itself, the main loop will grow exponentially with the size of the file, very soon you will be looking at millions of iterations. Think carefully how to reduce the dataset to something relevant and manageable.

There are alternative methods for large datasets that involve tokenising parts of the texts and scale better possibly reducing accuracy . These are detailed in another notebook.



In [1]:
def create_fuzzy_column(df,col_name):
    df['fuzzy']=df[col_name].str.lower()
    df['fuzzy'] = df['fuzzy'].str.replace('\'', ' ')
    df['fuzzy'] = df['fuzzy'].str.replace(' +', ' ')
    df['fuzzy'] = df['fuzzy'].str.strip()
    df.set_index('fuzzy')
    return list(set(df['fuzzy']))

In [2]:
df =pd.read_excel("./test_data/data_long.xlsx")
print(df.shape)
list_to_check= create_fuzzy_column(df,"full_name")
print(df.shape)
print(len(list_to_check))



NameError: name 'pd' is not defined

In [12]:
matches = []
print(datetime.now())
for item in list_to_check:
    match = process.extract(item, list_to_check, scorer=fuzz.token_sort_ratio, limit=2 ) #Returns 2 best matches, one will be itself!
    matches.append([item,match[0],match[1]])
print(datetime.now())

2021-02-21 20:15:20.278468
2021-02-21 20:18:30.023952


In [13]:
# We need to remove the self hit
matches_clean=[]
for x in matches:
    if x[0] == x[1][0]:
        a= x[0]
        b= x[2][0]
        c= x[2][1]
    else:
        a= x[0]
        b= x[1][0]
        c= x[1][1]        
    matches_clean.append([a,b,c])
    
likely_matches=[]
likely_matches= [[x[0],x[1], x[2]] for x in matches_clean if x[2]>90]
likely_matches

[['joseph haynes', 'joseph hayes', 96],
 ['lewis, anthony', 'anthony lewis', 100],
 ['michael sullivan', 'mitchell sullivan', 91],
 ['billy taylor', 'taylor, billy', 100],
 ['tina simmons', 'tina simon', 91],
 ['amy torres', 'tammy torres', 91],
 ['tina simon', 'tina simmons', 91],
 ['joseph hayes', 'joseph haynes', 96],
 ['bradley curtis', 'curtis brady', 92],
 ['roy adams', 'troy adams', 95],
 ['daniel williams', 'danielle williams', 94],
 ['tammy torres', 'amy torres', 91],
 ['johnny miller', 'john miller', 92],
 ['kevin diaz', 'diaz, kevin', 100],
 ['john miller', 'johnny miller', 92],
 ['joseph gonzalez', 'jose gonzalez', 93],
 ['michael dunn', 'michael duncan', 92],
 ['diaz, kevin', 'kevin diaz', 100],
 ['joseph hall', 'joseph ball', 91],
 ['joseph ball', 'joseph hall', 91],
 ['curtis brady', 'bradley curtis', 92],
 ['michael duncan', 'michael dunn', 92],
 ['jose gonzalez', 'joseph gonzalez', 93],
 ['anthony lewis', 'lewis, anthony', 100],
 ['matthew j moore', 'matthew moore', 93

## Alternative method with a manual iteration  

This method uses a more manual approach to select the combinations of items in this case we can retrieve as many similar items meet the criteria based on the similarity score and not just the top 1 or 2 etc.

In [179]:
from itertools import combinations 
item_combinations=list(combinations(list_to_check, 2))

In [180]:
matches = []
for x in item_combinations:
    s = fuzz.token_sort_ratio(x[0],x[1])
    if s>60:
        matches.append([x[0], x[1], s]) 
        

In [185]:
df_matches=pd.DataFrame(matches, columns=['x1','x2','score'])

likely_matches = df_matches[df_matches.score > 90]
df2.shape

(15, 3)

we still need to go back to the original dataset and find the matching records as we did a fuzzy match on the simplified and unique values, so there could be more than one per hit.

In [189]:
output_list=[]
for index,row in likely_matches.iterrows():
    key1=df.loc[df['fuzzy'] == row['x1']]['member_id'].tolist()
    key2=df.loc[df['fuzzy'] == row['x2']]['member_id'].tolist()
    output_list.append([row['x1'],row['x2'], row['score'], key1, key2])
output= pd.DataFrame(output_list, columns=['x1','x2','score','k1','k2'])
output.shape

(15, 5)

In [190]:
output

Unnamed: 0,x1,x2,score,k1,k2
0,amy torres,tammy torres,91,[2700300227159],[4066256017823]
1,joseph ball,joseph hall,91,[4437763835086],[9554307555868]
2,billy taylor,"taylor, billy",100,[7794666749030],[5433696913272]
3,joseph gonzalez,jose gonzalez,93,[9305307561485],[4774921940030]
4,mitchell sullivan,michael sullivan,91,[1296517170926],"[825827228877, 9520235523057]"
5,daniel williams,danielle williams,94,[6978712771492],"[8127440448529, 952131193062]"
6,tina simon,tina simmons,91,[9856537464638],[2883744213749]
7,matthew j moore,matthew moore,93,[3450342297878],[3873363814785]
8,michael dunn,michael duncan,92,[8955508677235],[6750866325253]
9,joseph haynes,joseph hayes,96,[4272394152283],[4795669425207]
