# Examples of fuzzy matching 

updated 
21/2/2021
These example routines use a special library in Python called fuzzywuzzy which implements various strategies for ranking similarity. Absolutely use it. There are other basic libraries but then you need to program the logic on top to make the best choices.

Note that this approach works for a relatively low number of records. If you are talking several 100K records things will become exponential and you may not be able to really run these. There are some approaches for large scale comparisons in a separate recipe.

In [1]:
import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
from datetime import datetime
#import Levenshtein
#Note: apparently installing the python-Levenshtein module alongside fuzzywuzzy can increase performance

#we need this for the alternative method of self matching, ie fuzzy duplicates 
from itertools import combinations 



df_short= pd.read_excel("./test_data/data_short.xlsx")
df_long=pd.read_excel("./test_data/data_long.xlsx")







Inspired by  

https://www.datacamp.com/community/tutorials/fuzzy-string-python

The Levenshtein distance is a metric to measure how apart are two sequences of words. In other words, it measures the minimum number of edits that you need to do to change a one-word sequence into the other. These edits can be insertions, deletions or substitutions. This metric was named after Vladimir Levenshtein, who originally considered it in 1965.

In [2]:
Str1 = "Apple Inc."
Str2 = "apple Inc"
Ratio = fuzz.ratio(Str1.lower(),Str2.lower())
print(Ratio)

95


In [3]:
Str1 = "Los Angeles Lakers"
Str2 = "L.A. Lakers"
Ratio = fuzz.ratio(Str1.lower(),Str2.lower())
Partial_Ratio = fuzz.partial_ratio(Str1.lower(),Str2.lower())
Token_Sort_Ratio = fuzz.token_sort_ratio(Str1.lower(),Str2.lower())
print(Ratio)
print(Partial_Ratio)
print(Token_Sort_Ratio)

62
73
64


fuzz.partial_ratio() is capable of detecting that both strings are referring to the Lakers. Thus, it yields 100% similarity. The way this works is by using an "optimal partial" logic. In other words, if the short string has length k and the longer string has the length m, then the algorithm seeks the score of the best matching length-k substring.

Nevertheless, this approach is not foolproof. What happens when the strings comparison the same, but they are in a different order?

In [4]:
Str1 = "united states v. nixon"
Str2 = "Nixon v. United States"
Ratio = fuzz.ratio(Str1.lower(),Str2.lower())
Partial_Ratio = fuzz.partial_ratio(Str1.lower(),Str2.lower())
Token_Sort_Ratio = fuzz.token_sort_ratio(Str1,Str2)
print(Ratio)
print(Partial_Ratio)
print(Token_Sort_Ratio)

59
74
100


The fuzz.token functions have an important advantage over ratio and partial_ratio. They tokenize the strings and preprocess them by turning them to lower case and getting rid of punctuation. In the case of fuzz.token_sort_ratio(), the string tokens get sorted alphabetically and then joined together. After that, a simple fuzz.ratio() is applied to obtain the similarity percentage. This allows cases such as court cases in this example to be marked as being the same.

Still, what happens if these two strings are of widely differing lengths? Thats where fuzz.token_set_ratio() comes in.

In [5]:
Str1 = "The supreme court case of Nixon vs The United States"
Str2 = "Nixon v. United States"
Ratio = fuzz.ratio(Str1.lower(),Str2.lower())
Partial_Ratio = fuzz.partial_ratio(Str1.lower(),Str2.lower())
Token_Sort_Ratio = fuzz.token_sort_ratio(Str1,Str2)
Token_Set_Ratio = fuzz.token_set_ratio(Str1,Str2)
print(Ratio)
print(Partial_Ratio)
print(Token_Sort_Ratio)
print(Token_Set_Ratio)

57
77
58
95


fuzz.token_set_ratio() takes a more flexible approach than fuzz.token_sort_ratio(). Instead of just tokenizing the strings, sorting and then pasting the tokens back together, token_set_ratio performs a set operation that takes out the common tokens (the intersection) and then makes fuzz.ratio() pairwise comparisons between the following new strings:

s1 = Sorted_tokens_in_intersection  
s2 = Sorted_tokens_in_intersection + sorted_rest_of_str1_tokens  
s3 = Sorted_tokens_in_intersection + sorted_rest_of_str2_tokens  
The logic behind these comparisons is that since Sorted_tokens_in_intersection is always the same, the score will tend to go up as these words make up a larger chunk of the original strings or the remaining tokens are closer to each other.

In [6]:
def create_fuzzy_column(df,col_name):
    df['fuzzy']=df[col_name].str.lower()
    df['fuzzy'].replace("\'", '',regex= True , inplace=True)
    df['fuzzy'].replace(r'\s', ' ', regex = True, inplace = True)
    df['fuzzy'].replace("",np.nan,regex=False, inplace=True)
    #This will directly remove accented chars and enie , not replace with the vowel or n   
    #df['fuzzy'] = df['fuzzy'].str.encode('ascii', 'ignore').str.decode('ascii')    
    #This will also remove accented chars 
    #from string import printable
    #st = set(printable)
    #df["fuzzy"] = df["fuzzy"].apply(lambda x: ''.join([" " if  i not in  st else i for i in x]))      
    #This will retain the characters but standardise for compatibility, there are various libraries that do that too.  
    df["fuzzy"]= df["fuzzy"].str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8')
    df["fuzzy"].replace('[^a-z0-9 ]', ' ', regex=True, inplace=True)
    df['fuzzy'].replace(' +', ' ', regex=True, inplace=True)
    df['fuzzy'] = df['fuzzy'].str.strip()
    df.set_index('fuzzy')
    return list(set(df['fuzzy'][~pd.isnull(df.fuzzy)]))

test_data = [["  New\rLine","new line"],['Iñaqui  ', 'inaqui'], ["Tab\tEntry", "tab entry"], ['Lucía',"lucia"], ["","NaN"], ["Ryan     O'Neill","ryan oneill"],["Ana-María","ana maria"],["John Smith 2nd","john smith 2nd"],["Peter\uFF3FDrücker","peter drucker"],["Emma\u005FWatson","emma watson"]]
test_df = pd.DataFrame(test_data, columns = ['input', 'expected']) 
test_list= create_fuzzy_column(test_df,'input')
print(test_list)
test_df

['new line', 'ana maria', 'peter drucker', 'emma watson', 'tab entry', 'inaqui', 'lucia', 'ryan oneill', 'john smith 2nd']


Unnamed: 0,input,expected,fuzzy
0,New\rLine,new line,new line
1,Iñaqui,inaqui,inaqui
2,Tab\tEntry,tab entry,tab entry
3,Lucía,lucia,lucia
4,,,
5,Ryan O'Neill,ryan oneill,ryan oneill
6,Ana-María,ana maria,ana maria
7,John Smith 2nd,john smith 2nd,john smith 2nd
8,Peter＿Drücker,peter drucker,peter drucker
9,Emma_Watson,emma watson,emma watson


## Example of a routine of checking one file against another

While the fuzzy matching already ignores cases and special chars, it perhaps convenient to do a bit of cleanup in advance. This is because when you pick the unique values via "set" command and when you queries for potential matches afterwards, you would have a cleaner and smaller set of data. Possibly worth also removing other artifacts like aphostrophes and double spaces, trimming trailing spaces etc.

Also note that using a generic fuzzy field makes the rest of the routine more reusable

In [7]:
primary_list=create_fuzzy_column(df_long,"full_name")
print(df_long.shape)
print(len(primary_list))

(2000, 15)
1983


In [8]:
secondary_list= create_fuzzy_column(df_short,"full_name")
print(df_short.shape)
print(len(secondary_list))

(500, 15)
500


In [9]:
matches = []
for item in primary_list:
    if type(item) != str: # Checking base case. 
        pass
    else:
        match = process.extractOne(item, secondary_list, scorer=fuzz.token_sort_ratio) #Returns tuple of best match and percent fit.
        matches.append([item,match[0],match[1]])       

we filter by some threshold

In [10]:
likely_matches=[]
likely_matches= [x for x in matches if x[2]>90]
len(likely_matches)

20

This piece is to recreate the actual records in the file with an identifier key, the reason is that we could technically have more than one hit per each duplicate found (if the file has duplicates itself)

In [11]:
output_list=[]
for x in likely_matches:
    key1=df_long.loc[df_long['fuzzy'] == x[0]]['member_id'].tolist()
    key2=df_short.loc[df_short['fuzzy'] == x[1]]['member_id'].tolist()
    output_list.append([x[0], x[1], x[2], key1, key2])
output= pd.DataFrame(output_list, columns=['primary','secondary','perc_match','key_primary','key_secondary'])

In [12]:
output[0:20]

Unnamed: 0,primary,secondary,perc_match,key_primary,key_secondary
0,james martin,james martin,100,[7957927902845],[7488960271747]
1,christopher brown,christopher brown,100,[502887664677],[1970815105834]
2,daniel gonzales,daniel gonzales,100,[7993593208263],[140312516230]
3,katherine garcia,garcia katherine h,94,[7876034075542],[9706101060165]
4,joe smith,smith jose,95,[3880509524088],[4617626102085]
5,andrew hernandez,andrew hernandez,100,[2340862914892],[358339665507]
6,christina taylor,christian taylor,94,[648337018130],[1399534976844]
7,katherine smith,katherine smith,100,[8986386973416],[7563384088130]
8,michael taylor,michael taylor,100,[2836812545211],[5847830015911]
9,mary hernandez,mary hernandez,100,[6492089346744],[7534575761576]


In [13]:
df_long[df_long['member_id']==output.iloc[6][3][0]]

Unnamed: 0.1,Unnamed: 0,address,city,country,dob,first_name,full_address,full_name,gender,last_name,member_id,middle_name,suffix,title,fuzzy
1216,1216,209 Richard Glens,Smithborough,Peru,2015-04-04,Christina,"209 Richard Glens, Smithborough, Peru",Christina Taylor,F,Taylor,648337018130,,,Mrs.,christina taylor


In [14]:
df_short[df_short['member_id']==output.iloc[6][4][0]]

Unnamed: 0.1,Unnamed: 0,address,city,country,dob,first_name,full_address,full_name,gender,last_name,member_id,middle_name,suffix,title,fuzzy
205,205,335 Becker Falls Suite 886,Lake Willieton,Belarus,1942-02-13,Christian,"335 Becker Falls Suite 886, Lake Willieton, Be...",Christian Taylor,M,Taylor,1399534976844,,,Mr.,christian taylor


## Example of checking for fuzzy duplicates within one file

You need to make sure the value is a string.  
You can also here concatenate a few fields to do fuzzy matching across a few fields,  
though it maybe better to do separate similarity checks and then work on those scores

The routines below would be useful for finding fuzzy duplicates within a file... 
Because we are comparing against itself, the main loop will grow exponentially with the size of the file, very soon you will be looking at millions of iterations. Think carefully how to reduce the dataset to something relevant and manageable.

There are alternative methods for large datasets that involve tokenising parts of the texts and scale better possibly reducing accuracy . These are detailed in another notebook.



In [15]:
list_to_check= create_fuzzy_column(df_long,"full_name")
print(df_long.shape)
print(len(list_to_check))

(2000, 15)
1983


In [16]:
matches = []
print(datetime.now())
for item in list_to_check:
    match = process.extract(item, list_to_check, scorer=fuzz.token_sort_ratio, limit=2 ) #Returns 2 best matches, one will be itself!
    matches.append([item,match[0],match[1]])
print(datetime.now())

2021-04-15 21:15:29.168651
2021-04-15 21:17:24.201498


In [17]:
# We need to remove the self hit
matches_clean=[]
for x in matches:
    if x[0] == x[1][0]:
        a= x[0]
        b= x[2][0]
        c= x[2][1]
    else:
        a= x[0]
        b= x[1][0]
        c= x[1][1]        
    matches_clean.append([a,b,c])
    
likely_matches=[]
likely_matches= [[x[0],x[1], x[2]] for x in matches_clean if x[2]>90]
likely_matches

[['tina simon', 'tina simmons', 91],
 ['billy taylor', 'taylor billy', 100],
 ['diaz kevin', 'kevin diaz', 100],
 ['kevin diaz', 'diaz kevin', 100],
 ['johnny miller', 'john miller', 92],
 ['jose gonzalez', 'joseph gonzalez', 93],
 ['matthew j moore', 'matthew moore', 93],
 ['joseph hall', 'joseph ball', 91],
 ['joseph ball', 'joseph hall', 91],
 ['bradley curtis', 'curtis brady', 92],
 ['danielle williams', 'daniel williams', 94],
 ['taylor billy', 'billy taylor', 100],
 ['michael sullivan', 'mitchell sullivan', 91],
 ['curtis brady', 'bradley curtis', 92],
 ['matthew moore', 'matthew j moore', 93],
 ['anthony lewis', 'lewis anthony', 100],
 ['michael dunn', 'michael duncan', 92],
 ['john miller', 'johnny miller', 92],
 ['roy adams', 'troy adams', 95],
 ['michael duncan', 'michael dunn', 92],
 ['joseph haynes', 'joseph hayes', 96],
 ['tina simmons', 'tina simon', 91],
 ['joseph hayes', 'joseph haynes', 96],
 ['troy adams', 'roy adams', 95],
 ['tammy torres', 'amy torres', 91],
 ['lewi

## Alternative method with a manual iteration  

This method uses a more manual approach to select the combinations of items in this case we can retrieve as many similar items meet the criteria based on the similarity score and not just the top 1 or 2 etc.
We need to import the itertools library for its "combinations" function



In [52]:
list_to_check= create_fuzzy_column(df_long,"full_name")
print(df_long.shape)
print(len(list_to_check))

item_combinations=list(combinations(list_to_check, 2))
print(len(item_combinations), "combinations to compute. Careful if this number is large, it may take a long time")

(2000, 14)
1983
1965153 combinations to compute. Careful if this number is large, it may take a long time


In [19]:
matches = []
for x in item_combinations:
    s = fuzz.token_sort_ratio(x[0],x[1])
    if s>60:
        matches.append([x[0], x[1], s]) 
        

In [20]:
df_matches=pd.DataFrame(matches, columns=['x1','x2','score'])
likely_matches = df_matches[df_matches.score > 90]
likely_matches.shape

(15, 3)

we still need to go back to the original dataset and find the matching records as we did a fuzzy match on the simplified and unique values, so there could be more than one per hit.

In [55]:
output_list_fat=[]
output_list_thin=[]
n=1
for index,row in likely_matches.iterrows():
    key1=df_long.loc[df_long['fuzzy'] == row['x1']]['member_id'].tolist()
    key2=df_long.loc[df_long['fuzzy'] == row['x2']]['member_id'].tolist()
    key_conso=key1+key2
    for k in key_conso:
        v=df_long[df_long['member_id']==k]['full_name'].to_list()[0] #optional we get the original value
        #we could use indexes for extra performance but anyway this routine cannot be run in huge tables
        #as the combination would quickly get to tens of millions to check.
        output_list_thin.append([k,n,row[2],v])
        n=n+1 
    output_list_fat.append([row[0], row[1], row[2], key1, key2, len(key_conso)])
    
output= pd.DataFrame(output_list_fat, columns=['x1','x2','score','k1','k2','countkeys']).sort_values(by=['score'],ascending=False)
output_thin=pd.DataFrame(output_list_thin,columns=['id','match_group','score','value']).sort_values(by=['score'],ascending=False)


In [56]:
output.head()

Unnamed: 0,x1,x2,score,k1,k2,countkeys
1,billy taylor,taylor billy,100,[7794666749030],[5433696913272],2
2,diaz kevin,kevin diaz,100,[8914835563291],[1445917465778],2
10,anthony lewis,lewis anthony,100,[2118643743364],[9294051750262],2
13,joseph haynes,joseph hayes,96,[4272394152283],[4795669425207],2
12,roy adams,troy adams,95,[8904361580121],[9415236987604],2


In [57]:
output_thin.head()

Unnamed: 0,id,match_group,score,value
24,9294051750262,25,100,"Lewis, Anthony"
2,7794666749030,3,100,Billy Taylor
3,5433696913272,4,100,"Taylor, Billy"
4,8914835563291,5,100,"Diaz, Kevin"
5,1445917465778,6,100,Kevin Diaz


In [None]:
output