# Examples of fuzzy matching 

updated 
21/2/2021
These example routines use a special library in Python called fuzzywuzzy which implements various strategies for ranking similarity. Absolutely use it. There are other basic libraries but then you need to program the logic on top to make the best choices.

Note that this approach works for a relatively low number of records. If you are talking several 100K records things will become exponential and you may not be able to really run these. There are some approaches for large scale comparisons in a separate recipe.

In [2]:
import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
from datetime import datetime
import Levenshtein
from Levenshtein import ratio
#Note: apparently installing the python-Levenshtein module alongside fuzzywuzzy can increase performance

#we need this for the alternative method of self matching, ie fuzzy duplicates 
from itertools import combinations 



df_short= pd.read_excel("./test_data/data_short.xlsx")
df_long=pd.read_excel("./test_data/data_long.xlsx")





Inspired by  

https://www.datacamp.com/community/tutorials/fuzzy-string-python

The Levenshtein distance is a metric to measure how apart are two sequences of words. In other words, it measures the minimum number of edits that you need to do to change a one-word sequence into the other. These edits can be insertions, deletions or substitutions. This metric was named after Vladimir Levenshtein, who originally considered it in 1965.

In [3]:
Str1 = "Apple Inc."
Str2 = "apple Inc"
Ratio = fuzz.ratio(Str1.lower(),Str2.lower())
print(Ratio)

95


In [4]:
Str1 = "Los Angeles Lakers"
Str2 = "L.A. Lakers"
Ratio = fuzz.ratio(Str1.lower(),Str2.lower())
Partial_Ratio = fuzz.partial_ratio(Str1.lower(),Str2.lower())
Token_Sort_Ratio = fuzz.token_sort_ratio(Str1.lower(),Str2.lower())
print(Ratio)
print(Partial_Ratio)
print(Token_Sort_Ratio)

62
73
64


fuzz.partial_ratio() is capable of detecting that both strings are referring to the Lakers. Thus, it yields 100% similarity. The way this works is by using an "optimal partial" logic. In other words, if the short string has length k and the longer string has the length m, then the algorithm seeks the score of the best matching length-k substring.

Nevertheless, this approach is not foolproof. What happens when the strings comparison the same, but they are in a different order?

In [5]:
Str1 = "united states v. nixon"
Str2 = "Nixon v. United States"
Ratio = fuzz.ratio(Str1.lower(),Str2.lower())
Partial_Ratio = fuzz.partial_ratio(Str1.lower(),Str2.lower())
Token_Sort_Ratio = fuzz.token_sort_ratio(Str1,Str2)
print(Ratio)
print(Partial_Ratio)
print(Token_Sort_Ratio)

59
74
100


The fuzz.token functions have an important advantage over ratio and partial_ratio. They tokenize the strings and preprocess them by turning them to lower case and getting rid of punctuation. In the case of fuzz.token_sort_ratio(), the string tokens get sorted alphabetically and then joined together. After that, a simple fuzz.ratio() is applied to obtain the similarity percentage. This allows cases such as court cases in this example to be marked as being the same.

Still, what happens if these two strings are of widely differing lengths? Thats where fuzz.token_set_ratio() comes in.

In [6]:
Str1 = "The supreme court case of Nixon vs The United States"
Str2 = "Nixon v. United States"
Ratio = fuzz.ratio(Str1.lower(),Str2.lower())
Partial_Ratio = fuzz.partial_ratio(Str1.lower(),Str2.lower())
Token_Sort_Ratio = fuzz.token_sort_ratio(Str1,Str2)
Token_Set_Ratio = fuzz.token_set_ratio(Str1,Str2)
print(Ratio)
print(Partial_Ratio)
print(Token_Sort_Ratio)
print(Token_Set_Ratio)

57
77
58
95


fuzz.token_set_ratio() takes a more flexible approach than fuzz.token_sort_ratio(). Instead of just tokenizing the strings, sorting and then pasting the tokens back together, token_set_ratio performs a set operation that takes out the common tokens (the intersection) and then makes fuzz.ratio() pairwise comparisons between the following new strings:

s1 = Sorted_tokens_in_intersection  
s2 = Sorted_tokens_in_intersection + sorted_rest_of_str1_tokens  
s3 = Sorted_tokens_in_intersection + sorted_rest_of_str2_tokens  
The logic behind these comparisons is that since Sorted_tokens_in_intersection is always the same, the score will tend to go up as these words make up a larger chunk of the original strings or the remaining tokens are closer to each other.

In [34]:
from functools import lru_cache
import string
import math

@lru_cache(maxsize=None)
def token_sort(s):
    s=str(s)
    s=s.translate(str.maketrans('', '', string.punctuation))
    sl= str.split(s)
    sl.sort()
    s= "".join(sl)
    return s

def create_fuzzy_column(df,col_name):
    df['fuzzy']=df[col_name].str.lower()
    df['fuzzy'].replace("\'", '',regex= True , inplace=True)
    df['fuzzy'].replace(r'\s', ' ', regex = True, inplace = True)
    df['fuzzy'].replace("",np.nan,regex=True, inplace=True)
    #This will directly remove accented chars and enie , not replace with the vowel or n   
    #df['fuzzy'] = df['fuzzy'].str.encode('ascii', 'ignore').str.decode('ascii')
    
    #This will also remove accented chars 
    #from string import printable
    #st = set(printable)
    #df["fuzzy"] = df["fuzzy"].apply(lambda x: ''.join([" " if  i not in  st else i for i in x]))   
    #This will retain the characters but standardise for compatibility, there are various libraries that do that too.  
    df["fuzzy"]= df["fuzzy"].str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8')
    df["fuzzy"].replace('[^a-z1-9 ]', ' ', regex=True, inplace=True)
    df['fuzzy'].replace(' ltd'," ", regex=True, inplace=True)
    df['fuzzy'].replace(' plc'," ", regex=True, inplace=True)
    df['fuzzy'].replace(' llp'," ",regex=True, inplace=True)   
    df['fuzzy'].replace(' limited'," ", regex=True , inplace=True)
    df['fuzzy'].replace("mr "," ",regex=True, inplace=True)
    df['fuzzy'].replace("mrs "," ",regex=True, inplace=True)
    df['fuzzy'].replace("ms "," ",regex=True, inplace=True)
    df['fuzzy'].replace("miss "," ",regex=True, inplace=True)
    df['fuzzy'].replace(' +', ' ', regex=True, inplace=True)
    df['fuzzy'] = df['fuzzy'].str.strip()
    list_token_sort=[token_sort(s) for s in df["fuzzy"]]
    s = pd.Series(list_token_sort)
    df['fuzzy'] = s.values
    df.set_index('fuzzy')
    return list(df['fuzzy'][~pd.isnull(df.fuzzy)])


In [35]:
test_data = [["  New\rLine","new line"],['Iñaqui  ', 'inaqui'], ["Tab\tEntry", "tab entry"], ['Lucía',"lucia"], ["","NaN"], ["Ryan     O'Neill","ryan oneill"],["Ana-María","ana maria"],["John Smith 2nd","john smith 2nd"],["Peter\uFF3FDrücker","peter drucker"],["Emma\u005FWatson","emma watson"],["Jeff Bezos","jeff bezos"],["  jeff   . Bezos  ","jeff bezos"]]
test_df = pd.DataFrame(test_data, columns = ['input', 'expected']) 
test_list= create_fuzzy_column(test_df,'input')
print(test_list)
test_df

['linenew', 'inaqui', 'entrytab', 'lucia', 'nan', 'oneillryan', 'anamaria', '2ndjohnsmith', 'druckerpeter', 'emmawatson', 'bezosjeff', 'bezosjeff']


Unnamed: 0,input,expected,fuzzy
0,New\rLine,new line,linenew
1,Iñaqui,inaqui,inaqui
2,Tab\tEntry,tab entry,entrytab
3,Lucía,lucia,lucia
4,,,
5,Ryan O'Neill,ryan oneill,oneillryan
6,Ana-María,ana maria,anamaria
7,John Smith 2nd,john smith 2nd,2ndjohnsmith
8,Peter＿Drücker,peter drucker,druckerpeter
9,Emma_Watson,emma watson,emmawatson


## Example of a routine of checking one file against another

While the fuzzy matching already ignores cases and special chars, it perhaps convenient to do a bit of cleanup in advance. This is because when you pick the unique values via "set" command and when you queries for potential matches afterwards, you would have a cleaner and smaller set of data. Possibly worth also removing other artifacts like aphostrophes and double spaces, trimming trailing spaces etc.

Also note that using a generic fuzzy field makes the rest of the routine more reusable

In [36]:
primary_list=list(set(create_fuzzy_column(df_long,"full_name"))) 
#we do set to remove duplicates and reduce the comparisons, we will pick all the exact duplicates
#anyway when we reconstruct the hits
print(df_long.shape)
print(len(primary_list))

(2000, 15)
1980


In [14]:
secondary_list= list(set(create_fuzzy_column(df_short,"full_name")))
print(df_short.shape)
print(len(secondary_list))

(500, 15)
500


In [15]:
matches = []
for item in primary_list:
    if type(item) != str: # Checking base case. 
        pass
    else:
        match = process.extractOne(item, secondary_list, scorer=fuzz.token_sort_ratio) #Returns tuple of best match and percent fit.
        matches.append([item,match[0],match[1]])       

we filter by some threshold

In [38]:
likely_matches=[]
likely_matches= [x for x in matches if x[2]>90]
len(likely_matches)

18

This piece is to recreate the actual records in the file with an identifier key, the reason is that we could technically have more than one hit per each duplicate found (if the file has duplicates itself)

In [39]:
output_list=[]
for x in likely_matches:
    key1=df_long.loc[df_long['fuzzy'] == x[0]]['member_id'].tolist()
    key2=df_short.loc[df_short['fuzzy'] == x[1]]['member_id'].tolist()
    output_list.append([x[0], x[1], x[2], key1, key2])
output= pd.DataFrame(output_list, columns=['primary','secondary','perc_match','key_primary','key_secondary'])

In [40]:
output[0:20]

Unnamed: 0,primary,secondary,perc_match,key_primary,key_secondary
0,jamesmartin,jamesmartin,100,[7957927902845],[7488960271747]
1,oliviasmith,oliviasmith,100,[1067119077629],[4358287106424]
2,andrewhernandez,andrewhernandez,100,[2340862914892],[358339665507]
3,collinsgeorge,collinsgeorge,100,[5179019038436],[1306491549530]
4,hernandezmary,hernandezmary,100,[6492089346744],[7534575761576]
5,brownchristopher,brownchristopher,100,[502887664677],[1970815105834]
6,davidwilliams,davidwilliams,100,[3922931562549],[229720631152]
7,christinataylor,christiantaylor,93,[648337018130],[1399534976844]
8,alvarezjessica,alvarezjessica,100,[9790621226361],[9250994501583]
9,danielgonzales,danielgonzales,100,[7993593208263],[140312516230]


In [41]:
df_long[df_long['member_id']==output.iloc[6][3][0]]

Unnamed: 0.1,Unnamed: 0,address,city,country,dob,first_name,full_address,full_name,gender,last_name,member_id,middle_name,suffix,title,fuzzy
1957,1957,3069 Kimberly Ways Suite 864,Port Michael,Iceland,1949-10-01,David,"3069 Kimberly Ways Suite 864, Port Michael, Ic...",David Williams,M,Williams,3922931562549,,,Mr.,davidwilliams


In [42]:
df_short[df_short['member_id']==output.iloc[6][4][0]]

Unnamed: 0.1,Unnamed: 0,address,city,country,dob,first_name,full_address,full_name,gender,last_name,member_id,middle_name,suffix,title,fuzzy
6,6,753 Leonard Ridge,West Andrea,Monaco,1976-11-11,David,"753 Leonard Ridge, West Andrea, Monaco",David Williams,M,Williams,229720631152,,,Mr.,davidwilliams


## Checking for fuzzy duplicates within one file

Because we are comparing against itself, the main loop will grow exponentially with the size of the file, very soon you will be looking at millions of iterations. Think carefully how to reduce the dataset to something relevant and manageable.

There are alternative methods for large datasets that involve tokenising parts of the texts and scale better possibly reducing accuracy . These are detailed in another notebook.





This method below uses a more manual approach to select the combinations of items in this case we can retrieve as many similar items meet the criteria based on the similarity score and not just the top 1 or 2 etc.
We need to import the itertools library for its "combinations" function



In [47]:
list_to_check= create_fuzzy_column(df_long,"full_name")
print(df_long.shape)
print(len(list_to_check))

item_combinations=list(combinations(list_to_check, 2))
print(len(item_combinations), "combinations to compute. Careful if this number is large, it may take a long time")

(2000, 15)
2000
1999000 combinations to compute. Careful if this number is large, it may take a long time


In [48]:
df_dupes=df_long[df_long['fuzzy'].duplicated(keep=False)].sort_values(by="fuzzy", ascending=False)
df_dupes.shape

(40, 15)

In [49]:
df_dupes.groupby("fuzzy").count().sort_values(by="full_name",ascending=False)

Unnamed: 0_level_0,Unnamed: 0,address,city,country,dob,first_name,full_address,full_name,gender,last_name,member_id,middle_name,suffix,title
fuzzy,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
amandathompson,2,2,2,2,2,2,2,2,2,2,2,0,0,2
anthonylewis,2,2,2,2,2,2,2,2,2,2,2,0,0,2
michaelsullivan,2,2,2,2,2,2,2,2,2,2,2,0,0,2
laurensmith,2,2,2,2,2,2,2,2,2,2,2,0,0,2
josephsmith,2,2,2,2,2,2,2,2,2,2,2,0,0,2
jonesthomas,2,2,2,2,2,2,2,2,2,2,2,0,0,2
johnmiller,2,2,2,2,2,2,2,2,2,2,2,0,0,2
huntmichael,2,2,2,2,2,2,2,2,2,2,2,0,0,2
hernandezjacqueline,2,2,2,2,2,2,2,2,2,2,2,0,0,2
heathertaylor,2,2,2,2,2,2,2,2,2,2,2,0,0,2


In [55]:
import datetime
start=datetime.datetime.now()
print(start.strftime("%d-%b-%Y %H:%M:%S"))
matches = []
@lru_cache(maxsize=None)
def get_ratio(a,b):
    #r = fuzz.token_sort_ratio(a,b)# because the fuzzying preprocesing we did already does token sort, it is not needed.
    #r=fuzz.ratio(a,b)
    r=ratio(a,b)*100 #this uses the Levinshtein libary directly, should be fast
    return r

for x in item_combinations:
    s=get_ratio(x[0],x[1])
    if s>80:
        matches.append([x[0], x[1], s]) 
        
print (datetime.datetime.now().strftime("%d-%b-%Y %H:%M:%S"))
print ("Minutes", round((datetime.datetime.now()-start).total_seconds()/(60),2))

09-May-2021 19:38:03
09-May-2021 19:38:06
Minutes 0.06


In [56]:
df_matches=pd.DataFrame(matches, columns=['x1','x2','score'])
likely_matches = df_matches[df_matches.score > 90]
print(likely_matches.shape)


(32, 3)


In [15]:
likely_matches_deduped=likely_matches.drop_duplicates() #we could have exact duplicates in the original file
print(likely_matches_deduped.shape)

(32, 3)


we still need to go back to the original dataset and find the matching groups

In [53]:
output_list_fat=[]
output_list_thin=[]
match_group=1
for index,row in likely_matches.iterrows():
    key1=df_long.loc[df_long['fuzzy'] == row['x1']]['member_id'].tolist()
    if row['x1']==row['x2']:
        key2=[] #it is an self match so we count it only once
    else:
        key2=df_long.loc[df_long['fuzzy'] == row['x2']]['member_id'].tolist()
    key_conso=key1+key2
    for k in key_conso:
        v=df_long[df_long['member_id']==k]['full_name'].to_list()[0] #optional we get the original value
        #we could use indexes for extra performance but anyway this routine cannot be run in huge tables
        #as the combination would quickly get to tens of millions to check.
        output_list_thin.append([k,match_group,row[2],v])
    output_list_fat.append([row[0], row[1], row[2], key1, key2, len(key_conso),match_group])
    match_group +=1 
    
output= pd.DataFrame(output_list_fat, columns=['x1','x2','score','k1','k2','countkeys','match_group']).sort_values(by=['score'],ascending=False)
output_thin=pd.DataFrame(output_list_thin,columns=['id','match_group','score','value']).sort_values(by=['score'],ascending=False)


In [54]:
output.head()

Unnamed: 0,x1,x2,score,k1,k2,countkeys,match_group
0,christophermoore,christophermoore,100,"[1360800895169, 7072377543230]",[],2,1
12,anthonylewis,anthonylewis,100,"[9294051750262, 2118643743364]",[],2,13
25,brianwilliams,brianwilliams,100,"[406297274061, 55177040885]",[],2,26
24,laurensmith,laurensmith,100,"[8579400177275, 8950577347478]",[],2,25
23,huntmichael,huntmichael,100,"[2447864260207, 3632687508653]",[],2,24


In [37]:
output_thin.iloc[0:40].sort_values(by='match_group')

Unnamed: 0,id,match_group,score,value
0,1360800895169,1,100,Christopher Moore
1,7072377543230,1,100,Christopher Moore
4,825827228877,3,100,Michael Sullivan
5,9520235523057,3,100,Michael Sullivan
9,7039363481297,5,100,Amanda Thompson
10,231280510563,5,100,Amanda Thompson
12,5433696913272,6,100,"Taylor, Billy"
11,7794666749030,6,100,Billy Taylor
18,5400110077162,9,100,Joseph Smith
17,2828385250821,9,100,Joseph Smith
