# Examples of fuzzy matching 

updated 1/6/2021  

These example routines use a library in Python called fuzzywuzzy which implements various strategies for ranking similarity.

Note that this approach works for a relatively low number of records. If you are tackling several 100K records, the process will become exponential and you may not be able to run these. There are some approaches for large scale comparisons in a separate recipe.

In [19]:
import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
from datetime import datetime
from functools import lru_cache # This library implements in memory storage aka "caching" of the last recently used results, which helps to speed up *re* calculations when the same value appears again.
import string
import math
 

try:
    from Levenshtein import ratio, distance
except:
    print("Levenshtein library not installed")
#Installing the python-Levenshtein module alongside fuzzywuzzy will increase performance
#as it comes with a C language implementation

try:
    import distance
except:
    print("Distance library not installed")
#this is another library that seems to implement text distance, but it is pure python too
#so it will be slower


from itertools import combinations_with_replacement
#we need this for the alternative method of self matching, ie fuzzy duplicates 
# we use combinations_with_replacement instead of combinations to include the self comparison
# when there are many records that matches the pre-processed string







Levenshtein library not installed


Inspired by  

https://www.datacamp.com/community/tutorials/fuzzy-string-python

The Levenshtein distance is a metric to measure how apart are two sequences of words.  
In other words, it measures the minimum number of edits that you need to do to change a one-word sequence into the other. These edits can be insertions, deletions or substitutions. This metric was named after Vladimir Levenshtein, who originally considered it in 1965.

In [9]:
Str1 = "Apple Inc."
Str2 = "apple Inc"
Ratio = fuzz.ratio(Str1.lower(),Str2.lower())
Distance= distance.levenshtein(Str1.lower(),Str2.lower())
print("Ratio",Ratio)
print("Distance",Distance) # Should be 1 after making all string same case 

Ratio 95
Distance 1


In [10]:
Str1 = "Los Angeles Lakers"
Str2 = "L.A. Lakers"
Ratio = fuzz.ratio(Str1.lower(),Str2.lower())
Partial_Ratio = fuzz.partial_ratio(Str1.lower(),Str2.lower())
Token_Sort_Ratio = fuzz.token_sort_ratio(Str1.lower(),Str2.lower())
print(Ratio)
print(Partial_Ratio)
print(Token_Sort_Ratio)

62
73
64


fuzz.partial_ratio() is capable of detecting that both strings are referring to the Lakers. Thus, it yields 100% similarity. The way this works is by using an "optimal partial" logic. In other words, if the short string has length k and the longer string has the length m, then the algorithm seeks the score of the best matching length-k substring.

Nevertheless, this approach is not foolproof. What happens when the strings comparison the same, but they are in a different order?

In [11]:
Str1 = "united states v. nixon"
Str2 = "Nixon v. United States"
Ratio = fuzz.ratio(Str1.lower(),Str2.lower())
Partial_Ratio = fuzz.partial_ratio(Str1.lower(),Str2.lower())
Token_Sort_Ratio = fuzz.token_sort_ratio(Str1,Str2)
print(Ratio)
print(Partial_Ratio)
print(Token_Sort_Ratio)

59
74
100


The fuzz.token functions have an important advantage over ratio and partial_ratio. They tokenize the strings and preprocess them by turning them to lower case and getting rid of punctuation. In the case of fuzz.token_sort_ratio(), the string tokens get sorted alphabetically and then joined together. After that, a simple fuzz.ratio() is applied to obtain the similarity percentage. This allows cases such as court cases in this example to be marked as being the same.

Still, what happens if these two strings are of widely differing lengths? Thats where fuzz.token_set_ratio() comes in.

In [12]:
Str1 = "The supreme court case of Nixon vs The United States"
Str2 = "Nixon v. United States"
Ratio = fuzz.ratio(Str1.lower(),Str2.lower())
Partial_Ratio = fuzz.partial_ratio(Str1.lower(),Str2.lower())
Token_Sort_Ratio = fuzz.token_sort_ratio(Str1,Str2)
Token_Set_Ratio = fuzz.token_set_ratio(Str1,Str2)
print(Ratio)
print(Partial_Ratio)
print(Token_Sort_Ratio)
print(Token_Set_Ratio)

57
77
58
95


fuzz.token_set_ratio() takes a more flexible approach than fuzz.token_sort_ratio(). Instead of just tokenizing the strings, sorting and then pasting the tokens back together, token_set_ratio performs a set operation that takes out the common tokens (the intersection) and then makes fuzz.ratio() pairwise comparisons between the following new strings:

s1 = Sorted_tokens_in_intersection  
s2 = Sorted_tokens_in_intersection + sorted_rest_of_str1_tokens  
s3 = Sorted_tokens_in_intersection + sorted_rest_of_str2_tokens  
The logic behind these comparisons is that since Sorted_tokens_in_intersection is always the same, the score will tend to go up as these words make up a larger chunk of the original strings or the remaining tokens are closer to each other.

## Practical examples  

We are going now to do 2 examples, with slightly different approaches.  
One is a file against another, to do a fuzzy lookup.  
The other is a file against itself, to do a fuzzy duplicate detection/grouping.
  

The first thing to do is to import the needed libraries and a common "preprocessing" function to cleanup the column we are going to use.  
Fuzzywuzzy also does preprocessing, but I find that it may be better to do it manually in bulk ourselves to control exactly what we are dropping, there may be cases we want to ignore like special characters or abbreviations. 

In [13]:
@lru_cache(maxsize=None) #we enable the caching in this small piece, maxsize is set to unlimited, but we could add a limit , apparently having an actual limit makes it marginally faster in some conditions.
def token_sort(s):
    s=str(s)
    s=s.translate(str.maketrans('', '', string.punctuation))
    sl= str.split(s)
    sl.sort()
    s= "".join(sl)
    return s

def create_fuzzy_column(df,col_name):
    df['fuzzy']=df[col_name].str.lower()
    df['fuzzy'].replace("\'", '',regex= True , inplace=True)
    df['fuzzy'].replace(r'\s', ' ', regex = True, inplace = True)
    df['fuzzy'].replace("",np.nan,regex=True, inplace=True)
    #This will directly remove accented chars and enie , not replace with the vowel or n   
    #df['fuzzy'] = df['fuzzy'].str.encode('ascii', 'ignore').str.decode('ascii')
    
    #This will also remove accented chars 
    #from string import printable
    #st = set(printable)
    #df["fuzzy"] = df["fuzzy"].apply(lambda x: ''.join([" " if  i not in  st else i for i in x]))   
    #This will retain the characters but standardise for compatibility, there are various
    #libraries that do that too.  
    df["fuzzy"]= df["fuzzy"].str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8')
    df["fuzzy"].replace('[^a-z1-9 ]', ' ', regex=True, inplace=True)
    df['fuzzy'].replace(' ltd'," ", regex=True, inplace=True)
    df['fuzzy'].replace(' plc'," ", regex=True, inplace=True)
    df['fuzzy'].replace(' llp'," ",regex=True, inplace=True)   
    df['fuzzy'].replace(' limited'," ", regex=True , inplace=True)
    df['fuzzy'].replace("mr "," ",regex=True, inplace=True)
    df['fuzzy'].replace("mrs "," ",regex=True, inplace=True)
    df['fuzzy'].replace("ms "," ",regex=True, inplace=True)
    df['fuzzy'].replace("miss "," ",regex=True, inplace=True)
    df['fuzzy'].replace(' +', ' ', regex=True, inplace=True)
    df['fuzzy'] = df['fuzzy'].str.strip()
    list_token_sort=[token_sort(s) for s in df["fuzzy"]]
    s = pd.Series(list_token_sort)
    df['fuzzy'] = s.values
    df.set_index('fuzzy')
    return list(df['fuzzy'][~pd.isnull(df.fuzzy)])


This is a small test of the routine 

In [15]:
test_data = [["  New\rLine","linenew"],['Iñaqui  ', 'inaqui'], ["Tab\tEntry", "entrytab"], ['Lucía',"lucia"], ["","NaN"], ["Mr. Ryan     O'Neill","oneillryan"],["Ana-María","anamaria"],["John Smith 2nd","2ndjohnsmith"],["Peter\uFF3FDrücker","druckerpeter"],["Emma\u005FWatson","emma watson"],["Jeff Bezos","bezosjeff"],["  jeff   . Bezos  ","bezosjeff"],["Amazon Ltd.","amazon"]]
test_df = pd.DataFrame(test_data, columns = ['input', 'expected']) 
test_list= create_fuzzy_column(test_df,'input')
print(test_list)
test_df

['linenew', 'inaqui', 'entrytab', 'lucia', 'nan', 'oneillryan', 'anamaria', '2ndjohnsmith', 'druckerpeter', 'emmawatson', 'bezosjeff', 'bezosjeff', 'amazon']


Unnamed: 0,input,expected,fuzzy
0,New\rLine,linenew,linenew
1,Iñaqui,inaqui,inaqui
2,Tab\tEntry,entrytab,entrytab
3,Lucía,lucia,lucia
4,,,
5,Mr. Ryan O'Neill,oneillryan,oneillryan
6,Ana-María,anamaria,anamaria
7,John Smith 2nd,2ndjohnsmith,2ndjohnsmith
8,Peter＿Drücker,druckerpeter,druckerpeter
9,Emma_Watson,emma watson,emmawatson


## Example of a routine for checking one file against another

While the fuzzy matching already ignores cases and special chars, it perhaps convenient to do a bit of cleanup in advance. This is because when you pick the unique values via "set" command and when you queries for potential matches afterwards, you would have a cleaner and smaller set of data. Possibly worth also removing other artifacts like aphostrophes and double spaces, trimming trailing spaces etc.

Also note that using a generic fuzzy field makes the rest of the routine more reusable

In [42]:
df_short= pd.read_excel("./test_data/data_short.xlsx")
df_long=pd.read_excel("./test_data/data_long.xlsx")
field_to_match='full_name'
row_id='member_id'  # unique identifier of the row

In [43]:
primary_list=list(set(create_fuzzy_column(df_long,field_to_match))) 
#we do set to remove duplicates and reduce the comparisons, we will pick all the exact duplicates
#anyway when we reconstruct the hits
print(df_long.shape)

(2000, 15)


In [44]:
secondary_list= list(set(create_fuzzy_column(df_short,field_to_match)))
print(df_short.shape)

(500, 15)


In [35]:
matches = []
for item in primary_list:
    if type(item) != str: # Checking base case. 
        pass
    else:
        match = process.extractOne(item, secondary_list, scorer=fuzz.token_sort_ratio) 
        #Returns tuple (pairs) of best match and percent fit.
        #We are letting the algorithm pick the winner, so it is important to choose 
        #the right method. We used "token sort ratio" but you should check the variants
        matches.append([item,match[0],match[1]])       

This is how we sort descending a list of lists by some field (and retrieve first 5)

In [38]:
sorted(matches, reverse=True, key=lambda x: x[2])[0:5]

[['oliviasmith', 'oliviasmith', 100],
 ['katherinesmith', 'katherinesmith', 100],
 ['alvarezjessica', 'alvarezjessica', 100],
 ['collinsgeorge', 'collinsgeorge', 100],
 ['danielgonzales', 'danielgonzales', 100]]

we filter by some threshold

In [39]:
likely_matches=[]
likely_matches= [x for x in matches if x[2]>90]
len(likely_matches)

18

The below is to recreate the actual records in the file with an identifier key, the reason is that we could technically have more than one hit per each duplicate found (if the file has duplicates itself)

In [50]:
output_list=[]
for x in likely_matches:
    key1=df_long.loc[df_long['fuzzy'] == x[0]][row_id].tolist()
    key2=df_short.loc[df_short['fuzzy'] == x[1]][row_id].tolist()
    output_list.append([x[0], x[1], x[2], key1, key2])
# this produces yet another list of list, we better load it into a dataframe for convenience
# key1 and key2 are lists because there could be more than one record for that fuzzyied string
# one option is to just pick the first ie use key1[0] and key2[0] 
# another option is to use a routine to convert these into individual rows, see next example 
# where we do precisely that
# another option is to do some hacks with excel to split that list 


In [70]:
output= pd.DataFrame(output_list, columns=['primary','secondary','perc_match','key_primary','key_secondary'])
output[0:40]

Unnamed: 0,primary,secondary,perc_match,key_primary,key_secondary
0,oliviasmith,oliviasmith,100,[1067119077629],[4358287106424]
1,katherinesmith,katherinesmith,100,[8986386973416],[7563384088130]
2,christinataylor,christiantaylor,93,[648337018130],[1399534976844]
3,alvarezjessica,alvarezjessica,100,[9790621226361],[9250994501583]
4,collinsgeorge,collinsgeorge,100,[5179019038436],[1306491549530]
5,danielgonzales,danielgonzales,100,[7993593208263],[140312516230]
6,davidwilliams,davidwilliams,100,[3922931562549],[229720631152]
7,henrymartinez,henrymartin,92,[6982911494472],[9971387465151]
8,joesmith,josesmith,94,[3880509524088],[4617626102085]
9,garciakatherine,garciahkatherine,97,[7876034075542],[9706101060165]


In [68]:
output.iat[0,2] = 200

In [69]:
output.iat[0,2]

200

In [71]:
output.head()

Unnamed: 0,primary,secondary,perc_match,key_primary,key_secondary
0,oliviasmith,oliviasmith,100,[1067119077629],[4358287106424]
1,katherinesmith,katherinesmith,100,[8986386973416],[7563384088130]
2,christinataylor,christiantaylor,93,[648337018130],[1399534976844]
3,alvarezjessica,alvarezjessica,100,[9790621226361],[9250994501583]
4,collinsgeorge,collinsgeorge,100,[5179019038436],[1306491549530]


In [48]:
df_long[df_long['member_id']==output.iloc[6][3][0]]

Unnamed: 0.1,Unnamed: 0,address,city,country,dob,first_name,full_address,full_name,gender,last_name,member_id,middle_name,suffix,title,fuzzy
1957,1957,3069 Kimberly Ways Suite 864,Port Michael,Iceland,1949-10-01,David,"3069 Kimberly Ways Suite 864, Port Michael, Ic...",David Williams,M,Williams,3922931562549,,,Mr.,davidwilliams


In [52]:
df_short[df_short['member_id']==output.iloc[6][4][0]]

Unnamed: 0.1,Unnamed: 0,address,city,country,dob,first_name,full_address,full_name,gender,last_name,member_id,middle_name,suffix,title,fuzzy
6,6,753 Leonard Ridge,West Andrea,Monaco,1976-11-11,David,"753 Leonard Ridge, West Andrea, Monaco",David Williams,M,Williams,229720631152,,,Mr.,davidwilliams


## Checking for fuzzy duplicates within one file

Because we are comparing against itself, the main loop will grow exponentially with the size of the file, very soon you will be looking at millions of iterations. Think carefully how to reduce the dataset to something relevant and manageable.

There are alternative methods for large datasets that involve tokenising parts of the texts and scale better possibly reducing accuracy . These are detailed in another notebook.





This method below uses a more manual approach to select the combinations of items in this case we can retrieve as many similar items meet the criteria based on the similarity score and not just the top 1 or 2 etc.
We need to import the itertools library for its "combinations" function



In [9]:

df=df_long.copy()
list_to_check= list(sorted(set(create_fuzzy_column(df,"full_name"))))
print(df.shape)
print(len(list_to_check))

item_combinations=list(combinations_with_replacement(list_to_check, 2))
print(len(item_combinations), "combinations to compute. Careful if this number is large, it may take a long time")

(2000, 15)
1980
1961190 combinations to compute. Careful if this number is large, it may take a long time


In [41]:
import datetime
start=datetime.datetime.now()
print(start.strftime("%d-%b-%Y %H:%M:%S"))
matches = []
@lru_cache(maxsize=100000)
def get_ratio(a,b):
    #r = fuzz.token_sort_ratio(a,b)# not needed as we did token sort in pre processing
    r=fuzz.ratio(a,b)
    #r=ratio(a,b)*100 #this uses the Levinshtein libary directly, should be fast
    return r

for x in item_combinations:
    s=get_ratio(x[0],x[1])
    if s>89:
        matches.append([x[0], x[1], s]) 
        
print (datetime.datetime.now().strftime("%d-%b-%Y %H:%M:%S"))
print ("Minutes", round((datetime.datetime.now()-start).total_seconds()/(60),2))

06-Jun-2021 13:21:08
06-Jun-2021 13:21:53
Minutes 0.76


In [42]:
df_matches=pd.DataFrame(matches, columns=['x1','x2','score'])

we still need to go back to the original dataset and find the matching groups

In [44]:
output_list_fat=[]
output_list_thin=[]
match_group=1
for index,row in df_matches.iterrows():
    key1=df.loc[df['fuzzy'] == row['x1']]['member_id'].tolist()
    if row['x1']==row['x2']:
        if len(key1)<=1:
            #there is only one instance in the population, no real match, back to for loop
            continue
        key2=[] # these are exact matches against the preprocessed value, we count only once
    else:
        key2=df.loc[df_long['fuzzy'] == row['x2']]['member_id'].tolist()
    key_conso=key1+key2
    for k in key_conso:
        v=df[df['member_id']==k]['full_name'].to_list()[0] #optional we get the original value
        #we could use indexes for extra performance but anyway this routine cannot be run in huge tables
        #as the combination would quickly get to tens of millions to check.
        output_list_thin.append([k,match_group,row[2],v])
    output_list_fat.append([row[0], row[1], row[2], key1, key2, len(key_conso),match_group])
    match_group +=1 
    
output= pd.DataFrame(output_list_fat, columns=['x1','x2','score','k1','k2','countkeys','match_group']).sort_values(by=['score'],ascending=False)
output_thin=pd.DataFrame(output_list_thin,columns=['id','match_group','score','value']).sort_values(by=['score'],ascending=False)


In [47]:
output.head()

Unnamed: 0,x1,x2,score,k1,k2,countkeys,match_group
17,michaelsullivan,michaelsullivan,100,"[825827228877, 9520235523057]",[],2,18
13,michaelthomas,michaelthomas,100,"[3747334520428, 7537589811664]",[],2,14
32,johnmiller,johnmiller,100,"[6774632622452, 6261795470414]",[],2,33
31,brownjohn,brownjohn,100,"[8673512181554, 7506296213450]",[],2,32
30,laurensmith,laurensmith,100,"[8579400177275, 8950577347478]",[],2,31


In [46]:
output_thin.iloc[0:40].sort_values(by='match_group')

Unnamed: 0,id,match_group,score,value
7,55177040885,4,100,Brian Williams
6,406297274061,4,100,Brian Williams
8,3916487516460,5,100,Jacqueline Hernandez
9,9001733828207,5,100,Jacqueline Hernandez
11,5400110077162,6,100,Joseph Smith
10,2828385250821,6,100,Joseph Smith
17,5433696913272,9,100,"Taylor, Billy"
16,7794666749030,9,100,Billy Taylor
21,3632687508653,11,100,Michael Hunt
20,2447864260207,11,100,Michael Hunt
