In [25]:
import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
#apparently installing alongside the python-Levenshtein module can increase performance


Inspired by  

https://www.datacamp.com/community/tutorials/fuzzy-string-python

The Levenshtein distance is a metric to measure how apart are two sequences of words. In other words, it measures the minimum number of edits that you need to do to change a one-word sequence into the other. These edits can be insertions, deletions or substitutions. This metric was named after Vladimir Levenshtein, who originally considered it in 1965.

In [16]:
Str1 = "Apple Inc."
Str2 = "apple Inc"
Ratio = fuzz.ratio(Str1.lower(),Str2.lower())
print(Ratio)

95


In [17]:
Str1 = "Los Angeles Lakers"
Str2 = "L.A. Lakers"
Ratio = fuzz.ratio(Str1.lower(),Str2.lower())
Partial_Ratio = fuzz.partial_ratio(Str1.lower(),Str2.lower())
Token_Sort_Ratio = fuzz.token_sort_ratio(Str1.lower(),Str2.lower())
print(Ratio)
print(Partial_Ratio)
print(Token_Sort_Ratio)

62
73
64


fuzz.partial_ratio() is capable of detecting that both strings are referring to the Lakers. Thus, it yields 100% similarity. The way this works is by using an "optimal partial" logic. In other words, if the short string has length k and the longer string has the length m, then the algorithm seeks the score of the best matching length-k substring.

Nevertheless, this approach is not foolproof. What happens when the strings comparison the same, but they are in a different order?

In [43]:
Str1 = "united states v. nixon"
Str2 = "Nixon v. United States"
Ratio = fuzz.ratio(Str1.lower(),Str2.lower())
Partial_Ratio = fuzz.partial_ratio(Str1.lower(),Str2.lower())
Token_Sort_Ratio = fuzz.token_sort_ratio(Str1,Str2)
print(Ratio)
print(Partial_Ratio)
print(Token_Sort_Ratio)

59
74
100


The fuzz.token functions have an important advantage over ratio and partial_ratio. They tokenize the strings and preprocess them by turning them to lower case and getting rid of punctuation. In the case of fuzz.token_sort_ratio(), the string tokens get sorted alphabetically and then joined together. After that, a simple fuzz.ratio() is applied to obtain the similarity percentage. This allows cases such as court cases in this example to be marked as being the same.

Still, what happens if these two strings are of widely differing lengths? Thats where fuzz.token_set_ratio() comes in.

In [46]:
Str1 = "The supreme court case of Nixon vs The United States"
Str2 = "Nixon v. United States"
Ratio = fuzz.ratio(Str1.lower(),Str2.lower())
Partial_Ratio = fuzz.partial_ratio(Str1.lower(),Str2.lower())
Token_Sort_Ratio = fuzz.token_sort_ratio(Str1,Str2)
Token_Set_Ratio = fuzz.token_set_ratio(Str1,Str2)
print(Ratio)
print(Partial_Ratio)
print(Token_Sort_Ratio)
print(Token_Set_Ratio)

57
77
58
95


fuzz.token_set_ratio() takes a more flexible approach than fuzz.token_sort_ratio(). Instead of just tokenizing the strings, sorting and then pasting the tokens back together, token_set_ratio performs a set operation that takes out the common tokens (the intersection) and then makes fuzz.ratio() pairwise comparisons between the following new strings:

s1 = Sorted_tokens_in_intersection  
s2 = Sorted_tokens_in_intersection + sorted_rest_of_str1_tokens  
s3 = Sorted_tokens_in_intersection + sorted_rest_of_str2_tokens  
The logic behind these comparisons is that since Sorted_tokens_in_intersection is always the same, the score will tend to go up as these words make up a larger chunk of the original strings or the remaining tokens are closer to each other.

In [22]:
df_short= pd.read_excel("./test_data/data_short.xlsx")
df_long=pd.read_excel("./test_data/data_long.xlsx")


In [34]:
list_to_check_against=df_short['full_name'].to_list()
list_to_check_against = [each_string.lower() for each_string in list_to_check_against]


In [35]:
data_to_check=df_long['full_name'].to_list()
data_to_check= [each_string.lower() for each_string in data_to_check]


In [38]:
output = []
for item in data_to_check:
    if type(item) != str: # Checking base case. 
        pass
    else:
        match = process.extractOne(item, list_to_check_against, scorer=fuzz.token_sort_ratio) #Returns tuple of best match and percent fit.
        output.append([item,match[0],match[1]])

In [42]:
[x for x in output if x[2]>90]

[['joe smith', 'smith, jose', 95],
 ['vasquez, robert', 'robert velasquez', 93],
 ['gary velazquez', 'vazquez, gary', 92],
 ['andrew smith', 'andrea smith', 92],
 ['joseph smith', 'smith, jose', 91],
 ['mary hernandez', 'mary hernandez', 100],
 ['olivia smith', 'smith, olivia', 100],
 ['jessica alvarez', 'jessica alvarez', 100],
 ['katherine garcia', 'garcia, katherine h', 94],
 ['michael taylor', 'michael taylor', 100],
 ['andrew hernandez', 'andrew hernandez', 100],
 ['james martin', 'james martin', 100],
 ['christina taylor', 'christian taylor', 94],
 ['george collins', 'george collins', 100],
 ['daniel gonzales', 'daniel gonzales', 100],
 ['katherine smith', 'katherine smith', 100],
 ['christopher brown', 'christopher brown', 100],
 ['henry martinez', 'henry martin', 92],
 ['david williams', 'david williams', 100],
 ['jamie owens', 'james owens', 91],
 ['joseph smith', 'smith, jose', 91]]