# FuzzyWuzzy Python library
## Install via pip :

In [1]:
!pip install fuzzywuzzy

!pip install python-Levenshtein



## How to use this library ?

First of import these modules,

In [2]:
from fuzzywuzzy import fuzz 
from fuzzywuzzy import process

Simple ratio usage :

In [3]:
print(fuzz.ratio('geeksforgeeks', 'geeksgeeks'))
print(fuzz.ratio('GeeksforGeeks', 'GeeksforGeeks'))
print(fuzz.ratio('geeks for geeks', 'Geeks For Geeks '))

print(fuzz.partial_ratio("geeks for geeks", "geeks for geeks!")) 
print(fuzz.partial_ratio("geeks for geeks", "geeks geeks"))

87
100
77
100
64


Now, token set ratio an token sort ratio:

In [4]:
# Token Sort Ratio 
print(fuzz.token_sort_ratio("geeks for geeks", "for geeks geeks")) 

# Token Set Ratio 
print(fuzz.token_sort_ratio("geeks for geeks", "geeks for for geeks")) 
print(fuzz.token_set_ratio("geeks for geeks", "geeks for for geeks"))

100
88
100


Now suppose if we have list of list of options and we want to find the closest match(es), we can use the process module

In [5]:
query = 'geeks for geeks' 
choices = ['geek for geek', 'geek geek', 'g. for geeks'] 

print(process.extract(query, choices)) 
print(process.extractOne(query, choices))

[(&#39;g. for geeks&#39;, 95), (&#39;geek for geek&#39;, 93), (&#39;geek geek&#39;, 86)]
(&#39;g. for geeks&#39;, 95)


There is also one more ratio which is used often called WRatio, sometimes its better to use WRatio instead of simple ratio as WRatio handles lower and upper cases and some other parameters too.

In [6]:
print(fuzz.WRatio('geeks for geeks', 'Geeks For Geeks'))
print(fuzz.WRatio('geeks for geeks!!!','geeks for geeks'))
print(fuzz.ratio('geeks for geeks!!!','geeks for geeks'))

100
100
91


## Full Code

In [7]:
# Python code showing all the ratios together, 
from fuzzywuzzy import fuzz 
from fuzzywuzzy import process 
s1 = "I love GeeksforGeeks" 
s2 = "I am loving GeeksforGeeks" 
print("FuzzyWuzzy Ratio: ", fuzz.ratio(s1, s2))
print("FuzzyWuzzy PartialRatio: ", fuzz.partial_ratio(s1, s2))
print("FuzzyWuzzy TokenSortRatio: ", fuzz.token_sort_ratio(s1, s2))
print("FuzzyWuzzy TokenSetRatio: ", fuzz.token_set_ratio(s1, s2))
print("FuzzyWuzzy WRatio: ", fuzz.WRatio(s1, s2),'\n\n')
# for process library, 
query = 'geeks for geeks' 
choices = ['geek for geek', 'geek geek', 'g. for geeks'] 
print("List of ratios:")
print(process.extract(query, choices), '\n')
print("Best among the above list: ",process.extractOne(query, choices))

FuzzyWuzzy Ratio:  84
FuzzyWuzzy PartialRatio:  85
FuzzyWuzzy TokenSortRatio:  84
FuzzyWuzzy TokenSetRatio:  86
FuzzyWuzzy WRatio:  84 


List of ratios:
[(&#39;g. for geeks&#39;, 95), (&#39;geek for geek&#39;, 93), (&#39;geek geek&#39;, 86)] 

Best among the above list:  (&#39;g. for geeks&#39;, 95)


Let's start simple. FuzzyWuzzy has, just like the Levenshtein package, a ratio function that computes the standard Levenshtein distance similarity ratio between two sequences. You can see an example below:

In [8]:
from fuzzywuzzy import fuzz

Str1 = "Apple Inc."

Str2 = "apple Inc"

Ratio = fuzz.ratio(Str1.lower(),Str2.lower())

print(Ratio)

95


That ratio of similarity is the same as we expected given the other examples above. However, fuzzywuzzy has more powerful functions that allow us to deal with more complex situations such as substring matching. Here is an example:

In [9]:
Str1 = "Los Angeles Lakers"

Str2 = "Lakers"

Ratio = fuzz.ratio(Str1.lower(),Str2.lower())

Partial_Ratio = fuzz.partial_ratio(Str1.lower(),Str2.lower())

print(Ratio)

print(Partial_Ratio)

50
100


fuzz.partial_ratio() is capable of detecting that both strings are referring to the Lakers. Thus, it yields 100% similarity. The way this works is by using an "optimal partial" logic. In other words, if the short string has length kk and the longer string has the length mm, then the algorithm seeks the score of the best matching length-kk substring.

Nevertheless, this approach is not foolproof. What happens when the strings comparison the same, but they are in a different order? Luckily for us, fuzzywuzzy has a solution. You can see the example below:

In [10]:
Str1 = "united states v. nixon"

Str2 = "Nixon v. United States"

Ratio = fuzz.ratio(Str1.lower(),Str2.lower())

Partial_Ratio = fuzz.partial_ratio(Str1.lower(),Str2.lower())

Token_Sort_Ratio = fuzz.token_sort_ratio(Str1,Str2)

print(Ratio)

print(Partial_Ratio)

print(Token_Sort_Ratio)

59
74
100


The fuzz.token functions have an important advantage over ratio and partial_ratio. They tokenize the strings and preprocess them by turning them to lower case and getting rid of punctuation. In the case of fuzz.token_sort_ratio(), the string tokens get sorted alphabetically and then joined together. After that, a simple fuzz.ratio() is applied to obtain the similarity percentage. This allows cases such as court cases in this example to be marked as being the same.

Still, what happens if these two strings are of widely differing lengths? Thats where fuzz.token_set_ratio() comes in. Here is an example:

In [11]:
Str1 = "The supreme court case of Nixon vs The United States"

Str2 = "Nixon v. United States"

Ratio = fuzz.ratio(Str1.lower(),Str2.lower())

Partial_Ratio = fuzz.partial_ratio(Str1.lower(),Str2.lower())

Token_Sort_Ratio = fuzz.token_sort_ratio(Str1,Str2)

Token_Set_Ratio = fuzz.token_set_ratio(Str1,Str2)

print(Ratio)

print(Partial_Ratio)

print(Token_Sort_Ratio)

print(Token_Set_Ratio)

57
77
58
95


Finally, the fuzzywuzzy package has a module called process that allows you to calculate the string with the highest similarity out of a vector of strings. You can see how this works below:

In [12]:
from fuzzywuzzy import process

str2Match = "apple inc"

strOptions = ["Apple Inc.","apple park","apple incorporated","iphone"]

Ratios = process.extract(str2Match,strOptions)

print(Ratios)

# You can also select the string with the highest matching percentage

highest = process.extractOne(str2Match,strOptions)

print(highest)

[('Apple Inc.', 100), ('apple incorporated', 90), ('apple park', 67), ('iphone', 30)]

('Apple Inc.', 100)

[(&#39;Apple Inc.&#39;, 100), (&#39;apple incorporated&#39;, 90), (&#39;apple park&#39;, 67), (&#39;iphone&#39;, 30)]
(&#39;Apple Inc.&#39;, 100)


(&#39;Apple Inc.&#39;, 100)