# FuzzyWuzzy Tutorial

## About FuzzyWuzzy

FuzzyWuzzy is a tool to do text comparison for strings. An example of where this may be used is for comparing names in different datasets. Some people may go by their middle name, or by "Mike" instead fo "Michael". With FuzzyWuzzy, we can get a similarity score that accounts for either case.

## Tutorial

### Part 1 - Importing/ install
See code cell below for example import! It's normal to recieve a warning - this only impacts performance, not a big deal.

In [22]:
#!pip install fuzzywuzzy levenshtein
from fuzzywuzzy import fuzz

### Part 2 - Using Fuzz

The fuzz portion of FuzzyWuzzy is useful for simple string comparison. It contains several options to work better with differently formatted strings.


#### Simple Ratio
Simple Ratio takes the Levenstein difference to calculate the difference between two strings that are passed in.

https://en.wikipedia.org/wiki/Levenshtein_distance

In [15]:
fuzz.ratio("Greg!","gregory")

50

#### Token Sort Ratio

In token sort ratio, strings are set to lowercase and punctuation is removed before comparison. This is useful to filter out noise in the data, because often we do not care about anything except the name string.

In [18]:
fuzz.token_sort_ratio("Greg!", "gregory")

73


#### Token Set Ratio

Token set ratio is usefull in the case that somebody goes by a middle name. In addition to the lowercase and punctuation filtering in Token Set Ratio, it tokenizes the string (sorting out each word) and checks for subsets. If the intersection of the two sets perfectly match, the score is 100%. 

You can see that it performs better than token sort ratio in the following example:

In [19]:
s1 = "George Santos" 
s2 = "George Anthony Devolder Santos"


print(fuzz.token_set_ratio(s1,s2))


100


In [20]:
print(fuzz.token_sort_ratio(s1,s2))

60


### Part 3 - using Process

Process can be used to extract the closest match from a list of strings. 

In [31]:
from fuzzywuzzy import process
list_of_strings = ["Gregory Zavalnitskiy", "Ben Ramsey", "Thao Nguyen", "Vivian Pavlica", "Okoniewski, Johnny"]

`process.extract` extracts all matches. It takes in a string, and a list of choiches. It returns a list of tuples of matches and the corresponding Token Set Ratio score. A limit can be set with the `limit` keyword argument.

In [34]:
process.extract("Viv", list_of_strings, limit = 3)

[('Vivian Pavlica', 90),
 ('Gregory Zavalnitskiy', 30),
 ('Okoniewski, Johnny', 30)]

`process.extractOne` Only extracts one match. It takes in a string, and a list of potential matches, and it returns the closest match as a tuple of name and score. Practically, this is the same as `process.extract` when `limit = 1`

In [35]:
process.extractOne("Viv", list_of_strings)

('Vivian Pavlica', 90)

In [29]:
process.extractOne("Greg", list_of_strings)

('Gregory Zavalnitskiy', 90)