# Part 1. FuzzyFinder in 10 lines of code

#### Stolen from 'Brain Spill' - https://blog.amjith.com/fuzzyfinder-in-10-lines-of-python

### Intro:
FuzzyFinder is a popular feature available in decent editors to open files. The idea is to start typing partial strings from the full path and the list of suggestions will be narrowed down to match the desired file. 

### Problem Statement:
We have a collection of strings (filenames). We're trying to filter down that collection based on user input. The user input can be partial strings from the filename. Let's walk this through with an example. Here is a collection of filenames:

    collection = ['django_migrations.py',
                'django_admin_log.py',
                'main_generator.py',
                'migrations.py',
                'api_user.doc',
                'user_group.doc',
                'accounts.txt',
                ]
                
When the user types 'djm' we are supposed to match 'django_migrations.py' and 'django_admin_log.py'. The simplest route to achieve this is to use regular expressions. 

### Solutions:
#### Naive Regex Matching:
Convert 'djm' into 'd.\*j.\*m' and try to match this regex against every item in the list. Items that match are the possible candidates.

NOTE: We use regex.search instead of regex.match, as the match function will only return true if there is a match from the BEGINNING of the string. Here are the main attributes/methods of match/search functions:

- group():	Return the string matched by the RE
- start()	Return the starting position of the match
- end():	Return the ending position of the match
- span():	Return a tuple containing the (start, end) positions of the match

In [146]:
import re

collection = ['django_migrations.py',
            'django_admin_log.py',
            'main_generator.py',
            'migrations.py',
            'api_user.doc',
            'user_group.doc',
            'accounts.txt',
            ]

def FuzzyFinder(search_term, collection):
    """This function essentially returns the strings that have the specified characters in a sequential order"""
    suggestions = []
    pattern = '.*'.join(search_term)
    compiled_re = re.compile(pattern)
    for item in collection:
        match = compiled_re.search(item)
        if match:
               suggestions.append(item)
    return suggestions
    
FuzzyFinder('mig',collection)

['django_migrations.py',
 'django_admin_log.py',
 'main_generator.py',
 'migrations.py']

### Ranking based on match position:
We can rank the results based on the position of the **first occurrence of the matching character**. We make the list of suggestions to be tuples where the first item is the position of the match and second item is the matching filename. Then we use a list comprehension to iterate over the sorted list of tuples and extract just the second item which is the file name we're interested in.

NOTE: When this list is sorted used the 'sorted' funciton, python will sort them based on the first item in tuple and use the second item as a tie breaker.

In [147]:
#sorting tuples example
sorted([(5,'bTEST'),(5,'aTEST'),(5,'1TEST'),(7,'TEST'),(3,'TEST'),(1,'TEST')])

[(1, 'TEST'),
 (3, 'TEST'),
 (5, '1TEST'),
 (5, 'aTEST'),
 (5, 'bTEST'),
 (7, 'TEST')]

In [148]:
def FuzzyFinder(search_term, collection):
    """This function essentially returns the strings in a list that have the specified characters in a sequential order.
    It ranks the strings based on the position of the first occurence of the matching character"""
    suggestions = []
    pattern = '.*'.join(search_term) # Converts 'djm' to 'd.*j.*m'
    compiled_re = re.compile(pattern) # Compiles a regex.
    for item in collection:
        match = compiled_re.search(item) # Checks if the current item matches the regex.
        if match:
               suggestions.append((match.start(),item))
    return [x for _,x in sorted(suggestions)]
    
FuzzyFinder('mig',collection)

['main_generator.py',
 'migrations.py',
 'django_migrations.py',
 'django_admin_log.py']

### Ranking based on compact match:

The last example got us close to the end result, but as shown its not perfect. Our search for 'mig' included 'main_generator' as the first suggestion, when we should have got migrations.py... this is because it was ranking them best on the position of the first characted!!! 

Using regex.match() might solve this, but there is another way, compact match! When a user started typing a partial string they will conitnue to type consecutive letters in a effort to find the exact match. When some types 'mig' they are looking for 'migrations' or 'django_migrations', NOT main_generator.py. **The key here is to find the most COMPACT MATCH for the user input.**

This is pretty easily implemented in pyhton:
- When we match a string against a reg expressoin, the matched string is stored in the match.group(). 
- We use the length of the caputre group as our **PRIMARY RANK** and use the starting position as our secondary rank. 
    - to do this we add len(match.group()) as the firest item in the tuple, match.start() as the second, and the filename as the third
    - python will sort the list first by the primary key, then by the secondary, and then by the third key (the filename) (the tie breaker)

In [300]:
collection = ['django_migrations.py',
            'django_admin_log.py',
            'main_generator.py',
            'migrations.py',
            'api_user.doc',
            'user_group.doc',
            'accounts.txt',
            ]

def FuzzyFinder(search_term, collection):
    """This function essentially returns the strings in a list that have the specified characters in a sequential order.
    It ranks the strings based on the compact match lengths and position of the first occurence of the matching character"""
    suggestions = []
    pattern = '.*'.join(search_term) # Converts 'djm' to 'd.*j.*m'
    compiled_re = re.compile(pattern) # Compiles a regex.
    for item in collection:
        match = compiled_re.search(item) # Checks if the current item matches the regex.
        if match:
               suggestions.append((len(match.group()),match.start(), item))
    return [x for _,_,x in sorted(suggestions)]
    
FuzzyFinder('mig',collection)

['migrations.py',
 'django_migrations.py',
 'main_generator.py',
 'django_admin_log.py']

### Non-Greedy Match

Not bad so far!!! One more subtle adjustment we need to make:

Consider these two items in the collection: 
        
        ['api_user', 'user_group']. 

When you enter 'user' the ideal suggestion list would be ['user_group','api_user'], but heres what we actually get:

In [301]:
FuzzyFinder('user', collection)

['api_user.doc', 'user_group.doc']

The list is inversed!!! hmmm... Digging into this more, we notice that 'api_user' actually contains two 'r' characters... so what u.\*s.\*e.\*r actually matches is **user_gr** instead of just 'user', which is why it comes second in our suggestion list. 

Fortunately, there is an easy fix for this. **We use the NON-GREEDY version of the regex notation (.\*? instead of .\*)**

In [151]:
collection = ['django_migrations.py',
            'django_admin_log.py',
            'main_generator.py',
            'migrations.py',
            'api_user.doc',
            'user_group.doc',
            'accounts.txt',
            ]

def FuzzyFinder(search_term, collection):
    """This function essentially returns the strings in a list that have the specified characters in a sequential order.
    It ranks the strings based on the compact match lengths and position of the first occurence of the matching character"""
    suggestions = []
    pattern = '.*?'.join(search_term) # Converts 'djm' to 'd.*j.*m'
    compiled_re = re.compile(pattern) # Compiles a regex.
    for item in collection:
        match = compiled_re.search(item) # Checks if the current item matches the regex.
        if match:
               suggestions.append((len(match.group()),match.start(), item))
    return [x for _,_,x in sorted(suggestions)]
    
FuzzyFinder('user',collection)

['user_group.doc', 'api_user.doc']

### Sidenote on the Greedy vs Lazy Quantifiers
http://www.rexegg.com/regex-quantifiers.html

#### The Greedy Trap
The classic trap with greedy quantifiers is that they may match more than you expect. Suppose you want to match tokens that begin with {START} and end with {END}. You may try this pattern:

    {START}.*{END}

However, you will find that this pattern matches this entire string from start to finish:

    {START} Mary {END} had a {START} little lamb {END} 

…whereas we wanted to find two matches:
    
    {START} Mary {END}
    {START} little lamb {END}
    
#### Lazy Quantifier Solution
The easiest way is to make the dot-star lazy by adding a ? question mark:
    
    {START}.*?{END}

The lazy .\*? quantifier guarantees that **the quantified dot only matches as many characters as needed for the rest of the pattern to succeed.** Therefore, the pattern only matches one {START}…{END} item at a time, which is what we want. 

In [302]:
#greedy search
string = "{START} Mary {END} had a {START} little lamb {END} "
pattern = r"{START}.*{END}"
print(re.search(pattern,string).group())
print(re.findall(pattern,string))

{START} Mary {END} had a {START} little lamb {END}
['{START} Mary {END} had a {START} little lamb {END}']


In [153]:
#lazy quantifier search
string = "{START} Mary {END} had a {START} little lamb {END} "
pattern = "{START}.*?{END}"
print(re.search(pattern,string).group())
print(re.findall(pattern,string))

{START} Mary {END}
['{START} Mary {END}', '{START} little lamb {END}']


#### Lazy Quantifiers are Expensive
It's important to understand how the lazy .\*? works in this example because there is a cost to using lazy quantifiers. 

When it first encounters .\*? the engine starts out by matching the minimum number of characters allowed by the quantifier—which is zero. The engine then advances in the pattern and tries the next token (which is {) against the M in Mary. This fails, so the engine backtracks and allows the .\*? to expand its match by one item, so that it matches the M. Once again, the engine advances in the pattern. It now tries the { against the a in Mary. This fails, so the engine backtracks and allows the .\*? to expand and match the a. The process then repeats itself—the engine advances, fails, backtracks, allows the lazy .\*? to expand its match by one item, advances, fails and so on. 

As you can see, for each character matched by the .*\?, the engine has to backtrack. From a computing standpoint, this process of matching one item, advancing, failing, backtracking, expanding is "expensive". 

On a modern processor, for simple patterns, this will likely not matter. But if you want to craft efficient regular expressions, you must pay attention to use lazy quantifiers only when they are needed.

# Part 2. Cleansing text
#### http://blog.keyrus.co.uk/fuzzy_matching_101_part_i.html

#### Stripping whitespace and unwanted characters

In [154]:
# Whitespace stripping
s = '   hello world    \n '
print(s.strip()) #strips whitespace on outside of string
print(s.lstrip()) #strips whitespace on leftside of string
print(s.rstrip()) #strips whitespace on rightside of string

hello world
hello world    
 
   hello world


In [155]:
# Character stripping
t = '-----hell-o====='
print(t.lstrip('-')) #stripped specfied characters from the left
print(t.lstrip('=')) #this will do nothing
print(t.strip('-=')) #strip al characters

hell-o=====
-----hell-o=====
hell-o


In [156]:
# lower casing
t = 'HELLO World'
print(t.lower())

hello world


#### Replacing unwanted characters with maketrans()

The maketrans() method takes 3 parameters:

- x - If only one argument is supplied, it must be a dictionary.
The dictionary should contain 1-to-1 mapping from a single character string to its translation OR a unicode number (97 for 'a') to its translation.
- y - If two arguments are passed, it must be two strings with equal length.
Each character in the first string is a replacement to its corresponding index in the second string.
- z - If three arguments are passed, each character in the third argument is mapped to None.

In [157]:
#create unicode translation dictionary
str.maketrans('abcdefg','1234567')

{97: 49, 98: 50, 99: 51, 100: 52, 101: 53, 102: 54, 103: 55}

In [158]:
# striping non-relevant punctuation
s = 'This is Mikes awesome string'
translation = str.maketrans('aeiou','12345')
s.translate(translation)

'Th3s 3s M3k2s 1w2s4m2 str3ng'

In [159]:
#use translation to remove punctuation (replace all punctuation with none... use three args)
s = "This is' a s&tring^ with *alot( of >random !@#punctu*)(@ation)"
translation = str.maketrans("","","!@@#$%>^'&*()")
s.translate(translation)

'This is a string with alot of random punctuation'

#### Using the str.replace() function

The method replace() returns a copy of the string in which the occurrences of old have been replaced with new, optionally restricting the number of replacements to max. The replace method has 3 parameters:

- old − This is old substring to be replaced.
- new − This is new substring, which would replace old substring.
- max − If this optional argument max is given, only the first count occurrences are replaced

In [160]:
string = "MikeMikeMikeMikeMike"
print(string.replace("Mike", "Hailey"))
print(string.replace("Mike", "Hailey", 3))

HaileyHaileyHaileyHaileyHailey
HaileyHaileyHaileyMikeMike


# Part 3. fuzzywuzzy

Fuzzy String Matching, also called Approximate String Matching, is the process of finding strings that approximatively match a given pattern.
The closeness of a match is often measured in terms of edit distance, which is the number of primitive operations necessary to convert the string into an exact match.

Primitive operations are usually: 
- insertion (to insert a new character at a given position), 
- deletion (to delete a particular character) and 
- substitution (to replace a character with a new one).

#### Fuzzy Wuzzy provides 4 types of fuzzy logic based matching, using LEVENSHTEIN DISTANCE to determine the similarity between two strings. This metric mathematically determines similarity by looking at the minimum number of edits required for two strings to converge / be equal.

To quickly summarise the matching methods offered, there is:

- Simple Ratio - Pure Levenshtein Distance based matching
- Partial Ratio - Matches based on best substrings
- Token Sort Ratio - Tokenises strings and sorts them alphabetically before matching
- Token Set Ratio - Tokenise and compare intersection and remainder

See this post for more info: http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/

“Fuzzywuzzy” depends only on the **difflib python library**.

In [1]:
from difflib import SequenceMatcher
from fuzzywuzzy import fuzz

s1 = "New York Mets"
s2 = "New York Meats"

def ratio(s1,s2):
    m = SequenceMatcher(None, s1, s2)
    return(m.ratio())
print(ratio(s1,s2))

#fuzzy match equivalent
print(fuzz.ratio( "New York Mets", "New York Meats"))

0.9629629629629629
96


Great, so we’re done! Not quite. It turns out that the standard “string closeness” measurement **works fine for very short strings (such as a single word) and very long strings (such as a full book), but not so much for 3-10 word labels.** The naive approach is far too sensitive to minor differences in word order, missing or extra words, and other such issues.

For example, in the example below, the first two strings are clearly referring to the same team, but the second two are clearly referring to different ones. Yet, the score of the “bad” match is higher than the “right” one.

In [162]:
print(fuzz.ratio("YANKEES", "NEW YORK YANKEES"))
print(fuzz.ratio("NEW YORK METS", "NEW YORK YANKEES"))

61
76


## fuzz.partial_ratio

We use a heuristic called “best partial” when two strings are of noticeably different lengths (such as the case above). **If the shorter string is length m, and the longer string is length n, we’re basically interested in the score of the best matching length-m substring.** Partial_ratio effectively iterates through strings to find best match like so:        
        
        fuzz.ratio("YANKEES", "NEW YOR") ⇒ 14
        fuzz.ratio("YANKEES", "EW YORK") ⇒ 28
        fuzz.ratio("YANKEES", "W YORK ") ⇒ 28
        fuzz.ratio("YANKEES", " YORK Y") ⇒ 28
        ...
        fuzz.ratio("YANKEES", "YANKEES") ⇒ 100

##### NOTE: This basically assumes the 'short' string is the correct string, or the string we are searching for.  So this really should be called 'Substring match'

In [163]:
print(fuzz.partial_ratio("YANKEES", "NEW YORK YANKEES"))
print(fuzz.partial_ratio("YANKEEEES", "NEW YORK YANKEES"))
print(fuzz.partial_ratio("NEW YORK METS", "NEW YORK YANKEES"))

100
78
69


In [3]:
fuzz.ratio("YANKEEEES", "K YANKEES")

78

In [2]:
a = "Air Canada"
b = "The plane company Air Canada"
fuzz.partial_ratio(a,b)

100

In [164]:
#this should produce a sub-perfect match because the shortest string does not perfectly allign to the longer string
a = "NY YANKEES"
b = "NEW YORK YANKEES"
fuzz.partial_ratio(b,a)

80

##### Great! This gives us the answer we want. Lets now break apart the forumula using https://github.com/seatgeek/fuzzywuzzy/tree/master/fuzzywuzzy

In [181]:
##################################### 
##### make string types UNICODE #####
#####################################

def make_type_consistent(s1, s2):
    """If both objects aren't either both string or unicode instances force them to unicode"""
    if isinstance(s1, str) and isinstance(s2, str):
        return s1, s2

    elif isinstance(s1, unicode) and isinstance(s2, unicode):
        return s1, s2

    else:
        return unicode(s1), unicode(s2)

#####################################
####### partial_ratio function ######
#####################################

def partial_ratio(s1, s2):
    """"Return the ratio of the most similar substring
    as a number between 0.0 and 1."""
    s1, s2 = make_type_consistent(s1, s2) #change to unicode!
    
    if len(s1) <= len(s2):
        shorter = s1
        longer = s2
    else:
        shorter = s2
        longer=s1
    
    #create a sequence matcher using difflib
    m = SequenceMatcher(None, shorter, longer)
    
    #create match blocks
    blocks = m.get_matching_blocks()
    #print(blocks)
    # each block represents a sequence of matching characters in a string
    # of the form (idx_1, idx_2, len)
    # the best partial match will block align with at least one of those blocks
    #   e.g. shorter = "abcd", longer = XXXbcdeEEE
    #   block = (1,3,3) #shorter starts at 1, longer starts at 3, and block is 3 char long
    #   best score === ratio("abcd", "Xbcd")

    scores = []
    for block in blocks:
        long_start = block[1] - block[0] if (block[1] - block[0])>0 else 0
        long_end = long_start + len(shorter)
        long_substr = longer[long_start:long_end]
        
        m2 = SequenceMatcher(None, shorter, long_substr)
        r = m2.ratio()
        if r > 0.995:
            return 1.0
        else:
            scores.append(r)
    return round(max(scores),3)

print(partial_ratio("YANKEES", "NEW YORK YANKEES"))
print(partial_ratio("NEW YORK METS", "NEW YORK YANKEES"))
print(partial_ratio("METS", "NEW YORK METS"))

1.0
0.692
1.0


## fuzz.token_sort_ratio()

Sometimes strings may match, but they are in different order!  Here is an extremely common pattern, where one seller constructs strings as “HOME_TEAM vs AWAY_TEAM” and another constructs strings as “AWAY_TEAM vs HOME_TEAM”. See below:

In [166]:
print(fuzz.ratio("New York Mets vs Atlanta Braves", "Atlanta Braves vs New York Mets"))
print(fuzz.partial_ratio("New York Mets vs Atlanta Braves", "Atlanta Braves vs New York Mets"))

45
45


This is obviously not good enough. We use 'token_sort' and 'token_set' approaches to deal with this issue. 

#### Token Sort:
token_sort approach involves tokenizing the string in question, sorting the tokens alphabetically, and then joing them back in a string. 

    "new york mets vs atlanta braves"   →→  "atlanta braves mets new vs york"

In [190]:
fuzz.token_sort_ratio("New York Mets vs Atlanta Braves", "Atlanta Braves vs New York Mets")

100

In [5]:
a,b = "Bank of Nova Scotia", "Scotia Bank"
print(fuzz.token_sort_ratio(a,b))
print(fuzz.partial_ratio(a,b))

73
55


##### Perfect! Lets break apart the function now. We a couple steps:

In [241]:
import re
import string
def StringProcessor(string_, rejoin=True):
    #clean string and replace non-alphanumeric characters with spaces
    regex = re.compile(r"\W")
    clean_string = regex.sub(" ", string_.strip().lower())
    
    #return tokenized string
    clean_sorted = sorted(clean_string.split())
    
    #join tokens into single string
    if rejoin:
        return " ".join(clean_sorted).strip()
    else:
        return clean_sorted

print(StringProcessor('     BTEsT1*&&&&ATEST2***          '))
print(StringProcessor('New York Mets* vs Atlanta Braves'))
print(StringProcessor('Atlanta Braves vs New York!! Mets!!!'))

atest2 btest1
atlanta braves mets new vs york
atlanta braves mets new vs york


In [279]:
def token_sort_ratio(s1, s2, partial=False):
    sorted1 = StringProcessor(s1)
    sorted2 = StringProcessor(s2)
    
    if partial:
        ratio_func = partial_ratio
    else:
        ratio_func = ratio
    
    t = "Mike's fuzzy score: "
    print(t, ratio_func(sorted1, sorted2))
    
token_sort_ratio("New York Mets vs Atlanta Braves", "Atlanta Braves vs New York Mets", partial=True)

print("\nFuzzyWuzzy fuzzy score: ",fuzz.token_sort_ratio("New York Mets vs Atlanta Braves", "Atlanta Braves vs New York Mets"))


Mike's fuzzy score:  1.0

FuzzyWuzzy fuzzy score:  100


#### Token Set:
token_set approach is similar to token_sort, but a little more expensive. Here we tokenize both strings, **but instead of immediately sotring and comparing, we split the tokens into two groups:** 
- intersection and 
- remainder. 

We use those sets to build up a comparison string:


In [281]:
s1 = "mariners vs angels"
s2 = "los angeles angels of anaheim at seattle mariners"

In [284]:
#first try token sort ratio
token_sort_ratio(s1,s2, partial=False)
print("\nFuzzyWuzzy fuzzy score: ",fuzz.token_sort_ratio(s1,s2))

Mike's fuzzy score:  0.5074626865671642

FuzzyWuzzy fuzzy score:  51


In [285]:
#now try partial token sort ratio
token_sort_ratio(s1,s2, partial=True)
print("\nFuzzyWuzzy fuzzy score: ",fuzz.partial_token_sort_ratio(s1,s2))

Mike's fuzzy score:  0.722

FuzzyWuzzy fuzzy score:  72


##### Hmmm we should get a higher score because these are the same teams!!! 

So clearly token_sort is not sufficient here, because **the longer string has too many extra tokens that get interleaved with the sort... so we end up comparing:**

    t1 = "angels mariners vs"
    t2 = "anaheim angeles angels los mariners of seattle vs"

The set method allows us to detect that 'angels' and 'mariners' are common to both strings, and separate those out (the set intersection). Now we construct and compare strings of the following form.

    t0 = [SORTED_INTERSECTION]
    t1 = [SORTED_INTERSECTION] + [SORTED_REST_OF_STRING1]
    t2 = [SORTED_INTERSECTION] + [SORTED_REST_OF_STRING2]

In [297]:
def token_set_ratio(s1,s2, partial=False):
    """Find all alphanumeric tokens in each string...
        - treat them as a set
        - construct two strings of the form:
            <sorted_intersection><sorted_remainder>
        - take ratios of those two strings"""
    #create token sets
    tokens1 = set(StringProcessor(s1, rejoin=False))
    tokens2 = set(StringProcessor(s2, rejoin=False))
    
    #parse intersection and differences
    intersection = sorted(tokens1.intersection(tokens2))
    diff1to2 = sorted(tokens1.difference(tokens2))
    diff2to1 = sorted(tokens2.difference(tokens1))
    
    joined_int = " ".join(intersection)
    joined_1to2 = " ".join(diff1to2)
    joined_2to1 = " ".join(diff2to1)
    
    t0 = joined_int.strip()
    t1 = (joined_int + " " + joined_1to2).strip()
    t2 = (joined_int + " " + joined_2to1).strip()
    
    if partial:
        ratio_func = partial_ratio
    else:
        ratio_func = ratio
    
    comps = [ratio_func(t0,t1),
             ratio_func(t0,t2),
             ratio_func(t1,t2)]
    t = "Mike's fuzzy score: "
    print(t, max(comps))

token_set_ratio('this is a new test','this is a different test', partial=False)    

Mike's fuzzy score:  0.875


#### Try running these agin with the new token_set_ratio function!

In [290]:
s1 = "mariners vs angels"
s2 = "los angeles angels of anaheim at seattle mariners"

In [294]:
#token_set RATIO
token_set_ratio(s1,s2, partial=False)
print("\nFuzzyWuzzy fuzzy score: ",fuzz.token_set_ratio(s1,s2))

Mike's fuzzy score:  0.9090909090909091

FuzzyWuzzy fuzzy score:  91


In [298]:
#token_set PARTIAL RATIO
token_set_ratio(s1,s2, partial=True)
print("\nFuzzyWuzzy fuzzy score: ",fuzz.partial_token_set_ratio(s1,s2))

Mike's fuzzy score:  1.0

FuzzyWuzzy fuzzy score:  100


In [7]:
a,b = "Bank of Nova Scotia", "ScotiaBank"
print(fuzz.token_sort_ratio(a,b))
print(fuzz.partial_token_set_ratio(a,b))

41
75


### They match!!! WOOOT!