# Finding the Perfect Match

There are many times we have to solve a similarity problem.  There are many different ways to accomplish this task.  This notebook shows how you can use the jaccard similarity measurement to match a municipality name.

## Data Wrangling

Let's begin by pulling the geography index from CGR's data hub:

In [1]:
# Import python libraries we need to use
from sqlalchemy import create_engine
from sqlalchemy import text
import pandas as pd

# Connect to CGR's Data Hub
engine = create_engine('mysql+pymysql://dba:cgr1915@data.cgr.org/hub')
conn = engine.connect()

# Pull all the NYS records from the CGR Geography Index table
results = conn.execute(text("SELECT * FROM CGR_GEOGRAPHY_INDEX WHERE " "NAME LIKE :string"), string="%New York")
haystack = results.fetchall()

# Let's see the first 10 records to get a sense of the data
print("('CGR_GEO_ID', 'NAME', 'TYPE')")
print("==============================")
for i in range(0, 10):
    print(haystack[i])

('CGR_GEO_ID', 'NAME', 'TYPE')
('36', 'New York', 'State')
('36001', 'Albany County, New York', 'County')
('3600106211', 'Berne town, Albany County, New York', 'Town')
('3600106354', 'Bethlehem town, Albany County, New York', 'Town')
('3600116694', 'Coeymans town, Albany County, New York', 'Town')
('3600117343', 'Colonie town, Albany County, New York', 'Town')
('3600130532', 'Green Island town, Albany County, New York', 'Town')
('3600131104', 'Guilderland town, Albany County, New York', 'Town')
('3600140002', 'Knox town, Albany County, New York', 'Town')
('3600150672', 'New Scotland town, Albany County, New York', 'Town')


By design the sub-county locations have the county they fall in plus the state.  Now let's create a needle.  The OSC uses a different approach in naming municipalities.  Let's pull Fairport's name from the OSC data:

In [2]:
# Pull OSC data for Fairport
results = conn.execute(text("SELECT DISTINCT `MUNICIPAL_CODE`, `ENTITY_NAME`, `COUNTY` FROM `NY_OSC_DETAILED_ACCOUNT_LEVEL_DATA` WHERE " "ENTITY_NAME LIKE :string"), string="%Fairport%")
results = results.fetchall()
print(results)

[(260465001630, 'Village of Fairport', 'Monroe')]


So here's the problem.  OSC leads the name with the type of municipality while in our geography index the type is after the name.  The OSC entity name does not include the county name nor New York.  The geography index does not include the word "of" while the OSC one does.  We will need to create our needle by combining the parts and adding in the missing elements:

In [3]:
needle = results[0][1] + ', ' + results[0][2] + ' County, New York'
print(needle)

Village of Fairport, Monroe County, New York


## Jaccard Similarity
There is an excelent explanation of this method found in [this blog post](http://dataaspirant.com/2015/04/11/five-most-popular-similarity-measures-implementation-in-python/).  The post explains:

>The Jaccard similarity measures the similarity between finite sample sets and is defined as the cardinality of the intersection of sets divided by the cardinality of the union of the sample sets. Suppose you want to find Jaccard similarity between two sets A and B it is the ration of cardinality of A ∩ B and A ∪ B

Lete's walk through this with an example.  Let's say we want to compare set A and B.  Here's how we do it:

<img src="http://i0.wp.com/dataaspirant.com/wp-content/uploads/2015/04/jaccard_similariyt.png?resize=768%2C307"/>

<img src="http://i1.wp.com/dataaspirant.com/wp-content/uploads/2015/04/jaccaard2.png?resize=768%2C307"/>

<img src="http://i2.wp.com/dataaspirant.com/wp-content/uploads/2015/04/jaccaard3.png?w=700"/>

Translating this idea into Python code:

In [4]:
def jaccard_similarity(a,b):
    intersection_cardinality = len(set.intersection(*[set(a), set(b)]))
    union_cardinality = len(set.union(*[set(a), set(b)]))
    return intersection_cardinality/float(union_cardinality)

So let's break the name down into parts and compare them with the jaccard similarity measure:

In [5]:
# This will hold our findings
best_match = ''
best_match_score = -1
best_match_set = list()

# Strip out the commas and break it apart by the spaces.  Also put everything in lower case.
a = needle.replace(',','').lower().split(' ')
print('"'+needle+'" broken out to:')
print(a)
print('\r')

# We will step through the geographies, stripping away the commas, splitting on the spaces and adding the word "of", 
# then compare using the jaccard similarity function.  Also put everything in lower case.
for hay in haystack:
    b = hay[1].replace(',','').lower().split(' ')
    b.append('of')
    # Calculate the similarity
    j = jaccard_similarity(a, b)
    # Check to see if we have a jaccard similary score that beats our best match
    if j > best_match_score:
        best_match_score = j
        best_match = hay[1]
        best_match_set = b

# Now that we have done that let's print the best match
print('Best Match: '+best_match+' (Score: '+str(best_match_score)+')')
print(best_match_set)

"Village of Fairport, Monroe County, New York" broken out to:
['village', 'of', 'fairport', 'monroe', 'county', 'new', 'york']

Best Match: Fairport village, Monroe County, New York (Score: 1.0)
['fairport', 'village', 'monroe', 'county', 'new', 'york', 'of']
