Entity Resolution
=========

This provides an overview of why and how Entity Resolution is becoming an important 
discipline in [Computer Science and Data Science](http://www.datacommunitydc.org/blog/2013/08/entity-resolution-for-big-data). This notebook explores why we need entity resolution and how to do it. Brief explanations
are given regarding commerical options. Open source options are also discussed. Some of these open
source options are demonstrated using Python.

## Why do we need entity resolution?

When doing a statistical analysis, you need to first identify your units of analysis. 
Borrowing from Allison ([Multiple Regression: A Primer](http://www.amazon.com/Multiple-Regression-Research-Methods-Statistics/dp/0761985336), p. 7):

> To do a regression analysis, you first need a set of cases (also called units of analysis or observations).
> In the social sciences, the cases are most often persons, but they could also be organizations, countries, or
> other groups. In economics, the cases are sometimes units of time, like years or quarters. For each case, you
> need measurements on all of the variables in the regession equation. 

Often, our data does not come to us in one table; ready for analysis. In this situation, we also need to be
aware of database concepts.In web development, we can say that we're concerned with defining the database 
model (see the Python web framework Django for [their model specifications](https://docs.djangoproject.com/en/1.8/topics/db/models/)). In the case of Entity Resolution, we are concerned with two concepts:

1. Primary Keys
1. Foreign Keys

In Django, each model requires exactly one field to have a primary key. A foreign key specifies a one-to-many 
relationship in Django. They have separate field types for one-to-one and many-to-many relationships between models. For our purposes we will exclude many-to-many relationships and include one-to-one relationships in our discussion of the Foreign key. The following is a demonstration from the [Django Documenation of the Foreign key](https://docs.djangoproject.com/en/1.8/ref/models/fields/#django.db.models.ForeignKey). 

``` Python
from django.db import models

class Car(models.Model):
    manufacturer = models.ForeignKey('Manufacturer')
    # ...

class Manufacturer(models.Model):
    # ...
    pass
```

We can see that each type of car could have multiple manufacturers. Several car manufacturers make mid-size sedans. Those manufacturers are reflected in the above Django model. The database models are not specific to Python but it does help to have concrete examples with easily accessible documentation.

When conducting a statistical analysis, we want to run our model on one table. Before conducting the analysis, we must be sure of the following:

1. Each unit of analysis occurs only once 
  * Primary keys must be present and unique
1. Each unit of analysis includes variables that have been correctly mapped from their original tables
  * Foreign keys must be present

Entity Resolution is a tool to define primary and foreign keys when these relationships are not previously well defined. For example, let's say people start using the web application described above. They populate the models with cars and
the manufacturers who produce them. The developers did not include standard names for cars and manufacturers because
they were unsure of what the future may hold. As the data scientist arrives on scene, they find that people have spelled
the manufacturer "GMC" in a variety of ways like "General Motors Corporation" and "General Motors". People have also 
mispelled the manufacturer name. Actually, they have done this across much of the manufacturer and car information. 

Given the state of this manufacturer/car data, the data scientist is no longer able to ascertain their unit of analysis. They must recreate primary and foreign keys before they can perform their statistical analysis. 

  
**TL;DR** We can't begin our analysis unless we have defined and identified quantifiable units of analysis. Data often shows up in multiple tables, we need to know about primary and foreign keys. We want to run our statistical analysis on a single table. We must determine primary and foreign keys before we can manipulate our data to perform the statistical analysis. Entity resolution can be used to establish primary and foreign keys when those relationships are not previously well defined.




## How can we accomplish entity resolution?

We have established that entity resolution is a way for the data scientist to restablish primary and foreign keys once people have inputted information which makes those relationships ambiguous from a statistical analysis standpoint. Here is the same goal restated by the [Stanford Entity Resolution Framework (SERF)](http://infolab.stanford.edu/serf/)Project.

> The goal of the SERF project is to develop a generic infrastructure for Entity Resolution (ER). ER (also known as
> deduplication, or record linkage) is an important information integration problem: The same "real-world entities"
> (e.g., customers, or products) are referred to in different ways in multiple data records. For instance, two records
> on the same person may provide different name spellings, and addresses may differ. The goal of ER is to "resolve"
> entities, by identifying the records that represent the same entity and reconciling them to obtain one record per
> entity.

You can accomplish entity resolution by:

1. Hiring a company
1. Doing it yourself
1. A little of both

We will explore each option. 

### Hire a company

[Basis Technology](http://www.basistech.com/) has a product called the [Rosette Entity Resolver](http://www.basistech.com/text-analytics/rosette/entity-resolver/). Their technology has been used
by "Amazon.com, EMC, Endeca/Oracle, Exalead/Dassault, Fujitsu, Google, Hewlett-Packard, Microsoft, Oracle, and governments around the world". If you're a newer smaller company then you might want to check out their [startup program](http://www.basistech.com/about/startup/).

### Do it yourself

There are number of open source tools and frameworks that I've found. None of them are an Apache projects. Here's what I've found:

1. [Dedupe](https://github.com/datamade/dedupe) python package
1. [Duke](https://github.com/larsga/Duke)
1. [elasticsearch-entity-resolution](https://github.com/YannBrrd/elasticsearch-entity-resolution)
1. [Berkeley Entity Resolution](https://github.com/gregdurrett/berkeley-entity)
1. [SERF Project](http://infolab.stanford.edu/serf/)
1. [Ch. 3 of Mining Massive Datasets](http://infolab.stanford.edu/~ullman/mmds/ch3.pdf)

Of these options, Dedupe is the only one maintained by an organization ([Datamade](http://datamade.us/)). My personal experience with Duke is that it seemed to have some buzz but none of the demo examples worked for me. The elasticsearch entity resolution plugin is based on Duke (haven't personally tried it). The Berkeley ER project looks like it's Greg's thesis and appears to be a work in progress. When I contacted one of the researchers at the SERF project, they said it's no longer active. Ch.3 of Mining massive datasets included an entity resolution example that may work for certain use cases.

Let's take a closer look at *Dedupe* and Ch.3 of Mining Massive datasets (hereafter referred to as *MMDS*). See **appendix 1 & 2** for examples.

### Do some of it yourself but hire a company

One approach is to refurbish your dataset by kicking the complexity to some else's API. Depending on your
data, you may be able to take advantage of the [Google Places API](https://developers.google.com/places/). This would allow you to return standard names for locations and organizations; you may even be able to cross-reference their phone numbers. This API is considerably different than the one produced by Facebook or Foursquare because it's not user-generated; from what I can tell. See **Appendix 3** for an example.

## Appendix 0: Getting Data

Our data is from the Dedupe [test datasets](https://github.com/datamade/dedupe/tree/master/tests/datasets).

In [1]:
import pandas as pd

restaurant_1 = pd.read_csv("datasets/restaurant-1.csv")
restaurant_2 = pd.read_csv("datasets/restaurant-2.csv")

print("Columns are identical: {}".format(restaurant_1.columns == restaurant_2.columns))

Columns are identical: [ True  True  True  True  True]


In [2]:
restaurant_1.describe()

Unnamed: 0,name,address,city,cuisine,unique_id
count,112,112,112,112,112
unique,112,111,16,31,112
top,san domenico,"""3434 peachtree rd. ne""","""new york city""","""american (new)""",'36'
freq,1,2,43,20,1


## Appendix 1: MMDS Approach

The main moving part here is the edit distance. 

## Appendix 2: Dedupe (Python Package)

Note that [dedupe tests](https://github.com/datamade/dedupe/tree/master/tests) includes the referenced datasets. I've modified the tests so they play well within a notebook. We will use Dedupe to solve the following problems:


1. Deduplication
1. Record linkage

In [3]:
import os
import time
import csv
import datetime
import multiprocessing
from itertools import combinations
from future.utils import viewitems

import dedupe
import exampleIO


### Deduplication

In [4]:
# filepath
settings_file = 'canonical_learned_settings.json'
raw_data = 'datasets/restaurant-nophone-training.csv'

# params
fields  = [{'field' : 'name', 'type': 'String'},
                      {'field' : 'name', 'type': 'Exact'},
                      {'field' : 'address', 'type': 'String'},
                      {'field' : 'cuisine', 'type': 'ShortString'},
                      {'field' : 'city', 'type' : 'ShortString'}
                      ]



In [5]:
def dedupeToPandasDf(output):
    """ Convert dedupe output to Pandas df """

def canonicalImport(filename):
    """ Clean and import data """
    preProcess = exampleIO.preProcess

    data_d = {}

    with open(filename) as f:
        reader = csv.DictReader(f)
        for (i, row) in enumerate(reader):
            clean_row = [(k, preProcess(v)) for (k, v) in
                         viewitems(row)]
            data_d[i] = dedupe.core.frozendict(clean_row)

    return data_d, reader.fieldnames

def evaluateDuplicates(found_dupes, true_dupes):
    """ Log information about duplicates that are identified """
    true_positives = found_dupes.intersection(true_dupes)
    false_positives = found_dupes.difference(true_dupes)
    uncovered_dupes = true_dupes.difference(found_dupes)

    print('found duplicate')
    print('duplicate length: {}'.format(len(found_dupes)))

    print('precision: {}'.format(1 - len(false_positives) \
                                        / float(len(found_dupes))))

    print('recall: {}'.format(len(true_positives) \
                                     / float(len(true_dupes))))
    
    return (true_positives, false_positives, uncovered_dupes)
    
def handleSettingsFile(settings_file, cores, sample_size, 
                       training_pairs, fields):
    """ Create settings file if it doesn't exist 
    
    Args:
        settings_file: str, path to settings file (json)
        cores:  int, multiprocessing.cpu_count()
        sample_size: int, # of samples
        training_pairs: dedupe.trainingDataDedupe()
        fields: list of dicts, field specifications for deduping
        
    Returns:
        void, settings file updated or written to disk
    
    """
    if os.path.exists(settings_file):
        with open(settings_file, 'rb') as f:
            deduper = dedupe.StaticDedupe(f, 1)
            f.close()
        return deduper
    
    else:
        fields = fields

        deduper = dedupe.Dedupe(fields, num_cores=cores)
        deduper.sample(data_d, sample_size)
        deduper.markPairs(training_pairs)
        deduper.train()
        with open(settings_file, 'wb') as f:
            deduper.writeSettings(f)
            f.close()
        return deduper


In [6]:
## run the deduplication


data_d, header = canonicalImport(raw_data)

training_pairs = dedupe.trainingDataDedupe(data_d, 
                                           'unique_id', 
                                            5000)

deduper = handleSettingsFile(settings_file = settings_file, 
                       cores = multiprocessing.cpu_count(),
                       sample_size = 10000,
                       training_pairs = training_pairs,
                       fields = fields)


duplicates_s = set(frozenset(pair) for pair in training_pairs['match'])

t0 = time.time()

print('number of known duplicate pairs', len(duplicates_s))

alpha = deduper.threshold(data_d, 1.5)

# print candidates
print('clustering...')
clustered_dupes = deduper.match(data_d, threshold=alpha)

print('Evaluate Clustering')
confirm_dupes = set([])
for dupes, score in clustered_dupes:
    for pair in combinations(dupes, 2):
        confirm_dupes.add(frozenset((data_d[pair[0]], 
                                     data_d[pair[1]])))

true_positives, false_positives, uncovered_dupes = evaluateDuplicates(confirm_dupes, duplicates_s)

print('ran in ', time.time() - t0, 'seconds')


number of known duplicate pairs 112
clustering...
Evaluate Clustering
found duplicate
duplicate length: 115
precision: 0.9391304347826087
recall: 0.9642857142857143
ran in  0.37319064140319824 seconds


In [20]:
# format output into a table
for item in true_positives:
    print("\nnew item")
    print(item)
    


new item
frozenset({<frozendict {'name': 'plumpjack cafe', 'city': 'san francisco', 'unique_id': '108', 'cuisine': 'american (new)', 'address': '3127 fillmore st.'}>, <frozendict {'name': 'plumpjack cafe', 'city': 'san francisco', 'unique_id': '108', 'cuisine': 'mediterranean', 'address': '3201 fillmore st.'}>})

new item
frozenset({<frozendict {'name': 'dining room ritz-carlton buckhead', 'city': 'atlanta', 'unique_id': '90', 'cuisine': 'international', 'address': '3434 peachtree rd.'}>, <frozendict {'name': 'ritz-carlton dining room (buckhead)', 'city': 'atlanta', 'unique_id': '90', 'cuisine': 'american (new)', 'address': '3434 peachtree rd. ne'}>})

new item
frozenset({<frozendict {'name': "virgil's", 'city': 'new york', 'unique_id': '66', 'cuisine': 'american', 'address': '152 w. 44th st.'}>, <frozendict {'name': "virgil's real bbq", 'city': 'new york city', 'unique_id': '66', 'cuisine': 'bbq', 'address': '152 w. 44th st.'}>})

new item
frozenset({<frozendict {'name': 'hedgerose h

In [21]:
for item in false_positives:
    print("\nnew item")
    print(item)


new item
frozenset({<frozendict {'name': 'restaurant ritz-carlton atlanta', 'city': 'atlanta', 'unique_id': '91', 'cuisine': 'continental', 'address': '181 peachtree st.'}>, <frozendict {'name': 'ritz-carlton cafe (atlanta)', 'city': 'atlanta', 'unique_id': '711', 'cuisine': 'american (new)', 'address': '181 peachtree st.'}>})

new item
frozenset({<frozendict {'name': 'dining room ritz-carlton buckhead', 'city': 'atlanta', 'unique_id': '90', 'cuisine': 'international', 'address': '3434 peachtree rd.'}>, <frozendict {'name': 'ritz-carlton cafe (buckhead)', 'city': 'atlanta', 'unique_id': '89', 'cuisine': 'american (new)', 'address': '3434 peachtree rd. ne'}>})

new item
frozenset({<frozendict {'name': 'palm', 'city': 'new york city', 'unique_id': '634', 'cuisine': 'steakhouses', 'address': '837 second ave.'}>, <frozendict {'name': 'palm too', 'city': 'new york city', 'unique_id': '635', 'cuisine': 'steakhouses', 'address': '840 second ave.'}>})

new item
frozenset({<frozendict {'name':

In [22]:
for item in uncovered_dupes:
    print("\nnew item")
    print(item)


new item
frozenset({<frozendict {'name': 'spago', 'city': 'los angeles', 'unique_id': '20', 'cuisine': 'californian', 'address': '1114 horn ave.'}>, <frozendict {'name': 'spago (los angeles)', 'city': 'w. hollywood', 'unique_id': '20', 'cuisine': 'californian', 'address': '8795 sunset blvd.'}>})

new item
frozenset({<frozendict {'name': 'shun lee west', 'city': 'new york', 'unique_id': '60', 'cuisine': 'asian', 'address': '43 w. 65th st.'}>, <frozendict {'name': 'shun lee palace', 'city': 'new york city', 'unique_id': '60', 'cuisine': 'chinese', 'address': '155 e. 55th st.'}>})

new item
frozenset({<frozendict {'name': 'bel-air hotel', 'city': 'bel air', 'unique_id': '2', 'cuisine': 'californian', 'address': '701 stone canyon rd.'}>, <frozendict {'name': 'hotel bel-air', 'city': 'bel air', 'unique_id': '2', 'cuisine': 'californian', 'address': '701 stone canyon rd.'}>})

new item
frozenset({<frozendict {'name': 'restaurant ritz-carlton atlanta', 'city': 'atlanta', 'unique_id': '91', '

In [23]:
for item in clustered_dupes:
    print("\nnew item")
    print(item)


new item
((0, 1), (0.99999946, 0.99999946))

new item
((2, 3), (0.99885875, 0.99885875))

new item
((6, 7), (0.99999994, 0.99999994))

new item
((8, 9), (0.99999976, 0.99999976))

new item
((10, 11), (0.99999982, 0.99999982))

new item
((12, 13), (0.99999994, 0.99999994))

new item
((781, 775), (0.38461211, 0.38461211))

new item
((14, 15), (0.99846447, 0.99846447))

new item
((16, 17), (0.99999994, 0.99999994))

new item
((18, 19), (0.99783939, 0.99783939))

new item
((20, 21), (0.51554132, 0.51554132))

new item
((22, 23), (1.0, 1.0))

new item
((24, 25), (0.99973643, 0.99973643))

new item
((26, 27), (0.99983519, 0.99983519))

new item
((28, 29), (0.9999997, 0.9999997))

new item
((30, 31), (0.83063316, 0.83063316))

new item
((32, 33), (0.99999994, 0.99999994))

new item
((34, 35), (0.99812686, 0.99812686))

new item
((36, 37), (1.0, 1.0))

new item
((38, 39), (0.99999988, 0.99999988))

new item
((42, 43), (0.99999994, 0.99999994))

new item
((44, 45), (0.99939847, 0.99939847))

n

### Record linkage

## Appendix 3: Google Places API

Please verify with Google that this is not a violation of their terms of service before implementing this approach.
