# Record Linkage Adapted from [here](https://github.com/CSSIP-AIR/Big-Data-Workbooks/blob/master/08.%20Data%20Linkage/Record%20Linkage.ipynb)

---

## Introduction

The goal of record linkage is to determine if pairs of records describe the same identity. This is important for removing duplicates from a data source or joining two separate data sources together. Record linkage also goes by the terms data matching, merge/purge, duplication detection, de-duping, reference matching, co-reference/anaphora in various fields. There are several approaches to record linkage that include exact matching, rule-based linking and probabilistic linking. An example of exact matching is joining records based on social security number. Rule-based matching involves applying a cascading set of rules that relect the domain knowledge of the records being linked. In probabilistic record linkage, linkage weights are calculated based on records and a threshold is applied to make a decision of whether to link records or not. This tutorial will cover preprocessing data, rule-based linkage and probabilitic linkage using the Felligi-Sunter model. 

## Table of Contents

1. [Load the data](#Load-the-Data)
2. [Explore the data](#Data-Exploration)
3. [Preprocess the Data](#Preprocess:-clean-Up-Names-and-separate-to-first-middle-and-last-name)
4. [Explore Metrics](#String-Comparators)
5. [Create Rule-Based Linking](#Create-A-Rule-Based-System-for-record-matching.)
6. [Create Probabilistic Liking Using Felligi Sunter](#Probabilisic-Record-Linkage)

In [None]:
%pylab inline
from __future__ import print_function
from six.moves import zip, range
import pandas as pd
import jellyfish
from collections import OrderedDict

# Load the Data 

Our first dataset is all nsf grants awarded between 2010-2012

In [None]:
df_nsf_awards = pd.read_csv('./Datasets/nsf_awards_2010-2012.csv')

In [None]:
df_nsf_awards.head()

Our second dataset is a list of all employees in the UC System in 2011. 

In [None]:
df_ucpay = pd.read_csv('./Datasets/ucpay2011.csv', sep='\t')

In [None]:
df_ucpay.head()

We can see there are some redacted names from the data. Let's do some exploratory analysis on the data 

# Data Exploration

In [None]:
df_ucpay.year.unique()

as we expected all employees are from 2011

In [None]:
df_ucpay.campus.unique()

In [None]:
df_ucpay.groupby('campus').size().plot(kind='barh')

In [None]:
df_ucpay.title.unique()

In [None]:
len(df_ucpay.title.unique())

We can see that there are 2626 unique positions in the System. It is likely only a very small subset received grants from the NSF. Typically with the title of Professor, Postdoc, Research Professional etc. 

In [None]:
df_ucpay.shape

remove all the redeacted names from the UC dataset 

In [None]:
mask = df_ucpay.name != "***********"

In [None]:
df_ucpay[mask].shape

In [None]:
df_ucpay = df_ucpay[mask]

In [None]:
df_ucpay.head(15)

we filtered out all the redacted names and can see the name field in the uc data is given in *lastname*, *firstname* *middle* format 

In [None]:
sel_cols = ['ID','campus', 'name', 'title']
df_ucpay = df_ucpay[sel_cols]

In [None]:
df_ucpay.head()

filter the nsf data to only display awards from CA

In [None]:
state_mask = df_nsf_awards['StateCode'] == 'CA'
df_nsf_awards = df_nsf_awards[state_mask]

In [None]:
df_nsf_awards.head()

# What kind of the records do we want to match

Given the name of an award match their record with the UC database to get their position and employee id. 

# Preprocess: clean Up Names and separate to first middle and last name

In [None]:
names = df_ucpay.name.values

In [None]:
def split_names(name):
    """
    Splits names fields into first, middle and last names
    and return lower case values. 
    
    Parameters
    -----------
    name: str
        e.g., SHAPIRO, JORDAN ISAAC
    
    
    Returns
    -------
    (first, middle, last): str
        e.g., mark calvin anderson
    """
    
    #split on the comma do get the last name
    name=name.lower()
    ls_name = name.split(',')

    last_name = ls_name[0]
    first_middle_name = ls_name[1]
    
    #split by space to get the first and middle name
    ls_first_middle_name = first_middle_name.split()
    if len(ls_first_middle_name) > 1:
        first_name = ls_first_middle_name[0]
        middle_name = ls_first_middle_name[1]
    else: 
        first_name = ls_first_middle_name[0]
        middle_name = ''
    return unicode(first_name.strip()), unicode(middle_name.strip()), unicode(last_name.strip())

In [None]:
ls_cleaned_names = [split_names(name) for name in names]

In [None]:
ls_first, ls_middle, ls_last = zip(*ls_cleaned_names)

In [None]:
df_ucpay['first'] = ls_first
df_ucpay['middle'] = ls_middle
df_ucpay['last'] = ls_last

In [None]:
df_ucpay.head()

we know have cleaned fields for the first, middle and last names. The NSF data only has first and last name fields so we only need the first and last name fields. 

In [None]:
df_nsf_awards.dropna(subset=['FirstName','LastName'], inplace=True)

drop any rows that do not have entries in the FirstName and LastName field

In [None]:
df_nsf_awards['first'] = [unicode(name.lower()) for name in df_nsf_awards['FirstName'].values]
df_nsf_awards['last'] = [unicode(name.lower()) for name in df_nsf_awards['LastName'].values]

>**Note**: In python2 we have to explicitly tell Python we want a string to be encoding in unicode. In Python3 all strings 
>are by default unicode. 


In [None]:
df_nsf_awards.head()

# String Comparators

Now that we have cleaned data lets explore how to match the string fields. For continuous fields, comparision is simple, the absolute difference between the values can be taken as a measure of the closeness. For string comparisions it is a little bit more complex. One metric is the *edit distance*, the minimum number of edit distances to transform one string to another. In the case of how many insertions, deletions and substitutions to transform one string to antoher that is known as the *Levenshtein distance*. If you add transposing with adjacent letters that is known as the *Levenshtein-Damerau distance*. The *Jaro-Winkler* distance is a fast-to-compute distance metric that returns a normalized score between zero and one. 

In [None]:
class StringComparators():
    """
    Test various string comparators 
    """

    def test_levenshtein_distance():
        assert jellyfish.levenshtein_distance('John', 'John') == 0
        assert jellyfish.levenshtein_distance('Jon', "John") == 2
        assert jellyfish.levenshtein_distance('Joseph', 'Joesph') == 1
        
    def test_damerau_levenshtein():
        assert jellyfish.damerau_levenshtein_distance('Joseph', 'Joesph') == 1

    def test_jaro_winklear():
        assert (np.isclose(jellyfish.jaro_winkler('Joseph', 'Joesph'), 0.955555))
        assert (np.isclose(jellyfish.jaro_winkler('Chris', 'Christoper'), 0.9))

# Lets get the to 10 matching first names in the nsf database according to the jaro-winker score 

In [None]:
uniq_nsf_firstname = set( df_nsf_awards['first'].values ) #grab unique names from the nsf

In [None]:
uc_names = df_ucpay['first'].values #grab the uc_names

In [None]:
# Comparison of records

In [None]:
testname = unicode(uc_names[0])

In [None]:
def get_matching_first_name(testname, NUM_NAMES=10):
    """
    get top 10 first names that match
    """
    dict_name_pair = {}
    for name in uc_names:
        name = unicode(name)
        dist = jellyfish.jaro_winkler(testname,name)
        dict_name_pair[name] = dist

    orddict_dict_name_pair = OrderedDict(
                                sorted(dict_name_pair.items(), key=lambda x: x[1]))

    ls_sorted_name = list(orddict_dict_name_pair.keys())


    return ls_sorted_name[-NUM_NAMES:][::-1]

In [None]:
print(testname,get_matching_first_name(testname))

In [None]:
for nm in uc_names[:25]:
    testname = unicode(nm)
    print(testname, get_matching_first_name(testname))

# Create A Rule Based System for record matching. 

Let's try to merge data with the following rules. 

1. The first name Jaro-Winkler score has to be greater than 0.90
2. The last name Jaro-Winkler score has to be greater then 0.90

This rule essentially means that the names have to match with very minor typos. 


In [None]:
dict_nsf_awards = df_nsf_awards[:10].to_dict(orient='index')

In [None]:
def create_rule_mask(nsf_first_name, 
                     nsf_last_name,
                     df_ucpay,
                     first_name_thresh=0.90,
                     last_name_thresh=0.90):
    """
    Returns a boolean array of records to match based on a
    fixed threshold. 
    
    Parameters
    ----------
    (nsf_first_name, nsf_last_name): str
        first and last name in the NSF dataset
        
    df_ucpay: DataFrame
        DataFrame of the UC directory
        
    (first_name_thresh,last_name_thresh): int
        
        
    Returns
    -------
    jaro_mask: ls[bool]
        boolean list of records to match
    """
    compare_first = lambda x: jellyfish.jaro_winkler(nsf_first_name,x)
    compare_last = lambda x: jellyfish.jaro_winkler(nsf_last_name,x)

    jaro_first = df_ucpay['first'].map(compare_first) 
    jaro_last = df_ucpay['last'].map(compare_last)

    jaro_mask = (jaro_first > first_name_thresh) & (jaro_last > last_name_thresh)
    
    return jaro_mask
    

In [None]:
def match_records(dict_nsf_awards, df_ucpay, f_create_rule_mask):
    """
    match records from the nsf and uc datasets based on the fields 'first' and 'last' name
    
    Parameters
    ---------
    dict_nsf_awards: dict
        dictionary of nsf awards
    df_ucpay: DataFrame
        DataFrame of UC employees
    create_rule_mask: function
        Function that takes a first name, last name and df_ucpay
        and returns a Boolean array of whether or not to match 
        records
    
    Returns
    -------
    df_linked_data: DataFrame
    """
    
    df_linked_data = pd.DataFrame()
    for key in dict_nsf_awards.keys():
        dict_test_row = dict_nsf_awards[key]
    
        nsf_first_name = dict_test_row['first']
        nsf_last_name = dict_test_row['last']

        jaro_mask = f_create_rule_mask(nsf_first_name, nsf_last_name, df_ucpay)
    
        df_matches = df_ucpay[jaro_mask]
        if len(df_matches) == 0:
            print('No Match: {} {}'.format(nsf_first_name,nsf_last_name))
        for row in df_matches.iterrows():
            dict_test_row['ID'] = row[1]['ID']
            dict_test_row['campus'] = row[1]['campus']
            dict_test_row['name'] = row[1]['name']
            dict_test_row['title'] = row[1]['title']
            df_linked_data = df_linked_data.append(dict_test_row, ignore_index=True)
            
    return df_linked_data

In [None]:
df_linked_data = match_records(dict_nsf_awards, df_ucpay, create_rule_mask )

In [None]:
sel_col = ['AwardId', 'CityName', 'FirstName', 'ID', 'LastName', 'Name', 'campus', 'title', 'first', 'last']
df_linked_data[sel_col]

As we can see 4 records in the NSF database had no matches in teh UC database. Also if we examine the output of the record linkage more closely we see we have a few false postives, Paul Davies, Joseph Pasquale.

There are several shortcomings to this approach:
    
1. There is no threshold that can be adjusted for the proper tolerance of false postives and false negatives
2. As you apply more and more rules is can become unclear what the combination of rules has on the final linkage
3. Rules also often reflect the creators domain knowledge.

# Probabilisic Record Linkage 

The Fellegi-Sunter probablistic record linkage model compares selected similiar fields in two records and calculated a weighted probablity of being similar.  The algorithm is the following: two fields are first compared using a metric, in this case, the jaro-winkler algorithm. The jaro-winkler distance is then obtained from the jaro-winkler algorithm is then binned into a category -- exact match, close match and no match. The category the comparison falls into is then weighed with two distributions. The probability that the records are a match and the probablity that the record are a non-match are then calculated for each field. The log probablity of being a match or a non-match are then combined for each field respectively. The final score is then the probablity of being a match minus the probablity of being a non-match. If the final score is greater then a thershold, then the records are considered to match.  

In [None]:
class FellegiSunter():
    """
    class to implement Fellegi Sunter model
    """
    
    m_weights = {'first_name': (0.01,0.14,0.85),
                 'last_name': (0.01,0.09,0.90)}
    
    u_weights = {'first_name': (0.88,0.10,0.02),
                 'last_name': (0.91,0.08,0.01)}
    
    
    def fuzzy_match(self,name1,name2):
        """
        Compares two strings using jaro-winker and
        outputs and returns one of three match
        levels.
        
        * exact match is a jaro-winkler score >= 0.92
        * close match is a jaro-winkler score > 0.85
        * no match is a jaro-winkler score < 0.85
        
        Parameters
        ----------
        (name1, name2): str
            two text strings to output
            
        Returns
        -------
        match_level: int
            one of three match levels
            2 - exact match 
            1 - close match
            0 - not a match
        """
        score = jellyfish.jaro_winkler(name1,name2)
        if score >= 0.92:
            return 2
        elif score > 0.85:
            return 1
        else:
            return 0
        
    def match_score(self,record1,record2):
        """
        computes the match score between a pair
        of records
        
        Parameters
        ----------
        (record1, record2): tuple(str)
            tuples of records to be compared
            
        Returns
        -------
        match_score: int
        
        Raises
        ------
        Exception
            tuples need to be the same size 
        """
        if not(len(record1) == len(record2)):
            raise Exception('records need to be same size')
        
        scores = [self.fuzzy_match(rec1,rec2) for rec1, rec2 in zip(record1, record2)]
        
        first_name_score, last_name_score = scores
        
        #grab the m and u weights
        
        first_name_m_weight = self.m_weights['first_name'][first_name_score]
        first_name_u_weight = self.u_weights['last_name'][first_name_score]
        
        last_name_m_weight = self.m_weights['first_name'][last_name_score]
        last_name_u_weight = self.u_weights['last_name'][last_name_score]
        
        log_prob_match = math.log(first_name_m_weight) + math.log(last_name_m_weight)
        log_prob_umatch = math.log(first_name_u_weight) + math.log(last_name_u_weight)
        
        match_score = log_prob_match - log_prob_umatch
        
        return match_score
        
    def link_record(self, record1, record2, threshold=0.5):
        """
        Returns True if records should be linked
        False otherwise.
        
        Parameters
        ----------
        (record1, record2): tuple(str)
            tuples of records to be compared. must
            be the same length
        threshold: int
            threshold for linking or not
        
        Returns
        -------
        link: bool
            bool on whether to link two records or not
        """
        
        match_score = self.match_score(record1,record2)
        if match_score > threshold:
            return True
        else:
            return False
        

In [None]:
fs = FellegiSunter()

In [None]:
print( fs.link_record(('Avishek','Kumar'), ('Avishek','Kumar')) )

In [None]:
print( fs.link_record( ('Avishek','Kumar'), ('Anup','Kumar') ) )

In [None]:
#let's take this new function for a spin
print('jonathon', 'jonthon',fs.fuzzy_match('jonathon','jonthon') )
print('john', 'mark',fs.fuzzy_match('john','mark') )
print('fred', 'frederick', fs.fuzzy_match('fred', 'frederick'))

In [None]:
def create_jaro_mask(nsf_first_name, nsf_last_name, df_ucpay):
    """
    create a boolean array for whether to link records based on
    the jaro-winkler distance
    
    Parameters
    ----------
    (nsf_first_name, nsf_last_name): str
        first and last name in the NSF dataset
        
    df_ucpay: DataFrame
        DataFrame of the UC directory
        
    Returns
    -------
    jaro_mask: ls[bool]
        boolean list of records to match
    """
    record = (nsf_first_name, nsf_last_name)
    uc_records = df_ucpay[['first','last']].values
    
    jaro_mask = [fs.link_record(record, uc_record) for uc_record in uc_records]
    
    return jaro_mask
    
    

In [None]:
df_linked_data = match_records(dict_nsf_awards, df_ucpay, create_jaro_mask )

In [None]:
sel_col = ['AwardId', 'CityName', 'FirstName', 'ID', 'LastName', 'Name', 'campus', 'title', 'first', 'last']
df_linked_data[sel_col]

Here is the matching using probablistic matching. We can change the thresholds do see how results will vary.

[Back to Table of Contents](#Table-of-Contents)