# <center> RECORD LINKAGE IN PYTHON <br/><br/> CSCAR WORKSHOP <br/><br/> 12/08/2017
## <center> Marcio Mourao

# <center> Setup for Anaconda / Jupyter Notebook

<ul>
    <li>Go to the page https://marcio-mourao.github.io/</li>
    <li>Download the Python notebook under "Record Linkage" to your "username/Documents"</li><br/>
    
    <li>Click the Windows button (Bottom Left Corner)</li>
    <li>Click "All apps"</li>
    <li>Click "Anaconda3"</li>
    <li>Click "Anaconda Prompt" </li>
    <li>Enter "pip install recordlinkage"</li>
    <br/>
    
    <li>Click the Windows button (Bottom Left Corner)</li>
    <li>Click "All apps"</li>
    <li>Click "Anaconda3"</li>
    <li>Click "Jupyter Notebook" </li><br/>
    <li>Click "Workshop.ipynb" (this should open a new tab in the browser)</li>
</ul>

# <center> Introduction

<ul>
  <li>Don't forget to go to: http://cscar.research.umich.edu/ to know what we're offering!</li>
  <li>Any questions/feedback, you can send an email to <a href="mailto:mdam@umich.edu" target="_top">Marcio</a>
</ul>

# <center> References

<ul>
  <li>https://www.continuum.io/anaconda-overview</li>
  <li>http://www.numpy.org/</li>
  <li>http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html</li>
  <li>http://pandas.pydata.org/pandas-docs/stable/10min.html</li>
  <li>https://pypi.python.org/pypi/recordlinkage/</li>
</ul>

# <center>What is record linkage?

The term record linkage is used to indicate the procedure of bringing together information from two or more records that are believed to belong to the same entity. Record linkage is used to link data from multiple data sources or to find duplicates in a single data source. In computer science, record linkage is also known as data matching or deduplication (in case of search duplicate records within a single file).

# <center> Main features of the Python Record Linkage Toolkit

<ul>
    <li>Clean and standardise data with easy to use tools</li>
    <li>Make pairs of records with smart indexing methods such as blocking and sorted neighbourhood indexing</li>
    <li>Compare records with a large number of comparison and similarity measures for different types of variables such as strings, numbers and dates</li>
    <li>Several classifications algorithms, both supervised and unsupervised algorithms</li>
    <li>Common record linkage evaluation tools</li>
    <li>Several built-in datasets</li>
</ul>

# <center> Advantages and Disadvantages

### Advantages
<ul>
    <li>Understandability </li>
    <li>Usability </li>
    <li>Extensibility </li>
</ul>

### Disadvantages
<ul>
    <li>The Python Record Linkage Toolkit is NOT developed with speed in mind</li>
    <li>The toolkit is useful for linking small or medium sized files</li>
</ul>

# <center> Summary of this workshop

<ul>
  <li>Summary of Python Data Types</li>
  <li>Load and describe the datasets</li>
  <li>Make Record Pairs or Indexing</li>
  <li>Compare record pairs</li>
  <li>Classify Record Pairs</li>
</ul>



## Import relevant general modules

In [None]:
import numpy as np
import pandas as pd
import recordlinkage

In [None]:
import sys
print(sys.version)

print(np.__version__)
print(pd.__version__)
print(recordlinkage.__version__)

## Summary of Python Data Types

### Python Simple Data Types
##### Integers
##### Floats
##### Strings
##### Booleans

### Relevant Python Data Structures

### Lists

In [None]:
#Defines one small list and common operations on it
example_list = [2,4,'fg',8,3]

print(example_list[0])
print(example_list[2:4])

example_list[2]=20 # modifying element of the list
print(example_list)

### Numpy arrays

In [None]:
#Defines one small numpy array and common operations on it
example_array = np.array([2,4,'4',8,10])

print(example_array[0])
print(example_array[2:4])

example_array[2]='20' # modifying element of the list
print(example_array)

### Pandas Series
#### A one dimensional labeled array

In [None]:
#Defines one small pandas series and common operations on it
example_dictionary = {'A':20,'B':40,'C':60,'D':55}
example_series = pd.Series(example_dictionary)

print(example_series)
print(example_series[0])
print(example_series['B':])

### Pandas Dataframes
#### A two-dimensional labeled data structure with columns of potentially different types

In [None]:
#Creation of a dataframe with a list
aux=[['ds',1.0],
     ['as',3],
     ['bq',5]]

example_DF = pd.DataFrame(aux,index=['Row1','Row2','Row3'],columns=['Col1','Col2'])
example_DF

In [None]:
#Check types
print(type(example_DF))
example_DF.dtypes

## Load and describe the datasets

In [None]:
#Import function to use in order to load example datasets
from recordlinkage.datasets import load_febrl4

In [None]:
#Obtain the two dataframes
dfA, dfB = load_febrl4()

In [None]:
#Obtains the number of lines and columns of the first dataframe
dfA.shape

In [None]:
#Obtains the number of lines and columns of the second dataframe
dfB.shape

In [None]:
dfA.head()

In [None]:
dfB.head()

In [None]:
#Obtains the dataframe main types
dfA.dtypes

In [None]:
#Provides a statistical summary of the dataframe
dfA.describe()

In [None]:
#Summarizes just the column 'surname
dfA['surname'].describe()

In [None]:
#Exemplify use of clean and phonetic methods
from recordlinkage.standardise import clean, phonetic

dfA_aux = pd.DataFrame()
dfA_aux['given_name'] = clean(dfA['given_name'])
dfA_aux['given_name'] = phonetic(dfA['given_name'], method = 'soundex')
dfA_aux.head(10)

## Make Record Pairs or Indexing

In [None]:
?recordlinkage.FullIndex

In [None]:
?indexer.index

In [None]:
#Full index method
indexer = recordlinkage.FullIndex() # first load the FullIndex class
pairs = indexer.index(dfA, dfB) # then create all possible and unique record pairs

In [None]:
#Summarize pairs
print(type(pairs))
print(pairs[0],pairs[1])
print(len(pairs))
print(dfA.shape[0]*dfB.shape[0])

In [None]:
?recordlinkage.SortedNeighbourhoodIndex

In [None]:
#Neighbourhood method
indexer = recordlinkage.SortedNeighbourhoodIndex(on='given_name', window=3)
pairs = indexer.index(dfA, dfB)

print(pairs[0],pairs[1])
print(len(pairs)) 

In [None]:
?recordlinkage.BlockIndex

In [None]:
#Blocking method
indexer = recordlinkage.BlockIndex(on='given_name')
pairs = indexer.index(dfA, dfB)

print(pairs[0],pairs[1])
print(len(pairs)) # Notice the reduction on the number of pairs in relation to using the NeighbourhoodIndex class

In [None]:
#Just a test to compare the names
print(dfA.loc['rec-1070-org','given_name'])
print(dfB.loc['rec-3024-dup-0','given_name'])
print(sum(dfA['given_name']=='michaela'))
print(sum(dfB['given_name']=='michaela'))

## Compare record pairs

In [None]:
#Create class compare
compare_cl = recordlinkage.Compare()

In [None]:
#Check exact and string methods for comparison of record attributes
?compare_cl.string

In [None]:
compare_cl.exact('given_name', 'given_name', label='given_name')
compare_cl.string('surname', 'surname', method='jarowinkler', threshold=0.85, label='surname')
compare_cl.exact('date_of_birth', 'date_of_birth', label='date_of_birth')
compare_cl.exact('suburb', 'suburb', label='suburb')
compare_cl.exact('state', 'state', label='state')
compare_cl.string('address_1', 'address_1', threshold=0.85, label='address_1')

features = compare_cl.compute(pairs, dfA, dfB)

In [None]:
#Check pairs at the top
features.head(50)

In [None]:
#Summarize results from comparing records
features.describe()

In [None]:
#Sum the comparison results
features.sum(axis=1).value_counts().sort_index(ascending=False)

In [None]:
#Obtain a subselection of the features dataframe with large scores
features[features.sum(axis=1)>5].head()

## Classify Record Pairs

In [None]:
#Import another dataset
from recordlinkage.datasets import load_krebsregister

In [None]:
?load_krebsregister

In [None]:
krebs_data, krebs_match = load_krebsregister(missing_values = 0)
krebs_data.head()

In [None]:
krebs_data.shape

In [None]:
krebs_data.describe()

In [None]:
krebs_data.index[0:2].get_values()

In [None]:
len(krebs_match)

In [None]:
krebs_match.get_values()

#### Logistic Regression - Supervised Approach

In [None]:
#Set training data for the supervised approach
krebs_data_train = krebs_data[0:5000]
krebs_match_train = krebs_data_train.index & krebs_match

In [None]:
?logreg.learn

In [None]:
#Initialize the classifier
logreg = recordlinkage.LogisticRegressionClassifier()

#Train the classifier
lr = logreg.learn(krebs_data_train, krebs_match_train)
print ("Intercept: ", logreg.intercept)
print ("Coefficients: ", logreg.coefficients)

In [None]:
?logreg.predict

In [None]:
#Set test data for the supervised approach
krebs_data_test = krebs_data[5000:krebs_data.shape[0]]
krebs_match_test = krebs_data_test.index & krebs_match

#Predict the match status for all record pairs
result_logreg = logreg.predict(krebs_data_test)

#Predict the probability for all record pairs
result_logreg_prob = logreg.prob(krebs_data_test)

In [None]:
print(len(result_logreg))
print(len(result_logreg_prob))

In [None]:
result_logreg_prob = result_logreg_prob.sort_values(ascending=False)
result_logreg_prob.head()

In [None]:
result_logreg_prob.tail()

In [None]:
?recordlinkage.confusion_matrix

In [None]:
conf_logreg = recordlinkage.confusion_matrix(krebs_match_test, result_logreg, len(krebs_data_test))
conf_logreg

In [None]:
recordlinkage.fscore(conf_logreg)

#### Expectation-Maximization algorithm - Unsupervised approach

In [None]:
#Train the classifier
ecm = recordlinkage.ECMClassifier()
result_ecm = ecm.learn((krebs_data > 0.8).astype(int))

len(result_ecm)

In [None]:
conf_ecm = recordlinkage.confusion_matrix(krebs_match, result_ecm, len(krebs_data))
conf_ecm

In [None]:
# The F-score for this classification is
recordlinkage.fscore(conf_ecm)