## Python Tools for Record Linking and Fuzzy Matching

This notebook accompanies the [article](https://pbpython.com/record-linking.html) on Practical Business Python

This notebook relies on [fuzzymatcher](https://github.com/RobinL/fuzzymatcher) and the [Python Record Linkage Toolkit](https://recordlinkage.readthedocs.io/en/latest/about.html)


In [1]:
pip install fuzzymatcher

Collecting fuzzymatcher
  Downloading fuzzymatcher-0.0.6-py3-none-any.whl (15 kB)
Collecting metaphone
  Downloading Metaphone-0.6.tar.gz (14 kB)
Collecting rapidfuzz
  Downloading rapidfuzz-2.6.0-cp39-cp39-win_amd64.whl (1.4 MB)
Collecting python-Levenshtein
  Downloading python-Levenshtein-0.12.2.tar.gz (50 kB)
Collecting jarowinkler<2.0.0,>=1.2.0
  Downloading jarowinkler-1.2.1-cp39-cp39-win_amd64.whl (61 kB)
Building wheels for collected packages: metaphone, python-Levenshtein
  Building wheel for metaphone (setup.py): started
  Building wheel for metaphone (setup.py): finished with status 'done'
  Created wheel for metaphone: filename=Metaphone-0.6-py3-none-any.whl size=13901 sha256=04001d4149e5e29bd968c927b0d367bd471fc466add6aa865a7160ef783f8e90
  Stored in directory: c:\users\cti110016\appdata\local\pip\cache\wheels\b2\9e\d9\26be7687b8fe36cd6cacbec34e825a3dbcd3bae54017cfb385
  Building wheel for python-Levenshtein (setup.py): started
  Building wheel for python-Levenshtein (setu

  ERROR: Command errored out with exit status 1:
   command: 'C:\Users\cti110016\Anaconda3\python.exe' -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\cti110016\\AppData\\Local\\Temp\\pip-install-9jzk5vgx\\python-levenshtein_30ba0467f5294a7296f6acc1a0aab2d2\\setup.py'"'"'; __file__='"'"'C:\\Users\\cti110016\\AppData\\Local\\Temp\\pip-install-9jzk5vgx\\python-levenshtein_30ba0467f5294a7296f6acc1a0aab2d2\\setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d 'C:\Users\cti110016\AppData\Local\Temp\pip-wheel-78wrttx0'
       cwd: C:\Users\cti110016\AppData\Local\Temp\pip-install-9jzk5vgx\python-levenshtein_30ba0467f5294a7296f6acc1a0aab2d2\
  Complete output (28 lines):
  running bdist_wheel
  running build
  running build_py
  cr

In [3]:
import pandas as pd
from pathlib import Path
import fuzzymatcher
import recordlinkage

ModuleNotFoundError: No module named 'fuzzymatcher'

### Example using fuzzymatcher

In [None]:
hospital_accounts = pd.read_csv(
    'https://github.com/chris1610/pbpython/raw/master/data/hospital_account_info.csv'
)
hospital_reimbursement = pd.read_csv(
    'https://raw.githubusercontent.com/chris1610/pbpython/master/data/hospital_reimbursement.csv'
)

In [None]:
hospital_accounts.head()

In [None]:
hospital_reimbursement.head()

In [None]:
# Columns to match on from df_left
left_on = ["Facility Name", "Address", "City", "State"]

# Columns to match on from df_right
right_on = [
    "Provider Name", "Provider Street Address", "Provider City",
    "Provider State"
]

In [None]:
# Now perform the match
# It will take several minutes to run on this data set
matched_results = fuzzymatcher.fuzzy_left_join(hospital_accounts,
                                               hospital_reimbursement,
                                               left_on,
                                               right_on,
                                               left_id_col='Account_Num',
                                               right_id_col='Provider_Num')

In [None]:
matched_results.head()

In [None]:
# Reorder the columns to make viewing easier
cols = [
    "best_match_score", "Facility Name", "Provider Name", "Address", "Provider Street Address",
    "Provider City", "City", "Provider State", "State"
]

In [None]:
# Let's see the best matches
matched_results[cols].sort_values(by=['best_match_score'], ascending=False).head(5)

In [None]:
# Now the worst matches
matched_results[cols].sort_values(by=['best_match_score'],
                                  ascending=True).head(5)

In [None]:
# Look at the matches around 1
matched_results[cols].query("best_match_score <= 1").sort_values(
    by=['best_match_score'], ascending=False).head(10)

In [None]:
matched_results[cols].query("best_match_score <= .80").sort_values(
    by=['best_match_score'], ascending=False).head(5)

### Example using Python Record Linkage Toolkit

In [None]:
# Re-read in the data using the index_col
hospital_accounts = pd.read_csv(
    'https://github.com/chris1610/pbpython/raw/master/data/hospital_account_info.csv',
    index_col='Account_Num'
)
hospital_reimbursement = pd.read_csv(
    'https://raw.githubusercontent.com/chris1610/pbpython/master/data/hospital_reimbursement.csv',
    index_col='Provider_Num'
)

In [None]:
hospital_accounts.head()

In [None]:
hospital_reimbursement.head()

In [None]:
# Build the indexer
indexer = recordlinkage.Index()
# Can use full or block
#indexer.full()
#indexer.block(left_on='State', right_on='Provider State')

# Use sortedneighbor as a good option if data is not clean
indexer.sortedneighbourhood(left_on='State', right_on='Provider State')

In [None]:
candidates = indexer.index(hospital_accounts, hospital_reimbursement)

In [None]:
# Let's see how many matches we want to do
print(len(candidates))

In [None]:
# Takes 3 minutes using the full index.
# 14s using sorted neighbor
# 7s using blocking
compare = recordlinkage.Compare()
compare.exact('City', 'Provider City', label='City')
compare.string('Facility Name',
               'Provider Name',
               threshold=0.85,
               label='Hosp_Name')
compare.string('Address',
               'Provider Street Address',
               method='jarowinkler',
               threshold=0.85,
               label='Hosp_Address')
features = compare.compute(candidates, hospital_accounts,
                           hospital_reimbursement)

In [None]:
features

In [None]:
# What are the score totals?
features.sum(axis=1).value_counts().sort_index(ascending=False)

In [None]:
# Get the potential matches
potential_matches = features[features.sum(axis=1) > 1].reset_index()

In [None]:
potential_matches['Score'] = potential_matches.loc[:, 'City':'Hosp_Address'].sum(axis=1)
potential_matches.head()

In [None]:
hospital_accounts.loc[51216,:]

In [None]:
hospital_reimbursement.loc[268781,:]

In [None]:
# Add some convenience columns for comparing data
hospital_accounts['Acct_Name_Lookup'] = hospital_accounts[[
    'Facility Name', 'Address', 'City', 'State'
]].apply(lambda x: '_'.join(x), axis=1)

In [None]:
hospital_reimbursement['Reimbursement_Name_Lookup'] = hospital_reimbursement[[
    'Provider Name', 'Provider Street Address', 'Provider City',
    'Provider State'
]].apply(lambda x: '_'.join(x), axis=1)

In [None]:
reimbursement_lookup = hospital_reimbursement[['Reimbursement_Name_Lookup']].reset_index()
account_lookup = hospital_accounts[['Acct_Name_Lookup']].reset_index()

In [None]:
account_lookup.head()

In [None]:
reimbursement_lookup.head()

In [None]:
account_merge = potential_matches.merge(account_lookup, how='left')

In [None]:
account_merge.head()

In [None]:
reimbursement_lookup.head()

In [None]:
# Let's build a dataframe to  compare
final_merge = account_merge.merge(reimbursement_lookup, how='left')

In [None]:
cols = [
    'Account_Num', 'Provider_Num', 'Score', 'Acct_Name_Lookup',
    'Reimbursement_Name_Lookup'
]

In [None]:
final_merge[cols].sort_values(by=['Account_Num', 'Score'], ascending=False)

In [None]:
# If you need to save it to Excel -
#final_merge.sort_values(by=['Account_Num', 'Score'],
#                        ascending=False).to_excel('merge_list.xlsx',
#                                                  index=False)

In [None]:
final_merge[final_merge['Account_Num']==11035][cols]

In [None]:
final_merge[final_merge['Account_Num']==56375][cols]

### Dedupe the data

In [None]:
hospital_dupes = pd.read_csv(
    'https://github.com/chris1610/pbpython/raw/master/data/hospital_account_dupes.csv',
    index_col='Account_Num')

In [None]:
hospital_dupes.head()

In [None]:
# Deduping follows the same process, you just use 1 single dataframe
dupe_indexer = recordlinkage.Index()
dupe_indexer.sortedneighbourhood(left_on='State')
dupe_candidate_links = dupe_indexer.index(hospital_dupes)


In [None]:
# Comparison step
compare_dupes = recordlinkage.Compare()
compare_dupes.string('City', 'City', threshold=0.85, label='City')
compare_dupes.string('Phone Number',
                     'Phone Number',
                     threshold=0.85,
                     label='Phone_Num')
compare_dupes.string('Facility Name',
                     'Facility Name',
                     threshold=0.80,
                     label='Hosp_Name')
compare_dupes.string('Address',
                     'Address',
                     threshold=0.85,
                     label='Hosp_Address')
dupe_features = compare_dupes.compute(dupe_candidate_links, hospital_dupes)

In [None]:
dupe_features

In [None]:
dupe_features.sum(axis=1).value_counts().sort_index(ascending=False)

In [None]:
potential_dupes = dupe_features[dupe_features.sum(axis=1) > 2].reset_index()
potential_dupes['Score'] = potential_dupes.loc[:, 'City':'Hosp_Address'].sum(axis=1)

In [None]:
potential_dupes.sort_values(by=['Score'], ascending=True)

In [None]:
# Take a look at one of the potential duplicates
hospital_dupes[hospital_dupes.index.isin([51567, 41166])]