<a href="https://colab.research.google.com/github/jembi/mpi-toolkit-notebook/blob/main/fastLink-notebook/Linking/FastLinkRecordLinking.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **<font color="#007f68"> fastLink Records Linkage**


### fastLink: Fast Probabilistic Record Linkage with Missing Data
Implements a Fellegi-Sunter probabilistic record linkage model that allows for missing data and the inclusion of auxiliary information. This includes functionalities to conduct a merge of two datasets under the Fellegi-Sunter model using the Expectation-Maximization algorithm. In addition, tools for preparing, adjusting, and summarizing data merges are included.

The package implements methods described in Enamorado, Fifield, and Imai (2019) ”Using a Probabilistic Model to Assist Merging of Large-scale Administrative Records”, American Political Science Review and is available at <http://imai.fas.harvard.edu/research/linkage.html>.

## <font color="#007f68">Links:
[Directory Notebook](https://colab.research.google.com/drive/1TqQ5sklvhw8I1f5m49ob-WTFfBtiElfs#scrollTo=M5NmyYErfDvt)

[Data Generator](https://colab.research.google.com/drive/1f1nnThx7sV47R_bbHq8CIlBfTVrcM2NA#scrollTo=9cXT_RyoLtv3%23offline%3Dtrue&sandboxMode=true)


## **<font color="#007f68">1. Installing rpy2** 
rpy2 is running an embedded R, providing access to it from Python using R’s own C-API through either:

a high-level interface making R functions an objects just like Python functions and providing a seamless conversion to numpy and pandas data structures

a low-level interface closer to the C-API

In [None]:
!pip install rpy2

## **<font color="#007f68">2. Import python packages**

In [None]:
import pandas as pd
import logging
import os

from rpy2.robjects import globalenv
from rpy2.robjects.vectors import StrVector
import rpy2.robjects as r_objects
import rpy2.robjects.packages as r_packages

r = r_objects.r

from ipywidgets import Dropdown
from ipywidgets import FloatSlider
from ipywidgets import Text
from tqdm.auto import tqdm

logging.basicConfig(format='%(asctime)s %(levelname)s:%(message)s', level=logging.INFO)
logging.info('Started')

## **<font color="#007f68">3. Import R packages**

In [None]:
utils = r_packages.importr('utils')
utils.chooseCRANmirror(ind=1)
pack_names = ('fastLink', 'tictoc', 'strex', 'data.table', 'csv', 'stringr')
names_to_install = [x for x in pack_names if not r_packages.isinstalled(x)]
if len(names_to_install) > 0:
  utils.install_packages(StrVector(names_to_install))
  base = r_packages.importr('base')
  stats = r_packages.importr('stats')
  fastLink = r_packages.importr('fastLink')
  strex = r_packages.importr('strex')
  data_table = r_packages.importr('data.table')
  stringr = r_packages.importr('stringr')

## **<font color="#007f68">4. Import/Upload files**

### <font color="#18CF68">Mount Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

### <font color="#18CF68">Upload File From Your Computer

In [None]:
from google.colab import files
uploaded = files.upload()


### <font color="#18CF68">Reading a csv file and splitting it into 2 dataframes
The two dataframes are:

*   The reference records
*   The duplicates records


 

In [7]:
filename = next(iter(uploaded))
globalenv['csv'] = r['read.csv'](filename, header=True, stringsAsFactors=False)
r('csv[csv==""] <- NA')
col_names_r = r('colnames(csv)')
col_names = list(col_names_r)
r('dfA          <- csv[str_detect(csv$ID, "-aaa-"), ]')
r('dfB          <- csv[str_detect(csv$ID, "-bbb-"), ]')
s = r('structure(list(csv = csv, dfA = dfA, dfB = dfB))')

menu_1 = Dropdown(description="Choose your unique identifier", options=col_names)
menu_2 = Dropdown(description="Choose your string distance algorithm", options=["Jaro-Winkler", "Levensthein"])
slider_1 = FloatSlider(description="cut.a:", value=0.94, min=0, max=1, step=0.01)
slider_2 = FloatSlider(description="cut.p:", value=0.88, min=0, max=1, step=0.01)
display(menu_1, menu_2, slider_1, slider_2)


Dropdown(description='Choose your unique identifier', options=('ID', 'hivCaseReportNumber', 'name', 'fathersNa…

Dropdown(description='Choose your string distance algorithm', options=('Jaro-Winkler', 'Levensthein'), value='…

FloatSlider(value=0.94, description='cut.a:', max=1.0, step=0.01)

FloatSlider(value=0.88, description='cut.p:', max=1.0, step=0.01)

## **<font color="#007f68">5. Get user input**
### <font color="#18CF68">Jaro-Winkler
The Jaro–Winkler distance is a string metric measuring an edit distance between two sequences.
The lower the Jaro–Winkler distance for two strings is, the more similar the strings are. The score is normalized such that 0 means an exact match and 1 means there is no similarity. 

### <font color="#18CF68">Levenshtein
The Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.


In [None]:
key = menu_1.value
string_distance = menu_2.value
if string_distance == "Jaro-Winkler":
  string_distance = "jw"
else:
  string_distance = "lv"
cut_a = slider_1.value
cut_p = slider_2.value
print(key)
print(string_distance)
print(cut_a)
print(cut_p)

## **<font color="#007f68">6. Record Linkage using fastLink**

In [9]:
  logging.info("FastLink : initialized")

  df_a = r('dfA')
  df_b = r('dfB')

  def fl_link(df_a, df_b, key, string_distance, cut_a, cut_p):
      get_links = r('''
          my_fl_link <- function(dfA, dfB) {{
              pasteT <- function(x) {{
                  x <- sort(x)
                  x <- paste(x, collapse = ",")
                  x
              }}
              
              varnames <- colnames(dfA[,-1])
              #varnames <- varnames[-which(varnames %in% c('{0}'))]
              print(varnames)
              fl_out <- fastLink(dfA = dfA, dfB = dfB, varnames = varnames,
                                  stringdist.match = varnames, stringdist.method = '{1}', cut.a = {2}, cut.p = {3},
                                  dedupe.matches = FALSE, linprog.dedupe = FALSE,
                                  cond.indep = TRUE,
                                  n.cores = 8,
                                  verbose = TRUE)
              inds_ab <- data.table(cbind(fl_out$matches$inds.a, fl_out$matches$inds.b))
              inds_ab[, `:=`(V3, pasteT(V2)), by = V1]
              inds_ab <- inds_ab[,.(V1, V3)]
              inds_ab <- inds_ab[!duplicated(inds_ab)]
              setnames(inds_ab, 'V3', 'V2')
              structure(list(fl_out = fl_out, inds_ab = inds_ab))
          }}'''.format(key, string_distance, cut_a, cut_p))
      return get_links(df_a, df_b)

2021-08-02 13:18:04,928 INFO:FastLink : initialized


## **<font color="#007f68">7. Display fastLink Logs to the console**

In [10]:
def analytics(process):
  varnames = tuple(globalenv['{0}'.format(process)].rx2('fl_out').rx2('EM').rx2('varnames'))
  logging.info(('{0}: %s'.format(process), tuple(globalenv['{0}'.format(process)].names)))
  logging.info('fl_out: %s', tuple(globalenv['{0}'.format(process)].rx2('fl_out').names))
  logging.info('fl_out: %s', tuple(globalenv['{0}'.format(process)].rx2('fl_out').rx2('matches').names))
  logging.info('EM: %s', tuple(globalenv['{0}'.format(process)].rx2('fl_out').rx2('EM').names))
  logging.info('patterns.w: %s', tuple(globalenv['{0}'.format(process)].rx2('fl_out').rx2('EM').rx2('patterns.w').names[1]))
  logging.info('varnames: %s', tuple(globalenv['{0}'.format(process)].rx2('fl_out').rx2('EM').rx2('varnames')))
  logging.info('patterns: %s', tuple(globalenv['{0}'.format(process)].rx2('fl_out').rx2('patterns').names))
  logging.info('inds_ab: %s', tuple(globalenv['{0}'.format(process)].rx2('inds_ab').names))
  # The posterior probability of a pair matching.
  logging.info('p.m: %f', r('{0}$fl_out$EM$p.m'.format(process))[0])
  # The posterior probability of a pair not matching.
  logging.info('p.u: %f', r('{0}$fl_out$EM$p.u'.format(process))[0])
  # The posterior of the matching prVobability for a specific matching field.
  logging.info("EM.p.gamma.k.m")
  em_p_gamma_k_m = []
  for i in range(1, len(r('{0}$fl_out$EM$p.gamma.k.m'.format(process))) - 1):
    placeholder = ['p.gamma.k.m ----- %-20s : %3.10f  %3.10f  %3.10f',
                    varnames[i],
                    r('{0}$fl_out$EM$p.gamma.k.m'.format(process))[i][0],
                    abs(r('{0}$fl_out$EM$p.gamma.k.m'.format(process))[i][1]),
                    r('{0}$fl_out$EM$p.gamma.k.m'.format(process))[i][0] + r('{0}$fl_out$EM$p.gamma.k.m'.format(process))[i][1]]
    em_p_gamma_k_m.append(placeholder)
    logging.info('p.gamma.k.m ----- %-20s : %3.10f  %3.10f  %3.10f',
                  varnames[i],
                  r('{0}$fl_out$EM$p.gamma.k.m'.format(process))[i][0],
                  abs(r('{0}$fl_out$EM$p.gamma.k.m'.format(process))[i][1]),
                  r('{0}$fl_out$EM$p.gamma.k.m'.format(process))[i][0] + r('{0}$fl_out$EM$p.gamma.k.m'.format(process))[i][1])
  # The posterior of the non-matching probability for a specific matching field.
  logging.info("EM.p.gamma.k.u")
  em_p_gamma_k_u = []
  for i in range(1, len(r('{0}$fl_out$EM$p.gamma.k.u'.format(process))) - 1):
    placeholder = ['p.gamma.k.u ----- %-20s : %3.10f  %3.10f  %3.10f',
                    varnames[i],
                    abs(r('{0}$fl_out$EM$p.gamma.k.u'.format(process))[i][0]),
                    r('{0}$fl_out$EM$p.gamma.k.u'.format(process))[i][1],
                    r('{0}$fl_out$EM$p.gamma.k.u'.format(process))[i][0] + r('{0}$fl_out$EM$p.gamma.k.u'.format(process))[i][1]]
    em_p_gamma_k_u.append(placeholder)
    logging.info('p.gamma.k.u ----- %-20s : %3.10f  %3.10f  %3.10f',
                  varnames[i],
                  abs(r('{0}$fl_out$EM$p.gamma.k.u'.format(process))[i][0]),
                  r('{0}$fl_out$EM$p.gamma.k.u'.format(process))[i][1],
                  r('{0}$fl_out$EM$p.gamma.k.u'.format(process))[i][0] + r('{0}$fl_out$EM$p.gamma.k.u'.format(process))[i][1])
  logging.info('iter.converge: %d', r('{0}$fl_out$EM$iter.converge'.format(process))[0])
  return varnames, em_p_gamma_k_m, em_p_gamma_k_u

## **<font color="#007f68">8. Linking process**

In [None]:
globalenv['links'] = fl_link(df_a, df_b, key, string_distance, cut_a, cut_p)
log_info = analytics('links')
varnames = log_info[0]
em_p_gamma_k_m = log_info[1]
em_p_gamma_k_u = log_info[2]

v1 = tuple(map(int, r('links$inds_ab$V1')))
v2 = tuple(r('links$inds_ab$V2'))
false_positives = 0
true_positives = 0

fields = ('ID',) + varnames
left = pd.DataFrame(columns=('key',) + fields)
right = pd.DataFrame(columns=('key',) + fields)

p_df = pd.read_csv(filename)

k = 0

key = []
for k in range(len(p_df)):
  key.append(k)

p_df.insert(0, "key", key, True)
for u in range(len(fields)):
  p_df[fields[u]] = p_df[fields[u]].astype(str)
for i in range(len(p_df)):
  left.loc[i] = p_df.loc[i]
  for b in range(len(v1)):
    if int(v1[b]) == (i + 1):
      try:
        dupe_links = tuple(map(int, v2[b].split(',')))
      except IndexError:
        dupe_links = int(v2[b])
      for j in range(len(dupe_links)):
        dup = r('csv[{},]'.format(dupe_links[j]))
        right.loc[k] = (i,) + tuple(map(lambda x: str(dup.rx2(x)[0]), fields))
        k = k + 1

r('write.csv(dfB, file="df_b.csv")')
p_df_b = pd.read_csv('df_b.csv')

try:
  os.remove('redCapData.csv')
except OSError:
  pass

key_1 = []
for k in range(len(p_df_b)):
  key_1.append(k)
p_df_b.insert(0, "key", key_1, True)
len_left = len(p_df)

for h in range(len(p_df_b)):
  count = 0
  for g in range(len(right)):
    if p_df_b.values[h][1] == right.values[g][1]:
      count = 1
  if count == 0:
    left.loc[len_left] = p_df_b.loc[h]
    len_left = len_left + 1



## **<font color="#007f68">9. Displaying m and u values**

In [None]:
for m in range(2):
  if m == 0:
    bar_format = '{1_bar}{bar}{r_bar}'
    weight = em_p_gamma_k_m
    w_label = Text(value="Matches", disabled=True)
  else:
    weight = em_p_gamma_k_u
    w_label = Text(value="Unmatches", disabled=True)
  w_min = weight[0][2]
  for i in range(1, len(weight)):
    if weight[i][2] < w_min:
      w_min = weight[i][2]
  w_max = weight[0][2]
  for i in range(1, len(weight)):
    if weight[i][2] > w_max:
      w_max = weight[i][2]
  display(w_label)
  for j in range(len(weight)):
    w_slider = FloatSlider(value=weight[j][2], min=w_min-0.13, max=w_max+0.13, description=weight[j][1], disabled=True)
    display(w_slider)

## **<font color="#007f68">10. Display result to the screen**

In [None]:
from IPython.display import display_html
# display(left)
# display(right)

df1_styler = left.style.set_table_attributes("style='display:inline'").set_caption('Reference Records')
df2_styler = right.style.set_table_attributes("style='display:inline'").set_caption('Duplicate Records')

display_html(df1_styler._repr_html_()+df2_styler._repr_html_(), raw=True)