Vinculación de registros (record linkage)
===

* 0:00 min | Última modificación: Octubre 06, 2021 | YouTube

Se usa cuando no hay campos clave por los cuales se pueda hacer un join.

In [None]:
!pip3 install --quiet recordlinkage

In [1]:
import pandas as pd

pd.set_option("display.notebook_repr_html", False)

Datos
---

El primer conjunto de datos no contiene registros duplicados.

In [53]:
%%writefile /tmp/dfA.csv
first_name,last_name,birtdate,phone
Kayne M,Taffie,2/10/1985,+86 (669) 916-2473
Daisey S,Heisham,5/19/1990,+55 (858) 758-7630
Clair W,Brik,10/3/1976,+351 (509) 289-3191
Kippy L,Frome,9/18/1992,+420 (195) 491-9791
Burgess Jr,Klimes,2/4/1977,+86 (762) 990-4484
Dermot R,Garwill,9/27/1984,+86 (699) 948-9318
Hadley P,Gosker,2/15/1993,+48 (457) 883-3998
S Jackqueline,Papes,6/18/1983,+86 (784) 978-0726

Overwriting /tmp/dfA.csv


El segundo conjunto de datos contiene registros duplicados del primero.

In [54]:
%%writefile /tmp/dfB.csv
first_name,last_name,birtdate,phone
BURGESS,Klimes,2/4/1977,+86 (762) 990-4484
Dermot,Garwill,9/27/1984,+86 (699) 948-9318
Hadley,GOSKER,2/15/1993,+48 (457) 883-3998
Jackqueline,Papes,6/18/1983,+86 (784) 978-0726
Alastair,Barge,3/9/1971,+33 (182) 729-8581
Theobald,Bastian,11/15/1987,+62 (397) 242-4366
Pammi,Daffey,9/5/1986,+86 (761) 567-4803
Marcus,Charlo,7/7/1974,+86 (928) 602-4540
Burgess,KLIMES,2/4/1977,+86 (762) 990-4484
Dermot,Garwill,9/27/1984,+86 (699) 948-9318

Overwriting /tmp/dfB.csv


In [55]:
dfA = pd.read_csv('/tmp/dfA.csv')
dfB = pd.read_csv('/tmp/dfB.csv')

display(
    dfA,
    '-' * 60,
    dfB
)

      first_name last_name   birtdate                phone
0        Kayne M    Taffie  2/10/1985   +86 (669) 916-2473
1       Daisey S   Heisham  5/19/1990   +55 (858) 758-7630
2        Clair W      Brik  10/3/1976  +351 (509) 289-3191
3        Kippy L     Frome  9/18/1992  +420 (195) 491-9791
4     Burgess Jr    Klimes   2/4/1977   +86 (762) 990-4484
5       Dermot R   Garwill  9/27/1984   +86 (699) 948-9318
6       Hadley P    Gosker  2/15/1993   +48 (457) 883-3998
7  S Jackqueline     Papes  6/18/1983   +86 (784) 978-0726

'------------------------------------------------------------'

    first_name last_name    birtdate               phone
0      BURGESS    Klimes    2/4/1977  +86 (762) 990-4484
1       Dermot   Garwill   9/27/1984  +86 (699) 948-9318
2       Hadley    GOSKER   2/15/1993  +48 (457) 883-3998
3  Jackqueline     Papes   6/18/1983  +86 (784) 978-0726
4     Alastair     Barge    3/9/1971  +33 (182) 729-8581
5     Theobald   Bastian  11/15/1987  +62 (397) 242-4366
6        Pammi    Daffey    9/5/1986  +86 (761) 567-4803
7       Marcus    Charlo    7/7/1974  +86 (928) 602-4540
8      Burgess    KLIMES    2/4/1977  +86 (762) 990-4484
9       Dermot   Garwill   9/27/1984  +86 (699) 948-9318

Data Cleaning
---

In [68]:
from recordlinkage.preprocessing import clean, phonenumbers

dfA.first_name = clean(dfA.first_name)
dfB.first_name = clean(dfB.first_name)

dfA.last_name = clean(dfA.last_name)
dfB.last_name = clean(dfB.last_name)

dfA.phone = phonenumbers(dfA.phone)
dfB.phone = phonenumbers(dfB.phone)

display(
    dfA,
    '-' * 60,
    dfB
)

      first_name last_name   birtdate           phone
0        kayne m    taffie  2/10/1985   +866699162473
1       daisey s   heisham  5/19/1990   +558587587630
2        clair w      brik  10/3/1976  +3515092893191
3        kippy l     frome  9/18/1992  +4201954919791
4     burgess jr    klimes   2/4/1977   +867629904484
5       dermot r   garwill  9/27/1984   +866999489318
6       hadley p    gosker  2/15/1993   +484578833998
7  s jackqueline     papes  6/18/1983   +867849780726

'------------------------------------------------------------'

    first_name last_name    birtdate          phone
0      burgess    klimes    2/4/1977  +867629904484
1       dermot   garwill   9/27/1984  +866999489318
2       hadley    gosker   2/15/1993  +484578833998
3  jackqueline     papes   6/18/1983  +867849780726
4     alastair     barge    3/9/1971  +331827298581
5     theobald   bastian  11/15/1987  +623972424366
6        pammi    daffey    9/5/1986  +867615674803
7       marcus    charlo    7/7/1974  +869286024540
8      burgess    klimes    2/4/1977  +867629904484
9       dermot   garwill   9/27/1984  +866999489318

In [56]:
#
# Creacion de pares de registros usando full()
# ===============================================
# Crea pares dfA x dfB y retorna el indice
#
import recordlinkage

indexer = recordlinkage.Index()
indexer.full()
pairs = indexer.index(dfA, dfB)
pairs[:5], pairs[-5:]



(MultiIndex([(0, 0),
             (0, 1),
             (0, 2),
             (0, 3),
             (0, 4)],
            ),
 MultiIndex([(7, 5),
             (7, 6),
             (7, 7),
             (7, 8),
             (7, 9)],
            ))

In [57]:
#
# Formación de pares por atributos iguales.
#
indexer = recordlinkage.Index()
indexer.block('last_name')
candidate_links = indexer.index(dfA, dfB)
candidate_links

MultiIndex([(4, 0),
            (5, 1),
            (5, 9),
            (7, 3)],
           )

In [58]:
#
# Comparación de registros
#
compare_cl = recordlinkage.Compare()

compare_cl.exact(
    "last_name",
    "last_name",
    label="last_name",
)

compare_cl.string(
    "first_name",
    "first_name",
    method="jarowinkler",
    threshold=0.85,
    label="first_name",
)

features = compare_cl.compute(candidate_links, dfA, dfB)
features

     last_name  first_name
4 0          1         0.0
5 1          1         1.0
  9          1         1.0
7 3          1         1.0

In [59]:
features.sum(axis=1).value_counts().sort_index(ascending=False)

2.0    3
1.0    1
dtype: int64

In [60]:
#
# Se puede establecer un minimo para considerar
# que un grupo de registros son el mismo.
#
potential_matches = features[features.sum(axis=1) >= 1]
potential_matches

     last_name  first_name
4 0          1         0.0
5 1          1         1.0
  9          1         1.0
7 3          1         1.0

In [61]:
#
# Indices obtenidos
potential_matches.index

MultiIndex([(4, 0),
            (5, 1),
            (5, 9),
            (7, 3)],
           )

In [62]:
#
# Indices de los registros duplicados en el 
# segundo dataframe
#
duplicate_rows = potential_matches.index.get_level_values(1)
duplicate_rows

Int64Index([0, 1, 9, 3], dtype='int64')

In [63]:
#
# Registros duplicados en dfB
#
dfB[dfB.index.isin(duplicate_rows)]

    first_name last_name   birtdate               phone
0      BURGESS    Klimes   2/4/1977  +86 (762) 990-4484
1       Dermot   Garwill  9/27/1984  +86 (699) 948-9318
3  Jackqueline     Papes  6/18/1983  +86 (784) 978-0726
9       Dermot   Garwill  9/27/1984  +86 (699) 948-9318

In [64]:
#
# Registros no duplicados en dfB
#
dfB_new = dfB[~dfB.index.isin(duplicate_rows)]
dfB_new

  first_name last_name    birtdate               phone
2     Hadley    GOSKER   2/15/1993  +48 (457) 883-3998
4   Alastair     Barge    3/9/1971  +33 (182) 729-8581
5   Theobald   Bastian  11/15/1987  +62 (397) 242-4366
6      Pammi    Daffey    9/5/1986  +86 (761) 567-4803
7     Marcus    Charlo    7/7/1974  +86 (928) 602-4540
8    Burgess    KLIMES    2/4/1977  +86 (762) 990-4484

In [67]:
#
# Concatenación de dfA y dfB
#
dfA_new = dfA.append(dfB_new, ignore_index=True, sort=True)
dfA_new.sort_values('birtdate')

      birtdate     first_name last_name                phone
2    10/3/1976        Clair W      Brik  +351 (509) 289-3191
10  11/15/1987       Theobald   Bastian   +62 (397) 242-4366
0    2/10/1985        Kayne M    Taffie   +86 (669) 916-2473
6    2/15/1993       Hadley P    Gosker   +48 (457) 883-3998
8    2/15/1993         Hadley    GOSKER   +48 (457) 883-3998
4     2/4/1977     Burgess Jr    Klimes   +86 (762) 990-4484
13    2/4/1977        Burgess    KLIMES   +86 (762) 990-4484
9     3/9/1971       Alastair     Barge   +33 (182) 729-8581
1    5/19/1990       Daisey S   Heisham   +55 (858) 758-7630
7    6/18/1983  S Jackqueline     Papes   +86 (784) 978-0726
12    7/7/1974         Marcus    Charlo   +86 (928) 602-4540
3    9/18/1992        Kippy L     Frome  +420 (195) 491-9791
5    9/27/1984       Dermot R   Garwill   +86 (699) 948-9318
11    9/5/1986          Pammi    Daffey   +86 (761) 567-4803