## record linkage tutorial

In [1]:
!pip install recordlinkage

Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
Collecting recordlinkage
  Downloading recordlinkage-0.14-py3-none-any.whl (944 kB)
[K     |████████████████████████████████| 944 kB 128 kB/s eta 0:00:01
[?25hCollecting jellyfish>=0.5.4
  Downloading jellyfish-0.7.2.tar.gz (133 kB)
[K     |████████████████████████████████| 133 kB 515 kB/s eta 0:00:01
Building wheels for collected packages: jellyfish
  Building wheel for jellyfish (setup.py) ... [?25ldone
[?25h  Created wheel for jellyfish: filename=jellyfish-0.7.2-cp36-cp36m-macosx_10_7_x86_64.whl size=24685 sha256=23cbe1201d1f2048f7a65e13f883c9124639d91a8d776f921ef01ee216fc5288
  Stored in directory: /Users/ahmad/Library/Caches/pip/wheels/3b/de/5e/9a80586358562caf9f6b3913b998e45508b5748bce9a45d419
Successfully built jellyfish
Installing collected packages: jellyfish, recordlinkage
Successfully ins

In [2]:
import recordlinkage
import pandas as pd

Let's start by following the example from the record linkage documentation. For this example, we use the Febrl dataset 1. This dataset contains 1000 records of which 500 original and 500 duplicates, with exactly one duplicate per original record. This dataset can be loaded with the function `load_febrl1`.

In [3]:
from recordlinkage.datasets import load_febrl1
dfA = load_febrl1()
dfA.head()

Unnamed: 0_level_0,given_name,surname,street_number,address_1,address_2,suburb,postcode,state,date_of_birth,soc_sec_id
rec_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
rec-223-org,,waller,6.0,tullaroop street,willaroo,st james,4011,wa,19081209,6988048
rec-122-org,lachlan,berry,69.0,giblin street,killarney,bittern,4814,qld,19990219,7364009
rec-373-org,deakin,sondergeld,48.0,goldfinch circuit,kooltuo,canterbury,2776,vic,19600210,2635962
rec-10-dup-0,kayla,harrington,,maltby circuit,coaling,coolaroo,3465,nsw,19150612,9004242
rec-227-org,luke,purdon,23.0,ramsay place,mirani,garbutt,2260,vic,19831024,8099933


### Make record pairs
It is very intuitive to start with comparing each record in DataFrame `dfA` with all other records in DataFrame `dfA`. In fact, we want to make record pairs. Each record pair should contain two different records of DataFrame `dfA`. This process of making record pairs is also called ‘indexing’. With the `recordlinkage` module, indexing is easy. First, load the `recordlinkage.Index` class and call the `.full` method. This object generates a full index on a `.index(...)` call. In case of deduplication of a single dataframe, one dataframe is sufficient as input argument.

In [34]:
indexer = recordlinkage.Index()
indexer.full()
candidate_links_full = indexer.index(dfA)



With the method `index`, all possible (and unique) record pairs are made. The method returns a `pandas.MultiIndex`. The number of pairs is equal to the number of records in `dfA` times the number of records in `dfB` (which in this example is the same as `dfA`).

In [35]:
print (len(dfA), len(candidate_links_full))

1000 499500


Many of these record pairs do not belong to the same person. The `recordlinkage` toolkit has some more advanced indexing methods to reduce the number of record pairs. Obvious non-matches are left out of the index. Note that if a matching record pair is not included in the index, it can not be matched anymore.

One of the most well known indexing methods is named blocking. This method includes only record pairs that are identical on one or more stored attributes of the person (or entity in general). The blocking method can be used in the `recordlinkage` module.

In [36]:
indexer = recordlinkage.Index()
indexer.block('given_name')
candidate_links = indexer.index(dfA)

print (len(candidate_links))

2082


The argument ‘given_name’ is the blocking variable. This variable has to be the name of a column in `dfA` and `dfB`. It is possible to parse a list of columns names to block on multiple variables. Blocking on multiple variables will reduce the number of record pairs even further.

### Compare records
Each record pair is a candidate match. To classify the candidate record pairs into matches and non-matches, compare the records on all attributes both records have in common. The `recordlinkage` module has a class named `Compare`. This class is used to compare the records. The following code shows how to compare attributes.

In [44]:
compare_cl = recordlinkage.Compare()

compare_cl.exact('given_name', 'given_name', label='given_name')
compare_cl.string('surname', 'surname', method='jarowinkler', threshold=0.85, label='surname')
compare_cl.exact('date_of_birth', 'date_of_birth', label='date_of_birth')
compare_cl.exact('suburb', 'suburb', label='suburb')
compare_cl.exact('state', 'state', label='state')
compare_cl.string('address_1', 'address_1', threshold=0.85, label='address_1')

features = compare_cl.compute(candidate_links, dfA)

As we can see by blocking the number of record pairs reduce significanlty. This in turn decreases the number of comparisons that should be made and saves a lot of computational cost. For the sake of comparison let's compare the runing time of a full indexr against a blocking indexer.

In [43]:
# this cell may take some time
import time
print("full indexer runnig time")
start = time.time()
features = compare_cl.compute(candidate_links_full, dfA)
print(time.time() - start)

print("Blocking indexer runnig time")
start = time.time()
features = compare_cl.compute(candidate_links, dfA)
print(time.time() - start)

full indexer runnig time
5.6329357624053955
Blocking indexer runnig time
0.05978703498840332


The comparing of record pairs starts when the `compute` method is called. All attribute comparisons are stored in a DataFrame with horizontally the features and vertically the record pairs. The first 10 comparison vectors are:

In [46]:
features.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,given_name,surname,date_of_birth,suburb,state,address_1
rec_id_1,rec_id_2,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
rec-183-dup-0,rec-122-org,1,0.0,0,0,0,0.0
rec-248-org,rec-122-org,1,0.0,0,0,1,0.0
rec-248-org,rec-183-dup-0,1,0.0,0,0,0,0.0
rec-122-dup-0,rec-122-org,1,1.0,1,1,1,1.0
rec-122-dup-0,rec-183-dup-0,1,0.0,0,0,0,0.0
rec-122-dup-0,rec-248-org,1,0.0,0,0,1,0.0
rec-469-org,rec-122-org,1,0.0,0,0,0,0.0
rec-469-org,rec-183-dup-0,1,0.0,0,0,1,0.0
rec-469-org,rec-248-org,1,0.0,0,0,0,0.0
rec-469-org,rec-122-dup-0,1,0.0,0,0,0,0.0


The last step is to decide which records belong to the same person. In this example, we keep it simple:

In [13]:
# Sum the comparison results.
features.sum(axis=1).value_counts().sort_index(ascending=False)

6.0     142
5.0     145
4.0      30
3.0       9
2.0     376
1.0    1380
dtype: int64

In [14]:
matches = features[features.sum(axis=1) > 3]

print(len(matches))
matches.head(10)

317


Unnamed: 0_level_0,Unnamed: 1_level_0,given_name,surname,date_of_birth,suburb,state,address_1
rec_id_1,rec_id_2,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
rec-122-dup-0,rec-122-org,1,1.0,1,1,1,1.0
rec-183-org,rec-183-dup-0,1,1.0,1,1,1,1.0
rec-248-dup-0,rec-248-org,1,1.0,1,1,1,1.0
rec-373-dup-0,rec-373-org,1,1.0,1,1,1,1.0
rec-10-org,rec-10-dup-0,1,1.0,1,1,1,1.0
rec-342-dup-0,rec-342-org,1,1.0,0,1,1,1.0
rec-397-org,rec-397-dup-0,1,1.0,1,1,1,0.0
rec-472-org,rec-472-dup-0,1,1.0,1,1,1,0.0
rec-330-org,rec-330-dup-0,1,0.0,1,1,1,0.0
rec-190-org,rec-190-dup-0,1,1.0,0,1,1,1.0


### precision and recall
Now we can evaluate how good our deduplication worked. We can do this by computing the precision and recall values. 

First we have to find the correct pairs among the matches we found. We can do this by exploiting the structure of record ids and by simple string splitting. 

The precision value is the number of correct matches found devided by the total number of matches found. Recall is equal to the number of correct matches found devided by the total number of matching pairs in the dataset which we know is 500.

In [53]:
matches_index = matches.reset_index()["rec_id_1"].map(lambda x: x.split("-")[1]) == \
                matches.reset_index()["rec_id_2"].map(lambda x: x.split("-")[1])

In [59]:
precision = len(matches[list(matches_index)]) / len(matches)
print("precision = ", precision)
recall = len(matches[list(matches_index)]) / 500
print("recall = ", recall)

precision =  1.0
recall =  0.634
