# Active Learning

In [None]:
!pip install dedupe -q
!pip install -U numpy


Collecting numpy
  Downloading numpy-1.21.4-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.7 MB)
[K     |████████████████████████████████| 15.7 MB 5.4 MB/s 
[?25hInstalling collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.19.5
    Uninstalling numpy-1.19.5:
      Successfully uninstalled numpy-1.19.5
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
yellowbrick 1.3.post1 requires numpy<1.20,>=1.16.0, but you have numpy 1.21.4 which is incompatible.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.
albumentations 0.1.12 requires imgaug<0.2.7,>=0.2.5, but you have imgaug 0.2.9 which is incompatible.[0m
Successfully installed numpy-1.21.4


In [None]:
# Check if we're running locally, or in Google Colab.
try:
    import google.colab
    COLAB = True
except ModuleNotFoundError:
    COLAB = False
    
# If we're running in Colab, download the tutorial functions file 
# to the Colab session local directory, and install required libraries.
if COLAB:
    import requests
    
    tutorial_functions_url = "https://raw.githubusercontent.com/rachhouse/intro-to-data-linking/main/tutorial_notebooks/linking_tutorial_functions.py"
    r = requests.get(tutorial_functions_url)
    
    with open("linking_tutorial_functions.py", "w") as fh:
        fh.write(r.text)
    
    !pip install -q altair dedupe dedupe-variable-name jellyfish recordlinkage 

In [None]:
import datetime
import itertools
import os
import pathlib
import re
from typing import Any, Dict, Optional

import dedupe
import pandas as pd

import linking_tutorial_functions as tutorial

INFO:root:Generating grammar tables from /usr/lib/python3.7/lib2to3/Grammar.txt
INFO:root:Generating grammar tables from /usr/lib/python3.7/lib2to3/PatternGrammar.txt


## Define Working Filepaths

For convenience, we'll define a `pathlib.Path` to reference our current working directory.

In [None]:
WORKING_DIR = pathlib.Path(os.path.abspath(''))
WORKING_DIR

PosixPath('/content')

## Load Training Dataset and Ground Truth Labels

In [None]:
df_A, df_B, df_ground_truth = tutorial.load_febrl_training_data(COLAB)

Let's take a quick look at our training dataset to refresh on the columns, formats, and data.

In [None]:
df_A.head()

Unnamed: 0_level_0,first_name,surname,street_number,address_1,address_2,suburb,postcode,state,date_of_birth,age,phone_number,soc_sec_id
person_id_A,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
fbc4143d-15f9-4f27-b5f0-dedbadce6616,matilda,struck,8,ballard place,,west perth,2470,qld,19611002,32.0,03 05903135,8276847
48a56cad-7ba6-45e1-97cd-517ba65bdab5,lachlan,eglinton,36,kambalda crescent,villa 427,auburn,5109,,19260108,27.0,,9937958
b1792d21-e4be-4b86-8dea-454ffa5194c5,mikayla,asher,588,britten-jones drive,,miami,4218,nsw,19251102,32.0,03 33770501,7017310
96653d73-bebc-4459-94f3-c3f0a8c514d4,grace,bristow,7,,wandella park snowy,cardiff,6163,nsw,19400120,,07 37864073,3535974
41f038b8-77c0-45a5-9e1f-e62b8637ffd1,wilson,bishop,11,chisholm street,,bronte,2490,nsw,19210305,27.0,04 15209769,5573522


## Data Augmentation

We'll do minimal data augmentation before feeding our training data to `dedupe`; we just want to format the date of birth data as `mm/dd/yy`, and ensure all columns are in string format and stripped of trailing/leading whitespace. Additionally, `dedupe` requires input data to be in dictionaries, using the record id as the key and the record metadata as the value. So, we'll convert our dataframes to this format.

In [None]:
def format_dob(dob: str) -> Optional[str]:
    """ Transform date of birth format from YYYYMMDD to mm/dd/yy.
        If DOB cannot be transformed, return None.
    """
    try:
        if re.match(r"\d{8}", dob):
            return (datetime.datetime.strptime(dob, "%Y%m%d")).strftime("%m/%d/%y")
    except:
        pass

    return None

def strip_and_null(x: Any) -> Optional[str]:
    """ Stringify incoming variable, remove trailing/leading whitespace
        and return resulting string. Return None if resulting string is empty.
    """
    x = str(x).strip()
    
    if x == "":
        return None
    else:
        return x
    
def convert_df_to_dict(df: pd.DataFrame) -> Dict[str, Dict]:
    """ Convert pandas DataFrame to dict keyed by record id.
        Convert all fields to strings or Nones to satisfy dedupe.
        Transform date format of date_of_birth field.
    """    

    for col in df.columns:
        df[col] = df[col].apply(lambda x: strip_and_null(x))

    df["date_of_birth"] = df["date_of_birth"].apply(lambda x: format_dob(x))    

    return df.to_dict("index")

In [None]:
records_A = convert_df_to_dict(df_A)
records_B = convert_df_to_dict(df_B)

We can examine a small sample of the resulting transformed records:

In [None]:
[records_A[k] for k in list(records_A.keys())[0:2]]

[{'address_1': 'ballard place',
  'address_2': None,
  'age': '32',
  'date_of_birth': '10/02/61',
  'first_name': 'matilda',
  'phone_number': '03 05903135',
  'postcode': '2470',
  'soc_sec_id': '8276847',
  'state': 'qld',
  'street_number': '8',
  'suburb': 'west perth',
  'surname': 'struck'},
 {'address_1': 'kambalda crescent',
  'address_2': 'villa 427',
  'age': '27',
  'date_of_birth': '01/08/26',
  'first_name': 'lachlan',
  'phone_number': None,
  'postcode': '5109',
  'soc_sec_id': '9937958',
  'state': None,
  'street_number': '36',
  'suburb': 'auburn',
  'surname': 'eglinton'}]

## Prepare Training

When we linked our data via SimSum and supervised learning, we defined our blockers and comparators manually with `recordlinkage`. The `dedupe` library takes an active learning approach to blocking and classification and will use our feedback gathered during the labeling session to learn blocking rules and train a classifier. 

To prepare our `dedupe.RecordLink` object for training, first we'll define the fields that we think `dedupe` should pay attention to when matching records - these definitions will serve as the comparators. The `field` contains the name of the attribute to use for comparison, and the `type` defines the comparison type.

In [None]:
%%time

fields = [
    { "field" : "first_name", "type" : "Name" },
    { "field" : "surname", "type" : "Name" },
    { "field" : "address_1", "type" : "ShortString" },
    { "field" : "address_2", "type" : "ShortString" },
    { "field" : "suburb", "type" : "ShortString" },
    { "field" : "postcode", "type" : "Exact" },
    { "field" : "state", "type" : "Exact" },
    { "field" : "date_of_birth", "type" : "DateTime" },
    { "field" : "soc_sec_id", "type" : "Exact" },
]

linker = dedupe.RecordLink(fields)
linker.prepare_training(records_A, records_B)

INFO:dedupe.canopy_index:Removing stop word re
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:SimplePredicate: (dayPredicate, date_of_birth)


CPU times: user 52.7 s, sys: 1.18 s, total: 53.8 s
Wall time: 53.1 s


## Active Learning Labeling Session!

At this point, we're ready to provide feedback to `dedupe` via an active learning labeling session. For this, `dedupe` supplies a convenience method to iterate through pairs it is uncertain about. As you provide feedback for each pair, dedupe learns blocking rules and recalculates its linking model weights.

You can use `y` (yes, match), `n` (no, not match), and `u` (unsure) to provide feedback on candidate links. When you're ready to exit the labeling session, use `f`.

In [None]:
ydedupe.console_label(linker)

first_name : dylan
surname : paine
address_1 : macalister crescent
address_2 : westmead accom
suburb : None
postcode : 2148
state : sa
date_of_birth : None
soc_sec_id : 2677567

first_name : dylan
surname : dixon
address_1 : None
address_2 : None
suburb : sippy vswns
postcode : 4161
state : qld
date_of_birth : 08/30/10
soc_sec_id : 7931074

0/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished


yes


(y)es / (n)o / (u)nsure / (f)inished


y


first_name : michaela
surname : wheatley
address_1 : None
address_2 : None
suburb : granville
postcode : 6281
state : act
date_of_birth : 07/16/27
soc_sec_id : 2515374

first_name : lauren
surname : dixon
address_1 : lansell civrcuit
address_2 : None
suburb : None
postcode : 3212
state : nss
date_of_birth : None
soc_sec_id : 5803741

1/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:SimplePredicate: (suffixArray, first_name)
first_name : alana
surname : boyle
address_1 : monaro crescent
address_2 : None
suburb : deniliquin
postcode : 3216
state : sa
date_of_birth : 09/17/24
soc_sec_id : 2811388

first_name : sophie
surname : fergas
address_1 : monaro crescent
address_2 : None
suburb : None
postcode : 2259
state : wa
date_of_birth : None
soc_sec_id : 1623683

2/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:SimplePredicate: (dayPredicate, date_of_birth)
INFO:dedupe.training:PartialIndexLevenshteinSearchPredicate: (1, first_name, Surname)
first_name : riley
surname : mccarthy
address_1 : howitt street
address_2 : None
suburb : macquarie fields
postcode : 2070
state : sa
date_of_birth : 11/20/86
soc_sec_id : 3084435

first_name : sophie
surname : campbell
address_1 : howitt street
address_2 : None
suburb : None
postcode : 3789
state : vic
date_of_birth : None
soc_sec_id : 7764578

3/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


first_name : andrew
surname : gibbett
address_1 : macalister crescent
address_2 : harpers bridge
suburb : drouin east
postcode : 2866
state : vic
date_of_birth : 09/07/50
soc_sec_id : 4101336

first_name : dillon
surname : paino
address_1 : macalister crescent
address_2 : None
suburb : None
postcode : 2148
state : sa
date_of_birth : None
soc_sec_id : 2677567

4/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:SimplePredicate: (firstTwoTokensPredicate, address_1)
INFO:dedupe.training:PartialIndexLevenshteinSearchPredicate: (1, first_name, Surname)
first_name : elysse
surname : royle
address_1 : gamor street
address_2 : None
suburb : lower templestowe
postcode : 5251
state : qld
date_of_birth : 05/30/48
soc_sec_id : 3986562

first_name : lewis
surname : mcgregor
address_1 : None
address_2 : rajamape
suburb : lower templestowe
postcode : 7320
state : nsw
date_of_birth : None
soc_sec_id : 3174574

5/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


first_name : sophie
surname : campbell
address_1 : howitt street
address_2 : None
suburb : None
postcode : 3799
state : vic
date_of_birth : 09/20/19
soc_sec_id : 7746745

first_name : evab
surname : shephcrd
address_1 : gelane street
address_2 : dspq
suburb : robina
postcode : 4860
state : qld
date_of_birth : None
soc_sec_id : 9845540

6/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:SimplePredicate: (firstTwoTokensPredicate, address_1)
INFO:dedupe.training:PartialIndexLevenshteinSearchPredicate: (1, first_name, Surname)
INFO:dedupe.training:SimplePredicate: (firstTwoTokensPredicate, suburb)
first_name : isabella
surname : hoang
address_1 : salsola street
address_2 : melima
suburb : ellalong
postcode : 2564
state : nsw
date_of_birth : None
soc_sec_id : 2131318

first_name : charles
surname : kahlon
address_1 : abbott street
address_2 : None
suburb : ellalong
postcode : 3222
state : nsw
date_of_birth : None
soc_sec_id : 3912353

7/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:SimplePredicate: (commonSixGram, address_1)
INFO:dedupe.training:PartialIndexLevenshteinSearchPredicate: (1, first_name, Surname)
INFO:dedupe.training:SimplePredicate: (firstTwoTokensPredicate, suburb)
first_name : zara
surname : spark
address_1 : nanson place
address_2 : sp 11650016
suburb : willow grove
postcode : 4740
state : vic
date_of_birth : 02/06/92
soc_sec_id : 8386154

first_name : lauren
surname : dixon
address_1 : lansell civrcuit
address_2 : None
suburb : None
postcode : 3212
state : nss
date_of_birth : None
soc_sec_id : 5803741

8/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


first_name : tuscany
surname : careless
address_1 : canopus crescent
address_2 : None
suburb : morwell
postcode : 6105
state : sa
date_of_birth : 01/09/29
soc_sec_id : 8326973

first_name : zachary
surname : belavckc
address_1 : crick pmace
address_2 : None
suburb : None
postcode : 3630
state : nsw
date_of_birth : None
soc_sec_id : 3065989

9/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:TfidfNGramSearchPredicate: (0.2, address_1)
INFO:dedupe.training:PartialIndexLevenshteinSearchPredicate: (1, first_name, Surname)
INFO:dedupe.training:SimplePredicate: (firstTwoTokensPredicate, suburb)
first_name : esme
surname : elbanna
address_1 : battersby circuit
address_2 : None
suburb : frankston
postcode : 3227
state : nsw
date_of_birth : 02/11/05
soc_sec_id : 8999714

first_name : cooper
surname : maelszka
address_1 : manity court
address_2 : None
suburb : frankston
postcode : 2197
state : nsw
date_of_birth : None
soc_sec_id : 2462806

10/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:PartialIndexLevenshteinSearchPredicate: (3, surname, CorporationName)
first_name : matthew
surname : white
address_1 : None
address_2 : mungo park
suburb : hopetoun
postcode : 6110
state : vic
date_of_birth : 09/26/33
soc_sec_id : 7343175

first_name : matthew
surname : wighfe
address_1 : None
address_2 : mungo park
suburb : hopetoun
postcode : 6110
state : vic
date_of_birth : None
soc_sec_id : 7343175

11/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


first_name : talia
surname : de paoli
address_1 : verco street
address_2 : None
suburb : ngunnawal
postcode : 4740
state : nsw
date_of_birth : 01/29/92
soc_sec_id : 5078964

first_name : de paoli
surname : talua
address_1 : verco street
address_2 : None
suburb : ngunnawal
postcode : 4740
state : nsw
date_of_birth : 01/29/92
soc_sec_id : 5078964

12/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:PartialIndexLevenshteinSearchPredicate: (3, surname, CorporationName)
INFO:dedupe.training:SimplePredicate: (firstTwoTokensPredicate, address_2)
first_name : william
surname : mislov
address_1 : champ place
address_2 : None
suburb : corowa
postcode : 2913
state : vic
date_of_birth : 07/22/54
soc_sec_id : 6582245

first_name : zara
surname : humpcys
address_1 : None
address_2 : southern wood
suburb : corowa
postcode : 5095
state : qld
date_of_birth : None
soc_sec_id : 7393243

13/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:PartialIndexLevenshteinSearchPredicate: (3, surname, CorporationName)
INFO:dedupe.training:SimplePredicate: (wholeFieldPredicate, soc_sec_id)
first_name : david
surname : sznajder
address_1 : None
address_2 : bowtells caravn park
suburb : eaton
postcode : 2298
state : vic
date_of_birth : 09/15/78
soc_sec_id : 8971940

first_name : david
surname : sznadkr
address_1 : None
address_2 : bowtells caravn park
suburb : eaton
postcode : 2298
state : vic
date_of_birth : 09/15/78
soc_sec_id : 6770683

14/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:PartialIndexLevenshteinSearchPredicate: (3, surname, CorporationName)
INFO:dedupe.training:SimplePredicate: (wholeFieldPredicate, suburb)
first_name : jack
surname : jeffries
address_1 : foott street
address_2 : None
suburb : None
postcode : 0850
state : nsw
date_of_birth : 05/28/70
soc_sec_id : 9939397

first_name : mitchell
surname : matthdws
address_1 : badgery street
address_2 : None
suburb : thornbury
postcode : 2097
state : qhle
date_of_birth : None
soc_sec_id : 3327718

15/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


first_name : caitlyn
surname : beckwith
address_1 : burgan place
address_2 : None
suburb : greenacre
postcode : 3043
state : nsw
date_of_birth : None
soc_sec_id : 1049021

first_name : kayla
surname : de boar
address_1 : burraly court
address_2 : northbridge marina
suburb : None
postcode : 6020
state : wa
date_of_birth : 02/08/93
soc_sec_id : 9459451

16/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious


y


INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:PartialIndexLevenshteinSearchPredicate: (2, first_name, CorporationName)
first_name : bailey
surname : matthews
address_1 : spafford crescent
address_2 : None
suburb : moree
postcode : 3141
state : nt
date_of_birth : 07/11/26
soc_sec_id : 7466221

first_name : matthdws
surname : baldy
address_1 : spafford crescent
address_2 : None
suburb : moree
postcode : 3141
state : nt
date_of_birth : 07/11/26
soc_sec_id : 7466221

17/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious





(y)es / (n)o / (u)nsure / (f)inished / (p)revious


f


Finished labeling


We can now train our linker, based on the labeling session feedback.

In [None]:
%%time
linker.train()

INFO:rlr.crossvalidation:using cross validation to find optimum alpha...
INFO:rlr.crossvalidation:optimum alpha: 0.000010, score 0.0
INFO:dedupe.training:Final predicate set:


CPU times: user 4.29 s, sys: 405 ms, total: 4.69 s
Wall time: 4.32 s
