## Historical people: Quick and dirty

This example shows how to get some initial record linkage results as quickly as possible.

There are many ways to improve the accuracy of this model. But this may be a good place to start if you just want to give Splink a try and see what it's capable of.


<a target="_blank" href="https://colab.research.google.com/github/moj-analytical-services/splink/blob/master/docs/demos/examples/duckdb/quick_and_dirty_persons.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>


In [1]:
# Uncomment and run this cell if you're running in Google Colab.
!pip install splink

Collecting splink
  Downloading splink-4.0.7-py3-none-any.whl.metadata (12 kB)
Collecting igraph>=0.11.2 (from splink)
  Downloading igraph-0.11.8-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.8 kB)
Collecting texttable>=1.6.2 (from igraph>=0.11.2->splink)
  Downloading texttable-1.7.0-py2.py3-none-any.whl.metadata (9.8 kB)
Downloading splink-4.0.7-py3-none-any.whl (3.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.7/3.7 MB[0m [31m22.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading igraph-0.11.8-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m27.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading texttable-1.7.0-py2.py3-none-any.whl (10 kB)
Installing collected packages: texttable, igraph, splink
Successfully installed igraph-0.11.8 splink-4.0.7 texttable-1.7.0


In [2]:
from splink.datasets import splink_datasets

df = splink_datasets.historical_50k
df.head(5)

downloading: https://raw.githubusercontent.com/moj-analytical-services/splink_datasets/master/data/historical_figures_with_errors_50k.parquet



Unnamed: 0,unique_id,cluster,full_name,first_and_surname,first_name,surname,dob,birth_place,postcode_fake,gender,occupation
0,Q2296770-1,Q2296770,"thomas clifford, 1st baron clifford of chudleigh",thomas chudleigh,thomas,chudleigh,1630-08-01,devon,tq13 8df,male,politician
1,Q2296770-2,Q2296770,thomas of chudleigh,thomas chudleigh,thomas,chudleigh,1630-08-01,devon,tq13 8df,male,politician
2,Q2296770-3,Q2296770,tom 1st baron clifford of chudleigh,tom chudleigh,tom,chudleigh,1630-08-01,devon,tq13 8df,male,politician
3,Q2296770-4,Q2296770,thomas 1st chudleigh,thomas chudleigh,thomas,chudleigh,1630-08-01,devon,tq13 8hu,,politician
4,Q2296770-5,Q2296770,"thomas clifford, 1st baron chudleigh",thomas chudleigh,thomas,chudleigh,1630-08-01,devon,tq13 8df,,politician


In [4]:
from splink import block_on, SettingsCreator
import splink.comparison_library as cl


settings = SettingsCreator(
    link_type="dedupe_only",
    blocking_rules_to_generate_predictions=[
        block_on("full_name"),
        block_on("substr(full_name,1,6)", "dob", "birth_place"),
        block_on("dob", "birth_place"),
        block_on("postcode_fake"),
    ],
    comparisons=[
        cl.ForenameSurnameComparison(
            "first_name",
            "surname",
            forename_surname_concat_col_name="first_and_surname",
        ),
        cl.DateOfBirthComparison(
            "dob",
            input_is_string=True,
        ),
        cl.LevenshteinAtThresholds("postcode_fake", 2),
        cl.JaroWinklerAtThresholds("birth_place", 0.9).configure(
            term_frequency_adjustments=True
        ),
        cl.ExactMatch("occupation").configure(term_frequency_adjustments=True),
    ],
)

In [5]:
from splink import Linker, DuckDBAPI


linker = Linker(df, settings, db_api=DuckDBAPI(), set_up_basic_logging=False)
deterministic_rules = [
    "l.full_name = r.full_name",
    "l.postcode_fake = r.postcode_fake and l.dob = r.dob",
]

linker.training.estimate_probability_two_random_records_match(
    deterministic_rules, recall=0.6
)

In [None]:
linker.training.estimate_u_using_random_sampling(max_pairs=2e6)

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

In [7]:
results = linker.inference.predict(threshold_match_probability=0.9)

You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
Comparison: 'first_name_surname':
    m values not fully trained
Comparison: 'first_name_surname':
    u values not fully trained
Comparison: 'dob':
    m values not fully trained
Comparison: 'dob':
    u values not fully trained
Comparison: 'postcode_fake':
    m values not fully trained
Comparison: 'postcode_fake':
    u values not fully trained
Comparison: 'birth_place':
    m values not fully trained
Comparison: 'birth_place':
    u values not fully trained
Comparison: 'occupation':
    m values not fully trained
Comparison: 'occupation':
    u values not fully trained


In [8]:
results.as_pandas_dataframe(limit=5)

Unnamed: 0,match_weight,match_probability,unique_id_l,unique_id_r,surname_l,surname_r,first_name_l,first_name_r,first_and_surname_l,first_and_surname_r,...,gamma_postcode_fake,birth_place_l,birth_place_r,gamma_birth_place,occupation_l,occupation_r,gamma_occupation,full_name_l,full_name_r,match_key
0,13.570547,0.999918,Q17484923-11,Q17484923-4,mckenzie,mackenzie,ellen,helen,ellen mckenzie,helen mackenzie,...,-1,dufftown,dufftown,2,activist,activist,1,ellen carruthers mckenzie,helen mackenzie,2
1,28.438719,1.0,Q7341588-14,Q7341588-3,armstrong-jones,armstrong-jones,rob,robert,rob armstrong-jones,robert armstrong-jones,...,2,ynyscynhaearn,ynyscynhaearn,2,psychiatrist,psychiatrist,1,rob armstrong-jones,robert armstrong-jones,2
2,23.4923,1.0,Q65613275-12,Q65613275-4,bethell,bethell,zugusta,augusta,zugusta bethell,augusta bethell,...,2,marylebone,marylebone,2,translator,translator,1,zugusta bethell,augusta bethell,2
3,27.151567,1.0,Q16854735-2,Q16854735-5,hamilton,lascelles,maud,maud,maud hamilton,maud lascelles,...,2,northumberland,northumberland,2,translator,translator,1,maud hamilton,maud caroline lascelles,2
4,10.677766,0.99939,Q3195176-1,Q3195176-13,morton,morton,kenneth,kenn3th,kenneth morton,kenn3th morton,...,0,wakefield,wakefield,2,entomologist,entomologist,1,kenneth morton,kenn3th morton,2
