## Linking a dataset of real historical persons with Deterrministic Rules

While Splink is primarily a tool for probabilistic records linkage, it includes functionality to perform deterministic (i.e. rules based) linkage.

Significant work has gone into optimising the performance of rules based matching, so Splink is likely to be significantly faster than writing the basic SQL by hand.

In this example, we deduplicate a 50k row dataset based on historical persons scraped from wikidata. Duplicate records are introduced with a variety of errors introduced. The probabilistic dedupe of the same dataset can be found at `Deduplicate 50k rows historical persons`.


<a target="_blank" href="https://colab.research.google.com/github/moj-analytical-services/splink/blob/splink4_examples_notebooks/docs/demos/examples/duckdb/deterministic_dedupe.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>


In [5]:
# Uncomment and run this cell if you're running in Google Colab.
# !pip install git+https://github.com/moj-analytical-services/splink.git@splink4_examples_notebooks

In [6]:
from splink.datasets import splink_datasets

import altair as alt

alt.renderers.enable("html")

import pandas as pd

pd.options.display.max_rows = 1000
df = splink_datasets.historical_50k
df.head()

Unnamed: 0,unique_id,cluster,full_name,first_and_surname,first_name,surname,dob,birth_place,postcode_fake,gender,occupation
0,Q2296770-1,Q2296770,"thomas clifford, 1st baron clifford of chudleigh",thomas chudleigh,thomas,chudleigh,1630-08-01,devon,tq13 8df,male,politician
1,Q2296770-2,Q2296770,thomas of chudleigh,thomas chudleigh,thomas,chudleigh,1630-08-01,devon,tq13 8df,male,politician
2,Q2296770-3,Q2296770,tom 1st baron clifford of chudleigh,tom chudleigh,tom,chudleigh,1630-08-01,devon,tq13 8df,male,politician
3,Q2296770-4,Q2296770,thomas 1st chudleigh,thomas chudleigh,thomas,chudleigh,1630-08-01,devon,tq13 8hu,,politician
4,Q2296770-5,Q2296770,"thomas clifford, 1st baron chudleigh",thomas chudleigh,thomas,chudleigh,1630-08-01,devon,tq13 8df,,politician


When defining the settings object, specity your deterministic rules in the `blocking_rules_to_generate_predictions` key.

For a deterministic linkage, the linkage methodology is based solely on these rules, so there is no need to define `comparisons` nor any other parameters required for model training in a probabilistic model.


In [7]:
from splink.settings_creator import SettingsCreator
from splink.linker import Linker
from splink.blocking_rule_library import block_on
from splink.database_api import DuckDBAPI

settings = SettingsCreator(
    link_type="dedupe_only",
    blocking_rules_to_generate_predictions=[
        block_on("first_name", "surname", "dob"),
        block_on("surname", "dob", "postcode_fake"),
        block_on("first_name", "dob", "occupation"),
    ],
    retain_intermediate_calculation_columns=True,
)

linker = Linker(df, settings, database_api=DuckDBAPI())


Prior to running the linkage, it's usually a good idea to check how many record comparisons will be generated by your deterministic rules:


In [8]:
linker.cumulative_num_comparisons_from_blocking_rules_chart()

The results of the linkage can be viewed with the `deterministic_link` function.


In [9]:
df_predict = linker.deterministic_link()
df_predict.as_pandas_dataframe().head()

Unnamed: 0,unique_id_l,unique_id_r,occupation_l,occupation_r,first_name_l,first_name_r,postcode_fake_l,postcode_fake_r,dob_l,dob_r,surname_l,surname_r,match_key,match_probability
0,Q55455287-12,Q55455287-2,,writer,jaido,jaido,ta4 2ug,ta4 2uu,1836-01-01,1836-01-01,morata,morata,0,1.0
1,Q55455287-12,Q55455287-3,,writer,jaido,jaido,ta4 2ug,ta4 2uu,1836-01-01,1836-01-01,morata,morata,0,1.0
2,Q55455287-12,Q55455287-4,,writer,jaido,jaido,ta4 2ug,ta4 2sz,1836-01-01,1836-01-01,morata,morata,0,1.0
3,Q55455287-12,Q55455287-5,,,jaido,jaido,ta4 2ug,ta4 2ug,1836-01-01,1836-01-01,morata,morata,0,1.0
4,Q55455287-12,Q55455287-6,,writer,jaido,jaido,ta4 2ug,,1836-01-01,1836-01-01,morata,morata,0,1.0


Which can be used to generate clusters.

Note, for deterministic linkage, each comparison has been assigned a match probability of 1, so to generate clusters, set `threshold_match_probability=1` in the `cluster_pairwise_predictions_at_threshold` function.


In [10]:
clusters = linker.cluster_pairwise_predictions_at_threshold(
    df_predict, threshold_match_probability=1
)

Completed iteration 1, root rows count 94
Completed iteration 2, root rows count 10
Completed iteration 3, root rows count 0


In [11]:
clusters.as_pandas_dataframe(limit=5)

Unnamed: 0,cluster_id,unique_id,cluster,full_name,first_and_surname,first_name,surname,dob,birth_place,postcode_fake,gender,occupation,__splink_salt
0,Q64787791-1,Q64787791-16,Q64787791,samler browne,samler browne,samler,browne,1859-01-01,,sw19 5lj,male,publicist,0.336211
1,Q6773500-1,Q6773500-9,Q6773500,marshal brookes,marshal brookes,marshal,brookes,1855-05-30,,ol12 7ts,male,,0.211221
2,Q7790962-1,Q7790962-6,Q7790962,of andrews,of andrews,of,andrews,1101-01-01,leicester,le12 8tq,male,monk,0.60021
3,Q63871171-1,Q63871171-3,Q63871171,harold wight,harold wight,harold,wight,1850-01-01,malvern hills,wr13 6sb,female,writer,0.725846
4,Q6241382-1,Q6241382-6,Q6241382,evan jackson,evan jackson,evan,jackson,1686-04-04,sessay,yo7 3nn,male,author,0.948951


These results can then be passed into the `Cluster Studio Dashboard`.


In [12]:
linker.cluster_studio_dashboard(
    df_predict,
    clusters,
    "dashboards/50k_deterministic_cluster.html",
    sampling_method="by_cluster_size",
    overwrite=True,
)

from IPython.display import IFrame

IFrame(src="./dashboards/50k_deterministic_cluster.html", width="100%", height=1200)