## Linking a dataset of real historical persons with Deterrministic Rules

While Splink is primarily a tool for probabilistic records linkage, it includes functionality to perform deterministic (i.e. rules based) linkage.

Significant work has gone into optimising the performance of rules based matching, so Splink is likely to be significantly faster than writing the basic SQL by hand.

In this example, we deduplicate a 50k row dataset based on historical persons scraped from wikidata. Duplicate records are introduced with a variety of errors introduced. The probabilistic dedupe of the same dataset can be found at `Deduplicate 50k rows historical persons`.


<a target="_blank" href="https://colab.research.google.com/github/moj-analytical-services/splink/blob/splink4_examples_notebooks/docs/demos/examples/duckdb/deterministic_dedupe.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>


In [None]:
# Uncomment and run this cell if you're running in Google Colab.
# !pip install git+https://github.com/moj-analytical-services/splink.git@splink4_examples_notebooks

In [1]:
from splink.datasets import splink_datasets

import altair as alt

alt.renderers.enable("html")

import pandas as pd

pd.options.display.max_rows = 1000
df = splink_datasets.historical_50k
df.head()

Unnamed: 0,unique_id,cluster,full_name,first_and_surname,first_name,surname,dob,birth_place,postcode_fake,gender,occupation
0,Q2296770-1,Q2296770,"thomas clifford, 1st baron clifford of chudleigh",thomas chudleigh,thomas,chudleigh,1630-08-01,devon,tq13 8df,male,politician
1,Q2296770-2,Q2296770,thomas of chudleigh,thomas chudleigh,thomas,chudleigh,1630-08-01,devon,tq13 8df,male,politician
2,Q2296770-3,Q2296770,tom 1st baron clifford of chudleigh,tom chudleigh,tom,chudleigh,1630-08-01,devon,tq13 8df,male,politician
3,Q2296770-4,Q2296770,thomas 1st chudleigh,thomas chudleigh,thomas,chudleigh,1630-08-01,devon,tq13 8hu,,politician
4,Q2296770-5,Q2296770,"thomas clifford, 1st baron chudleigh",thomas chudleigh,thomas,chudleigh,1630-08-01,devon,tq13 8df,,politician


When defining the settings object, specity your deterministic rules in the `blocking_rules_to_generate_predictions` key.

For a deterministic linkage, the linkage methodology is based solely on these rules, so there is no need to define `comparisons` nor any other parameters required for model training in a probabilistic model.


In [2]:
from splink.database_api import DuckDBAPI
from splink.profile_data import profile_columns


from splink.blocking_rule_library import block_on
from splink.linker import Linker

settings = {
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": [
        block_on("first_name", "surname", "dob"),
        block_on("surname", "dob", "postcode_fake"),
        block_on("first_name", "dob", "occupation"),
    ],
    "retain_matching_columns": True,
    "retain_intermediate_calculation_columns": True,
}
linker = Linker(df, settings, database_api=DuckDBAPI())

Prior to running the linkage, it's usually a good idea to check how many record comparisons will be generated by your deterministic rules:


In [3]:
linker.cumulative_num_comparisons_from_blocking_rules_chart()

The results of the linkage can be viewed with the `deterministic_link` function.


In [4]:
df_predict = linker.deterministic_link()
df_predict.as_pandas_dataframe().head()

Unnamed: 0,unique_id_l,unique_id_r,postcode_fake_l,postcode_fake_r,surname_l,surname_r,occupation_l,occupation_r,first_name_l,first_name_r,dob_l,dob_r,match_key,match_probability
0,Q2296770-1,Q2296770-6,tq13 8df,tq13 8df,chudleigh,chudleigh,politician,politician,thomas,thomas,1630-08-01,1630-08-01,0,1.0
1,Q2296770-2,Q2296770-6,tq13 8df,tq13 8df,chudleigh,chudleigh,politician,politician,thomas,thomas,1630-08-01,1630-08-01,0,1.0
2,Q2296770-3,Q2296770-7,tq13 8df,tq13 8df,chudleigh,chudleigh,politician,,tom,tom,1630-08-01,1630-08-01,0,1.0
3,Q2296770-4,Q2296770-6,tq13 8hu,tq13 8df,chudleigh,chudleigh,politician,politician,thomas,thomas,1630-08-01,1630-08-01,0,1.0
4,Q2296770-5,Q2296770-6,tq13 8df,tq13 8df,chudleigh,chudleigh,politician,politician,thomas,thomas,1630-08-01,1630-08-01,0,1.0


Which can be used to generate clusters.

Note, for deterministic linkage, each comparison has been assigned a match probability of 1, so to generate clusters, set `threshold_match_probability=1` in the `cluster_pairwise_predictions_at_threshold` function.


In [5]:
clusters = linker.cluster_pairwise_predictions_at_threshold(
    df_predict, threshold_match_probability=1
)

Completed iteration 1, root rows count 94
Completed iteration 2, root rows count 10
Completed iteration 3, root rows count 0


In [6]:
clusters.as_pandas_dataframe(limit=5)

Unnamed: 0,cluster_id,unique_id,cluster,full_name,first_and_surname,first_name,surname,dob,birth_place,postcode_fake,gender,occupation,__splink_salt
0,Q21461054-1,Q43139177-8,Q43139177,fred valter,fred valter,fred,valter,1850-01-01,,wn4 0xl,female,,0.148775
1,Q84562127-1,Q84562127-19,Q84562127,elsie browne,elsie browne,elsie,browne,1853-01-01,,,male,,0.190937
2,Q20664532-1,Q21466387-6,Q21466387,harry brooker,harry brooker,harry,brooker,1848-01-01,plymouth,pl4 9hx,male,painter,0.299197
3,Q55595689-1,Q55595689-12,Q55595689,felice leigh,felice leigh,felice,leigh,1853-01-01,cheshire west and chester,,male,writer,0.241367
4,Q2076179-1,Q2076179-21,Q2076179,p.,p.,p.,,1860-12-28,birkenhead,ch41 9bp,,painter,0.670492


These results can then be passed into the `Cluster Studio Dashboard`.


In [7]:
linker.cluster_studio_dashboard(
    df_predict,
    clusters,
    "dashboards/50k_deterministic_cluster.html",
    sampling_method="by_cluster_size",
    overwrite=True,
)

from IPython.display import IFrame

IFrame(src="./dashboards/50k_deterministic_cluster.html", width="100%", height=1200)