## Linking a dataset of real historical persons with Deterrministic Rules

While Splink is primarily a tool for probabilistic records linkage, it includes functionality to perform deterministic (i.e. rules based) linkage.

Significant work has gone into optimising the performance of rules based matching, so Splink is likely to be significantly faster than writing the basic SQL by hand.

In this example, we deduplicate a 50k row dataset based on historical persons scraped from wikidata. Duplicate records are introduced with a variety of errors introduced. The probabilistic dedupe of the same dataset can be found at `Deduplicate 50k rows historical persons`.


<a target="_blank" href="https://colab.research.google.com/github/moj-analytical-services/splink/blob/splink4_dev/docs/demos/examples/duckdb/deterministic_dedupe.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>


In [1]:
# Uncomment and run this cell if you're running in Google Colab.
# !pip install git+https://github.com/moj-analytical-services/splink.git@splink4_dev

In [2]:
import pandas as pd
from splink import splink_datasets

pd.options.display.max_rows = 1000
df = splink_datasets.historical_50k
df.head()

Unnamed: 0,unique_id,cluster,full_name,first_and_surname,first_name,surname,dob,birth_place,postcode_fake,gender,occupation
0,Q2296770-1,Q2296770,"thomas clifford, 1st baron clifford of chudleigh",thomas chudleigh,thomas,chudleigh,1630-08-01,devon,tq13 8df,male,politician
1,Q2296770-2,Q2296770,thomas of chudleigh,thomas chudleigh,thomas,chudleigh,1630-08-01,devon,tq13 8df,male,politician
2,Q2296770-3,Q2296770,tom 1st baron clifford of chudleigh,tom chudleigh,tom,chudleigh,1630-08-01,devon,tq13 8df,male,politician
3,Q2296770-4,Q2296770,thomas 1st chudleigh,thomas chudleigh,thomas,chudleigh,1630-08-01,devon,tq13 8hu,,politician
4,Q2296770-5,Q2296770,"thomas clifford, 1st baron chudleigh",thomas chudleigh,thomas,chudleigh,1630-08-01,devon,tq13 8df,,politician


When defining the settings object, specity your deterministic rules in the `blocking_rules_to_generate_predictions` key.

For a deterministic linkage, the linkage methodology is based solely on these rules, so there is no need to define `comparisons` nor any other parameters required for model training in a probabilistic model.


In [3]:
from splink import SettingsCreator, Linker, block_on, DuckDBAPI

settings = SettingsCreator(
    link_type="dedupe_only",
    blocking_rules_to_generate_predictions=[
        block_on("first_name", "surname", "dob"),
        block_on("surname", "dob", "postcode_fake"),
        block_on("first_name", "dob", "occupation"),
    ],
    retain_intermediate_calculation_columns=True,
)

linker = Linker(df, settings, database_api=DuckDBAPI())


Prior to running the linkage, it's usually a good idea to check how many record comparisons will be generated by your deterministic rules:


In [4]:
linker.cumulative_num_comparisons_from_blocking_rules_chart()

The results of the linkage can be viewed with the `deterministic_link` function.


In [5]:
df_predict = linker.deterministic_link()
df_predict.as_pandas_dataframe().head()

Unnamed: 0,unique_id_l,unique_id_r,first_name_l,first_name_r,dob_l,dob_r,surname_l,surname_r,postcode_fake_l,postcode_fake_r,occupation_l,occupation_r,match_key,match_probability
0,Q55455287-12,Q55455287-2,jaido,jaido,1836-01-01,1836-01-01,morata,morata,ta4 2ug,ta4 2uu,,writer,0,1.0
1,Q55455287-12,Q55455287-3,jaido,jaido,1836-01-01,1836-01-01,morata,morata,ta4 2ug,ta4 2uu,,writer,0,1.0
2,Q55455287-12,Q55455287-4,jaido,jaido,1836-01-01,1836-01-01,morata,morata,ta4 2ug,ta4 2sz,,writer,0,1.0
3,Q55455287-12,Q55455287-5,jaido,jaido,1836-01-01,1836-01-01,morata,morata,ta4 2ug,ta4 2ug,,,0,1.0
4,Q55455287-12,Q55455287-6,jaido,jaido,1836-01-01,1836-01-01,morata,morata,ta4 2ug,,,writer,0,1.0


Which can be used to generate clusters.

Note, for deterministic linkage, each comparison has been assigned a match probability of 1, so to generate clusters, set `threshold_match_probability=1` in the `cluster_pairwise_predictions_at_threshold` function.


In [6]:
clusters = linker.cluster_pairwise_predictions_at_threshold(
    df_predict, threshold_match_probability=1
)

Completed iteration 1, root rows count 94


Completed iteration 2, root rows count 10


Completed iteration 3, root rows count 0


In [7]:
clusters.as_pandas_dataframe(limit=5)

Unnamed: 0,cluster_id,unique_id,cluster,full_name,first_and_surname,first_name,surname,dob,birth_place,postcode_fake,gender,occupation,__splink_salt
0,Q20732676-1,Q20732676-2,Q20732676,frances mary ormsby-gore,frances ormsby-gore,frances,ormsby-gore,1845-01-01,flintshire,ch6 5uu,female,writer,0.296041
1,Q18508292-1,Q27919030-3,Q27919030,harry wallace,harry wallace,harry,wallace,1860-01-01,manchester,m40 1qx,male,painter,0.373205
2,Q20664532-1,Q21466387-3,Q21466387,harry broker,harry broker,harry,broker,1848-01-01,plymouth,pl4 9hx,male,painter,0.733643
3,Q30145265-1,Q30145265-12,Q30145265,monty stow,monty stow,monty,stow,1847-07-21,county durham,dh8 9ud,male,cricketer,0.399835
4,Q21453106-1,Q21453106-16,Q21453106,harry carl renard,harry renard,harry,renard,1855-01-01,staffordshire moorlands,st13 7pq,male,,0.706409


These results can then be passed into the `Cluster Studio Dashboard`.


In [8]:
linker.cluster_studio_dashboard(
    df_predict,
    clusters,
    "dashboards/50k_deterministic_cluster.html",
    sampling_method="by_cluster_size",
    overwrite=True,
)

from IPython.display import IFrame

IFrame(src="./dashboards/50k_deterministic_cluster.html", width="100%", height=1200)