## Historical people: Quick and dirty

This example shows how to get some initial record linkage results as quickly as possible.

There are many ways to improve the accuracy of this model. But this may be a good place to start if you just want to give Splink a try and see what it's capable of.


<a target="_blank" href="https://colab.research.google.com/github/moj-analytical-services/splink/blob/splink4_dev/docs/demos/examples/duckdb/quick_and_dirty_persons.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>


In [1]:
# Uncomment and run this cell if you're running in Google Colab.
# !pip install git+https://github.com/moj-analytical-services/splink.git@splink4_dev

In [2]:
from splink.datasets import splink_datasets

df = splink_datasets.historical_50k
df.head(5)

In [3]:
from splink import block_on, SettingsCreator
import splink.comparison_library as cl
import splink.comparison_template_library as ctl


settings = SettingsCreator(
    link_type="dedupe_only",
    blocking_rules_to_generate_predictions=[
        block_on("full_name"),
        block_on("substr(full_name,1,6)", "dob", "birth_place"),
        block_on("dob", "birth_place"),
        block_on("postcode_fake"),
    ],
    comparisons=[
        cl.JaroWinklerAtThresholds("full_name", [0.9, 0.7]).configure(
            term_frequency_adjustments=True
        ),
        ctl.DateComparison(
            "dob",
            input_is_string=True,
            datetime_metrics=["day", "month", "year"],
            datetime_thresholds=[5, 1, 5],
        ),
        cl.LevenshteinAtThresholds("postcode_fake", 2),
        cl.JaroWinklerAtThresholds("birth_place", 0.9).configure(
            term_frequency_adjustments=True
        ),
        cl.ExactMatch("occupation").configure(term_frequency_adjustments=True),
    ],
)

In [4]:
from splink import Linker, DuckDBAPI


linker = Linker(df, settings, database_api=DuckDBAPI(), set_up_basic_logging=False)
deterministic_rules = [
    "l.full_name = r.full_name",
    "l.postcode_fake = r.postcode_fake and l.dob = r.dob",
]

linker.training.estimate_probability_two_random_records_match(deterministic_rules, recall=0.6)

In [5]:
linker.training.estimate_u_using_random_sampling(max_pairs=2e6)

In [6]:
results = linker.inference.predict(threshold_match_probability=0.9)

In [7]:
results.as_pandas_dataframe(limit=5)