In [1]:
from splink.duckdb.duckdb_linker import DuckDBLinker

## Read in data

In [2]:
import pandas as pd 
pd.options.display.max_rows = 1000
df = pd.read_csv("./data/fake_1000.csv")
df.head(5)

Unnamed: 0,unique_id,first_name,surname,dob,city,email,cluster
0,0,Robert,Alan,1971-06-24,,robert255@smith.net,0
1,1,Robert,Allen,1971-05-24,,roberta25@smith.net,0
2,2,Rob,Allen,1971-06-24,London,roberta25@smith.net,0
3,3,Robert,Alen,1971-06-24,Lonon,,0
4,4,Grace,,1997-04-26,Hull,grace.kelly52@jones.com,1


In [6]:
# Initialise the linker, passing in the input dataset(s)
linker = DuckDBLinker(df)

Note that the `cluster` column represents the 'ground truth' - a column which tells us with which rows refer to the same person. In most real linkage scenarios, we wouldn't have this column (this is what Splink is trying to estimate.)

##  Configure how Splink compares records using a `settings` dictionary

`splink` needs to know how to compare pairs records from the input dataset, with the aim of computing an overall score that quantifies the similarity.

For example, here is a pair of records from our input dataset.  


|   unique_id | first_name   | surname   | dob        | city   | email               |
|------------:|:-------------|:----------|:-----------|:-------|:--------------------|
|           1 | Robert       | Allen     | 1971-05-24 | nan    | roberta25@smith.net |
|           2 | Rob          | Allen     | 1971-06-24 | London | roberta25@smith.net |

What functions should we use to assess the similarity of `Rob` vs. `Robert` in the the `first_name` field?

This is configured using a `settings` dictionary.  

Rules are defined that map similarity to discrete number levels, known as "comparison levels".

In this introductory example, we will make these comparisons simple.  (In a real linkage model, they'd usually be more sophisticated).

- For the `first_name` column, there will be three comparison levels:
  - an 'exact match' (e.g. `John` vs `John`)
  - similar but not the exactly the same (e.g. `John` vs `Jon`).  Specifically this will be defined as a levenshtein distance of either 1 or 2.
  - all other comparisons 

- For all other comparisons, we will model just two comparison levels: an 'exact match' (e.g. `Smith` vs `Smith`), or 'anything else' (e.g. `Smith` vs `Jones`, or even `Smith` vs `Smyth`).

- For `city`, we enable term frequency comparisons because we observed significant skew in the distribution of values

In [7]:
import splink.duckdb.duckdb_comparison_library as cl

settings = {
    "probability_two_random_records_match": 4/1000,
    "link_type": "dedupe_only",
    "comparisons": [
        cl.levenshtein_at_thresholds("first_name", 2),
        cl.exact_match("surname"),
        cl.exact_match("dob"),
        cl.exact_match("city", term_frequency_adjustments=True),
        cl.exact_match("email"),
    ],
    "blocking_rules_to_generate_predictions": [
        "l.first_name = r.first_name",
        "l.surname = r.surname",
    ],
    "retain_matching_columns": True,
    "retain_intermediate_calculation_columns": True,
    "additional_columns_to_retain": ["cluster"],
}

In words, this setting dictionary says:

* We are performing a deduplication task (the other options are `link_only`, or `link_and_dedupe`, which may be used if there are multiple input datasets)

* When comparing records, we will use information from the `first_name`, `surname`, `dob`, `city` and `email` columns to compute a match score.
* The `blocking_rules_to_generate_predictions` states that we will only check for duplicates amongst records where either the `first_name` or `surname` is identical.
* We have enabled term frequency adjustments for the 'city' column, because some values (e.g. `London`) appear much more frequently than others
* We will retain the cluster column in the results even though this is not used as part of comparisons. Later we'll be able to use this to compare Splink scores to the ground truth.
* We have set `retain_intermediate_calculation_columns` and `additional_columns_to_retain` to `True` for the purposes of the demo, because this will mean the output datasets contain additional information that, whilst not strictly needed by Splink, helps the user understand the calculations. If these were omitted from the settings dictionary, they would be set to `False` (their default value).

## Estimate the parameters of the model


Use the `estimate_u_using_random_sampling` method to compute the `u` values of the model.

In [9]:
linker.initialise_settings(settings)
linker.estimate_u_using_random_sampling(target_rows=1e6)

----- Estimating u probabilities using random sampling -----

Estimated u probabilities using random sampling

Your model is not yet fully trained. Missing estimates for:
    - first_name (no m values are trained).
    - surname (no m values are trained).
    - dob (no m values are trained).
    - city (no m values are trained).
    - email (no m values are trained).


We then use the expectation maximisation algorithm to train the `m` values.

Note that in this first EM training session we block on `first_name` and `surname`, meaning that all comparisons will have `first_name` and `surname` exactly equal.   This means that, in this training session, we cannot estimate parameter estimates for the `first_name` or `surname` columns, as seen in the log messages, and their absence from the match weights chart.

In [10]:
training_blocking_rule = "l.first_name = r.first_name and l.surname = r.surname"
training_session_names = linker.estimate_parameters_using_expectation_maximisation(training_blocking_rule)

training_session_names.match_weights_interactive_history_chart()


----- Starting EM training session -----

Estimating the m probabilities of the model by blocking on:
l.first_name = r.first_name and l.surname = r.surname

Parameter estimates will be made for the following comparison(s):
    - dob
    - city
    - email

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
    - first_name
    - surname

Iteration 1: Largest change in params was 0.479 in the m_probability of dob, level `All other comparisons`
Iteration 2: Largest change in params was -0.0334 in the m_probability of email, level `Exact match`
Iteration 3: Largest change in params was 0.023 in probability_two_random_records_match
Iteration 4: Largest change in params was 0.0157 in probability_two_random_records_match
Iteration 5: Largest change in params was 0.0113 in probability_two_random_records_match
Iteration 6: Largest change in params was 0.00851 in probability_two_random_records_match
Iteration 7: Largest change in para

In a second training session, we block on dob. This allows us to estimate parameters for the first_name and surname comparisons.

Between the two training sessions, we now have parameter estimates for all comparisons.

In [11]:
training_blocking_rule = "l.dob = r.dob"
training_session_dob = linker.estimate_parameters_using_expectation_maximisation(training_blocking_rule)
training_session_dob.match_weights_interactive_history_chart()


----- Starting EM training session -----

Estimating the m probabilities of the model by blocking on:
l.dob = r.dob

Parameter estimates will be made for the following comparison(s):
    - first_name
    - surname
    - city
    - email

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
    - dob

Iteration 1: Largest change in params was 0.456 in probability_two_random_records_match
Iteration 2: Largest change in params was 0.207 in probability_two_random_records_match
Iteration 3: Largest change in params was 0.0733 in probability_two_random_records_match
Iteration 4: Largest change in params was 0.0326 in probability_two_random_records_match
Iteration 5: Largest change in params was 0.0177 in probability_two_random_records_match
Iteration 6: Largest change in params was 0.0107 in probability_two_random_records_match
Iteration 7: Largest change in params was 0.00702 in probability_two_random_records_match
Iteration 8: Larg

In [12]:
linker.estimate_m_from_label_column("cluster")

-------- Estimating m probabilities using from column cluster --------

Your model is fully trained. All comparisons have at least one estimate for their m and u values


In [13]:
linker.parameter_estimate_comparisons_chart()

The final match weights can be viewed in the match weights chart:

In [14]:
linker.match_weights_chart()