## Quickstart deduplication demo

In this demo we de-duplicate a small dataset.

The purpose is to demonstrate core Splink functionality as succinctly as possible.

## Step 1: Imports and setup
The following is just boilerplate code that imports some basic packages and set our logging level.

In [1]:
import pandas as pd
pd.options.display.max_columns = 500

## Step 2: Read in data
Read in a 1000-record dataset that contains duplicates.

Note that the `group` column is the truth - rows which share the same value refer to the same person. In the real world, we wouldn't have this field because this is the truth - the label which Splink is trying to estimate.

In [2]:
df = pd.read_csv("data/fake_1000.csv")
df.head(5)

Unnamed: 0,unique_id,first_name,surname,dob,city,email,group
0,0,Robert,Alan,1971-06-24,,robert255@smith.net,0
1,1,Robert,Allen,1971-05-24,,roberta25@smith.net,0
2,2,Rob,Allen,1971-06-24,London,roberta25@smith.net,0
3,3,Robert,Alen,1971-06-24,Lonon,,0
4,4,Grace,,1997-04-26,Hull,grace.kelly52@jones.com,1


## Step 3: Configure Splink `settings`

`splink` needs to know how to compare records from the input dataset.  For example, which columns should be compared, and how should Splink assess their similarity.

This is configured using a `settings` dictionary.  For the purposes of this simple example, we will make these comparisons as simple as possible:  for each input column, Splink will categorise comparisons as either an 'exact match' (e.g. `John` vs `John`), or 'anything else' (e.g. `John` vs `David`, or even `John` vs `Jon`).

In [3]:
# Combine our individual settings objects into our settings
from splink.comparison_library import exact_match
settings = {
    "proportion_of_matches": 0.01,
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": [
        "l.first_name = r.first_name",
        "l.surname = r.surname",
    ],
    "comparisons": [
        exact_match("first_name"),
        exact_match("surname"),
        exact_match("dob"),
        exact_match("city", term_frequency_adjustments=True),
        exact_match("email"),
    ],
    "retain_matching_columns": True,
    "retain_intermediate_calculation_columns": True,
    "additional_columns_to_retain": ["group"],
    "max_iterations": 10,
}

In words, this setting dictionary says:

* We are performing a deduplication task (the other options are `link_only`, or `link_and_dedupe`, which may be used if there are multiple input datasets)
* The blocking rule states that we will only check for duplicates where the first names or surnames are identical.
* When comparing records, we will use information from the first_name, dob, city and email columns to compute a match score.
* We have enabled term frequency adjustments for the 'city' column, because some values (e.g.) London appear much more frequently than others
* We will retain the group column in the results even though this is not used as part of comparisons. This is a labelled dataset and group contains the true match - i.e. where group matches, the records pertain to the same person
* Will will consider the algorithm to have converged when no parameter changes by more than 0.01 between iterations or four iterations. (Generally we would allow more than four, but this makes the demo run more quickly).

* We have set `retain_intermediate_calculation_columns` and `additional_columns_to_retain` to True for the purposes of the demo, because this will mean the output datasets contain additional information that, whilst not strictly needed by Splink, helps the user understand the calculations. If these were not included in the settings dictionary, they would be set to False automatically (their default value).

## Step 4: Select your SQL engine

In `splink` version 3, you now have the option to choose the specific SQL backend you'd like to run your linking job. 

Currently, `splink` offers three different SQL backends:
- `duckdb` with: `from splink.duckdb.duckdb_linker import DuckDBInMemoryLinker`

- `SQLite` with: `from splink.sqlite.sqlite_linker import SQLiteLinker`

- `spark` with: `from splink.spark.spark_linker import SparkLinker`

For smaller datasets, `duckdb` is now the recommended option. you are free to choose whichever linker best suits your setup.

## Step 5: Using your selected linker to predict match weights

Estimate the parameters of a Fellegi Sunter model, and use the model to generate predictions

This is equivalent to running `linker.get_scored_comparisons()` from within `splink2`.

In [4]:
from splink.duckdb.duckdb_linker import DuckDBLinker

linker = DuckDBLinker(settings, input_tables = {"fake_1000": df})
linker.train_u_using_random_sampling(target_rows=1e6)

blocking_rule = "l.first_name = r.first_name and l.surname = r.surname"
training_session_names = linker.train_m_using_expectation_maximisation(blocking_rule)

training_session_names.match_weights_interactive_history_chart()


    select *
    from __splink__df_concat_with_tf
    
    
Iteration 0: Largest change in params was 0.48 in the m_probability of dob, level `All other comparisons`
Iteration 1: Largest change in params was -0.0143 in the m_probability of city, level `exact_match`
Iteration 2: Largest change in params was 0.0043 in proportion_of_matches
Iteration 3: Largest change in params was 0.0036 in proportion_of_matches
Iteration 4: Largest change in params was 0.00301 in proportion_of_matches
Iteration 5: Largest change in params was 0.00254 in proportion_of_matches
Iteration 6: Largest change in params was 0.00217 in proportion_of_matches
Iteration 7: Largest change in params was 0.00187 in proportion_of_matches
Iteration 8: Largest change in params was 0.00162 in proportion_of_matches
Iteration 9: Largest change in params was 0.00141 in proportion_of_matches
EM converged after 9 iterations
Proportion of matches not fully trained, current estimates are [0.000517585570113293]


In [5]:
# Complete model training

blocking_rule = "l.dob = r.dob"
linker.train_m_using_expectation_maximisation(blocking_rule)



Iteration 0: Largest change in params was 0.457 in proportion_of_matches
Iteration 1: Largest change in params was 0.178 in proportion_of_matches
Iteration 2: Largest change in params was 0.0645 in proportion_of_matches
Iteration 3: Largest change in params was 0.0322 in proportion_of_matches
Iteration 4: Largest change in params was 0.0191 in proportion_of_matches
Iteration 5: Largest change in params was 0.0124 in proportion_of_matches
Iteration 6: Largest change in params was 0.00859 in proportion_of_matches
Iteration 7: Largest change in params was 0.00618 in proportion_of_matches
Iteration 8: Largest change in params was 0.00457 in proportion_of_matches
Iteration 9: Largest change in params was 0.00345 in proportion_of_matches
EM converged after 9 iterations
Proportion of matches can now be estimated, estimates are [0.0017839386049928502, 0.03468205727124394]


<EMTrainingSession, blocking on l.dob = r.dob, deactivating comparisons dob>

The final match weights can be viewed in the match weights chart:

In [6]:
linker.settings_obj.match_weights_chart()

## Step 6: Predicting match weights using the trained model 

In [7]:
df_e = linker.predict()

## Step 7: Visualising results

In [8]:
df_e.as_pandas_dataframe().head(5)

Unnamed: 0,match_weight,match_probability,unique_id_l,unique_id_r,first_name_l,first_name_r,gamma_first_name,bf_first_name,surname_l,surname_r,gamma_surname,bf_surname,dob_l,dob_r,gamma_dob,bf_dob,city_l,city_r,gamma_city,tf_city_l,tf_city_r,bf_city,bf_tf_adj_city,email_l,email_r,gamma_email,bf_email,group_l,group_r,match_key
0,8.031867,0.996194,4,5,Grace,Grace,1,88.923335,,Kelly,-1,1.0,1997-04-26,1991-04-26,0,0.592008,Hull,,-1,0.00123,,1.0,1.0,grace.kelly52@jones.com,grace.kelly52@jones.com,1,267.694851,1,1,0
1,-2.176367,0.181155,9,922,Evie,Evie,1,88.923335,Dean,Jones,0,0.547085,2015-03-03,2002-07-22,0,0.592008,Pootsmruth,,-1,0.00123,,1.0,1.0,evihd56@earris-bailey.net,eviejones@brewer-sparks.org,0,0.413619,3,230,0
2,-2.176367,0.181155,14,998,Oliver,Oliver,1,88.923335,Griffiths,Bird,0,0.547085,1991-10-26,2000-02-27,0,0.592008,Lunton,,-1,0.00123,,1.0,1.0,o.griffiths90@reyes-coleman.com,oliver.b@smith.net,0,0.413619,5,250,0
3,-0.902743,0.348479,18,475,Caleb,Caleb,1,88.923335,Rwoe,Scott,0,0.547085,1992-11-20,2000-12-10,0,0.592008,Liverpool,,-1,0.04059,,1.0,1.0,,c.scott@brooks.com,-1,1.0,8,119,0
4,-3.384779,0.087372,21,917,Darcy,Darcy,1,88.923335,Bernass,Rhodes,0,0.547085,1986-02-04,1979-01-14,0,0.592008,Southampton,Birmingham,0,0.00861,0.0492,0.432745,1.0,darcy.b@silva.com,drhodes16@johnson-robinson.com,0,0.413619,9,229,0


You can also view rows in this dataset as a waterfall chart as follows:

In [9]:
from splink.charts import waterfall_chart
records_to_plot = df_e.as_pandas_dataframe().head(5).to_dict(orient="records")
waterfall_chart(records_to_plot, linker.settings_obj, filter_nulls=False)