## Quickstart deduplication demo

In this demo we de-duplicate a small dataset.

The purpose is to demonstrate core Splink functionality as succinctly as possible.

## Step 1: Imports and setup
The following is just boilerplate code that imports some basic packages and set our logging level.

In [8]:
import altair as alt
import pandas as pd

pd.options.display.max_columns = 500
alt.renderers.enable("mimetype")

RendererRegistry.enable('mimetype')

In [9]:
import logging 
logging.basicConfig()  # Means logs will print in Jupyter Lab

# Set to DEBUG if you want splink to log the SQL statements it's executing under the hood
logging.getLogger("splink").setLevel(logging.INFO)

## Step 2: Read in data
Note that the `group` column is the truth - rows which share the same value refer to the same person. In the real world, we wouldn't have this field because this is the truth - the label which Splink is trying to estimate.

In [10]:
df = pd.read_csv("data/fake_1000.csv")
df.head(5)

Unnamed: 0,unique_id,first_name,surname,dob,city,email,group
0,0,Robert,Alan,1971-06-24,,robert255@smith.net,0
1,1,Robert,Allen,1971-05-24,,roberta25@smith.net,0
2,2,Rob,Allen,1971-06-24,London,roberta25@smith.net,0
3,3,Robert,Alen,1971-06-24,Lonon,,0
4,4,Grace,,1997-04-26,Hull,grace.kelly52@jones.com,1


## Step 3: Configure `settings objects` to input into our main splink `settings` object

`splink3` offers two different ways for users to approach creating their `settings` object:
1. Construct a specific settings object manually (such as `comparions_exact_email` below). This offers more control to the user, at the expense of adding additional layers to the code base.
2. Use a predefined settings object which can be imported from `splink.comparison_library`.

Examples of both approaches are outlined below.

In [11]:
from splink.comparison_library import exact_match, levenshtein
levenshtein_first_name = levenshtein("first_name", distance_threshold=2, term_frequency_adjustments=True)
comparions_exact_dob = exact_match("dob")
comparions_exact_city = exact_match("city")

# And a manual version. This is identical to: exact_match("email")
# Initial m values need to be defined and additional comparison levels (see waterfall chart) can be added depending on the needs of the user
comparions_exact_email = {
    "column_name": "email",
    "comparison_levels": [
        {
            "sql_condition": "email_l IS NULL OR email_r IS NULL",
            "label_for_charts": "Comparison includes null",
            "is_null_level": True,
        },
        {
            "sql_condition": "email_l = email_r",
            "label_for_charts": "Exact match",
            "m_probability": 0.9,
        },
        {
            "sql_condition": "ELSE",
            "label_for_charts": "All other comparisons",
            "m_probability": 0.1,
        },
    ],
}


In [15]:
# Combine our individual settings objects into our settings
settings = {
    "proportion_of_matches": 0.01,
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": [
        "l.surname = r.surname"
    ],
    "comparisons": [
        levenshtein_first_name,
        comparions_exact_dob,
        comparions_exact_email,
        comparions_exact_city,
    ],
    "retain_matching_columns": True,
    "retain_intermediate_calculation_columns": True,
    "additional_columns_to_retain": ["group"],
    "max_iterations": 4,
}

In words, this setting dictionary says:

* We are performing a deduplication task (the other options are link_only, or link_and_dedupe)
* The blocking rule states that we will only check for duplicates where the surnames are identical.
* When comparing records, we will use information from the first_name, dob, city and email columns to compute a match score.
* For first_name string comparisons will have three levels:
* * Level 2: Strings are exactly the same
* * Level 1: Strings are similar (levenshtein distance of 2 above)
* * Level 0: No match
* We will make adjustments for term frequencies on the first_name columns
* We will retain the group column in the results even though this is not used as part of comparisons. This is a labelled dataset and group contains the true match - i.e. where group matches, the records pertain to the same person
* Will will consider the algorithm to have converged when no parameter changes by more than 0.01 between iterations or four iterations. (Generally we would allow more than four, but this makes the demo run more quickly).

Other variables:

> `retain_matching_columns` - specify whether you want to keep the comparison values for each column. If you see Step 7, you'll notice we are requesting `_l` and `_r` versions of data, which represent the two names being compared.

> `retain_intermediate_calculation_columns` - <notes>

> `additional_columns_to_retain` - <notes>

> `max_iterations` - 

## Step 5: Select your SQL engine

With `splink3`, you now have the option to choose the specific SQL backend you'd like to run your linking job. 

As of the current version of splink, we offer three different options for linkers:
> `duckdb` with: `from splink.duckdb.duckdb_linker import DuckDBInMemoryLinker`

> `SQLite` with: `from splink.sqlite.sqlite_linker import SQLiteLinker`

> `spark` with: `from splink.spark.spark_linker import SparkLinker`

While `duckdb` is now the default recommendation, you are free to choose whichever linker best suits your setup.

## Step 6: Using your selected linker, estimate match scores using the Expectation Maximisation algorithm and predict to get our final output

This is equivalent to running `linker.get_scored_comparisons()` from within `splink2`.

In [17]:
from splink.duckdb.duckdb_linker import DuckDBInMemoryLinker

linker = DuckDBInMemoryLinker(settings, input_tables = {"fake_1000": df})
linker.train_u_using_random_sampling(target_rows=1e6)
blocking_rule = "l.first_name = r.first_name and l.surname = r.surname"
linker.train_m_using_expectation_maximisation(blocking_rule)
df_e = linker.predict()

Iteration 0: Largest change in params was -0.324 in the m_probability of dob, level `exact_match`
Iteration 1: Largest change in params was 0.119 in proportion_of_matches
Iteration 2: Largest change in params was 0.0566 in proportion_of_matches
Iteration 3: Largest change in params was 0.0315 in proportion_of_matches
EM converged after 3 iterations
Proportion of matches not fully trained, current estimates are [0.04992851502119216]


## Step 7: Inspect results

In [18]:
cols_to_inspect = ["match_probability", "match_weight", "unique_id_l", "unique_id_r", "group_l", "group_r", "first_name_l", "first_name_r", "dob_l", "dob_r", "city_l", "city_r", "email_l", "email_r"]

df_e[cols_to_inspect].sort_values(["unique_id_l", "unique_id_r"]).head(5)

Unnamed: 0,match_probability,match_weight,unique_id_l,unique_id_r,group_l,group_r,first_name_l,first_name_r,dob_l,dob_r,city_l,city_r,email_l,email_r
0,0.462605,-0.2162,1,2,0,0,Robert,Rob,1971-05-24,1971-06-24,,London,roberta25@smith.net,roberta25@smith.net
1,0.001091,-9.838068,5,150,1,40,Grace,Alfie,1991-04-26,2020-09-05,,Birminmhag,grace.kelly52@jones.com,alfiekelly@walters.com
790,0.930197,3.736171,8,9,3,3,,Evie,2015-03-03,2015-03-03,,Pootsmruth,,evihd56@earris-bailey.net
296,0.930197,3.736171,8,10,3,3,,,2015-03-03,2015-03-03,,Portsmouth,,evied56@harris-bailey.net
2,0.655703,0.929386,9,10,3,3,Evie,,2015-03-03,2015-03-03,Pootsmruth,Portsmouth,evihd56@earris-bailey.net,evied56@harris-bailey.net


===== Waterfall chart when integrated =====