## Quality assurance when you have fully labelled data

In this example, our data contains a fully-populated ground-truth column called `cluster` that enables us to perform accuracy analysis of the final model

In [1]:
import pandas as pd 
import altair as alt
alt.renderers.enable("mimetype")

df = pd.read_csv("../../data/fake_1000.csv")
df.head(2)

Unnamed: 0,unique_id,first_name,surname,dob,city,email,cluster
0,0,Robert,Alan,1971-06-24,,robert255@smith.net,0
1,1,Robert,Allen,1971-05-24,,roberta25@smith.net,0


In [2]:
from splink.duckdb.linker import DuckDBLinker
import splink.duckdb.comparison_template_library as ctl
import splink.duckdb.comparison_library as cl

settings = {
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": [
        "l.first_name = r.first_name",
        "l.surname = r.surname",
    ],
    "comparisons": [
        ctl.name_comparison("first_name"),
        ctl.name_comparison("surname"),
        ctl.date_comparison("dob", cast_strings_to_date=True),
        cl.exact_match("city", term_frequency_adjustments=True),
        ctl.email_comparison("email", 2),
    ],
    "retain_matching_columns": True,
    "retain_intermediate_calculation_columns": True,
}

In [3]:
linker = DuckDBLinker(df, settings, set_up_basic_logging=False)
deterministic_rules = [
    "l.first_name = r.first_name and levenshtein(r.dob, l.dob) <= 1",
    "l.surname = r.surname and levenshtein(r.dob, l.dob) <= 1",
    "l.first_name = r.first_name and levenshtein(r.surname, l.surname) <= 2",
    "l.email = r.email"
]

linker.estimate_probability_two_random_records_match(deterministic_rules, recall=0.7)


In [4]:
linker.estimate_u_using_random_sampling(max_pairs=1e6, seed=5)

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

In [5]:
session_dob = linker.estimate_parameters_using_expectation_maximisation("l.dob = r.dob")
session_email = linker.estimate_parameters_using_expectation_maximisation("l.email = r.email")


Level Jaro_winkler_similarity Username >= 0.88 on comparison email not observed in dataset, unable to train m value

Level Jaro_winkler_similarity Username >= 0.88 on comparison email not observed in dataset, unable to train m value

Level Jaro_winkler_similarity Username >= 0.88 on comparison email not observed in dataset, unable to train m value

Level Jaro_winkler_similarity Username >= 0.88 on comparison email not observed in dataset, unable to train m value

Level Jaro_winkler_similarity Username >= 0.88 on comparison email not observed in dataset, unable to train m value

Level Jaro_winkler_similarity Username >= 0.88 on comparison email not observed in dataset, unable to train m value

Level Jaro_winkler_similarity Username >= 0.88 on comparison email not observed in dataset, unable to train m value

Level Jaro_winkler_similarity Username >= 0.88 on comparison email not observed in dataset, unable to train m value

Level Jaro_winkler_similarity Username >= 0.88 on comparison em

In [6]:
linker.truth_space_table_from_labels_column(
    "cluster", match_weight_round_to_nearest=0.1
).as_pandas_dataframe(limit=5)


You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
Comparison: 'email':
    m values not fully trained


Unnamed: 0,truth_threshold,match_probability,row_count,p,n,tp,tn,fp,fn,P_rate,N_rate,tp_rate,tn_rate,fp_rate,fn_rate,precision,recall,f1
0,-87.500001,4.569566e-27,4353.0,2031.0,2322.0,2031.0,0.0,2322.0,0.0,0.466575,0.533425,1.0,0.0,1.0,0.0,0.466575,1.0,0.636278
1,-87.400001,4.8975400000000005e-27,4353.0,2031.0,2322.0,2031.0,212.0,2110.0,0.0,0.466575,0.533425,1.0,0.091301,0.908699,0.0,0.490461,1.0,0.658134
2,-86.300001,1.0498109999999999e-26,4353.0,2031.0,2322.0,2031.0,380.0,1942.0,0.0,0.466575,0.533425,1.0,0.163652,0.836348,0.0,0.511201,1.0,0.676549
3,-86.200001,1.125159e-26,4353.0,2031.0,2322.0,2031.0,528.0,1794.0,0.0,0.466575,0.533425,1.0,0.22739,0.77261,0.0,0.53098,1.0,0.693648
4,-85.300001,2.099621e-26,4353.0,2031.0,2322.0,2031.0,616.0,1706.0,0.0,0.466575,0.533425,1.0,0.265289,0.734711,0.0,0.543484,1.0,0.70423


In [7]:
linker.roc_chart_from_labels_column("cluster")


You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
Comparison: 'email':
    m values not fully trained


<VegaLite 5 object>

If you see this message, it means the renderer has not been properly enabled
for the frontend that you are using. For more information, see
https://altair-viz.github.io/user_guide/display_frontends.html#troubleshooting


In [8]:
linker.precision_recall_chart_from_labels_column("cluster")


You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
Comparison: 'email':
    m values not fully trained


<VegaLite 5 object>

If you see this message, it means the renderer has not been properly enabled
for the frontend that you are using. For more information, see
https://altair-viz.github.io/user_guide/display_frontends.html#troubleshooting


In [9]:
# Plot some false positives
linker.prediction_errors_from_labels_column(
    "cluster", include_false_negatives=True, include_false_positives=True
).as_pandas_dataframe(limit=5)


You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
Comparison: 'email':
    m values not fully trained


Unnamed: 0,clerical_match_score,found_by_blocking_rules,match_weight,match_probability,unique_id_l,unique_id_r,first_name_l,first_name_r,gamma_first_name,bf_first_name,...,tf_city_r,bf_city,bf_tf_adj_city,email_l,email_r,gamma_email,bf_email,cluster_l,cluster_r,match_key
0,0.0,True,2.116816,0.812641,115,844,Oliver,Oliver,4,85.428002,...,,1.0,1.0,oliver.atkinson@moran-smith.com,oliverwatson97@morgan.com,1,23.407484,31,211,0
1,0.0,True,0.895794,0.650427,461,603,Henry,Henry,4,85.428002,...,0.00738,0.428979,1.0,henry.w@miller-mitheiln.lnfo,henry.c35@love-banks.com,1,23.407484,117,149,0
2,0.0,True,0.895794,0.650427,115,845,Oliver,Oliver,4,85.428002,...,0.00123,0.428979,1.0,oliver.atkinson@moran-smith.com,oliverwatson97@morgan.com,1,23.407484,31,211,0
3,0.0,True,0.895794,0.650427,115,843,Oliver,Oliver,4,85.428002,...,0.00123,0.428979,1.0,oliver.atkinson@moran-smith.com,oliverwatson97@morgan.com,1,23.407484,31,211,0
4,1.0,True,-2.432082,0.15633,248,249,Joshua,Joshua,4,85.428002,...,,1.0,1.0,,j.williams@levine-johnson.com,-1,1.0,64,64,0


In [10]:
records = linker.prediction_errors_from_labels_column(
    "cluster", include_false_negatives=True, include_false_positives=True
).as_record_dict(limit=5)

linker.waterfall_chart(records)


You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary.  To produce predictions the following untrained trained parameters will use default values.
Comparison: 'email':
    m values not fully trained


<VegaLite 5 object>

If you see this message, it means the renderer has not been properly enabled
for the frontend that you are using. For more information, see
https://altair-viz.github.io/user_guide/display_frontends.html#troubleshooting
