## Quality assurance when you have fully labelled data

In this example, our data contains a fully-populated ground-truth column called `cluster` that enables us to perform accuracy analysis of the final model


In [1]:
from splink.datasets import splink_datasets
import altair as alt

df = splink_datasets.fake_1000

df.head(2)

Unnamed: 0,unique_id,first_name,surname,dob,city,email,cluster
0,0,Robert,Alan,1971-06-24,,robert255@smith.net,0
1,1,Robert,Allen,1971-05-24,,roberta25@smith.net,0


In [2]:
from splink.linker import Linker


from splink.blocking_rule_library import block_on
import splink.comparison_template_library as ctl
import splink.comparison_library as cl

settings = {
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": [
        block_on("first_name"),
        block_on("surname"),
    ],
    "comparisons": [
        ctl.NameComparison("first_name"),
        ctl.NameComparison("surname"),
        cl.LevenshteinAtThresholds("dob"),
        cl.ExactMatch("city").configure(term_frequency_adjustments=True),
        ctl.EmailComparison("email", include_username_fuzzy_level=False),
    ],
    "retain_matching_columns": True,
    "retain_intermediate_calculation_columns": True,
}

In [3]:
from splink.database_api import DuckDBAPI

db_api = DuckDBAPI()
linker = Linker(df, settings, database_api=db_api)
deterministic_rules = [
    "l.first_name = r.first_name and levenshtein(r.dob, l.dob) <= 1",
    "l.surname = r.surname and levenshtein(r.dob, l.dob) <= 1",
    "l.first_name = r.first_name and levenshtein(r.surname, l.surname) <= 2",
    "l.email = r.email",
]

linker.estimate_probability_two_random_records_match(deterministic_rules, recall=0.7)

Probability two random records match is estimated to be  0.00333.
This means that amongst all possible pairwise record comparisons, one in 300.13 are expected to match.  With 499,500 total possible comparisons, we expect a total of around 1,664.29 matching pairs


In [4]:
linker.estimate_u_using_random_sampling(max_pairs=1e6, seed=5)

----- Estimating u probabilities using random sampling -----

Estimated u probabilities using random sampling

Your model is not yet fully trained. Missing estimates for:
    - first_name (no m values are trained).
    - surname (no m values are trained).
    - dob (no m values are trained).
    - city (no m values are trained).
    - email (no m values are trained).


In [5]:
session_dob = linker.estimate_parameters_using_expectation_maximisation(block_on("dob"))
session_email = linker.estimate_parameters_using_expectation_maximisation(
    block_on("email")
)


----- Starting EM training session -----



Estimating the m probabilities of the model by blocking on:
l."dob" = r."dob"

Parameter estimates will be made for the following comparison(s):
    - first_name
    - surname
    - city
    - email

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
    - dob

Iteration 1: Largest change in params was -0.417 in the m_probability of surname, level `Exact match on surname`
Iteration 2: Largest change in params was 0.121 in probability_two_random_records_match
Iteration 3: Largest change in params was 0.0354 in probability_two_random_records_match
Iteration 4: Largest change in params was 0.0127 in probability_two_random_records_match
Iteration 5: Largest change in params was 0.00539 in probability_two_random_records_match
Iteration 6: Largest change in params was 0.0025 in probability_two_random_records_match
Iteration 7: Largest change in params was 0.0012 in probability_two_random_records_match
Iteration 8: Largest change in 

In [6]:
linker.truth_space_table_from_labels_column(
    "cluster", match_weight_round_to_nearest=0.1
).as_pandas_dataframe(limit=5)

Unnamed: 0,truth_threshold,match_probability,row_count,p,n,tp,tn,fp,fn,P_rate,...,precision,recall,specificity,npv,accuracy,f1,f2,f0_5,p4,phi
0,-25.5,2.107342e-08,4353.0,2031.0,2322.0,2031.0,0.0,2322.0,0.0,0.466575,...,0.466575,1.0,0.0,1.0,0.466575,0.636278,0.813898,0.522296,0.0,0.0
1,-17.9,4.088474e-06,4353.0,2031.0,2322.0,2030.0,0.0,2322.0,1.0,0.466575,...,0.466452,0.999508,0.0,0.0,0.466345,0.636065,0.813562,0.522146,0.0,-0.016208
2,-16.8,8.763795e-06,4353.0,2031.0,2322.0,2028.0,0.0,2322.0,3.0,0.466575,...,0.466207,0.998523,0.0,0.0,0.465886,0.635637,0.812891,0.521847,0.0,-0.02808
3,-15.6,2.013368e-05,4353.0,2031.0,2322.0,2027.0,605.0,1717.0,4.0,0.466575,...,0.5414,0.998031,0.260551,0.993432,0.60464,0.701991,0.853977,0.595931,0.519908,0.371884
4,-15.5,2.157872e-05,4353.0,2031.0,2322.0,2027.0,969.0,1353.0,4.0,0.466575,...,0.599704,0.998031,0.417313,0.995889,0.688261,0.749215,0.880998,0.651727,0.658992,0.497369


In [7]:
linker.roc_chart_from_labels_column("cluster")
# Can also do linker.precision_recall_chart_from_labels_column("cluster")

In [8]:
linker.confusion_matrix_from_labels_column("cluster")

In [9]:
# Plot some false positives
linker.prediction_errors_from_labels_column(
    "cluster", include_false_negatives=True, include_false_positives=True
).as_pandas_dataframe(limit=5)

Unnamed: 0,clerical_match_score,found_by_blocking_rules,match_weight,match_probability,unique_id_l,unique_id_r,first_name_l,first_name_r,gamma_first_name,bf_first_name,...,tf_city_r,bf_city,bf_tf_adj_city,email_l,email_r,gamma_email,bf_email,cluster_l,cluster_r,match_key
0,1.0,False,-25.468865,2.153316e-08,417,418,Florence,Brown,0,0.213239,...,0.00123,0.428426,1.0,fb@reose.cem,f@b@reese.com,0,0.001023,108,108,2
1,1.0,False,-17.887988,4.122658e-06,796,797,Taylor,,-1,1.0,...,0.00738,0.428426,1.0,jt40o@combs.net,jt40@cotbs.nm,0,0.001023,201,201,2
2,1.0,False,-17.887988,4.122658e-06,452,454,,Davies,-1,1.0,...,0.01599,0.428426,1.0,rd@lewis.com,idlewrs.cocm,0,0.001023,115,115,2
3,1.0,True,-16.815434,8.670542e-06,594,595,Grace,Grace,3,85.863612,...,0.00123,0.428426,1.0,gk@frey-robinson.org,rgk@frey-robinon.org,0,0.001023,146,146,0
4,1.0,False,-15.53633,2.104213e-05,150,151,Alfie,Kelly,0,0.213239,...,0.0492,0.428426,1.0,alfiekelly@walters.com,,-1,1.0,40,40,2


In [10]:
records = linker.prediction_errors_from_labels_column(
    "cluster", include_false_negatives=True, include_false_positives=True
).as_record_dict(limit=5)

linker.waterfall_chart(records)