## Evaluation when you have fully labelled data

In this example, our data contains a fully-populated ground-truth column called `cluster` that enables us to perform accuracy analysis of the final model


<a target="_blank" href="https://colab.research.google.com/github/moj-analytical-services/splink/blob/splink4_dev/docs/demos/examples/duckdb/accuracy_analysis_from_labels_column.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>


In [1]:
# Uncomment and run this cell if you're running in Google Colab.
# !pip install git+https://github.com/moj-analytical-services/splink.git@splink4_dev

In [2]:
from splink import splink_datasets

df = splink_datasets.fake_1000
df.head(2)

Unnamed: 0,unique_id,first_name,surname,dob,city,email,cluster
0,0,Robert,Alan,1971-06-24,,robert255@smith.net,0
1,1,Robert,Allen,1971-05-24,,roberta25@smith.net,0


In [3]:
from splink import SettingsCreator, Linker, block_on, DuckDBAPI
import splink.comparison_template_library as ctl
import splink.comparison_library as cl

settings = SettingsCreator(
    link_type="dedupe_only",
    blocking_rules_to_generate_predictions=[
        block_on("first_name"),
        block_on("surname"),
    ],
    comparisons=[
        ctl.NameComparison("first_name"),
        ctl.NameComparison("surname"),
        ctl.DateComparison(
            "dob",
            input_is_string=True,
            datetime_metrics=["month", "year", "year"],
            datetime_thresholds=[1, 1, 10],
        ),
        cl.ExactMatch("city").configure(term_frequency_adjustments=True),
        ctl.EmailComparison("email", include_username_fuzzy_level=False),
    ],
    retain_intermediate_calculation_columns=True,
)

In [4]:
db_api = DuckDBAPI()
linker = Linker(df, settings, database_api=db_api)
deterministic_rules = [
    "l.first_name = r.first_name and levenshtein(r.dob, l.dob) <= 1",
    "l.surname = r.surname and levenshtein(r.dob, l.dob) <= 1",
    "l.first_name = r.first_name and levenshtein(r.surname, l.surname) <= 2",
    "l.email = r.email",
]

linker.estimate_probability_two_random_records_match(deterministic_rules, recall=0.7)

Probability two random records match is estimated to be  0.00333.
This means that amongst all possible pairwise record comparisons, one in 300.13 are expected to match.  With 499,500 total possible comparisons, we expect a total of around 1,664.29 matching pairs


In [5]:
linker.estimate_u_using_random_sampling(max_pairs=1e6, seed=5)

----- Estimating u probabilities using random sampling -----



Estimated u probabilities using random sampling



Your model is not yet fully trained. Missing estimates for:
    - first_name (no m values are trained).
    - surname (no m values are trained).
    - dob (no m values are trained).
    - city (no m values are trained).
    - email (no m values are trained).


In [6]:
session_dob = linker.estimate_parameters_using_expectation_maximisation(block_on("dob"))
session_email = linker.estimate_parameters_using_expectation_maximisation(
    block_on("email")
)


----- Starting EM training session -----



Estimating the m probabilities of the model by blocking on:
l."dob" = r."dob"

Parameter estimates will be made for the following comparison(s):
    - first_name
    - surname
    - city
    - email

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
    - dob





Iteration 1: Largest change in params was -0.417 in the m_probability of surname, level `Exact match on surname`


Iteration 2: Largest change in params was 0.121 in probability_two_random_records_match


Iteration 3: Largest change in params was 0.0354 in probability_two_random_records_match


Iteration 4: Largest change in params was 0.0127 in probability_two_random_records_match


Iteration 5: Largest change in params was 0.00539 in probability_two_random_records_match


Iteration 6: Largest change in params was 0.0025 in probability_two_random_records_match


Iteration 7: Largest change in params was 0.0012 in probability_two_random_records_match


Iteration 8: Largest change in params was 0.000599 in probability_two_random_records_match


Iteration 9: Largest change in params was 0.000313 in probability_two_random_records_match


Iteration 10: Largest change in params was 0.000186 in probability_two_random_records_match


Iteration 11: Largest change in params was 0.000147 in the m_probability of first_name, level `All other comparisons`


Iteration 12: Largest change in params was 0.000158 in the m_probability of first_name, level `All other comparisons`


Iteration 13: Largest change in params was 0.000184 in the m_probability of first_name, level `All other comparisons`


Iteration 14: Largest change in params was 0.000195 in the m_probability of first_name, level `All other comparisons`


Iteration 15: Largest change in params was 0.000179 in the m_probability of first_name, level `All other comparisons`


Iteration 16: Largest change in params was 0.000144 in the m_probability of first_name, level `All other comparisons`


Iteration 17: Largest change in params was 0.000105 in probability_two_random_records_match


Iteration 18: Largest change in params was 7.27e-05 in probability_two_random_records_match



EM converged after 18 iterations



Your model is not yet fully trained. Missing estimates for:
    - dob (no m values are trained).



----- Starting EM training session -----



Estimating the m probabilities of the model by blocking on:
l."email" = r."email"

Parameter estimates will be made for the following comparison(s):
    - first_name
    - surname
    - dob
    - city

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
    - email





Iteration 1: Largest change in params was -0.466 in the m_probability of dob, level `Exact match on dob`


Iteration 2: Largest change in params was 0.0884 in probability_two_random_records_match


Iteration 3: Largest change in params was 0.0193 in probability_two_random_records_match


Iteration 4: Largest change in params was 0.00688 in probability_two_random_records_match


Iteration 5: Largest change in params was 0.00294 in probability_two_random_records_match


Iteration 6: Largest change in params was 0.00138 in probability_two_random_records_match


Iteration 7: Largest change in params was 0.000681 in probability_two_random_records_match


Iteration 8: Largest change in params was 0.000346 in probability_two_random_records_match


Iteration 9: Largest change in params was 0.000178 in probability_two_random_records_match


Iteration 10: Largest change in params was 9.26e-05 in probability_two_random_records_match



EM converged after 10 iterations



Your model is fully trained. All comparisons have at least one estimate for their m and u values


In [7]:
linker.truth_space_table_from_labels_column(
    "cluster", match_weight_round_to_nearest=0.1
).as_pandas_dataframe(limit=5)

Unnamed: 0,truth_threshold,match_probability,row_count,p,n,tp,tn,fp,fn,P_rate,...,precision,recall,specificity,npv,accuracy,f1,f2,f0_5,p4,phi
0,-24.2,5.188884e-08,4353.0,2031.0,2322.0,2031.0,0.0,2322.0,0.0,0.466575,...,0.466575,1.0,0.0,1.0,0.466575,0.636278,0.813898,0.522296,0.0,0.0
1,-21.9,2.555306e-07,4353.0,2031.0,2322.0,2030.0,0.0,2322.0,1.0,0.466575,...,0.466452,0.999508,0.0,0.0,0.466345,0.636065,0.813562,0.522146,0.0,-0.016208
2,-19.5,1.348697e-06,4353.0,2031.0,2322.0,2029.0,0.0,2322.0,2.0,0.466575,...,0.46633,0.999015,0.0,0.0,0.466115,0.635851,0.813226,0.521996,0.0,-0.022924
3,-19.3,1.549246e-06,4353.0,2031.0,2322.0,2028.0,0.0,2322.0,3.0,0.466575,...,0.466207,0.998523,0.0,0.0,0.465886,0.635637,0.812891,0.521847,0.0,-0.02808
4,-19.2,1.66044e-06,4353.0,2031.0,2322.0,2028.0,236.0,2086.0,3.0,0.466575,...,0.492951,0.998523,0.101637,0.987448,0.520101,0.660049,0.828567,0.548494,0.288148,0.219355


In [8]:
linker.roc_chart_from_labels_column("cluster")
# Can also do linker.precision_recall_chart_from_labels_column("cluster")

In [9]:
linker.threshold_selection_tool_from_labels_column("cluster", threshold_actual=0.5, add_metrics=['f1'])

In [10]:
# Plot some false positives
linker.prediction_errors_from_labels_column(
    "cluster", include_false_negatives=True, include_false_positives=True
).as_pandas_dataframe(limit=5)

Unnamed: 0,clerical_match_score,found_by_blocking_rules,match_weight,match_probability,unique_id_l,unique_id_r,first_name_l,first_name_r,gamma_first_name,bf_first_name,...,tf_city_r,bf_city,bf_tf_adj_city,email_l,email_r,gamma_email,bf_email,cluster_l,cluster_r,match_key
0,1.0,False,-24.165914,5.31294e-08,417,418,Florence,Brown,0,0.213986,...,0.00123,0.427845,1.0,fb@reose.cem,f@b@reese.com,0,0.001023,108,108,2
1,1.0,False,-21.941506,2.482839e-07,796,797,Taylor,,-1,1.0,...,0.00738,0.427845,1.0,jt40o@combs.net,jt40@cotbs.nm,0,0.001023,201,201,2
2,1.0,False,-19.517277,1.332642e-06,452,454,,Davies,-1,1.0,...,0.01599,0.427845,1.0,rd@lewis.com,idlewrs.cocm,0,0.001023,115,115,2
3,1.0,False,-17.978364,3.872323e-06,717,718,Mia,Jones,0,0.213986,...,0.00615,0.427845,1.0,mia.j63@martinez.biz,,-1,1.0,182,182,2
4,1.0,True,-15.51869,2.130097e-05,594,595,Grace,Grace,3,85.794621,...,0.00123,0.427845,1.0,gk@frey-robinson.org,rgk@frey-robinon.org,0,0.001023,146,146,0


In [11]:
records = linker.prediction_errors_from_labels_column(
    "cluster", include_false_negatives=True, include_false_positives=True
).as_record_dict(limit=5)

linker.waterfall_chart(records)