Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New link accuracy chart #1478

Merged
merged 22 commits into from Aug 10, 2023
Merged

New link accuracy chart #1478

merged 22 commits into from Aug 10, 2023

Conversation

samnlindsay
Copy link
Contributor

@samnlindsay samnlindsay commented Jul 27, 2023

Type of PR

  • BUG
  • FEAT
  • MAINT
  • DOC

Is your Pull Request linked to an existing Issue or Pull Request?

Closes #1467

Give a brief description for the solution you have provided

New chart created using linker.accuracy_chart_from_labels_column("cluster"):

  • Added additional quantities to truth_space_table_from_labels_with_predictions_sqls():
    • specificity (=tn_rate, aka selectivity)
    • npv (negative predictive value)
    • accuracy
    • F2 ($F_2$ - favours recall)
    • F0_5 ($F_{0.5}$ - favours precision)
    • P4 ($P_4$ - harmonic mean of precision, recall, NPV and accuracy, like an extension of $F_1$)
  • New chart plotting precision, recall, F1, P4 and accuracy (with just precision and recall highlighted initially) vs match weight threshold.

Demo example (duckb/accuracy_analysis_from_labels_column.ipynb)

image

Real example with courts data

Screenshot 2023-07-27 at 17 30 36

PR Checklist

  • Added documentation for changes
  • Added feature to example notebooks at tutorial in splink_demos (if appropriate)
  • Added tests (if appropriate)
  • Made changes based off the latest version of Splink
  • Run the linter

@github-actions
Copy link
Contributor

github-actions bot commented Jul 27, 2023

Test: test_2_rounds_1k_duckdb

Percentage change: -8.6%

date time stats_mean stats_min commit_info_branch commit_info_id machine_info_cpu_brand_raw machine_info_cpu_hz_actual_friendly commit_hash
849 2022-07-12 18:40:05 1.89098 1.87463 splink3 c334bb9 Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz 2.7934 GHz c334bb9
1929 2023-08-09 18:50:44 1.73164 1.71324 (detached head) 4bdc459 Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz 2.0952 GHz 4bdc459

Test: test_2_rounds_1k_sqlite

Percentage change: 2.2%

date time stats_mean stats_min commit_info_branch commit_info_id machine_info_cpu_brand_raw machine_info_cpu_hz_actual_friendly commit_hash
851 2022-07-12 18:40:05 4.32179 4.25898 splink3 c334bb9 Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz 2.7934 GHz c334bb9
1931 2023-08-09 18:50:44 4.47221 4.35404 (detached head) 4bdc459 Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz 2.0952 GHz 4bdc459

Click here for vega lite time series charts

@samnlindsay
Copy link
Contributor Author

samnlindsay commented Jul 27, 2023

To do:

  • Fix error when running locally (no problem in Vega Editor)
  • Allow user to specify metrics up front (e.g. linker.accuracy_chart_from_labels_column("cluster", metrics=["F1", "F2"]))
  • Decide on appropriate text/naming (I just threw some words down)
  • Add documention (incl. explainers for the different metrics)
  • Add to demo notebooks etc.

@samnlindsay samnlindsay changed the title [WIP] New chart [WIP] New link accuracy chart Jul 27, 2023
@RobinL
Copy link
Member

RobinL commented Jul 27, 2023

That is one beautiful chart! Just in case this is of any relevance to the error you're seeing, here is an error I was getting with a valid spec (i.e. working in the vega-lite editor, not working with altair saving to png).

@RossKen
Copy link
Contributor

RossKen commented Jul 27, 2023

This is really great - I love it! 😍

@samnlindsay
Copy link
Contributor Author

samnlindsay commented Jul 27, 2023

The error is

Error rendering output item using 'jupyter-vega-renderer'
Duplicate signal name: "metric_point_tuple"

referring to

{
"mark": {
"type": "point"
},
"params": [
{
"name": "metric_point",
"select": {
"type": "point",
"fields": [
"score"
]
},
"value": [
{
"score": "precision"
},
{
"score": "recall"
}
],
"bind": "legend"
},
{
"name": "point_selection",
"select": {
"type": "point",
"fields": [
"truth_threshold"
],
"encodings": [
"x"
],
"nearest": true,
"on": "mouseover"
}
}
],
"transform": [
{
"filter": {
"param": "metric_point",
"empty": true
}
}
],
"encoding": {
"opacity": {
"condition": {
"param": "point_selection",
"value": 1,
"empty": false
},
"value": 0
},
"x": {
"field": "truth_threshold",
"type": "quantitative"
},
"y": {
"field": "y",
"type": "quantitative"
},
"color": {
"field": "score",
"type": "nominal",
"sort": [
"precision",
"recall",
"F1",
"P4",
"accuracy"
]
}
}
},

where the "metric_point" param is defined and used (in the transform -> filter) exclusively within the point layer of the chart 🤷

@RobinL
Copy link
Member

RobinL commented Jul 27, 2023

Does it save out and load correctly if rather than rendering in jlab you do a chart.save("chart.png") or chart.save("chart.html")?

@aalexandersson
Copy link
Contributor

aalexandersson commented Jul 29, 2023

Is it feasible to add Matthews' correlation coefficient (MCC)? I recommend MCC as the standard binary classification metric because when MCC is high, each of the four basic rates of a confusion matrix is high, without exception.

Background context: My feature request for MCC in fastLink
New reference: Chicco, D. and Jurman, G.. 2023. The Matthews correlation coefficient (MCC) should replace the ROC AUC as the standard metric for assessing binary classification. BioData Mining 16 (4), 1-23. URL: https://doi.org/10.1186/s13040-023-00322-4

Update: In case it could help, MCC is implemented in Python in the function matthews_corrcoef() in the library sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.matthews_corrcoef.html
Here is a blog: https://www.statology.org/matthews-correlation-coefficient-python/
Also, I too really, really like this chart!

@samnlindsay samnlindsay changed the title [WIP] New link accuracy chart New link accuracy chart Jul 31, 2023
@samnlindsay
Copy link
Contributor Author

@aalexandersson I have added MCC ("phi") as an option (see docstring):

Screenshot 2023-07-31 at 15 40 56

Example

linker.accuracy_chart_from_labels_column("ground_truth", add_metrics=["phi", "f1", "accuracy", "specificity"])

produces the following:
Screenshot 2023-07-31 at 15 43 06

@aalexandersson
Copy link
Contributor

This is great, thanks a lot for adding phi (MCC)!

Suggestion for very minor labeling improvements in the graph, to be consistent:

  1. Add "(MCC)" as in "phi (MCC) = 0.856". -- Alternatively, remove "(TPR)" in "Recall (TPR) = 0.922"
  2. Metric: "Precision (PPV)" instead of "Precision".

@samnlindsay
Copy link
Contributor Author

Thank you for spotting the copy-paste error. That's nudged me to tidy up the labels for consistency and ease of editing in future.

mapping = {
        'precision': 'Precision (PPV)', 
        'recall': 'Recall (TPR)', 
        'specificity': 'Specificity (TNR)', 
        'accuracy': 'Accuracy', 
        'npv':'NPV', 
        'f1': 'F1', 
        'f2':'F2', 
        'f0_5':'F0.5', 
        'p4':'P4', 
        'phi':'\u03C6 (MCC)'
    }

@aalexandersson
Copy link
Contributor

aalexandersson commented Jul 31, 2023

Thank you, the new chart looks great to me. I wonder though, how to best use this new chart in a splink workflow -- with and without a clerical review?

In the example, where does "ground_truth" come from? How do I create the ground truth?linker.accuracy_chart_from_labels_column("ground_truth", add_metrics=["phi", "f1", "accuracy", "specificity"])

The new linker function seems similar to truth_space_table_from_labels_column() but it begs the same question.

Update: The documentation in splink/linker.py is useful but it would be nice to have a worked example of "ground truth" somewhere since accuracy (linkage quality) metrics are critical.

Update 2: Where will the new chart fit in Robin's demo workflow? Is that demo example with variable truth_status an example of "ground truth" or something else?

@RobinL RobinL closed this Jul 31, 2023
@RossKen
Copy link
Contributor

RossKen commented Jul 31, 2023

Is there a reason why this was closed? 🤔

@RobinL RobinL reopened this Aug 1, 2023
@RobinL
Copy link
Member

RobinL commented Aug 1, 2023

Sorry! Must have clicked something by accident!

@RobinL
Copy link
Member

RobinL commented Aug 1, 2023

@aalexandersson yes, to use these charts you need labelled data from clerical review. You can't produce these charts without clerical review or some other source of labels

Specifically, you need a dataset that has pairwise unique IDs and the clerical score like:

source_dataset_l unique_id_l source_dataset_r unique_id_r clerical_match_score
df_1 1 df_2 2 0.99
df_1 1 df_2 3 0.2

The purpose of the new beta clerical labelling tool is to more easily produce data in this format.

Occasionally, you may have a fully labelled data. This is most commonly the case if you've generated synthetic data to benchmark/test matching algorithms.

In this case, the input dataset contains 'the answer', i.e. the cluster to which the record belongs exists in the source dataset. Here's an example where the cluster is in a column called 'group':
https://github.com/moj-analytical-services/splink/blob/master/tests/datasets/fake_1000_from_splink_demos.csv

In that case, it would be silly/inefficient to need to convert this into a large list of pairwise labels, so we allow the accuracy charts to run off this column (referred to as a labels column) e.g. the roc_chart_from_labels_column function

In the demo workflow, the labelled data is produced between the step
'Write a labelling dashboard for a random sample of input records'
and
'Read in the labels and perform various types of accuracy analysis'

i.e. it is assumed that the user has used the labelling dashboard to produce the pairwise clerical labels, and then read them back in

splink/accuracy.py Outdated Show resolved Hide resolved
Copy link
Contributor

@ThomasHepworth ThomasHepworth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fantastic! Just one small issue and I'm happy to approve - #1478 (comment)

@ThomasHepworth
Copy link
Contributor

The follow clauses are failing in postgres at present:

cast(TN as float)/(TN+FN) as npv,

(2 * precision * recall)/(precision + recall) as f1

(5 * precision * recall)/(4 * precision + recall) as f2

(1.25 * precision * recall)/(0.25 * precision + recall) as f0_5

4/((1./precision)+(1./recall)+(1./specificity)+(1./npv)) as p4

cast((TP*TN)-(FP*FN) as float)/sqrt((TP+FP)*P*N*(TN+FN)) as phi

I haven't looked into why yet.

@samnlindsay
Copy link
Contributor Author

If we want some further documentation around what these charts show and how to interpret them, this document may be useful for pinching content for a topic guide or some other documentation.

@ThomasHepworth
Copy link
Contributor

Have you had a chance to look at why this is failing in postgres?

allowed,
)

# Silently filter out invalid entries (except case errors - e.g. ["NPV", "F1"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to throw a warning to the user that an entry is invalid? Happy to go with the current version if you're content.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, happy for it to quietly ignore stupid entries

@ThomasHepworth
Copy link
Contributor

Can you add the new charts as integration tests to the full_example scripts.

@samnlindsay
Copy link
Contributor Author

The only test that was failing was SQLite for some reason (see 5463e41 for fix)

All (strictly necessary) changes made now 🤞

@samnlindsay
Copy link
Contributor Author

CI test (SQLite) failing with the following error: sqlite3.OperationalError: no such function: sqrt

Erm...

@RobinL
Copy link
Member

RobinL commented Aug 9, 2023

CI test (SQLite) failing with the following error: sqlite3.OperationalError: no such function: sqrt

Erm...

Sqlite can be annoying: "The math functions shown below are part of the SQLite amalgamation source file but are only active if the amalgamation is compiled using the -DSQLITE_ENABLE_MATH_FUNCTIONS compile-time option.”

I would just register sqrt in python as a sqlite udf to make the tests pass

@samnlindsay
Copy link
Contributor Author

Thanks Robin, already done that in c21ceb3

Postgres tests now passing after covering off some division by zero issues not picked up by the local tests 🤔

Copy link
Contributor

@ThomasHepworth ThomasHepworth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Sam, this is great!

@samnlindsay samnlindsay merged commit d64dcd5 into master Aug 10, 2023
11 checks passed
@samnlindsay samnlindsay deleted the accuracy_chart branch August 10, 2023 10:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEAT] Additional QA chart to help select optimal match weight threshold
5 participants