convert accuracy metrics to float #1893

ThomasHepworth · 2024-01-26T14:31:31Z

Type of PR

BUG
FEAT
MAINT
DOC

Give a brief description for the solution you have provided

Our production QA pipelines have been encountering the following error: TypeError: Object of type Decimal is not JSON serializable when attempting to serialise an the altair chart spec, for any charts containing the following sql.

I've narrowed the error down to the following lines:

2.0*TP/(2*TP + FN + FP) as f1,
5.0*TP/(5*TP + 4*FN + FP) as f2,
1.25*TP/(1.25*TP + 0.25*FN + FP) as f0_5,
4.0*TP*TN/((4.0*TP*TN) + ((TP + TN)*(FP + FN))) as p4,
case when TN+FN=0 or TP+FP=0 or P=0 or N=0 then 0
    else cast((TP*TN)-(FP*FN) as float)/sqrt((TP+FP)*P*N*(TN+FN)) end as phi

which, when run in our glue jobs (spark 3.1.1) are being cast to python's Decimal type - from decimal import Decimal.

To resolve this, I've added some explicit float casts, which seems to ensure that spark doesn't convert to using Decimals - this branch has been used to check that this resolves the issue.

I haven't looked into why this is happening, so there are no tests in this PR. However, if you would like a more detailed explanation exploring what's going on, I'm happy to try and get a reproducible example.

ThomasHepworth · 2024-01-26T17:05:53Z

I wonder if this bug is also related? #1822 (comment)

RobinL · 2024-01-30T15:46:01Z

yeah so fundamentally this is because in Spark a literal number is a Decimal not a Double in Spark. See also here

So whenever we have a literal in the number, we have to wrap it in a cast:

splink/splink/comparison_level.py

Line 557 in 663491b

THEN cast({bayes_factor} as float8)

Then we have:
class CustomSpark(Spark):

Which translates e.g. cast(0.12345 AS float8) into 0.12345D which is of double type

We need to do that because the float8 type doesn't exist in

spark.sql('select 0.1 as a').toPandas().iloc[0,0]
-> Decimal('0.1')

spark.sql('select 0.12345678D as a').toPandas().iloc[0,0]
_. 0.12345678. (i.e. a float)

spark.sql('select cast(0.12345678 as float8) as a').toPandas().iloc[0,0]
-> [UNSUPPORTED_DATATYPE] Unsupported data type "FLOAT8".(line 1, pos 26)

RobinL · 2024-01-30T15:52:59Z

Can also see here gamma_name that you can't use 'cast(0.23489134 as double' as it truncates, it has to be the 0.23498234D syntax

RobinL · 2024-01-30T15:55:12Z

spark.sql('select cast(0.349807234523750923475089457470478952340895740 as double) as a').toPandas().iloc[0,0]

ArithmeticException: [DECIMAL_PRECISION_EXCEEDS_MAX_PRECISION] Decimal precision 45 exceeds max precision 38.

spark.sql('select 0.349807234523750923475089457470478952340895740D as a').toPandas().iloc[0,0]

0.3498072345237509

RobinL

Minor change in comment on code - but yeah, i think this is the right way to fix

RobinL · 2024-01-30T15:56:40Z

splink/accuracy.py

-        5.0*TP/(5*TP + 4*FN + FP) as f2,
-        1.25*TP/(1.25*TP + 0.25*FN + FP) as f0_5,
-        4.0*TP*TN/((4.0*TP*TN) + ((TP + TN)*(FP + FN))) as p4,
+        cast(2.0*TP/(2*TP + FN + FP) as float) as f1,


I think all of these should be float8 for consistency (see comments on PR)

RobinL · 2024-01-30T16:09:28Z

see also #694

cinnq346 · 2024-02-05T13:03:19Z

I've just seen this exact problem as well. We're using the Postgres backend. I had to convert all the numeric columns to double precision to solve this. In other words, I think you need to cast row_count, p, n, tp, tn, fp, fn and p_rate in addition to the ones you've already done.

For reference, that's what the table is created as it currently stands:

                                   Table "splink.__splink__truth_space_table_337726f4d"
      Column       |       Type       | Collation | Nullable | Default | Storage | Compression | Stats target | Description
-------------------+------------------+-----------+----------+---------+---------+-------------+--------------+-------------
 truth_threshold   | double precision |           |          |         | plain   |             |              |
 match_probability | double precision |           |          |         | plain   |             |              |
 row_count         | numeric          |           |          |         | main    |             |              |
 p                 | numeric          |           |          |         | main    |             |              |
 n                 | numeric          |           |          |         | main    |             |              |
 tp                | numeric          |           |          |         | main    |             |              |
 tn                | numeric          |           |          |         | main    |             |              |
 fp                | numeric          |           |          |         | main    |             |              |
 fn                | numeric          |           |          |         | main    |             |              |
 p_rate            | numeric          |           |          |         | main    |             |              |
 n_rate            | double precision |           |          |         | plain   |             |              |
 tp_rate           | double precision |           |          |         | plain   |             |              |
 tn_rate           | double precision |           |          |         | plain   |             |              |
 fp_rate           | double precision |           |          |         | plain   |             |              |
 fn_rate           | double precision |           |          |         | plain   |             |              |
 precision         | double precision |           |          |         | plain   |             |              |
 recall            | double precision |           |          |         | plain   |             |              |
 specificity       | double precision |           |          |         | plain   |             |              |
 npv               | double precision |           |          |         | plain   |             |              |
 accuracy          | double precision |           |          |         | plain   |             |              |
 f1                | numeric          |           |          |         | main    |             |              |
 f2                | numeric          |           |          |         | main    |             |              |
 f0_5              | numeric          |           |          |         | main    |             |              |
 p4                | numeric          |           |          |         | main    |             |              |
 phi               | double precision |           |          |         | plain   |             |              |

P.S. I'll be doing a PR to alter

splink/splink/accuracy.py

Line 200 in cbd5e8f

cast(({label_colname}_l = {label_colname}_r) as float) as clerical_match_score,

with a Case statement instead, because casting boolean to float doesn't work in Postgres.

from this:
cast(({label_colname}_l = {label_colname}_r) as float) as clerical_match_score,

to this:
 case when ({label_colname}_l = {label_colname}_r) then 1.0 else 0.0 end AS clerical_match_score

RobinL · 2024-02-05T13:28:23Z

Thank you - that sounds good @cinnq346 . When you do, please could you add additional casts around the 1.0 and 0.0 to make it 'spark proof' as follows:

 case when ({label_colname}_l = {label_colname}_r) then cast(1.0 as float8) else cast(0.0 as float8) end AS clerical_match_score

ThomasHepworth · 2024-02-05T14:14:01Z

I've just seen this exact problem as well. We're using the Postgres backend. I had to convert all the numeric columns to double precision to solve this. In other words, I think you need to cast row_count, p, n, tp, tn, fp, fn and p_rate in addition to the ones you've already done.

For reference, that's what the table is created as it currently stands:

                                   Table "splink.__splink__truth_space_table_337726f4d"
      Column       |       Type       | Collation | Nullable | Default | Storage | Compression | Stats target | Description
-------------------+------------------+-----------+----------+---------+---------+-------------+--------------+-------------
 truth_threshold   | double precision |           |          |         | plain   |             |              |
 match_probability | double precision |           |          |         | plain   |             |              |
 row_count         | numeric          |           |          |         | main    |             |              |
 p                 | numeric          |           |          |         | main    |             |              |
 n                 | numeric          |           |          |         | main    |             |              |
 tp                | numeric          |           |          |         | main    |             |              |
 tn                | numeric          |           |          |         | main    |             |              |
 fp                | numeric          |           |          |         | main    |             |              |
 fn                | numeric          |           |          |         | main    |             |              |
 p_rate            | numeric          |           |          |         | main    |             |              |
 n_rate            | double precision |           |          |         | plain   |             |              |
 tp_rate           | double precision |           |          |         | plain   |             |              |
 tn_rate           | double precision |           |          |         | plain   |             |              |
 fp_rate           | double precision |           |          |         | plain   |             |              |
 fn_rate           | double precision |           |          |         | plain   |             |              |
 precision         | double precision |           |          |         | plain   |             |              |
 recall            | double precision |           |          |         | plain   |             |              |
 specificity       | double precision |           |          |         | plain   |             |              |
 npv               | double precision |           |          |         | plain   |             |              |
 accuracy          | double precision |           |          |         | plain   |             |              |
 f1                | numeric          |           |          |         | main    |             |              |
 f2                | numeric          |           |          |         | main    |             |              |
 f0_5              | numeric          |           |          |         | main    |             |              |
 p4                | numeric          |           |          |         | main    |             |              |
 phi               | double precision |           |          |         | plain   |             |              |

P.S. I'll be doing a PR to alter

splink/splink/accuracy.py

Line 200 in cbd5e8f

cast(({label_colname}_l = {label_colname}_r) as float) as clerical_match_score,

with a Case statement instead, because casting boolean to float doesn't work in Postgres.

from this:
cast(({label_colname}_l = {label_colname}_r) as float) as clerical_match_score,

to this:
 case when ({label_colname}_l = {label_colname}_r) then 1.0 else 0.0 end AS clerical_match_score

Thanks for this table, it's really useful. The additional casts have been added here.

Feel free to tag me into your follow-up PR alongside Robin.

convert accuracy metrics to float

663491b

ThomasHepworth requested a review from RobinL January 26, 2024 14:31

RobinL self-assigned this Jan 30, 2024

Merge branch 'master' into cast_accuracy_measures_to_float

fa36229

RobinL requested changes Jan 30, 2024

View reviewed changes

convert float casts to float8 casts

b77cf6b

ThomasHepworth requested a review from RobinL February 5, 2024 12:54

RobinL approved these changes Feb 5, 2024

View reviewed changes

Add additional casts to confusion matrix values

2192ba6

ThomasHepworth merged commit a5e0fd6 into master Feb 5, 2024
10 checks passed

ThomasHepworth deleted the cast_accuracy_measures_to_float branch February 5, 2024 14:19

cinnq346 mentioned this pull request Feb 5, 2024

Use CASE instead of bool to float casting in truth_space_table #1928

Merged

10 tasks

RobinL mentioned this pull request Feb 5, 2024

Use CASE instead of bool to float casting in truth_space_table #1929

Closed

10 tasks

cinnq346 mentioned this pull request Feb 6, 2024

Cast row_count as float8 in truth_table #1936

Merged

10 tasks

RobinL mentioned this pull request Mar 11, 2024

Issue while calculating ROC - Decimal type not json serialisable #2043

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

convert accuracy metrics to float #1893

convert accuracy metrics to float #1893

ThomasHepworth commented Jan 26, 2024 •

edited

Loading

ThomasHepworth commented Jan 26, 2024

RobinL commented Jan 30, 2024

RobinL commented Jan 30, 2024

RobinL commented Jan 30, 2024

RobinL left a comment

RobinL Jan 30, 2024

RobinL commented Jan 30, 2024

cinnq346 commented Feb 5, 2024

RobinL commented Feb 5, 2024

ThomasHepworth commented Feb 5, 2024

convert accuracy metrics to float #1893

convert accuracy metrics to float #1893

Conversation

ThomasHepworth commented Jan 26, 2024 • edited Loading

Type of PR

Give a brief description for the solution you have provided

ThomasHepworth commented Jan 26, 2024

RobinL commented Jan 30, 2024

RobinL commented Jan 30, 2024

RobinL commented Jan 30, 2024

RobinL left a comment

Choose a reason for hiding this comment

RobinL Jan 30, 2024

Choose a reason for hiding this comment

RobinL commented Jan 30, 2024

cinnq346 commented Feb 5, 2024

RobinL commented Feb 5, 2024

ThomasHepworth commented Feb 5, 2024

ThomasHepworth commented Jan 26, 2024 •

edited

Loading