Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

convert accuracy metrics to float #1893

Merged
merged 4 commits into from
Feb 5, 2024

Conversation

ThomasHepworth
Copy link
Contributor

@ThomasHepworth ThomasHepworth commented Jan 26, 2024

Type of PR

  • BUG
  • FEAT
  • MAINT
  • DOC

Give a brief description for the solution you have provided

Our production QA pipelines have been encountering the following error: TypeError: Object of type Decimal is not JSON serializable when attempting to serialise an the altair chart spec, for any charts containing the following sql.

I've narrowed the error down to the following lines:

2.0*TP/(2*TP + FN + FP) as f1,
5.0*TP/(5*TP + 4*FN + FP) as f2,
1.25*TP/(1.25*TP + 0.25*FN + FP) as f0_5,
4.0*TP*TN/((4.0*TP*TN) + ((TP + TN)*(FP + FN))) as p4,
case when TN+FN=0 or TP+FP=0 or P=0 or N=0 then 0
    else cast((TP*TN)-(FP*FN) as float)/sqrt((TP+FP)*P*N*(TN+FN)) end as phi

which, when run in our glue jobs (spark 3.1.1) are being cast to python's Decimal type - from decimal import Decimal.

To resolve this, I've added some explicit float casts, which seems to ensure that spark doesn't convert to using Decimals - this branch has been used to check that this resolves the issue.

I haven't looked into why this is happening, so there are no tests in this PR. However, if you would like a more detailed explanation exploring what's going on, I'm happy to try and get a reproducible example.

@ThomasHepworth
Copy link
Contributor Author

I wonder if this bug is also related? #1822 (comment)

@RobinL RobinL self-assigned this Jan 30, 2024
@RobinL
Copy link
Member

RobinL commented Jan 30, 2024

yeah so fundamentally this is because in Spark a literal number is a Decimal not a Double in Spark. See also here

So whenever we have a literal in the number, we have to wrap it in a cast:

THEN cast({bayes_factor} as float8)

Then we have:
class CustomSpark(Spark):

Which translates e.g. cast(0.12345 AS float8) into 0.12345D which is of double type

We need to do that because the float8 type doesn't exist in

spark.sql('select 0.1 as a').toPandas().iloc[0,0]
-> Decimal('0.1')

spark.sql('select 0.12345678D as a').toPandas().iloc[0,0]
_. 0.12345678. (i.e. a float)

spark.sql('select cast(0.12345678 as float8) as a').toPandas().iloc[0,0]
-> [UNSUPPORTED_DATATYPE] Unsupported data type "FLOAT8".(line 1, pos 26)

@RobinL
Copy link
Member

RobinL commented Jan 30, 2024

Can also see here gamma_name that you can't use 'cast(0.23489134 as double' as it truncates, it has to be the 0.23498234D syntax

@RobinL
Copy link
Member

RobinL commented Jan 30, 2024

spark.sql('select cast(0.349807234523750923475089457470478952340895740 as double) as a').toPandas().iloc[0,0]

ArithmeticException: [DECIMAL_PRECISION_EXCEEDS_MAX_PRECISION] Decimal precision 45 exceeds max precision 38.

spark.sql('select 0.349807234523750923475089457470478952340895740D as a').toPandas().iloc[0,0]

0.3498072345237509

Copy link
Member

@RobinL RobinL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor change in comment on code - but yeah, i think this is the right way to fix

5.0*TP/(5*TP + 4*FN + FP) as f2,
1.25*TP/(1.25*TP + 0.25*FN + FP) as f0_5,
4.0*TP*TN/((4.0*TP*TN) + ((TP + TN)*(FP + FN))) as p4,
cast(2.0*TP/(2*TP + FN + FP) as float) as f1,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all of these should be float8 for consistency (see comments on PR)

@RobinL
Copy link
Member

RobinL commented Jan 30, 2024

see also #694

@cinnq346
Copy link
Contributor

cinnq346 commented Feb 5, 2024

I've just seen this exact problem as well. We're using the Postgres backend. I had to convert all the numeric columns to double precision to solve this. In other words, I think you need to cast row_count, p, n, tp, tn, fp, fn and p_rate in addition to the ones you've already done.

For reference, that's what the table is created as it currently stands:

                                   Table "splink.__splink__truth_space_table_337726f4d"
      Column       |       Type       | Collation | Nullable | Default | Storage | Compression | Stats target | Description
-------------------+------------------+-----------+----------+---------+---------+-------------+--------------+-------------
 truth_threshold   | double precision |           |          |         | plain   |             |              |
 match_probability | double precision |           |          |         | plain   |             |              |
 row_count         | numeric          |           |          |         | main    |             |              |
 p                 | numeric          |           |          |         | main    |             |              |
 n                 | numeric          |           |          |         | main    |             |              |
 tp                | numeric          |           |          |         | main    |             |              |
 tn                | numeric          |           |          |         | main    |             |              |
 fp                | numeric          |           |          |         | main    |             |              |
 fn                | numeric          |           |          |         | main    |             |              |
 p_rate            | numeric          |           |          |         | main    |             |              |
 n_rate            | double precision |           |          |         | plain   |             |              |
 tp_rate           | double precision |           |          |         | plain   |             |              |
 tn_rate           | double precision |           |          |         | plain   |             |              |
 fp_rate           | double precision |           |          |         | plain   |             |              |
 fn_rate           | double precision |           |          |         | plain   |             |              |
 precision         | double precision |           |          |         | plain   |             |              |
 recall            | double precision |           |          |         | plain   |             |              |
 specificity       | double precision |           |          |         | plain   |             |              |
 npv               | double precision |           |          |         | plain   |             |              |
 accuracy          | double precision |           |          |         | plain   |             |              |
 f1                | numeric          |           |          |         | main    |             |              |
 f2                | numeric          |           |          |         | main    |             |              |
 f0_5              | numeric          |           |          |         | main    |             |              |
 p4                | numeric          |           |          |         | main    |             |              |
 phi               | double precision |           |          |         | plain   |             |              |

P.S. I'll be doing a PR to alter

cast(({label_colname}_l = {label_colname}_r) as float) as clerical_match_score,
with a Case statement instead, because casting boolean to float doesn't work in Postgres.

from this:
cast(({label_colname}_l = {label_colname}_r) as float) as clerical_match_score,

to this:
 case when ({label_colname}_l = {label_colname}_r) then 1.0 else 0.0 end AS clerical_match_score

@RobinL
Copy link
Member

RobinL commented Feb 5, 2024

Thank you - that sounds good @cinnq346 . When you do, please could you add additional casts around the 1.0 and 0.0 to make it 'spark proof' as follows:

 case when ({label_colname}_l = {label_colname}_r) then cast(1.0 as float8) else cast(0.0 as float8) end AS clerical_match_score

@ThomasHepworth
Copy link
Contributor Author

I've just seen this exact problem as well. We're using the Postgres backend. I had to convert all the numeric columns to double precision to solve this. In other words, I think you need to cast row_count, p, n, tp, tn, fp, fn and p_rate in addition to the ones you've already done.

For reference, that's what the table is created as it currently stands:

                                   Table "splink.__splink__truth_space_table_337726f4d"
      Column       |       Type       | Collation | Nullable | Default | Storage | Compression | Stats target | Description
-------------------+------------------+-----------+----------+---------+---------+-------------+--------------+-------------
 truth_threshold   | double precision |           |          |         | plain   |             |              |
 match_probability | double precision |           |          |         | plain   |             |              |
 row_count         | numeric          |           |          |         | main    |             |              |
 p                 | numeric          |           |          |         | main    |             |              |
 n                 | numeric          |           |          |         | main    |             |              |
 tp                | numeric          |           |          |         | main    |             |              |
 tn                | numeric          |           |          |         | main    |             |              |
 fp                | numeric          |           |          |         | main    |             |              |
 fn                | numeric          |           |          |         | main    |             |              |
 p_rate            | numeric          |           |          |         | main    |             |              |
 n_rate            | double precision |           |          |         | plain   |             |              |
 tp_rate           | double precision |           |          |         | plain   |             |              |
 tn_rate           | double precision |           |          |         | plain   |             |              |
 fp_rate           | double precision |           |          |         | plain   |             |              |
 fn_rate           | double precision |           |          |         | plain   |             |              |
 precision         | double precision |           |          |         | plain   |             |              |
 recall            | double precision |           |          |         | plain   |             |              |
 specificity       | double precision |           |          |         | plain   |             |              |
 npv               | double precision |           |          |         | plain   |             |              |
 accuracy          | double precision |           |          |         | plain   |             |              |
 f1                | numeric          |           |          |         | main    |             |              |
 f2                | numeric          |           |          |         | main    |             |              |
 f0_5              | numeric          |           |          |         | main    |             |              |
 p4                | numeric          |           |          |         | main    |             |              |
 phi               | double precision |           |          |         | plain   |             |              |

P.S. I'll be doing a PR to alter

cast(({label_colname}_l = {label_colname}_r) as float) as clerical_match_score,

with a Case statement instead, because casting boolean to float doesn't work in Postgres.

from this:
cast(({label_colname}_l = {label_colname}_r) as float) as clerical_match_score,

to this:
 case when ({label_colname}_l = {label_colname}_r) then 1.0 else 0.0 end AS clerical_match_score

Thanks for this table, it's really useful. The additional casts have been added here.

Feel free to tag me into your follow-up PR alongside Robin.

@ThomasHepworth ThomasHepworth merged commit a5e0fd6 into master Feb 5, 2024
10 checks passed
@ThomasHepworth ThomasHepworth deleted the cast_accuracy_measures_to_float branch February 5, 2024 14:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants