-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement new NaN behavior #22386
Implement new NaN behavior #22386
Conversation
c0948db
to
62da9bf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot (all?) of these type comparisons look like they should be .equals for everything, not just the real and double types. Type is not an enum. I guess they're supposed to be singletons? But then why doesn't this work for real and double?
It doesn't work for real and double in this PR because i added a flag to it for the nan migration, so depending on the configuration property, the flag could be true or false. (Also, while you're certainly welcome to review while it's in this state, I'm planning to clean up the commits to be logically independent once I've finished fixing the tests/adding tests and covering all the cases). |
d2b923d
to
1679716
Compare
158134f
to
3a9bd84
Compare
2a25b6b
to
bd3141d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add property for new nan behavior
return useNewNanDefinition; | ||
} | ||
|
||
@Config("use-new-nan-definition") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we call this the new NaN definition, and by default this is false, or should we refer to the old behavior as the deprecated NaN definition, and by default that is true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see the later commit that enables it. I would have recommended this if it were disabled by default, but I think it's fine as it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add operator double support for new NaN...
long aBits = doubleToLongBits(a); | ||
long bBits = doubleToLongBits(b); | ||
if (aBits < bBits) { | ||
return -1; | ||
} | ||
if (aBits > bBits) { | ||
return 1; | ||
} | ||
return 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory, isn't it possible for multiple NaNs to have separate values when represented as bits? Supposing this came from a datasource and not from Presto itself. If so, would this work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doubleToLongBits coerces all the nans to the same representation. There's a different function Double.doubleToRawLongBits() that doesn't do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a comment in the code and some tests for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add NaN definition to DoubleType ...
// a time. .equals() comparison is always used against the static DOUBLE | ||
// instance to check if something is double type. this hack is temporary | ||
// and will be removed when we full remove the old nan behavior | ||
return other == DOUBLE || other == OLD_NAN_DOUBLE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just do an instanceof check? It wouldn't require a lengthy explanation. Likewise for float.
6879f44
fbf6bbc
to
623eb7f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks awesome, thanks for the great work @rschlussel.
Add a boolean field to DoubleType and RealType to determine whether to use the new nan definition. If useNewNanDefinition is set to true in the configuration property, then only doubles/reals with that property set to true will be created, and if it is false, then only doubles/reals with that property set to false will be created. This will be used in later commits to make decisions about how to handle nans. Because DoubleType and RealType are now parametrized types, it is no longer correct to use type == DOUBLE for type checking, as it is no longer a singleton instance. All code that was using type == DOUBLE or type == REAL has been updated to use .equals() comparison
Add support for new nan defintion for =, <>, >, <, >=, <=, between, in, not in.
This adds support for the new nan definition to =, <>, <, >, <=,>=, BETWEEN, IN, NOT IN for real types.
This adds support for new nan definition for tuple domains, which are use for hive filter pushdown.
also fixes when array has nans and nulls
Description
This PR contains all the changes to overhaul the NaN operators to conform with the new definition proposed in https://github.com/prestodb/rfcs/blob/main/RFC-0001-nan-definition.md.
According to the new nan definition, Nan is larger than all other numbers and is equal to itself. This PR also changes +0 and -0 to always be considered equal/not distinct, whereas previously there was inconsistency here as well.
I recommend reviewing commit by commit as it is divided into logically distinct pieces that should be easier to review. If it would be helpful I can also split it up into smaller PRs.
Fixes the following issues:
#22040
#21936
#21877
#22679
#13807
#21065
#22716
facebookincubator/velox#9511
It should also fix should also fix #16851, though I'm not sure how to create the file to test it.
Motivation and Context
The motivation for this change is to provide consistent NaN semantics across all of our functions and operators. It is also to ensure that these semantics are consistent with velox as we move to native workers. For more details see the RFC: https://github.com/prestodb/rfcs/blob/main/RFC-0001-nan-definition.md.
Impact
nan will now be treated as greater than all other numbers for all functions and as equal to itself for all functions. This changes the behavior of many existing functions and operators. These differences include (but are not limited to) =, <, >, joins, various distincting functions like set_agg and array_distinct, array_min.
It also fixes #22040, a wrong results bug with map_top_n in the presence of nans.
This PR also changes joins and aggregations to treat +0 and -0 as equal/not distinct.
Test Plan
Added tests for all affected functions.
Contributor checklist
Release Notes
Please follow release notes guidelines and fill in the release notes below.