Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: incorrect rounding in groupby.cummin near int64 implementation bounds #40767

Merged
merged 10 commits into from Apr 14, 2021

Conversation

jbrockmendel
Copy link
Member

  • closes #xxxx
  • tests added / passed
  • Ensure all linting tests pass, see here for how to run them
  • whatsnew entry

xref #40719

if (values == iNaT).any():
values = ensure_float64(values)
else:
values = ensure_int_or_float(values)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these iNaT checks are necessary because in algos like group_min_max, it assumes that it is impossible for an integer array to have iNaT (because it assumes datetimelike=True and safety in setting missing values as NPY_NAT).

eg on this branch:

ser = pd.Series([1, iNaT])
print(ser.groupby([1, 1]).max(min_count=2))

gives

1   -9223372036854775808
dtype: int64

because the iNaT is interpreted as NaN, so min_count isn't reached. (Then the NaN is set with iNaT, but not interpreted as NaN anymore)

Forcing mask usage (#40651 (comment)) would be another way to handle this more robustly (since to specify missing values, the mask itself can just be modified inplace, precluding the need for the existing iNaT workarounds to signify missing values in the algo result)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because it assumes datetimelike=True

isn't this the underlying problem in this example?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reason datetimelike isn't used in group_min_max(and some other groupby algos) is because of the additional problem that there is no way to encode a missing value in an integer (not datetimelike) array.

So the current logic casts integer to float if iNaT is present, so that iNaT can be used to signify missing values for integer data (regardless of datetimelike status).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that there is no way to encode a missing value in an integer (not datetimelike) array.

i think we dont need to encode it because we have counts==0 (or counts<min_count) to convey that information

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we use masks we could avoid this awkwardness but that will need some refactoring

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its not too hard to edit group_min_max to handle this example (will push in a bit). We can improve matters a bit by using counts < min_count as a mask more consistently, but that won't do anything about int64->float64 conversions being lossy. a full solution to that would be one of a) explicitly be OK with the lossiness, b) detect when the conversion would be lossy and cast to either object or float128, c) return an IntegerArray

@jreback
Copy link
Contributor

jreback commented Apr 6, 2021

why would we need to cast? can we not simply use a mask and leave the values as int?

@jbrockmendel
Copy link
Member Author

why would we need to cast? can we not simply use a mask and leave the values as int?

that was option c in my comment, return an IntegerArray. For that we'd need IntegerArray to be 1) not opt-in and 2a) support 2D (#38992) or 2b) go full ArrayManager

@jreback
Copy link
Contributor

jreback commented Apr 6, 2021

maybe u misunderstand

in the cython code we use the incoming dtype and a mask

no casting required and you can always return the original type (except of course you may have to cast to floats if make nans but that would be rare)

@jbrockmendel
Copy link
Member Author

maybe u misunderstand

suggested phrasing: "maybe we're talking past each other" or "maybe there's a miscommunication"

no casting required and you can always return the original type (except of course you may have to cast to floats if make nans but that would be rare)

the casting under discussion is on the back end after the cython call.

if is_integer_dtype(result.dtype) and not is_datetimelike:
cutoff = max(1, min_count)
empty_groups = counts < cutoff
if empty_groups.any():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this check needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if empty_groups.any() then we need to mask in order to cast to float64

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

though this behavior is really surprising, though likely not hit practically. we should move aggressively to return Int dtypes here. Yes this is a breaking change but fixes these types of value dependent behavior.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should move aggressively to return Int dtypes here

we'd need 2D support for Int dtypes for me to consider getting on board with this, xref #38992

@jreback jreback added this to the 1.3 milestone Apr 12, 2021
assert result.ndim != 2
result = result[counts > 0]
if kind == "aggregate":
# i.e. counts is defined
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you reference either this PR & put in an expl here, this is is really unexpected and non-obvious what is happening. ping on green.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping

@jreback
Copy link
Contributor

jreback commented Apr 13, 2021

this is a very slight user facing change right? can you add a whatsnew note (ref this PR)

@jreback jreback merged commit 380904d into pandas-dev:master Apr 14, 2021
@jreback
Copy link
Contributor

jreback commented Apr 14, 2021

thanks @jbrockmendel

@jbrockmendel jbrockmendel deleted the ref-gbop branch April 14, 2021 14:14
yeshsurya pushed a commit to yeshsurya/pandas that referenced this pull request Apr 21, 2021
yeshsurya pushed a commit to yeshsurya/pandas that referenced this pull request May 6, 2021
JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants