BUG: incorrect rounding in groupby.cummin near int64 implementation bounds #40767

jbrockmendel · 2021-04-03T04:33:24Z

closes #xxxx
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

mzeitlin11 · 2021-04-03T14:42:29Z

pandas/core/groupby/ops.py

-            if (values == iNaT).any():
-                values = ensure_float64(values)
-            else:
-                values = ensure_int_or_float(values)


I think these iNaT checks are necessary because in algos like group_min_max, it assumes that it is impossible for an integer array to have iNaT (because it assumes datetimelike=True and safety in setting missing values as NPY_NAT).

eg on this branch:

ser = pd.Series([1, iNaT]) print(ser.groupby([1, 1]).max(min_count=2))

gives

1 -9223372036854775808 dtype: int64

because the iNaT is interpreted as NaN, so min_count isn't reached. (Then the NaN is set with iNaT, but not interpreted as NaN anymore)

Forcing mask usage (#40651 (comment)) would be another way to handle this more robustly (since to specify missing values, the mask itself can just be modified inplace, precluding the need for the existing iNaT workarounds to signify missing values in the algo result)

because it assumes datetimelike=True

isn't this the underlying problem in this example?

I think the reason datetimelike isn't used in group_min_max(and some other groupby algos) is because of the additional problem that there is no way to encode a missing value in an integer (not datetimelike) array.

So the current logic casts integer to float if iNaT is present, so that iNaT can be used to signify missing values for integer data (regardless of datetimelike status).

that there is no way to encode a missing value in an integer (not datetimelike) array.

i think we dont need to encode it because we have counts==0 (or counts<min_count) to convey that information

if we use masks we could avoid this awkwardness but that will need some refactoring

its not too hard to edit group_min_max to handle this example (will push in a bit). We can improve matters a bit by using counts < min_count as a mask more consistently, but that won't do anything about int64->float64 conversions being lossy. a full solution to that would be one of a) explicitly be OK with the lossiness, b) detect when the conversion would be lossy and cast to either object or float128, c) return an IntegerArray

jreback · 2021-04-06T01:06:07Z

why would we need to cast? can we not simply use a mask and leave the values as int?

jbrockmendel · 2021-04-06T01:13:17Z

why would we need to cast? can we not simply use a mask and leave the values as int?

that was option c in my comment, return an IntegerArray. For that we'd need IntegerArray to be 1) not opt-in and 2a) support 2D (#38992) or 2b) go full ArrayManager

jreback · 2021-04-06T01:16:50Z

maybe u misunderstand

in the cython code we use the incoming dtype and a mask

no casting required and you can always return the original type (except of course you may have to cast to floats if make nans but that would be rare)

jbrockmendel · 2021-04-06T01:19:09Z

maybe u misunderstand

suggested phrasing: "maybe we're talking past each other" or "maybe there's a miscommunication"

no casting required and you can always return the original type (except of course you may have to cast to floats if make nans but that would be rare)

the casting under discussion is on the back end after the cython call.

jreback · 2021-04-08T13:21:59Z

pandas/core/groupby/ops.py

+            if is_integer_dtype(result.dtype) and not is_datetimelike:
+                cutoff = max(1, min_count)
+                empty_groups = counts < cutoff
+                if empty_groups.any():


why is this check needed?

if empty_groups.any() then we need to mask in order to cast to float64

though this behavior is really surprising, though likely not hit practically. we should move aggressively to return Int dtypes here. Yes this is a breaking change but fixes these types of value dependent behavior.

we should move aggressively to return Int dtypes here

we'd need 2D support for Int dtypes for me to consider getting on board with this, xref #38992

jreback · 2021-04-12T12:44:04Z

pandas/core/groupby/ops.py

-            assert result.ndim != 2
-            result = result[counts > 0]
+        if kind == "aggregate":
+            # i.e. counts is defined


can you reference either this PR & put in an expl here, this is is really unexpected and non-obvious what is happening. ping on green.

jreback · 2021-04-13T11:35:35Z

this is a very slight user facing change right? can you add a whatsnew note (ref this PR)

jreback · 2021-04-14T12:56:29Z

thanks @jbrockmendel

…ounds (pandas-dev#40767)

jbrockmendel added 2 commits April 2, 2021 21:31

BUG: groupby.cummin losing precision on large integers

7936a9c

simplify check

234419e

mzeitlin11 reviewed Apr 3, 2021

View reviewed changes

jbrockmendel added 2 commits April 3, 2021 19:24

Merge branch 'master' into ref-gbop

2e87770

Merge branch 'master' into ref-gbop

850f985

Handle min/max

b8f5a28

jbrockmendel added Bug Groupby labels Apr 6, 2021

jreback requested changes Apr 8, 2021

View reviewed changes

jreback added this to the 1.3 milestone Apr 12, 2021

jreback approved these changes Apr 12, 2021

View reviewed changes

jreback reviewed Apr 12, 2021

View reviewed changes

jreback mentioned this pull request Apr 12, 2021

PERF/BUG: use masked algo in groupby cummin and cummax #40651

Merged

jbrockmendel added 2 commits April 12, 2021 14:32

Merge branch 'master' into ref-gbop

5db0964

comment+GH ref

e5cf34d

jbrockmendel added 3 commits April 13, 2021 10:07

Merge branch 'master' into ref-gbop

3f06416

whatsnew

eeba093

Merge branch 'master' into ref-gbop

c2e75be

jbrockmendel force-pushed the ref-gbop branch from 4a2491c to c2e75be Compare April 13, 2021 21:34

jreback merged commit 380904d into pandas-dev:master Apr 14, 2021

jbrockmendel deleted the ref-gbop branch April 14, 2021 14:14

yeshsurya pushed a commit to yeshsurya/pandas that referenced this pull request Apr 21, 2021

BUG: incorrect rounding in groupby.cummin near int64 implementation b…

51a5d9c

…ounds (pandas-dev#40767)

yeshsurya pushed a commit to yeshsurya/pandas that referenced this pull request May 6, 2021

BUG: incorrect rounding in groupby.cummin near int64 implementation b…

b41f1a0

…ounds (pandas-dev#40767)

JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021

BUG: incorrect rounding in groupby.cummin near int64 implementation b…

b51f067

…ounds (pandas-dev#40767)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: incorrect rounding in groupby.cummin near int64 implementation bounds #40767

BUG: incorrect rounding in groupby.cummin near int64 implementation bounds #40767

jbrockmendel commented Apr 3, 2021

mzeitlin11 Apr 3, 2021

jbrockmendel Apr 3, 2021

mzeitlin11 Apr 3, 2021

jbrockmendel Apr 5, 2021

jreback Apr 5, 2021

jbrockmendel Apr 6, 2021

jreback commented Apr 6, 2021

jbrockmendel commented Apr 6, 2021

jreback commented Apr 6, 2021

jbrockmendel commented Apr 6, 2021

jreback Apr 8, 2021

jbrockmendel Apr 11, 2021

jreback Apr 12, 2021

jbrockmendel Apr 12, 2021

jreback Apr 12, 2021

jbrockmendel Apr 13, 2021

jreback commented Apr 13, 2021

jreback commented Apr 14, 2021

BUG: incorrect rounding in groupby.cummin near int64 implementation bounds #40767

BUG: incorrect rounding in groupby.cummin near int64 implementation bounds #40767

Conversation

jbrockmendel commented Apr 3, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Apr 6, 2021

jbrockmendel commented Apr 6, 2021

jreback commented Apr 6, 2021

jbrockmendel commented Apr 6, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Apr 13, 2021

jreback commented Apr 14, 2021