BUG: Groupby transformation (cumsum) output dtype depends on whether NA is among group labels #58811

avm19 · 2024-05-22T20:53:24Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
print(pd.__version__)  # 3.0.0.dev0+1020.g2aa155ae1c

s1 = pd.Series([0, 0, 1, 1, 2, None], dtype='Int16', name='A')
s2 = pd.Series([10, 20, 30, None, None, 60], dtype='Int16', name='B')
df = pd.concat([s1, s2], axis=1)
print(df.iloc[:5].groupby('A')['B'].cumsum().dtype)  # Int16 as expected
print(df.iloc[:6].groupby('A')['B'].cumsum().dtype)  # Float64, BUT EXPECTED Int16Dtype !!!
print(df.iloc[5:].groupby('A')['B'].cumsum().dtype)  # Float64, BUT EXPECTED Int16Dtype !!!
print(df.iloc[:5].groupby('A', dropna=False)['B'].cumsum().dtype)  # Int16 as expected
print(df.iloc[:6].groupby('A', dropna=False)['B'].cumsum().dtype)  # Int16 as expected 
print(df.iloc[5:].groupby('A', dropna=False)['B'].cumsum().dtype)  # Int16 as expected 
print(df.iloc[:5].groupby('A')['B'].sum().dtype)  # Int16 as expected
print(df.iloc[:6].groupby('A')['B'].sum().dtype)  # Int16 as expected
print(df.iloc[5:].groupby('A')['B'].sum().dtype)  # Int16 as expected

Issue Description

If a None (pd.NA) is present in group labels (i.e. in the by= column), then the result of a transformation (here cumsum) changes dtype from nullable integer (Int16) to nullable float (Float64).

This is unexpected and therefore a bug. The type change can be undone by .astype(...), but although unlikely in my use case, in theory a loss of precision is possible. I also wonder if floating point operations and type conversion can have a noticeable impact on performance for large in-memory datasets.

Other remarks:

No bug when groupby(..., dropna=False), as shown above.
No bug when using aggregation, e.g. .sum() instead of cumsum(), as shown above.

Expected Behavior

See the example.

Installed Versions

INSTALLED VERSIONS

commit : 2aa155a
python : 3.10.13.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.133+
Version : #1 SMP Tue Dec 19 13:14:11 UTC 2023
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : POSIX
LANG : C.UTF-8
LOCALE : None.None

pandas : 3.0.0.dev0+1020.g2aa155ae1c
numpy : 1.26.4
pytz : 2023.3.post1
dateutil : 2.9.0.post0
setuptools : 69.0.3
pip : 23.3.2
Cython : 3.0.8
pytest : 8.1.1
hypothesis : None
sphinx : None
blosc : None
feather : 0.4.1
xlsxwriter : None
lxml.etree : 5.2.1
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.20.0
pandas_datareader : 0.10.0
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.2
bottleneck : None
fastparquet : None
fsspec : 2024.2.0
gcsfs : 2024.2.0
matplotlib : 3.7.5
numba : 0.58.1
numexpr : 2.10.0
odfpy : None
openpyxl : 3.1.2
pyarrow : 15.0.2
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : 2024.2.0
scipy : 1.11.4
sqlalchemy : 2.0.25
tables : 3.9.2
tabulate : 0.9.0
xarray : 2024.3.0
xlrd : None
zstandard : 0.22.0
tzdata : 2023.4
qtpy : None
pyqt5 : None

The text was updated successfully, but these errors were encountered:

rhshadrach · 2024-05-22T21:08:11Z

Thanks for the report! Confirmed on main, further investigations and PRs to fix are welcome.

luke396 · 2024-05-24T10:32:05Z

take

luke396 · 2024-05-24T13:23:44Z

I think the issue is here:

pandas/pandas/core/groupby/generic.py

Line 549 in 3b48b17

return obj._constructor(result, index=self.obj.index, name=obj.name)

When adding dtype like:

return obj._constructor(
            result, index=self.obj.index, dtype=obj.dtype, name=obj.name
        )

the result changes, but a new issue arises:

df = pd.DataFrame({'A': [1, None], 'B': [2,3]}, dtype='Int16')
obj = df.groupby('A')["B"]
obj.cumsum()
# before
# 0    2.0
# 1    NaN
# Name: B, dtype: Float64

# after
# 0    2
# 1    0
# Name: B, dtype: Int16

The reason why NaN is converted to 0 is that when using astype, self._data will astyped from array([ 2., nan]) with dtype('float64') to array([2, 0], dtype=int16), as noted in TODO NaN of FloatingArray case.

pandas/pandas/core/arrays/masked.py

Lines 538 to 548 in 3b48b17

    
           if isinstance(dtype, BaseMaskedDtype): 
        
               # TODO deal with NaNs for FloatingArray case 
        
               with warnings.catch_warnings(): 
        
                   warnings.filterwarnings("ignore", category=RuntimeWarning) 
        
                   # TODO: Is rounding what we want long term? 
        
                   data = self._data.astype(dtype.numpy_dtype, copy=copy) 
        
               # mask is copied depending on whether the data was copied, and 
        
               # not directly depending on the `copy` keyword 
        
               mask = self._mask if data is self._data else self._mask.copy() 
        
               cls = dtype.construct_array_type() 
        
               return cls(data, mask, copy=False)

I think addressing the case for NaN in FloatingArray will solve this issue (and possibly others, as the TODO has been there for a long time). However, I'm not sure how to fix it. Could you offer some help or some thoughts, @rhshadrach?

rhshadrach · 2024-05-28T22:29:10Z

@luke396 - I think this issue lies deeper, perhaps in pandas.groupby.ops.WrappedCythonOp._get_result_dtype. I was under the impression that with smaller dtypes (e.g. int16), we would still result in 64-bit when doing a sum et al to avoid overflow issues. But it looks like this is not the case, even with NumPy dtypes. @jbrockmendel, do you know if my memory is incorrect here?

s1 = pd.Series([0, 0, 1, 1], dtype='int16', name='A')
s2 = pd.Series([10, 20, 30, 60], dtype='int16', name='B')
df = pd.concat([s1, s2], axis=1)

print(df.groupby('A')['B'].sum().dtype)
# int16

print(df.groupby('A')['B'].cumsum().dtype)
# int16

jbrockmendel · 2024-05-30T02:08:06Z

IIRC when writing WrappedCythonOp i just kept the behavior that existed at the time, so we'd have go to back to @jreback for answers on this one. I'd be OK either way casting to 64 bit or retaining input dtype.

rhshadrach · 2024-06-03T21:21:40Z

Both NumPy and Pyarrow give by 64-bit integers when summing; long term we may want to agree, but for this issue because it's consistent in pandas to stay 16-bit I think we should do that here.

luke396 · 2024-06-05T13:25:23Z

@rhshadrach, if I understand correctly, we currently desire the functionality of pandas.groupby.ops.WrappedCythonOp._get_result_dtype, which maintains a 16-bit data type.

Regarding the issue within _grouper._cython_operation, I discovered some potentially useful insights. In maybe_downcast_to_dtype and maybe_downcast_numeric, the result includes NaN values. Notably, _cython_operation does not perform any astype conversion in this scenario, unlike in cases with sum or dropna=False, which utilize the following code for type conversion.

pandas/pandas/core/dtypes/cast.py

Lines 374 to 385 in 9e7abc8

    
           if ( 
        
               issubclass(result.dtype.type, (np.object_, np.number)) 
        
               and notna(result).all() 
        
           ): 
        
               new_result = trans(result).astype(dtype) 
        
               if new_result.dtype.kind == "O" or result.dtype.kind == "O": 
        
                   # np.allclose may raise TypeError on object-dtype 
        
                   if (new_result == result).all(): 
        
                       return new_result 
        
               else: 
        
                   if np.allclose(new_result, result, rtol=0): 
        
                       return new_result

In the original design, did we consider this scenario elsewhere, or should we modify/add code to address this situation?

rhshadrach · 2024-06-10T21:04:18Z

I think the offending code may be here:

pandas/pandas/_libs/groupby.pyx

Lines 401 to 402 in 629ffeb

    
           if lab < 0: 
        
               continue

When using a mask (uses_mask=True), should we be setting result_mask to True for these entries (and out to 0 for deterministic results)? Doing this, I get Int16 as a result for cumsum, but I haven't looked to see if this might cause any other issues.

Another place that is suspect is:

pandas/pandas/core/groupby/ops.py

Lines 235 to 240 in 629ffeb

    
           if how in ["var", "mean"] or ( 
        
               self.kind == "transform" and self.has_dropped_na 
        
           ): 
        
               # has_dropped_na check need for test_null_group_str_transformer 
        
               # result may still include NaN, so we have to cast 
        
               values = ensure_float64(values)

The conversion to float is necessary when dealing with NumPy dtypes, but not NumPy-nullable dtypes (when you implement the aforementioned change). This has implications on precision of the result.

luke396 · 2024-06-15T01:46:22Z

Thanks @rhshadrach! I couldn’t have finished the PR without your help!

avm19 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 22, 2024

rhshadrach added Groupby Dtype Conversions Unexpected or buggy dtype conversions Transformations e.g. cumsum, diff, rank and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 22, 2024

github-actions bot assigned luke396 May 24, 2024

luke396 mentioned this issue Jun 12, 2024

BUG: Fix issue with negative labels in group_cumsum #58984

Merged

5 tasks

mroeschke closed this as completed in #58984 Jun 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Groupby transformation (cumsum) output dtype depends on whether NA is among group labels #58811

BUG: Groupby transformation (cumsum) output dtype depends on whether NA is among group labels #58811

avm19 commented May 22, 2024

INSTALLED VERSIONS

rhshadrach commented May 22, 2024

luke396 commented May 24, 2024

luke396 commented May 24, 2024

rhshadrach commented May 28, 2024

jbrockmendel commented May 30, 2024

rhshadrach commented Jun 3, 2024 •

edited

Loading

luke396 commented Jun 5, 2024

rhshadrach commented Jun 10, 2024

luke396 commented Jun 15, 2024

BUG: Groupby transformation (cumsum) output dtype depends on whether NA is among group labels #58811

BUG: Groupby transformation (cumsum) output dtype depends on whether NA is among group labels #58811

Comments

avm19 commented May 22, 2024

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

rhshadrach commented May 22, 2024

luke396 commented May 24, 2024

luke396 commented May 24, 2024

rhshadrach commented May 28, 2024

jbrockmendel commented May 30, 2024

rhshadrach commented Jun 3, 2024 • edited Loading

luke396 commented Jun 5, 2024

rhshadrach commented Jun 10, 2024

luke396 commented Jun 15, 2024

rhshadrach commented Jun 3, 2024 •

edited

Loading