Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Groupby transformation (cumsum) output dtype depends on whether NA is among group labels #58811

Closed
3 tasks done
avm19 opened this issue May 22, 2024 · 9 comments · Fixed by #58984
Closed
3 tasks done
Assignees
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Groupby Transformations e.g. cumsum, diff, rank

Comments

@avm19
Copy link

avm19 commented May 22, 2024

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
print(pd.__version__)  # 3.0.0.dev0+1020.g2aa155ae1c

s1 = pd.Series([0, 0, 1, 1, 2, None], dtype='Int16', name='A')
s2 = pd.Series([10, 20, 30, None, None, 60], dtype='Int16', name='B')
df = pd.concat([s1, s2], axis=1)
print(df.iloc[:5].groupby('A')['B'].cumsum().dtype)  # Int16 as expected
print(df.iloc[:6].groupby('A')['B'].cumsum().dtype)  # Float64, BUT EXPECTED Int16Dtype !!!
print(df.iloc[5:].groupby('A')['B'].cumsum().dtype)  # Float64, BUT EXPECTED Int16Dtype !!!
print(df.iloc[:5].groupby('A', dropna=False)['B'].cumsum().dtype)  # Int16 as expected
print(df.iloc[:6].groupby('A', dropna=False)['B'].cumsum().dtype)  # Int16 as expected 
print(df.iloc[5:].groupby('A', dropna=False)['B'].cumsum().dtype)  # Int16 as expected 
print(df.iloc[:5].groupby('A')['B'].sum().dtype)  # Int16 as expected
print(df.iloc[:6].groupby('A')['B'].sum().dtype)  # Int16 as expected
print(df.iloc[5:].groupby('A')['B'].sum().dtype)  # Int16 as expected

Issue Description

If a None (pd.NA) is present in group labels (i.e. in the by= column), then the result of a transformation (here cumsum) changes dtype from nullable integer (Int16) to nullable float (Float64).

This is unexpected and therefore a bug. The type change can be undone by .astype(...), but although unlikely in my use case, in theory a loss of precision is possible. I also wonder if floating point operations and type conversion can have a noticeable impact on performance for large in-memory datasets.

Other remarks:

  • No bug when groupby(..., dropna=False), as shown above.
  • No bug when using aggregation, e.g. .sum() instead of cumsum(), as shown above.

Expected Behavior

See the example.

Installed Versions

INSTALLED VERSIONS

commit : 2aa155a
python : 3.10.13.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.133+
Version : #1 SMP Tue Dec 19 13:14:11 UTC 2023
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : POSIX
LANG : C.UTF-8
LOCALE : None.None

pandas : 3.0.0.dev0+1020.g2aa155ae1c
numpy : 1.26.4
pytz : 2023.3.post1
dateutil : 2.9.0.post0
setuptools : 69.0.3
pip : 23.3.2
Cython : 3.0.8
pytest : 8.1.1
hypothesis : None
sphinx : None
blosc : None
feather : 0.4.1
xlsxwriter : None
lxml.etree : 5.2.1
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.20.0
pandas_datareader : 0.10.0
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.2
bottleneck : None
fastparquet : None
fsspec : 2024.2.0
gcsfs : 2024.2.0
matplotlib : 3.7.5
numba : 0.58.1
numexpr : 2.10.0
odfpy : None
openpyxl : 3.1.2
pyarrow : 15.0.2
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : 2024.2.0
scipy : 1.11.4
sqlalchemy : 2.0.25
tables : 3.9.2
tabulate : 0.9.0
xarray : 2024.3.0
xlrd : None
zstandard : 0.22.0
tzdata : 2023.4
qtpy : None
pyqt5 : None

@avm19 avm19 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 22, 2024
@rhshadrach rhshadrach added Groupby Dtype Conversions Unexpected or buggy dtype conversions Transformations e.g. cumsum, diff, rank and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 22, 2024
@rhshadrach
Copy link
Member

Thanks for the report! Confirmed on main, further investigations and PRs to fix are welcome.

@luke396
Copy link
Contributor

luke396 commented May 24, 2024

take

@luke396
Copy link
Contributor

luke396 commented May 24, 2024

I think the issue is here:

return obj._constructor(result, index=self.obj.index, name=obj.name)

When adding dtype like:

return obj._constructor(
            result, index=self.obj.index, dtype=obj.dtype, name=obj.name
        )

the result changes, but a new issue arises:

df = pd.DataFrame({'A': [1, None], 'B': [2,3]}, dtype='Int16')
obj = df.groupby('A')["B"]
obj.cumsum()
# before
# 0    2.0
# 1    NaN
# Name: B, dtype: Float64

# after
# 0    2
# 1    0
# Name: B, dtype: Int16

The reason why NaN is converted to 0 is that when using astype, self._data will astyped from array([ 2., nan]) with dtype('float64') to array([2, 0], dtype=int16), as noted in TODO NaN of FloatingArray case.

if isinstance(dtype, BaseMaskedDtype):
# TODO deal with NaNs for FloatingArray case
with warnings.catch_warnings():
warnings.filterwarnings("ignore", category=RuntimeWarning)
# TODO: Is rounding what we want long term?
data = self._data.astype(dtype.numpy_dtype, copy=copy)
# mask is copied depending on whether the data was copied, and
# not directly depending on the `copy` keyword
mask = self._mask if data is self._data else self._mask.copy()
cls = dtype.construct_array_type()
return cls(data, mask, copy=False)

I think addressing the case for NaN in FloatingArray will solve this issue (and possibly others, as the TODO has been there for a long time). However, I'm not sure how to fix it. Could you offer some help or some thoughts, @rhshadrach?

@rhshadrach
Copy link
Member

@luke396 - I think this issue lies deeper, perhaps in pandas.groupby.ops.WrappedCythonOp._get_result_dtype. I was under the impression that with smaller dtypes (e.g. int16), we would still result in 64-bit when doing a sum et al to avoid overflow issues. But it looks like this is not the case, even with NumPy dtypes. @jbrockmendel, do you know if my memory is incorrect here?

s1 = pd.Series([0, 0, 1, 1], dtype='int16', name='A')
s2 = pd.Series([10, 20, 30, 60], dtype='int16', name='B')
df = pd.concat([s1, s2], axis=1)

print(df.groupby('A')['B'].sum().dtype)
# int16

print(df.groupby('A')['B'].cumsum().dtype)
# int16

@jbrockmendel
Copy link
Member

IIRC when writing WrappedCythonOp i just kept the behavior that existed at the time, so we'd have go to back to @jreback for answers on this one. I'd be OK either way casting to 64 bit or retaining input dtype.

@rhshadrach
Copy link
Member

rhshadrach commented Jun 3, 2024

Both NumPy and Pyarrow give by 64-bit integers when summing; long term we may want to agree, but for this issue because it's consistent in pandas to stay 16-bit I think we should do that here.

@luke396
Copy link
Contributor

luke396 commented Jun 5, 2024

@rhshadrach, if I understand correctly, we currently desire the functionality of pandas.groupby.ops.WrappedCythonOp._get_result_dtype, which maintains a 16-bit data type.

Regarding the issue within _grouper._cython_operation, I discovered some potentially useful insights. In maybe_downcast_to_dtype and maybe_downcast_numeric, the result includes NaN values. Notably, _cython_operation does not perform any astype conversion in this scenario, unlike in cases with sum or dropna=False, which utilize the following code for type conversion.

if (
issubclass(result.dtype.type, (np.object_, np.number))
and notna(result).all()
):
new_result = trans(result).astype(dtype)
if new_result.dtype.kind == "O" or result.dtype.kind == "O":
# np.allclose may raise TypeError on object-dtype
if (new_result == result).all():
return new_result
else:
if np.allclose(new_result, result, rtol=0):
return new_result

In the original design, did we consider this scenario elsewhere, or should we modify/add code to address this situation?

@rhshadrach
Copy link
Member

I think the offending code may be here:

if lab < 0:
continue

When using a mask (uses_mask=True), should we be setting result_mask to True for these entries (and out to 0 for deterministic results)? Doing this, I get Int16 as a result for cumsum, but I haven't looked to see if this might cause any other issues.

Another place that is suspect is:

if how in ["var", "mean"] or (
self.kind == "transform" and self.has_dropped_na
):
# has_dropped_na check need for test_null_group_str_transformer
# result may still include NaN, so we have to cast
values = ensure_float64(values)

The conversion to float is necessary when dealing with NumPy dtypes, but not NumPy-nullable dtypes (when you implement the aforementioned change). This has implications on precision of the result.

@luke396
Copy link
Contributor

luke396 commented Jun 15, 2024

Thanks @rhshadrach! I couldn’t have finished the PR without your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Groupby Transformations e.g. cumsum, diff, rank
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants