-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Groupby transformation (cumsum) output dtype depends on whether NA is among group labels #58811
Comments
Thanks for the report! Confirmed on main, further investigations and PRs to fix are welcome. |
take |
I think the issue is here: pandas/pandas/core/groupby/generic.py Line 549 in 3b48b17
When adding dtype like: return obj._constructor(
result, index=self.obj.index, dtype=obj.dtype, name=obj.name
) the result changes, but a new issue arises: df = pd.DataFrame({'A': [1, None], 'B': [2,3]}, dtype='Int16')
obj = df.groupby('A')["B"]
obj.cumsum()
# before
# 0 2.0
# 1 NaN
# Name: B, dtype: Float64
# after
# 0 2
# 1 0
# Name: B, dtype: Int16 The reason why pandas/pandas/core/arrays/masked.py Lines 538 to 548 in 3b48b17
I think addressing the case for |
@luke396 - I think this issue lies deeper, perhaps in s1 = pd.Series([0, 0, 1, 1], dtype='int16', name='A')
s2 = pd.Series([10, 20, 30, 60], dtype='int16', name='B')
df = pd.concat([s1, s2], axis=1)
print(df.groupby('A')['B'].sum().dtype)
# int16
print(df.groupby('A')['B'].cumsum().dtype)
# int16 |
IIRC when writing WrappedCythonOp i just kept the behavior that existed at the time, so we'd have go to back to @jreback for answers on this one. I'd be OK either way casting to 64 bit or retaining input dtype. |
Both NumPy and Pyarrow give by 64-bit integers when summing; long term we may want to agree, but for this issue because it's consistent in pandas to stay 16-bit I think we should do that here. |
@rhshadrach, if I understand correctly, we currently desire the functionality of Regarding the issue within pandas/pandas/core/dtypes/cast.py Lines 374 to 385 in 9e7abc8
In the original design, did we consider this scenario elsewhere, or should we modify/add code to address this situation? |
I think the offending code may be here: pandas/pandas/_libs/groupby.pyx Lines 401 to 402 in 629ffeb
When using a mask ( Another place that is suspect is: pandas/pandas/core/groupby/ops.py Lines 235 to 240 in 629ffeb
The conversion to float is necessary when dealing with NumPy dtypes, but not NumPy-nullable dtypes (when you implement the aforementioned change). This has implications on precision of the result. |
Thanks @rhshadrach! I couldn’t have finished the PR without your help! |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
If a
None
(pd.NA
) is present in group labels (i.e. in theby=
column), then the result of a transformation (herecumsum
) changes dtype from nullable integer (Int16
) to nullable float (Float64
).This is unexpected and therefore a bug. The type change can be undone by
.astype(...)
, but although unlikely in my use case, in theory a loss of precision is possible. I also wonder if floating point operations and type conversion can have a noticeable impact on performance for large in-memory datasets.Other remarks:
groupby(..., dropna=False)
, as shown above..sum()
instead ofcumsum()
, as shown above.Expected Behavior
See the example.
Installed Versions
INSTALLED VERSIONS
commit : 2aa155a
python : 3.10.13.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.133+
Version : #1 SMP Tue Dec 19 13:14:11 UTC 2023
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : POSIX
LANG : C.UTF-8
LOCALE : None.None
pandas : 3.0.0.dev0+1020.g2aa155ae1c
numpy : 1.26.4
pytz : 2023.3.post1
dateutil : 2.9.0.post0
setuptools : 69.0.3
pip : 23.3.2
Cython : 3.0.8
pytest : 8.1.1
hypothesis : None
sphinx : None
blosc : None
feather : 0.4.1
xlsxwriter : None
lxml.etree : 5.2.1
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.20.0
pandas_datareader : 0.10.0
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.2
bottleneck : None
fastparquet : None
fsspec : 2024.2.0
gcsfs : 2024.2.0
matplotlib : 3.7.5
numba : 0.58.1
numexpr : 2.10.0
odfpy : None
openpyxl : 3.1.2
pyarrow : 15.0.2
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : 2024.2.0
scipy : 1.11.4
sqlalchemy : 2.0.25
tables : 3.9.2
tabulate : 0.9.0
xarray : 2024.3.0
xlrd : None
zstandard : 0.22.0
tzdata : 2023.4
qtpy : None
pyqt5 : None
The text was updated successfully, but these errors were encountered: