Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: groupby-nunique modifies null values #31950

Closed
thomas-reineking-by opened this issue Feb 13, 2020 · 5 comments · Fixed by #32175
Closed

BUG: groupby-nunique modifies null values #31950

thomas-reineking-by opened this issue Feb 13, 2020 · 5 comments · Fixed by #32175
Labels
Bug Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@thomas-reineking-by
Copy link

thomas-reineking-by commented Feb 13, 2020

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np
df = pd.DataFrame({"GROUP": 0, "VALUE": [1.0, np.nan]})
df.groupby("GROUP")["VALUE"].nunique()
print(df)

Problem description

Original dataframe is modified:

   GROUP         VALUE
0      0  1.000000e+00
1      0 -9.223372e+18

Issue seems to have been introduced in version 1.0.0, 0.25.3 works as expected.

Expected Output

Original dataframe should not be modified.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : None python : 3.6.6.final.0 python-bits : 64 OS : Linux OS-release : 4.9.87-linuxkit-aufs machine : x86_64 processor : byteorder : little LC_ALL : en_US.UTF-8 LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.0.1
numpy : 1.17.2
pytz : 2019.3
dateutil : 2.8.1
pip : 19.2.3
setuptools : 41.2.0
Cython : 0.29.13
pytest : 5.2.1
hypothesis : 4.23.0
sphinx : 1.7.9
blosc : None
feather : None
xlsxwriter : 1.2.2
lxml.etree : 4.4.1
html5lib : 1.0.1
pymysql : None
psycopg2 : 2.8.2 (dt dec pq3 ext lo64)
jinja2 : 2.10.3
IPython : 7.9.0
pandas_datareader: None
bs4 : None
bottleneck : 1.2.1
fastparquet : None
gcsfs : None
lxml.etree : 4.4.1
matplotlib : 3.1.1
numexpr : 2.7.0
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.13.0
pytables : None
pytest : 5.2.1
pyxlsb : None
s3fs : None
scipy : 1.2.1
sqlalchemy : 1.3.11
tables : None
tabulate : 0.8.5
xarray : None
xlrd : None
xlwt : None
xlsxwriter : 1.2.2
numba : 0.45.1

@MarcoGorelli MarcoGorelli added Regression Functionality that used to work in a prior pandas version Bug labels Feb 13, 2020
@MarcoGorelli
Copy link
Member

Thanks @thomas-reineking-jdas

@jorisvandenbossche jorisvandenbossche added this to the 1.0.2 milestone Feb 13, 2020
@dsaxton
Copy link
Member

dsaxton commented Feb 13, 2020

I think this was caused by #27951. Seems the values get mutated here: https://github.com/pandas-dev/pandas/blob/master/pandas/core/groupby/generic.py#L596

@MarcoGorelli
Copy link
Member

@dsaxton yes, seems that you're right

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Feb 14, 2020

I think the problem might stem from this:

>>> dtarr = np.array([pd.NaT, 1, 2, 3]); dtarr.sort(); dtarr                                                                                                                                                 
array([NaT, 1, 2, 3], dtype=object)                                                                                                                     

>>> dtarr = np.array([np.nan, 1, 2, 3]); dtarr.sort(); dtarr                                                                                                                                                 
array([ 1.,  2.,  3., nan])

Can push a fix which removes

        # GH 27951
        # temporary fix while we wait for NumPy bug 12629 to be fixed
        val[isna(val)] = np.datetime64("NaT")

but it might require pushing the minimum numpy version up, looking into it

@MarcoGorelli MarcoGorelli self-assigned this Feb 14, 2020
@jreback
Copy link
Contributor

jreback commented Feb 14, 2020

cc @jbrockmendel @WillAyd

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Regression Functionality that used to work in a prior pandas version
Projects
None yet
5 participants