Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected behaviour when grouping datetime column containing null-values, SeriesGroupby #10979

Closed
eoincondron opened this issue Sep 3, 2015 · 10 comments
Labels
Datetime Datetime data dtype Dtype Conversions Unexpected or buggy dtype conversions Groupby
Milestone

Comments

@eoincondron
Copy link

I found some unexpected behaviour when looking for the group minima of a datetime column containing null values. It appears that when the min method is called on a SeriesGroupBy of dtype datetime64 with null values, the values are cast to floats before the minima are computed. Consider the following:

df = pd.DataFrame({'datetime': pd.date_range('20150903', periods=4), 
                   'groups': ['a', 'b']*2})
df.loc[0, 'datetime'] = pd.NaT

In [357]: df.groupby('groups').datetime.min()
Out[357]:
groups
a             NaN
b    1.441325e+18
Name: datetime, dtype: float64

The float value of pd.NaT is -2^63 and so it is determined to be the minimum of any group which contains it. The expected behaviour would be for null values to be ignored and the minima of the non-null values returned as datetime64 objects. Interestingly, the max method seems to work as expected;

In [367]: df.groupby('groups').datetime.max()
Out[367]:
groups
a   2015-09-05
b   2015-09-06
Name: datetime, dtype: datetime64[ns]

The min method of the DataFrameGroupBy object is kind of half way between; it fails to ignore the null-values and gives pd.NaT as the min of any group which contains it but it does return the correct data type:

In [369]: df.groupby('groups').min()
Out[369]:
          datetime
groups  
a         NaT
b         2015-09-04

I tried to trace the source of the error and I got as far as the call to

self.grouper.aggregate(obj.value, how='min') 

where 'obj' is a (the only) set of values in self._iterate_slices. Within self.grouper.aggregate the lines

        if is_datetime_or_timedelta_dtype(values.dtype):
        values = values.view('int64')

and

    if com.is_integer_dtype(result):
        if len(result[result == tslib.iNaT]) > 0:
            result = result.astype('float64')
            result[result == tslib.iNaT] = np.nan

seem relevant. It might be worth noting that self.aggregate(lambda x: np.min(x, axis=self.axis) has the desired output while self.aggregate(np.min) does not. Also, changing the definition of the min method to

min = _groupby_function('min', 'min', np.min, numeric_only=True)

fixes this particular problem.

@jreback
Copy link
Contributor

jreback commented Sep 3, 2015

pls do a pd.show_versions() when you report an issue.

this was fixed in 0.16.0, xref is #9311, #6620

@jreback jreback closed this as completed Sep 3, 2015
@jreback jreback added Datetime Datetime data dtype Groupby Dtype Conversions Unexpected or buggy dtype conversions labels Sep 3, 2015
@eoincondron
Copy link
Author

Sorry, I forgot to show the version but it I did check I was using the most up-to-date version, 0.16.2. Maybe one of my dependencies is old. Here is the output of pd.show_versions:

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.10.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-229.7.2.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8

pandas: 0.16.2
nose: 1.3.0
Cython: 0.19.1
numpy: 1.9.2
scipy: 0.15.1
statsmodels: 0.6.0
IPython: 4.0.0
sphinx: 1.1.3
patsy: 0.3.0
dateutil: 2.4.2
pytz: 2015.4
bottleneck: None
tables: 3.2.0
numexpr: 2.4.3
matplotlib: 1.4.3
openpyxl: 2.0.2
xlrd: 0.9.2
xlwt: 0.7.5
xlsxwriter: None
lxml: 3.4.4
bs4: 4.3.2
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: 1.0.4
pymysql: 0.6.6.None
psycopg2: None

@jreback
Copy link
Contributor

jreback commented Sep 3, 2015

ahh, this was actually a regression in 0.16.2 to 0.16.0, fixed in master in any event, so will be in 0.17.0

@eoincondron
Copy link
Author

Ok cool. There's a simple workaround for now. Cheers!

@jreback
Copy link
Contributor

jreback commented Sep 3, 2015

@eoincondron np, thanks for the report. If you'd like to see where this actually happend (xref #10980) would be great.

@eoincondron
Copy link
Author

Ok, I'm just learning how to use Git but I'll give it a try.

@jreback
Copy link
Contributor

jreback commented Sep 3, 2015

contributing docs are here: http://pandas.pydata.org/pandas-docs/stable/contributing.html

nickeubank pushed a commit to nickeubank/pandas that referenced this issue Sep 29, 2015
@equialgo
Copy link

equialgo commented Jul 29, 2016

This bug should be reopened because it still persists when one group has all NaT values:

df = pd.DataFrame({'datetime': pd.date_range('20150903', periods=4), 
                   'groups': ['a', 'b']*2})
df.loc[0, 'datetime'] = pd.NaT
df.loc[2, 'datetime'] = pd.NaT
df.groupby('groups').datetime.min()

which results in:

groups
a             NaN
b    1.441325e+18
Name: datetime, dtype: float64

Note that the DataFrameGroupBy handles this correctly:

df.groupby('groups')[['datetime']].min()

and gives as result:

    datetime
groups  
a   NaT
b   2015-09-04

My pandas version:

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Darwin
OS-release: 15.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 23.0.0
Cython: 0.24
numpy: 1.11.1
scipy: 0.17.1
statsmodels: 0.6.1
xarray: None
IPython: 4.2.0
sphinx: 1.4.1
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: 1.1.0
tables: 3.2.2
numexpr: 2.6.0
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.2
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.40.0
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Jul 29, 2016

@equialgo the last you note is a dupe of #12821 and being handled in #12992

@equialgo
Copy link

@jreback Sorry about that! Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Dtype Conversions Unexpected or buggy dtype conversions Groupby
Projects
None yet
Development

No branches or pull requests

3 participants