New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

groupby on 2 categorical columns, when one categorical is based on datetimes, incorrectly returns all NaN dataframe #21390

Closed
rogeriomgatto opened this Issue Jun 8, 2018 · 6 comments

Comments

Projects
None yet
5 participants
@rogeriomgatto

rogeriomgatto commented Jun 8, 2018

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'label1': list('abcbabcba'),
    'label2': list('xyxyxyxyx'),
    'minute': list(pd.date_range('2018-06-01 00', freq='1T', periods=3)) * 3,
    'n1': np.arange(9, dtype='float'),
    'n2': np.arange(9, dtype='float') ** 2
})

# this is correct
df.groupby(['label1', 'minute'])[['n1', 'n2']].mean()

# convert to categoricals
df['label1'] = df['label1'].astype('category')
df['label2'] = df['label2'].astype('category')
df['minute'] = df['minute'].astype('category')

# this is wrong, returns all NaNs
df.groupby(['label1', 'minute'])[['n1', 'n2']].mean()

Problem description

When grouping by [str, datetime] columns, results are as expected:

>>> df.groupby(['label1', 'minute'])[['n1', 'n2']].mean()
                             n1    n2
label1 minute                        
a      2018-06-01 00:00:00  0.0   0.0
       2018-06-01 00:01:00  4.0  16.0
       2018-06-01 00:02:00  8.0  64.0
b      2018-06-01 00:00:00  3.0   9.0
       2018-06-01 00:01:00  4.0  25.0
       2018-06-01 00:02:00  5.0  25.0
c      2018-06-01 00:00:00  6.0  36.0
       2018-06-01 00:02:00  2.0   4.0

After converting label1, label2, and minute to categoricals, that same groupby returns all NaNs:

>>> df.groupby(['label1', 'minute'])[['n1', 'n2']].mean()
                            n1  n2
label1 minute                     
a      2018-06-01 00:00:00 NaN NaN
       2018-06-01 00:01:00 NaN NaN
       2018-06-01 00:02:00 NaN NaN
b      2018-06-01 00:00:00 NaN NaN
       2018-06-01 00:01:00 NaN NaN
       2018-06-01 00:02:00 NaN NaN
c      2018-06-01 00:00:00 NaN NaN
       2018-06-01 00:01:00 NaN NaN
       2018-06-01 00:02:00 NaN NaN

I only got this bug when grouping on 2 categoricals with one of them being datetime based (order is irrelevant). Grouping by ['label1', 'label2'] and 'minute' by itself works as expected.

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]
INSTALLED VERSIONS

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-22-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.0
pytest: None
pip: 10.0.1
setuptools: 39.2.0
Cython: None
numpy: 1.14.4
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: None
tables: 3.4.3
numexpr: 2.6.5
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: 1.0.5
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@gfyoung

This comment has been minimized.

Member

gfyoung commented Jun 8, 2018

Looks very similar to #21334

cc @jreback

@jorisvandenbossche jorisvandenbossche added this to the 0.23.2 milestone Jun 8, 2018

@jreback jreback modified the milestones: 0.23.2, 0.23.3 Jun 26, 2018

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Jun 27, 2018

This seems to boil down to a problem with reindexing with such a categorical index:

idx = pd.MultiIndex.from_product([pd.Categorical(['a', 'b', 'c']), pd.Categorical(pd.date_range("2012-01-01", periods=3, freq='H'))])
df = pd.DataFrame({'a': range(len(idx))}, index=idx)
df2 = df.iloc[[0, 1, 2, 3, 4, 5, 6, 8]]
df2.reindex(idx)

on 0.22.0 works correctly, but on master gives:

In [23]: df2.reindex(idx)
Out[23]: 
                        a
a 2012-01-01 00:00:00 NaN
  2012-01-01 01:00:00 NaN
  2012-01-01 02:00:00 NaN
b 2012-01-01 00:00:00 NaN
  2012-01-01 01:00:00 NaN
  2012-01-01 02:00:00 NaN
c 2012-01-01 00:00:00 NaN
  2012-01-01 01:00:00 NaN
  2012-01-01 02:00:00 NaN
@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Jun 27, 2018

cc @toobaz this seems to be related to the new MultiIndexUIntEngine

Using the above example (idx is a MultiIndex):

In [6]: idx._engine.get_indexer(idx)
Out[6]: array([-1, -1, -1, -1, -1, -1, -1, -1, -1])
@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Jun 27, 2018

Sorry Pietro, probably a bit prematurely pointed to that :-), as in the end it is code in the MultiIndexUIntEngine that surfaces another bug. Iterating a MultiIndex (tolist) with categorical datetime is broken (but was already broken in 0.22.0, just now surfaces through the use in the multiindex engine):

In [21]: list(idx)
Out[21]: 
[('a', 1325376000000000000),
 ('a', 1325379600000000000),
 ('a', 1325383200000000000),
 ('b', 1325376000000000000),
 ('b', 1325379600000000000),
 ('b', 1325383200000000000),
 ('c', 1325376000000000000),
 ('c', 1325379600000000000),
 ('c', 1325383200000000000)]

In [22]: list(idx.get_level_values(1))
Out[22]: 
[Timestamp('2012-01-01 00:00:00'),
 Timestamp('2012-01-01 01:00:00'),
 Timestamp('2012-01-01 02:00:00'),
 Timestamp('2012-01-01 00:00:00'),
 Timestamp('2012-01-01 01:00:00'),
 Timestamp('2012-01-01 02:00:00'),
 Timestamp('2012-01-01 00:00:00'),
 Timestamp('2012-01-01 01:00:00'),
 Timestamp('2012-01-01 02:00:00')]
@toobaz

This comment has been minimized.

Member

toobaz commented Jun 27, 2018

Sorry Pietro, probably a bit prematurely pointed to that :-)

Good :-) In general, it is unlikely that bugs in the MI engine code are dtype-specific, as it entirely delegates actual lookup to single levels, and only looks for integers (codes).

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Jun 27, 2018

PR: #21657

@jreback jreback modified the milestones: 0.23.2, 0.23.3 Jun 28, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment