Grouping a 2 rows DataFrame by time and columns doesn't work as expected #11185

JCalderan · 2015-09-24T09:55:45Z

Hi guys,

Working with pandas is great, however I might have notice a bug while grouping a 2 rows DataFrame by time and columns:

>>> import pandas as pd
>>> import numpy as np
>>> from datetime import datetime

>>> freq = 's'
>>> t1 = np.datetime64(datetime.utcnow(), freq)
>>> index = pd.date_range(start=t1, periods=2, freq=freq)
# DatetimeIndex(['2015-09-24 08:55:27', '2015-09-24 08:55:28'], dtype='datetime64[ns]', freq='S', tz=None)
>>> df = pd.DataFrame([['A', 10], ['B', 15]], columns=['metric', 'values'], index=index)
#                     metric  values
#2015-09-24 08:55:27      A      10
#2015-09-24 08:55:28      B      15
>>> grouped = df.groupby([pd.Grouper(level=0, freq=freq), 'metric'])
# here the grouping should output something similar to the input DataFrame, 
# since each rows are already individual groups reguarding the parameters of the groupby function.
>>> grouped.mean()
#                                                     values
# <pandas.tseries.resample.TimeGrouper object at ...      10
# metric                                                  15
#
# notice how the index is broken : a new TimeGrouper object is the first index values, 
# while the second value is the name of the columns used to create the second group...
# now let's try to add another row : a new second, a new metric
>>> df_2 = pd.DataFrame([['C', 0]], columns=df.columns, index=[df.index.shift(-1, freq)[0]])
>>> df_2 = df_2.append(df)
#                     metric  values
#2015-09-24 08:55:26      C       0
#2015-09-24 08:55:27      A      10
#2015-09-24 08:55:28      B      15
>>> grouped = df_2.groupby([pd.Grouper(level=0, freq=freq), 'metric'])
>>> grouped.mean()
#                             values
#                     metric        
#2015-09-24 08:55:26 C            0
#2015-09-24 08:55:27 A           10
#2015-09-24 08:55:28 B           15
# work as expected with 3 rows !
# let's try with 1 row :
>>> df_2.iloc[0:1].groupby([pd.Grouper(level=0, freq=freq), 'metric']).mean()
#                             values
#                     metric        
#2015-09-24 08:55:26 C            0
# work as expected too !

I have tried to group by key, instead of level, or to use another frequency for aggregating (using freq = 's' while building the dataframe, then aggregate with freq='T'), but the result is the same.

Did I miss something ?

Please, not that using the resampling API provide the expected result, but i think the grouping API should provide consistent results :

>>> df.groupby(['metric']).resample(how='mean', freq=freq)
#                             values
# metric                            
# A      2015-09-24 08:55:27      10
# B      2015-09-24 08:55:28      15

Here are the dependencies I have installed with pandas (working on Ubuntu 12.04.5 LTS):

>>> from pandas.util.print_versions import show_versions
>>> show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.10.final.0
python-bits: 64
OS: Linux
OS-release: 3.5.0-54-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: fr_FR.UTF-8

pandas: 0.16.2
nose: 1.3.7
Cython: 0.22.1
numpy: 1.9.2
scipy: 0.15.1
statsmodels: None
IPython: 3.2.0
sphinx: 1.3.1
patsy: 0.3.0
dateutil: 2.4.2
pytz: 2015.4
bottleneck: 1.0.0
tables: 3.2.0
numexpr: 2.4.3
matplotlib: 1.4.3
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 1.0.0
xlsxwriter: 0.7.3
lxml: 3.4.4
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.5
pymysql: None
psycopg2: None

jreback · 2015-09-24T10:23:48Z

hmm, looks like the result index is not correctly named. Will mark it as a bug. Pull requests are welcome.

rinoc · 2015-09-30T00:32:22Z

I took a shot at solving this.

It seemed that the following block of code is being executed in the case where there are two rows since match_axis_length evaluates to true:

    if (not any_callable and not all_in_columns
        and not any_arraylike and match_axis_length
            and level is None):
        keys = [com._asarray_tuplesafe(keys)]

My solution was to add another clause in the conditional that checks if any Grouper objects have been passed in.

jreback · 2015-10-02T22:03:01Z

closed by #11202

jreback added Datetime Datetime data dtype Groupby Prio-medium labels Sep 24, 2015

jreback added this to the Next Major Release milestone Sep 24, 2015

rinoc mentioned this issue Sep 29, 2015

BUG: groupby list of keys with same length as index #11202

Merged

jreback modified the milestones: 0.17.0, Next Major Release Oct 1, 2015

jreback closed this as completed Oct 2, 2015

nbonnotte mentioned this issue Nov 24, 2015

BUG AttributeError: 'DataFrameGroupBy' object has no attribute '_obj_with_exclusions' #11640

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grouping a 2 rows DataFrame by time and columns doesn't work as expected #11185

Grouping a 2 rows DataFrame by time and columns doesn't work as expected #11185

JCalderan commented Sep 24, 2015

jreback commented Sep 24, 2015

rinoc commented Sep 30, 2015

jreback commented Oct 2, 2015

Grouping a 2 rows DataFrame by time and columns doesn't work as expected #11185

Grouping a 2 rows DataFrame by time and columns doesn't work as expected #11185

Comments

JCalderan commented Sep 24, 2015

jreback commented Sep 24, 2015

rinoc commented Sep 30, 2015

jreback commented Oct 2, 2015