Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grouping a 2 rows DataFrame by time and columns doesn't work as expected #11185

Closed
JCalderan opened this issue Sep 24, 2015 · 3 comments
Closed
Labels
Datetime Datetime data dtype Groupby
Milestone

Comments

@JCalderan
Copy link

Hi guys,

Working with pandas is great, however I might have notice a bug while grouping a 2 rows DataFrame by time and columns:

>>> import pandas as pd
>>> import numpy as np
>>> from datetime import datetime

>>> freq = 's'
>>> t1 = np.datetime64(datetime.utcnow(), freq)
>>> index = pd.date_range(start=t1, periods=2, freq=freq)
# DatetimeIndex(['2015-09-24 08:55:27', '2015-09-24 08:55:28'], dtype='datetime64[ns]', freq='S', tz=None)
>>> df = pd.DataFrame([['A', 10], ['B', 15]], columns=['metric', 'values'], index=index)
#                     metric  values
#2015-09-24 08:55:27      A      10
#2015-09-24 08:55:28      B      15
>>> grouped = df.groupby([pd.Grouper(level=0, freq=freq), 'metric'])
# here the grouping should output something similar to the input DataFrame, 
# since each rows are already individual groups reguarding the parameters of the groupby function.
>>> grouped.mean()
#                                                     values
# <pandas.tseries.resample.TimeGrouper object at ...      10
# metric                                                  15
#
# notice how the index is broken : a new TimeGrouper object is the first index values, 
# while the second value is the name of the columns used to create the second group...
# now let's try to add another row : a new second, a new metric
>>> df_2 = pd.DataFrame([['C', 0]], columns=df.columns, index=[df.index.shift(-1, freq)[0]])
>>> df_2 = df_2.append(df)
#                     metric  values
#2015-09-24 08:55:26      C       0
#2015-09-24 08:55:27      A      10
#2015-09-24 08:55:28      B      15
>>> grouped = df_2.groupby([pd.Grouper(level=0, freq=freq), 'metric'])
>>> grouped.mean()
#                             values
#                     metric        
#2015-09-24 08:55:26 C            0
#2015-09-24 08:55:27 A           10
#2015-09-24 08:55:28 B           15
# work as expected with 3 rows !
# let's try with 1 row :
>>> df_2.iloc[0:1].groupby([pd.Grouper(level=0, freq=freq), 'metric']).mean()
#                             values
#                     metric        
#2015-09-24 08:55:26 C            0
# work as expected too !

I have tried to group by key, instead of level, or to use another frequency for aggregating (using freq = 's' while building the dataframe, then aggregate with freq='T'), but the result is the same.

Did I miss something ?

Please, not that using the resampling API provide the expected result, but i think the grouping API should provide consistent results :

>>> df.groupby(['metric']).resample(how='mean', freq=freq)
#                             values
# metric                            
# A      2015-09-24 08:55:27      10
# B      2015-09-24 08:55:28      15

Here are the dependencies I have installed with pandas (working on Ubuntu 12.04.5 LTS):

>>> from pandas.util.print_versions import show_versions
>>> show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.10.final.0
python-bits: 64
OS: Linux
OS-release: 3.5.0-54-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: fr_FR.UTF-8

pandas: 0.16.2
nose: 1.3.7
Cython: 0.22.1
numpy: 1.9.2
scipy: 0.15.1
statsmodels: None
IPython: 3.2.0
sphinx: 1.3.1
patsy: 0.3.0
dateutil: 2.4.2
pytz: 2015.4
bottleneck: 1.0.0
tables: 3.2.0
numexpr: 2.4.3
matplotlib: 1.4.3
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 1.0.0
xlsxwriter: 0.7.3
lxml: 3.4.4
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.5
pymysql: None
psycopg2: None
@jreback
Copy link
Contributor

jreback commented Sep 24, 2015

hmm, looks like the result index is not correctly named. Will mark it as a bug. Pull requests are welcome.

@jreback jreback added this to the Next Major Release milestone Sep 24, 2015
@rinoc
Copy link
Contributor

rinoc commented Sep 30, 2015

I took a shot at solving this.

It seemed that the following block of code is being executed in the case where there are two rows since match_axis_length evaluates to true:

    if (not any_callable and not all_in_columns
        and not any_arraylike and match_axis_length
            and level is None):
        keys = [com._asarray_tuplesafe(keys)]

My solution was to add another clause in the conditional that checks if any Grouper objects have been passed in.

@jreback jreback modified the milestones: 0.17.0, Next Major Release Oct 1, 2015
@jreback
Copy link
Contributor

jreback commented Oct 2, 2015

closed by #11202

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Groupby
Projects
None yet
Development

No branches or pull requests

3 participants