Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

groupby AssertionError with datetime column name #35876

Closed
2 of 3 tasks
hliatrussellinvestments opened this issue Aug 24, 2020 · 2 comments · Fixed by #35877
Closed
2 of 3 tasks

groupby AssertionError with datetime column name #35876

hliatrussellinvestments opened this issue Aug 24, 2020 · 2 comments · Fixed by #35877
Labels
Index Related to the Index class or subclasses Regression Functionality that used to work in a prior pandas version Timeseries
Milestone

Comments

@hliatrussellinvestments
  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

# Your code here
import pandas as pd
import datetime

column_names = ['dimension_1', 'dt', 'value']
data = [
        ('D1', datetime.date(2020, 1, 1), 2.1),
        ('D1', datetime.date(2020, 1, 2), 4.1),
        ('D1', datetime.date(2020, 1, 3), 1.7),
    ]
df_stack = pd.DataFrame(data=data, columns=column_names)
df_stack['dt'] = pd.to_datetime(df_stack['dt'])
df_stack.set_index(['dimension_1', 'dt'], inplace=True)

df_pivot = df_stack.unstack(level=['dt']).transpose().droplevel(None).transpose()

agg_map = {c: 'sum' for c in df_pivot.columns}
df_pivot.groupby(level=['dimension_1']).agg(agg_map)

Problem description

When executing groupby (on index) for dataframe whose column-names are of type datetime ('datetime64[ns]'), the groupby fails with AssertionError (result.name == res_name).

AssertionError Traceback (most recent call last) in 1 # do groupby on index 2 agg_map = {c: 'sum' for c in df_pivot.columns} ----> 3 df_pivot.groupby(level=['dimension_1']).agg(agg_map)

c:\users\hli\appdata\local\programs\python\python38-32\lib\site-packages\pandas\core\groupby\generic.py in aggregate(self, func, engine, engine_kwargs, *args, **kwargs)
947 )
948
--> 949 result, how = self._aggregate(func, *args, **kwargs)
950 if how is None:
951 return result

c:\users\hli\appdata\local\programs\python\python38-32\lib\site-packages\pandas\core\base.py in _aggregate(self, arg, *args, **kwargs)
349 keys = list(arg.keys())
350 if isinstance(obj, ABCDataFrame) and len(
--> 351 obj.columns.intersection(keys)
352 ) != len(keys):
353 cols = sorted(set(keys) - set(obj.columns.intersection(keys)))

c:\users\hli\appdata\local\programs\python\python38-32\lib\site-packages\pandas\core\indexes\datetimelike.py in intersection(self, other, sort)
701 # TODO: no tests rely on this; needed?
702 result = result._with_freq("infer")
--> 703 assert result.name == res_name
704 return result
705

Expected Output

The expected result in this example should be the dataframe itself, since it only has one row. This can be seen by changing the column data type to string:

df_pivot.columns = df_pivot.columns.astype(str)
agg_map = {c: 'sum' for c in df_pivot.columns}
df_pivot.groupby(level=['dimension_1']).agg(agg_map)

Output of pd.show_versions()

INSTALLED VERSIONS

commit : f2ca0a2
python : 3.8.5.final.0
python-bits : 32
OS : Windows
OS-release : 10
Version : 10.0.18362
machine : AMD64
processor : Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : English_United States.1252

pandas : 1.1.1
numpy : 1.19.0
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.1
setuptools : 47.1.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.5.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.17.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.2.2
numexpr : None
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.5.0
sqlalchemy : 1.3.18
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

@hliatrussellinvestments hliatrussellinvestments added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 24, 2020
@dsaxton dsaxton added Apply Apply, Aggregate, Transform Groupby Regression Functionality that used to work in a prior pandas version and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 24, 2020
@dsaxton
Copy link
Member

dsaxton commented Aug 24, 2020

Thanks @hliatrussellinvestments, can you try to make this example more minimal (there's a lot of code here that looks unrelated to the bug)?

This worked in 1.0.5 and the regression seems to be due to 29c820f.

cc @jbrockmendel

In [1]: import pandas as pd
   ...: import datetime
   ...:
   ...: print(pd.__version__)
   ...:
   ...: column_names = ['dimension_1', 'dt', 'value']
   ...: data = [
   ...:     ('D1', datetime.date(2020, 1, 1), 2.1),
   ...:     ('D1', datetime.date(2020, 1, 2), 4.1),
   ...:     ('D1', datetime.date(2020, 1, 3), 1.7),
   ...: ]
   ...: df_stack = pd.DataFrame(data=data, columns=column_names)
   ...: df_stack['dt'] = pd.to_datetime(df_stack['dt'])
   ...: df_stack = df_stack.set_index(['dimension_1', 'dt'])
   ...:
   ...: df_pivot = df_stack.unstack(level=['dt']).transpose().droplevel(None).transpose()
   ...:
   ...: agg_map = {c: 'sum' for c in df_pivot.columns}
   ...: df_pivot.groupby(level=['dimension_1']).agg(agg_map)
   ...:
1.0.5
Out[1]:
             2020-01-01  2020-01-02  2020-01-03
dimension_1
D1                  2.1         4.1         1.7

@dsaxton
Copy link
Member

dsaxton commented Aug 24, 2020

I think this is essentially the problem (it's an index issue and not one with groupby per se):

import pandas as pd

print(pd.__version__)

values = [pd.Timestamp("2020-01-01"), pd.Timestamp("2020-02-01")]
idx = pd.DatetimeIndex(values, name="a")
idx.intersection(values)
1.2.0.dev0+147.g07983803b
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-1-d5a68db44b8f> in <module>
      5 values = [pd.Timestamp("2020-01-01"), pd.Timestamp("2020-02-01")]
      6 idx = pd.DatetimeIndex(values, name="a")
----> 7 idx.intersection(values)

~/opt/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/pandas/core/indexes/datetimelike.py in intersection(self, other, sort)
    701                     # TODO: no tests rely on this; needed?
    702                     result = result._with_freq("infer")
--> 703             assert result.name == res_name
    704             return result
    705

AssertionError:

Related issue #35847

@dsaxton dsaxton added Index Related to the Index class or subclasses Timeseries and removed Apply Apply, Aggregate, Transform Groupby labels Aug 24, 2020
@jreback jreback added this to the 1.1.2 milestone Aug 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Index Related to the Index class or subclasses Regression Functionality that used to work in a prior pandas version Timeseries
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants