Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: regression on master in groupby agg with ExtensionArray #29141

Closed
jorisvandenbossche opened this issue Oct 21, 2019 · 6 comments · Fixed by #29144
Closed

BUG: regression on master in groupby agg with ExtensionArray #29141

jorisvandenbossche opened this issue Oct 21, 2019 · 6 comments · Fixed by #29144
Labels
Apply Apply, Aggregate, Transform ExtensionArray Extending pandas with custom dtypes or arrays. Groupby Regression Functionality that used to work in a prior pandas version

Comments

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Oct 21, 2019

Example that I could make with DecimalArray:

In [1]: from pandas.tests.extension.decimal import DecimalArray, make_data 

In [2]: df = pd.DataFrame({'id': [0,0,0,1,1], 'decimals': DecimalArray(make_data()[:5])}) 

In [3]: df.groupby('id')['decimals'].agg(lambda x: x.iloc[0]) 
Out[8]: 
id
0      0.831922765262135044395108707249164581298828125
1    0.40839445887803604851029604105860926210880279...
dtype: object

On master of a few days ago, the above returned 'decimal' dtype instead of object dtype.

Found this in the geopandas test suite, as there it creates invalid output and then an error in a follow-up operation (https://travis-ci.org/geopandas/geopandas/jobs/600859374)

This seems to be caused by #29088, and specifically the change in agg_series: https://github.com/pandas-dev/pandas/pull/29088/files#diff-8c0985a9fca770c2028bed688dfc043fR653-R666
The self._aggregate_series_fast is giving a "AttributeError: 'DecimalArray' object has no attribute 'flags'" error if the series is backed by an EA, and the AttributeError is no longer catched.

cc @jbrockmendel

@jorisvandenbossche jorisvandenbossche added Regression Functionality that used to work in a prior pandas version ExtensionArray Extending pandas with custom dtypes or arrays. labels Oct 21, 2019
@WillAyd
Copy link
Member

WillAyd commented Oct 21, 2019

Pretty strange. I don't see it in the traceback of the logs but I assume that comes from here:

if not arr.flags.f_contiguous:

@jbrockmendel
Copy link
Member

I expect this is solved by #29100

@jorisvandenbossche
Copy link
Member Author

It seems to solve the Decimal example, but not fully the GeoPandas one. The result is still object dtype, while before it was geometry dtype

@jbrockmendel
Copy link
Member

but not fully the GeoPandas one

Two ideas, not mutually exclusive:

  • send me a link and I'll add it to my groupby-follow-up todo list
  • make a (xfailed) test in e.g. test_downstream

@jorisvandenbossche
Copy link
Member Author

So for the geopandas case, I didn't post the reproducer yet (I was trying to find one with Decimal), but it is:

In [1]: df = geopandas.read_file(geopandas.datasets.get_path('nybb'))

In [2]: df['id'] = [1, 1, 1, 2, 2]

In [3]: df.groupby('id')['geometry'].agg(lambda x: x.unary_union)
Out[3]: 
id
1    MULTIPOLYGON (((970217.022 145643.332, 970227....
2    MULTIPOLYGON (((981219.056 188655.316, 980940....
Name: geometry, dtype: geometry

(the above is master from a few days ago, where the result has 'geometry' dtype)

@jorisvandenbossche
Copy link
Member Author

And (somewhat artificial) reproducer with Decimal:

In [1]: from pandas.tests.extension.decimal import DecimalArray, make_data

In [2]: DecimalArray.my_sum = lambda self: np.sum(np.array(self))

In [3]: s = pd.Series(DecimalArray(make_data()[:5])) 

In [4]: s.groupby(np.array([0, 0, 1, 1, 1])).agg(lambda x: x.values.my_sum())
Out[4]: 
0    1.843483375611340902011647813
1    1.970626216878476610894210808
dtype: object

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform ExtensionArray Extending pandas with custom dtypes or arrays. Groupby Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants