Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REGR: cumsum regression with groupby call to agg #31802

Closed
mattharrison opened this issue Feb 8, 2020 · 26 comments · Fixed by #32561
Closed

REGR: cumsum regression with groupby call to agg #31802

mattharrison opened this issue Feb 8, 2020 · 26 comments · Fixed by #32561
Labels
Groupby Numeric Operations Arithmetic, Comparison, and Logical operations Regression Functionality that used to work in a prior pandas version Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@mattharrison
Copy link

mattharrison commented Feb 8, 2020

Code Sample, a copy-pastable example if possible

I want to define a custom function that I can pass to the agg method. It uses the cumsum method, which appears to be problematic recently.

# Your code here
import pandas as pd

def max_test(s):
    return s.cumsum().max()
    #return s.max()

dummy_data = pd.DataFrame(
    {'AIRLINE': {0: 'WN', 1: 'UA', 2: 'MQ', 3: 'AA', 4: 'WN'},
     'ORG_AIR': {0: 'LAX', 1: 'DEN', 2: 'DFW', 3: 'DFW', 4: 'LAX'},
     'DIST': {0: 590, 1: 1452, 2: 641, 3: 1192, 4: 1363}})

gb = dummy_data.groupby(['AIRLINE', 'ORG_AIR'])

result = gb.agg(
    #'max' 
    max_test
)

print(result)

Problem description

Prior to Pandas 1.0rc this worked. It now raises an exception:

$ python /tmp/regpandas.py
Traceback (most recent call last):
  File "/tmp/regpandas.py", line 16, in <module>
    max_test
  File "/Users/matt/.env/pandas1/lib/python3.7/site-packages/pandas/core/groupby/generic.py", line 948, in aggregate
    return self._python_agg_general(func, *args, **kwargs)
  File "/Users/matt/.env/pandas1/lib/python3.7/site-packages/pandas/core/groupby/groupby.py", line 936, in _python_agg_general
    result, counts = self.grouper.agg_series(obj, f)
  File "/Users/matt/.env/pandas1/lib/python3.7/site-packages/pandas/core/groupby/ops.py", line 641, in agg_series
    return self._aggregate_series_fast(obj, func)
  File "/Users/matt/.env/pandas1/lib/python3.7/site-packages/pandas/core/groupby/ops.py", line 666, in _aggregate_series_fast
    result, counts = grouper.get_result()
  File "pandas/_libs/reduction.pyx", line 376, in pandas._libs.reduction.SeriesGrouper.get_result
  File "pandas/_libs/reduction.pyx", line 193, in pandas._libs.reduction._BaseGrouper._apply_to_group
  File "/Users/matt/.env/pandas1/lib/python3.7/site-packages/pandas/core/groupby/groupby.py", line 913, in <lambda>
    f = lambda x: func(x, *args, **kwargs)
  File "/tmp/regpandas.py", line 4, in max_test
    return s.cumsum().max()
  File "/Users/matt/.env/pandas1/lib/python3.7/site-packages/pandas/core/generic.py", line 11331, in cum_func
    result = self._data.apply(na_accum_func)
  File "/Users/matt/.env/pandas1/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 440, in apply
    applied = b.apply(f, **kwargs)
  File "/Users/matt/.env/pandas1/lib/python3.7/site-packages/pandas/core/internals/blocks.py", line 403, in apply
    result = self.make_block(values=_block_shape(result, ndim=self.ndim))
  File "/Users/matt/.env/pandas1/lib/python3.7/site-packages/pandas/core/internals/blocks.py", line 273, in make_block
    return make_block(values, placement=placement, ndim=self.ndim)
  File "/Users/matt/.env/pandas1/lib/python3.7/site-packages/pandas/core/internals/blocks.py", line 3041, in make_block
    return klass(values, ndim=ndim, placement=placement)
  File "/Users/matt/.env/pandas1/lib/python3.7/site-packages/pandas/core/internals/blocks.py", line 125, in __init__
    f"Wrong number of items passed {len(self.values)}, "
ValueError: Wrong number of items passed 2, placement implies 1

Expected Output

$ python /tmp/regpandas.py
                 DIST
AIRLINE ORG_AIR
AA      DFW      1192
MQ      DFW       641
UA      DEN      1452
WN      LAX      1953

Output of pd.show_versions()

>>> pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : None
python           : 3.7.3.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 18.6.0
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.0.1
numpy            : 1.18.1
pytz             : 2019.3
dateutil         : 2.8.1
pip              : 19.0.3
setuptools       : 40.8.0
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 2.10.3
IPython          : 7.11.1
pandas_datareader: None
bs4              : None
bottleneck       : None
fastparquet      : None
gcsfs            : None
lxml.etree       : None
matplotlib       : 3.1.2
numexpr          : 2.7.1
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pytables         : None
pytest           : None
pyxlsb           : None
s3fs             : None
scipy            : 1.4.1
sqlalchemy       : 1.3.13
tables           : 3.6.1
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
xlsxwriter       : None
numba            : None
@jorisvandenbossche jorisvandenbossche added this to the 1.0.2 milestone Feb 8, 2020
@jorisvandenbossche jorisvandenbossche added the Regression Functionality that used to work in a prior pandas version label Feb 8, 2020
@jorisvandenbossche
Copy link
Member

@mattharrison Thanks for the clear report! I can confirm this is a regression.

cc @jbrockmendel this is another example of something failing in libreduction with another error than the specific one that is catched:

try:
return self._aggregate_series_fast(obj, func)
except ValueError as err:
if "Function does not reduce" in str(err):
# raised in libreduction
pass
else:
raise
return self._aggregate_series_pure_python(obj, func)

@jorisvandenbossche jorisvandenbossche changed the title cumsum regression with groupby call to agg REGR: cumsum regression with groupby call to agg Feb 8, 2020
@jorisvandenbossche
Copy link
Member

@jbrockmendel can you take a look?

@jbrockmendel
Copy link
Member

We expect cumsum().max() to be a reduction, but when we go through libreduction, on the last group it doesn't reduce. This suggests that we could catch this within lib-reduction and re-raise as ValueError("Function does not reduce").

@jorisvandenbossche
Copy link
Member

We expect cumsum().max() to be a reduction, but when we go through libreduction, on the last group it doesn't reduce.

Can you explain this a bit more? The function is doing a max(), so I would expect this to always be a reduction?

@jbrockmendel
Copy link
Member

Can you explain this a bit more? The function is doing a max(), so I would expect this to always be a reduction?

I, too, expect it to be a reduction.

I added in Block.apply a print(result, getattr(result, "shape", None)) just before calling _split_op_result and the ran the example here:

# Your code here
import pandas as pd

def max_test(s):
    return s.cumsum().max()

df = pd.DataFrame(
    {'AIRLINE': {0: 'WN', 1: 'UA', 2: 'MQ', 3: 'AA', 4: 'WN'},
     'ORG_AIR': {0: 'LAX', 1: 'DEN', 2: 'DFW', 3: 'DFW', 4: 'LAX'},
     'DIST': {0: 590, 1: 1452, 2: 641, 3: 1192, 4: 1363}})

gb = df.groupby(['AIRLINE', 'ORG_AIR'])

>>> result = gb.agg(max_test)
[] (0,)
[1192] (1,)
[641] (1,)
[1452] (1,)
[ 590 1953] (2,)
Traceback (most recent call last):
[...]

so maybe it is reducing, just along the wrong axis?

@TomAugspurger
Copy link
Contributor

A bit simpler reproducer:

In [43]: df = pd.DataFrame({"A": ['a', 'a', 'a'], "B": ['a', 'b', 'b'], "C": [1, 1, 1]})

In [44]: def f(x):
    ...:     assert len(x) == len(x._data.blocks[0].mgr_locs)
    ...:     return 0
    ...:

In [45]: df.groupby(["A", "B"]).agg(f)
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-45-6faf28de6e4f> in <module>
----> 1 df.groupby(["A", "B"]).agg(f)

~/sandbox/pandas/pandas/core/groupby/generic.py in aggregate(self, func, *args, **kwargs)
    937             # grouper specific aggregations
    938             if self.grouper.nkeys > 1:
--> 939                 return self._python_agg_general(func, *args, **kwargs)
    940             elif args or kwargs:
    941                 result = self._aggregate_frame(func, *args, **kwargs)

~/sandbox/pandas/pandas/core/groupby/groupby.py in _python_agg_general(self, func, *args, **kwargs)
    924             try:
    925                 # if this function is invalid for this dtype, we will ignore it.
--> 926                 result, counts = self.grouper.agg_series(obj, f)
    927             except TypeError:
    928                 continue

~/sandbox/pandas/pandas/core/groupby/ops.py in agg_series(self, obj, func)
    638
    639         try:
--> 640             return self._aggregate_series_fast(obj, func)
    641         except ValueError as err:
    642             if "Function does not reduce" in str(err):

~/sandbox/pandas/pandas/core/groupby/ops.py in _aggregate_series_fast(self, obj, func)
    663         group_index = algorithms.take_nd(group_index, indexer, allow_fill=False)
    664         grouper = libreduction.SeriesGrouper(obj, func, group_index, ngroups, dummy)
--> 665         result, counts = grouper.get_result()
    666         return result, counts
    667

~/sandbox/pandas/pandas/_libs/reduction.pyx in pandas._libs.reduction.SeriesGrouper.get_result()
    373                         cached_typ, cached_ityp, islider, vslider)
    374
--> 375                     res, initialized = self._apply_to_group(cached_typ, cached_ityp,
    376                                                             islider, vslider,
    377                                                             group_size, initialized)

~/sandbox/pandas/pandas/_libs/reduction.pyx in pandas._libs.reduction._BaseGrouper._apply_to_group()
    191         """
    192         cached_ityp._engine.clear_mapping()
--> 193         res = self.f(cached_typ)
    194         res = _extract_result(res)
    195         if not initialized:

~/sandbox/pandas/pandas/core/groupby/groupby.py in <lambda>(x)
    911     def _python_agg_general(self, func, *args, **kwargs):
    912         func = self._is_builtin_func(func)
--> 913         f = lambda x: func(x, *args, **kwargs)
    914
    915         # iterate through "columns" ex exclusions to populate output dict

<ipython-input-44-5151b7117ae2> in f(x)
      1 def f(x):
----> 2     assert len(x) == len(x._data.blocks[0].mgr_locs)
      3     return 0
      4

AssertionError:

I think the issue is roughly somewhere around

object.__setattr__(cached_typ._data._block, 'values', vslider.buf)
. We update the Block._values, but don't adjust mgr_locs / placement. Anything that creates a new block from a function applied to those values will fail. In this case, we're making a Series on the result of the cumsum.

I'm not 100% sure what the solution is. It seems weird to mutate the mgr_locs in libreduction. Do you have any thoughts on that @jbrockmendel?

@jbrockmendel
Copy link
Member

It seems weird to mutate the mgr_locs in libreduction

That can only end in tears. I'd be inclined to failover to the non-fastpath pretty aggressively.

@TomAugspurger
Copy link
Contributor

Yeah, that's my preference as well. By "weird" I should have probably put "crazy" :)

I'll see if I can get a somewhat targeted check for when we need to bail out.

@TomAugspurger
Copy link
Contributor

Right now I'm pessimistic about a "targeted check" being possible. It seems like libreduction.SeriesGrouper is fundamentally at odds with user-defined callables. We're creating invalid Series / DataFrame objects, so anything relying on the wrong invariant may hit arbitrary exceptions.

I think our only two choices are to

  1. Change the except ValueError in agg_series back to except Exception
  2. Just always call aggregate_series_pure_python for UDFs, skipping aggregate_series_fast / libreduction.

@TomAugspurger
Copy link
Contributor

I suppose the 3rd choice is to continue to play whack-a-mole on exceptions / messages in

try:
return self._aggregate_series_fast(obj, func)
except ValueError as err:
if "Function does not reduce" in str(err):
# raised in libreduction
pass
else:
raise
return self._aggregate_series_pure_python(obj, func)
. That's probably the most conservative change...

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Mar 9, 2020
Closes pandas-dev#31802

This "fixes" pandas-dev#31802 by expanding the number of cases where we swallow an
exception in libreduction. Currently, we're creating an invalid Series
in SeriesBinGrouper where the `.mgr_locs` doesn't match the values. See
pandas-dev#31802 (comment)
for more.

For now, we simply catch more cases that fall back to Python. I've gone
with a minimal change which addresses only issues hitting this exact
exception. We might want to go broader, but that's not clear.
@jbrockmendel
Copy link
Member

  1. Just always call aggregate_series_pure_python for UDFs, skipping aggregate_series_fast / libreduction.

Maintenance-wise, this is by far the most appealing option, xref #32086. As of that PR, the performance penalty (as measured by our most-affected asv) of disabling fast_apply was 48x. Some perf PRs since then have gotten that down to 7x. I'm pretty sure I can shave that down to 5x without too much trouble.

Leaving aside the issue of one asv not being a representative sample, the question is how big a perf penalty are we willing to accept to get rid of this hassle?

@jbrockmendel
Copy link
Member

object.setattr(cached_typ._data._block, 'values', vslider.buf)

I'm a little confused here. How are we getting to a point where vslider.buf has the wrong shape?

@jorisvandenbossche
Copy link
Member

What's the difference maintenance wise between try/except the fast version (with a generic except in case of UDFs), and otherwise fallback to pure python version versus an if/else to only do the pure python version for UDFs?

@jbrockmendel
Copy link
Member

What's the difference maintenance wise between try/except the fast version [...] (with a generic except in case of UDFs)

  1. Non-surfaced bugs
  2. The cython code path is very difficult to grok, not newcomer-friendly. Among other things it messes with ndarray internals, which is kinda sketchy.
  3. We need to worry about fast/slow versions having matching behavior, similar to the failure in POC/REF: remove SeriesBinGrouper, SeriesGrouper #32083. (i'm pretty sure the libreduction._extract_array behavior/usage is subtly different from what we have in the non-cython paths)

@jbrockmendel
Copy link
Member

we could use something like the raw keyword to let users opt in to a fastpath to operate only on the underlying arrays (i think we do this for DataFrame.apply). If we wanted to get fancy about it we could look at annotations on a UDF and see if is fastpath-friendly

@jorisvandenbossche
Copy link
Member

But we need the cython code anyway? Or would we be able to remove quite some of it if it didn't need to support UDFs?

@jbrockmendel
Copy link
Member

But we need the cython code anyway? Or would we be able to remove quite some of it if it didn't need to support UDFs?

If we got rid of fast_apply, we could rip out 136 lines of cython and about 60 more of non-cython, see #32086.

Also it looks like fast/slow paths give different behavior for tshift in tests.groupby.test_whitelist

@jbrockmendel
Copy link
Member

@TomAugspurger after having looked at it more closely, updating mgr_locs is viable. Adding cached_type._data._block.mgr_locs = slice(len(vslider.buf))) to _update_cached_objs in libreduction gives us back a valid Series

@TomAugspurger
Copy link
Contributor

Hmmm OK. Let's dig this hole a little deeper then...

@jbrockmendel
Copy link
Member

the question is how big a perf penalty are we willing to accept to get rid of this hassle?

Any thoughts here? I've got it down to 4.5x locally

@TomAugspurger
Copy link
Contributor

4.5x seems a bit rough (is that across the board, or just specific cases?)

I think I'd prefer to see something like what you proposed earlier:

we could use something like the raw keyword to let users opt in to a fastpath to operate only on the underlying arrays

It seems safeish to be doing these hacks on plain ndarrays. I think most of the fragility comes from trying to operate on more complex objects like Series & DataFrame. If we could ensure that the callable is only operating on ndarrays (via an opt-in keyword like raw) then the perf / maintenance tradeoff seems more palitable.

@jbrockmendel
Copy link
Member

If we could ensure that the callable is only operating on ndarrays

This seems like a good tradeoff.

4.5x seems a bit rough (is that across the board, or just specific cases?)

A few weeks ago I ran the asvs with fast_apply disabled and chose the one that was most affected. Have been using that for comparison in optimization PRs since then. Most of the slowdown comes from slicing because we create new Series+SingleBlockManager+Index+Block objects on each of ~2k groups

@TomAugspurger
Copy link
Contributor

Do you have a sense (without looking) for whether we're doing "pandas things" (accessing the index, blocks, etc.) in those methods? I guess that determines whether the raw keyword is even an option.

@jbrockmendel
Copy link
Member

whether we're doing "pandas things" (accessing the index, blocks, etc.) in those methods?

which methods are you referring to?

@TomAugspurger
Copy link
Contributor

The callable being applied. If I'm thinking about this the right way, that callable would start receiving an ndarray rather than a Series.

@jbrockmendel
Copy link
Member

Gotcha. Yah, passing an ndarray to the UDF would allow for a lot of simplication

@jreback jreback added Groupby Numeric Operations Arithmetic, Comparison, and Logical operations Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Mar 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Numeric Operations Arithmetic, Comparison, and Logical operations Regression Functionality that used to work in a prior pandas version Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants