BUG: DataFrame.agg({'col': 'size'}) not working #16405

Closed
chmp opened this Issue May 21, 2017 · 4 comments

Comments

Projects
None yet
4 participants
@chmp

chmp commented May 21, 2017

size aggregation does not seem to work properly when used with ungrouped dataframes. When replacing size with count the examples run through. However, this change null semantics, counting nulls (size) in contrast to skipping nulls (count). The examples below are for Pandas version 0.20.1.

When used with .groupby(...), size aggregation works as expected:

>>> pd.DataFrame({'g': [0, 0, 1], 'v': [1, 2, None]}).groupby('g').agg({'v': 'size'})
   v
g   
0  2
1  1

Without .groupby, it raises an recursion error:

>>> pd.DataFrame({'g': [0, 0, 1], 'v': [1, 2, None]}).agg({'v': 'size'})
---------------------------------------------------------------------------
RecursionError                            Traceback (most recent call last)
<ipython-input-15-5fc5c38cdb5f> in <module>()
----> 1 pd.DataFrame({'g': [0, 0, 1], 'v': [1, 2, 3]}).agg({'v': 'size'})

/Volumes/Home/venvs/py3-data-science/lib/python3.5/site-packages/pandas/core/frame.py in aggregate(self, func, axis, *args, **kwargs)
   4250                 pass
   4251         if result is None:
-> 4252             return self.apply(func, axis=axis, args=args, **kwargs)
   4253         return result
   4254 

/Volumes/Home/venvs/py3-data-science/lib/python3.5/site-packages/pandas/core/frame.py in apply(self, func, axis, broadcast, raw, reduce, args, **kwds)
   4322         # dispatch to agg
   4323         if axis == 0 and isinstance(func, (list, dict)):
-> 4324             return self.aggregate(func, axis=axis, *args, **kwds)
   4325 
   4326         if len(self.columns) == 0 and len(self.index) == 0:

... last 2 frames repeated, from the frame below ...

/Volumes/Home/venvs/py3-data-science/lib/python3.5/site-packages/pandas/core/frame.py in aggregate(self, func, axis, *args, **kwargs)
   4250                 pass
   4251         if result is None:
-> 4252             return self.apply(func, axis=axis, args=args, **kwargs)
   4253         return result
   4254 

RecursionError: maximum recursion depth exceeded
@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback May 21, 2017

Contributor

hmm, that doesn't seem friendly. I think the issue is that size needs to be special cases in .agg because its a property on a Series/DataFarme and a function in .groupby.

Contributor

jreback commented May 21, 2017

hmm, that doesn't seem friendly. I think the issue is that size needs to be special cases in .agg because its a property on a Series/DataFarme and a function in .groupby.

@jreback jreback added this to the Next Major Release milestone May 21, 2017

@jreback jreback changed the title from DataFrame.agg({'col': 'size'}) not working to BUG: DataFrame.agg({'col': 'size'}) not working May 21, 2017

@pvomelveny

This comment has been minimized.

Show comment
Hide comment
@pvomelveny

pvomelveny May 22, 2017

Contributor

Hey there, I started looking into this as part of the PyCon sprint.

Using the example:

>>> pd.DataFrame({'g': [0, 0, 1], 'v': [1, 2, None]}).agg({'v': 'size'})

I tracked down where the error that gets passed by is. In pandas/core/base.py:

def _try_aggregate_string_function(self, arg, *args, **kwargs):
        """
        if arg is a string, then try to operate on it:
        - try to find a function on ourselves
        - try to find a numpy function
        - raise
        """
        assert isinstance(arg, compat.string_types)

        f = getattr(self, arg, None)
        if f is not None:
            return f(*args, **kwargs)

        f = getattr(np, arg, None)
        if f is not None:
            return f(self, *args, **kwargs)

        raise ValueError("{} is an unknown string function".format(arg))

The arg in this case is 'size', which is an actual attribute of the Series, just not, as expected, a callable one. A TypeError then bubbles up back to the original pandas/core/frame.py's aggregate, which is caught and passed over. Frame.aggregate then calls Frame.apply, which immediately deploys back to Frame.aggregate (and thus, our eventual recursion error):

 def apply(self, func, axis=0, broadcast=False, raw=False, reduce=None,
              args=(), **kwds):
        """
        Applies function along input axis of DataFrame.

        ...

        """
        axis = self._get_axis_number(axis)
        ignore_failures = kwds.pop('ignore_failures', False)

        # dispatch to agg
        if axis == 0 and isinstance(func, (list, dict)):
            return self.aggregate(func, axis=axis, *args, **kwds)

        ...

It seems like a good place to cut this off may be either _try_aggregate_string_function in base.py or in aggregate in series.py. Any executive decisions on this @TomAugspurger ? Left entierly to my own devices I'd probably do something along the lines:

def _try_aggregate_string_function(self, arg, *args, **kwargs):
        """
        if arg is a string, then try to operate on it:
        - try to find a function on ourselves
        - try to find a numpy function
        - raise
        """
        assert isinstance(arg, compat.string_types)

        f = getattr(self, arg, None)
        if f is not None:
            return f(*args, **kwargs) if callable(f) else f  # Do people still hate the ternary operator?

        ...

But since this is actually in the base class, there could be some unforeseen consequences of this I can't see? (And barring those consequences, should this should just be restricted to calls of 'size'? Otherwise people could do potentially-unexpected-but-also-kinda-cool things like call df.agg({v:'ndim', w:'name', x:'data'}))

Appreciate any thoughts, thanks!

Contributor

pvomelveny commented May 22, 2017

Hey there, I started looking into this as part of the PyCon sprint.

Using the example:

>>> pd.DataFrame({'g': [0, 0, 1], 'v': [1, 2, None]}).agg({'v': 'size'})

I tracked down where the error that gets passed by is. In pandas/core/base.py:

def _try_aggregate_string_function(self, arg, *args, **kwargs):
        """
        if arg is a string, then try to operate on it:
        - try to find a function on ourselves
        - try to find a numpy function
        - raise
        """
        assert isinstance(arg, compat.string_types)

        f = getattr(self, arg, None)
        if f is not None:
            return f(*args, **kwargs)

        f = getattr(np, arg, None)
        if f is not None:
            return f(self, *args, **kwargs)

        raise ValueError("{} is an unknown string function".format(arg))

The arg in this case is 'size', which is an actual attribute of the Series, just not, as expected, a callable one. A TypeError then bubbles up back to the original pandas/core/frame.py's aggregate, which is caught and passed over. Frame.aggregate then calls Frame.apply, which immediately deploys back to Frame.aggregate (and thus, our eventual recursion error):

 def apply(self, func, axis=0, broadcast=False, raw=False, reduce=None,
              args=(), **kwds):
        """
        Applies function along input axis of DataFrame.

        ...

        """
        axis = self._get_axis_number(axis)
        ignore_failures = kwds.pop('ignore_failures', False)

        # dispatch to agg
        if axis == 0 and isinstance(func, (list, dict)):
            return self.aggregate(func, axis=axis, *args, **kwds)

        ...

It seems like a good place to cut this off may be either _try_aggregate_string_function in base.py or in aggregate in series.py. Any executive decisions on this @TomAugspurger ? Left entierly to my own devices I'd probably do something along the lines:

def _try_aggregate_string_function(self, arg, *args, **kwargs):
        """
        if arg is a string, then try to operate on it:
        - try to find a function on ourselves
        - try to find a numpy function
        - raise
        """
        assert isinstance(arg, compat.string_types)

        f = getattr(self, arg, None)
        if f is not None:
            return f(*args, **kwargs) if callable(f) else f  # Do people still hate the ternary operator?

        ...

But since this is actually in the base class, there could be some unforeseen consequences of this I can't see? (And barring those consequences, should this should just be restricted to calls of 'size'? Otherwise people could do potentially-unexpected-but-also-kinda-cool things like call df.agg({v:'ndim', w:'name', x:'data'}))

Appreciate any thoughts, thanks!

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback May 22, 2017

Contributor

@pvomelveny what you did is reasonable, but let's write it out (yes hate the ternary operator :>)

maybe

if f is not None:
    if callable(f):
        return f(*args, **kwargs)
   assert len(args) == 0 and len(kwargs) == 0
   return f

and see if things pass (and add a comment about why doing this).

Contributor

jreback commented May 22, 2017

@pvomelveny what you did is reasonable, but let's write it out (yes hate the ternary operator :>)

maybe

if f is not None:
    if callable(f):
        return f(*args, **kwargs)
   assert len(args) == 0 and len(kwargs) == 0
   return f

and see if things pass (and add a comment about why doing this).

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger May 23, 2017

Contributor

@pvomelveny can you submit a pull request with those changes? I'm curious to see if the tests pass.

Contributor

TomAugspurger commented May 23, 2017

@pvomelveny can you submit a pull request with those changes? I'm curious to see if the tests pass.

@jreback jreback closed this in #16458 Jun 1, 2017

@jreback jreback modified the milestones: 0.20.2, Next Major Release Jun 1, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment