BUG: resample unexpectedly branching on type of `how` callable #7929

Closed
dsm054 opened this Issue Aug 4, 2014 · 11 comments

Comments

Projects
None yet
2 participants
Contributor

dsm054 commented Aug 4, 2014

I was about to answer an SO question giving the standard "use lambda or functools.partial" to create a new function with bound arguments when to my surprise my example didn't work with the OP's code. After some experimentation, it turns out to be because we don't always get the same return type. For example:

df = pd.DataFrame(np.arange(5, dtype=np.int64), 
                  index=pd.DatetimeIndex(start='2014/01/01', periods=5, freq='d'))

def f(x,a=1):
    print(type(x))
    return int(isinstance(x, pd.DataFrame)) + 1000*a

After which:

>>> df.resample("M", how=f)
<class 'pandas.core.series.Series'>
               0
2014-01-31  1000
>>> df.resample("M", how=lambda x: f(x,5))
<class 'pandas.core.series.Series'>
               0
2014-01-31  5000
>>> df.resample("M", how=partial(f, a=9))
<class 'pandas.core.frame.DataFrame'>
               0
2014-01-31  9001

how shouldn't care about how the function was constructed, only whether it's callable, and we should be getting the same result (whether Series or DataFrame) fed into how in every case.

Contributor

jreback commented Aug 4, 2014

groupby first calls the function as a DataFrame (to evaluate it as a single block)

then will call it series by series if that doesn't work

I don't think this is a bug but rather state being captured in the closure

but u would have to step thru and see

Contributor

dsm054 commented Aug 4, 2014

What does "it doesn't work" mean here? In each case, the how function is called exactly once, and it's called with a different type object. So it's definitely not branching on anything that function is doing.

Contributor

jreback commented Aug 4, 2014

ok, a very odd bug

so, every function has a __name__ attribute, but:

partial(f,...) does not, this ends up raising internally and takes another path (that is wrong).

The reason for this is convience feature I think something like this:

df = DataFrame({'foo' : [1,2], 'bar' :[3,4]})
df.groupby('foo').aggregate([np.sum,lambda x: x+1])

e.g. it figures out the name of the np.sum function is sum (duh!). the lambdas name is ''
but a partial function doesn't have a name (not sure why)...

jreback added this to the 0.15.0 milestone Aug 4, 2014

Contributor

jreback commented Aug 4, 2014

@dsm054 want to take a stab at fixing?

search for __name__ and need to deal with that if doesn't exist. Could see if its a partial then look inside the partial to get the name I guess.

Contributor

dsm054 commented Aug 4, 2014

Class instances don't have a default __name__, so functools.partial instances aren't different from any other class which supports __call__. We shouldn't care what's providing us an answer, only that it does..

Contributor

dsm054 commented Aug 4, 2014

Okay, I've got a working branch with tests, but am I alone in thinking the API for this is actually a little strange? If I pass a lambda or a partial I don't really have a good way of changing the name, because the dict syntax is already being used to mean columns. Anyway, setting that aside:

(1) What do we want to do for names of things which don't really have them? At the moment we can't pass multiple lambdas.

(2) Is there a good place to toss a small utility function like get_callable_name?

Contributor

jreback commented Aug 4, 2014

The problem is that they NEED to be put in a dict ATM, somewhat related here: pydata#6515

I think it might be better to simply number the columns if they don't have a name (and de-annomize the lambdas too).

e.g. my example above with [np.sum, lambda x: x+1] instead of yielding ['sum','<lambda>']
really should yield ['sum',1] I think (or maybe ['sum',0])

or maybe ['sum','aggregate_1']

I think this is why @wesm punted on this, have to name them something.

  1. dunno (but if you create a naming scheme then you prob can have multiple lambdas)
  2. maybe core/common.py?
Contributor

dsm054 commented Aug 4, 2014

We could add a layer of nesting, e.g. {"neat_name": {"foo": sum}, "neat_name2": {"bar": lambda x: x.sum()}} but even though that seems natural enough to me it does flip the meaning of the outermost keys. Probably better to leave it alone, and add a different method if necessary.

If we're fixing duplicate names of the empty variety, should we also allow multiple sums (e.g. sum, np.sum) and use the same convention applied to duplicate columns when reading (e.g. sum, sum.1)? If so, then the result of passing sum, np.sum, lambda x: x.sum() will be sum, sum.1, aggregate_0 (or 1?), which I could live with. If those columns have duplicates (say, because you already have a column with one of those names), then simply raise and leave it to the user to fix.

Contributor

jreback commented Aug 4, 2014

yep, I would be ok with aggregate_0, maybe could allow a Series which solves the problem nicely

df.groupby(....).agg(Series([np.sum,sum,lambda x:x.sum()],['numpy_num','reg_sum','lambda_sum']))

or a dict too (I don't think you need a nested dict?

Contributor

dsm054 commented Aug 9, 2014

I like the Series idea. Any objection to my submitting a PR to fix the original bug first, and then opening a followup ENH?

Contributor

jreback commented Aug 10, 2014

sure on both counts!

jreback closed this in #7974 Aug 10, 2014

cpcloud referenced this issue in blaze/blaze Aug 12, 2014

Closed

Consistent column naming scheme #458

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment