Propagate Series.name attribute when merging series into data frame #6124

bburan-galenea · 2014-01-27T14:33:39Z

Use case

Facilitate DataFrame group/apply transformations when using a function that returns a Series. Right now, if we perform the following:

import pandas
df = pandas.DataFrame(
        {'a':  [0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2],
         'b':  [0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1],
         'c':  [1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0],
         'd':  [0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1],
         })

def count_values(df):
    return pandas.Series({'count': df['b'].sum(), 'mean': df['c'].mean()}, name='metrics')

result = df.groupby('a').apply(count_values)
print result.stack().reset_index()

We get the following output:

   a level_1    0
0  0   count  2.0
1  0    mean  0.5
2  1   count  2.0
3  1    mean  0.5
4  2   count  2.0
5  2    mean  0.5

[6 rows x 3 columns]

Ideally, the series name should be preserved and propagated through these operations such that we get the following output:

   a metrics    0
0  0   count  2.0
1  0    mean  0.5
2  1   count  2.0
3  1    mean  0.5
4  2   count  2.0
5  2    mean  0.5

[6 rows x 3 columns]

The only way to achieve this (currently) is:

result = df.groupby('a').apply(count_values)
result.columns.name = 'metrics'
print result.stack().reset_index()

However, the key issue here is 1) this adds an extra line of code and 2) the name of the series created in the applied function may not be known in the outside block (so we can't properly fix the result.columns.name attribute).

The other work-around is to name the index of the series:

def count_values(df):
    series = pandas.Series({'count': df['b'].sum(), 'mean': df['c'].mean()})
    series.index.name = 'metrics'
    return series

During the group/apply operation, one approach is to check to see whether series.index has the name attribute set. If the name attribute is not set, it will set the index.name attribute to the name of the series (thus ensuring the name propagates).

The text was updated successfully, but these errors were encountered:

jreback · 2014-02-05T12:16:04Z

@bburan-galenea pls confirm if #6265 is indeed a dupe (looks like it 2 me). pls add that example as a test if its substantially different (I didn't look).

thanks

bburan-galenea · 2014-03-05T12:26:00Z

When GroupBy.apply is provided a callable that returns a series, the proposed solution is to check the first series to see if it has a name attribute set. If the name attribute is set, use that as the name for the resulting series. However, this will break some of the unit tests in pandas. Specifically, if we have the following:

import pandas
df = pandas.DataFrame(
        {'a':  [0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2],
         'b':  [0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1],
         'c':  [1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0],
         'd':  [0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1],
         })

def count_values(df):
    return df.iloc[1]

result = df.groupby('a').apply(count_values)

There will be three groups and the Series returned for each group will be named with the index of the slice it was derived from (which will be 1, 5 and 9). Since the series names are not consistent, the proposed solution in the PR (#6068) will fail since it checks to see if the series names are consistent before merging the series into a data frame.

jreback · 2014-03-05T12:34:06Z

@bburan-galenea ok...I think you can disambiguate that. Unfortunately groupy handles a lot of cases!

bburan-galenea · 2014-03-05T12:38:03Z

I'm not sure what you mean by disambiguating that. There are several approaches:

If the series names are not consistent, raise an Exception (the proposed solution in a comment on the PR). This seems like it may break existing code.
Don't name the series if the series names are inconsistent.
Scrap this approach and add a new keyword argument to GroupBy.apply called name that will be used to indicate the name of the resulting series. For example, result = df.groupby('a').apply(count_values, name='metrics')

Once there's agreement on which approach is best, I can implement it.

jreback · 2014-03-05T12:40:46Z

go with 2 (don't name) and see what effects this has.

don't want 1 as will break compat (or does it?)

3 - too many keywords already.. :)

bburan-galenea · 2014-03-05T12:47:04Z

Thanks! 1 will probably break compatibility in someone's code (I can think of a few cases in some old analyses I've done where I might have done something similar with group/apply/iloc). So, I will go with 2.

bburan-galenea mentioned this issue Jan 27, 2014

ENH: Keep series name when merging GroupBy result #6068

Merged

jreback mentioned this issue Feb 5, 2014

Missing Series name when using count() on groupby object #6265

Closed

jreback closed this as completed in #6068 Mar 5, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Propagate Series.name attribute when merging series into data frame #6124

Propagate Series.name attribute when merging series into data frame #6124

bburan-galenea commented Jan 27, 2014

jreback commented Feb 5, 2014

bburan-galenea commented Mar 5, 2014

jreback commented Mar 5, 2014

bburan-galenea commented Mar 5, 2014

jreback commented Mar 5, 2014

bburan-galenea commented Mar 5, 2014

Propagate Series.name attribute when merging series into data frame #6124

Propagate Series.name attribute when merging series into data frame #6124

Comments

bburan-galenea commented Jan 27, 2014

Use case

jreback commented Feb 5, 2014

bburan-galenea commented Mar 5, 2014

jreback commented Mar 5, 2014

bburan-galenea commented Mar 5, 2014

jreback commented Mar 5, 2014

bburan-galenea commented Mar 5, 2014