Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propagate Series.name attribute when merging series into data frame #6124

Closed
bburan-galenea opened this issue Jan 27, 2014 · 6 comments · Fixed by #6068
Closed

Propagate Series.name attribute when merging series into data frame #6124

bburan-galenea opened this issue Jan 27, 2014 · 6 comments · Fixed by #6068
Labels
Bug Groupby Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@bburan-galenea
Copy link
Contributor

See #6068

Use case

Facilitate DataFrame group/apply transformations when using a function that returns a Series. Right now, if we perform the following:

import pandas
df = pandas.DataFrame(
        {'a':  [0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2],
         'b':  [0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1],
         'c':  [1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0],
         'd':  [0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1],
         })

def count_values(df):
    return pandas.Series({'count': df['b'].sum(), 'mean': df['c'].mean()}, name='metrics')

result = df.groupby('a').apply(count_values)
print result.stack().reset_index()

We get the following output:

   a level_1    0
0  0   count  2.0
1  0    mean  0.5
2  1   count  2.0
3  1    mean  0.5
4  2   count  2.0
5  2    mean  0.5

[6 rows x 3 columns]

Ideally, the series name should be preserved and propagated through these operations such that we get the following output:

   a metrics    0
0  0   count  2.0
1  0    mean  0.5
2  1   count  2.0
3  1    mean  0.5
4  2   count  2.0
5  2    mean  0.5

[6 rows x 3 columns]

The only way to achieve this (currently) is:

result = df.groupby('a').apply(count_values)
result.columns.name = 'metrics'
print result.stack().reset_index()

However, the key issue here is 1) this adds an extra line of code and 2) the name of the series created in the applied function may not be known in the outside block (so we can't properly fix the result.columns.name attribute).

The other work-around is to name the index of the series:

def count_values(df):
    series = pandas.Series({'count': df['b'].sum(), 'mean': df['c'].mean()})
    series.index.name = 'metrics'
    return series

During the group/apply operation, one approach is to check to see whether series.index has the name attribute set. If the name attribute is not set, it will set the index.name attribute to the name of the series (thus ensuring the name propagates).

@jreback
Copy link
Contributor

jreback commented Feb 5, 2014

@bburan-galenea pls confirm if #6265 is indeed a dupe (looks like it 2 me). pls add that example as a test if its substantially different (I didn't look).

thanks

@bburan-galenea
Copy link
Contributor Author

When GroupBy.apply is provided a callable that returns a series, the proposed solution is to check the first series to see if it has a name attribute set. If the name attribute is set, use that as the name for the resulting series. However, this will break some of the unit tests in pandas. Specifically, if we have the following:

import pandas
df = pandas.DataFrame(
        {'a':  [0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2],
         'b':  [0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1],
         'c':  [1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0],
         'd':  [0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1],
         })

def count_values(df):
    return df.iloc[1]

result = df.groupby('a').apply(count_values)

There will be three groups and the Series returned for each group will be named with the index of the slice it was derived from (which will be 1, 5 and 9). Since the series names are not consistent, the proposed solution in the PR (#6068) will fail since it checks to see if the series names are consistent before merging the series into a data frame.

@jreback
Copy link
Contributor

jreback commented Mar 5, 2014

@bburan-galenea ok...I think you can disambiguate that. Unfortunately groupy handles a lot of cases!

@bburan-galenea
Copy link
Contributor Author

I'm not sure what you mean by disambiguating that. There are several approaches:

  • If the series names are not consistent, raise an Exception (the proposed solution in a comment on the PR). This seems like it may break existing code.
  • Don't name the series if the series names are inconsistent.
  • Scrap this approach and add a new keyword argument to GroupBy.apply called name that will be used to indicate the name of the resulting series. For example, result = df.groupby('a').apply(count_values, name='metrics')

Once there's agreement on which approach is best, I can implement it.

@jreback
Copy link
Contributor

jreback commented Mar 5, 2014

go with 2 (don't name) and see what effects this has.

don't want 1 as will break compat (or does it?)

3 - too many keywords already.. :)

@bburan-galenea
Copy link
Contributor Author

Thanks! 1 will probably break compatibility in someone's code (I can think of a few cases in some old analyses I've done where I might have done something similar with group/apply/iloc). So, I will go with 2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Groupby Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants