Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: aggregation without MultiIndex columns #14581

Closed
PanWu opened this issue Nov 3, 2016 · 5 comments
Closed

Feature request: aggregation without MultiIndex columns #14581

PanWu opened this issue Nov 3, 2016 · 5 comments
Labels
API Design Groupby Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@PanWu
Copy link

PanWu commented Nov 3, 2016

Currently the group-by-aggregation in pandas will create MultiIndex columns if there are multiple operation on the same column. However, this introduces some friction to reset the column names for fast filter and join. (If all operations could be chained together, analytics would be smoother)

df = pd.DataFrame([
        ['A', 1],
        ['A', 2],
        ['A', 3],
        ['B', 4],
        ['B', 5],
        ['B', 6]
    ], columns=['Key', 'Value'])

df1 = df.groupby('Key') \
    .agg({
            'Value': {'V1': 'sum', 'V2': 'count'}
        })

df1.columns = df1.columns.droplevel() # This line introduce friction
df1.query('V1 > V2')

Expected Output

It would be great if there is a simple alias function for columns (like the pyspark's implementation), such as

# Just one approach, there may be others more appropriate
df.groupby('Key') \
    .agg(
            pd.Series.sum('Value').alias('V1'),
            pd.Series.count('Value').alias('V2')
        ) \
    .query('V1 > V2')
@jreback
Copy link
Contributor

jreback commented Nov 3, 2016

In [14]: df.groupby('Key').Value.agg({'V1': 'sum', 'V2': 'count'}).query('V1 > V2')
Out[14]: 
     V2  V1
Key        
A     3   6
B     3  15

You simply need to specify an aggregation directly on the Series.

PySpark in general is NOT an inspiration as things to be honest are much clunkier.

@jreback jreback closed this as completed Nov 3, 2016
@jreback jreback added Groupby Reshaping Concat, Merge/Join, Stack/Unstack, Explode API Design labels Nov 3, 2016
@jreback jreback added this to the No action milestone Nov 3, 2016
@PanWu
Copy link
Author

PanWu commented Nov 4, 2016

Thanks @jreback for the quick response. My previous example is over simplified, what about the following example

df = pd.DataFrame([
        ['A', 1, 85],
        ['A', 2, 45],
        ['A', 3, 23],
        ['B', 4, 76],
        ['B', 5, 43],
        ['B', 6, 56]
    ], columns=['Key', 'Value', 'Age'])

df1 = df.groupby('Key') \
    .agg({
            'Value': {'V1': 'sum', 'V2': 'count'},
            'Age': {'AvgAge': 'mean', 'StdAge': 'std'}
        })

Is there a way to handle multiple columns aggregation with multiple aggregation method in a convenient way?

@jreback
Copy link
Contributor

jreback commented Nov 4, 2016

@wesm
Copy link
Member

wesm commented Nov 4, 2016

FWIW I would also like to see a new groupby API entry point (not requiring keyword args) that does not yield a row index.

@PanWu
Copy link
Author

PanWu commented Nov 4, 2016

@jreback thanks, eventually figure out the schema is:

df1 = df.groupby('Key', as_index=False) \
    ['Value', 'Age'] \
    .agg({
            'Value': {'V1': 'sum', 'V2': 'count'},
            'Age': {'AvgAge': 'mean', 'StdAge': 'std'}
        })

@wesm that would be great! +1 on that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Groupby Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

3 participants