Feature request: aggregation without MultiIndex columns #14581

PanWu · 2016-11-03T22:12:55Z

Currently the group-by-aggregation in pandas will create MultiIndex columns if there are multiple operation on the same column. However, this introduces some friction to reset the column names for fast filter and join. (If all operations could be chained together, analytics would be smoother)

df = pd.DataFrame([
        ['A', 1],
        ['A', 2],
        ['A', 3],
        ['B', 4],
        ['B', 5],
        ['B', 6]
    ], columns=['Key', 'Value'])

df1 = df.groupby('Key') \
    .agg({
            'Value': {'V1': 'sum', 'V2': 'count'}
        })

df1.columns = df1.columns.droplevel() # This line introduce friction
df1.query('V1 > V2')

Expected Output

It would be great if there is a simple alias function for columns (like the pyspark's implementation), such as

# Just one approach, there may be others more appropriate
df.groupby('Key') \
    .agg(
            pd.Series.sum('Value').alias('V1'),
            pd.Series.count('Value').alias('V2')
        ) \
    .query('V1 > V2')

jreback · 2016-11-03T22:35:34Z

In [14]: df.groupby('Key').Value.agg({'V1': 'sum', 'V2': 'count'}).query('V1 > V2')
Out[14]: 
     V2  V1
Key        
A     3   6
B     3  15

You simply need to specify an aggregation directly on the Series.

PySpark in general is NOT an inspiration as things to be honest are much clunkier.

PanWu · 2016-11-04T04:13:24Z

Thanks @jreback for the quick response. My previous example is over simplified, what about the following example

df = pd.DataFrame([
        ['A', 1, 85],
        ['A', 2, 45],
        ['A', 3, 23],
        ['B', 4, 76],
        ['B', 5, 43],
        ['B', 6, 56]
    ], columns=['Key', 'Value', 'Age'])

df1 = df.groupby('Key') \
    .agg({
            'Value': {'V1': 'sum', 'V2': 'count'},
            'Age': {'AvgAge': 'mean', 'StdAge': 'std'}
        })

Is there a way to handle multiple columns aggregation with multiple aggregation method in a convenient way?

jreback · 2016-11-04T13:12:36Z

http://pandas.pydata.org/pandas-docs/stable/groupby.html#aggregation

wesm · 2016-11-04T14:53:04Z

FWIW I would also like to see a new groupby API entry point (not requiring keyword args) that does not yield a row index.

PanWu · 2016-11-04T19:58:21Z

@jreback thanks, eventually figure out the schema is:

df1 = df.groupby('Key', as_index=False) \
    ['Value', 'Age'] \
    .agg({
            'Value': {'V1': 'sum', 'V2': 'count'},
            'Age': {'AvgAge': 'mean', 'StdAge': 'std'}
        })

@wesm that would be great! +1 on that

jreback closed this as completed Nov 3, 2016

jreback added Groupby Reshaping Concat, Merge/Join, Stack/Unstack, Explode API Design labels Nov 3, 2016

jreback added this to the No action milestone Nov 3, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: aggregation without MultiIndex columns #14581

Feature request: aggregation without MultiIndex columns #14581

PanWu commented Nov 3, 2016

jreback commented Nov 3, 2016

PanWu commented Nov 4, 2016

jreback commented Nov 4, 2016

wesm commented Nov 4, 2016

PanWu commented Nov 4, 2016

Feature request: aggregation without MultiIndex columns #14581

Feature request: aggregation without MultiIndex columns #14581

Comments

PanWu commented Nov 3, 2016

Expected Output

jreback commented Nov 3, 2016

PanWu commented Nov 4, 2016

jreback commented Nov 4, 2016

wesm commented Nov 4, 2016

PanWu commented Nov 4, 2016