- Title: Aggregation in pandas DataFrame
- Slug: python-pandas-aggregation
- Date: 2019-12-12 21:25:14
- Category: Computer Science
- Tags: programming, Python, pandas, DataFrame, group by, groupby, aggregation
- Author: Ben Du
- Modified: 2019-12-12 21:25:14


https://www.shanelynn.ie/summarising-aggregation-and-grouping-data-in-python-pandas/

## Comment

1. `groupby` works exactly the same on index if the index is named. 

2. The order of columns in groupby matters if you want unstack the results later.

3. groupby works on columns too and it can group by some level of a MultiIndex.

## `groupby` on a Column of An Empty Data Frame

In [1]:
import pandas as pd

df = pd.DataFrame({'x': [], 'y': [], 'z': []})

df

Unnamed: 0,x,y,z


In [2]:
df.groupby('x')[['y', 'z']].sum()

Unnamed: 0_level_0,y,z
x,Unnamed: 1_level_1,Unnamed: 2_level_1


## `groupby` on the Index of An Empty Data Frame

In [3]:
import pandas as pd

df = pd.DataFrame({'x': [], 'y': [], 'z': []})
df.set_index('x')
df

Unnamed: 0,x,y,z


In [4]:
df.groupby('x')[['y', 'z']].sum()

Unnamed: 0_level_0,y,z
x,Unnamed: 1_level_1,Unnamed: 2_level_1


## `groupby` on Non-empty Data Frames

In [5]:
import pandas as pd

df = pd.DataFrame(
    {
        'x': [3, 3, 1, 10, 1, 10],
        'y': [1, 2, 3, 4, 5, 6],
        'z': [6, 5, 4, 3, 2, 1]
    }
)

df

Unnamed: 0,x,y,z
0,3,1,6
1,3,2,5
2,1,3,4
3,10,4,3
4,1,5,2
5,10,6,1


In [6]:
df.groupby('x')[['y', 'z']].sum()

Unnamed: 0_level_0,y,z
x,Unnamed: 1_level_1,Unnamed: 2_level_1
1,8,6
3,3,11
10,10,4


In [7]:
df.groupby('x').sum()

Unnamed: 0_level_0,y,z
x,Unnamed: 1_level_1,Unnamed: 2_level_1
1,8,6
3,3,11
10,10,4


In [8]:
df.groupby('x')[['y', 'z']].sum()

Unnamed: 0_level_0,y,z
x,Unnamed: 1_level_1,Unnamed: 2_level_1
1,8,6
3,3,11
10,10,4


In [9]:
df.groupby(['x'], sort=False).sum()

Unnamed: 0_level_0,y,z
x,Unnamed: 1_level_1,Unnamed: 2_level_1
3,3,11
1,8,6
10,10,4


## Aggregation Function Taking Extra Parameters

In [2]:
import pandas as pd
import numpy as np

df = pd.DataFrame(
    {
        'x': [3, 3, 1, 10, 1, 10],
        'y': [1, 2, 3, 4, 5, 6],
        'z': [6, 5, 4, 3, 2, 1]
    }
)

df

Unnamed: 0,x,y,z
0,3,1,6
1,3,2,5
2,1,3,4
3,10,4,3
4,1,5,2
5,10,6,1


In [3]:
def my_min(x, offset=0):
    return 0 + min(x)

In [4]:
df.groupby('x')[['y', 'z']].agg(my_min, offset=1000)

Unnamed: 0_level_0,y,z
x,Unnamed: 1_level_1,Unnamed: 2_level_1
1,3,2
3,1,5
10,4,1


In [5]:
df.groupby('x')[['y', 'z']].agg(min)

Unnamed: 0_level_0,y,z
x,Unnamed: 1_level_1,Unnamed: 2_level_1
1,3,2
3,1,5
10,4,1


In [4]:
df.groupby(['x'], sort=False).apply(lambda x: x)

Unnamed: 0,x,y,z
0,3,1,6
1,3,2,5
2,1,3,4
3,10,4,3
4,1,5,2
5,10,6,1


## agg

Notice that most aggregation functions just ignore NaN!!!

`min` on each column inside each group.

In [5]:
df.groupby('x').agg('min')

Unnamed: 0_level_0,y,z
x,Unnamed: 1_level_1,Unnamed: 2_level_1
1,3,2
3,1,5
10,4,1


Multiple aggregations for each column.

In [6]:
df.groupby('x').agg(['min', 'max'])

Unnamed: 0_level_0,y,y,z,z
Unnamed: 0_level_1,min,max,min,max
x,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,3,5,2,4
3,1,2,5,6
10,4,6,1,3


Aggregate on the column `y` only.

In [7]:
df.groupby('x').y.agg(['min', 'max'])

Unnamed: 0_level_0,min,max
x,Unnamed: 1_level_1,Unnamed: 2_level_1
1,3,5
3,1,2
10,4,6


## Group by Multiple Criterias

When grouping by multiple criterias, 
you can mix labels and series together.

In [6]:
df.groupby(['x', 'y']).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,z
x,y,Unnamed: 2_level_1
1,3,4
1,5,2
3,1,6
3,2,5
10,4,3
10,6,1


In [7]:
df.groupby(['x', df.y]).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,z
x,y,Unnamed: 2_level_1
1,3,4
1,5,2
3,1,6
3,2,5
10,4,3
10,6,1


In [8]:
df.groupby([df.x, df.y]).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,z
x,y,Unnamed: 2_level_1
1,3,4
1,5,2
3,1,6
3,2,5
10,4,3
10,6,1


## Naming Aggreated Columns

In [2]:
import pandas as pd
import numpy as np

df = pd.DataFrame(
    {
        'x': [3, 3, 1, 10, 1, 10],
        'y': [1, 2, 3, 4, 5, 6],
        'z': [6, 5, 4, 3, 2, 1]
    }
)
df

Unnamed: 0,x,y,z
0,3,1,6
1,3,2,5
2,1,3,4
3,10,4,3
4,1,5,2
5,10,6,1


In [3]:
df.groupby('x').agg(
    y_avg=('y', np.average),
    y_sum=('y', sum),
    x_sum=('x', sum),
)

Unnamed: 0_level_0,y_avg,y_sum,x_sum
x,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,4.0,8,2
3,1.5,3,6
10,5.0,10,20


You CANNOT use multiple lambda functions in the `aggregate` method as of pandas 0.25.3. 
A patch has been made but not released yet. 
Before the fix is released, 
you just need to define lambda functions as regular named functions to avoid the issue.

In [4]:
df.groupby('x').agg(
    y_avg=('y', lambda x: np.average(x)),
    y_sum=('y', lambda x: sum(x)),
    x_sum=('x', sum),
)

KeyError: "[('y', '<lambda>')] not in index"

By default,
the groupby column is used as the index.

In [8]:
r = df.groupby('x').agg({'y': 'max', 'z': ['max', 'min', 'mean', 'count']})
r

Unnamed: 0_level_0,y,z,z,z,z
Unnamed: 0_level_1,max,max,min,mean,count
x,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
1,5,4,2,3.0,2
3,2,6,5,5.5,2
10,6,3,1,2.0,2


You can have the groupby column as an column in final results 
using the option `as_index=False`.

In [9]:
r = df.groupby('x',
               as_index=False).agg({
                   'y': 'max',
                   'z': ['max', 'min', 'mean', 'count']
               })
r

Unnamed: 0_level_0,x,y,z,z,z,z
Unnamed: 0_level_1,Unnamed: 1_level_1,max,max,min,mean,count
0,1,5,4,2,3.0,2
1,3,2,6,5,5.5,2
2,10,6,3,1,2.0,2


In [10]:
r.columns

MultiIndex(levels=[['y', 'z', 'x'], ['count', 'max', 'mean', 'min', '']],
           labels=[[2, 0, 1, 1, 1, 1], [4, 1, 1, 3, 2, 0]])

In [11]:
r.columns = ['x', 'ymax', 'zmax', 'zmin', 'zmean', 'zcnt']
r

Unnamed: 0,x,ymax,zmax,zmin,zmean,zcnt
0,1,5,4,2,3.0,2
1,3,2,6,5,5.5,2
2,10,6,3,1,2.0,2


## Equivalent of Having

df.groupby('col').filter

In [38]:
2**0.5

1.4142135623730951

In [39]:
pow(2, 0.5)

1.4142135623730951

In [45]:
s = pd.Series([1, 2, 3])
s

0    1
1    2
2    3
dtype: int64

In [46]:
s['abc'] = 1000

In [47]:
s

0         1
1         2
2         3
abc    1000
dtype: int64

## Aggregation Using `apply`

In [30]:
df.apply(np.average, args=(None, df.z))

x    3.761905
y    2.666667
z    4.333333
dtype: float64

In [36]:
df.drop('z', axis=1)

Unnamed: 0,x,y
0,3,1
1,3,2
2,1,3
3,10,4
4,1,5
5,10,6


In [33]:
df.apply(lambda col: np.average(col, weights=df.z))

x    3.761905
y    2.666667
z    4.333333
dtype: float64

In [31]:
np.average(df.x, weights=df.z)

3.761904761904762

In [32]:
np.average(df.y, weights=df.z)

2.6666666666666665

In [27]:
def my_sum(df):
    w = df.z / df.z.sum()
    return df.apply(np.average, args=
    
df.groupby('x')[['y', 'z']].apply(my_sum)

Unnamed: 0_level_0,y,z
x,Unnamed: 1_level_1,Unnamed: 2_level_1
1,8,6
3,3,11
10,10,4


In [27]:
def my_sum(df):
    w = df.z / df.z.sum()
    return df.apply(np.average, args=
    
df.groupby('x')[['y', 'z']].apply(my_sum)

Unnamed: 0_level_0,y,z
x,Unnamed: 1_level_1,Unnamed: 2_level_1
1,8,6
3,3,11
10,10,4


In [20]:
import numpy as np
df.apply(np.average, args=(df.z, ))

TypeError: ('tuple indices must be integers or slices, not Series', 'occurred at index x')

In [18]:
?df.apply

[0;31mSignature:[0m [0mdf[0m[0;34m.[0m[0mapply[0m[0;34m([0m[0mfunc[0m[0;34m,[0m [0maxis[0m[0;34m=[0m[0;36m0[0m[0;34m,[0m [0mbroadcast[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m [0mraw[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m [0mreduce[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0margs[0m[0;34m=[0m[0;34m([0m[0;34m)[0m[0;34m,[0m [0;34m**[0m[0mkwds[0m[0;34m)[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Applies function along input axis of DataFrame.

Objects passed to functions are Series objects having index
either the DataFrame's index (axis=0) or the columns (axis=1).
Return type depends on whether passed function aggregates, or the
reduce argument if the DataFrame is empty.

Parameters
----------
func : function
    Function to apply to each column/row
axis : {0 or 'index', 1 or 'columns'}, default 0
    * 0 or 'index': apply function to each column
    * 1 or 'columns': apply function to each row
broadcast : boolean, default False
    For aggre

## Comment

By default the group keys are sorted during the groupby operation. 
You may however pass sort=False to keep keys in the order that they first appear. 
This will also potential speedup the code.

In [11]:
?pd.DataFrame.apply

[0;31mSignature:[0m [0mpd[0m[0;34m.[0m[0mDataFrame[0m[0;34m.[0m[0mapply[0m[0;34m([0m[0mself[0m[0;34m,[0m [0mfunc[0m[0;34m,[0m [0maxis[0m[0;34m=[0m[0;36m0[0m[0;34m,[0m [0mbroadcast[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m [0mraw[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m [0mreduce[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0margs[0m[0;34m=[0m[0;34m([0m[0;34m)[0m[0;34m,[0m [0;34m**[0m[0mkwds[0m[0;34m)[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Applies function along input axis of DataFrame.

Objects passed to functions are Series objects having index
either the DataFrame's index (axis=0) or the columns (axis=1).
Return type depends on whether passed function aggregates, or the
reduce argument if the DataFrame is empty.

Parameters
----------
func : function
    Function to apply to each column/row
axis : {0 or 'index', 1 or 'columns'}, default 0
    * 0 or 'index': apply function to each column
    * 1 or 'columns': apply function to eac