<a href="https://colab.research.google.com/github/plthiyagu/AI-Engineering/blob/master/14-Pandas/Pandas_GroupBy_Tips.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

https://predictivehacks.com/pandas-groupby-tips/

In [1]:
import pandas as pd

In [2]:
df = pd.DataFrame({'Gender':['m','m','m','f','f','f','f', 'm','f','f'],
                   'Type':['a','b','c','a','b','c','c', 'c','c','b'],
                   'ColA':[10,20,30,40,50,60,70,80,90,100],
                   'ColB':[0,5,10,15,25,30,50,10,20,30]})
df

Unnamed: 0,Gender,Type,ColA,ColB
0,m,a,10,0
1,m,b,20,5
2,m,c,30,10
3,f,a,40,15
4,f,b,50,25
5,f,c,60,30
6,f,c,70,50
7,m,c,80,10
8,f,c,90,20
9,f,b,100,30


Tip: How to return results without Index


In many cases, we do not want the column(s) of the group by operations to appear as indexes. For that reason, we use to add the reset_index() at the end. For example, let’s say that we want to get the average of ColA group by Gender



In [3]:
df.groupby('Gender')['ColA'].mean()

Gender
f    68.333333
m    35.000000
Name: ColA, dtype: float64

Now, if we want to remove the Gender as index we add the reset_index() command as the end:



In [4]:
df.groupby('Gender')['ColA'].mean().reset_index()

Unnamed: 0,Gender,ColA
0,f,68.333333
1,m,35.0


Tip: Instead of typing the rest_index() command, you can add the as_index=False in the groupby and you will get the same output. For example:



In [5]:
df.groupby('Gender', as_index=False)['ColA'].mean()

Unnamed: 0,Gender,ColA
0,f,68.333333
1,m,35.0


This can be extended to more columns. For example, let’s say that we group by Gender and Type

with Index

In [None]:
df.groupby(['Gender', 'Type'])['ColA'].mean()

Without index:


In [6]:
df.groupby(['Gender', 'Type'], as_index=False)['ColA'].mean()

Unnamed: 0,Gender,Type,ColA
0,f,a,40.0
1,f,b,75.0
2,f,c,73.333333
3,m,a,10.0
4,m,b,20.0
5,m,c,55.0


Tip: How to get the groups

Once we group our data frame, we can show and get them. For example, let’s assume that we group our DataFrame by Type



In [7]:
grouped = df.groupby('Type')

How to iterate over groups?


In [8]:
for g in grouped:
    print(g)

('a',   Gender Type  ColA  ColB
0      m    a    10     0
3      f    a    40    15)
('b',   Gender Type  ColA  ColB
1      m    b    20     5
4      f    b    50    25
9      f    b   100    30)
('c',   Gender Type  ColA  ColB
2      m    c    30    10
5      f    c    60    30
6      f    c    70    50
7      m    c    80    10
8      f    c    90    20)


How to get a group?

We can get a specific group using the command get_group. For example, let’s say that we want to get the group of Type “b“.



In [9]:
grouped.get_group('b')

Unnamed: 0,Gender,Type,ColA,ColB
1,m,b,20,5
4,f,b,50,25
9,f,b,100,30


Tip: How to apply multiple functions?

Let’s say that we want for ColA to calculate the mean and var and for ColB to calculate the min and max, group by Gender.



In [10]:
df.groupby('Gender').agg({'ColA':['mean', 'var'], 
                          'ColB':['min', 'max'] })

Unnamed: 0_level_0,ColA,ColA,ColB,ColB
Unnamed: 0_level_1,mean,var,min,max
Gender,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
f,68.333333,536.666667,15,50
m,35.0,966.666667,0,10


Tip: How to change the names of the aggregated columns


In [11]:
df.groupby('Gender').agg({'ColA':[('ColA_Mean','mean'), ('ColA_Var', 'var')], 
                          'ColB':[('ColB_Min','min'), ('ColB_Max', 'max')] })

Unnamed: 0_level_0,ColA,ColA,ColB,ColB
Unnamed: 0_level_1,ColA_Mean,ColA_Var,ColB_Min,ColB_Max
Gender,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
f,68.333333,536.666667,15,50
m,35.0,966.666667,0,10


Tip: How to add a custom function

Let’s say that we want to add a custom calculation which is the range, i.e max-min for ColA.



In [12]:
df.groupby('Gender').agg({'ColA':[('ColA_Mean','mean'), ('ColA_Var', 'var'), ('CustomFunction', lambda x: x.max() - x.min())], 
                          'ColB':[('ColB_Min','min'), ('ColB_Max', 'max')] }) 

Unnamed: 0_level_0,ColA,ColA,ColA,ColB,ColB
Unnamed: 0_level_1,ColA_Mean,ColA_Var,CustomFunction,ColB_Min,ColB_Max
Gender,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
f,68.333333,536.666667,60,15,50
m,35.0,966.666667,70,0,10


Tip: Dealing with Multiple Indexes

Let’s create a grouped DataFrame with multiple indexes. You can find more details at Pandas Documentation



In [13]:
ex = df.groupby(['Gender', 'Type']).agg({'ColA':['mean'], 
                          'ColB':['min', 'max'] })
 
ex

Unnamed: 0_level_0,Unnamed: 1_level_0,ColA,ColB,ColB
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,min,max
Gender,Type,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
f,a,40.0,15,15
f,b,75.0,25,30
f,c,73.333333,20,50
m,a,10.0,0,0
m,b,20.0,5,5
m,c,55.0,10,10


Let’s say now that I want to get the row where Gender==’f’ and Type==’c’. We can just use the .loc and pass the values as follows:



In [14]:
ex.loc[('f','c')]

ColA  mean    73.333333
ColB  min     20.000000
      max     50.000000
Name: (f, c), dtype: float64

Let’s say that I want to run the same query but this time to get the data only for ColA:


In [15]:
ex.loc[('f','c'), 'ColA']

mean    73.333333
Name: (f, c), dtype: float64

Tip: Slicers with Multiple Indexes

Let’s say that I want to get all the levels from Gender and the levels ‘a’ and ‘b’ from Type. I can use the slicers as follows:



In [16]:
ex.loc[(slice(None), slice('a','b')), :]

Unnamed: 0_level_0,Unnamed: 1_level_0,ColA,ColB,ColB
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,min,max
Gender,Type,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
f,a,40.0,15,15
f,b,75.0,25,30
m,a,10.0,0,0
m,b,20.0,5,5


Notice: You can use slice(None) to select all the contents of that level. You do not need to specify all the deeper levels, they will be implied as slice(None).



Tip: Reset a column’s MultiIndex levels

As we see here in our example DataFrame called ‘ex‘, we have Multiple Indexes even in columns. Let’s see how we can reset them.



In [17]:
ex.columns = ex.columns.droplevel(0)
ex = ex.rename_axis(None, axis=1)
ex

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,min,max
Gender,Type,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
f,a,40.0,15,15
f,b,75.0,25,30
f,c,73.333333,20,50
m,a,10.0,0,0
m,b,20.0,5,5
m,c,55.0,10,10


In [18]:
ex.reset_index()

Unnamed: 0,Gender,Type,mean,min,max
0,f,a,40.0,15,15
1,f,b,75.0,25,30
2,f,c,73.333333,20,50
3,m,a,10.0,0,0
4,m,b,20.0,5,5
5,m,c,55.0,10,10


In [19]:
ex.reset_index(level=0)

Unnamed: 0_level_0,Gender,mean,min,max
Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
a,f,40.0,15,15
b,f,75.0,25,30
c,f,73.333333,20,50
a,m,10.0,0,0
b,m,20.0,5,5
c,m,55.0,10,10
