<img src="https://pandas.pydata.org/static/img/pandas.svg" width="250">

## <center> Transform your data with `groupby`
    
 + simple group by with `groupby()`
 + reset_index()
 + df.groupby().agg()
 + groupby object
     + get_group()
     + max()
     + lambda
     + filter()

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('../data/iris.csv')
df.head(5)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


---------

# A simple groupby on one dimension with one aggregation for all variables

### get maximum of each features per species

In [3]:
df_grouped = df.groupby(['species']).max()
df_grouped

Unnamed: 0_level_0,sepal_length,sepal_width,petal_length,petal_width
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
setosa,5.8,4.4,1.9,0.6
versicolor,7.0,3.4,5.1,1.8
virginica,7.9,3.8,6.9,2.5


### Flatten the hierarchical index with `reset_index`

In [4]:
df_grouped = df_grouped.reset_index()
df_grouped

# we can reset the index of grouped by dataframe
# species index is now become a column

Unnamed: 0,species,sepal_length,sepal_width,petal_length,petal_width
0,setosa,5.8,4.4,1.9,0.6
1,versicolor,7.0,3.4,5.1,1.8
2,virginica,7.9,3.8,6.9,2.5


---------

# Multiple aggregation methods to different variables
+ **`df.groupby().agg()`**

In [5]:
aggregated_df = df.groupby(['species']).agg({
    'sepal_length': ['min', 'max', 'mean'],
    'sepal_width': ['count']
})

In [6]:
aggregated_df

Unnamed: 0_level_0,sepal_length,sepal_length,sepal_length,sepal_width
Unnamed: 0_level_1,min,max,mean,count
species,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
setosa,4.3,5.8,5.006,50
versicolor,4.9,7.0,5.936,50
virginica,4.9,7.9,6.588,50


In [7]:
aggregated_df['sepal_length']

Unnamed: 0_level_0,min,max,mean
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
setosa,4.3,5.8,5.006
versicolor,4.9,7.0,5.936
virginica,4.9,7.9,6.588


In [8]:
aggregated_df['sepal_width']

Unnamed: 0_level_0,count
species,Unnamed: 1_level_1
setosa,50
versicolor,50
virginica,50


-------

# Flattening hierarchical indexes

In [10]:
aggregated_df.head()

Unnamed: 0_level_0,sepal_length,sepal_length,sepal_length,sepal_width
Unnamed: 0_level_1,min,max,mean,count
species,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
setosa,4.3,5.8,5.006,50
versicolor,4.9,7.0,5.936,50
virginica,4.9,7.9,6.588,50


We can flattend our aggregated_df like below.

In [16]:
['_'.join(col).strip()  for col in aggregated_df.columns.values]

['sepal_length_min',
 'sepal_length_max',
 'sepal_length_mean',
 'sepal_width_count']

In [17]:
aggregated_df.columns = ['_'.join(col).strip()  for col in aggregated_df.columns.values]

In [18]:
aggregated_df

Unnamed: 0_level_0,sepal_length_min,sepal_length_max,sepal_length_mean,sepal_width_count
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
setosa,4.3,5.8,5.006,50
versicolor,4.9,7.0,5.936,50
virginica,4.9,7.9,6.588,50


In [19]:
aggregated_df.reset_index()

Unnamed: 0,species,sepal_length_min,sepal_length_max,sepal_length_mean,sepal_width_count
0,setosa,4.3,5.8,5.006,50
1,versicolor,4.9,7.0,5.936,50
2,virginica,4.9,7.9,6.588,50


------

# Specify groupings prior to any aggregation

### creating group object
+ we can create grouping object without defining aggregation first

In [22]:
groupings = df.groupby(['species'])

groupings

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000000B7736EF160>

### `get_group` to get specific group information

In [26]:
groupings.get_group('setosa').head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


### get `max` value of the group object

In [27]:
groupings.max()

Unnamed: 0_level_0,sepal_length,sepal_width,petal_length,petal_width
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
setosa,5.8,4.4,1.9,0.6
versicolor,7.0,3.4,5.1,1.8
virginica,7.9,3.8,6.9,2.5


### using `lambda` to apply function to group object

In [29]:
groupings.apply(lambda x: x.max())

Unnamed: 0_level_0,sepal_length,sepal_width,petal_length,petal_width
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
setosa,5.8,4.4,1.9,0.6
versicolor,7.0,3.4,5.1,1.8
virginica,7.9,3.8,6.9,2.5


### `Filter` to group object
+ same as SQL WHERE

In [31]:
groupings.filter(lambda x: x['petal_length'].max() < 5)

# we can see only setasoa species are returned as petal_lenght is less than 5

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
5,5.4,3.9,1.7,0.4
6,4.6,3.4,1.4,0.3
7,5.0,3.4,1.5,0.2
8,4.4,2.9,1.4,0.2
9,4.9,3.1,1.5,0.1
