# 06. Groupby and Aggregate functions
- Grouping and aggregation are frequent operations that you will apply on your data.
- Specifically important in EDA - Exploratory Data Analysis.
- The **`groupby` is not a single operation**. Grouping analysis includes:
    - Splitting the data into Groups
    - Do certain analysis on each group
    - Combining the results.
![ring](data/d-ring.jpeg)

In [21]:
#import the libraries
import pandas as pd
import numpy as np

In [30]:
df = pd.read_csv('./data/diamonds.csv')
df.head(20)

Unnamed: 0,carat,cut,clarity,price
0,0.23,Ideal,SI2,326
1,0.21,Premium,SI1,326
2,0.23,Good,VS1,327
3,0.29,Premium,VS2,334
4,0.31,Good,SI2,335
5,0.24,Very Good,VVS2,336
6,0.24,Very Good,VVS1,336
7,0.26,Very Good,SI1,337
8,0.22,Fair,VS2,337
9,0.23,Very Good,VS1,338


In [26]:
df.describe()

Unnamed: 0,carat,price
count,53940.0,53940.0
mean,0.79794,3932.799722
std,0.474011,3989.439738
min,0.2,326.0
25%,0.4,950.0
50%,0.7,2401.0
75%,1.04,5324.25
max,5.01,18823.0


## 1. Grouping in a single dimensions
Group data by one column and apply mean function.

In [39]:
# Do not apply any aggregate functions.
df.groupby(['cut']).describe()

Unnamed: 0_level_0,carat,carat,carat,carat,carat,carat,carat,carat,price,price,price,price,price,price,price,price
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
cut,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
Fair,1610.0,1.046137,0.516404,0.22,0.7,1.0,1.2,5.01,1610.0,4358.757764,3560.386612,337.0,2050.25,3282.0,5205.5,18574.0
Good,4906.0,0.849185,0.454054,0.23,0.5,0.82,1.01,3.01,4906.0,3928.864452,3681.589584,327.0,1145.0,3050.5,5028.0,18788.0
Ideal,21551.0,0.702837,0.432876,0.2,0.35,0.54,1.01,3.5,21551.0,3457.54197,3808.401172,326.0,878.0,1810.0,4678.5,18806.0
Premium,13791.0,0.891955,0.515262,0.2,0.41,0.86,1.2,4.01,13791.0,4584.257704,4349.204961,326.0,1046.0,3185.0,6296.0,18823.0
Very Good,12082.0,0.806381,0.459435,0.2,0.41,0.71,1.02,4.0,12082.0,3981.759891,3935.862161,336.0,912.0,2648.0,5372.75,18818.0


## 2. Grouping in multiple dimensions
You can split the data based on combined columns, just pass a list of column names.

In [40]:
df.groupby(['cut', 'carat']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,price
cut,carat,Unnamed: 2_level_1
Fair,0.22,337.000000
Fair,0.23,369.000000
Fair,0.25,645.666667
Fair,0.27,371.000000
Fair,0.29,1184.000000
...,...,...
Very Good,2.70,14341.000000
Very Good,2.74,17174.000000
Very Good,3.00,6512.000000
Very Good,3.04,15354.000000


### NOTE: 
- The groupby() function returns a Pandas object, which can be used further to perform the desired aggregate functions. 
- Can apply on Series as well as dataframes.

In [41]:
type(df.groupby(['cut', 'carat']))

pandas.core.groupby.generic.DataFrameGroupBy

_______
## 3. Aggregate functions - `agg()`
- So far you have seen, how to use `groupby` to split and aggregate them into each group.
- Grouping is done using variopus pandas inbuilt functions or custom functions.
- Pandas also provide **`agg()`** interface, which accepts Numpt functions as a partameter to aggregate the groups.
- Most frequently used aggregations are:
    - **sum:** Return the sum of the values for the requested axis
    - **min:** Return the minimum of the values for the requested axis
    - **max:** Return the maximum of the values for the requested axis

In [1]:
my_grp = df.groupby('cut')

NameError: name 'df' is not defined

### 3.1. Aggregate single function

In [56]:
my_grp['price'].agg(np.mean)

cut
Fair         4358.757764
Good         3928.864452
Ideal        3457.541970
Premium      4584.257704
Very Good    3981.759891
Name: price, dtype: float64

In [57]:
my_grp['price'].agg([np.mean])

Unnamed: 0_level_0,mean
cut,Unnamed: 1_level_1
Fair,4358.757764
Good,3928.864452
Ideal,3457.54197
Premium,4584.257704
Very Good,3981.759891


### 3.2. Aggregate Multiple functions

In [54]:
my_grp['price'].agg([np.mean, np.median, np.std])

Unnamed: 0_level_0,mean,median,std
cut,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Fair,4358.757764,3282.0,3560.386612
Good,3928.864452,3050.5,3681.589584
Ideal,3457.54197,1810.0,3808.401172
Premium,4584.257704,3185.0,4349.204961
Very Good,3981.759891,2648.0,3935.862161


__________
## 4. nth element
- To select the nth element from a DataFrame or Series - use **`nth()`**
- For example, return a single row per group if you pass an integer for n

In [66]:
my_grp['price'].nth(2) #.agg([np.mean])

cut
Fair         2759
Good          339
Ideal         344
Premium       342
Very Good     337
Name: price, dtype: int64