# Data Analysis in Python - V: Grouping and Aggregating Data

## Introduction

In this lesson, we will learn how to group and aggregrate data. 

Note: 
1. Use the TOC to navigate between sections.


## Loading the data set into a data frame

In [2]:
# read poverty data, set index to country and print first 20 rows

import pandas as pd

povData = pd.read_csv('../scratch/PovertyData.csv', sep=',',na_values="*")
# povData = povData.set_index('Country', drop=False)
# OR
povData.set_index('Country', drop=False, inplace = True)
povData.head(20)

Unnamed: 0_level_0,LiveBirthRate,DeathRate,InfantDeaths,MaleLifeExpectancy,FemaleLifeExpectancy,GNI,Region,Country
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Albania,24.7,5.7,30.8,69.6,75.5,600.0,1,Albania
Bulgaria,12.5,11.9,14.4,68.3,74.7,2250.0,1,Bulgaria
Czechoslovakia,13.4,11.7,11.3,71.8,77.7,2980.0,1,Czechoslovakia
Former_E._Germany,12.0,12.4,7.6,69.8,75.9,,1,Former_E._Germany
Hungary,11.6,13.4,14.8,65.4,73.8,2780.0,1,Hungary
Poland,14.3,10.2,16.0,67.2,75.7,1690.0,1,Poland
Romania,13.6,10.7,26.9,66.5,72.4,1640.0,1,Romania
Yugoslavia,14.0,9.0,20.2,68.6,74.5,,1,Yugoslavia
USSR,17.7,10.0,23.0,64.6,74.0,2242.0,1,USSR
Byelorussian_SSR,15.2,9.5,13.1,66.4,75.9,1880.0,1,Byelorussian_SSR


## Grouping and Aggregating Data

We can use the `groupby` function to group data on one or more columns (fields). It returns a `DataFrameGroupBy` object. This is a different data type from `DataFrame` and does not support all operations that work on a data frame. 

In [2]:
# group data by region
povData.groupby('Region')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fbef9ffa850>

In [3]:
type(povData.groupby('Region'))

pandas.core.groupby.generic.DataFrameGroupBy

You can retrieve a particular group from the result.

In [4]:
# retrieve data for region 1
grp_data=povData.groupby('Region')
grp_data.get_group(1)

Unnamed: 0_level_0,LiveBirthRate,DeathRate,InfantDeaths,MaleLifeExpectancy,FemaleLifeExpectancy,GNI,Region,Country
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Albania,24.7,5.7,30.8,69.6,75.5,600.0,1,Albania
Bulgaria,12.5,11.9,14.4,68.3,74.7,2250.0,1,Bulgaria
Czechoslovakia,13.4,11.7,11.3,71.8,77.7,2980.0,1,Czechoslovakia
Former_E._Germany,12.0,12.4,7.6,69.8,75.9,,1,Former_E._Germany
Hungary,11.6,13.4,14.8,65.4,73.8,2780.0,1,Hungary
Poland,14.3,10.2,16.0,67.2,75.7,1690.0,1,Poland
Romania,13.6,10.7,26.9,66.5,72.4,1640.0,1,Romania
Yugoslavia,14.0,9.0,20.2,68.6,74.5,,1,Yugoslavia
USSR,17.7,10.0,23.0,64.6,74.0,2242.0,1,USSR
Byelorussian_SSR,15.2,9.5,13.1,66.4,75.9,1880.0,1,Byelorussian_SSR


## Aggregating Data by Groups
The typical use of grouping is to apply an aggegation function to the grouped data. For example, to calculate the average GNI of countries in each region or the minimum live birth rate for each region. Aggregation returns a Series or Data Frame.

In [5]:
# avg GNI by region 
povData.groupby('Region')['GNI'].mean()

Region
1     1931.333333
2     1672.500000
3    18706.000000
4     7392.000000
5     2332.000000
6      852.592593
Name: GNI, dtype: float64

In [10]:
type(povData.groupby('Region')['GNI'].mean())

pandas.core.series.Series

In [16]:
# min birth and death rates by region
povData.groupby('Region')[['LiveBirthRate','DeathRate']].min()

Unnamed: 0_level_0,LiveBirthRate,DeathRate
Region,Unnamed: 1_level_1,Unnamed: 2_level_1
1,11.6,5.7
2,18.0,4.4
3,9.7,6.7
4,22.3,2.2
5,11.7,4.9
6,31.1,7.3


In [20]:
type(povData.groupby('Region')[['LiveBirthRate','DeathRate']].min())

pandas.core.frame.DataFrame

When the result of grouping and aggregation is a Series or Data Frame, you can apply the relevant operations to the results. 

In [19]:
# min birth and death rates for region 3
povData.groupby('Region')[['LiveBirthRate','DeathRate']].min().loc[3]

LiveBirthRate    9.7
DeathRate        6.7
Name: 3, dtype: float64

In [21]:
# calculate min birth and death rates by region and display birth rates only
povData.groupby('Region')[['LiveBirthRate','DeathRate']].min()['LiveBirthRate']

Region
1    11.6
2    18.0
3     9.7
4    22.3
5    11.7
6    31.1
Name: LiveBirthRate, dtype: float64

In [23]:
# calculate avg birth and death rates by region and retrieve the lowest of these values.
povData.groupby('Region')[['LiveBirthRate','DeathRate']].mean().min()

LiveBirthRate    12.852632
DeathRate         6.754545
dtype: float64

It is possible to group on multiple columns.

In [25]:
# group by region and male life expectancy and calculate avg birth and death rates
povData.groupby(['Region','MaleLifeExpectancy'])[['LiveBirthRate','DeathRate']].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,LiveBirthRate,DeathRate
Region,MaleLifeExpectancy,Unnamed: 2_level_1,Unnamed: 3_level_1
1,64.6,17.70,10.00
1,65.4,11.60,13.40
1,66.4,14.30,10.55
1,66.5,13.60,10.70
1,67.2,14.30,10.20
...,...,...,...
6,57.5,32.10,9.90
6,57.8,38.80,9.50
6,59.1,39.75,9.60
6,61.6,35.50,8.30


This returns hierarchical data. You can flatten the data as follows.

In [26]:
# flatten the data
grp_data=povData.groupby(['Region','MaleLifeExpectancy'])[['LiveBirthRate','DeathRate']].mean()
flat_data=grp_data.reset_index()
flat_data

Unnamed: 0,Region,MaleLifeExpectancy,LiveBirthRate,DeathRate
0,1,64.6,17.70,10.00
1,1,65.4,11.60,13.40
2,1,66.4,14.30,10.55
3,1,66.5,13.60,10.70
4,1,67.2,14.30,10.20
...,...,...,...,...
84,6,57.5,32.10,9.90
85,6,57.8,38.80,9.50
86,6,59.1,39.75,9.60
87,6,61.6,35.50,8.30


You may be interested in counting the number of observations in each group or counting the number of unique values in each group. 

In [27]:
# count the number of countries in each region
povData.groupby('Region')['Country'].count()

Region
1    11
2    12
3    19
4    11
5    17
6    27
Name: Country, dtype: int64

In [28]:
# count the number of male life expectancy values (non-unique and unique) in each region

povData.groupby('Region')['MaleLifeExpectancy'].nunique()

Region
1    10
2    11
3    17
4    11
5    17
6    23
Name: MaleLifeExpectancy, dtype: int64

In [3]:
# counts for all groups and columns
povData.groupby('Region').count()

Unnamed: 0_level_0,LiveBirthRate,DeathRate,InfantDeaths,MaleLifeExpectancy,FemaleLifeExpectancy,GNI,Country
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,11,11,11,11,11,9,11
2,12,12,12,12,12,12,12
3,19,19,19,19,19,19,19
4,11,11,11,11,11,10,11
5,17,17,17,17,17,14,17
6,27,27,27,27,27,27,27


Look through the [groupby documentation](https://pandas.pydata.org/pandas-docs/stable/reference/groupby.html) to learn about other built-in aggregation functions. 

In [32]:
grp=povData.groupby('Region')
dir(grp)

['Country',
 'DeathRate',
 'FemaleLifeExpectancy',
 'GNI',
 'InfantDeaths',
 'LiveBirthRate',
 'MaleLifeExpectancy',
 'Region',
 '__annotations__',
 '__class__',
 '__class_getitem__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__orig_bases__',
 '__parameters__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__slots__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_accessors',
 '_agg_examples_doc',
 '_agg_general',
 '_agg_py_fallback',
 '_aggregate_frame',
 '_aggregate_item_by_item',
 '_aggregate_with_numba',
 '_apply_allowlist',
 '_apply_filter',
 '_apply_to_column_groupbys',
 '_bool_agg',
 '_can_use_transform_fast',
 '_choose_path',
 '_concat_objects',
 '_constructor',
 '_cumcount_array',
 '_cython_agg_ge

You may want to pull multiple statistics at the same time. This can be done using the `aggregate()` function.

In [34]:
# mean and std dev of birth and death rates for each region
povData.groupby('Region')[['LiveBirthRate','DeathRate']].aggregate(['mean','std'])

Unnamed: 0_level_0,LiveBirthRate,LiveBirthRate,DeathRate,DeathRate
Unnamed: 0_level_1,mean,std,mean,std
Region,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,14.763636,3.690331,10.554545,2.080079
2,29.175,7.38649,9.416667,5.510293
3,12.852632,1.941754,9.431579,1.492594
4,33.9,8.626355,6.754545,2.654191
5,29.617647,8.911316,10.217647,4.659431
6,44.525926,5.68592,14.622222,4.799947


Suppose you want to compute a different statistic for each column.

In [35]:
# avg birth rate and std dev of death rate by region
povData.groupby('Region')[['LiveBirthRate','DeathRate']].aggregate({'LiveBirthRate':'mean','DeathRate':'std'})

Unnamed: 0_level_0,LiveBirthRate,DeathRate
Region,Unnamed: 1_level_1,Unnamed: 2_level_1
1,14.763636,2.080079
2,29.175,5.510293
3,12.852632,1.492594
4,33.9,2.654191
5,29.617647,4.659431
6,44.525926,4.799947


Note: Keep in mind that dictionaries cannot have duplicate keys. If you want to apply multiple functions to the same column, provide a list of functions instead. 

In [39]:
pd.options.display.float_format="{:,.2f}".format
povData.groupby('Region')[['LiveBirthRate','DeathRate']].aggregate({'LiveBirthRate':['mean','std'],'DeathRate':'std'})

Unnamed: 0_level_0,LiveBirthRate,LiveBirthRate,DeathRate
Unnamed: 0_level_1,mean,std,std
Region,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1,14.76,3.69,2.08
2,29.18,7.39,5.51
3,12.85,1.94,1.49
4,33.9,8.63,2.65
5,29.62,8.91,4.66
6,44.53,5.69,4.8


In [37]:
# Formatting float values
pd.options.display.float_format="{:,.2f}".format
povData.groupby('Region')[['LiveBirthRate','DeathRate']].aggregate({'LiveBirthRate':'mean','DeathRate':'std'})

Unnamed: 0_level_0,LiveBirthRate,DeathRate
Region,Unnamed: 1_level_1,Unnamed: 2_level_1
1,14.76,2.08
2,29.18,5.51
3,12.85,1.49
4,33.9,2.65
5,29.62,4.66
6,44.53,4.8
