## Data Aggregation

Aggregation operations generally involve the following steps:

1. Split the dataFrame into groups
2. Apply a function to each group
3. Combine the results into one data structure


In [None]:
# aggregation using a loop
import pandas as pd

happiness2015 = pd.read_csv("world-happiness-2015.csv")

mean_happiness = {}
regions = happiness2015['Region'].unique()

for r in regions:
    #1. Split the dataframe into groups.
    region_group = happiness2015[happiness2015['Region'] == r]
    #2. Apply a function to each group.
    region_mean = region_group['Happiness Score'].mean()
    #3. Combine the results into one data structure.
    mean_happiness[r] = region_mean

mean_happiness    

pandas provides a `groupby` operation which condenses this into 2 steps:

1. Create a `GroupBy` object
2. Call a function of the `GroupBy` object

The `GroupBy` object maps groups to the original dataframe which makes operations faster and more efficient.

In [None]:
# Using groupby

grouped = happiness2015.groupby("Region")
aus_nz = grouped.get_group("Australia and New Zealand")
aus_nz

In [None]:
# map of group names to original datafram row indexes, in the GroupBy object:
grouped.groups

In [None]:
# verify row indexes 
happiness2015.iloc[8:10]

Pandas common aggregation methods:

- `mean()` Calculates the mean of groups.
- `sum()` Calculates the sum of group values.
- `size()` Calculates the size of the groups.
- `count()` Calculates the count of values in groups.
- `min()` Calculates the minimum of group values.
- `max()` Calculates the maximum of group values.


In [None]:
# calculate mean happiness score by region
grouped = happiness2015.groupby("Region")
grouped["Happiness Score"].mean()

In [None]:
# Perform multiple aggregation functions at once with `agg([func1, func2])`
import numpy as np
happiness2015.groupby("Region")["Happiness Score"].agg([np.mean, np.min, np.max])

In [None]:
# this also works
happiness2015.groupby("Region")["Happiness Score"].agg(["mean", "min", "max"])

In [None]:
# using a custom agg function, the object is automatically passed to the func

def percentage_of_max(group):
    return round((group.mean() / group.max()) * 100)

happiness2015.groupby("Region")["Happiness Score"].agg([percentage_of_max])

In [2]:
# using a custom agg function, the object is automatically passed to the func

def percentage_of_max(group):
    return round((group.mean() / group.max()) * 100)

happiness2015.groupby("Region")["Happiness Score"].agg([percentage_of_max])

{'Western Europe': 6.689619047619048,
 'North America': 7.273,
 'Australia and New Zealand': 7.285,
 'Middle East and Northern Africa': 5.406899999999999,
 'Latin America and Caribbean': 6.1446818181818195,
 'Southeastern Asia': 5.317444444444444,
 'Central and Eastern Europe': 5.332931034482757,
 'Eastern Asia': 5.626166666666666,
 'Sub-Saharan Africa': 4.2028,
 'Southern Asia': 4.580857142857143}

pandas provides a `groupby` operation which condenses this into 2 steps:

1. Create a `GroupBy` object
2. Call a function of the `GroupBy` object

The `GroupBy` object maps groups to the original dataframe which makes operations faster and more efficient.

In [4]:
# Using groupby

grouped = happiness2015.groupby("Region")
aus_nz = grouped.get_group("Australia and New Zealand")
aus_nz

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
8,New Zealand,Australia and New Zealand,9,7.286,0.03371,1.25018,1.31967,0.90837,0.63938,0.42922,0.47501,2.26425
9,Australia,Australia and New Zealand,10,7.284,0.04083,1.33358,1.30923,0.93156,0.65124,0.35637,0.43562,2.26646


In [5]:
# map of group names to original datafram row indexes, in the GroupBy object:
grouped.groups

{'Australia and New Zealand': Int64Index([8, 9], dtype='int64'),
 'Central and Eastern Europe': Int64Index([ 30,  43,  44,  51,  53,  54,  55,  58,  59,  61,  63,  68,  69,
              72,  76,  79,  82,  85,  86,  88,  92,  94,  95, 103, 105, 110,
             126, 129, 133],
            dtype='int64'),
 'Eastern Asia': Int64Index([37, 45, 46, 71, 83, 99], dtype='int64'),
 'Latin America and Caribbean': Int64Index([ 11,  13,  15,  22,  24,  26,  29,  31,  32,  39,  40,  41,  42,
              47,  50,  52,  56,  57,  64,  97, 104, 118],
            dtype='int64'),
 'Middle East and Northern Africa': Int64Index([ 10,  19,  21,  27,  34,  38,  48,  62,  67,  75,  81,  91, 102,
             106, 107, 109, 111, 134, 135, 155],
            dtype='int64'),
 'North America': Int64Index([4, 14], dtype='int64'),
 'Southeastern Asia': Int64Index([23, 33, 60, 73, 74, 89, 98, 128, 144], dtype='int64'),
 'Southern Asia': Int64Index([78, 80, 108, 116, 120, 131, 152], dtype='int64'),
 'Sub-Saharan

In [6]:
# verify row indexes 
happiness2015.iloc[8:10]

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
8,New Zealand,Australia and New Zealand,9,7.286,0.03371,1.25018,1.31967,0.90837,0.63938,0.42922,0.47501,2.26425
9,Australia,Australia and New Zealand,10,7.284,0.04083,1.33358,1.30923,0.93156,0.65124,0.35637,0.43562,2.26646


Pandas common aggregation methods:

- `mean()` Calculates the mean of groups.
- `sum()` Calculates the sum of group values.
- `size()` Calculates the size of the groups.
- `count()` Calculates the count of values in groups.
- `min()` Calculates the minimum of group values.
- `max()` Calculates the maximum of group values.


In [8]:
# calculate mean happiness score by region
grouped = happiness2015.groupby("Region")
grouped["Happiness Score"].mean()

Region
Australia and New Zealand          7.285000
Central and Eastern Europe         5.332931
Eastern Asia                       5.626167
Latin America and Caribbean        6.144682
Middle East and Northern Africa    5.406900
North America                      7.273000
Southeastern Asia                  5.317444
Southern Asia                      4.580857
Sub-Saharan Africa                 4.202800
Western Europe                     6.689619
Name: Happiness Score, dtype: float64

In [14]:
# Perform multiple aggregation functions at once with `agg([func1, func2])`
import numpy as np
happiness2015.groupby("Region")["Happiness Score"].agg([np.mean, np.min, np.max])

Unnamed: 0_level_0,mean,amin,amax
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Australia and New Zealand,7.285,7.284,7.286
Central and Eastern Europe,5.332931,4.218,6.505
Eastern Asia,5.626167,4.874,6.298
Latin America and Caribbean,6.144682,4.518,7.226
Middle East and Northern Africa,5.4069,3.006,7.278
North America,7.273,7.119,7.427
Southeastern Asia,5.317444,3.819,6.798
Southern Asia,4.580857,3.575,5.253
Sub-Saharan Africa,4.2028,2.839,5.477
Western Europe,6.689619,4.857,7.587


In [15]:
# this also works
happiness2015.groupby("Region")["Happiness Score"].agg(["mean", "min", "max"])

Unnamed: 0_level_0,mean,min,max
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Australia and New Zealand,7.285,7.284,7.286
Central and Eastern Europe,5.332931,4.218,6.505
Eastern Asia,5.626167,4.874,6.298
Latin America and Caribbean,6.144682,4.518,7.226
Middle East and Northern Africa,5.4069,3.006,7.278
North America,7.273,7.119,7.427
Southeastern Asia,5.317444,3.819,6.798
Southern Asia,4.580857,3.575,5.253
Sub-Saharan Africa,4.2028,2.839,5.477
Western Europe,6.689619,4.857,7.587


In [16]:
# using a custom agg function, the object is automatically passed to the func

def percentage_of_max(group):
    return round((group.mean() / group.max()) * 100)

happiness2015.groupby("Region")["Happiness Score"].agg([percentage_of_max])

Unnamed: 0_level_0,percentage_of_max
Region,Unnamed: 1_level_1
Australia and New Zealand,100.0
Central and Eastern Europe,82.0
Eastern Asia,89.0
Latin America and Caribbean,85.0
Middle East and Northern Africa,74.0
North America,98.0
Southeastern Asia,78.0
Southern Asia,87.0
Sub-Saharan Africa,77.0
Western Europe,88.0
