# Data Aggregation & Summarization

## Introduction



Aggregating and summarizing are essential tools in data analysis. They allow us to perform computations on our data or look at descriptive statistics for subsets of the data. These calculations can help us make meaningful inference regarding our data.

## Grouping



We have looked at the group by clause in SQL in previous lessons. Pandas has a similar function that enables us to perform aggregations - the groupby function.



Applying the groupby function to a DataFrame will return a DataFrameGroupBy object. We then specify the columns that we intend to group on.



Recall the vehicles dataset from previous lessons:

In [1]:
import numpy as np
import pandas as pd

In [2]:
vehicles = pd.read_csv('vehicles/vehicles.csv')

In [3]:
vehicles.head()

Unnamed: 0,Make,Model,Year,Engine Displacement,Cylinders,Transmission,Drivetrain,Vehicle Class,Fuel Type,Fuel Barrels/Year,City MPG,Highway MPG,Combined MPG,CO2 Emission Grams/Mile,Fuel Cost/Year
0,AM General,DJ Po Vehicle 2WD,1984,2.5,4.0,Automatic 3-spd,2-Wheel Drive,Special Purpose Vehicle 2WD,Regular,19.388824,18,17,17,522.764706,1950
1,AM General,FJ8c Post Office,1984,4.2,6.0,Automatic 3-spd,2-Wheel Drive,Special Purpose Vehicle 2WD,Regular,25.354615,13,13,13,683.615385,2550
2,AM General,Post Office DJ5 2WD,1985,2.5,4.0,Automatic 3-spd,Rear-Wheel Drive,Special Purpose Vehicle 2WD,Regular,20.600625,16,17,16,555.4375,2100
3,AM General,Post Office DJ8 2WD,1985,4.2,6.0,Automatic 3-spd,Rear-Wheel Drive,Special Purpose Vehicle 2WD,Regular,25.354615,13,13,13,683.615385,2550
4,ASC Incorporated,GNX,1987,3.8,6.0,Automatic 4-spd,Rear-Wheel Drive,Midsize Cars,Premium,20.600625,14,21,16,555.4375,2550


In [4]:
vehicles.groupby(['Transmission'])

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x11381b890>

This object contains information that can be "unleashed" when an aggregation is applied to this object.

## Aggregations


We can apply different aggregation functions to our grouped data. We can use some standard functions or define our own functions and then apply them to the aggregated data using the agg function.


Some standard aggregation functions are: mean, sum, count, median, min, max, std.


We can also use the agg function to apply multiple aggregations at once to all columns specified.

After aggregating, we can subset the data to only apply the aggregation to the columns that we choose.



Here are some examples of standard aggregation functions:

In [12]:
# groupby make and obtain the means of 'Highway MPG', 'City MPG', 'Combined MPG'
 
vehicles.groupby(['Make'])['Highway MPG','City MPG', 'Combined MPG'].mean()

Unnamed: 0_level_0,Highway MPG,City MPG,Combined MPG
Make,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AM General,15.000000,15.000000,14.750000
ASC Incorporated,21.000000,14.000000,16.000000
Acura,25.940397,18.890728,21.506623
Alfa Romeo,23.902439,17.097561,19.512195
American Motors Corporation,20.181818,16.045455,17.681818
...,...,...,...
Volkswagen,28.985673,21.226361,24.093601
Volvo,25.064156,17.981869,20.605300
Wallace Environmental,16.000000,12.437500,13.875000
Yugo,28.250000,23.000000,25.000000


In [13]:
# .groupby on 'Fuel Type', 'Cylinders' and obtain the median 'CO2 Emission Grams/Mile'

vehicles.groupby(['Fuel Type', 'Cylinders'])['CO2 Emission Grams/Mile'].mean()

Fuel Type                    Cylinders
CNG                          4.0          257.539637
                             6.0          417.030882
                             8.0          573.309895
Diesel                       4.0          330.909525
                             5.0          425.288362
                             6.0          435.797518
                             8.0          589.949213
                             10.0         590.506536
Gasoline or E85              4.0          359.073842
                             6.0          454.282739
                             8.0          594.649022
Gasoline or natural gas      4.0          383.464097
                             6.0          437.000000
                             8.0          768.310391
Gasoline or propane          8.0          666.525000
Midgrade                     6.0          406.800000
                             8.0          524.891046
Premium                      2.0          486.635300
       

In [19]:
# groupby 'Fuel Type' and display for the 'Combined MPG' the 'mean', 'median' and 'std' using .agg

x = vehicles.groupby(['Fuel Type'])['Combined MPG'].agg(['mean', 'median', 'std', 'sum'])
x['mean']

Fuel Type
CNG                            18.133333
Diesel                         23.488474
Gasoline or E85                17.572385
Gasoline or natural gas        15.350000
Gasoline or propane            13.500000
Midgrade                       17.378378
Premium                        19.343816
Premium Gas or Electricity     31.647059
Premium and Electricity        26.300000
Premium or E85                 20.090909
Regular                        20.144698
Regular Gas and Electricity    41.937500
Regular Gas or Electricity     42.000000
Name: mean, dtype: float64

## Custom Aggregation Function


We do not have to be limited by the range of standard aggregation functions. If the need arises, we can write our own aggregation function.


Let us for instance write a functions that groups based on the model and then checks whether for all models in that group they are older or more modern cars.

In [21]:
def old_or_not(x):
    car = x.mean()
    return 'Modern' if car >= 1992 else 'Old'

vehicles.groupby("Model")["Year"].agg([old_or_not])    # 
vehicles.groupby("Model")["Year"].mean()


Model
1-Ton Truck 2WD      1988.500000
100                  1991.583333
100 Wagon            1990.666667
100 quattro          1992.000000
100 quattro Wagon    1993.000000
                        ...     
iQ                   2013.000000
tC                   2010.500000
xA                   2005.000000
xB                   2009.727273
xD                   2011.000000
Name: Year, Length: 3608, dtype: float64

In [None]:
from scipy import stats

vehicles.groupby("Make")["Combined MPG"].agg([stats.mode])

## Summary



In this lesson we learned how to summarization and aggregation with DataFrames. We learned to use the standard aggregation functions and how to make custom aggregation functions.