# Data Aggregations and Summarization


Lesson Goals

In this lesson we will learn:

    Grouping data in Pandas
    Using the aggregation functions to summarize grouped data

Introduction

Aggregating and summarizing are essential tools in data analysis. They allow us to perform computations on our data or look at descriptive statistics for subsets of the data. These calculations can help us make meaningful inference regarding our data.

We will use the vehicles.csv data set you used in Module 1. In case you don't have the data set handy, download it again from here. Extract the content of the downloaded file to your machine. vehicles.csv is contained in the extracted folder.
Grouping

We have looked at the group by clause in SQL in previous lessons. Pandas has a similar function that enables us to perform aggregations - the groupby function.

Applying the groupby function to a DataFrame will return a DataFrameGroupBy object. We then specify the columns that we intend to group on.

Recall the vehicles dataset from previous lessons:

In [1]:
import numpy as np
import pandas as pd

vehicles = pd.read_csv('data/vehicles.csv')
vehicles.groupby(['Transmission'])

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fdbd45e84e0>

This object contains information that can be "unleashed" when an aggregation is applied to this object.


# Aggregations

We can apply different aggregation functions to our grouped data. We can use some standard functions or define our own functions and then apply them to the aggregated data using the agg function.

Some standard aggregation functions are: mean, sum, count, median, min, max, std.

We can also use the agg function to apply multiple aggregations at once to all columns specified.

After aggregating, we can subset the data to only apply the aggregation to the columns that we choose.

Here are some examples of standard aggregation functions:

In [2]:
#Here we aggregate 3 different columns and compute their mean based on the different transmission values
vehicles.groupby(['Transmission'])['Highway MPG', 'City MPG', 'Combined MPG'].mean().head()

Unnamed: 0_level_0,Highway MPG,City MPG,Combined MPG
Transmission,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Auto (AV),40.0,35.0,37.0
Auto (AV-S6),25.0,22.0,23.0
Auto (AV-S8),22.0,20.0,21.0
Auto(A1),37.0,41.0,39.0
Auto(AM-S6),32.978261,24.315217,27.554348


In [3]:
#In this example we aggregate based on two columns and compute the median CO2 Emission for all combinations of fuel type and cylinders
vehicles.groupby(['Fuel Type', 'Cylinders'])['CO2 Emission Grams/Mile'].median().head()

Fuel Type  Cylinders
CNG        4.0          253.197321
           6.0          417.030882
           8.0          568.070913
Diesel     4.0          308.484848
           5.0          391.538462
Name: CO2 Emission Grams/Mile, dtype: float64

In [4]:
#Here we produce the mean, median and standard deviation for combined MPG grouped by fuel type
vehicles.groupby(['Fuel Type'])['Combined MPG'].agg(['mean', 'median', 'std']).head()

Unnamed: 0_level_0,mean,median,std
Fuel Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
CNG,18.133333,14.5,7.436663
Diesel,23.488474,21.0,7.054702
Gasoline or E85,17.572385,17.0,3.822538
Gasoline or natural gas,15.35,12.0,5.343712
Gasoline or propane,13.5,13.5,1.603567


# Custom Aggregation Functions

We do not have to be limited by the range of standard aggregation functions. If the need arises, we can write our own aggregation function.

For example, in our vehicle dataset, we might want to find out for each level of transmission, what is the most common vehicle class. In other words, we would like to find the mode.

We can write our own implementation of the mode function, but it would be more efficient to use the scipy implementation of this function. Scipy is a Python package for scientific computing.

Let us first define our custom function using the scipy mode function. We create a custom function since the mode function returns a tuple with the mode and the frequency of the mode. We are only interested in the first part of the tuple.

In [5]:
from scipy import stats

def agg_mode(x):
    return(stats.mode(x)[0])

Now we can use our custom aggregation function using the agg function:

In [6]:
vehicles.groupby("Transmission")["Vehicle Class"].agg(agg_mode).head()



Transmission
Auto (AV)          Compact Cars
Auto (AV-S6)       Compact Cars
Auto (AV-S8)       Midsize Cars
Auto(A1)        Subcompact Cars
Auto(AM-S6)        Compact Cars
Name: Vehicle Class, dtype: object