# Data Aggregation & Summarization

## Introduction

Aggregating and summarizing are essential tools in data analysis. They allow us to perform computations on our data or look at descriptive statistics for subsets of the data. These calculations can help us make meaningful inference regarding our data.

We will use the vehicles.csv data set you used in Module 1. In case you don't have the data set handy, download it again from here. Extract the content of the downloaded file to your machine. vehicles.csv is contained in the extracted folder.

## Grouping

We have looked at the group by clause in SQL in previous lessons. Pandas has a similar function that enables us to perform aggregations - the groupby function.

Applying the groupby function to a DataFrame will return a DataFrameGroupBy object. We then specify the columns that we intend to group on.

Recall the vehicles dataset from previous lessons:

In [None]:
import numpy as np
import pandas as pd

In [None]:
vehicles = pd.read_csv('vehicles.csv')

In [None]:
vehicles.head()

In [None]:
vehicles.groupby(['Transmission'])

This object contains information that can be "unleashed" when an aggregation is applied to this object.

## Aggregations

We can apply different aggregation functions to our grouped data. We can use some standard functions or define our own functions and then apply them to the aggregated data using the agg function.

Some standard aggregation functions are: mean, sum, count, median, min, max, std.

We can also use the agg function to apply multiple aggregations at once to all columns specified.

After aggregating, we can subset the data to only apply the aggregation to the columns that we choose.

Here are some examples of standard aggregation functions:

In [None]:
# groupby transmission and obtain the means of 'Highway MPG', 'City MPG', 'Combined MPG'

vehicles.groupby(['Make'])['Highway MPG', 'City MPG', 'Combined MPG'].mean()

In [None]:
# .groupby on 'Fuel Type', 'Cylinders' and obtain the median 'CO2 Emission Grams/Mile'

vehicles.groupby(['Fuel Type', 'Cylinders'])['CO2 Emission Grams/Mile'].mean()

In [None]:
# groupby 'Fuel Type' and display for the 'Combined MPG' the 'mean', 'median' and 'std' using .agg

x = vehicles.groupby(['Fuel Type'])['Combined MPG'].agg(['mean', 'median', 'std'])
x['mean']

## Custom Aggregation Function

We do not have to be limited by the range of standard aggregation functions. If the need arises, we can write our own aggregation function.

For example, in our vehicle dataset, we might want to find out for each level of transmission, what is the most common vehicle class. In other words, we would like to find the mode.

We can write our own implementation of the mode function, but it would be more efficient to use the scipy implementation of this function. Scipy is a Python package for scientific computing.

Let us first define our custom function using the scipy mode function. We create a custom function since the mode function returns a tuple with the mode and the frequency of the mode. We are only interested in the first part of the tuple.

In [None]:
from scipy import stats


Now we can use our custom aggregation function using the agg function:



In [None]:
vehicles.groupby("Transmission")["Vehicle Class"].agg(agg_mode)

In [None]:
def old_or_not(x):
    car = x.mean() # x['Year']
    return #'Modern' if car >= 1992 else 'Old' if car < 1992 else np.nan

vehicles.groupby("Model")["Year", "Cylinders"].agg([old_or_not, 'mean'])

In [None]:
vehicles.head()

## Summary

In this lesson we learned how to summarization and aggregation with DataFrames. We learned to use the standard aggregation functions and how to make custom aggregation functions.