# Grouping and Aggregation

In [None]:
%matplotlib inline
from matplotlib import pyplot as plt
import pandas as pd
from pandas import DataFrame, Series
import numpy as np

**Learning Objectives:** Learn to apply the split-apply-combine approach to group and aggregate data.

This notebook is based on Chapter 9 of Wes McKinney's Python for Data Analysis.

## Split-apply-combine

The idea of *split-apply-combine* is this:

1. Split the data frame into groups of rows.
2. Apply some transformation, method or function to each column of each group of rows.
3. Combine the output of those transformation into a final `Series` or `DataFrame`.

This simple sequence of steps can be used to accomplish a wide range of data transformations.

## Split using `groupby`

Splitting a data frames by its rows is done using the `groupby` method of a `Series` or `DataFrame`.

To illustrate this, here is a `DataFrame` with two numerical and two categorical columns:

In [None]:
df = DataFrame({'key1': ['a','a','b','b','a'],
                'key2': ['one','two','one','two','one'],
                'data1': np.random.randn(5),
                'data2': np.random.randn(5)})

In [None]:
df

There are two things you have to specify when calling `groupby` to perform a split:

1. What columns you want to **look at* or analyze.
2. What columns you want to **group by.**

As you think about these choices, here is are two *guidelines* for picking these columns:

* Look at numerical columns.
* Group by categorical columns.

While these guidelines can be broken, they are a good idea to keep in mind.

Look at the `data1` column and group by the `key1` column's values.

In [None]:
g1 = df['data1'].groupby(df['key1'])

In [None]:
g1

You can iterate through the groups as follows:

In [None]:
for name, group in g1:
    print(name)
    print(group)
    print('')

The `groups` attribute returns a dictionary this gives which rows belong to which groups:

In [None]:
g1.groups

It can also be useful to see the size of the groups:

In [None]:
g1.size()

### Interacting with `groupby`

Let's use IPython's interact function to better understand how `groupby` works.

In [None]:
def show_groups(column, by):
    groups = df[column].groupby(df[by])
    for name, group in groups:
        print(name)
        print(group)
        print('')

In [None]:
from ipywidgets import interact, fixed

In [None]:
interact(show_groups, column=['data1','data2'], by=['key1','key2']);

You can pick columns to look at either before you call `groupby` or after:

In [None]:
df['data1'].groupby(df['key1']).mean()

In [None]:
df.groupby(df['key1'])['data1'].mean()

If you are usng `groupby` on the entire `DataFrame`, you can pass `groupby` the name of the column, rather than the actual column of values:

In [None]:
df.groupby('key1')['data1'].mean()

You can group by multiple columns. The resulting `Series` of `DataFrame` will have a heirarchical index:

In [None]:
df.groupby(['key1','key2'])['data1'].mean()

If you are looking at a single column of numerical data, the final result will be a `Series`. You can also look at multiple columns, which will result in a `DataFrame`:

In [None]:
df.groupby('key1').mean()

Here is a more complicated example where we are looking at and grouping by multiple columns:

In [None]:
df.groupby(['key1','key2'])[['data1','data2']].mean()

It is possibly to use any sequence for the `groupby` values. If you pass a sequence that isn't in the `DataFrame`, that sequence will be treated like another column in the `DataFrame` for the splitting step:

In [None]:
states = ['OH','CA','CA','OH','OH']
years = [2005,2005,2006,2005,2006]
df.groupby([states, years]).mean()

That is like doing a groupby on the following `DataFrame`:

In [None]:
df2 = df.copy()
df2['states'] = states
df2['years'] = years
df2

In [None]:
df2.groupby(['states','years']).mean()

There are other, more sophisticated ways of doing grouping:

* `Series`
* `dict`
* Functions

See P4DA Chapter 9 for more details or the Pandas [Group By Documentation](http://pandas.pydata.org/pandas-docs/dev/groupby.html).

## Aggregation

### Single function on all columns

To apply a single aggregation function to all columns in the grouped data, simple call the method on the `groupby` result or pass an aggregation function to the `agg` method.

In [None]:
df

In [None]:
g2 = df.groupby('key1')

In [None]:
g2.mean()

In [None]:
g2.count()

Here are some of the aggregation methods that are built in:

* `count`
* `sum`
* `mean`
* `median`
* `std/var`
* `min/max`
* `prod`
* `first/last`
* `describe`
* `size`

When you call these methods, **the same operation is applied to all columns of all groups.**

It is possible to write your own aggregation function and have it called on the columns of each group using `agg`.

In [None]:
def peak_to_peak(arr):
    return arr.mean()-arr.max()

In [None]:
g2.agg(peak_to_peak)

When you pass a single function to `agg`, that same function is applied all columns of each group.

You can also pass the names of builtin function to `agg` as strings:

In [None]:
g2.agg('mean')

### Multiple functions on each column

Sometimes, you want to call multiple aggregation functions on each column of data. In this case, the same set of functions is still being called on all columns. To call different functions on each column see below.

We will use the tips data set to illustrate these features:

In [None]:
import seaborn

In [None]:
tips = seaborn.load_dataset('tips')
tips['tip_pct'] = tips.tip/tips.total_bill

In [None]:
tips.head()

In [None]:
grouped = tips.groupby(['sex','smoker'])

The first way of calling multiple aggregation function is to simple pass a list of functions or function names to `agg`:

In [None]:
grouped['tip_pct'].agg(['mean','std',peak_to_peak])

Note how the new column names will match the names of the functions. If you want to customize the column names, you can pass a list of tuples:

In [None]:
grouped['tip_pct'].agg([('the_mean','mean'),('the_std','std'),('p2p',peak_to_peak)])

### Different functions on different columns

The last case if if you want to call different sets of function on different columns. In this case, you can pass a dict, where the keys are the column names and the values are a functions you want to apply.

Here is a simple example of applying different functions to the `tip` and `tip_pct` columns:

In [None]:
grouped.agg({'tip':'max','tip_pct':['mean','std']})