# Grouping

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

pd.options.display.max_rows = 6
pd.options.display.max_columns = 6
pd.options.display.width = 80

We'll use the same dataset of beer reviews.

In [None]:
df = pd.read_hdf('data/beer.hdf')

# Groupby

Groupby is a fundamental operation to pandas and data analysis.

The components of a groupby operation are to

1. Split a table into groups
2. Apply a function to each groups
3. Combine the results

http://pandas.pydata.org/pandas-docs/stable/groupby.html

In pandas the first step looks like

``df.groupby( grouper )``

`grouper` can be many things

- ``Series`` (or string indicating a column in a ``DataFrame``)
- function (to be applied on the index)
- dict : groups by *values*
- `levels=[]`, names of levels in a MultiIndex

In [None]:
gr = df.groupby('beer_style')
gr

Haven't really done anything yet. Just some book-keeping to figure out which **keys** go with which rows. Keys are the things we've grouped by (each `beer_style` in this case).

In [None]:
gr.ngroups

In [None]:
list(gr.groups)[0:5]

In [None]:
cols = ['beer_style'] + df.columns.difference(['beer_style']).tolist()
cols

In [None]:
gr.get_group('Tripel')[cols]

In [None]:
df.loc[df.beer_style=='Tripel',cols]

The last two steps, apply and combine:

In [None]:
gr.agg('mean')

This says apply the `mean` function to each column. Non-numeric columns (nusiance columns) are excluded. We can also select a subset of columns to perform the aggregation on.

In [None]:
review_columns = ['abv','review_overall','review_appearance',
                  'review_palate','review_taste']
gr[review_columns].agg('mean')

`.` attribute lookup works as well.

In [None]:
gr.abv.agg('mean')

Find the `beer_style` with the greatest variance in `abv`.

In [None]:
(df
   .groupby('beer_style')
   .abv
   .std()
   .sort_values(ascending=False)
 )

Multiple Aggregations on one column

In [None]:
gr['review_aroma'].agg([np.mean, np.std, 'count'])

Single Aggregation on multiple columns

In [None]:
gr[review_columns].mean()

Multiple aggregations on multiple columns

In [None]:
result = gr[review_columns].agg(['mean', 'count', 'std'])
result.columns.names=['characteristic','measure']
result

Hierarchical Indexes in the columns can be awkward to work with, so I'll usually
move a level to the Index with `.stack`.

http://pandas.pydata.org/pandas-docs/stable/reshaping.html#reshaping-by-stacking-and-unstacking

In [None]:
result

In [None]:
multi = result.stack(level='characteristic')
multi

In [None]:
result.stack(level='measure')

In [None]:
# stack-unstack are inverses
(result
      .stack(level='measure')
      .unstack(level='measure')
 )

You can group by **levels** of a MultiIndex.

In [None]:
(result.stack(level='characteristic')
       .groupby(level='beer_style')
       ['mean']
       .agg(['min', 'max' ])
 )

Group by **multiple** columns

In [None]:
df.groupby(['brewer_id', 'beer_style'])[review_columns].mean()

### Exercise: Plot the relationship between review length (the `text` column) and average `review_overall`.

- Find the **len**gth of each reivew (remember the `df.text.str` namespace?)
- Group by that Series of review lengths
- Using the '.k' plotting style

In [None]:
(df.groupby(df.text.str.len())
   .review_overall
   .mean()
   .plot(style='.k', figsize=(12,8))
 )


## What are we doing

In [None]:
df.text.str.len()

In [None]:
df.groupby(df.text.str.len()).ngroups

We've seen a lot of permutations among number of groupers, number of columns to aggregate, and number of aggregators.
In fact, the `.agg`, which returns one row per group, is just one kind of way to combine the results. The three ways are

- `agg`: one row per results
- `transform`: identicaly shaped output as input
- `apply`: anything goes


# Transform

Combined Series / DataFrame is the same shape as the input. For example, say you want to standardize the reviews by subtracting the mean.

In [None]:
def de_mean(reviews):
    s = reviews - reviews.mean()
    return s

In [None]:
de_mean(df.review_overall)

In [None]:
df.groupby('profile_name').transform(de_mean)

Oftentimes is better to work with the groupby object directly

In [None]:
(df-df.groupby('profile_name').transform('mean')
).select_dtypes(exclude=['object'])

In [None]:
%timeit df.groupby('profile_name').transform(de_mean)

In [None]:
%timeit (df-df.groupby('profile_name').transform('mean')).select_dtypes(exclude=['object'])

In [None]:
df.groupby('profile_name').ngroups