# Data Aggregation and GroupOperations

After Loading, Merging, Preparing data set, we often need to compute group statistics or pivot tables for reporting or visualization.

**Overview**:
* Split pandas object into pieces 
* Computing group sumary statitics: count, mean, std, or user-defined functions.
* Applying function to each column of DataFrame
* Apply within-group transformations or other manipulations, like normalization,linear regression, rank, or subset selection
* Computing pivot tables
* Perform quantile analysis and other data-derived group analyses

# GroupBy Mechanics
As Hadley Wickham said, group is the same workflow: **split-apply-combine**
![split-apply-combine](https://image.slidesharecdn.com/slides-151008060416-lva1-app6892/95/pandas-powerful-data-analysis-tools-for-python-19-638.jpg?cb=1444284343)

Each grouping key can take many forms, and the keys do not have to be all of the sametype:
* A list or array of values that is the same length as the axis being grouped
* A value indicating a column name in a DataFrame
* A dict or Series giving a correspondence between the values on the axis being grouped and the group names
* A function to be invoked on the axis index or the individual labels in the index

In [2]:
import pandas as pd
from pandas import DataFrame
from pandas import Series
import numpy as np

In [5]:
df = DataFrame({
        'key1': ['a', 'a', 'b', 'b', 'a'],
        'key2': ['one', 'two', 'one', 'two', 'one'],
        'data1': np.random.rand(5),
        'data2': np.random.rand(5)
    })
df

Unnamed: 0,data1,data2,key1,key2
0,0.251316,0.423363,a,one
1,0.065576,0.529084,a,two
2,0.084591,0.687881,b,one
3,0.483881,0.437802,b,two
4,0.344404,0.040863,a,one


Suppose we want to compute the **mean** of data1 column by using groups labels from **key1**

In [6]:
grouped = df['data1'].groupby(df['key1'])

The importantthing here is that the data (a Series) has been aggregated according to the group key, producing a new Series that is now indexed by the unique values in the key1 column.

In [8]:
grouped.mean()

key1
a    0.220432
b    0.284236
Name: data1, dtype: float64

If instead we had passed multiple arrays as a list

means = df['data1'].groupby([df['key1'], df['key2']]).mean()
means

we grouped the data using two keys, and the resulting Series now has a hierarchical index consisting of the unique pairs of keys observed. So we can **unstack hirrachical Series to get a DataFrame**

In [14]:
means.unstack()

key2,one,two
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0.29786,0.065576
b,0.084591,0.483881


We can not only group Series by Series key, but also group by an array

In [18]:
states = np.array(['Ohio', 'California', 'California', 'Ohio', 'Ohio'])
years = np.array([2005, 2005, 2005, 2005, 2006])


array([2005, 2005, 2005, 2005, 2006])

In [19]:
df['data1'].groupby([states, years]).mean()

California  2005    0.075083
Ohio        2005    0.367598
            2006    0.344404
Name: data1, dtype: float64

Frequently the grouping information to be found in the same DataFrame as the data you want to work on

In [21]:
df.groupby('key1').mean()

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0.220432,0.331104
b,0.284236,0.562842


In [22]:
df.groupby(['key1', 'key2']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,one,0.29786,0.232113
a,two,0.065576,0.529084
b,one,0.084591,0.687881
b,two,0.483881,0.437802


## Iterating Over Groups