# Data Aggregation and Group Operations

After Loading, Merging, Preparing data set, we often need to compute group statistics or pivot tables for reporting or visualization.

**Overview**:
* Split pandas object into pieces 
* Computing group sumary statitics: count, mean, std, or user-defined functions.
* Applying function to each column of DataFrame
* Apply within-group transformations or other manipulations, like normalization,linear regression, rank, or subset selection
* Computing pivot tables
* Perform quantile analysis and other data-derived group analyses

# GroupBy Mechanics
As [Hadley Wickham](https://en.wikipedia.org/wiki/Hadley_Wickham) said, group is the same workflow: **split-apply-combine**
![split-apply-combine](https://image.slidesharecdn.com/slides-151008060416-lva1-app6892/95/pandas-powerful-data-analysis-tools-for-python-19-638.jpg?cb=1444284343)

Each grouping key can take many forms, and the keys do not have to be all of the sametype:
* A list or array of values that is the same length as the axis being grouped
* A value indicating a column name in a DataFrame
* A dict or Series giving a correspondence between the values on the axis being grouped and the group names
* A function to be invoked on the axis index or the individual labels in the index

In [2]:
import pandas as pd
from pandas import DataFrame
from pandas import Series
import numpy as np

In [43]:
df = DataFrame({
        'key1': ['a', 'a', 'b', 'b', 'a'],
        'key2': ['one', 'two', 'one', 'two', 'one'],
        'data1': np.random.rand(5),
        'data2': np.random.rand(5)
    })
df

Unnamed: 0,data1,data2,key1,key2
0,0.414884,0.047964,a,one
1,0.422564,0.048797,a,two
2,0.330925,0.970728,b,one
3,0.165235,0.112956,b,two
4,0.773809,0.698775,a,one


Suppose we want to compute the **mean** of data1 column by using groups labels from **key1**

In [23]:
grouped = df['data1'].groupby(df['key1'])

The importantthing here is that the data (a Series) has been aggregated according to the group key, producing a new Series that is now indexed by the unique values in the key1 column.

In [22]:
grouped.mean()

key1
a    0.366713
b    0.137267
Name: data1, dtype: float64

If instead we had passed multiple arrays as a list

In [14]:
means = df['data1'].groupby([df['key1'], df['key2']]).mean()
means

key1  key2
a     one     0.329705
      two     0.747123
b     one     0.816156
      two     0.588410
Name: data1, dtype: float64

we grouped the data using two keys, and the resulting Series now has a hierarchical index consisting of the unique pairs of keys observed. So we can **unstack hirrachical Series to get a DataFrame**

In [15]:
means.unstack()

key2,one,two
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0.329705,0.747123
b,0.816156,0.58841


We can not only group Series by Series key, but also group by an array

In [16]:
states = np.array(['Ohio', 'California', 'California', 'Ohio', 'Ohio'])
years = np.array([2005, 2005, 2005, 2005, 2006])

In [19]:
df['data1'].groupby([states, years]).mean()

California  2005    0.781640
Ohio        2005    0.469020
            2006    0.309779
Name: data1, dtype: float64

Frequently the grouping information to be found in the same DataFrame as the data you want to work on

In [20]:
df.groupby('key1').mean()

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0.468844,0.624352
b,0.702283,0.265704


In [22]:
df.groupby(['key1', 'key2']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,one,0.29786,0.232113
a,two,0.065576,0.529084
b,one,0.084591,0.687881
b,two,0.483881,0.437802


## Iterating Over Groups

> The GroupBy object supports iteration, generating a sequence of 2-tuples containing
the group name along with the chunk of data


In [21]:
df

Unnamed: 0,data1,data2,key1,key2
0,0.349631,0.840004,a,one
1,0.747123,0.908781,a,two
2,0.816156,0.248073,b,one
3,0.58841,0.283334,b,two
4,0.309779,0.124273,a,one


In [26]:
df.groupby('key1')

<pandas.core.groupby.DataFrameGroupBy object at 0x0000000009228E48>

Groupby divide dataset in to chunks of data, and each of them can be access by **iterator**

In [28]:
for group1, group2  in df.groupby('key1'):
    print group1
    print group2

a
      data1     data2 key1 key2
0  0.349631  0.840004    a  one
1  0.747123  0.908781    a  two
4  0.309779  0.124273    a  one
b
      data1     data2 key1 key2
2  0.816156  0.248073    b  one
3  0.588410  0.283334    b  two


In [30]:
df.groupby(['key1', 'key2']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,one,0.329705,0.482138
a,two,0.747123,0.908781
b,one,0.816156,0.248073
b,two,0.58841,0.283334


In the case of multiple keys

In [42]:
for (k1, k2), group in df.groupby(['key1', 'key2']):
    print k1, k2
    print "==========="
    print group 
    print "==========="

a one
      data1     data2 key1 key2
0  0.349631  0.840004    a  one
4  0.309779  0.124273    a  one
a two
      data1     data2 key1 key2
1  0.747123  0.908781    a  two
b one
      data1     data2 key1 key2
2  0.816156  0.248073    b  one
b two
     data1     data2 key1 key2
3  0.58841  0.283334    b  two


Of course, we can choose to do whatever you want with the pieces of data. Such as computing a dict

In [43]:
pieces = dict(list(df.groupby('key1')))
pieces

{'a':       data1     data2 key1 key2
 0  0.349631  0.840004    a  one
 1  0.747123  0.908781    a  two
 4  0.309779  0.124273    a  one, 'b':       data1     data2 key1 key2
 2  0.816156  0.248073    b  one
 3  0.588410  0.283334    b  two}

By default groupby groups on **axis=0**. We can change to groupby other axis

In [51]:
df

Unnamed: 0,data1,data2,key1,key2
0,0.349631,0.840004,a,one
1,0.747123,0.908781,a,two
2,0.816156,0.248073,b,one
3,0.58841,0.283334,b,two
4,0.309779,0.124273,a,one


In [50]:
df.dtypes

data1    float64
data2    float64
key1      object
key2      object
dtype: object

In [52]:
grouped = df.groupby(df.dtypes, axis=1)
dict(list(grouped))

{dtype('float64'):       data1     data2
 0  0.349631  0.840004
 1  0.747123  0.908781
 2  0.816156  0.248073
 3  0.588410  0.283334
 4  0.309779  0.124273, dtype('O'):   key1 key2
 0    a  one
 1    a  two
 2    b  one
 3    b  two
 4    a  one}

## Selecting a Column or Subset of Columns

Indexing a GroupBy object created from a DataFrame with a column name or array of
column names has the effect of selecting those columns for aggregation

In [4]:
df.groupby('key1')['data1']
df.groupby('key2')['data2']

<pandas.core.groupby.SeriesGroupBy object at 0x7f242f7b0ed0>

In [7]:
df['data1'].groupby(df['key1'])
df['data2'].groupby(df['key2'])

<pandas.core.groupby.SeriesGroupBy object at 0x7f242f37c310>

With large dataset, we desire to aggregate only few columns. For example, we just want to compute **mean** of colums **data2**:

In [11]:
df.groupby(['key1', 'key2'])[['data2']].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data2
key1,key2,Unnamed: 2_level_1
a,one,0.499721
a,two,0.472313
b,one,0.382567
b,two,0.255023


## Grouping with Dicts and Series

In [53]:
people = DataFrame(
    np.random.randn(5,5), 
    columns=['a','b','c','d','e'],
    index=['Joe', 'Steve', 'Wes', 'Jim','Travis']
)
people

Unnamed: 0,a,b,c,d,e
Joe,-0.099241,-1.107979,1.38566,-1.558141,0.851939
Steve,0.495138,2.29809,0.744016,-1.607983,2.394997
Wes,-0.527139,0.484615,-0.948942,-0.025088,-0.709397
Jim,0.348103,1.154688,1.144065,-0.841691,0.456913
Travis,-0.173291,-0.589425,-1.314298,-0.896198,-0.736594


Add fre NaN value

In [3]:
people.ix[2:3, ['b', 'c']] = np.nan
people

Unnamed: 0,a,b,c,d,e
Joe,-1.021701,1.967837,-0.383938,0.552804,0.037949
Steve,0.852651,0.060948,-1.21108,-2.401445,-0.117531
Wes,-0.894143,,,-0.186364,-1.105413
Jim,0.309725,-0.100202,-0.563988,-0.308601,-0.478978
Travis,-0.624195,0.558411,0.152154,0.346468,1.071798


Sum together columns by group 

In [4]:
mapping = {
    'a': 'red', 
    'b': 'red',
    'c': 'blue',
    'd': 'blue',
    'e': 'red', 
    'f': 'orange'
}
mapping

{'a': 'red', 'b': 'red', 'c': 'blue', 'd': 'blue', 'e': 'red', 'f': 'orange'}

Now, you could easily construct an array from this dict to pass to groupby , but instead
we can just pass the dict:

In [5]:
by_column = people.groupby(mapping, axis=1)
by_column.sum()

Unnamed: 0,blue,red
Joe,0.168866,0.984085
Steve,-3.612525,0.796069
Wes,-0.186364,-1.999556
Jim,-0.872589,-0.269455
Travis,0.498622,1.006013


## Grouping with Functions

In [14]:
people.groupby(len).sum()

Unnamed: 0,a,b,c,d,e
3,-1.606119,1.867635,-0.947926,0.057839,-1.546442
5,0.852651,0.060948,-1.21108,-2.401445,-0.117531
6,-0.624195,0.558411,0.152154,0.346468,1.071798


In [7]:
key_list = ['one', 'one', 'one', 'two', 'two']
people.groupby([len, key_list]).sum()

Unnamed: 0,Unnamed: 1,a,b,c,d,e
3,one,-1.915844,1.967837,-0.383938,0.36644,-1.067464
3,two,0.309725,-0.100202,-0.563988,-0.308601,-0.478978
5,one,0.852651,0.060948,-1.21108,-2.401445,-0.117531
6,two,-0.624195,0.558411,0.152154,0.346468,1.071798


## Grouping by Index Levels

A final convenience for hierarchically-indexed data sets is the ability to aggregate using
one of the levels of an axis index. To do this, pass the level number or name using the
**level** keyword:

In [17]:
columns = pd.MultiIndex.from_arrays([
        ['US', 'US', 'US', 'JP', 'JP'],
        [1, 3, 5, 1, 3]],
        names=['cty', 'tenor']      
    )
columns

MultiIndex(levels=[[u'JP', u'US'], [1, 3, 5]],
           labels=[[1, 1, 1, 0, 0], [0, 1, 2, 0, 1]],
           names=[u'cty', u'tenor'])

In [18]:
hier_df = DataFrame(np.random.randn(4,5), columns=columns)
hier_df

cty,US,US,US,JP,JP
tenor,1,3,5,1,3
0,2.290035,0.287998,-0.411387,-1.599136,-0.86027
1,0.588556,-0.67543,-0.545894,1.995311,0.986947
2,0.160333,-1.355068,0.665241,0.092038,0.198539
3,-0.958391,-0.314045,0.519504,1.975757,-1.13565


# Data Aggregation

Data aggregation means produces scalar value from an array. Some of method we already known: **sum**, **mean**, **count**, **min**, **max**. But we still can use our own devising and aditionally.

In [23]:
df 

Unnamed: 0,data1,data2,key1,key2
0,0.00533,0.850351,a,one
1,0.383974,0.892172,a,two
2,0.602897,0.215213,b,one
3,0.924972,0.820692,b,two
4,0.85254,0.712074,a,one


In [24]:
grouped = df.groupby('key1')

In [30]:
grouped['data1'].quantile(0.9)

key1
a    0.758827
b    0.892765
Name: data1, dtype: float64

We can create our own aggregation method by pass method into parameter of **agg** function

In [25]:
def peak_to_peak(arr):
    return arr.max() - arr.min()

In [32]:
grouped.agg(peak_to_peer)

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0.84721,0.180098
b,0.322076,0.605479


**notice**: **describe** is not an aggregation but it show the result of bunch of other aggs

In [33]:
grouped.describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,count,3.0,3.0
a,mean,0.413948,0.818199
a,std,0.424399,0.094256
a,min,0.00533,0.712074
a,25%,0.194652,0.781212
a,50%,0.383974,0.850351
a,75%,0.618257,0.871262
a,max,0.85254,0.892172
b,count,2.0,2.0
b,mean,0.763934,0.517952


## Column-wise and Multiple Function Application

Apply many functions into groups using **agg** function

In [4]:
url = 'https://raw.github.com/pandas-dev/pandas/master/pandas/tests/data/tips.csv'

In [6]:
tips = pd.read_csv(url)
tips

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
5,25.29,4.71,Male,No,Sun,Dinner,4
6,8.77,2.00,Male,No,Sun,Dinner,2
7,26.88,3.12,Male,No,Sun,Dinner,4
8,15.04,1.96,Male,No,Sun,Dinner,2
9,14.78,3.23,Male,No,Sun,Dinner,2


We add tipping percentage columns to this DataFrame

In [7]:
tips['tip_pct'] = tips['tip'] / tips['total_bill']

In [8]:
tips

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_pct
0,16.99,1.01,Female,No,Sun,Dinner,2,0.059447
1,10.34,1.66,Male,No,Sun,Dinner,3,0.160542
2,21.01,3.50,Male,No,Sun,Dinner,3,0.166587
3,23.68,3.31,Male,No,Sun,Dinner,2,0.139780
4,24.59,3.61,Female,No,Sun,Dinner,4,0.146808
5,25.29,4.71,Male,No,Sun,Dinner,4,0.186240
6,8.77,2.00,Male,No,Sun,Dinner,2,0.228050
7,26.88,3.12,Male,No,Sun,Dinner,4,0.116071
8,15.04,1.96,Male,No,Sun,Dinner,2,0.130319
9,14.78,3.23,Male,No,Sun,Dinner,2,0.218539


We'd like to aggregate **mean** group by **sex** and **smoke** by **tip_pct** 

In [18]:
grouped = tips.groupby(['sex', 'smoker'])
grouped

<pandas.core.groupby.DataFrameGroupBy object at 0x000000000A00C860>

In [19]:
grouped.mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill,tip,size,tip_pct
sex,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Female,No,18.105185,2.773519,2.592593,0.156921
Female,Yes,17.977879,2.931515,2.242424,0.18215
Male,No,19.791237,3.113402,2.71134,0.160669
Male,Yes,22.2845,3.051167,2.5,0.152771


In [29]:
grouped_pct = grouped['tip_pct']
grouped_pct.mean()

sex     smoker
Female  No        0.156921
        Yes       0.182150
Male    No        0.160669
        Yes       0.152771
Name: tip_pct, dtype: float64

But now, we'd like to pass multiple aggregation functions at once time. So we can use **agg** function

In [30]:
grouped['tip_pct'].agg(['mean', 'std'])

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,std
sex,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1
Female,No,0.156921,0.036421
Female,Yes,0.18215,0.071595
Male,No,0.160669,0.041849
Male,Yes,0.152771,0.090588


We can handle the name of columns we aggregation by passing tuples in **agg**

In [31]:
grouped_pct.agg([('mean_result', 'mean'), ('std_result', 'std')])

Unnamed: 0_level_0,Unnamed: 1_level_0,mean_result,std_result
sex,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1
Female,No,0.156921,0.036421
Female,Yes,0.18215,0.071595
Male,No,0.160669,0.041849
Male,Yes,0.152771,0.090588


Or we can apply all functions to many columns

In [33]:
result = grouped['tip_pct', 'total_bill'].agg(['max', 'min', 'count'])

In [34]:
result

Unnamed: 0_level_0,Unnamed: 1_level_0,tip_pct,tip_pct,tip_pct,total_bill,total_bill,total_bill
Unnamed: 0_level_1,Unnamed: 1_level_1,max,min,count,max,min,count
sex,smoker,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
Female,No,0.252672,0.056797,54,35.83,7.25,54
Female,Yes,0.416667,0.056433,33,44.3,3.07,33
Male,No,0.29199,0.071804,97,48.33,7.51,97
Male,Yes,0.710345,0.035638,60,50.81,7.25,60


In [31], as you see, i pass an tuple to handle the name of column, we can do the same by passing an dictionary

In [35]:
grouped.agg({'tip': np.max, 'size': 'sum'})

Unnamed: 0_level_0,Unnamed: 1_level_0,tip,size
sex,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1
Female,No,5.2,140
Female,Yes,6.5,74
Male,No,9.0,263
Male,Yes,10.0,150


Dict is more flexible than other method to aggregations. In each columns, we want to apply different functional is quite hard with passing a tuples, but dict is very easy

In [36]:
grouped.agg({
        'tip_pct': ['min', 'max', 'mean', 'std'],
        'size': 'sum'
    })

Unnamed: 0_level_0,Unnamed: 1_level_0,tip_pct,tip_pct,tip_pct,tip_pct,size
Unnamed: 0_level_1,Unnamed: 1_level_1,min,max,mean,std,sum
sex,smoker,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Female,No,0.056797,0.252672,0.156921,0.036421,140
Female,Yes,0.056433,0.416667,0.18215,0.071595,74
Male,No,0.071804,0.29199,0.160669,0.041849,263
Male,Yes,0.035638,0.710345,0.152771,0.090588,150


## Returning Aggregated Data in “unindexed” Form

In [41]:
tips.groupby(['sex', 'smoker']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill,tip,size,tip_pct
sex,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Female,No,18.105185,2.773519,2.592593,0.156921
Female,Yes,17.977879,2.931515,2.242424,0.18215
Male,No,19.791237,3.113402,2.71134,0.160669
Male,Yes,22.2845,3.051167,2.5,0.152771


In [39]:
tips.groupby(['sex', 'smoker'], as_index=False).mean()

Unnamed: 0,sex,smoker,total_bill,tip,size,tip_pct
0,Female,No,18.105185,2.773519,2.592593,0.156921
1,Female,Yes,17.977879,2.931515,2.242424,0.18215
2,Male,No,19.791237,3.113402,2.71134,0.160669
3,Male,Yes,22.2845,3.051167,2.5,0.152771


# Group-wise Operations and Transformations

Aggregation is only one kind of group operation. It is a special case in the more generalclass of data transformations; that is, it accepts functions that reduce a one-dimensionalarray to a scalar value. In this section, I will introduce you to the transform and applymethods, which will enable you to do many other kinds of group operations.

Suppose, instead, we wanted to add a column to a DataFrame containing group meansfor each index. One way to do this is to aggregate, then merge:

In [44]:
df

Unnamed: 0,data1,data2,key1,key2
0,0.414884,0.047964,a,one
1,0.422564,0.048797,a,two
2,0.330925,0.970728,b,one
3,0.165235,0.112956,b,two
4,0.773809,0.698775,a,one


In [51]:
k1_mean = df.groupby('key1').mean().add_prefix('mean_')
k1_mean

Unnamed: 0_level_0,mean_data1,mean_data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0.537086,0.265179
b,0.24808,0.541842


In [50]:
pd.merge(df, k1_mean, left_on='key1', right_index=True)

Unnamed: 0,data1,data2,key1,key2,mean_data1,mean_data2
0,0.414884,0.047964,a,one,0.537086,0.265179
1,0.422564,0.048797,a,two,0.537086,0.265179
4,0.773809,0.698775,a,one,0.537086,0.265179
2,0.330925,0.970728,b,one,0.24808,0.541842
3,0.165235,0.112956,b,two,0.24808,0.541842


We can use transform to broadcast value

In [64]:
df.groupby('key1').transform(np.mean)

Unnamed: 0,data1,data2
0,0.537086,0.265179
1,0.537086,0.265179
2,0.24808,0.541842
3,0.24808,0.541842
4,0.537086,0.265179


Or We can use **transform** function

In [57]:
people

Unnamed: 0,a,b,c,d,e
Joe,-0.099241,-1.107979,1.38566,-1.558141,0.851939
Steve,0.495138,2.29809,0.744016,-1.607983,2.394997
Wes,-0.527139,0.484615,-0.948942,-0.025088,-0.709397
Jim,0.348103,1.154688,1.144065,-0.841691,0.456913
Travis,-0.173291,-0.589425,-1.314298,-0.896198,-0.736594


In [58]:
key = ['one', 'two', 'one', 'two', 'one']
people.groupby(key).mean()

Unnamed: 0,a,b,c,d,e
one,-0.266557,-0.404263,-0.292527,-0.826476,-0.198017
two,0.42162,1.726389,0.944041,-1.224837,1.425955


In [59]:
people.groupby(key).transform(np.mean)

Unnamed: 0,a,b,c,d,e
Joe,-0.266557,-0.404263,-0.292527,-0.826476,-0.198017
Steve,0.42162,1.726389,0.944041,-1.224837,1.425955
Wes,-0.266557,-0.404263,-0.292527,-0.826476,-0.198017
Jim,0.42162,1.726389,0.944041,-1.224837,1.425955
Travis,-0.266557,-0.404263,-0.292527,-0.826476,-0.198017


## Apply: General split-apply-combine

Like aggregate, transform is a more specialized function having rigid requirements: thepassed function must either produce a scalar value to be broadcasted (like np.mean) ora transformed array of the same size

**apply** splits theobject being manipulated into pieces, invokes the passed function on each piece, then attempts to concatenate the pieces together.

Suppose We wanted to select the top five **tip_pct** values by group. First, it’s straight forward to write a function that selects the rows with the largest values in a particular column:

In [69]:
def top(df, n=5, column = 'tip_pct'):
    return df.sort_values(by=column)[-n:]

In [70]:
top(tips, n=6)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_pct
109,14.31,4.0,Female,Yes,Sat,Dinner,2,0.279525
183,23.17,6.5,Male,Yes,Sun,Dinner,4,0.280535
232,11.61,3.39,Male,No,Sat,Dinner,2,0.29199
67,3.07,1.0,Female,Yes,Sat,Dinner,1,0.325733
178,9.6,4.0,Female,Yes,Sun,Dinner,2,0.416667
172,7.25,5.15,Male,Yes,Sun,Dinner,2,0.710345


Now, if we **group by smoker**, say, and **call apply** with this function, we get the following:

In [72]:
tips.groupby('smoker').apply(top)

Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill,tip,sex,smoker,day,time,size,tip_pct
smoker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
No,88,24.71,5.85,Male,No,Thur,Lunch,2,0.236746
No,185,20.69,5.0,Male,No,Sun,Dinner,5,0.241663
No,51,10.29,2.6,Female,No,Sun,Dinner,2,0.252672
No,149,7.51,2.0,Male,No,Thur,Lunch,2,0.266312
No,232,11.61,3.39,Male,No,Sat,Dinner,2,0.29199
Yes,109,14.31,4.0,Female,Yes,Sat,Dinner,2,0.279525
Yes,183,23.17,6.5,Male,Yes,Sun,Dinner,4,0.280535
Yes,67,3.07,1.0,Female,Yes,Sat,Dinner,1,0.325733
Yes,178,9.6,4.0,Female,Yes,Sun,Dinner,2,0.416667
Yes,172,7.25,5.15,Male,Yes,Sun,Dinner,2,0.710345


The top function is called on each piece of the DataFrame,then the results are glued together using pandas.concat

![image](data/image/apply_architecture.png)

Instead of calling function and pass to apply, we can pass driectly function at first and its arguments afterward

In [79]:
tips.groupby(['smoker', 'day']).apply(top, n=1, column='total_bill')

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,total_bill,tip,sex,smoker,day,time,size,tip_pct
smoker,day,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
No,Fri,94,22.75,3.25,Female,No,Fri,Dinner,2,0.142857
No,Sat,212,48.33,9.0,Male,No,Sat,Dinner,4,0.18622
No,Sun,156,48.17,5.0,Male,No,Sun,Dinner,6,0.103799
No,Thur,142,41.19,5.0,Male,No,Thur,Lunch,5,0.121389
Yes,Fri,95,40.17,4.73,Male,Yes,Fri,Dinner,4,0.11775
Yes,Sat,170,50.81,10.0,Male,Yes,Sat,Dinner,3,0.196812
Yes,Sun,182,45.35,3.5,Male,Yes,Sun,Dinner,3,0.077178
Yes,Thur,197,43.11,5.0,Female,Yes,Thur,Lunch,4,0.115982


# Pivot Tables and Cross-Tabulation
> A pivot table is a data summarization tool. 

> It aggregates a table of data by one or more keys,arranging the data in a rectangle with some of the group keys along the rows and somealong the columns

> Pivot tables in Python with pandas are made possible using thegroupby facility described in this chapter combined with reshape operations utilizinghierarchical indexing

Returning to the tipping data set, suppose I wanted to compute a table of group means (the default pivot_table aggregation type) arranged by sex and smoker on the rows:

In [81]:
tips.pivot_table(index=['sex', 'smoker'])

Unnamed: 0_level_0,Unnamed: 1_level_0,size,tip,tip_pct,total_bill
sex,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Female,No,2.592593,2.773519,0.156921,18.105185
Female,Yes,2.242424,2.931515,0.18215,17.977879
Male,No,2.71134,3.113402,0.160669,19.791237
Male,Yes,2.5,3.051167,0.152771,22.2845


This thing is easy proceduced by using **groupby**

In [83]:
tips.groupby(['sex', 'smoker']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill,tip,size,tip_pct
sex,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Female,No,18.105185,2.773519,2.592593,0.156921
Female,Yes,17.977879,2.931515,2.242424,0.18215
Male,No,19.791237,3.113402,2.71134,0.160669
Male,Yes,22.2845,3.051167,2.5,0.152771


Now, suppose we want to aggregate only tip_pct and size, and additionally group by day. I’ll put smoker in the table columns and day in the rows:

In [84]:
tips.pivot_table(['tip_pct', 'size'], index=['sex', 'day'], columns='smoker')

Unnamed: 0_level_0,Unnamed: 1_level_0,tip_pct,tip_pct,size,size
Unnamed: 0_level_1,smoker,No,Yes,No,Yes
sex,day,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Female,Fri,0.165296,0.209129,2.5,2.0
Female,Sat,0.147993,0.163817,2.307692,2.2
Female,Sun,0.16571,0.237075,3.071429,2.5
Female,Thur,0.155971,0.163073,2.48,2.428571
Male,Fri,0.138005,0.14473,2.0,2.125
Male,Sat,0.162132,0.139067,2.65625,2.62963
Male,Sun,0.158291,0.173964,2.883721,2.6
Male,Thur,0.165706,0.164417,2.5,2.3


How about now? It seems quite hard to do with **groupby** huh?

**pivot_table** has **margins** option. By default its value is False. 

This has the effect of adding All row and column labels, with corresponding values being the group statistics for all the data within a single tier

In [89]:
tips.pivot_table(['tip_pct', 'size'], index=['sex', 'day'], columns='smoker', margins=True)

Unnamed: 0_level_0,Unnamed: 1_level_0,tip_pct,tip_pct,tip_pct,size,size,size
Unnamed: 0_level_1,smoker,No,Yes,All,No,Yes,All
sex,day,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
Female,Fri,0.165296,0.209129,0.199388,2.5,2.0,2.111111
Female,Sat,0.147993,0.163817,0.15647,2.307692,2.2,2.25
Female,Sun,0.16571,0.237075,0.181569,3.071429,2.5,2.944444
Female,Thur,0.155971,0.163073,0.157525,2.48,2.428571,2.46875
Male,Fri,0.138005,0.14473,0.143385,2.0,2.125,2.1
Male,Sat,0.162132,0.139067,0.151577,2.65625,2.62963,2.644068
Male,Sun,0.158291,0.173964,0.162344,2.883721,2.6,2.810345
Male,Thur,0.165706,0.164417,0.165276,2.5,2.3,2.433333
All,,0.159328,0.163196,0.160803,2.668874,2.408602,2.569672


To use other function pass ot **aggfunc**

In [90]:
tips.pivot_table(['tip_pct', 'size'], index=['sex', 'day'], columns='smoker', aggfunc=len ,margins=True)

Unnamed: 0_level_0,Unnamed: 1_level_0,tip_pct,tip_pct,tip_pct,size,size,size
Unnamed: 0_level_1,smoker,No,Yes,All,No,Yes,All
sex,day,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
Female,Fri,2.0,7.0,9.0,2.0,7.0,9.0
Female,Sat,13.0,15.0,28.0,13.0,15.0,28.0
Female,Sun,14.0,4.0,18.0,14.0,4.0,18.0
Female,Thur,25.0,7.0,32.0,25.0,7.0,32.0
Male,Fri,2.0,8.0,10.0,2.0,8.0,10.0
Male,Sat,32.0,27.0,59.0,32.0,27.0,59.0
Male,Sun,43.0,15.0,58.0,43.0,15.0,58.0
Male,Thur,20.0,10.0,30.0,20.0,10.0,30.0
All,,151.0,93.0,244.0,151.0,93.0,244.0


## Cross-Tabulations: Crosstab

A cross-tabulation (or crosstab for short) is a special case of a pivot table that computesgroup frequencies. Here is a canonical example taken from the Wikipedia page on cross-tabulation:

In [93]:
from StringIO import StringIO
data = """\
Sample    Gender    Handedness
1    Female    Right-handed
2    Male    Left-handed
3    Female    Right-handed
4    Male    Right-handed
5    Male    Left-handed
6    Male    Right-handed
7    Female    Right-handed
8    Female    Left-handed
9    Male    Right-handed
10    Female    Right-handed"""
data = pd.read_table(StringIO(data), sep='\s+')

In [92]:
data

Unnamed: 0,Sample,Gender,Handedness
0,1,Female,Right-handed
1,2,Male,Left-handed
2,3,Female,Right-handed
3,4,Male,Right-handed
4,5,Male,Left-handed
5,6,Male,Right-handed
6,7,Female,Right-handed
7,8,Female,Left-handed
8,9,Male,Right-handed
9,10,Female,Right-handed


In [94]:
pd.crosstab(data.Gender, data.Handedness, margins=True)

Handedness,Left-handed,Right-handed,All
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Female,1,4,5
Male,2,3,5
All,3,7,10


With **pivot_table** we can do like this

In [102]:
pd.pivot_table(data, index='Gender', columns='Handedness', values='Sample', aggfunc=len , margins=True)

Handedness,Left-handed,Right-handed,All
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Female,1.0,4.0,5.0
Male,2.0,3.0,5.0
All,3.0,7.0,10.0


# Example: 2012 Federal Election Commission Database