###  GroupBy
* split-apply-combine
* First, data contained in pandas object is split into groups based on one or more supplied keys
* Splitting is performed on particular axis of object.
* Then function is applied to each group, producing new value.
* Finally result of all those function application are combined into result object.

![](images/group_aggregation.JPG)

In [1]:
import pandas as pd
import numpy as np

In [2]:
d1  = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'], 'key2': ['one', 'two', 'one', 'two', 'one'], 
                   'data1': np.random.randn(5), 'data2': np.random.randn(5)})

In [3]:
d1

Unnamed: 0,key1,key2,data1,data2
0,a,one,-0.618873,1.511084
1,a,two,-0.527544,0.471104
2,b,one,0.906042,-1.578376
3,b,two,-0.606813,-1.412274
4,a,one,1.493525,-1.213575


* To create mean of data1 column using label of key1.

In [5]:
grouped = d1['data1'].groupby(d1['key1'])

In [6]:
grouped

<pandas.core.groupby.groupby.SeriesGroupBy object at 0x000002D5604010F0>

In [7]:
grouped.mean()

key1
a    0.115703
b    0.149615
Name: data1, dtype: float64

* Data aggregated according to group key, producing a new series that is now indexed by the unique values in the key1 column.

In [9]:
means = d1['data1'].groupby([d1['key1'], d1['key2']]).mean() # Group data with 2 keys.

In [10]:
means

key1  key2
a     one     0.437326
      two    -0.527544
b     one     0.906042
      two    -0.606813
Name: data1, dtype: float64

In [11]:
means.unstack()

key2,one,two
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0.437326,-0.527544
b,0.906042,-0.606813


* Group key can be any array of the right length

In [12]:
states = np.array(['Ohio', 'California', 'California', 'Ohio', 'Ohio'])
years = np.array([2005,2005,2006,2005,2006])

In [13]:
d1['data1'].groupby([states, years]).mean()

California  2005   -0.527544
            2006    0.906042
Ohio        2005   -0.612843
            2006    1.493525
Name: data1, dtype: float64

In [14]:
d1.groupby('key1').mean()

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0.115703,0.256205
b,0.149615,-1.495325


In [15]:
d1.groupby(['key1', 'key2']).size()

key1  key2
a     one     2
      two     1
b     one     1
      two     1
dtype: int64

* Missing values in a group key will be excluded from result

#### Iterating over group
* Generate sequence of 2-tuples containing group name and chunk of data

In [16]:
for name, group in d1.groupby('key1'):
    print(name)
    print(group)

a
  key1 key2     data1     data2
0    a  one -0.618873  1.511084
1    a  two -0.527544  0.471104
4    a  one  1.493525 -1.213575
b
  key1 key2     data1     data2
2    b  one  0.906042 -1.578376
3    b  two -0.606813 -1.412274


In [17]:
for name, group in d1.groupby(['key1', 'key2']):
    print(name)
    print(group)

('a', 'one')
  key1 key2     data1     data2
0    a  one -0.618873  1.511084
4    a  one  1.493525 -1.213575
('a', 'two')
  key1 key2     data1     data2
1    a  two -0.527544  0.471104
('b', 'one')
  key1 key2     data1     data2
2    b  one  0.906042 -1.578376
('b', 'two')
  key1 key2     data1     data2
3    b  two -0.606813 -1.412274


In [20]:
temp = dict(list(d1.groupby('key1')))

In [21]:
temp

{'a':   key1 key2     data1     data2
 0    a  one -0.618873  1.511084
 1    a  two -0.527544  0.471104
 4    a  one  1.493525 -1.213575, 'b':   key1 key2     data1     data2
 2    b  one  0.906042 -1.578376
 3    b  two -0.606813 -1.412274}

In [22]:
temp['a']

Unnamed: 0,key1,key2,data1,data2
0,a,one,-0.618873,1.511084
1,a,two,-0.527544,0.471104
4,a,one,1.493525,-1.213575


* By  default groupby group on axis 0.

In [24]:
d1.dtypes

key1      object
key2      object
data1    float64
data2    float64
dtype: object

In [27]:
d1

Unnamed: 0,key1,key2,data1,data2
0,a,one,-0.618873,1.511084
1,a,two,-0.527544,0.471104
2,b,one,0.906042,-1.578376
3,b,two,-0.606813,-1.412274
4,a,one,1.493525,-1.213575


In [26]:
for name, group in d1.groupby(d1.dtypes, axis=1):
    print(name)
    print(group)

float64
      data1     data2
0 -0.618873  1.511084
1 -0.527544  0.471104
2  0.906042 -1.578376
3 -0.606813 -1.412274
4  1.493525 -1.213575
object
  key1 key2
0    a  one
1    a  two
2    b  one
3    b  two
4    a  one


In [28]:
d1['data1'].groupby(d1['key1']).mean()

key1
a    0.115703
b    0.149615
Name: data1, dtype: float64

In [30]:
d1.groupby(d1['key1'])['data1'].mean() # series output

key1
a    0.115703
b    0.149615
Name: data1, dtype: float64

In [31]:
d1.groupby(d1['key1'])[['data1']].mean() # dataframe output

Unnamed: 0_level_0,data1
key1,Unnamed: 1_level_1
a,0.115703
b,0.149615


In [40]:
d2 = pd.DataFrame(np.random.randn(5,5), columns=['a','b','c','d','e'], index = ['Joe','Steve','Wes','Jim','Travis'])

In [41]:
d2.iloc[2:3, [1,2]] = np.nan

In [42]:
d2

Unnamed: 0,a,b,c,d,e
Joe,-0.008334,-0.987271,-1.627561,0.45969,1.917741
Steve,-1.585743,-1.422776,-0.336122,-1.614666,1.084406
Wes,1.048305,,,-0.302358,0.751676
Jim,0.328887,-0.198034,-0.555475,-0.089334,1.649375
Travis,-1.53113,0.876439,-1.083185,-0.35341,0.582724


In [43]:
mapping = {'a':'red','b':'red','c':'blue','d':'blue','e':'red','f':'orange'}

In [44]:
mapping

{'a': 'red', 'b': 'red', 'c': 'blue', 'd': 'blue', 'e': 'red', 'f': 'orange'}

In [46]:
by_column = d2.groupby(mapping, axis = 1)

In [47]:
by_column.sum()

Unnamed: 0,blue,red
Joe,-1.167871,0.922136
Steve,-1.950787,-1.924113
Wes,-0.302358,1.799981
Jim,-0.644809,1.780229
Travis,-1.436595,-0.071967


In [48]:
map_series = pd.Series(mapping)

In [49]:
d2.groupby(map_series, axis = 1).count()

Unnamed: 0,blue,red
Joe,2,3
Steve,2,3
Wes,1,2
Jim,2,3
Travis,2,3


* Grouping with function
    - Any function passed as group key will be called once per index value with return value being used as group name.
* We have first name as index in `d2`.
* Group by length of name

In [51]:
d2.groupby(len).sum()

Unnamed: 0,a,b,c,d,e
3,1.368858,-1.185304,-2.183036,0.067998,4.318792
5,-1.585743,-1.422776,-0.336122,-1.614666,1.084406
6,-1.53113,0.876439,-1.083185,-0.35341,0.582724


* We can computer array of length and do group by but this is much simpler.

In [52]:
key_list = ['one','one','one','two','two']

In [53]:
d2.groupby([len, key_list]).min()

Unnamed: 0,Unnamed: 1,a,b,c,d,e
3,one,-0.008334,-0.987271,-1.627561,-0.302358,0.751676
3,two,0.328887,-0.198034,-0.555475,-0.089334,1.649375
5,one,-1.585743,-1.422776,-0.336122,-1.614666,1.084406
6,two,-1.53113,0.876439,-1.083185,-0.35341,0.582724


In [54]:
columns = pd.MultiIndex.from_arrays([['US','US','US','JP','JP'], [1,3,5,1,3]], names = ['city', 'tendor'])

In [55]:
d3 = pd.DataFrame(np.random.rand(4,5), columns=columns)

In [56]:
d3

city,US,US,US,JP,JP
tendor,1,3,5,1,3
0,0.270602,0.726883,0.285037,0.978483,0.940017
1,0.947552,0.002189,0.948212,0.529383,0.071125
2,0.976992,0.38365,0.330687,0.685868,0.544808
3,0.352115,0.485531,0.587385,0.052047,0.414181


In [57]:
d3.groupby(level='city',axis=1).count()

city,JP,US
0,2,3
1,2,3
2,2,3
3,2,3


-------------

### Data aggregation
* Data transformation that produces scalar value from arrays.Ex. mean, count, sum, min
![](images/aggregation_methods.JPG)

* We can use other functions which are defined for Series or dataframe columns

In [58]:
d1

Unnamed: 0,key1,key2,data1,data2
0,a,one,-0.618873,1.511084
1,a,two,-0.527544,0.471104
2,b,one,0.906042,-1.578376
3,b,two,-0.606813,-1.412274
4,a,one,1.493525,-1.213575


In [59]:
d1.groupby('key1').quantile(0.9)

0.9,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,1.089311,1.303088
b,0.754757,-1.428884


In [60]:
def my_range(arr):
    return arr.max() - arr.min()

In [61]:
d1.groupby('key1').agg(my_range) # using agg we can pass custom functions. Custom aggregation functions are slower.

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,2.112397,2.724659
b,1.512856,0.166102


In [63]:
d1.groupby('key1').describe()

Unnamed: 0_level_0,data1,data1,data1,data1,data1,data1,data1,data1,data2,data2,data2,data2,data2,data2,data2,data2
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
key1,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
a,3.0,0.115703,1.194102,-0.618873,-0.573208,-0.527544,0.48299,1.493525,3.0,0.256205,1.374983,-1.213575,-0.371235,0.471104,0.991094,1.511084
b,2.0,0.149615,1.069751,-0.606813,-0.228599,0.149615,0.527829,0.906042,2.0,-1.495325,0.117452,-1.578376,-1.536851,-1.495325,-1.4538,-1.412274


In [64]:
d1

Unnamed: 0,key1,key2,data1,data2
0,a,one,-0.618873,1.511084
1,a,two,-0.527544,0.471104
2,b,one,0.906042,-1.578376
3,b,two,-0.606813,-1.412274
4,a,one,1.493525,-1.213575


In [68]:
grouped = d1.groupby(['key1', 'key2'])

In [69]:
grouped['data1'].mean()

key1  key2
a     one     0.437326
      two    -0.527544
b     one     0.906042
      two    -0.606813
Name: data1, dtype: float64

In [70]:
grouped['data1'].agg('mean')

key1  key2
a     one     0.437326
      two    -0.527544
b     one     0.906042
      two    -0.606813
Name: data1, dtype: float64

In [71]:
grouped['data1'].agg(['mean', 'std', 'sum', my_range]) # calculating several aggregation at once

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,std,sum,my_range
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
a,one,0.437326,1.49369,0.874652,2.112397
a,two,-0.527544,,-0.527544,0.0
b,one,0.906042,,0.906042,0.0
b,two,-0.606813,,-0.606813,0.0


* We can name the columns

In [73]:
grouped['data1'].agg([('foo', 'mean'), ('bar', 'sum')]) # pass tuple of col name and func name

Unnamed: 0_level_0,Unnamed: 1_level_0,foo,bar
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,one,0.437326,0.874652
a,two,-0.527544,-0.527544
b,one,0.906042,0.906042
b,two,-0.606813,-0.606813


* Specify different aggregation for columns

In [74]:
functions = ['count', 'mean', 'max']

In [77]:
d1.groupby(['key1', 'key2']).agg(functions)

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data1,data1,data2,data2,data2
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,max,count,mean,max
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
a,one,2,0.437326,1.493525,2,0.148755,1.511084
a,two,1,-0.527544,-0.527544,1,0.471104,0.471104
b,one,1,0.906042,0.906042,1,-1.578376,-1.578376
b,two,1,-0.606813,-0.606813,1,-1.412274,-1.412274


In [79]:
d1.groupby(['key1','key2']).agg({'data1':['max', 'mean'], 'data2':'sum'})

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data1,data2
Unnamed: 0_level_1,Unnamed: 1_level_1,max,mean,sum
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,one,1.493525,0.437326,0.297509
a,two,-0.527544,-0.527544,0.471104
b,one,0.906042,0.906042,-1.578376
b,two,-0.606813,-0.606813,-1.412274


In [80]:
d1.groupby(['key1', 'key2'], as_index=False).mean()

Unnamed: 0,key1,key2,data1,data2
0,a,one,0.437326,0.148755
1,a,two,-0.527544,0.471104
2,b,one,0.906042,-1.578376
3,b,two,-0.606813,-1.412274


* Of course using `reset_index` we can accomplish same output

###  Apply

In [81]:
d1

Unnamed: 0,key1,key2,data1,data2
0,a,one,-0.618873,1.511084
1,a,two,-0.527544,0.471104
2,b,one,0.906042,-1.578376
3,b,two,-0.606813,-1.412274
4,a,one,1.493525,-1.213575


* Function that selects rows with highest value in particular column

In [85]:
def high_val(df, column = 'data2', n = 5):
    return df.sort_values(by=column)[-n:]

In [86]:
high_val(d1, 'data2', 5)

Unnamed: 0,key1,key2,data1,data2
2,b,one,0.906042,-1.578376
3,b,two,-0.606813,-1.412274
4,a,one,1.493525,-1.213575
1,a,two,-0.527544,0.471104
0,a,one,-0.618873,1.511084


In [87]:
d1.groupby('key1').apply(high_val)

Unnamed: 0_level_0,Unnamed: 1_level_0,key1,key2,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
a,4,a,one,1.493525,-1.213575
a,1,a,two,-0.527544,0.471104
a,0,a,one,-0.618873,1.511084
b,2,b,one,0.906042,-1.578376
b,3,b,two,-0.606813,-1.412274


* Function `high_val` is called on each group from the dataframe then result is glued together using `pd.concat`. 

In [89]:
d1.groupby('key1').apply(high_val, n = 1, column = 'data2')

Unnamed: 0_level_0,Unnamed: 1_level_0,key1,key2,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
a,0,a,one,-0.618873,1.511084
b,3,b,two,-0.606813,-1.412274


In [94]:
d1.groupby('key1').apply(high_val, n = 3,  column = 'data2')

Unnamed: 0_level_0,Unnamed: 1_level_0,key1,key2,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
a,4,a,one,1.493525,-1.213575
a,1,a,two,-0.527544,0.471104
a,0,a,one,-0.618873,1.511084
b,2,b,one,0.906042,-1.578376
b,3,b,two,-0.606813,-1.412274


In [93]:
d1.groupby('key1', group_keys=False).apply(high_val, n = 3,  column = 'data2')

Unnamed: 0,key1,key2,data1,data2
4,a,one,1.493525,-1.213575
1,a,two,-0.527544,0.471104
0,a,one,-0.618873,1.511084
2,b,one,0.906042,-1.578376
3,b,two,-0.606813,-1.412274


In [91]:
d1.groupby(['key1', 'key2']).apply(high_val, n = 1, column = 'data2')

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,key1,key2,data1,data2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
a,one,0,a,one,-0.618873,1.511084
a,two,1,a,two,-0.527544,0.471104
b,one,2,b,one,0.906042,-1.578376
b,two,3,b,two,-0.606813,-1.412274


In [96]:
d4 = pd.DataFrame({'data1': np.random.randn(1000), 'data2':np.random.randn(1000)})

In [98]:
quartiles = pd.cut(d4.data1, 4)

In [100]:
quartiles[:10]

0     (-0.0681, 1.583]
1     (-0.0681, 1.583]
2     (-0.0681, 1.583]
3     (-0.0681, 1.583]
4    (-1.719, -0.0681]
5     (-0.0681, 1.583]
6       (1.583, 3.235]
7    (-1.719, -0.0681]
8    (-1.719, -0.0681]
9    (-1.719, -0.0681]
Name: data1, dtype: category
Categories (4, interval[float64]): [(-3.377, -1.719] < (-1.719, -0.0681] < (-0.0681, 1.583] < (1.583, 3.235]]

In [101]:
def get_stats(group):
    return {'min': group.min(), 'max':group.max(), 'count': group.count(), 'mean':group.mean()}

In [103]:
d4.data2.groupby(quartiles).apply(get_stats)

data1                   
(-3.377, -1.719]   count     47.000000
                   max        2.172956
                   mean      -0.041010
                   min       -1.717327
(-1.719, -0.0681]  count    453.000000
                   max        2.889836
                   mean       0.030703
                   min       -3.123397
(-0.0681, 1.583]   count    443.000000
                   max        3.159362
                   mean       0.037314
                   min       -2.989615
(1.583, 3.235]     count     57.000000
                   max        1.876527
                   mean      -0.086981
                   min       -2.499795
Name: data2, dtype: float64

In [105]:
d4.data2.groupby(pd.qcut(d4.data1, 10, labels = False)).apply(get_stats)

data1       
0      count    100.000000
       max        2.889836
       mean      -0.052962
       min       -2.301903
1      count    100.000000
       max        2.711814
       mean       0.066532
       min       -2.232335
2      count    100.000000
       max        2.409237
       mean       0.082589
       min       -3.123397
3      count    100.000000
       max        2.279508
       mean      -0.001378
       min       -2.427033
4      count    100.000000
       max        2.496134
       mean       0.025028
       min       -2.585958
5      count    100.000000
       max        2.423703
       mean       0.090496
       min       -2.165560
6      count    100.000000
       max        2.835519
       mean       0.026451
       min       -2.419422
7      count    100.000000
       max        1.685521
       mean       0.035669
       min       -2.989615
8      count    100.000000
       max        3.159362
       mean       0.026481
       min       -2.176838
9      count   

#### Filling missing values

In [107]:
s3 = pd.Series(np.random.randn(6))

In [108]:
s3[::2] = np.nan

In [109]:
s3

0         NaN
1    0.206436
2         NaN
3   -0.953021
4         NaN
5    0.467627
dtype: float64

In [111]:
s3.fillna(s3.mean())

0   -0.092986
1    0.206436
2   -0.092986
3   -0.953021
4   -0.092986
5    0.467627
dtype: float64

* What if we want to fill different value in null according to group.

In [112]:
states = ['Ohio','New york','vermont','Florida','Oregon','Nevada','California','Idaho']

In [114]:
group_keys = ['East']*4 + ['West'] * 4

In [116]:
s5 = pd.Series(np.random.randn(8), index=states)

In [119]:
s5[['vermont', 'Nevada', 'Idaho']] = np.nan

In [120]:
s5

Ohio         -0.688325
New york      0.259817
vermont            NaN
Florida       1.566686
Oregon       -0.979961
Nevada             NaN
California   -0.153419
Idaho              NaN
dtype: float64

In [122]:
s5.groupby(grou_keys).mean()

East    0.379393
West   -0.566690
dtype: float64

In [123]:
fill_nan = lambda g: g.fillna(g.mean())

In [124]:
s5.groupby(grou_keys).apply(fill_nan)

Ohio         -0.688325
New york      0.259817
vermont       0.379393
Florida       1.566686
Oregon       -0.979961
Nevada       -0.566690
California   -0.153419
Idaho        -0.566690
dtype: float64

In [125]:
fill_values = {'East':0.54, 'West':5.0}

In [129]:
fill_func = lambda g: g.fillna(fill_values[g.name])

In [130]:
s5.groupby(group_keys).apply(fill_func)

Ohio         -0.688325
New york      0.259817
vermont       0.540000
Florida       1.566686
Oregon       -0.979961
Nevada        5.000000
California   -0.153419
Idaho         5.000000
dtype: float64

#### Random sampling and permutation

In [131]:
suits = ['H', 'S', 'C', 'D'] # hearts, spades, club, diamond

In [132]:
card_val = (list(range(1, 11)) + [10] * 3) * 4

In [133]:
base_names = ['A'] + list(range(2,11)) + ['J', 'K', 'Q']

In [134]:
cards = []
for s in suits:
    cards.extend(str(num) + s for num in base_names)

In [150]:
deck = pd.Series(card_val, index = cards)

In [151]:
deck[:13]

AH      1
2H      2
3H      3
4H      4
5H      5
6H      6
7H      7
8H      8
9H      9
10H    10
JH     10
KH     10
QH     10
dtype: int64

In [152]:
from random import sample
def draw(deck, n = 5):
    return deck.sample(n)

In [154]:
draw(deck)

7H    7
7S    7
9D    9
AC    1
3H    3
dtype: int64

* Draw 2 random cards from each suits

In [155]:
get_suit = lambda card: card[-1]

In [156]:
deck.groupby(get_suit).apply(draw, n =2)

C  4C      4
   KC     10
D  6D      6
   10D    10
H  KH     10
   8H      8
S  10S    10
   9S      9
dtype: int64

In [157]:
deck.groupby(get_suit, group_keys=False).apply(draw, n = 2)

9C      9
QC     10
2D      2
10D    10
5H      5
9H      9
5S      5
9S      9
dtype: int64

In [161]:
d5 = pd.DataFrame({'category': ['a','a','a','a','b','b','b', 'b'],
                   'data': np.random.randn(8), 'weights': np.random.rand(8)})

In [162]:
d5

Unnamed: 0,category,data,weights
0,a,-2.015779,0.988813
1,a,-0.000844,0.974148
2,a,1.986629,0.091222
3,a,1.322706,0.941849
4,b,0.698232,0.33958
5,b,1.313581,0.972968
6,b,0.127343,0.233369
7,b,-0.808366,0.651929


* Group weighted average by category

In [164]:
grouped = d5.groupby('category')

In [165]:
get_wavg = lambda g: np.average(g['data'], weights=g['weights'])

In [166]:
grouped.apply(get_wavg)

category
a   -0.189262
b    0.463135
dtype: float64

In [175]:
import seaborn as sns

In [176]:
planets = sns.load_dataset('planets')

In [178]:
planets.shape

(1035, 6)

In [179]:
planets.head()

Unnamed: 0,method,number,orbital_period,mass,distance,year
0,Radial Velocity,1,269.3,7.1,77.4,2006
1,Radial Velocity,1,874.774,2.21,56.95,2008
2,Radial Velocity,1,763.0,2.6,19.84,2011
3,Radial Velocity,1,326.03,19.4,110.62,2007
4,Radial Velocity,1,516.22,10.5,119.47,2009


In [180]:
planets.dropna().describe()

Unnamed: 0,number,orbital_period,mass,distance,year
count,498.0,498.0,498.0,498.0,498.0
mean,1.73494,835.778671,2.50932,52.068213,2007.37751
std,1.17572,1469.128259,3.636274,46.596041,4.167284
min,1.0,1.3283,0.0036,1.35,1989.0
25%,1.0,38.27225,0.2125,24.4975,2005.0
50%,1.0,357.0,1.245,39.94,2009.0
75%,2.0,999.6,2.8675,59.3325,2011.0
max,6.0,17337.5,25.0,354.0,2014.0


In [181]:
planets.groupby('method')

<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x000002D56BD394A8>

* No computation is done at this point, lazy evaluation. All computation will be done during aggregation/ apply step

In [184]:
planets.groupby('method')['orbital_period'].median()

method
Astrometry                         631.180000
Eclipse Timing Variations         4343.500000
Imaging                          27500.000000
Microlensing                      3300.000000
Orbital Brightness Modulation        0.342887
Pulsar Timing                       66.541900
Pulsation Timing Variations       1170.000000
Radial Velocity                    360.200000
Transit                              5.714932
Transit Timing Variations           57.011000
Name: orbital_period, dtype: float64

In [185]:
planets.groupby('method')['year'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Astrometry,2.0,2011.5,2.12132,2010.0,2010.75,2011.5,2012.25,2013.0
Eclipse Timing Variations,9.0,2010.0,1.414214,2008.0,2009.0,2010.0,2011.0,2012.0
Imaging,38.0,2009.131579,2.781901,2004.0,2008.0,2009.0,2011.0,2013.0
Microlensing,23.0,2009.782609,2.859697,2004.0,2008.0,2010.0,2012.0,2013.0
Orbital Brightness Modulation,3.0,2011.666667,1.154701,2011.0,2011.0,2011.0,2012.0,2013.0
Pulsar Timing,5.0,1998.4,8.38451,1992.0,1992.0,1994.0,2003.0,2011.0
Pulsation Timing Variations,1.0,2007.0,,2007.0,2007.0,2007.0,2007.0,2007.0
Radial Velocity,553.0,2007.518987,4.249052,1989.0,2005.0,2009.0,2011.0,2014.0
Transit,397.0,2011.236776,2.077867,2002.0,2010.0,2012.0,2013.0,2014.0
Transit Timing Variations,4.0,2012.5,1.290994,2011.0,2011.75,2012.5,2013.25,2014.0


In [208]:
decade = 10 * (planets['year'] //10)

In [209]:
decade = decade.astype(str) + 's'
decade.name = 'decade'

In [210]:
planets.groupby(['method', decade])['number'].sum()

method                         decade
Astrometry                     2010s       2
Eclipse Timing Variations      2000s       5
                               2010s      10
Imaging                        2000s      29
                               2010s      21
Microlensing                   2000s      12
                               2010s      15
Orbital Brightness Modulation  2010s       5
Pulsar Timing                  1990s       9
                               2000s       1
                               2010s       1
Pulsation Timing Variations    2000s       1
Radial Velocity                1980s       1
                               1990s      52
                               2000s     475
                               2010s     424
Transit                        2000s      64
                               2010s     712
Transit Timing Variations      2010s       9
Name: number, dtype: int64

In [211]:
planets.groupby(['method', decade])['number'].sum().unstack()

decade,1980s,1990s,2000s,2010s
method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Astrometry,,,,2.0
Eclipse Timing Variations,,,5.0,10.0
Imaging,,,29.0,21.0
Microlensing,,,12.0,15.0
Orbital Brightness Modulation,,,,5.0
Pulsar Timing,,9.0,1.0,1.0
Pulsation Timing Variations,,,1.0,
Radial Velocity,1.0,52.0,475.0,424.0
Transit,,,64.0,712.0
Transit Timing Variations,,,,9.0


In [212]:
planets.groupby(['method', decade])['number'].sum().unstack().fillna(0)

decade,1980s,1990s,2000s,2010s
method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Astrometry,0.0,0.0,0.0,2.0
Eclipse Timing Variations,0.0,0.0,5.0,10.0
Imaging,0.0,0.0,29.0,21.0
Microlensing,0.0,0.0,12.0,15.0
Orbital Brightness Modulation,0.0,0.0,0.0,5.0
Pulsar Timing,0.0,9.0,1.0,1.0
Pulsation Timing Variations,0.0,0.0,1.0,0.0
Radial Velocity,1.0,52.0,475.0,424.0
Transit,0.0,0.0,64.0,712.0
Transit Timing Variations,0.0,0.0,0.0,9.0


In [186]:
my_df = pd.DataFrame({'key':['A','B','C','A','B','C'], 'data1':range(6), 'data2':np.random.randint(0,10,6)})

In [187]:
my_df

Unnamed: 0,key,data1,data2
0,A,0,5
1,B,1,3
2,C,2,6
3,A,3,7
4,B,4,4
5,C,5,2


#### Aggregation

In [188]:
my_df.groupby('key').aggregate(['min', np.median, max])

Unnamed: 0_level_0,data1,data1,data1,data2,data2,data2
Unnamed: 0_level_1,min,median,max,min,median,max
key,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
A,0,1.5,3,5,6.0,7
B,1,2.5,4,3,3.5,4
C,2,3.5,5,2,4.0,6


In [189]:
my_df.groupby('key').aggregate({'data1':'min', 'data2':'max'})

Unnamed: 0_level_0,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
A,0,7
B,1,4
C,2,6


#### Filtering
* Allows to drop data based on group properties.

In [193]:
def filter_func(g):
    return g['data2'].std() > 2

In [194]:
my_df.groupby('key').std()

Unnamed: 0_level_0,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
A,2.12132,1.414214
B,2.12132,0.707107
C,2.12132,2.828427


In [195]:
my_df.groupby('key').filter(filter_func)

Unnamed: 0,key,data1,data2
2,C,2,6
5,C,5,2


* `filter()` return a boolean specifying whether group passes the filtering.

#### Transformation
* Return transform version of full data.
* Center the data by subtracting group wise mean.

In [196]:
my_df.groupby('key').transform(lambda x: x - x.mean())

Unnamed: 0,data1,data2
0,-1.5,-1.0
1,-1.5,-0.5
2,-1.5,2.0
3,1.5,1.0
4,1.5,0.5
5,1.5,-2.0


#### `apply()` method
* Lets us apply function to the group result. Function should take df and return pandas object or scalar.

In [197]:
def norm_by_data2(x):
    x['data1'] /= x['data2'].sum()
    return x

In [198]:
my_df.groupby('key').apply(norm_by_data2)

Unnamed: 0,key,data1,data2
0,A,0.0,5
1,B,0.142857,3
2,C,0.25,6
3,A,0.25,7
4,B,0.571429,4
5,C,0.625,2


In [199]:
my_df

Unnamed: 0,key,data1,data2
0,A,0,5
1,B,1,3
2,C,2,6
3,A,3,7
4,B,4,4
5,C,5,2


In [200]:
L = [0,1,0,1,2,0]

In [201]:
my_df.groupby(L).sum()

Unnamed: 0,data1,data2
0,7,13
1,4,10
2,4,4


In [202]:
my_df_temp = my_df.set_index('key')

In [203]:
my_df_temp

Unnamed: 0_level_0,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
A,0,5
B,1,3
C,2,6
A,3,7
B,4,4
C,5,2


In [204]:
mapping = {'A': 'vowel', 'B': 'consonant', 'C':'consonant'}

In [205]:
my_df_temp.groupby(mapping).sum()

Unnamed: 0,data1,data2
consonant,12,15
vowel,3,12


In [206]:
my_df_temp.groupby(str.lower).mean()

Unnamed: 0,data1,data2
a,1.5,6.0
b,2.5,3.5
c,3.5,4.0


In [207]:
my_df_temp.groupby([str.lower, mapping]).mean()

Unnamed: 0,Unnamed: 1,data1,data2
a,vowel,1.5,6.0
b,consonant,2.5,3.5
c,consonant,3.5,4.0


### Pivot table
* data summarization tool. It aggregates table of data by 1 or more keys, arranging data in a rectangle with some of the group keys along the rows and some along the columns.

In [213]:
titanic = sns.load_dataset('titanic')

In [214]:
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [215]:
titanic.groupby('sex')['survived'].mean()

sex
female    0.742038
male      0.188908
Name: survived, dtype: float64