# Group By: split-apply-combine
[Pandas tutorial split-apply-combine](http://pandas.pydata.org/pandas-docs/stable/groupby.html)

'group by' refers to one or more of splitting, applying, combining.

### Splitting an object into groups

In [1]:
import pandas as pd
import numpy as np

In [24]:
df = pd.DataFrame({'A':['foo','bar','foo','bar',
                      'foo','bar','foo','foo','foo'],
                 'B':['one','one','two','three',
                     'two','two','one','three','three'],
                 'C':[1,2,3,4,5,6,7,8,9],
                 'D':[9,10,11,12,13,14,15,16,17]})
df

Unnamed: 0,A,B,C,D
0,foo,one,1,9
1,bar,one,2,10
2,foo,two,3,11
3,bar,three,4,12
4,foo,two,5,13
5,bar,two,6,14
6,foo,one,7,15
7,foo,three,8,16
8,foo,three,9,17


##### Split a Series

In [32]:
# create a list
lst = [1,2,3,1,2,3]

# create a Series using the list
s = pd.Series([1,2,3,10,20,30], lst)
s
#note, not each value in the index is unique

1     1
2     2
3     3
1    10
2    20
3    30
dtype: int64

In [33]:
# index values are used as the group key
# in a groupby operation (below). So all values for the same
# index value will be in one group
grouped = s.groupby(level=0) # not sure what level=0 does

# .first() computes the first group of values
grouped.first()

1    1
2    2
3    3
dtype: int64

In [34]:
# .last() computes the last group of values
grouped.last()

1    10
2    20
3    30
dtype: int64

In [35]:
# computes the sum of the group values
grouped.sum()

1    11
2    22
3    33
dtype: int64

##### Sorting a dataframe

In [71]:
# group by one column
grouped = df.groupby('A')
grouped.sum()

# can group by two two columns
grouped = df.groupby(['A','B'])
# note, the 'foo'/'three' group is the only one summed
grouped.sum()

# Note, this is the same as a two-level MultiIndex (hierarchically-indexed data)

Unnamed: 0_level_0,Unnamed: 1_level_0,C,D
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,2,10
bar,three,4,12
bar,two,6,14
foo,one,8,24
foo,three,17,33
foo,two,8,24


In [49]:
# group keys are sorted by default
df2 = pd.DataFrame({'X':['B','B','A','A'],
                   'Y':[1,2,3,4]})
df2

Unnamed: 0,X,Y
0,B,1
1,B,2
2,A,3
3,A,4


In [52]:
# group and sort by specified column
df2.groupby(['X']).sum()

Unnamed: 0_level_0,Y
X,Unnamed: 1_level_1
A,7
B,3


In [53]:
# a .groupby() creates a groupby object. It doesn't become
# a dataframe until something is applied to it like sum()

# This is just a groupby object
x = df2.groupby(['X'])
type(x)

pandas.core.groupby.DataFrameGroupBy

In [54]:
# if I add .sum() to it it becomes a dataframe
y = x.sum()
type(y)

pandas.core.frame.DataFrame

In [55]:
y

Unnamed: 0_level_0,Y
X,Unnamed: 1_level_1
A,7
B,3


In [56]:
# override the default sorting
df2.groupby(['X'], sort=False).sum()

Unnamed: 0_level_0,Y
X,Unnamed: 1_level_1
B,3
A,7


In [57]:
# create another dataframe
df3 = pd.DataFrame({'X':['A','B','A','B'], 'Y':[3,4,1,2]})
df3

Unnamed: 0,X,Y
0,A,3
1,B,4
2,A,1
3,B,2


##### Sort and isolate a group

In [59]:
# groupby column X, get group A
# note, that original order is preserved within each group
a = df3.groupby(['X']).get_group('A')
a

Unnamed: 0,X,Y
0,A,3
2,A,1


In [60]:
# same with getting group B
b = df3.groupby(['X']).get_group('B')
b

Unnamed: 0,X,Y
1,B,4
3,B,2


##### GroupBy object attributes

In [62]:
# show the df dataframe again
df

Unnamed: 0,A,B,C,D
0,foo,one,1,9
1,bar,one,2,10
2,foo,two,3,11
3,bar,three,4,12
4,foo,two,5,13
5,bar,two,6,14
6,foo,one,7,15
7,foo,three,8,16
8,foo,three,9,17


In [63]:
# here .groups is an 'attribute' of the groupby object
# It is a dict whose keys are the computed unique groups
# and corresponding values which are the axis labels for
# each member of the group.
# Basically it's saying "all 'bar' items are 1,3,5
# and all 'foo' are 0,2,4,6,7, both constituting a dict

df.groupby('A').groups

{'bar': [1, 3, 5], 'foo': [0, 2, 4, 6, 7, 8]}

In [64]:
# groups by column A first, then B
grouped = df.groupby(['A','B'])
grouped.groups

{('bar', 'one'): [1],
 ('bar', 'three'): [3],
 ('bar', 'two'): [5],
 ('foo', 'one'): [0, 6],
 ('foo', 'three'): [7, 8],
 ('foo', 'two'): [2, 4]}

In [65]:
# use Python's len() to see length of the groupedby dict
len(grouped)

6

In [70]:
# uncomment below line, after period, hitting TAB will open up all methods available
# grouped.

##### GroupBy with MultiIndex

In [None]:
# Don't yet understand this...

# create a function that will allow me to group across columns
def get_letter_type(letter):
    if letter.lower() in 'aeiou':
        return 'vowel'
    else:
        return 'consonant'

df.groupby(get_letter_type, axis=1).groups

In [None]:
gb = df.groupby(['A'])
gb.<TAB>
# not sure what the <TAB> part is

In [37]:
# split dataframe on its index

# letter1 = 'd'
# letter2 = 'a'

def get_letter_type(letter):
    if letter.lower() in 'aeiou':
        return 'vowel'
    else:
        return 'consonant'
    
grouped = df.groupby(get_letter_type, axis=1)
grouped
# get_letter_type(letter2)

<pandas.core.groupby.DataFrameGroupBy object at 0x11253a390>