# Group By: split-apply-combine
[Pandas tutorial split-apply-combine](http://pandas.pydata.org/pandas-docs/stable/groupby.html)

'group by' refers to one or more of splitting, applying, combining.

### Splitting an object into groups

In [1]:
# import declarations
import pandas as pd
import numpy as np

In [2]:
# make a DataFrame
df = pd.DataFrame({'A':['foo','bar','foo','bar',
                      'foo','bar','foo','foo','foo'],
                 'B':['one','one','two','three',
                     'two','two','one','three','three'],
                 'C':[1,2,3,4,5,6,7,8,9],
                 'D':[9,10,11,12,13,14,15,16,17]})
df

Unnamed: 0,A,B,C,D
0,foo,one,1,9
1,bar,one,2,10
2,foo,two,3,11
3,bar,three,4,12
4,foo,two,5,13
5,bar,two,6,14
6,foo,one,7,15
7,foo,three,8,16
8,foo,three,9,17


##### Split a Series

In [6]:
# create a list
lst = [1,2,3,1,2,3]

# create a Series using the list
#note, not each value in the index is unique
s = pd.Series([1,2,3,10,20,30], index=lst)
s

1     1
2     2
3     3
1    10
2    20
3    30
dtype: int64

In [11]:
# create groupby object
# index values are used as the group key
grouped = s.groupby(level=0)
grouped

<pandas.core.groupby.SeriesGroupBy object at 0x111ee4a20>

In [12]:
# .first() computes the first group of values
grouped.first()

1    1
2    2
3    3
dtype: int64

In [13]:
# .last() computes the last group of values
grouped.last()

1    10
2    20
3    30
dtype: int64

In [14]:
# computes the sum of the group values
grouped.sum()

1    11
2    22
3    33
dtype: int64

##### Sorting a dataframe using .groupby

In [18]:
# show DataFrame
df

Unnamed: 0,A,B,C,D
0,foo,one,1,9
1,bar,one,2,10
2,foo,two,3,11
3,bar,three,4,12
4,foo,two,5,13
5,bar,two,6,14
6,foo,one,7,15
7,foo,three,8,16
8,foo,three,9,17


In [17]:
# group by specified column and sum
grouped = df.groupby('A')
grouped.sum()

Unnamed: 0_level_0,C,D
A,Unnamed: 1_level_1,Unnamed: 2_level_1
bar,12,36
foo,33,81


In [19]:
# group by two or more columns
# Note, this is the same as a two-level MultiIndex (hierarchically-indexed data)
grouped = df.groupby(['A','B'])
# note, the 'foo'/'three' group is the only one summed
grouped.sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,C,D
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,2,10
bar,three,4,12
bar,two,6,14
foo,one,8,24
foo,three,17,33
foo,two,8,24


In [21]:
# create a new DataFrame
df2 = pd.DataFrame({'X':['B','B','A','A'],
                   'Y':[1,2,3,4]})
df2

Unnamed: 0,X,Y
0,B,1
1,B,2
2,A,3
3,A,4


In [22]:
# group and sum by specified column
# note, it's sorted by the groupby key
df2.groupby(['X']).sum()

Unnamed: 0_level_0,Y
X,Unnamed: 1_level_1
A,7
B,3


In [23]:
# a .groupby() creates a groupby object. It doesn't become
# a dataframe until something is applied to it like sum()

# This is just a groupby object
x = df2.groupby(['X'])
type(x)

pandas.core.groupby.DataFrameGroupBy

In [24]:
# if I add .sum() to it it becomes a dataframe
y = x.sum()
type(y)

pandas.core.frame.DataFrame

In [25]:
y

Unnamed: 0_level_0,Y
X,Unnamed: 1_level_1
A,7
B,3


In [26]:
# override the default sorting
df2.groupby(['X'], sort=False).sum()

Unnamed: 0_level_0,Y
X,Unnamed: 1_level_1
B,3
A,7


In [27]:
# create another dataframe
df3 = pd.DataFrame({'X':['A','B','A','B'], 'Y':[3,4,1,2]})
df3

Unnamed: 0,X,Y
0,A,3
1,B,4
2,A,1
3,B,2


In [28]:
# sort and get
# groupby column X, get group A
# note, that original order is preserved within each group
a = df3.groupby(['X']).get_group('A')
a

Unnamed: 0,X,Y
0,A,3
2,A,1


In [29]:
# same with getting group B
b = df3.groupby(['X']).get_group('B')
b

Unnamed: 0,X,Y
1,B,4
3,B,2


In [30]:
# show the df dataframe again
df

Unnamed: 0,A,B,C,D
0,foo,one,1,9
1,bar,one,2,10
2,foo,two,3,11
3,bar,three,4,12
4,foo,two,5,13
5,bar,two,6,14
6,foo,one,7,15
7,foo,three,8,16
8,foo,three,9,17


Below, .groups is an 'attribute' of the groupby object. It is a dict whose keys are the computed unique groups and corresponding values which are the axis labels for each member of the group. Basically it's saying "all 'bar' items are 1,3,5 and all 'foo' are 0,2,4,6,7, both constituting a dict.

In [31]:
# delineate with .groups
df.groupby('A').groups

{'bar': [1, 3, 5], 'foo': [0, 2, 4, 6, 7, 8]}

In [32]:
# groups by column A first, then B
grouped = df.groupby(['A','B'])
grouped.groups

{('bar', 'one'): [1],
 ('bar', 'three'): [3],
 ('bar', 'two'): [5],
 ('foo', 'one'): [0, 6],
 ('foo', 'three'): [7, 8],
 ('foo', 'two'): [2, 4]}

In [33]:
# use Python's len() to see length of the groupedby dict
len(grouped)

6

In [34]:
# uncomment below line, then after period, hitting TAB will open up all methods available
# grouped.