《[利用Python进行数据分析](https://book.douban.com/subject/25779298/)》读书笔记。
 
 [第9章](/2017/07/19/python_data_analysis9.html)  第1节：groupby

所有用到的数据可以从[作者的 github](https://github.com/wesm/pydata-book)下载。


In [3]:
%pylab inline
import pandas as pd
from pandas import Series, DataFrame

Populating the interactive namespace from numpy and matplotlib


分组运算的典型过程为：split-apply-combine （拆分-应用-合并），如下图：

![分组聚合演示](group_sample.png)

用pandas进行分组很灵活：

- 在维度上可以任意选择。例如， DataFrame可以在行（axis=0）或列（axis=1）上进行分组。

- 在分组键上，可以有多种形式，比如列名，关于名/值的数组、列表、字典、Series等，甚至可以使用函数。

一个简单的例子：

In [4]:
df = DataFrame({
    'key1': ['a','a','b','b','a'],
    'key2': ['one','two','one','two','one'],
    'data1': np.random.randn(5),
    'data2': np.random.randn(5)
})

df

Unnamed: 0,data1,data2,key1,key2
0,-1.688851,-1.738961,a,one
1,1.67784,0.697004,a,two
2,-0.111286,-0.63034,b,one
3,-0.344769,-1.289249,b,two
4,-0.282896,-1.406068,a,one


In [6]:
# 根据key1 进行分组，并计算data1列的平均值
# 结果是一个Series
grouped = df['data1'].groupby(df['key1'])
grouped.mean()

key1
a   -0.097969
b   -0.228027
Name: data1, dtype: float64

In [7]:
#两个维度(key1,key2)上的分组
# 结果是一个Series
means = df['data1'].groupby([df['key1'], df['key2']]).mean()
means

key1  key2
a     one    -0.985874
      two     1.677840
b     one    -0.111286
      two    -0.344769
Name: data1, dtype: float64

In [13]:
# 转换成DataFrame
means.unstack()

key2,one,two
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,-0.985874,1.67784
b,-0.111286,-0.344769


In [15]:
# 分组键不仅可以是Series
# 比如，可以是数组（需要长度适当）
states = np.array(['Ohio', 'California', 'California', 'Ohio', 'Ohio'])
years = np.array([2005, 2005, 2006, 2005, 2006])
df['data1'].groupby([states, years]).mean()


California  2005    1.677840
            2006   -0.111286
Ohio        2005   -1.016810
            2006   -0.282896
Name: data1, dtype: float64

In [16]:
# 可以将列名（字符串、数组或其他对象）用作分组键
df.groupby('key1').mean()

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,-0.097969,-0.816009
b,-0.228027,-0.959794


In [17]:
df.groupby(['key1', 'key2']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,one,-0.985874,-1.572515
a,two,1.67784,0.697004
b,one,-0.111286,-0.63034
b,two,-0.344769,-1.289249


In [18]:
df.groupby(['key1', 'key2']).size()

key1  key2
a     one     2
      two     1
b     one     1
      two     1
dtype: int64

## 对分组进行迭代

In [21]:
# groupby 的结果是 GroupBy 对象。
# 可以进行迭代：
for name, group in df.groupby('key1'):
    print('=================================')
    print(name)
    print('----')
    print(group)

a
----
      data1     data2 key1 key2
0 -1.688851 -1.738961    a  one
1  1.677840  0.697004    a  two
4 -0.282896 -1.406068    a  one
b
----
      data1     data2 key1 key2
2 -0.111286 -0.630340    b  one
3 -0.344769 -1.289249    b  two


In [23]:
# 多重键时， 元组的第一个元素是 键值的组合
for (k1, k2), group in df.groupby(['key1', 'key2']):
    print('=================================')
    print((k1, k2))
    print('----')
    print(group)

('a', 'one')
----
      data1     data2 key1 key2
0 -1.688851 -1.738961    a  one
4 -0.282896 -1.406068    a  one
('a', 'two')
----
     data1     data2 key1 key2
1  1.67784  0.697004    a  two
('b', 'one')
----
      data1    data2 key1 key2
2 -0.111286 -0.63034    b  one
('b', 'two')
----
      data1     data2 key1 key2
3 -0.344769 -1.289249    b  two


In [24]:
# 可以利用这些数据片段。比如，做成一个字段
pieces = dict(list(df.groupby('key1')))
pieces

{'a':       data1     data2 key1 key2
 0 -1.688851 -1.738961    a  one
 1  1.677840  0.697004    a  two
 4 -0.282896 -1.406068    a  one, 'b':       data1     data2 key1 key2
 2 -0.111286 -0.630340    b  one
 3 -0.344769 -1.289249    b  two}

In [25]:
pieces['b']

Unnamed: 0,data1,data2,key1,key2
2,-0.111286,-0.63034,b,one
3,-0.344769,-1.289249,b,two


In [27]:
# groupby默认是在 axis=0 上分组，其实可以在任何轴上进行分组
# 比如，根据 dtype对列进行分组

df.dtypes

data1    float64
data2    float64
key1      object
key2      object
dtype: object

In [30]:
grouped = df.groupby(df.dtypes, axis=1)
dict(list(grouped))

{dtype('float64'):       data1     data2
 0 -1.688851 -1.738961
 1  1.677840  0.697004
 2 -0.111286 -0.630340
 3 -0.344769 -1.289249
 4 -0.282896 -1.406068, dtype('O'):   key1 key2
 0    a  one
 1    a  two
 2    b  one
 3    b  two
 4    a  one}

## 选取一个或一组列

对groupby产生的 GroupBy 进行索引，能实现选取部分列进行聚合的目的。索引可以是一个或一组字符串。

对于大数据集，可能只需要对部分列进行聚合，这种方法就很有用。比如：只计算data2列的平均值：

In [37]:
# 等价于 df['data2'].groupby(df['key1'])
df.groupby('key1')['data2'].mean()

key1
a   -0.816009
b   -0.959794
Name: data2, dtype: float64

In [38]:
# 直接转换为 DataFrame
# 等价于 df[['data2']].groupby(df['key1'])
df.groupby('key1')[['data2']].mean()

Unnamed: 0_level_0,data2
key1,Unnamed: 1_level_1
a,-0.816009
b,-0.959794


In [40]:
# 多个键值
df.groupby(['key1', 'key2'])[['data2']].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data2
key1,key2,Unnamed: 2_level_1
a,one,-1.572515
a,two,0.697004
b,one,-0.63034
b,two,-1.289249


## 通过字典或Series进行分组

In [41]:
people = DataFrame(np.random.randn(5, 5),
                   columns=['a', 'b', 'c', 'd', 'e'],
                   index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])
people.ix[2:3, ['b', 'c']] = np.nan # Add a few NA values
people

Unnamed: 0,a,b,c,d,e
Joe,-1.001767,-0.944141,-0.40602,0.914168,-0.653524
Steve,-2.624573,-1.142006,1.275258,0.63846,-0.049654
Wes,1.053102,,,1.136185,-0.705047
Jim,0.258541,1.262957,-1.130326,-1.853057,0.916846
Travis,-0.5828,1.582861,0.187689,-0.495851,-1.422027


In [43]:
# 使用字典，根据列的分组关系计算总和
mapping = {'a': 'red', 'b': 'red', 'c': 'blue',
           'd': 'orange', 'e': 'red', 'f' : 'blue'}
by_column = people.groupby(mapping, axis=1)
by_column.sum()

Unnamed: 0,blue,orange,red
Joe,-0.40602,0.914168,-2.599433
Steve,1.275258,0.63846,-3.816232
Wes,,1.136185,0.348056
Jim,-1.130326,-1.853057,2.438345
Travis,0.187689,-0.495851,-0.421966


In [44]:
# 使用 Series作为分组键， pandas会检查 Series以确保其索引与分组轴是对其的
map_series = Series(mapping)
map_series

a       red
b       red
c      blue
d    orange
e       red
f      blue
dtype: object

In [45]:
people.groupby(map_series, axis=1).count()

Unnamed: 0,blue,orange,red
Joe,1,1,3
Steve,1,1,3
Wes,0,1,2
Jim,1,1,3
Travis,1,1,3


## 通过函数进行分组

函数作为分组键时，会在各个索引值上被调用一次，起返回值作为分组名称。


In [46]:
people.groupby(len).sum()

Unnamed: 0,a,b,c,d,e
3,0.309876,0.318816,-1.536346,0.197296,-0.441725
5,-2.624573,-1.142006,1.275258,0.63846,-0.049654
6,-0.5828,1.582861,0.187689,-0.495851,-1.422027


In [48]:
# 数组，列表，字典，Series，函数可以混合分组
key_list = ['one', 'one', 'one', 'two', 'two']
people.groupby([len, key_list]).min()

Unnamed: 0,Unnamed: 1,a,b,c,d,e
3,one,-1.001767,-0.944141,-0.40602,0.914168,-0.705047
3,two,0.258541,1.262957,-1.130326,-1.853057,0.916846
5,one,-2.624573,-1.142006,1.275258,0.63846,-0.049654
6,two,-0.5828,1.582861,0.187689,-0.495851,-1.422027


## 根据索引级别分组

层次化的索引，可以根据索引级别进行聚合。通过level关键字传入级别编号或名称即可。

In [49]:
columns = pd.MultiIndex.from_arrays([['US', 'US', 'US', 'JP', 'JP'],
                                    [1, 3, 5, 1, 3]], names=['cty', 'tenor'])
hier_df = DataFrame(np.random.randn(4, 5), columns=columns)
hier_df

cty,US,US,US,JP,JP
tenor,1,3,5,1,3
0,-0.761653,-1.095973,0.123118,1.363196,-0.087321
1,-1.586871,0.242787,0.311317,0.0969,0.384882
2,1.032321,0.016799,-1.619511,0.017748,0.334336
3,-0.371162,0.919163,-0.47624,1.155305,2.318014


In [50]:
hier_df.groupby(level='cty', axis=1).count()

cty,JP,US
0,2,3
1,2,3
2,2,3
3,2,3


In [51]:
hier_df.groupby(level=1, axis=1).count()

tenor,1,3,5
0,2,2,1
1,2,2,1
2,2,2,1
3,2,2,1
