## Groupby技术

分组运算（split-apply-combine  拆分-应用-合并）

分组键可以有多种形式，且类型不必相同：
* 列表或者数组，其长度与待分组的轴一样
* 表示DataFrame某个列的值
* 字典或者Series，给出待分组轴上的值与分组名之间的对应关系
* 函数，用于处理轴索引与索引中的各个标签

In [1]:
from pandas import Series, DataFrame, Index, MultiIndex
from numpy.random import randn
import pandas as pd
import numpy as np
import datetime
import random
import re

In [2]:
%matplotlib inline
import matplotlib.pyplot as plt

In [5]:
df = DataFrame(
    {
        "key1": ["a", "a", "b", "b", "a"],
        "key2": ["one", "two", "one", "two", "one"],
        "data1": np.random.randn(5),
        "data2": np.random.randn(5)
    }
)

In [6]:
df

Unnamed: 0,data1,data2,key1,key2
0,-0.091409,1.001294,a,one
1,0.155556,-0.957352,a,two
2,-0.310636,1.255587,b,one
3,-0.157187,2.331527,b,two
4,0.136713,-2.856341,a,one


假设想按key1进行分组，并计算data1列的平均值。

实现方式很多，这里我们用的是：访问data1，并根据key1调用groupby

In [7]:
grouped = df["data1"].groupby(df["key1"]);grouped

<pandas.core.groupby.SeriesGroupBy object at 0x000000000896E6A0>

grouped是一个GroupBy对象。实际上还没有进行任何计算，只是包含有一些分组键df["key1"]的中间数据而已。

In [8]:
grouped.mean()

key1
a    0.066953
b   -0.233911
Name: data1, dtype: float64

我们也可以传入多个数字，的到不同的结果。

In [9]:
means = df["data1"].groupby([df["key1"], df["key2"]]).mean();means

key1  key2
a     one     0.022652
      two     0.155556
b     one    -0.310636
      two    -0.157187
Name: data1, dtype: float64

通过两个键对数据进行了分组，得到的Series具有一个层次化索引

In [11]:
means.unstack()

key2,one,two
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0.022652,0.155556
b,-0.310636,-0.157187


上面的例子，分组键均为Series，实际上，分组键也可以是任何长度的数组

In [12]:
states = np.array(["Ohio", "California", "California", "Ohio", "Ohio"])

In [13]:
years = np.array([2005, 2005, 2006, 2005, 2006])

In [14]:
df["data1"].groupby([states, years]).mean()

California  2005    0.155556
            2006   -0.310636
Ohio        2005   -0.124298
            2006    0.136713
Name: data1, dtype: float64

还可以将列名（字符串、数字、其他python对象）用作分组键

In [15]:
df.groupby("key1").mean()

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0.066953,-0.937466
b,-0.233911,1.793557


In [16]:
df.groupby(["key1", "key2"]).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,one,0.022652,-0.927523
a,two,0.155556,-0.957352
b,one,-0.310636,1.255587
b,two,-0.157187,2.331527


执行df.groupby("key1").mean()的时候，结果中并没有key2列，因为key2不是数值数据，被从结果移除了。

默认情况下，所有数值列都会被聚合。

GroupBy的size方法，可以返回一个含有分组大小的Series

In [17]:
df.groupby(["key1", "key2"]).size()

key1  key2
a     one     2
      two     1
b     one     1
      two     1
dtype: int64

### 对分组进行迭代

GroupBy对象支持迭代，可以产生一组二元元组（由分组名和数据块组成）。

In [18]:
for name, group in df.groupby("key1"):
    print name
    print group

a
      data1     data2 key1 key2
0 -0.091409  1.001294    a  one
1  0.155556 -0.957352    a  two
4  0.136713 -2.856341    a  one
b
      data1     data2 key1 key2
2 -0.310636  1.255587    b  one
3 -0.157187  2.331527    b  two


对于多重键的情况，元组的第一个元素将会由键值组成的元组

In [19]:
for (k1, k2), group in df.groupby(["key1", "key2"]):
    print k1, k2
    print group

a one
      data1     data2 key1 key2
0 -0.091409  1.001294    a  one
4  0.136713 -2.856341    a  one
a two
      data1     data2 key1 key2
1  0.155556 -0.957352    a  two
b one
      data1     data2 key1 key2
2 -0.310636  1.255587    b  one
b two
      data1     data2 key1 key2
3 -0.157187  2.331527    b  two


我们可以对这些数据片段做任何操作，比如将片段做成一个字典

In [20]:
pieces = dict(list(df.groupby("key1")))

In [21]:
pieces["b"]

Unnamed: 0,data1,data2,key1,key2
2,-0.310636,1.255587,b,one
3,-0.157187,2.331527,b,two


groupby默认是在axis=0的轴上进行分组的，通过设置也可以在其他任何轴进行分组。

比如我们可以根据dtype来分组

In [22]:
df.dtypes

data1    float64
data2    float64
key1      object
key2      object
dtype: object

In [23]:
grouped = df.groupby(df.dtypes, axis=1)

In [24]:
dict(list(grouped))

{dtype('float64'):       data1     data2
 0 -0.091409  1.001294
 1  0.155556 -0.957352
 2 -0.310636  1.255587
 3 -0.157187  2.331527
 4  0.136713 -2.856341, dtype('O'):   key1 key2
 0    a  one
 1    a  two
 2    b  one
 3    b  two
 4    a  one}

### 选择一个或者一组列

对于由DataFrame产生的GroupBy对象，如果用一个（单个字符串）或者一组（字符串组）列名对其进行索引，就能实现选取部分列进行聚合的目的。

df.groupby("key1")["data1"]

df.groupby("key1")[["data2"]]

是以下代码的语法糖

df["data1"].groupby(df["key1"])

df[["data2"]].groupby(df["key1"])

如果只需要计算data2列的平均值，并以DataFrame的形式得到结果

In [26]:
df.groupby(["key1", "key2"])[["data2"]].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data2
key1,key2,Unnamed: 2_level_1
a,one,-0.927523
a,two,-0.957352
b,one,1.255587
b,two,2.331527


In [27]:
df.groupby(["key1", "key2"])["data2"].mean() # series的结果

key1  key2
a     one    -0.927523
      two    -0.957352
b     one     1.255587
      two     2.331527
Name: data2, dtype: float64

### 通过字典或者Series进行分组

In [28]:
people = DataFrame(
    randn(5,5),
    columns=["a", "b", "c", "d", "e"],
    index=["Joe", "Steve", "Wes", "Jim", "Travis"]
)

In [29]:
people.ix[2:3, ["b", "c"]] = np.nan  

In [30]:
people

Unnamed: 0,a,b,c,d,e
Joe,-0.957417,-0.427352,-0.45189,-0.810388,-1.403369
Steve,1.85416,0.494709,1.199981,1.490893,0.193533
Wes,-0.815569,,,-0.89635,-0.893803
Jim,-0.731308,1.394683,0.461888,1.694353,0.632532
Travis,0.275919,-1.547021,1.540873,-1.783214,-0.403149


假设已知列的分组关系，并希望根据分组进行列的总计

In [31]:
mapping = {
    "a": "red",
    "b": "red",
    "c": "blue",
    "d": "blue",
    "e": "red",
    "f": "orange"
}

In [32]:
by_column = people.groupby(mapping, axis=1)

In [33]:
by_column.sum()

Unnamed: 0,blue,red
Joe,-1.262278,-2.788139
Steve,2.690875,2.542402
Wes,-0.89635,-1.709373
Jim,2.15624,1.295907
Travis,-0.24234,-1.674251


Series也有同样的功能，可以被看做一个固定大小的映射。

可以根据Series进行分组

In [34]:
map_series = Series(mapping); map_series

a       red
b       red
c      blue
d      blue
e       red
f    orange
dtype: object

In [35]:
people.groupby(map_series, axis=1).count()

Unnamed: 0,blue,red
Joe,2,3
Steve,2,3
Wes,1,2
Jim,2,3
Travis,2,3


### 通过函数进行分组

任何被当做分组键的函数都会在各个索引值上被调用一次，其返回值会被做分组名称。

In [36]:
people.groupby(len).sum()  # 根据人名长度

Unnamed: 0,a,b,c,d,e
3,-2.504294,0.967331,0.009997,-0.012385,-1.664641
5,1.85416,0.494709,1.199981,1.490893,0.193533
6,0.275919,-1.547021,1.540873,-1.783214,-0.403149


也可以将函数跟数组、列表、字典、series混合使用

In [37]:
key_list = ["one", "one", "one", "two", "two"]

In [38]:
people.groupby([len, key_list]).min()

Unnamed: 0,Unnamed: 1,a,b,c,d,e
3,one,-0.957417,-0.427352,-0.45189,-0.89635,-1.403369
3,two,-0.731308,1.394683,0.461888,1.694353,0.632532
5,one,1.85416,0.494709,1.199981,1.490893,0.193533
6,two,0.275919,-1.547021,1.540873,-1.783214,-0.403149


### 根据索引级别分组

层次索引数据集最方便的地方在于它能够根据索引级别进行聚合。

要实现改目的，可以通过关机中level传入级别编号或者名称即可。

In [40]:
columns = pd.MultiIndex.from_arrays(
    [
        ["US", "US", "US", "JP", "JP"],
        [1, 3, 5, 1, 3]
    ],
    names=["cty", "tenor"]
)

In [41]:
hier_df = DataFrame(np.random.randn(4, 5), columns=columns);hier_df

cty,US,US,US,JP,JP
tenor,1,3,5,1,3
0,0.044254,-1.209975,-2.087178,-1.263182,-0.64473
1,-1.379736,0.757575,0.19449,0.513908,-2.161278
2,0.232363,-2.140448,-2.164628,0.48096,-0.243653
3,-1.216786,1.911995,1.460839,0.664883,1.711944


In [42]:
hier_df.groupby(level="cty", axis=1).count()

cty,JP,US
0,2,3
1,2,3
2,2,3
3,2,3


## 数据聚合

对于聚合除了使用经常用到的一些函数，mean, count, min, sum 等，

你可以使用自己发明的聚合函数，还可以调动分组对象上已经定义好的任何方法。

比如：quantile 可以计算Series或者DataFrame列的样本分位数。

In [43]:
df

Unnamed: 0,data1,data2,key1,key2
0,-0.091409,1.001294,a,one
1,0.155556,-0.957352,a,two
2,-0.310636,1.255587,b,one
3,-0.157187,2.331527,b,two
4,0.136713,-2.856341,a,one


In [44]:
grouped = df.groupby("key1")

In [45]:
grouped["data1"].quantile(0.9)

key1
a    0.151787
b   -0.172532
Name: data1, dtype: float64

如果要使用自己的聚合函数，只需要将其传入aggregate或者agg方法即可

In [46]:
def peak_to_peak(arr):
    return arr.max() - arr.min()

In [47]:
grouped.agg(peak_to_peak)

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0.246965,3.857635
b,0.15345,1.07594


还要些方法如 describe也可以用到这里 

In [48]:
grouped.describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,count,3.0,3.0
a,mean,0.066953,-0.937466
a,std,0.137469,1.928894
a,min,-0.091409,-2.856341
a,25%,0.022652,-1.906846
a,50%,0.136713,-0.957352
a,75%,0.146134,0.021971
a,max,0.155556,1.001294
b,count,2.0,2.0
b,mean,-0.233911,1.793557


![groupby.png](./files/groupby.png)

为了说明一些更高级的功能，我们使用一个有关餐馆小费的数据集

In [49]:
tips = pd.read_csv("ch08/tips.csv")

In [51]:
tips["tip_pct"] = tips["tip"] / tips["total_bill"]  # 添加 “小费占总额百分比”的列

In [52]:
tips[:6]

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_pct
0,16.99,1.01,Female,No,Sun,Dinner,2,0.059447
1,10.34,1.66,Male,No,Sun,Dinner,3,0.160542
2,21.01,3.5,Male,No,Sun,Dinner,3,0.166587
3,23.68,3.31,Male,No,Sun,Dinner,2,0.13978
4,24.59,3.61,Female,No,Sun,Dinner,4,0.146808
5,25.29,4.71,Male,No,Sun,Dinner,4,0.18624
