《[利用Python进行数据分析](https://book.douban.com/subject/25779298/)》读书笔记。
 
 [第9章](/2017/07/19/python_data_analysis9.html)  第3节：分组级运算和转换

所有用到的数据可以从[作者的 github](https://github.com/wesm/pydata-book)下载。


In [1]:
%pylab inline
import pandas as pd
from pandas import Series, DataFrame

Populating the interactive namespace from numpy and matplotlib


聚合，只是分组运算中的一种，是数据转换的一种方法：将一维数组简化为标量值。

更多的分组运算，可以通过 transform和apply方法指定。

In [2]:
df = DataFrame({
    'key1': ['a','a','b','b','a'],
    'key2': ['one','two','one','two','one'],
    'data1': np.random.randn(5),
    'data2': np.random.randn(5)
})

df

Unnamed: 0,data1,data2,key1,key2
0,-0.545043,-1.770258,a,one
1,-2.72559,1.37862,a,two
2,-0.558872,0.443795,b,one
3,-0.393619,0.014487,b,two
4,-1.216987,-0.386167,a,one


In [3]:
k1_means = df.groupby('key1').mean().add_prefix('mean_')
k1_means

Unnamed: 0_level_0,mean_data1,mean_data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,-1.495873,-0.259268
b,-0.476246,0.229141


In [4]:
pd.merge(df, k1_means, left_on='key1', right_index=True)

Unnamed: 0,data1,data2,key1,key2,mean_data1,mean_data2
0,-0.545043,-1.770258,a,one,-1.495873,-0.259268
1,-2.72559,1.37862,a,two,-1.495873,-0.259268
4,-1.216987,-0.386167,a,one,-1.495873,-0.259268
2,-0.558872,0.443795,b,one,-0.476246,0.229141
3,-0.393619,0.014487,b,two,-0.476246,0.229141


In [5]:
# 通过transform完成
df.groupby('key1').transform(np.mean)

Unnamed: 0,data1,data2
0,-1.495873,-0.259268
1,-1.495873,-0.259268
2,-0.476246,0.229141
3,-0.476246,0.229141
4,-1.495873,-0.259268


transform会将一个函数应用到各个分组，然后将结果放置到适当的位置。

如果各分组产生的是一个标量值，则该值会被广播出去。

下面实现从各组中减去平均值。

In [9]:
# 创建一个距平化函数(demeaning function)
def demean(arr):
    return arr - arr.mean()

demeaned = df.groupby('key1').transform(demean)
demeaned

Unnamed: 0,data1,data2
0,0.950831,-1.51099
1,-1.229717,1.637888
2,-0.082627,0.214654
3,0.082627,-0.214654
4,0.278886,-0.126898


In [10]:
# 检查一下，此时平均值应该为0:

demeaned.mean()

data1    6.661338e-17
data2   -1.110223e-17
dtype: float64

## apply: 一般性的“拆分-应用-合并”

transform与aggregate一样，对函数有严格条件：其结果要么产生一个可以广播的标量值，如np.mean, 要么产生一个相同大小的结果数组。

最一般化的groupby方法是apply。apply会将待处理的对象拆分成多个片段，然后对个片段调用传入的函数，最后尝试将各个片段组合到一起。

In [11]:
# 数据准备
tips = pd.read_csv('data/ch08/tips.csv')

# 增加小费占比（tip_pct）
tips['tip_pct'] = tips['tip']/tips['total_bill']
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_pct
0,16.99,1.01,Female,No,Sun,Dinner,2,0.059447
1,10.34,1.66,Male,No,Sun,Dinner,3,0.160542
2,21.01,3.5,Male,No,Sun,Dinner,3,0.166587
3,23.68,3.31,Male,No,Sun,Dinner,2,0.13978
4,24.59,3.61,Female,No,Sun,Dinner,4,0.146808


In [13]:
# 假设要根据分组选出最高的5个tip_pct值

# 编写一个选取某个列具有最大值的行的函数

def top(df, n=5, column='tip_pct'):
    return df.sort_values(by=column)[-n:]

top(tips,n=6)


Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_pct
109,14.31,4.0,Female,Yes,Sat,Dinner,2,0.279525
183,23.17,6.5,Male,Yes,Sun,Dinner,4,0.280535
232,11.61,3.39,Male,No,Sat,Dinner,2,0.29199
67,3.07,1.0,Female,Yes,Sat,Dinner,1,0.325733
178,9.6,4.0,Female,Yes,Sun,Dinner,2,0.416667
172,7.25,5.15,Male,Yes,Sun,Dinner,2,0.710345


In [14]:
# 现在，对smoker进行分组，并apply该函数

tips.groupby('smoker').apply(top)

Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill,tip,sex,smoker,day,time,size,tip_pct
smoker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
No,88,24.71,5.85,Male,No,Thur,Lunch,2,0.236746
No,185,20.69,5.0,Male,No,Sun,Dinner,5,0.241663
No,51,10.29,2.6,Female,No,Sun,Dinner,2,0.252672
No,149,7.51,2.0,Male,No,Thur,Lunch,2,0.266312
No,232,11.61,3.39,Male,No,Sat,Dinner,2,0.29199
Yes,109,14.31,4.0,Female,Yes,Sat,Dinner,2,0.279525
Yes,183,23.17,6.5,Male,Yes,Sun,Dinner,4,0.280535
Yes,67,3.07,1.0,Female,Yes,Sat,Dinner,1,0.325733
Yes,178,9.6,4.0,Female,Yes,Sun,Dinner,2,0.416667
Yes,172,7.25,5.15,Male,Yes,Sun,Dinner,2,0.710345


In [15]:
# 传入apply函数的参数
tips.groupby(['smoker','day']).apply(top, n=1, column='total_bill')

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,total_bill,tip,sex,smoker,day,time,size,tip_pct
smoker,day,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
No,Fri,94,22.75,3.25,Female,No,Fri,Dinner,2,0.142857
No,Sat,212,48.33,9.0,Male,No,Sat,Dinner,4,0.18622
No,Sun,156,48.17,5.0,Male,No,Sun,Dinner,6,0.103799
No,Thur,142,41.19,5.0,Male,No,Thur,Lunch,5,0.121389
Yes,Fri,95,40.17,4.73,Male,Yes,Fri,Dinner,4,0.11775
Yes,Sat,170,50.81,10.0,Male,Yes,Sat,Dinner,3,0.196812
Yes,Sun,182,45.35,3.5,Male,Yes,Sun,Dinner,3,0.077178
Yes,Thur,197,43.11,5.0,Female,Yes,Thur,Lunch,4,0.115982


In [17]:
# 禁用分组键
tips.groupby(['smoker','day'], group_keys=False).apply(top, n=1, column='total_bill')

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_pct
94,22.75,3.25,Female,No,Fri,Dinner,2,0.142857
212,48.33,9.0,Male,No,Sat,Dinner,4,0.18622
156,48.17,5.0,Male,No,Sun,Dinner,6,0.103799
142,41.19,5.0,Male,No,Thur,Lunch,5,0.121389
95,40.17,4.73,Male,Yes,Fri,Dinner,4,0.11775
170,50.81,10.0,Male,Yes,Sat,Dinner,3,0.196812
182,45.35,3.5,Male,Yes,Sun,Dinner,3,0.077178
197,43.11,5.0,Female,Yes,Thur,Lunch,4,0.115982


## 分位数和桶分析

将分块工具(比如cut，qcut)与groupby结合起来，能非常轻松实现分位数(quantile)或桶(bucket)分析。

In [19]:
frame = DataFrame({'data1': np.random.randn(1000),
                   'data2': np.random.randn(1000)})

# 使用cut，将数据装入长度相等的桶中
factor = pd.cut(frame.data1, 4)
factor[:10]

0     (-1.465, 0.285]
1     (0.285, 2.0343]
2     (-1.465, 0.285]
3     (0.285, 2.0343]
4     (0.285, 2.0343]
5     (0.285, 2.0343]
6     (-1.465, 0.285]
7     (0.285, 2.0343]
8     (-1.465, 0.285]
9    (-3.221, -1.465]
Name: data1, dtype: category
Categories (4, object): [(-3.221, -1.465] < (-1.465, 0.285] < (0.285, 2.0343] < (2.0343, 3.784]]

In [20]:
# cut返回的Factor对象，可以直接用于groupby

def get_stats(group):
    return {'min': group.min(), 'max': group.max(), 'count': group.count(), 'mean': group.mean()}

grouped = frame.data2.groupby(factor)

In [21]:
grouped.apply(get_stats).unstack()

Unnamed: 0_level_0,count,max,mean,min
data1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"(-3.221, -1.465]",70.0,2.983339,0.002262,-3.194609
"(-1.465, 0.285]",551.0,3.048865,-0.036949,-3.262832
"(0.285, 2.0343]",354.0,3.179622,0.011766,-2.261756
"(2.0343, 3.784]",25.0,1.548275,-0.035986,-1.678245


In [22]:
# 使用qcut，根据样本分位数得到大小相等的桶。
# 传入labels=False可以只获取分位数的编号

grouping = pd.qcut(frame.data1, 10, labels=False)

grouped = frame.data2.groupby(grouping)
grouped.apply(get_stats).unstack()

Unnamed: 0_level_0,count,max,mean,min
data1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,100.0,2.983339,-0.054477,-3.194609
1,100.0,2.271839,0.108037,-2.296877
2,100.0,2.077979,-0.109394,-2.092836
3,100.0,2.526364,0.065489,-3.077176
4,100.0,2.173251,-0.024841,-3.262832
5,100.0,1.603202,-0.17012,-2.779698
6,100.0,3.048865,0.060291,-2.318005
7,100.0,3.179622,0.007835,-2.261756
8,100.0,2.759021,-0.102486,-2.100495
9,100.0,1.659892,0.050313,-1.712727


## 示例：用特定于分组的值填充缺失值

In [24]:
s = Series(np.random.randn(6))
s[::2] = np.nan
s

0         NaN
1   -0.917052
2         NaN
3   -0.773708
4         NaN
5    0.526083
dtype: float64

In [25]:
s.fillna(s.mean())

0   -0.388225
1   -0.917052
2   -0.388225
3   -0.773708
4   -0.388225
5    0.526083
dtype: float64

In [26]:
states = ['Ohio', 'New York', 'Vermont', 'Florida',
          'Oregon', 'Nevada', 'California', 'Idaho']
group_key = ['East'] * 4 + ['West'] * 4
data = Series(np.random.randn(8), index=states)
data[['Vermont', 'Nevada', 'Idaho']] = np.nan
data

Ohio          0.173629
New York     -0.054600
Vermont            NaN
Florida      -0.086387
Oregon        0.005616
Nevada             NaN
California    0.460536
Idaho              NaN
dtype: float64

In [27]:
data.groupby(group_key).mean()

East    0.010881
West    0.233076
dtype: float64

In [29]:
# 用分组平均值填充 NA 值 
fill_mean = lambda g: g.fillna(g.mean())
data.groupby(group_key).apply(fill_mean)

Ohio          0.173629
New York     -0.054600
Vermont       0.010881
Florida      -0.086387
Oregon        0.005616
Nevada        0.233076
California    0.460536
Idaho         0.233076
dtype: float64

In [30]:
# 指定填充
fill_values = {'East': 0.5, 'West': -1}
fill_func = lambda g: g.fillna(fill_values[g.name])

data.groupby(group_key).apply(fill_func)

Ohio          0.173629
New York     -0.054600
Vermont       0.500000
Florida      -0.086387
Oregon        0.005616
Nevada       -1.000000
California    0.460536
Idaho        -1.000000
dtype: float64

## 示例：随机采样和排列

一个随机采样的方法：选取np.random.permutation(N)的前K个元素。其中，N为总体个数，K为期望的样本大小。
比如，一个扑克牌。

In [46]:
# suite: 花色: 红桃 Hearts, 黑桃 Spades, 梅花 Clubs, 方块 Diamonds
suits  = ['H', 'S','C','D']
# 点数： 在21点中的取值 1,2,3,...9,10,10,10,10 
card_val = (list(range(1, 11)) + [10] * 3) * 4
# 牌面
base_names = ['A'] + list(range(2,11)) + ['J','Q','K']

cards = []
for suit in ['H', 'S', 'C', 'D']:
    cards.extend(str(num) + suit for num in base_names)

# 一副扑克牌（52张）
deck = Series(card_val, index=cards)
deck.head()

AH    1
2H    2
3H    3
4H    4
5H    5
dtype: int64

In [47]:
# 随机抽五张

def draw(deck,n=5):
    return deck.take(np.random.permutation(len(deck))[:n])

draw(deck)


8S     8
JC    10
9D     9
JH    10
6C     6
dtype: int64

In [48]:
# 每种花色抽2张

get_suit = lambda card: card[-1]
deck.groupby(get_suit).apply(draw,n=2)

C  KC     10
   JC     10
D  3D      3
   2D      2
H  QH     10
   6H      6
S  10S    10
   4S      4
dtype: int64

In [50]:
# 去掉键值
deck.groupby(get_suit, group_keys=False).apply(draw,n=2)

3C     3
6C     6
KD    10
3D     3
7H     7
AH     1
4S     4
7S     7
dtype: int64

## 实例：分组加权平均数和相关系数

In [51]:
df = DataFrame({'category': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'],
                'data': np.random.randn(8),
                'weights': np.random.rand(8)})
df

Unnamed: 0,category,data,weights
0,a,0.924912,0.168805
1,a,1.495683,0.731728
2,a,0.073354,0.455043
3,a,0.773848,0.463418
4,b,0.751353,0.702116
5,b,-1.144943,0.688624
6,b,-0.574356,0.300167
7,b,-0.117242,0.256547


In [52]:
# 利用category 计算分组加权平均数
grouped = df.groupby('category')
get_wavg = lambda g: np.average(g['data'], weights=g['weights'])
grouped.apply(get_wavg)

category
a    0.903003
b   -0.237941
dtype: float64

In [53]:
# 标普500指数和几只股票的收盘价数据

close_px = pd.read_csv('data/ch09/stock_px.csv', parse_dates=True, index_col=0)
close_px.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2214 entries, 2003-01-02 to 2011-10-14
Data columns (total 4 columns):
AAPL    2214 non-null float64
MSFT    2214 non-null float64
XOM     2214 non-null float64
SPX     2214 non-null float64
dtypes: float64(4)
memory usage: 86.5 KB


In [54]:
# 任务：计算日收益率与SPX之间的年度相关系数
rets = close_px.pct_change().dropna()
spx_corr = lambda x: x.corrwith(x['SPX'])
by_year = rets.groupby(lambda x: x.year)
by_year.apply(spx_corr)

Unnamed: 0,AAPL,MSFT,XOM,SPX
2003,0.541124,0.745174,0.661265,1.0
2004,0.374283,0.588531,0.557742,1.0
2005,0.46754,0.562374,0.63101,1.0
2006,0.428267,0.406126,0.518514,1.0
2007,0.508118,0.65877,0.786264,1.0
2008,0.681434,0.804626,0.828303,1.0
2009,0.707103,0.654902,0.797921,1.0
2010,0.710105,0.730118,0.839057,1.0
2011,0.691931,0.800996,0.859975,1.0


In [55]:
# 苹果与微软的年度相关系数
by_year.apply(lambda g: g['AAPL'].corr(g['MSFT']))

2003    0.480868
2004    0.259024
2005    0.300093
2006    0.161735
2007    0.417738
2008    0.611901
2009    0.432738
2010    0.571946
2011    0.581987
dtype: float64

## 示例：面向分组的线性回归

In [56]:
import statsmodels.api as sm
def regress(data, yvar, xvars):
    Y = data[yvar]
    X = data[xvars]
    X['intercept'] = 1.
    result = sm.OLS(Y, X).fit()
    return result.params
by_year.apply(regress, 'AAPL', ['SPX'])

Unnamed: 0,SPX,intercept
2003,1.195406,0.00071
2004,1.363463,0.004201
2005,1.766415,0.003246
2006,1.645496,8e-05
2007,1.198761,0.003438
2008,0.968016,-0.00111
2009,0.879103,0.002954
2010,1.052608,0.001261
2011,0.806605,0.001514
