In [1]:
import pandas as pd
import numpy as np
pd.set_option("display.show_dimensions", False)
pd.set_option("display.float_format", "{:4.2g}".format)

## 分组运算

In [2]:
dose_df = pd.read_csv("dose.csv")
print dose_df.head(3)

   Dose  Response1  Response2 Tmt  Age Gender
0    50        9.9         10   C  60s      F
1    15      0.002      0.004   D  60s      F
2    25       0.63        0.8   C  50s      M


### `groupby()`方法

> **TIP**

> `groupby()`并不立即执行分组操作，而只是返回保存源数据和分组数据的`GroupBy`对象。在需要获取每个分组的实际数据时，`GroupBy`对象才会执行分组操作。

分组用的数据在源数据中

In [12]:
# 按tmt一列中的所有可能进行分组
tmt_group = dose_df.groupby("Tmt")
len(tmt_group)

4

In [13]:
# 按tmt和age两个列分组
tmt_age_group = dose_df.groupby(["Tmt", "Age"])
len(tmt_age_group)
# tmt>4, age>3, 最多12中,可能有些Age没有所有A,B,C,D形式

10

分组用的数据不在源数据中

In [15]:
# 将原数据随机分成5组
random_values = np.random.randint(0, 5, dose_df.shape[0])
random_group = dose_df.groupby(random_values)
len(random_group)

5

分组用的数据可以通过源数据的行索引计算

In [16]:
# 行索引除以3的余数进行分组
alternating_group = dose_df.groupby(lambda n:n % 3)
len(alternating_group)

3

可以任意自由组合三种分组数据

In [19]:
crazy_group = dose_df.groupby(["Gender", lambda n: n % 2, random_values])
len(crazy_group) # 2*2*5

20

### `GroupBy`对象

len()可以获取分组个数

In [20]:
print len(tmt_age_group), len(crazy_group)

10 20


GroupBy对象支持迭代接口, 通过迭代得到分组的键和数据

In [22]:
for key, df in tmt_age_group:
    print "key =", key, ", shape =", df.shape

key = ('A', '50s') , shape = (39, 6)
key = ('A', '60s') , shape = (26, 6)
key = ('B', '40s') , shape = (13, 6)
key = ('B', '50s') , shape = (13, 6)
key = ('B', '60s') , shape = (39, 6)
key = ('C', '40s') , shape = (13, 6)
key = ('C', '50s') , shape = (13, 6)
key = ('C', '60s') , shape = (39, 6)
key = ('D', '50s') , shape = (52, 6)
key = ('D', '60s') , shape = (13, 6)


快速为每个分组指定变量名

In [25]:
(_, df_A), (_, df_B), (_, df_C), (_, df_D) = tmt_group
df_A.head()

Unnamed: 0,Dose,Response1,Response2,Tmt,Age,Gender
6,1.0,0.0,0.0,A,50s,F
10,15.0,5.2,5.2,A,60s,F
12,5.0,0.0,0.001,A,60s,F
17,5.0,0.0,0.003,A,50s,M
32,100.0,9.3,10.0,A,60s,F


> **TIP**

> 由于`GroupBy`对象有`keys`属性，因此无法通过`dict(tmt_group)`直接将其转换为字典，可以先将其转换为迭代器，再转换为字典`dict(iter(tmt_group))`。

get_group()方法获得分组键对应的数据

In [30]:
%C tmt_group.get_group("A").head(3);; tmt_age_group.get_group(("A", "50s")).head(3)

       tmt_group.get_group("A").head(3)       
----------------------------------------------
    Dose  Response1  Response2 Tmt  Age Gender
6      1          0          0   A  50s      F
10    15        5.2        5.2   A  60s      F
12     5          0      0.001   A  60s      F

tmt_age_group.get_group(("A", "50s")).head(3) 
----------------------------------------------
    Dose  Response1  Response2 Tmt  Age Gender
6      1          0          0   A  50s      F
17     5          0      0.003   A  50s      M
34    40         11         10   A  50s      M


对GroupBy的下标操作, 将获得一个新的gropby对象, 这个groupby对象只包含源数据中指定的列. 通过这种方式可以先使用源数据中的某些列进行分组, 然后选择另一些列进行后续运算

In [33]:
print tmt_group["Dose"]
print tmt_group[["Response1", "Response2"]]

<pandas.core.groupby.SeriesGroupBy object at 0x0A45FA90>
<pandas.core.groupby.DataFrameGroupBy object at 0x0A45F6D0>


In [35]:
for key, df in  tmt_group.Dose:
    print "key =", key, ", shape =", df.shape

key = A , shape = (65,)
key = B , shape = (65,)
key = C , shape = (65,)
key = D , shape = (65,)


### 分组－运算－合并

回调函数既可把分组的每列做为参数,也可把整个分组做为参数

* agg(), 对个分组中的数据进行聚合运算
* transform(),对每个分组中的数据进行转换运算
* filter(), 对每个分组进行条件判断
* apply(), 将每个分组的DataFrame对象传递给回调函数并收集其返回值,并将这些返回值按照某种规则合并

#### `agg()`－聚合

In [47]:
dose_df.head()

Unnamed: 0,Dose,Response1,Response2,Tmt,Age,Gender
0,50,9.9,10.0,C,60s,F
1,15,0.002,0.004,D,60s,F
2,25,0.63,0.8,C,50s,M
3,25,1.4,1.6,C,60s,F
4,15,0.01,0.02,C,60s,F


In [41]:
tmt_group

<pandas.core.groupby.DataFrameGroupBy object at 0x0A321FD0>

In [42]:
for key, df in tmt_group:
    print "key=", key, ", shape =",df.shape

key= A , shape = (65, 6)
key= B , shape = (65, 6)
key= C , shape = (65, 6)
key= D , shape = (65, 6)


In [44]:
tmt_group.get_group("A").head()

Unnamed: 0,Dose,Response1,Response2,Tmt,Age,Gender
6,1.0,0.0,0.0,A,50s,F
10,15.0,5.2,5.2,A,60s,F
12,5.0,0.0,0.001,A,60s,F
17,5.0,0.0,0.003,A,50s,M
32,100.0,9.3,10.0,A,60s,F


计算每个分组中每列的平均值

In [45]:
agg_res1 = tmt_group.agg(np.mean) #
agg_res1

Unnamed: 0_level_0,Dose,Response1,Response2
Tmt,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A,34,6.7,6.9
B,34,5.6,5.5
C,34,4.0,4.1
D,34,3.3,3.2


找到每个分组中response1最大的那一行

In [49]:
agg_res2 = tmt_group.agg(lambda df:df.loc[df.Response1.idxmax()]) #❷
agg_res2

Unnamed: 0_level_0,Dose,Response1,Response2,Age,Gender
Tmt,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
A,80.0,11,10.0,60s,F
B,100.0,11,10.0,50s,M
C,60.0,10,11.0,50s,M
D,80.0,11,9.9,60s,F


#### `transform()`－转换

In [51]:
# 对所有的列减去该列的平均值
transform_res1 = tmt_group.transform(lambda s:s - s.mean()) #
 # 只对某一列减去该列的平均值
transform_res2 = tmt_group.transform(
    lambda df:df.assign(Response1=df.Response1 - df.Response1.mean())) #
%C transform_res1.head(5);; transform_res2.head(5)

    transform_res1.head(5)   
-----------------------------
   Dose  Response1  Response2
0    16        5.8        5.9
1   -19       -3.3       -3.2
2  -8.5       -3.4       -3.3
3  -8.5       -2.7       -2.6
4   -19         -4       -4.1

          transform_res2.head(5)         
-----------------------------------------
   Dose  Response1  Response2  Age Gender
0    50        5.8         10  60s      F
1    15       -3.3      0.004  60s      F
2    25       -3.4        0.8  50s      M
3    25       -2.7        1.6  60s      F
4    15         -4       0.02  60s      F


#### `filter()`－过滤

In [52]:
print tmt_group.filter(lambda df:df.Response1.max() < 11).head()
# 保留response1的最大值小于11的分组

   Dose  Response1  Response2 Tmt  Age Gender
0    50        9.9         10   C  60s      F
1    15      0.002      0.004   D  60s      F
2    25       0.63        0.8   C  50s      M
3    25        1.4        1.6   C  60s      F
4    15       0.01       0.02   C  60s      F


#### `apply()`－运用

回调函数的返回形式:

* 返回Series对象, 并且Series对象的所有索引都是相同的
* 返回Series对象, 但是每个组的索引是不同的, 行索引变为二级形式

* 返回DataFrame对象, 并且保持原有的结构 
* 返回DataFrame对象, 索引对象改变, 行索引变为二级形式

每列的最大值和平均值

In [54]:
%C 4 tmt_group.apply(pd.DataFrame.max);; tmt_group.apply(pd.DataFrame.mean)

       tmt_group.apply(pd.DataFrame.max)       
-----------------------------------------------
     Dose  Response1  Response2 Tmt  Age Gender
Tmt                                            
A   1e+02         11         11   A  60s      M
B   1e+02         11         10   B  60s      M
C   1e+02         10         11   C  60s      M
D   1e+02         11        9.9   D  60s      M

tmt_group.apply(pd.DataFrame.mean)
----------------------------------
     Dose  Response1  Response2   
Tmt                               
A      34        6.7        6.9   
B      34        5.6        5.5   
C      34          4        4.1   
D      34        3.3        3.2   


从每个分组的response1列随机取两个数值

In [73]:
tmt_group.get_group("A").head()

Unnamed: 0,Dose,Response1,Response2,Tmt,Age,Gender
6,1.0,0.0,0.0,A,50s,F
10,15.0,5.2,5.2,A,60s,F
12,5.0,0.0,0.001,A,60s,F
17,5.0,0.0,0.003,A,50s,M
32,100.0,9.3,10.0,A,60s,F


In [62]:
np.random.seed(42)
#多级标签的series对象, 有源数据的行标
sample_res1 = tmt_group.apply(lambda df:df.Response1.sample(2)) 
# 删除源数据的行标签,所有结果的行标签全都相同 
sample_res2 = tmt_group.apply(
    lambda df:df.Response1.sample(2).reset_index(drop=True)) #
%C 4 sample_res1;; sample_res2

          sample_res1          
-------------------------------
Tmt                            
A    60    0.001               
     194      10               
B    57       10               
     89      4.2               
C    243       0               
     72      9.6               
D    5     0.007               
     219       0               
Name: Response1, dtype: float64

    sample_res2     
--------------------
Response1    0     1
Tmt                 
A           10   9.8
B           11  0.35
C            1   2.9
D          9.5 0.004


In [74]:
group = tmt_group[["Response1", "Response2"]]
apply_res1 = group.apply(lambda df:df - df.mean())
apply_res2 = group.apply( lambda df:(df - df.mean())[:] )

%C 4 apply_res1.head();; apply_res2.head()

   apply_res1.head()   
-----------------------
   Response1  Response2
0        5.8        5.9
1       -3.3       -3.2
2       -3.4       -3.3
3       -2.7       -2.6
4         -4       -4.1

     apply_res2.head()      
----------------------------
        Response1  Response2
Tmt                         
A   6        -6.7       -6.9
    10       -1.5       -1.7
    12       -6.7       -6.9
    17       -6.7       -6.9
    32        2.6        3.2


In [69]:
# Response1的均值大于5的分组中随机取两行数据
print tmt_group.apply(lambda df:None if df.Response1.mean() < 5 else df.sample(2))

         Dose  Response1  Response2 Tmt  Age Gender
Tmt                                                
A   17      5          0      0.003   A  50s      M
    234    10       0.47       0.69   A  50s      F
B   88  1e+02        9.7         10   B  60s      F
    110    30        9.9        8.2   B  40s      F


GroupBy自动将一些常用的DataFrame方法用apply()包装, 通过GroupBy对象调用这些方法就相当于将这些方法作为回调函数传递给apply()

In [72]:
%C 4 tmt_group.mean();; tmt_group.quantile(q=0.75)

        tmt_group.mean()       
-------------------------------
     Dose  Response1  Response2
Tmt                            
A      34        6.7        6.9
B      34        5.6        5.5
C      34          4        4.1
D      34        3.3        3.2

   tmt_group.quantile(q=0.75)  
-------------------------------
     Dose  Response1  Response2
Tmt                            
A      50         10         10
B      50        9.8         10
C      50        9.6        9.6
D      50        8.9        8.4
