# GroupBy
任何groupby操作都涉及原始对象的以下操作之一。他们是:
- **分割**对象
- **应用**函数
- **联系**结果

在许多情况下，我们将数据分成多个集合，并在每个子集上应用一些功能。在应用功能中，我们可以执行以下操作：
- **聚合** - 计算总结统计 
- **转换** - 执行一些特定集合的操作 
- **过滤** - 在某些情况下丢弃数据

In [1]:
import pandas as pd

ipl_data = {'Team':['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
         'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
         'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
         'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
         'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
print df

    Points  Rank    Team  Year
0      876     1  Riders  2014
1      789     2  Riders  2015
2      863     2  Devils  2014
3      673     3  Devils  2015
4      741     3   Kings  2014
5      812     4   kings  2015
6      756     1   Kings  2016
7      788     1   Kings  2017
8      694     2  Riders  2016
9      701     4  Royals  2014
10     804     1  Royals  2015
11     690     2  Riders  2017


### 1. 将数据拆分成组
Pandas对象可以分成任何对象。有多种方式来分割一个对象：
- obj.groupby('key')
- obj.groupby(['key1','key2'])
- obj.groupby(key,axis=1)

In [6]:
import pandas as pd

ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
         'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
         'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
         'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
         'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
print df,'\n'
print df.groupby('Rank'),'\n'
print df.groupby('Rank').groups,'\n'
print df.groupby('Team').groups

    Points  Rank    Team  Year
0      876     1  Riders  2014
1      789     2  Riders  2015
2      863     2  Devils  2014
3      673     3  Devils  2015
4      741     3   Kings  2014
5      812     4   kings  2015
6      756     1   Kings  2016
7      788     1   Kings  2017
8      694     2  Riders  2016
9      701     4  Royals  2014
10     804     1  Royals  2015
11     690     2  Riders  2017 

<pandas.core.groupby.DataFrameGroupBy object at 0x06EE2A10> 

{1: [0L, 6L, 7L, 10L], 2: [1L, 2L, 8L, 11L], 3: [3L, 4L], 4: [5L, 9L]} 

{'Kings': [4L, 6L, 7L], 'kings': [5L], 'Riders': [0L, 1L, 8L, 11L], 'Royals': [9L, 10L], 'Devils': [2L, 3L]}


按多列分组

In [7]:
import pandas as pd
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
         'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
         'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
         'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
         'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
print df
print df.groupby(['Team','Year']).groups

    Points  Rank    Team  Year
0      876     1  Riders  2014
1      789     2  Riders  2015
2      863     2  Devils  2014
3      673     3  Devils  2015
4      741     3   Kings  2014
5      812     4   kings  2015
6      756     1   Kings  2016
7      788     1   Kings  2017
8      694     2  Riders  2016
9      701     4  Royals  2014
10     804     1  Royals  2015
11     690     2  Riders  2017
{('Royals', 2014L): [9L], ('Kings', 2014L): [4L], ('kings', 2015L): [5L], ('Riders', 2014L): [0L], ('Riders', 2015L): [1L], ('Kings', 2016L): [6L], ('Riders', 2016L): [8L], ('Riders', 2017L): [11L], ('Devils', 2014L): [2L], ('Devils', 2015L): [3L], ('Royals', 2015L): [10L], ('Kings', 2017L): [7L]}


### 2. 按组进行迭代
使用groupby对象，我们可以遍历类似itertools.obj的对象。

In [2]:
import pandas as pd

ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
         'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
         'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
         'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
         'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)

grouped = df.groupby('Year')
for name,group in grouped:
    print name
    print group

2014
   Points  Rank    Team  Year
0     876     1  Riders  2014
2     863     2  Devils  2014
4     741     3   Kings  2014
9     701     4  Royals  2014
2015
    Points  Rank    Team  Year
1      789     2  Riders  2015
3      673     3  Devils  2015
5      812     4   kings  2015
10     804     1  Royals  2015
2016
   Points  Rank    Team  Year
6     756     1   Kings  2016
8     694     2  Riders  2016
2017
    Points  Rank    Team  Year
7      788     1   Kings  2017
11     690     2  Riders  2017


### 3. 选择一个组

In [3]:
import pandas as pd

ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
         'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
         'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
         'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
         'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)

grouped = df.groupby('Year')
print grouped.groups
print grouped.get_group(2014)

{2016: [6L, 8L], 2017: [7L, 11L], 2014: [0L, 2L, 4L, 9L], 2015: [1L, 3L, 5L, 10L]}
   Points  Rank    Team  Year
0     876     1  Riders  2014
2     863     2  Devils  2014
4     741     3   Kings  2014
9     701     4  Royals  2014


### 4. 聚合
聚合函数为每个组返回单个聚合值。一旦创建了组合对象，就可以对分组数据执行多个聚合操作。
一个显而易见的是通过聚合或等效的agg方法聚合 - 

In [7]:
import pandas as pd
import numpy as np

ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
         'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
         'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
         'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
         'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)

grouped = df.groupby('Year')
# 求每组的均值
print grouped['Points'].agg(np.mean)

grouped1 = df.groupby('Team')
# 求每组元素个数
print grouped1.agg(np.size)

# 通过分组系列，还可以传递函数的列表或字典来进行聚合，并生成DataFrame作为输出
print grouped1['Points'].agg([np.sum,np.mean,np.std])

Year
2014    795.25
2015    769.50
2016    725.00
2017    739.00
Name: Points, dtype: float64
        Points  Rank  Year
Team                      
Devils       2     2     2
Kings        3     3     3
Riders       4     4     4
Royals       2     2     2
kings        1     1     1
         sum        mean         std
Team                                
Devils  1536  768.000000  134.350288
Kings   2285  761.666667   24.006943
Riders  3049  762.250000   88.567771
Royals  1505  752.500000   72.831998
kings    812  812.000000         NaN


### 5. 转换
组或列上的转换返回索引大小与被分组的索引相同的对象。因此，转换应该返回与组块大小相同的结果。

In [8]:
import pandas as pd
import numpy as np

ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
         'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
         'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
         'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
         'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)

grouped = df.groupby('Team')
score = lambda x: (x - x.mean()) / x.std()*10
print grouped.transform(score)

       Points       Rank       Year
0   12.843272 -15.000000 -11.618950
1    3.020286   5.000000  -3.872983
2    7.071068  -7.071068  -7.071068
3   -7.071068   7.071068   7.071068
4   -8.608621  11.547005 -10.910895
5         NaN        NaN        NaN
6   -2.360428  -5.773503   2.182179
7   10.969049  -5.773503   8.728716
8   -7.705963   5.000000   3.872983
9   -7.071068   7.071068  -7.071068
10   7.071068  -7.071068   7.071068
11  -8.157595   5.000000  11.618950


### 6. 过滤
过滤根据定义的标准过滤数据并返回数据的子集。filter()函数用于过滤数据。

In [9]:
import pandas as pd
import numpy as np
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
         'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
         'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
         'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
         'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
print df.groupby('Team').filter(lambda x: len(x) >= 3)

    Points  Rank    Team  Year
0      876     1  Riders  2014
1      789     2  Riders  2015
4      741     3   Kings  2014
6      756     1   Kings  2016
7      788     1   Kings  2017
8      694     2  Riders  2016
11     690     2  Riders  2017


在上面的过滤条件中，我们需要返回那些在IPL比赛中参加三次及以上的队伍。

# Pandas中的聚合
### 应用aggregation到整个DataFrame中

In [1]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.arange(1,41).reshape(10,4),
index=pd.date_range('1/1/2000',periods=10),
columns=list('ABCD'))
print df

r = df.rolling(window=3,min_periods=1)
# 应用多个函数到DataFrame的多个列中
print r[['A','B']].aggregate([np.sum,np.mean]),'\n'
# 应用不同的函数到DataFrame的不同列中
print r.aggregate({'A':np.sum,'B':np.mean})

             A   B   C   D
2000-01-01   1   2   3   4
2000-01-02   5   6   7   8
2000-01-03   9  10  11  12
2000-01-04  13  14  15  16
2000-01-05  17  18  19  20
2000-01-06  21  22  23  24
2000-01-07  25  26  27  28
2000-01-08  29  30  31  32
2000-01-09  33  34  35  36
2000-01-10  37  38  39  40
               A            B      
             sum  mean    sum  mean
2000-01-01   1.0   1.0    2.0   2.0
2000-01-02   6.0   3.0    8.0   4.0
2000-01-03  15.0   5.0   18.0   6.0
2000-01-04  27.0   9.0   30.0  10.0
2000-01-05  39.0  13.0   42.0  14.0
2000-01-06  51.0  17.0   54.0  18.0
2000-01-07  63.0  21.0   66.0  22.0
2000-01-08  75.0  25.0   78.0  26.0
2000-01-09  87.0  29.0   90.0  30.0
2000-01-10  99.0  33.0  102.0  34.0 

               A     B
2000-01-01   1.0   2.0
2000-01-02   6.0   4.0
2000-01-03  15.0   6.0
2000-01-04  27.0  10.0
2000-01-05  39.0  14.0
2000-01-06  51.0  18.0
2000-01-07  63.0  22.0
2000-01-08  75.0  26.0
2000-01-09  87.0  30.0
2000-01-10  99.0  34.0
