Agg 常与 Groupby共同使用, Agg传入的是分组后的一整列(和apply一样)，返回的一定是一个值，因为每个分组只对应一个值，即聚合

In [21]:
import pandas as pd
import numpy as np

data = pd.DataFrame({'USRID':[1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 5, 6], 'day':[10, 10, 5, 6, 3, 2, 1, 2, 1, 18, 23, 12, 12]})

print(data)

    USRID  day
0       1   10
1       1   10
2       1    5
3       1    6
4       1    3
5       2    2
6       2    1
7       2    2
8       3    1
9       3   18
10      4   23
11      5   12
12      6   12


按照USRID分组后求一般统计值(mean, max, min)

In [2]:
data_user = data.groupby('USRID')
print(data_user.size())

USRID
1    5
2    3
3    2
4    1
5    1
6    1
dtype: int64


对每一组的天数求均值，最大值，最小值，并对新列命名

In [19]:
# 查看传入agg的参数
print('\n----agg----\n')
data.groupby('USRID')['day'].agg(lambda x: print(x))
# 可以看到传入的是groupby后的一整列
# apply也是一样
print('\n----apply----\n')
data.groupby('USRID')['day'].apply(lambda x: print(x))


----agg----

0    10
1    10
2     5
3     6
4     3
Name: day, dtype: int64
5    2
6    1
7    2
Name: day, dtype: int64
8     1
9    18
Name: day, dtype: int64
10    23
Name: day, dtype: int64
11    12
Name: day, dtype: int64
12    12
Name: day, dtype: int64

----apply----

0    10
1    10
2     5
3     6
4     3
Name: 1, dtype: int64
5    2
6    1
7    2
Name: 2, dtype: int64
8     1
9    18
Name: 3, dtype: int64
10    23
Name: 4, dtype: int64
11    12
Name: 5, dtype: int64
12    12
Name: 6, dtype: int64


USRID
1    None
2    None
3    None
4    None
5    None
6    None
Name: day, dtype: object

In [5]:
# 把某个人的day变成list
data.groupby('USRID')['day'].agg(lambda x: list(x))
# 如要得到用户时间间隔特征，可以用该方法

USRID
1    [10, 10, 5, 6, 3]
2            [2, 1, 2]
3              [1, 18]
4                 [23]
5                 [12]
6                 [12]
dtype: object

In [8]:
# 如统计day的平均值
data.groupby('USRID')['day'].agg([np.mean, np.std, np.max])

Unnamed: 0_level_0,mean,std,amax
USRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,6.8,3.114482,10
2,1.666667,0.57735,2
3,9.5,12.020815,18
4,23.0,,23
5,12.0,,12
6,12.0,,12


如果需要更改列名

In [10]:
data.groupby('USRID')['day'].agg([np.mean, np.std, np.max]).reset_index().rename(columns={'mean': 'user_mean'})

Unnamed: 0,USRID,user_mean,std,amax
0,1,6.8,3.114482,10
1,2,1.666667,0.57735,2
2,3,9.5,12.020815,18
3,4,23.0,,23
4,5,12.0,,12
5,6,12.0,,12


In [16]:
# 自定义求均值函数
def custom_sta(df, params):
    return np.mean(df)
    
data.groupby('USRID')['day'].agg(custom_mean)


0    10
1    10
2     5
3     6
4     3
Name: 1, dtype: int64
5    2
6    1
7    2
Name: 2, dtype: int64
8     1
9    18
Name: 3, dtype: int64
10    23
Name: 4, dtype: int64
11    12
Name: 5, dtype: int64
12    12
Name: 6, dtype: int64


USRID
1    None
2    None
3    None
4    None
5    None
6    None
Name: day, dtype: object