# Advanced pandas
## 12.1 Categorical Data
### 12.1.1 Background and Motivation

unique和value_counts，即从数组中提取不同值并计算这些值的频率

In [3]:
import numpy as np
import pandas as pd
values=pd.Series(['apple','orange','apple','apple']*2)
values

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
dtype: object

In [4]:
pd.unique(values) # 取唯一值

array(['apple', 'orange'], dtype=object)

In [6]:
pd.value_counts(values) # 不同值出现的次数（统计操作）

apple     6
orange    2
dtype: int64

在数据入库的操作中，使用所谓的维度表是一种最佳实践，维度表包含了不同值，并将主要观测值存储为引用维度表的整数键

In [8]:
values=pd.Series([0,1,0,0]*2) # 转换为维度表形式，节省空间
dim=pd.Series(['apple','orange']) # 便于分析，apple为0，orange为1
values

0    0
1    1
2    0
3    0
4    0
5    1
6    0
7    0
dtype: int64

使用take方法来恢复原来的字符串Series

In [9]:
dim.take(values)

0     apple
1    orange
0     apple
0     apple
0     apple
1    orange
0     apple
0     apple
dtype: object

### 12.1.2 Categorical Type in pandas

In [10]:
fruits = ['apple', 'orange', 'apple', 'apple'] * 2
N = len(fruits)
df = pd.DataFrame({'fruit': fruits,
                   'basket_id': np.arange(N),
                   'count': np.random.randint(3, 15, size=N), # 取3~15之间的随机整数，[3,15)
                   'weight': np.random.uniform(0, 4, size=N)}, # 取0~4之间均匀取值，取小数
                  columns=['basket_id', 'fruit', 'count', 'weight']) # 列名称
df

Unnamed: 0,basket_id,fruit,count,weight
0,0,apple,14,3.593407
1,1,orange,7,0.777306
2,2,apple,3,2.837937
3,3,apple,12,2.756411
4,4,apple,6,1.323695
5,5,orange,5,2.56942
6,6,apple,11,0.223134
7,7,apple,10,0.009667


df['fruit']是一个Python字符串对象组成的数组，可以通过调用函数将它转换位Categorical对象

In [12]:
fruit_cat = df['fruit'].astype('category')
fruit_cat # 并不是NumPy数组，而是Pandas.Categorical的实例

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
Name: fruit, dtype: category
Categories (2, object): [apple, orange]

In [13]:
c = fruit_cat.values
type(c)

pandas.core.arrays.categorical.Categorical

Categorical对象拥有categorical和codes属性

In [14]:
c.categories

Index(['apple', 'orange'], dtype='object')

In [15]:
c.codes

array([0, 1, 0, 0, 0, 1, 0, 0], dtype=int8)

In [16]:
df['fruit'] = df['fruit'].astype('category')
df.fruit

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
Name: fruit, dtype: category
Categories (2, object): [apple, orange]

In [17]:
my_categories = pd.Categorical(['foo', 'bar', 'baz', 'foo', 'bar'])
my_categories

[foo, bar, baz, foo, bar]
Categories (3, object): [bar, baz, foo]

如果你已经从另一个数据源获得了分类编码数据，可使用from_codes构造函数

In [18]:
categories = ['foo', 'bar', 'baz']
codes = [0, 1, 2, 0, 0, 1]
my_cats_2 = pd.Categorical.from_codes(codes, categories)
my_cats_2

[foo, bar, baz, foo, foo, bar]
Categories (3, object): [foo, bar, baz]

除非显示地指定，分类转换是不会指定类别的顺序的，因此Categorical数组可能会与输入数据的顺序不同，当使用from_codes或其他任意构造函数时，你可以为类别指定一个有意义的顺序

In [19]:
ordered_cat = pd.Categorical.from_codes(codes, categories,
                                        ordered=True) # 设置顺序  
ordered_cat

[foo, bar, baz, foo, foo, bar]
Categories (3, object): [foo < bar < baz]

一个未排序的分类实例可以使用as_ordered进行排序

In [20]:
my_cats_2.as_ordered()

[foo, bar, baz, foo, foo, bar]
Categories (3, object): [foo < bar < baz]

### 12.1.3 Computations with Categoricals

In [21]:
np.random.seed(12345)
draws = np.random.randn(1000)
draws[:5] # 前5个数

array([-0.20470766,  0.47894334, -0.51943872, -0.5557303 ,  1.96578057])

In [23]:
bins = pd.qcut(draws, 4) # 计算上面数据的四位分箱数 每个区间都有250个数
bins

[(-0.684, -0.0101], (-0.0101, 0.63], (-0.684, -0.0101], (-0.684, -0.0101], (0.63, 3.928], ..., (-0.0101, 0.63], (-0.684, -0.0101], (-2.9499999999999997, -0.684], (-0.0101, 0.63], (0.63, 3.928]]
Length: 1000
Categories (4, interval[float64]): [(-2.9499999999999997, -0.684] < (-0.684, -0.0101] < (-0.0101, 0.63] < (0.63, 3.928]]

In [26]:
pd.value_counts(bins)

(0.63, 3.928]                    250
(-0.0101, 0.63]                  250
(-0.684, -0.0101]                250
(-2.9499999999999997, -0.684]    250
dtype: int64

In [30]:
bins = pd.qcut(draws, 4, labels=['Q1', 'Q2', 'Q3', 'Q4']) # 将区间命名
bins

[Q2, Q3, Q2, Q2, Q4, ..., Q3, Q2, Q1, Q3, Q4]
Length: 1000
Categories (4, object): [Q1 < Q2 < Q3 < Q4]

In [31]:
bins.codes[:10] # 得到的是区间位置的编号

array([1, 2, 1, 1, 3, 3, 2, 2, 3, 3], dtype=int8)

In [32]:
bins = pd.Series(bins, name='quartile') # 将区间名称改为quartile
results = (pd.Series(draws)
           .groupby(bins) # 按Q1,Q2,Q3.Q4进行分类
           .agg(['count', 'min', 'max'])
           .reset_index())
results

Unnamed: 0,quartile,count,min,max
0,Q1,250,-2.949343,-0.685484
1,Q2,250,-0.683066,-0.010115
2,Q3,250,-0.010032,0.628894
3,Q4,250,0.634238,3.927528


In [33]:
results['quartile'] # quartile列保留了bins中原始的分类信息

0    Q1
1    Q2
2    Q3
3    Q4
Name: quartile, dtype: category
Categories (4, object): [Q1 < Q2 < Q3 < Q4]

#### 12.1.3.1 Better performance with categoricals

In [35]:
N = 10000000
draws = pd.Series(np.random.randn(N))
labels = pd.Series(['foo', 'bar', 'baz', 'qux'] * (N // 4)) # 重复(N//4)次
categories = labels.astype('category') # 将labels转换为Categorical对象

In [38]:
labels.memory_usage() # 查看内存

80000080

In [39]:
categories.memory_usage() # 查看内存

10000272

In [43]:
%time _ = labels.astype('category')

Wall time: 870 ms


### 12.1.4 Categorical Methods

In [44]:
s = pd.Series(['a', 'b', 'c', 'd'] * 2)
cat_s = s.astype('category') # 将s转换为categorical对象
cat_s

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (4, object): [a, b, c, d]

特殊属性cat提供了对分类方法的访问

In [47]:
cat_s.cat.codes # 进行编码

0    0
1    1
2    2
3    3
4    0
5    1
6    2
7    3
dtype: int8

In [48]:
cat_s.cat.categories # 访问索引index

Index(['a', 'b', 'c', 'd'], dtype='object')

假设该数据的实际类别集合超出了数据中观察到的四个值，可使用set_categories方法来改变类别

In [50]:
actual_categories = ['a', 'b', 'c', 'd', 'e']
cat_s2 = cat_s.cat.set_categories(actual_categories)
cat_s2 # e 没有输出，是因为上面的数据不包含

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (5, object): [a, b, c, d, e]

新类别将反映在使用它们的操作中,value_counts将遵循新的类别

In [51]:
cat_s.value_counts()

d    2
c    2
b    2
a    2
dtype: int64

In [52]:
cat_s2.value_counts() # e 存在，但是没有数据

d    2
c    2
b    2
a    2
e    0
dtype: int64

使用remove_unused_categories方法来去除未观察到的类别

In [54]:
cat_s3=cat_s[cat_s.isin(['a','b'])]
# 用于检测cat_s中是否存在['a','b'],如果存在，便输出；不存在，便不输出
cat_s3

0    a
1    b
4    a
5    b
dtype: category
Categories (4, object): [a, b, c, d]

In [55]:
cat_s3.cat.remove_unused_categories() # 去除未观察到的类别

0    a
1    b
4    a
5    b
dtype: category
Categories (2, object): [a, b]

#### 12.1.4.1 Creating dummy variables for modeling

In [57]:
cat_s = pd.Series(['a', 'b', 'c', 'd'] * 2, dtype='category')
cat_s

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (4, object): [a, b, c, d]

pandas.get_dummies函数将一维的分类数据转换为一个包含虚拟变量的DataFrame

In [59]:
pd.get_dummies(cat_s)

Unnamed: 0,a,b,c,d
0,1,0,0,0
1,0,1,0,0
2,0,0,1,0
3,0,0,0,1
4,1,0,0,0
5,0,1,0,0
6,0,0,1,0
7,0,0,0,1


## 12.2 Advanced GroupBy Use
### 12.2.1 Group Transforms and "Unwrapped" GroupBys

In [60]:
df = pd.DataFrame({'key': ['a', 'b', 'c'] * 4,
                   'value': np.arange(12.)})
df

Unnamed: 0,key,value
0,a,0.0
1,b,1.0
2,c,2.0
3,a,3.0
4,b,4.0
5,c,5.0
6,a,6.0
7,b,7.0
8,c,8.0
9,a,9.0


In [61]:
g = df.groupby('key').value # 按key分组
g.mean() # 求均值

key
a    4.5
b    5.5
c    6.5
Name: value, dtype: float64

In [63]:
g.transform(lambda x: x.mean()) # 值被key分组的均值替代

0     4.5
1     5.5
2     6.5
3     4.5
4     5.5
5     6.5
6     4.5
7     5.5
8     6.5
9     4.5
10    5.5
11    6.5
Name: value, dtype: float64

In [64]:
g.transform('mean') # 好处：减少了数据分析的工作量

0     4.5
1     5.5
2     6.5
3     4.5
4     5.5
5     6.5
6     4.5
7     5.5
8     6.5
9     4.5
10    5.5
11    6.5
Name: value, dtype: float64

transform可以与返回Series的函数一起使用，但结果必须和输入相同的大小

In [65]:
g.transform(lambda x: x * 2) # 给每个组乘以2

0      0.0
1      2.0
2      4.0
3      6.0
4      8.0
5     10.0
6     12.0
7     14.0
8     16.0
9     18.0
10    20.0
11    22.0
Name: value, dtype: float64

In [67]:
g.transform(lambda x: x.rank(ascending=False))
# 按降序排列
# 重点是每个组，分别是每个组的第1，2，3
# 最上面的value值最小，所以为4

0     4.0
1     4.0
2     4.0
3     3.0
4     3.0
5     3.0
6     2.0
7     2.0
8     2.0
9     1.0
10    1.0
11    1.0
Name: value, dtype: float64

In [68]:
def normalize(x):
    return (x - x.mean()) / x.std() # 标准差

In [71]:
g.transform(normalize)
# g.apply(normalize) 结果同上

0    -1.161895
1    -1.161895
2    -1.161895
3    -0.387298
4    -0.387298
5    -0.387298
6     0.387298
7     0.387298
8     0.387298
9     1.161895
10    1.161895
11    1.161895
Name: value, dtype: float64

In [72]:
g.apply(lambda x:x.std())

key
a    3.872983
b    3.872983
c    3.872983
Name: value, dtype: float64

内建的聚合函数如'mean'和'sum'通常会比apply函数更快

In [73]:
g.transform('mean')

0     4.5
1     5.5
2     6.5
3     4.5
4     5.5
5     6.5
6     4.5
7     5.5
8     6.5
9     4.5
10    5.5
11    6.5
Name: value, dtype: float64

In [74]:
normalized = (df['value'] - g.transform('mean')) / g.transform('std')
normalized

0    -1.161895
1    -1.161895
2    -1.161895
3    -0.387298
4    -0.387298
5    -0.387298
6     0.387298
7     0.387298
8     0.387298
9     1.161895
10    1.161895
11    1.161895
Name: value, dtype: float64

### 12.2.2 Grouped Time Resampling

对于时间序列数据。resample方法在语义上是一种基于时间分段的分组操作

In [75]:
N = 15
times = pd.date_range('2017-05-20 00:00', freq='1min', periods=N) # 每个一分钟取样
df = pd.DataFrame({'time': times,
                   'value': np.arange(N)})
df

Unnamed: 0,time,value
0,2017-05-20 00:00:00,0
1,2017-05-20 00:01:00,1
2,2017-05-20 00:02:00,2
3,2017-05-20 00:03:00,3
4,2017-05-20 00:04:00,4
5,2017-05-20 00:05:00,5
6,2017-05-20 00:06:00,6
7,2017-05-20 00:07:00,7
8,2017-05-20 00:08:00,8
9,2017-05-20 00:09:00,9


In [77]:
df.set_index('time').resample('5min').count() # 每隔十五分钟取样
# [00:00:00,00:05:00)
# [00:05:00,00:10:00)
# [00:10:00,00:15:00)

Unnamed: 0_level_0,value
time,Unnamed: 1_level_1
2017-05-20 00:00:00,5
2017-05-20 00:05:00,5
2017-05-20 00:10:00,5


In [79]:
df2 = pd.DataFrame({'time': times.repeat(3), # 将每个时间重复输出3次
                    'key': np.tile(['a', 'b', 'c'], N), # 在列上重复输出N次 N=15
                    'value': np.arange(N * 3.)})
df2[:7] # 取前7行

Unnamed: 0,time,key,value
0,2017-05-20 00:00:00,a,0.0
1,2017-05-20 00:00:00,b,1.0
2,2017-05-20 00:00:00,c,2.0
3,2017-05-20 00:01:00,a,3.0
4,2017-05-20 00:01:00,b,4.0
5,2017-05-20 00:01:00,c,5.0
6,2017-05-20 00:02:00,a,6.0


要为每个'key'的值进行相同的重新采样，可使用pandas.TimeGrouper对象

In [85]:
time_key = pd.TimeGrouper('5min') # 按时间频率进行聚合
time_key

  """Entry point for launching an IPython kernel.


TimeGrouper(freq=<5 * Minutes>, axis=0, sort=True, closed='left', label='left', how='mean', convention='e', base=0)

In [86]:
resampled = (df2.set_index('time')
             .groupby(['key', time_key]) # 'key': np.tile(['a', 'b', 'c'], N) 先按照key分组，再根据time_key聚合
             .sum())
resampled

Unnamed: 0_level_0,Unnamed: 1_level_0,value
key,time,Unnamed: 2_level_1
a,2017-05-20 00:00:00,30.0
a,2017-05-20 00:05:00,105.0
a,2017-05-20 00:10:00,180.0
b,2017-05-20 00:00:00,35.0
b,2017-05-20 00:05:00,110.0
b,2017-05-20 00:10:00,185.0
c,2017-05-20 00:00:00,40.0
c,2017-05-20 00:05:00,115.0
c,2017-05-20 00:10:00,190.0


In [87]:
resampled.reset_index()

Unnamed: 0,key,time,value
0,a,2017-05-20 00:00:00,30.0
1,a,2017-05-20 00:05:00,105.0
2,a,2017-05-20 00:10:00,180.0
3,b,2017-05-20 00:00:00,35.0
4,b,2017-05-20 00:05:00,110.0
5,b,2017-05-20 00:10:00,185.0
6,c,2017-05-20 00:00:00,40.0
7,c,2017-05-20 00:05:00,115.0
8,c,2017-05-20 00:10:00,190.0


使用TimeGrouper的一个限制是时间必须是Series或DataFrame的索引

## 12.3 Techniques for Method Chaining
方法链：方法链其实是一种编程风格，可以顺序调用多个方法，其中每个方程都在同一个对象上执行操作并将结果返回。它消除了每个中间步骤命名变量的思维负担，创建面向对象API的方法依赖于方法链（method chaining），类似unix系统的管道操作。

In [89]:
import numpy as np
import pandas as pd
df=pd.DataFrame({'col1':np.random.randn(200),
                'col2':np.random.randn(200),
                'key':np.arange(200)})
df2=df[df['col2']<0]
df2

Unnamed: 0,col1,col2,key
0,1.121829,-0.854300,0
1,-2.003564,-0.076057,1
3,0.515353,-1.995266,3
4,0.094970,-0.168464,4
5,-0.454599,-0.595363,5
8,-0.952184,-0.455784,8
11,-1.172774,-0.515910,11
12,1.228042,-0.199576,12
17,0.189197,-0.881804,17
19,-1.605899,-2.365012,19


DataFrame.assign方法是对df[k]=v的赋值方式的一种功能替代，它返回的是一个按指定修改的新的DataFrame，而不是再原对象上进行修改

In [91]:
import pandas as pd
import numpy as np
col1=np.random.randn(100)
col2=np.random.randn(100)
df=pd.DataFrame({'col1':col1,
                'col2':col2,
                'key':['one','two','three','four']*25})
df2=df[df['col2']<0]
def load_data():
    return pd.DataFrame({'col1':col1,
                'col2':col2,
                'key':['one','two','three','four']*25})
result=(load_data()
       [lambda x:x.col2<0]
       .assign(col1_demeaned=lambda x:x.col1-x.col1.mean())
       .groupby('key')
       .col1_demeaned.std())
result

key
four     0.742092
one      0.803014
three    1.255753
two      0.903226
Name: col1_demeaned, dtype: float64

### 12.3.1 The pipe Method

需要使用自定义的函数或来自第三方库的函数，这就是pipe(管道)方法出现的原因

f(df) 和 df.pipe(f) 是等价的

In [95]:
col1=np.random.randn(100)
col2=np.random.randn(100)
df=pd.DataFrame({'col1':col1,
             'col2':col2})
def f(df,arg1):
    return df-arg1
def g(df,arg2,arg3):
    return df-arg2+arg3
def h(df,arg4):
    return df+arg4

In [97]:
a=f(df,arg1=1)
a

Unnamed: 0,col1,col2
0,-0.247819,-1.022023
1,-0.052478,-2.150339
2,-1.904292,-1.878257
3,-1.231856,-0.296718
4,-1.036362,-0.718034
5,-3.118417,-1.185883
6,-0.274535,-0.842081
7,-2.443994,-1.691806
8,-1.947247,-1.819682
9,0.971729,-1.547155


In [98]:
b=g(a,2,4)
b

Unnamed: 0,col1,col2
0,1.752181,0.977977
1,1.947522,-0.150339
2,0.095708,0.121743
3,0.768144,1.703282
4,0.963638,1.281966
5,-1.118417,0.814117
6,1.725465,1.157919
7,-0.443994,0.308194
8,0.052753,0.180318
9,2.971729,0.452845


In [99]:
c=h(b,6)
c

Unnamed: 0,col1,col2
0,7.752181,6.977977
1,7.947522,5.849661
2,6.095708,6.121743
3,6.768144,7.703282
4,6.963638,7.281966
5,4.881583,6.814117
6,7.725465,7.157919
7,5.556006,6.308194
8,6.052753,6.180318
9,8.971729,6.452845


In [108]:
df2=pd.DataFrame({'col1':col1,
                  'col2':col2,
                 'key1':['one','two','three','four']*25,
                 'key2':['a','b','c','d']*25})
g=df2.groupby(['key1','key2'])
df2['col1']=df2['col1']-g.transform('mean')['col1']
df2

Unnamed: 0,col1,col2,key1,key2
0,0.829362,-0.022023,one,a
1,1.122838,-1.150339,two,b
2,-0.821226,-0.878257,three,c
3,-0.368464,0.703282,four,d
4,0.040819,0.281966,one,a
5,-1.943101,-0.185883,two,b
6,0.808531,0.157919,three,c
7,-1.580602,-0.691806,four,d
8,-0.870066,-0.819682,one,a
9,2.147044,-0.547155,two,b


In [107]:
df3=pd.DataFrame({'col1':col1,
                  'col2':col2,
                 'key1':['one','two','three','four']*25,
                 'key2':['a','b','c','d']*25})
def group_demean(df,by,cols):
    result=df.copy()
    g=df.groupby(by)
    for c in cols:
        result[c]=df[c]-g[c].transform('mean')
    return result
result=(df3[df3.col1<0].pipe(group_demean,['key1','key2'],['col1','col2']))
result

Unnamed: 0,col1,col2,key1,key2
2,-0.196548,-0.82623,three,c
3,0.504481,0.742496,four,d
4,0.857528,0.278293,one,a
5,-1.130459,0.073455,two,b
7,-0.707657,-0.652592,four,d
8,-0.053357,-0.823354,one,a
10,-0.788606,-0.31813,three,c
11,0.380515,-0.68779,four,d
14,0.026492,-1.138483,three,c
17,0.754782,-0.064473,two,b
