# 第九章 时序数据 

In [3]:
import pandas as pd
import numpy as np

## 三、重采样
##### 所谓重采样，就是指resample函数，它可以看做时序版本的groupby函数

### 1.resample对象的基本操作
##### 采样频率一般设置为上面提到的offset字符

In [4]:
df_r = pd.DataFrame(np.random.randn(1000, 3),index=pd.date_range('1/1/2020', freq='S', periods=1000),
                  columns=['A', 'B', 'C'])
df_r.head()

Unnamed: 0,A,B,C
2020-01-01 00:00:00,0.177775,-0.459368,-0.362346
2020-01-01 00:00:01,0.626194,2.207254,-0.559469
2020-01-01 00:00:02,-0.173953,1.65151,-0.358861
2020-01-01 00:00:03,0.044189,-0.158931,-0.82521
2020-01-01 00:00:04,-0.845582,-0.745165,-1.016058


In [6]:
r = df_r.resample('3min')
r

<pandas.core.resample.DatetimeIndexResampler object at 0x7fbc7d63f750>

In [9]:
r.sum()

Unnamed: 0,A,B,C
2020-01-01 00:00:00,-19.66531,8.328837,-9.858038
2020-01-01 00:03:00,-26.870187,-19.482617,-4.898907
2020-01-01 00:06:00,8.10418,16.78761,11.111367
2020-01-01 00:09:00,24.665701,-14.070119,-17.599416
2020-01-01 00:12:00,-1.253184,6.813827,1.436147
2020-01-01 00:15:00,-6.803968,6.296797,13.040606


In [10]:
df_r2 = pd.DataFrame(np.random.randn(200, 3),index=pd.date_range('1/1/2020', freq='D', periods=200),
                  columns=['A', 'B', 'C'])
r = df_r2.resample('CBMS')
r.sum()

Unnamed: 0,A,B,C
2020-01-01,-3.695056,-2.71032,2.230938
2020-02-03,-0.064266,-0.454343,-0.61914
2020-03-02,0.981248,8.318306,-0.854792
2020-04-01,-2.257505,0.936925,15.525964
2020-05-01,-1.229655,-5.689741,-2.232628
2020-06-01,-1.620876,1.942736,-8.26836
2020-07-01,4.962911,-0.785271,0.947631


### 3.采样聚合

In [11]:
r = df_r.resample('3T')
r['A'].mean()

2020-01-01 00:00:00   -0.109252
2020-01-01 00:03:00   -0.149279
2020-01-01 00:06:00    0.045023
2020-01-01 00:09:00    0.137032
2020-01-01 00:12:00   -0.006962
2020-01-01 00:15:00   -0.068040
Freq: 3T, Name: A, dtype: float64

In [12]:
r['A'].agg([np.sum, np.mean, np.std])

Unnamed: 0,sum,mean,std
2020-01-01 00:00:00,-19.66531,-0.109252,1.003105
2020-01-01 00:03:00,-26.870187,-0.149279,0.993577
2020-01-01 00:06:00,8.10418,0.045023,1.059718
2020-01-01 00:09:00,24.665701,0.137032,1.010835
2020-01-01 00:12:00,-1.253184,-0.006962,0.876128
2020-01-01 00:15:00,-6.803968,-0.06804,0.935122


##### 类似地，可以使用函数 lambda表达式

In [13]:
r.agg({'A':np.sum,'B':lambda x:max(x)-min(x)})

Unnamed: 0,A,B
2020-01-01 00:00:00,-19.66531,5.022055
2020-01-01 00:03:00,-26.870187,5.573791
2020-01-01 00:06:00,8.10418,4.585215
2020-01-01 00:09:00,24.665701,5.068494
2020-01-01 00:12:00,-1.253184,6.102251
2020-01-01 00:15:00,-6.803968,4.905433


### 3.采样组的迭代
##### 采样组的迭代和groupby迭代完全类似，对于每一组都可以做相应操作

In [14]:
small = pd.Series(range(6),index=pd.to_datetime(['2020-01-01 00:00:00', '2020-01-01 00:30:00'
                                                 , '2020-01-01 00:31:00','2020-01-01 01:00:00'
                                                 ,'2020-01-01 03:00:00','2020-01-01 03:05:00']))
resampled = small.resample('H')
for name, group in resampled:
    print("Group: ", name)
    print("-" * 27)
    print(group, end="\n\n")

Group:  2020-01-01 00:00:00
---------------------------
2020-01-01 00:00:00    0
2020-01-01 00:30:00    1
2020-01-01 00:31:00    2
dtype: int64

Group:  2020-01-01 01:00:00
---------------------------
2020-01-01 01:00:00    3
dtype: int64

Group:  2020-01-01 02:00:00
---------------------------
Series([], dtype: int64)

Group:  2020-01-01 03:00:00
---------------------------
2020-01-01 03:00:00    4
2020-01-01 03:05:00    5
dtype: int64



### 四、窗口函数
##### 下面主要介绍pandas中两类主要的窗口（window）函数：rolling/expanding

In [15]:
s = pd.Series(np.random.randn(1000),index=pd.date_range('1/1/2020', periods=1000))
s.head()

2020-01-01    1.093105
2020-01-02    0.711759
2020-01-03    0.734754
2020-01-04    1.442447
2020-01-05   -1.116043
Freq: D, dtype: float64

### 1.Rolling
#### （a）常用聚合
##### 所谓rolling方法，就是规定一个窗口，它和groupby对象一样，本身不会进行操作，需要配合聚合函数才能计算结果

In [16]:
s.rolling(window=50)

Rolling [window=50,center=False,axis=0]

In [17]:
s.rolling(window=50).mean()

2020-01-01         NaN
2020-01-02         NaN
2020-01-03         NaN
2020-01-04         NaN
2020-01-05         NaN
                ...   
2022-09-22    0.004697
2022-09-23    0.020048
2022-09-24    0.050533
2022-09-25    0.007057
2022-09-26   -0.013178
Freq: D, Length: 1000, dtype: float64

##### min_periods参数是指需要的非缺失数据点数量阀值

In [18]:
s.rolling(window=50,min_periods=3).mean().head()

2020-01-01         NaN
2020-01-02         NaN
2020-01-03    0.846539
2020-01-04    0.995516
2020-01-05    0.573204
Freq: D, dtype: float64

##### count/sum/mean/median/min/max/std/var/skew/kurt/quantile/cov/corr都是常用的聚合函数

### （b）rolling的apply聚合

##### 使用apply聚合时，只需记住传入的是window大小的Series，输出的必须是标量即可，比如如下计算变异系数

In [19]:
s.rolling(window=50,min_periods=3).apply(lambda x:x.std()/x.mean()).head()

2020-01-01         NaN
2020-01-02         NaN
2020-01-03    0.252606
2020-01-04    0.346898
2020-01-05    1.728087
Freq: D, dtype: float64

### （c）基于时间的rolling

In [20]:
s.rolling('15D').mean().head()

2020-01-01    1.093105
2020-01-02    0.902432
2020-01-03    0.846539
2020-01-04    0.995516
2020-01-05    0.573204
Freq: D, dtype: float64

##### 可选closed='right'（默认）\'left'\'both'\'neither'参数，决定端点的包含情况

In [21]:
s.rolling('15D', closed='right').sum().head()

2020-01-01    1.093105
2020-01-02    1.804864
2020-01-03    2.539618
2020-01-04    3.982065
2020-01-05    2.866022
Freq: D, dtype: float64

### 2.Expanding
#### (a) expanding函数
##### 普通的expanding函数等价于rolling（window=len(s),min_periods=1），是对序列的累计计算

In [22]:
s.rolling(window=len(s),min_periods=1).sum().head()

2020-01-01    1.093105
2020-01-02    1.804864
2020-01-03    2.539618
2020-01-04    3.982065
2020-01-05    2.866022
Freq: D, dtype: float64

In [23]:
s.expanding().sum().head()

2020-01-01    1.093105
2020-01-02    1.804864
2020-01-03    2.539618
2020-01-04    3.982065
2020-01-05    2.866022
Freq: D, dtype: float64

##### apply方法也是同样可用的

In [24]:
s.expanding().apply(lambda x:sum(x)).head()

2020-01-01    1.093105
2020-01-02    1.804864
2020-01-03    2.539618
2020-01-04    3.982065
2020-01-05    2.866022
Freq: D, dtype: float64

#### （b）几个特别的Expanding类型函数
##### cumsum/cumprod/cummax/cummin都是特殊expanding累计计算方法

In [25]:
s.cumsum().head()

2020-01-01    1.093105
2020-01-02    1.804864
2020-01-03    2.539618
2020-01-04    3.982065
2020-01-05    2.866022
Freq: D, dtype: float64

In [26]:
s.cummax().head()

2020-01-01    1.093105
2020-01-02    1.093105
2020-01-03    1.093105
2020-01-04    1.442447
2020-01-05    1.442447
Freq: D, dtype: float64

##### shift/diff/pct_change都是涉及到了元素关系
##### ①shift是指序列索引不变，但值向后移动
##### ②diff是指前后元素的差，period参数表示间隔，默认为1，并且可以为负
##### ③pct_change是值前后元素的变化百分比，period参数与diff类似

In [27]:
s.shift(2).head()

2020-01-01         NaN
2020-01-02         NaN
2020-01-03    1.093105
2020-01-04    0.711759
2020-01-05    0.734754
Freq: D, dtype: float64

In [28]:
s.diff(3).head()

2020-01-01         NaN
2020-01-02         NaN
2020-01-03         NaN
2020-01-04    0.349342
2020-01-05   -1.827802
Freq: D, dtype: float64

In [29]:
s.pct_change(3).head()

2020-01-01         NaN
2020-01-02         NaN
2020-01-03         NaN
2020-01-04    0.319587
2020-01-05   -2.568007
Freq: D, dtype: float64