《[利用Python进行数据分析](https://book.douban.com/subject/25779298/)》读书笔记。
 
 [第10章](/2017/07/20/python_data_analysis10.html)  第6节：重采样及频率转换

所有用到的数据可以从[作者的 github](https://github.com/wesm/pydata-book)下载。


In [1]:
%pylab inline
import pandas as pd
from datetime import datetime
from pandas import Series, DataFrame

Populating the interactive namespace from numpy and matplotlib


pandas对象都提供了resample方法，用于重采样。

对于时间序列来说，重采样（resampling）指的是将时间序列从一个频率转换到另一个频率的过程。

其中两类特殊的重采样是：将高频率数据聚合到低频率称为降采样（downsampling），而将低频率数据转换到高频率称为升采样（uosampling）。

并不是所有的重采样都能被划分到这两类中，比如将W-WED转换为W-FRI既不是降采样也不是升采样。

In [3]:
rng = pd.date_range('1/1/2000', periods=100, freq='D')
ts = Series(randn(len(rng)), index=rng)
ts.resample('M').mean()

2000-01-31   -0.102857
2000-02-29    0.042360
2000-03-31   -0.065909
2000-04-30   -0.058290
Freq: M, dtype: float64

In [4]:
ts.resample('M', kind='period').mean()

2000-01   -0.102857
2000-02    0.042360
2000-03   -0.065909
2000-04   -0.058290
Freq: M, dtype: float64

resample方法的主要参数包括：

![重采样参数](resample_params.png)
![重采样参数（续）](resample_params1.png)

## 降采样

将数据的频率降低称为降采样，也就是将数据进行聚合。
一个数据点只能属于一个聚合时间段，所有时间段的并集组成整个时间帧。
在进行降采样时，应该考虑如下：

- 各区间那便是闭合的
- 如何标记各个聚合面元，用区间的开头还是结尾

In [6]:
# 1分钟数据
rng = pd.date_range('1/1/2000', periods=12, freq='T')
ts = Series(np.arange(12), index=rng)
ts

2000-01-01 00:00:00     0
2000-01-01 00:01:00     1
2000-01-01 00:02:00     2
2000-01-01 00:03:00     3
2000-01-01 00:04:00     4
2000-01-01 00:05:00     5
2000-01-01 00:06:00     6
2000-01-01 00:07:00     7
2000-01-01 00:08:00     8
2000-01-01 00:09:00     9
2000-01-01 00:10:00    10
2000-01-01 00:11:00    11
Freq: T, dtype: int32

In [14]:
# 聚合到5分钟
# 注意:默认情况下，为 闭-开区间
ts.resample('5min').last()

2000-01-01 00:00:00     4
2000-01-01 00:05:00     9
2000-01-01 00:10:00    11
Freq: 5T, dtype: int32

In [21]:
# 指定closed = 'right' 改为 开- 闭 区间
ts.resample('5min', closed='right').last()

1999-12-31 23:55:00     0
2000-01-01 00:00:00     5
2000-01-01 00:05:00    10
2000-01-01 00:10:00    11
Freq: 5T, dtype: int32

In [22]:
# 指定使用右侧标记作为标签
ts.resample('5min', closed='right', label='right').last()

2000-01-01 00:00:00     0
2000-01-01 00:05:00     5
2000-01-01 00:10:00    10
2000-01-01 00:15:00    11
Freq: 5T, dtype: int32

In [24]:
# 对结果索引做一些位移
ts.resample('5min', loffset='-1s').last()

# 也可以通过调用结果对象的shift方法来实现。

1999-12-31 23:59:59     4
2000-01-01 00:04:59     9
2000-01-01 00:09:59    11
Freq: 5T, dtype: int32

### OHLC重采样

对于ohlc数据，pandas做了专门处理

In [26]:
ts.resample('5min').ohlc()

Unnamed: 0,open,high,low,close
2000-01-01 00:00:00,0,4,0,4
2000-01-01 00:05:00,5,9,5,9
2000-01-01 00:10:00,10,11,10,11


### 通过groupby进行重采样

另一种方法是使用pandas的groupby功能。例如，你打算根据月份或者周几进行分组，只需传入一个能够访问时间序列的索引上的这些字段的函数即可：

In [27]:
rng = pd.date_range('1/1/2000', periods=100, freq='D')
ts = Series(np.arange(100), index=rng)
ts.groupby(lambda x: x.month).mean()

1    15
2    45
3    75
4    95
dtype: int32

In [28]:
ts.groupby(lambda x: x.weekday).mean()

0    47.5
1    48.5
2    49.5
3    50.5
4    51.5
5    49.0
6    50.0
dtype: float64

## 升采样和插值

将数据从低频率转换到高频率时，就不需要聚合了。

In [29]:
frame = DataFrame(np.random.randn(2, 4),
                  index=pd.date_range('1/1/2000', periods=2, freq='W-WED'),
                  columns=['Colorado', 'Texas', 'New York', 'Ohio'])
frame

Unnamed: 0,Colorado,Texas,New York,Ohio
2000-01-05,-0.847635,0.66079,2.916199,-0.503541
2000-01-12,0.052009,-0.76434,-1.662339,0.12528


In [32]:
# 重采样到日频率，默认会引入缺失值
df_daily = frame.resample('D')
df_daily.last()

Unnamed: 0,Colorado,Texas,New York,Ohio
2000-01-05,-0.847635,0.66079,2.916199,-0.503541
2000-01-06,,,,
2000-01-07,,,,
2000-01-08,,,,
2000-01-09,,,,
2000-01-10,,,,
2000-01-11,,,,
2000-01-12,0.052009,-0.76434,-1.662339,0.12528


In [34]:
# 可以跟fillna和reindex一样进行填充
frame.resample('D').ffill()

Unnamed: 0,Colorado,Texas,New York,Ohio
2000-01-05,-0.847635,0.66079,2.916199,-0.503541
2000-01-06,-0.847635,0.66079,2.916199,-0.503541
2000-01-07,-0.847635,0.66079,2.916199,-0.503541
2000-01-08,-0.847635,0.66079,2.916199,-0.503541
2000-01-09,-0.847635,0.66079,2.916199,-0.503541
2000-01-10,-0.847635,0.66079,2.916199,-0.503541
2000-01-11,-0.847635,0.66079,2.916199,-0.503541
2000-01-12,0.052009,-0.76434,-1.662339,0.12528


In [37]:
# 只填充指定的时期数（目的是限制前面的观测值的持续使用距离）
frame.resample('D').ffill(limit=2)

Unnamed: 0,Colorado,Texas,New York,Ohio
2000-01-05,-0.847635,0.66079,2.916199,-0.503541
2000-01-06,-0.847635,0.66079,2.916199,-0.503541
2000-01-07,-0.847635,0.66079,2.916199,-0.503541
2000-01-08,,,,
2000-01-09,,,,
2000-01-10,,,,
2000-01-11,,,,
2000-01-12,0.052009,-0.76434,-1.662339,0.12528


In [38]:
# 注意，新的日期索引完全没必要跟旧的相交,注意这个例子展现了数据日期可以延长
frame.resample('W-THU').ffill()

Unnamed: 0,Colorado,Texas,New York,Ohio
2000-01-06,-0.847635,0.66079,2.916199,-0.503541
2000-01-13,0.052009,-0.76434,-1.662339,0.12528


## 通过时期进行重采样

对那些使用时期索引的数据进行重采样是一件非常简单的事情。

In [39]:
frame = DataFrame(np.random.randn(24, 4),
                  index=pd.period_range('1-2000', '12-2001', freq='M'),
                  columns=['Colorado', 'Texas', 'New York', 'Ohio'])
frame[:5]

Unnamed: 0,Colorado,Texas,New York,Ohio
2000-01,2.001633,0.637625,0.422806,-1.233967
2000-02,0.214921,-0.561227,-0.15532,-2.21166
2000-03,-0.584018,-0.205559,1.27646,-2.255439
2000-04,0.346297,0.18851,-1.72063,-0.742461
2000-05,-0.908527,0.315601,-0.507128,-0.449549


In [42]:
# 升采样要稍微麻烦些，因为你必须决定在新的频率中各区间的哪端用于放置原来的值
# 就像asfreq方法一样，convention默认为'end',可设置为'start'
# Q-DEC：季度型（每年以12月结束）
annual_frame = frame.resample('Q-DEC').mean()
annual_frame

Unnamed: 0,Colorado,Texas,New York,Ohio
2000Q1,0.544178,-0.043054,0.514649,-1.900355
2000Q2,-0.000791,0.468495,-1.247954,-0.708676
2000Q3,-0.247497,0.067101,0.512545,0.103532
2000Q4,-1.126654,-0.190198,0.416364,-0.046905
2001Q1,-1.081078,-0.259754,-0.861197,-0.235051
2001Q2,0.010989,0.004505,-1.142694,0.922503
2001Q3,0.843994,-0.283601,-0.942034,0.787534
2001Q4,-0.211269,-0.200959,0.103619,-0.239605


In [43]:
annual_frame.resample('Q-DEC').ffill()

Unnamed: 0,Colorado,Texas,New York,Ohio
2000Q1,0.544178,-0.043054,0.514649,-1.900355
2000Q2,-0.000791,0.468495,-1.247954,-0.708676
2000Q3,-0.247497,0.067101,0.512545,0.103532
2000Q4,-1.126654,-0.190198,0.416364,-0.046905
2001Q1,-1.081078,-0.259754,-0.861197,-0.235051
2001Q2,0.010989,0.004505,-1.142694,0.922503
2001Q3,0.843994,-0.283601,-0.942034,0.787534
2001Q4,-0.211269,-0.200959,0.103619,-0.239605


In [45]:
# Q-DEC: Quarterly, year ending in December

# note: output changed, default value changed from convention='end' to convention='start' + 'start' changed to span-like
# also the following cells
annual_frame.resample('Q-DEC', convention='start').ffill()

Unnamed: 0,Colorado,Texas,New York,Ohio
2000Q1,0.544178,-0.043054,0.514649,-1.900355
2000Q2,-0.000791,0.468495,-1.247954,-0.708676
2000Q3,-0.247497,0.067101,0.512545,0.103532
2000Q4,-1.126654,-0.190198,0.416364,-0.046905
2001Q1,-1.081078,-0.259754,-0.861197,-0.235051
2001Q2,0.010989,0.004505,-1.142694,0.922503
2001Q3,0.843994,-0.283601,-0.942034,0.787534
2001Q4,-0.211269,-0.200959,0.103619,-0.239605


由于时期指的是时间区间，所以升采样和降采样的规则就比较严格:

- 在降采样中，目标频率必须是源频率的子时期(subperiod)
- 在升采样中，目标频率必须是原频率的超时期(superperiod)

如果不满足这些条件，就会引发异常，主要影响的是按季、年、周计算的频率。

例如，由Q-MAR定义的时间区间只能升采样为A-MAR、A-JUN等

In [47]:
annual_frame.resample('Q-MAR').ffill()

Unnamed: 0,Colorado,Texas,New York,Ohio
2000Q4,0.544178,-0.043054,0.514649,-1.900355
2001Q1,-0.000791,0.468495,-1.247954,-0.708676
2001Q2,-0.247497,0.067101,0.512545,0.103532
2001Q3,-1.126654,-0.190198,0.416364,-0.046905
2001Q4,-1.081078,-0.259754,-0.861197,-0.235051
2002Q1,0.010989,0.004505,-1.142694,0.922503
2002Q2,0.843994,-0.283601,-0.942034,0.787534
2002Q3,-0.211269,-0.200959,0.103619,-0.239605
