## Pandas-重采样及频率转换

内容介绍:重采样是指将时间序列从一个频率转换到另一个频率的处理过程。

分类:

* 降采样：将高频率数据聚合到低频率(降频率采样，例如将月度数据转换为季度数据)
* 升采样：将低频率数据转换为高频率（例如将季度数据转换为月度数据）
* 其他采样。

重采样主要使用resample函数：

resample(self, rule, axis=0, closed: 'Optional[str]' = None, label: 'Optional[str]' = None, convention: 'str' = 'start', kind: 'Optional[str]' = None, loffset=None, base: 'Optional[int]' = None, on=None, level=None, origin: 'Union[str, TimestampConvertibleTypes]' = 'start_day', offset: 'Optional[TimedeltaConvertibleTypes]' = None) -> 'Resampler'

* rule:重采样规则
* axis:采样数据的轴，默认为0轴
* fill_method:升采样如何插值，如ffill,bfill。默认不插值。
* ohlc:open,high,low,close四种采样方式
* 其他参数详见：help(pd.DataFrame.resample)

* how:重采样后以何种方式聚合。以类似于resampe().mean()的形式使用。
**默认是求平均值，即参数为mean。也可以是first,last,max,min。**

In [5]:
import numpy as np
import pandas as pd

In [6]:
# 示例数据
rng = pd.date_range('2018-1-1',periods=100,freq='D')
ts = pd.Series(np.random.randn(len(rng)),index=rng)
ts

2018-01-01    0.553903
2018-01-02   -0.034592
2018-01-03   -0.381743
2018-01-04   -0.531937
2018-01-05   -1.739332
                ...   
2018-04-06    1.862778
2018-04-07   -1.581180
2018-04-08    0.055110
2018-04-09    1.401881
2018-04-10   -0.986895
Freq: D, Length: 100, dtype: float64

### 1.resample基本方法的使用

In [7]:
# 日频率转换为月频率
ts.resample('M').mean()

2018-01-31   -0.014873
2018-02-28    0.396786
2018-03-31   -0.164874
2018-04-30    0.575431
Freq: M, dtype: float64

In [10]:
# 转换为时期频率
ts.resample('M',kind='period').mean()

2018-01   -0.014873
2018-02    0.396786
2018-03   -0.164874
2018-04    0.575431
Freq: M, dtype: float64

### 2.降采样

In [19]:
# 日期范围，频率1分钟
rng2 = pd.date_range('2019-1-1',periods=12,freq='T')
ts2=pd.Series(np.arange(12),index=rng2)
ts

2019-01-01 00:00:00     0
2019-01-01 00:01:00     1
2019-01-01 00:02:00     2
2019-01-01 00:03:00     3
2019-01-01 00:04:00     4
2019-01-01 00:05:00     5
2019-01-01 00:06:00     6
2019-01-01 00:07:00     7
2019-01-01 00:08:00     8
2019-01-01 00:09:00     9
2019-01-01 00:10:00    10
2019-01-01 00:11:00    11
Freq: T, dtype: int64

In [12]:
# 通过求和的方式，将数据聚合到5分钟的块中
# 包含面元的右边界，最终的时间序列是以各面元边界的时间戳进行标记的
# 多一个区间(2018-12-31 23:55:00,2019-01-01 00:00:00)。
# 2019-01-01 00:00:00落不到原来的第一个区间里,所以要往前补足。
ts2.resample('5min',closed='right').count()

2018-12-31 23:55:00    1
2019-01-01 00:00:00    5
2019-01-01 00:05:00    5
2019-01-01 00:10:00    1
Freq: 5T, dtype: int64

In [14]:
#当参数close=left时，包含面元的左边界
ts2.resample('5min',closed='left').sum()

2019-01-01 00:00:00    10
2019-01-01 00:05:00    35
2019-01-01 00:10:00    21
Freq: 5T, dtype: int64

In [15]:
#当参数close=left时，包含面元的左边界
ts2.resample('5min',closed='right',label='right').sum()

2019-01-01 00:00:00     0
2019-01-01 00:05:00    15
2019-01-01 00:10:00    40
2019-01-01 00:15:00    11
Freq: 5T, dtype: int64

In [16]:
#当参数close=left时，包含面元的左边界。
ts2.resample('5min',closed='right',label='right',loffset='-1s').sum()


>>> df.resample(freq="3s", loffset="8H")

becomes:

>>> from pandas.tseries.frequencies import to_offset
>>> df = df.resample(freq="3s").mean()
>>> df.index = df.index.to_timestamp() + to_offset("8H")

  ts.resample('5min',closed='right',label='right',loffset='-1s').sum()


2018-12-31 23:59:59     0
2019-01-01 00:04:59    15
2019-01-01 00:09:59    40
2019-01-01 00:14:59    11
Freq: 5T, dtype: int64

### 3.OHLC采样

In [18]:
ts2.resample('5min').ohlc()

Unnamed: 0,open,high,low,close
2019-01-01 00:00:00,0,4,0,4
2019-01-01 00:05:00,5,9,5,9
2019-01-01 00:10:00,10,11,10,11


### 4.升采样和插值

* 将数据从低频率转换为高频率不需要聚合
* 使用asfreq方法转换为高频，则不经过聚合
* 使用resample的参数实现填充和插值

* 使用时期索引的数据进行重采样，与时间戳相似
* 因为时期指的是时间区间，所以升采样和降采样的规则比较严格

**基本规则：**

In [22]:
df4 = pd.DataFrame(np.random.randn(2,4),
                  index=pd.date_range('1/1/2021',periods=2,freq='W-WED'),
                  columns=['北京','广州','上海','深圳']
                  )
df4

Unnamed: 0,北京,广州,上海,深圳
2021-01-06,0.85981,0.990167,2.056032,-1.942216
2021-01-13,0.802784,0.959297,-0.440185,-1.049184


In [24]:
# 数据从低频率到高频率不需要聚合，但是会出现缺失值
# 使用asfreq()方法转换成高频，则不经过聚合
df_daily = df4.resample('D').asfreq()
df_daily

Unnamed: 0,北京,广州,上海,深圳
2021-01-06,0.85981,0.990167,2.056032,-1.942216
2021-01-07,,,,
2021-01-08,,,,
2021-01-09,,,,
2021-01-10,,,,
2021-01-11,,,,
2021-01-12,,,,
2021-01-13,0.802784,0.959297,-0.440185,-1.049184


In [25]:
# 使用resampleing的参数实现填充和插值
df4.resample('D').ffill()

Unnamed: 0,北京,广州,上海,深圳
2021-01-06,0.85981,0.990167,2.056032,-1.942216
2021-01-07,0.85981,0.990167,2.056032,-1.942216
2021-01-08,0.85981,0.990167,2.056032,-1.942216
2021-01-09,0.85981,0.990167,2.056032,-1.942216
2021-01-10,0.85981,0.990167,2.056032,-1.942216
2021-01-11,0.85981,0.990167,2.056032,-1.942216
2021-01-12,0.85981,0.990167,2.056032,-1.942216
2021-01-13,0.802784,0.959297,-0.440185,-1.049184


In [26]:
# 填充部分
df4.resample('D').ffill(limit=2)

Unnamed: 0,北京,广州,上海,深圳
2021-01-06,0.85981,0.990167,2.056032,-1.942216
2021-01-07,0.85981,0.990167,2.056032,-1.942216
2021-01-08,0.85981,0.990167,2.056032,-1.942216
2021-01-09,,,,
2021-01-10,,,,
2021-01-11,,,,
2021-01-12,,,,
2021-01-13,0.802784,0.959297,-0.440185,-1.049184


In [27]:
# 变换频率，星期三转换为星期四
df4.resample('W-THU').ffill()

Unnamed: 0,北京,广州,上海,深圳
2021-01-07,0.85981,0.990167,2.056032,-1.942216
2021-01-14,0.802784,0.959297,-0.440185,-1.049184


### 5.通过时期进行重采样

In [30]:
df5 = pd.DataFrame(np.random.randn(24,4),
                  index=pd.period_range('1-2020','12-2021',freq='M'),
                  columns=['北京','广州','上海','深圳']
                  )
df5.head()

Unnamed: 0,北京,广州,上海,深圳
2020-01,-0.701865,0.541006,0.423853,0.300178
2020-02,0.759198,-1.276694,0.869328,2.28317
2020-03,-0.335607,0.437015,-0.269833,1.322394
2020-04,0.635993,0.138992,-2.301968,-1.284577
2020-05,-0.072213,0.140781,-1.941532,-0.347236


In [32]:
# 使用时期索引的数据进行重采样，与时间戳相似
annual_df5 = df5.resample('A-DEC').mean()
annual_df5

Unnamed: 0,北京,广州,上海,深圳
2020,0.041654,-0.609221,-0.334578,0.306583
2021,-0.123737,-0.223495,0.629238,0.060333


In [33]:
annual_df5.resample('Q-DEC').ffill()

Unnamed: 0,北京,广州,上海,深圳
2020Q1,0.041654,-0.609221,-0.334578,0.306583
2020Q2,0.041654,-0.609221,-0.334578,0.306583
2020Q3,0.041654,-0.609221,-0.334578,0.306583
2020Q4,0.041654,-0.609221,-0.334578,0.306583
2021Q1,-0.123737,-0.223495,0.629238,0.060333
2021Q2,-0.123737,-0.223495,0.629238,0.060333
2021Q3,-0.123737,-0.223495,0.629238,0.060333
2021Q4,-0.123737,-0.223495,0.629238,0.060333


In [35]:
#convention='end'以原数据的季度中最后一个填充
annual_df5.resample('Q-DEC',convention='end').ffill()

Unnamed: 0,北京,广州,上海,深圳
2020Q4,0.041654,-0.609221,-0.334578,0.306583
2021Q1,0.041654,-0.609221,-0.334578,0.306583
2021Q2,0.041654,-0.609221,-0.334578,0.306583
2021Q3,0.041654,-0.609221,-0.334578,0.306583
2021Q4,-0.123737,-0.223495,0.629238,0.060333


In [36]:
annual_df5.resample('Q-MAR').ffill()

Unnamed: 0,北京,广州,上海,深圳
2020Q4,0.041654,-0.609221,-0.334578,0.306583
2021Q1,0.041654,-0.609221,-0.334578,0.306583
2021Q2,0.041654,-0.609221,-0.334578,0.306583
2021Q3,0.041654,-0.609221,-0.334578,0.306583
2021Q4,-0.123737,-0.223495,0.629238,0.060333
2022Q1,-0.123737,-0.223495,0.629238,0.060333
2022Q2,-0.123737,-0.223495,0.629238,0.060333
2022Q3,-0.123737,-0.223495,0.629238,0.060333


In [2]:
help(pd.DataFrame.resample)

Help on function resample in module pandas.core.generic:

resample(self, rule, axis=0, closed: 'Optional[str]' = None, label: 'Optional[str]' = None, convention: 'str' = 'start', kind: 'Optional[str]' = None, loffset=None, base: 'Optional[int]' = None, on=None, level=None, origin: 'Union[str, TimestampConvertibleTypes]' = 'start_day', offset: 'Optional[TimedeltaConvertibleTypes]' = None) -> 'Resampler'
    Resample time-series data.
    
    Convenience method for frequency conversion and resampling of time
    series. Object must have a datetime-like index (`DatetimeIndex`,
    `PeriodIndex`, or `TimedeltaIndex`), or pass datetime-like values
    to the `on` or `level` keyword.
    
    Parameters
    ----------
    rule : DateOffset, Timedelta or str
        The offset string or object representing target conversion.
    axis : {0 or 'index', 1 or 'columns'}, default 0
        Which axis to use for up- or down-sampling. For `Series` this
        will default to 0, i.e. along the rows