不管在哪个领域，时间序列（time series）数据都是一种重要的结构化数据形式。

时间序列主要有以下几种：
* 时间戳（timestamp）特定的时刻
* 固定时期（period） 如 2017年1月或者2016年全年
* 时间间隔（interval）由起始时间戳和结束时间戳表示。  时期（period）可以看做是时间间隔的特例。

## 日期和时间数据类型及工具

主要会用到Python的 datetime、time、calendar模块

datetime模块中的数据类型：
* date   以公历形式存储日历日期（年、月、日）
* time   将时间存储为 时、分、秒、毫秒
* datetime  存储日期和时间

In [1]:
from pandas import Series, DataFrame, Index, MultiIndex
from numpy.random import randn
import pandas as pd
import numpy as np
import random
import re

### 字符串和datetime的相互转换

对于字符串解析成datetime，我们可以第三方的dateutil模块，很方便进行解析。

In [2]:
from dateutil.parser import parse

parse("2011-02-12")

datetime.datetime(2011, 2, 12, 0, 0)

In [3]:
parse("Jan 31, 1998 10:45 PM")

datetime.datetime(1998, 1, 31, 22, 45)

In [4]:
parse("6/22/2015", dayfirst=True)  # 国际通用格式，日期在前面，传入dayfirst即可

datetime.datetime(2015, 6, 22, 0, 0)

pandas用to_datetime解析多种不同格式的日期。

In [5]:
datestrs = ["7/3/2011", "2017/09/10"]

In [6]:
pd.to_datetime(datestrs)

DatetimeIndex(['2011-07-03', '2017-09-10'], dtype='datetime64[ns]', freq=None)

In [7]:
# 缺失值的处理
idx = pd.to_datetime(datestrs + [None]); idx

DatetimeIndex(['2011-07-03', '2017-09-10', 'NaT'], dtype='datetime64[ns]', freq=None)

In [8]:
idx[2]

NaT

In [9]:
pd.isnull(idx)

array([False, False,  True], dtype=bool)

NaT (Not a Time) 是pandas中时间戳数据的NA值。

## 时间序列基础

pandas最基本的时间序列类型就是以时间戳（通常以Python字符串或者datetime对象表示）为索引的Series

In [10]:
from datetime import datetime

In [11]:
dates = [
    datetime(2011, 1, 2),
    datetime(2011, 1, 5),
    datetime(2011, 1, 7),
    datetime(2011, 1, 8),
    datetime(2011, 1, 10),
    datetime(2011, 1, 12),
]

In [12]:
ts = Series(np.random.randn(6), index=dates)

In [13]:
ts

2011-01-02   -1.334551
2011-01-05    1.549368
2011-01-07    1.169218
2011-01-08   -0.136806
2011-01-10   -2.556776
2011-01-12   -0.589145
dtype: float64

In [14]:
type(ts)  # 这些datetime对象实际被放在一个DatetimeIndex中

pandas.core.series.Series

In [15]:
ts.index

DatetimeIndex(['2011-01-02', '2011-01-05', '2011-01-07', '2011-01-08',
               '2011-01-10', '2011-01-12'],
              dtype='datetime64[ns]', freq=None)

In [16]:
stamp = ts.index[0]  # DateTimeIndex中的各个标量值是pandas的Timestamp对象

In [17]:
stamp

Timestamp('2011-01-02 00:00:00')

### 索引、选取、子集构造

In [18]:
# 索引
stamp = ts.index[2]
ts[stamp]

1.1692183523622377

In [19]:
ts["01/05/2011"]  # 更直接方便的用法

1.5493681019981875

In [20]:
ts["2011/01/05"] # 会自动解析

1.5493681019981875

对于更长的时间序列，只需要传入 年 或者 年月 就可以轻松选取数据的切片。

In [21]:
longer_ts = Series(np.random.randn(1000), index=pd.date_range("2001/02/12", periods=1000))

In [22]:
longer_ts

2001-02-12    0.609556
2001-02-13   -0.400261
2001-02-14   -0.226404
2001-02-15    1.011243
2001-02-16    1.854372
2001-02-17   -0.576325
2001-02-18   -0.835005
2001-02-19   -0.539963
2001-02-20   -0.961637
2001-02-21    0.198119
2001-02-22    1.448315
2001-02-23   -0.181188
2001-02-24    0.282050
2001-02-25    1.897249
2001-02-26   -1.284542
2001-02-27   -0.053282
2001-02-28   -0.251129
2001-03-01   -0.228706
2001-03-02    1.346402
2001-03-03   -1.757072
2001-03-04    0.595497
2001-03-05    0.232608
2001-03-06    1.096379
2001-03-07    0.033021
2001-03-08   -1.963730
2001-03-09   -0.305389
2001-03-10   -0.700585
2001-03-11   -0.133695
2001-03-12   -0.760396
2001-03-13   -0.351255
                ...   
2003-10-10   -0.356059
2003-10-11    3.620784
2003-10-12   -0.200641
2003-10-13    0.992358
2003-10-14    1.749932
2003-10-15    0.554820
2003-10-16   -0.018050
2003-10-17    0.093952
2003-10-18    0.681002
2003-10-19    0.634408
2003-10-20   -0.998447
2003-10-21   -0.874929
2003-10-22 

In [23]:
longer_ts["2003"]

2003-01-01    0.704333
2003-01-02   -0.426601
2003-01-03   -0.601953
2003-01-04    0.278257
2003-01-05    0.233329
2003-01-06    0.671001
2003-01-07   -0.115926
2003-01-08   -0.934724
2003-01-09    1.843406
2003-01-10    0.289212
2003-01-11    0.934848
2003-01-12    0.973205
2003-01-13    0.289062
2003-01-14    0.793975
2003-01-15    1.451616
2003-01-16    2.852414
2003-01-17   -0.046855
2003-01-18   -0.365119
2003-01-19    0.833437
2003-01-20    1.047999
2003-01-21   -0.132501
2003-01-22   -0.401536
2003-01-23   -0.234607
2003-01-24    1.033119
2003-01-25   -0.525152
2003-01-26   -0.345850
2003-01-27    0.406763
2003-01-28    0.786329
2003-01-29   -1.124362
2003-01-30   -0.144908
                ...   
2003-10-10   -0.356059
2003-10-11    3.620784
2003-10-12   -0.200641
2003-10-13    0.992358
2003-10-14    1.749932
2003-10-15    0.554820
2003-10-16   -0.018050
2003-10-17    0.093952
2003-10-18    0.681002
2003-10-19    0.634408
2003-10-20   -0.998447
2003-10-21   -0.874929
2003-10-22 

In [24]:
longer_ts["2002-02"]

2002-02-01   -0.662727
2002-02-02    0.961987
2002-02-03   -0.270974
2002-02-04   -0.627381
2002-02-05    1.089119
2002-02-06    0.166114
2002-02-07    0.335072
2002-02-08    0.991560
2002-02-09   -0.055388
2002-02-10   -0.417313
2002-02-11   -0.393176
2002-02-12   -0.001779
2002-02-13   -0.692764
2002-02-14    0.081315
2002-02-15   -0.256343
2002-02-16   -0.428281
2002-02-17   -1.480918
2002-02-18    0.864796
2002-02-19    0.027399
2002-02-20   -0.931539
2002-02-21   -1.415464
2002-02-22   -1.849608
2002-02-23    0.407225
2002-02-24    0.928208
2002-02-25    0.832352
2002-02-26   -0.209724
2002-02-27    0.422738
2002-02-28    1.427698
Freq: D, dtype: float64

通过日期进行切片的方式，只对规则Series有效

In [25]:
ts[datetime(2011, 1, 7):]

2011-01-07    1.169218
2011-01-08   -0.136806
2011-01-10   -2.556776
2011-01-12   -0.589145
dtype: float64

大部分时间序列都是按照时间顺序排序的，因此也可以对不存在于该时间序列中的时间戳进行切片

In [26]:
ts

2011-01-02   -1.334551
2011-01-05    1.549368
2011-01-07    1.169218
2011-01-08   -0.136806
2011-01-10   -2.556776
2011-01-12   -0.589145
dtype: float64

In [27]:
ts["2011/01/05":"2011/01/11"]

2011-01-05    1.549368
2011-01-07    1.169218
2011-01-08   -0.136806
2011-01-10   -2.556776
dtype: float64

还有一个等价方法可以获取两个日期之间的TimeSeries

In [28]:
ts.truncate(after="2011/01/09")

2011-01-02   -1.334551
2011-01-05    1.549368
2011-01-07    1.169218
2011-01-08   -0.136806
dtype: float64

上面的操作对于DataFrame同样生效。

In [29]:
dates = pd.date_range("2000/02/01", periods=100, freq="W-WED")

In [30]:
long_df = DataFrame(np.random.randn(100, 4), index=dates, columns=["Colorado", "Texas", "New York", "Ohio"])

In [31]:
long_df

Unnamed: 0,Colorado,Texas,New York,Ohio
2000-02-02,-0.934443,0.762153,1.508269,0.411377
2000-02-09,-2.062479,1.496819,0.201033,-0.967268
2000-02-16,0.592308,0.236118,1.045408,-0.923909
2000-02-23,-0.976214,0.120006,-1.695422,0.351222
2000-03-01,-0.797758,-0.495477,0.274699,0.513870
2000-03-08,-0.053780,0.461306,0.432507,-0.297747
2000-03-15,-0.563562,0.121814,0.500647,-0.691014
2000-03-22,-0.429164,1.778290,-1.140049,1.102834
2000-03-29,0.784836,-0.647859,-2.489837,-0.611840
2000-04-05,1.554708,0.194259,0.322817,-0.034653


In [32]:
long_df.loc["5-2001"]

Unnamed: 0,Colorado,Texas,New York,Ohio
2001-05-02,0.76681,0.291711,-1.271703,0.509743
2001-05-09,0.327573,-0.023593,-0.18423,-0.639019
2001-05-16,-0.117698,-0.049667,-0.775719,0.034095
2001-05-23,-0.739998,1.855835,-1.079322,-0.940804
2001-05-30,-0.571286,0.162124,1.905543,1.339196


### 带有重复索引的时间序列

In [33]:
dates = pd.DatetimeIndex([
    "2000/1/1", "2000/2/1",
    "2000/2/1", "2000/2/1",
    "2000/3/1",
])

In [34]:
dup_ts = Series(np.arange(5), index=dates)

In [35]:
dup_ts

2000-01-01    0
2000-02-01    1
2000-02-01    2
2000-02-01    3
2000-03-01    4
dtype: int32

In [36]:
dup_ts.index.is_unique

False

In [37]:
dup_ts["2000/1/1"]  # 不重复

0

In [38]:
dup_ts["2000/02/01"]  # 重复

2000-02-01    1
2000-02-01    2
2000-02-01    3
dtype: int32

我们可以对非唯一时间戳的数据进行聚合，使用groupby传入参数level=0

In [39]:
dup_ts.groupby(level=0).mean()

2000-01-01    0
2000-02-01    2
2000-03-01    4
dtype: int32

## 日期的范围、频率以及移动

pandas中的时间序列一般认为是不规则的。也就是说没有固定频率。

但我们经常会按照固定频率来分析，比如每日、没月、每十五分钟等。

我们可以将时间序列转换为一个具有固定频率的时间序列。

In [40]:
ts

2011-01-02   -1.334551
2011-01-05    1.549368
2011-01-07    1.169218
2011-01-08   -0.136806
2011-01-10   -2.556776
2011-01-12   -0.589145
dtype: float64

In [41]:
ts.resample("D")

DatetimeIndexResampler [freq=<Day>, axis=0, closed=left, label=left, convention=start, base=0]

### 生成日期范围

In [42]:
index = pd.date_range("2012/1/4","2012/1/6")

In [43]:
index

DatetimeIndex(['2012-01-04', '2012-01-05', '2012-01-06'], dtype='datetime64[ns]', freq='D')

date_range默认会按照天生成时间，如果只传入起始日期或者结束日期，就得传入一个表示一段时间的数字

In [44]:
pd.date_range(start="2012/1/4", periods=20)

DatetimeIndex(['2012-01-04', '2012-01-05', '2012-01-06', '2012-01-07',
               '2012-01-08', '2012-01-09', '2012-01-10', '2012-01-11',
               '2012-01-12', '2012-01-13', '2012-01-14', '2012-01-15',
               '2012-01-16', '2012-01-17', '2012-01-18', '2012-01-19',
               '2012-01-20', '2012-01-21', '2012-01-22', '2012-01-23'],
              dtype='datetime64[ns]', freq='D')

In [45]:
pd.date_range(end="2012/1/6", periods=20)

DatetimeIndex(['2011-12-18', '2011-12-19', '2011-12-20', '2011-12-21',
               '2011-12-22', '2011-12-23', '2011-12-24', '2011-12-25',
               '2011-12-26', '2011-12-27', '2011-12-28', '2011-12-29',
               '2011-12-30', '2011-12-31', '2012-01-01', '2012-01-02',
               '2012-01-03', '2012-01-04', '2012-01-05', '2012-01-06'],
              dtype='datetime64[ns]', freq='D')

起始日期和结束日期定义了严格的边界。

如果想传入一个由每月最后一个工作日组成的日期索引，可以传入"BM"(business end of month)

In [46]:
pd.date_range("2000/1/1", "2000/12/1", freq="BM")

DatetimeIndex(['2000-01-31', '2000-02-29', '2000-03-31', '2000-04-28',
               '2000-05-31', '2000-06-30', '2000-07-31', '2000-08-31',
               '2000-09-29', '2000-10-31', '2000-11-30'],
              dtype='datetime64[ns]', freq='BM')

date_range默认会保留起始和结束时间戳的时间信息

In [47]:
pd.date_range("2012/5/12 12:56:31", periods=5)

DatetimeIndex(['2012-05-12 12:56:31', '2012-05-13 12:56:31',
               '2012-05-14 12:56:31', '2012-05-15 12:56:31',
               '2012-05-16 12:56:31'],
              dtype='datetime64[ns]', freq='D')

有时候你想产生一组规范会到午夜的时间戳。normalize选项可以实现该功能

In [48]:
pd.date_range("2012/5/12 12:56:21", periods=5, normalize=True)

DatetimeIndex(['2012-05-12', '2012-05-13', '2012-05-14', '2012-05-15',
               '2012-05-16'],
              dtype='datetime64[ns]', freq='D')

### 频率和日期偏移量

pandas中的频率由一个基础频率和一个乘数组成。

基础频率通常以一个字符串别名表示。 M代表月，H表示小时

In [49]:
from pandas.tseries.offsets import Hour, Minute

In [50]:
hour = Hour()

In [51]:
hour

<Hour>

In [52]:
four_hours = Hour(4)

In [53]:
four_hours

<4 * Hours>

一般来说，无需创建这样的对象，只需要使用"M" "4H" 这样的字符串别名即可

In [54]:
pd.date_range("2000/1/1", "2000/1/3 23:59", freq="4h")

DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 04:00:00',
               '2000-01-01 08:00:00', '2000-01-01 12:00:00',
               '2000-01-01 16:00:00', '2000-01-01 20:00:00',
               '2000-01-02 00:00:00', '2000-01-02 04:00:00',
               '2000-01-02 08:00:00', '2000-01-02 12:00:00',
               '2000-01-02 16:00:00', '2000-01-02 20:00:00',
               '2000-01-03 00:00:00', '2000-01-03 04:00:00',
               '2000-01-03 08:00:00', '2000-01-03 12:00:00',
               '2000-01-03 16:00:00', '2000-01-03 20:00:00'],
              dtype='datetime64[ns]', freq='4H')

In [55]:
Hour(2) + Minute(20)  # 大部分的偏移量对象可以通过加法进行连接

<140 * Minutes>

同样，你也可以传入频率字符串如"2h30min" 这种字符串

In [56]:
pd.date_range("2000/1/1", periods=10, freq="1h30min")

DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 01:30:00',
               '2000-01-01 03:00:00', '2000-01-01 04:30:00',
               '2000-01-01 06:00:00', '2000-01-01 07:30:00',
               '2000-01-01 09:00:00', '2000-01-01 10:30:00',
               '2000-01-01 12:00:00', '2000-01-01 13:30:00'],
              dtype='datetime64[ns]', freq='90T')

![date_range.png](./files/date_range.png)
![date_range2.png](./files/date_range2.png)

### WOM 日期

WOM （week of Month）是一种很实用的频率类，WOM开头，能获得如“每月第三个星期五”之类的日期

In [57]:
rng = pd.date_range("2012/1/1", "2012/9/1", freq="WOM-3FRI")

In [58]:
list(rng)

[Timestamp('2012-01-20 00:00:00', freq='WOM-3FRI'),
 Timestamp('2012-02-17 00:00:00', freq='WOM-3FRI'),
 Timestamp('2012-03-16 00:00:00', freq='WOM-3FRI'),
 Timestamp('2012-04-20 00:00:00', freq='WOM-3FRI'),
 Timestamp('2012-05-18 00:00:00', freq='WOM-3FRI'),
 Timestamp('2012-06-15 00:00:00', freq='WOM-3FRI'),
 Timestamp('2012-07-20 00:00:00', freq='WOM-3FRI'),
 Timestamp('2012-08-17 00:00:00', freq='WOM-3FRI')]

### 移动（超前或者滞后）的数据

移动（shifting）指的是沿着时间轴将数据前移或者后移。

In [59]:
ts = Series(np.random.randn(4), index=pd.date_range("2000/1/1", periods=4, freq="M"))

In [60]:
ts

2000-01-31    1.112513
2000-02-29    0.404857
2000-03-31   -0.397940
2000-04-30   -1.846929
Freq: M, dtype: float64

In [61]:
ts.shift(2)

2000-01-31         NaN
2000-02-29         NaN
2000-03-31    1.112513
2000-04-30    0.404857
Freq: M, dtype: float64

In [62]:
ts.shift(-2)

2000-01-31   -0.397940
2000-02-29   -1.846929
2000-03-31         NaN
2000-04-30         NaN
Freq: M, dtype: float64

shift通常用于计算一个时间序列或者多个时间序列中的百分比变化。

In [63]:
ts / ts.shift(1) -1 

2000-01-31         NaN
2000-02-29   -0.636087
2000-03-31   -1.982914
2000-04-30    3.641225
Freq: M, dtype: float64

单纯的移动操作不会修改索引，所以部分数据会被丢失。

也可以对时间戳进行位移。

In [64]:
ts.shift(2, freq="M")

2000-03-31    1.112513
2000-04-30    0.404857
2000-05-31   -0.397940
2000-06-30   -1.846929
Freq: M, dtype: float64

还有其他频率可以使用，就方便处理超前或者滞后数据了。

In [65]:
ts.shift(3, freq="D")

2000-02-03    1.112513
2000-03-03    0.404857
2000-04-03   -0.397940
2000-05-03   -1.846929
dtype: float64

### 通过偏移量对日期进行位移

In [66]:
from pandas.tseries.offsets import Day, MonthEnd

now = datetime(2001, 11, 17)

In [67]:
now + 3 * Day(3)

Timestamp('2001-11-26 00:00:00')

如果偏移量是锚点偏移量，第一次增量会将原日期向前或者向后滚动到符合频率规则的第一个日期。

In [68]:
now + MonthEnd()

Timestamp('2001-11-30 00:00:00')

In [69]:
now + MonthEnd(2)

Timestamp('2001-12-31 00:00:00')

通过锚点偏移量的rollforward和rollback方法，可以显示的将日期向前或者向后滚动

In [70]:
offset = MonthEnd();offset

<MonthEnd>

In [71]:
offset.rollforward(now)

Timestamp('2001-11-30 00:00:00')

In [72]:
offset.rollback(now)

Timestamp('2001-10-31 00:00:00')

日期偏移量还有个巧妙的用法，即结合groupby使用这两个滚动方法

In [73]:
ts = Series(np.random.randn(20), index=pd.date_range("2000/1/15", periods=20, freq="4d"))

In [74]:
ts.groupby(offset.rollforward).mean()

2000-01-31   -0.461608
2000-02-29    0.374462
2000-03-31   -0.151989
dtype: float64

当然实现该功能最快的方法是resample

In [75]:
ts.resample("M").mean()

2000-01-31   -0.461608
2000-02-29    0.374462
2000-03-31   -0.151989
Freq: M, dtype: float64

## 时区处理

python中使用第三方库pytz;

pandas中集成该库。

## 时期与其算术运算

时期（period）表示的时间区间，比如 数日、数月、数年等，其构造函数需要用到一个字符串或者整数，以及频率。

In [77]:
p = pd.Period(2007, freq="A-DEC"); p  # 表示的是从2007年1月1日到2007年12月31日之间的整段时间。

Period('2007', 'A-DEC')

In [78]:
p + 5

Period('2012', 'A-DEC')

In [79]:
p - 5

Period('2002', 'A-DEC')

如果两个period对象拥有相同的频率，则他们的差就是他们之间的单位数量。

In [80]:
pd.Period("2014", freq="A-DEC") - p

7L

period_range函数可以创建规则的时间范围

In [81]:
rng = pd.period_range("2000/1/1", "2000/6/30", freq="M")

In [82]:
rng

PeriodIndex(['2000-01', '2000-02', '2000-03', '2000-04', '2000-05', '2000-06'], dtype='period[M]', freq='M')

PeriodIndex类保存了一组Period，他可以在任何pandas数据结构总被用作轴索引

In [83]:
Series(np.random.randn(6), index=rng)

2000-01   -0.109596
2000-02    0.338172
2000-03    0.557845
2000-04    0.184800
2000-05   -0.258738
2000-06    1.615694
Freq: M, dtype: float64

PeriodIndex类的构造函数还允许直接用一组字符串

In [84]:
values = ["2001Q3", "2002Q2", "2003Q1"]

In [85]:
index = pd.PeriodIndex(values, freq="Q-DEC")

In [86]:
index

PeriodIndex(['2001Q3', '2002Q2', '2003Q1'], dtype='period[Q-DEC]', freq='Q-DEC')

### 时期的频率转换

Period和PeriodIndex对象都可以通过asfreq方法转换成别的频率

In [87]:
p = pd.Period("2007", freq="A-DEC")

In [88]:
p.asfreq("M", how="start")

Period('2007-01', 'M')

In [89]:
p.asfreq("M", how="end")

Period('2007-12', 'M')

对于一个不以12月结束的财政年度，月度子时期的归属情况就不一样

In [90]:
p = pd.Period("2007", freq="A-JUN")

In [91]:
p.asfreq("M", how="start")

Period('2006-07', 'M')

In [92]:
p.asfreq("M", how="end")

Period('2007-06', 'M')

将高频率转换为低频率时，超时期是由子时期所属的位置决定的

如在 A-JUN的频率中，月份 2007-08其实是在2008年的

In [93]:
p = pd.Period("2007-08", "M")

In [94]:
p.asfreq("A-JUN")

Period('2008', 'A-JUN')

PeriodIndex或者TimeSeries的频率转换方式同样

In [95]:
rng = pd.period_range("2006", "2009", freq="A-DEC")

In [96]:
ts = Series(np.random.randn(len(rng)), index=rng)

In [97]:
ts

2006   -0.090631
2007   -0.650356
2008   -0.343683
2009    0.036415
Freq: A-DEC, dtype: float64

In [98]:
ts.asfreq("M", how="start")

2006-01   -0.090631
2007-01   -0.650356
2008-01   -0.343683
2009-01    0.036415
Freq: M, dtype: float64

In [99]:
ts.asfreq("B", how="end")

2006-12-29   -0.090631
2007-12-31   -0.650356
2008-12-31   -0.343683
2009-12-31    0.036415
Freq: B, dtype: float64

### 按季度计算的时期频率

不同的“财年末”，会导致时期不同。比如"2012Q4"

In [100]:
p = pd.Period("2012Q4", freq="Q-JAN")

In [101]:
p

Period('2012Q4', 'Q-JAN')

In [102]:
p.asfreq("D", "start")

Period('2011-11-01', 'D')

In [103]:
p.asfreq("D", "end")

Period('2012-01-31', 'D')

Period之间的算术运算很简单。比如获取该季度倒数第二个工作日下午4点的时间戳

In [104]:
p4pm = (p.asfreq("B", "e") - 1).asfreq("T", "s") + 16*60

In [105]:
p4pm

Period('2012-01-30 16:00', 'T')

In [106]:
p4pm.to_timestamp()

Timestamp('2012-01-30 16:00:00')

period_range还可以生成季度范围。

In [107]:
rng = pd.period_range("2011Q3", "2012Q4", freq="Q-JAN")

In [108]:
ts = Series(np.arange(len(rng)), index=rng)

In [109]:
ts

2011Q3    0
2011Q4    1
2012Q1    2
2012Q2    3
2012Q3    4
2012Q4    5
Freq: Q-JAN, dtype: int32

In [110]:
new_rng = (rng.asfreq("B", "e") - 1).asfreq("T", "s") + 16*60

In [112]:
ts.index = new_rng.to_timestamp()

In [113]:
ts

2010-10-28 16:00:00    0
2011-01-28 16:00:00    1
2011-04-28 16:00:00    2
2011-07-28 16:00:00    3
2011-10-28 16:00:00    4
2012-01-30 16:00:00    5
dtype: int32

将Timestamp转换为Period（以及反向过程）

In [114]:
rng = pd.date_range("2000/1/1", periods=3, freq="M")

In [115]:
ts = Series(randn(3), index=rng)

In [116]:
ts

2000-01-31   -1.102828
2000-02-29   -0.280922
2000-03-31    0.740465
Freq: M, dtype: float64

In [117]:
pts = ts.to_period()

In [118]:
pts

2000-01   -1.102828
2000-02   -0.280922
2000-03    0.740465
Freq: M, dtype: float64

PeriodIndex的频率是从时间戳推断来的，也可以指定任何频率，结果中允许存在重复时期

In [119]:
rng = pd.date_range("2000/1/29", periods=6, freq="D")

In [120]:
ts2 = Series(randn(6), index=rng)

In [121]:
ts2.to_period("M")

2000-01    0.119246
2000-01   -1.093911
2000-01    0.010663
2000-02    1.298913
2000-02    1.780850
2000-02    0.430898
Freq: M, dtype: float64

转换为时间戳，使用to_timestamp即可

In [122]:
pts = ts.to_period();pts

2000-01   -1.102828
2000-02   -0.280922
2000-03    0.740465
Freq: M, dtype: float64

In [123]:
pts.to_timestamp(how="end")

2000-01-31   -1.102828
2000-02-29   -0.280922
2000-03-31    0.740465
Freq: M, dtype: float64

### 通过数组创建PeriodIndex

In [124]:
data = pd.read_csv("ch08/macrodata.csv")

In [125]:
data.year

0      1959.0
1      1959.0
2      1959.0
3      1959.0
4      1960.0
5      1960.0
6      1960.0
7      1960.0
8      1961.0
9      1961.0
10     1961.0
11     1961.0
12     1962.0
13     1962.0
14     1962.0
15     1962.0
16     1963.0
17     1963.0
18     1963.0
19     1963.0
20     1964.0
21     1964.0
22     1964.0
23     1964.0
24     1965.0
25     1965.0
26     1965.0
27     1965.0
28     1966.0
29     1966.0
        ...  
173    2002.0
174    2002.0
175    2002.0
176    2003.0
177    2003.0
178    2003.0
179    2003.0
180    2004.0
181    2004.0
182    2004.0
183    2004.0
184    2005.0
185    2005.0
186    2005.0
187    2005.0
188    2006.0
189    2006.0
190    2006.0
191    2006.0
192    2007.0
193    2007.0
194    2007.0
195    2007.0
196    2008.0
197    2008.0
198    2008.0
199    2008.0
200    2009.0
201    2009.0
202    2009.0
Name: year, Length: 203, dtype: float64

In [126]:
data.quarter

0      1.0
1      2.0
2      3.0
3      4.0
4      1.0
5      2.0
6      3.0
7      4.0
8      1.0
9      2.0
10     3.0
11     4.0
12     1.0
13     2.0
14     3.0
15     4.0
16     1.0
17     2.0
18     3.0
19     4.0
20     1.0
21     2.0
22     3.0
23     4.0
24     1.0
25     2.0
26     3.0
27     4.0
28     1.0
29     2.0
      ... 
173    2.0
174    3.0
175    4.0
176    1.0
177    2.0
178    3.0
179    4.0
180    1.0
181    2.0
182    3.0
183    4.0
184    1.0
185    2.0
186    3.0
187    4.0
188    1.0
189    2.0
190    3.0
191    4.0
192    1.0
193    2.0
194    3.0
195    4.0
196    1.0
197    2.0
198    3.0
199    4.0
200    1.0
201    2.0
202    3.0
Name: quarter, Length: 203, dtype: float64

两个数组以及一个频率传入PeriodIndex，将可以将他们合并成DataFrame的一个索引

In [127]:
index = pd.PeriodIndex(year=data.year, quarter=data.quarter, freq="Q-DEC")

In [128]:
index

PeriodIndex(['1959Q1', '1959Q2', '1959Q3', '1959Q4', '1960Q1', '1960Q2',
             '1960Q3', '1960Q4', '1961Q1', '1961Q2',
             ...
             '2007Q2', '2007Q3', '2007Q4', '2008Q1', '2008Q2', '2008Q3',
             '2008Q4', '2009Q1', '2009Q2', '2009Q3'],
            dtype='period[Q-DEC]', length=203, freq='Q-DEC')

## 重复采样及频率转换

重采样（resampling）指的是将时间序列从一个频率转换到另一个频率的处理过程。

pandas对象有一个resample的方法，他是各种频率转换工作的主力函数

In [129]:
rng = pd.date_range("2000/1/1", periods=100, freq="D")

In [130]:
ts = Series(randn(len(rng)), index=rng)

In [131]:
ts.resample("M").mean()

2000-01-31    0.020931
2000-02-29   -0.296366
2000-03-31   -0.097544
2000-04-30   -0.107259
Freq: M, dtype: float64

In [132]:
ts.resample("M", kind="period").mean()

2000-01    0.020931
2000-02   -0.296366
2000-03   -0.097544
2000-04   -0.107259
Freq: M, dtype: float64

### 降采样

将数据聚合到规整的低频率，在使用resample对数据降采样时，考虑两点：
* 各区间哪边是闭合的
* 如何标记各个聚合面元，用区间的开头还是结尾

In [134]:
rng = pd.date_range("2000/1/1", periods=12, freq="T")  # 一分钟数据

In [135]:
ts = Series(np.arange(12), index=rng)

In [136]:
ts

2000-01-01 00:00:00     0
2000-01-01 00:01:00     1
2000-01-01 00:02:00     2
2000-01-01 00:03:00     3
2000-01-01 00:04:00     4
2000-01-01 00:05:00     5
2000-01-01 00:06:00     6
2000-01-01 00:07:00     7
2000-01-01 00:08:00     8
2000-01-01 00:09:00     9
2000-01-01 00:10:00    10
2000-01-01 00:11:00    11
Freq: T, dtype: int32

In [137]:
ts.resample("5min").sum()  #聚合到5分钟

2000-01-01 00:00:00    10
2000-01-01 00:05:00    35
2000-01-01 00:10:00    21
Freq: 5T, dtype: int32

In [140]:
ts.resample("5min", closed="right").sum()  #右边闭合

1999-12-31 23:55:00     0
2000-01-01 00:00:00    15
2000-01-01 00:05:00    40
2000-01-01 00:10:00    11
Freq: 5T, dtype: int32

In [143]:
ts.resample("5min", closed="right", label="right").sum()  #用面元的右边界进行标记

2000-01-01 00:00:00     0
2000-01-01 00:05:00    15
2000-01-01 00:10:00    40
2000-01-01 00:15:00    11
Freq: 5T, dtype: int32

In [145]:
ts.resample("5min", loffset="-1s").sum()

1999-12-31 23:59:59    10
2000-01-01 00:04:59    35
2000-01-01 00:09:59    21
Freq: 5T, dtype: int32

### OHLC重采样

金融领域有计算面元的四个值：open开盘、close收盘、high最高 low最低

In [146]:
ts.resample("5min").ohlc()

Unnamed: 0,open,high,low,close
2000-01-01 00:00:00,0,4,0,4
2000-01-01 00:05:00,5,9,5,9
2000-01-01 00:10:00,10,11,10,11


### 通过groupby进行采样

In [147]:
rng = pd.date_range("2000/1/1", periods=100, freq="D")

In [148]:
ts = Series(np.arange(100), index=rng)

In [149]:
ts.groupby(lambda x: x.month).mean()

1    15
2    45
3    75
4    95
dtype: int32

In [150]:
ts.groupby(lambda x: x.weekday).mean()

0    47.5
1    48.5
2    49.5
3    50.5
4    51.5
5    49.0
6    50.0
dtype: float64

### 升采样和插值

In [152]:
frame = DataFrame(
    randn(2, 4),
    index=pd.date_range("2000/1/1", periods=2, freq="W-WED"),
    columns=["Colorado", "Texas", "New York", "Ohio"]
)

In [153]:
frame[:5]

Unnamed: 0,Colorado,Texas,New York,Ohio
2000-01-05,-0.825688,0.96507,-0.044539,0.754923
2000-01-12,1.203043,0.91987,-1.646271,2.368358


将重采样到日频率，将会引入缺失值

In [154]:
df_daily = frame.resample("D")

In [155]:
df_daily

DatetimeIndexResampler [freq=<Day>, axis=0, closed=left, label=left, convention=start, base=0]

In [159]:
frame.resample("D").ffill()

Unnamed: 0,Colorado,Texas,New York,Ohio
2000-01-05,-0.825688,0.96507,-0.044539,0.754923
2000-01-06,-0.825688,0.96507,-0.044539,0.754923
2000-01-07,-0.825688,0.96507,-0.044539,0.754923
2000-01-08,-0.825688,0.96507,-0.044539,0.754923
2000-01-09,-0.825688,0.96507,-0.044539,0.754923
2000-01-10,-0.825688,0.96507,-0.044539,0.754923
2000-01-11,-0.825688,0.96507,-0.044539,0.754923
2000-01-12,1.203043,0.91987,-1.646271,2.368358


In [161]:
frame.resample("W-THU").ffill()  # 换个频率填充

Unnamed: 0,Colorado,Texas,New York,Ohio
2000-01-06,-0.825688,0.96507,-0.044539,0.754923
2000-01-13,1.203043,0.91987,-1.646271,2.368358


### 通过时期进行重采样

In [162]:
frame = DataFrame(
    randn(24, 4),
    index=pd.period_range("2000/1", "2001/12", freq="M"),
    columns=["Colorado", "Texas", "New York", "Ohio"]
)

In [163]:
frame[:5]

Unnamed: 0,Colorado,Texas,New York,Ohio
2000-01,0.117996,-1.131396,0.30322,-0.261724
2000-02,0.647982,-0.04001,1.282247,0.257515
2000-03,0.633626,1.496012,0.539518,-0.512288
2000-04,-1.35108,-1.374513,0.091002,-0.20129
2000-05,-1.196733,0.360692,-0.055058,1.100272


In [164]:
annual_frame = frame.resample("A-DEC").mean()

In [165]:
annual_frame

Unnamed: 0,Colorado,Texas,New York,Ohio
2000,0.230475,0.042549,-0.471946,-0.147169
2001,-0.488391,-0.373092,-0.313056,0.137096


升采样要指定新频率中哪个端用于放置原来的值，convention参数

In [166]:
annual_frame.resample("Q-DEC").ffill()

Unnamed: 0,Colorado,Texas,New York,Ohio
2000Q1,0.230475,0.042549,-0.471946,-0.147169
2000Q2,0.230475,0.042549,-0.471946,-0.147169
2000Q3,0.230475,0.042549,-0.471946,-0.147169
2000Q4,0.230475,0.042549,-0.471946,-0.147169
2001Q1,-0.488391,-0.373092,-0.313056,0.137096
2001Q2,-0.488391,-0.373092,-0.313056,0.137096
2001Q3,-0.488391,-0.373092,-0.313056,0.137096
2001Q4,-0.488391,-0.373092,-0.313056,0.137096


In [168]:
annual_frame.resample("Q-DEC", convention="end").ffill()

Unnamed: 0,Colorado,Texas,New York,Ohio
2000Q4,0.230475,0.042549,-0.471946,-0.147169
2001Q1,0.230475,0.042549,-0.471946,-0.147169
2001Q2,0.230475,0.042549,-0.471946,-0.147169
2001Q3,0.230475,0.042549,-0.471946,-0.147169
2001Q4,-0.488391,-0.373092,-0.313056,0.137096


## 时间序列绘图

略