## 时间序列数据的意义取决于具体的应用场景
1. 时间戳(timestamp),特定的时刻
2. 固定时期(period),如2008年1月或2016全年
3. 时间间隔(interval),由起始和结束时间戳表示,period可以被看做interval的特例
4. 实验或过程时间,每个时间点都是相对于特定起始时间的一个度量.如从放入烤箱起,每秒钟饼干的直径

# 日期和时间数据类型及工具
* python标准库:date/time/datetime/calendar

In [40]:
%pylab inline
from datetime import datetime
from pandas import Series, DataFrame
import pandas as pd

Populating the interactive namespace from numpy and matplotlib


`%matplotlib` prevents importing * from pylab and numpy


In [36]:
now = datetime.now()
now

datetime.datetime(2016, 9, 5, 15, 10, 49, 274000)

In [37]:
now.year, now.month, now.day

(2016, 9, 5)

In [38]:
# datetime以毫秒形式存储日期和时间,datetime.timedelta表示两个时间datetime对象之间的时间差
delta = datetime(2011,9,2) - datetime(2011,8,3,10,23)
delta

datetime.timedelta(29, 49020)

In [39]:
# 天,秒
delta.days, delta.seconds

(29, 49020)

In [12]:
# 可以给datetime对象加上/减去一个或多个timedelta,会产生一个新对象
from datetime import timedelta

In [13]:
start = datetime(2016,9,2)
start + timedelta(12)

datetime.datetime(2016, 9, 14, 0, 0)

## 字符串和datetime的相互转换

In [15]:
# str和strftime方法将datetime对象和pandas的Timestamp对象格式化为字符串
stamp = datetime(2016,1,3)
str(stamp)

'2016-01-03 00:00:00'

In [17]:
stamp.strftime('%Y-%m-%d')

'2016-01-03'

In [20]:
# 将格式化字符编码转换为日期
value = '2011-03-29'
datetime.strptime(value,'%Y-%m-%d')

datetime.datetime(2011, 3, 29, 0, 0)

In [21]:
# 利用dateutil中的parser.parse处理常见日期格式的格式化字符编码
from dateutil.parser import parse

In [22]:
parse(value)

datetime.datetime(2011, 3, 29, 0, 0)

In [25]:
parse('Jan 31, 1997 10:45 PM')

datetime.datetime(1997, 1, 31, 22, 45)

In [27]:
# 处理日期在月份前面,dayfirst=True
parse('6/12/2016',dayfirst=True)

datetime.datetime(2016, 12, 6, 0, 0)

In [41]:
# pandas的to_datetime方法处理成组日期
datestrs = ['7/6/2016','9/9/2015']
pd.to_datetime(datestrs)

DatetimeIndex(['2016-07-06', '2015-09-09'], dtype='datetime64[ns]', freq=None)

In [44]:
# 处理缺失值(None,空字符串)
idx = pd.to_datetime(datestrs + [None])
idx

DatetimeIndex(['2016-07-06', '2015-09-09', 'NaT'], dtype='datetime64[ns]', freq=None)

In [45]:
pd.isnull(idx)

array([False, False,  True], dtype=bool)

## 时间序列基础
* pandas最基本的时间序列是以时间戳为索引的Series

In [53]:
dates = [datetime(2016,1,8), datetime(2016,6,8), datetime(2016,5,8), datetime(2016,4,8), datetime(2016,3,8), datetime(2016,2,8)]
ts = Series(np.random.randn(len(dates)),index=dates)
print type(ts)
print type(ts.index)
print ts.index.dtype
ts

<class 'pandas.core.series.Series'>
<class 'pandas.tseries.index.DatetimeIndex'>
datetime64[ns]


2016-01-08    1.726573
2016-06-08    0.683461
2016-05-08    0.292300
2016-04-08    0.068015
2016-03-08    0.046052
2016-02-08    0.262360
dtype: float64

In [51]:
# 不同索引的时间序列之间的算术运算会自动按日期对其
ts + ts[::2]

2016-01-08    2.380956
2016-02-08         NaN
2016-03-08   -0.591529
2016-04-08         NaN
2016-05-08   -2.330775
2016-06-08         NaN
dtype: float64

## 索引、选取、子集构造

In [56]:
# 索引
stamp = ts.index[2]
stamp

Timestamp('2016-05-08 00:00:00')

In [57]:
ts[stamp]

0.29230009763720743

In [58]:
ts['2016-01-08']

2016-01-08    1.726573
dtype: float64

In [61]:
# 对于较长的时间序列,传入"年"或"年月"即可数据切片
longer_ts = Series(np.random.randn(1000),index=pd.date_range('1/1/2010',periods=1000))
longer_ts.head()

2010-01-01   -0.661064
2010-01-02   -0.542034
2010-01-03   -0.148403
2010-01-04   -1.556598
2010-01-05    0.461211
Freq: D, dtype: float64

In [63]:
# "年"切片
longer_ts['2011'].head()

2011-01-01   -1.192233
2011-01-02    0.598542
2011-01-03   -0.130657
2011-01-04    0.354745
2011-01-05    0.615886
Freq: D, dtype: float64

In [64]:
# "年月"切片
longer_ts['2011-03']

2011-03-01   -0.656364
2011-03-02    0.071906
2011-03-03   -0.562434
2011-03-04    0.621281
2011-03-05    1.707967
2011-03-06   -1.033982
2011-03-07   -0.374775
2011-03-08   -0.903705
2011-03-09   -0.174119
2011-03-10    1.308651
2011-03-11   -0.711709
2011-03-12   -0.708516
2011-03-13   -0.626936
2011-03-14    0.392382
2011-03-15    0.927369
2011-03-16   -1.071181
2011-03-17    0.810623
2011-03-18   -1.616151
2011-03-19    0.625646
2011-03-20   -0.794965
2011-03-21    0.768581
2011-03-22    0.034570
2011-03-23   -0.617377
2011-03-24    2.074930
2011-03-25    0.582247
2011-03-26   -0.032518
2011-03-27    3.139842
2011-03-28   -1.109369
2011-03-29   -1.530500
2011-03-30   -0.350568
2011-03-31   -1.320868
Freq: D, dtype: float64

In [66]:
longer_ts[datetime(2011,3,1):datetime(2011,3,10)]

2011-03-01   -0.656364
2011-03-02    0.071906
2011-03-03   -0.562434
2011-03-04    0.621281
2011-03-05    1.707967
2011-03-06   -1.033982
2011-03-07   -0.374775
2011-03-08   -0.903705
2011-03-09   -0.174119
2011-03-10    1.308651
Freq: D, dtype: float64

In [69]:
# 可以使用不存在于该时间序列中的时间戳对其进行切片
ts['2016-01-11':'2016-04-11']

2016-04-08    0.068015
2016-03-08    0.046052
2016-02-08    0.262360
dtype: float64

In [70]:
# 对DataFrame进行索引
dates = pd.date_range('2016/1/1',periods=100,freq='W-WED')
long_df = DataFrame(np.random.randn(100,4),index=dates,columns=list('ABCD'))
long_df.ix['2016/3']

Unnamed: 0,A,B,C,D
2016-03-02,-1.160695,1.305283,-1.232372,0.679614
2016-03-09,-1.176652,0.94806,0.516697,-1.875135
2016-03-16,-0.968665,0.536208,-0.810429,-0.901856
2016-03-23,0.095245,1.6228,-0.109334,-1.09638
2016-03-30,0.601759,-0.841323,-0.269787,-0.768711


## 带有重复索引的时间序列
* 某些应用场景中,多个观测数据落在同一个时间点上

In [77]:
dates = pd.DatetimeIndex(['1/1/2016','1/2/2016','1/3/2016','1/2/2016','1/1/2016','10/1/2016'])
dup_ts = Series(np.arange(6),index=dates)
dup_ts

2016-01-01    0
2016-01-02    1
2016-01-03    2
2016-01-02    3
2016-01-01    4
2016-10-01    5
dtype: int32

In [78]:
# 判断index是否唯一
dup_ts.index.is_unique

False

In [81]:
# 对时间序列进行索引,产生标量值(时间点唯一)或切片(时间点重复)
print dup_ts['1/1/2016'],'\n'
print dup_ts['10/1/2016']

2016-01-01    0
2016-01-01    4
dtype: int32 

2016-10-01    5
dtype: int32


In [84]:
# 使用groupby对非唯一时间戳的数据进行聚合
grouped = dup_ts.groupby(level=0)
grouped.mean()

2016-01-01    2
2016-01-02    2
2016-01-03    2
2016-10-01    5
dtype: int32

## 日期的范围、频率以及移动
* 时间序列一般是不规则的,没有固定频率
* 但是通常需要以某种固定频率进行分析
* pandas有相关工具

In [88]:
# resample将时间序列转换为有固定频率的时间序列

# 固定频率(每日)
ts.resample('D').mean().head()

2016-01-08    1.726573
2016-01-09         NaN
2016-01-10         NaN
2016-01-11         NaN
2016-01-12         NaN
Freq: D, dtype: float64

### 生成日期范围

In [92]:
# pd.date_range生成指定长度的DatetimeIndex
# 默认保留起始和结束时间戳
index = pd.date_range('4/1/2016','4/10/2016')
index

DatetimeIndex(['2016-04-01', '2016-04-02', '2016-04-03', '2016-04-04',
               '2016-04-05', '2016-04-06', '2016-04-07', '2016-04-08',
               '2016-04-09', '2016-04-10'],
              dtype='datetime64[ns]', freq='D')

In [94]:
# pd.date_range默认产生按天计算的时间点
# 如果只给起始或结束时间,就要传入periods
pd.date_range(end='2016/10/1',periods=10)

DatetimeIndex(['2016-09-22', '2016-09-23', '2016-09-24', '2016-09-25',
               '2016-09-26', '2016-09-27', '2016-09-28', '2016-09-29',
               '2016-09-30', '2016-10-01'],
              dtype='datetime64[ns]', freq='D')

In [95]:
# 生成一个由每个月最后一个工作日组成的日期索引,传入BM(business end of month)
pd.date_range('2016/1/1','2016/10/1',freq='BM')

DatetimeIndex(['2016-01-29', '2016-02-29', '2016-03-31', '2016-04-29',
               '2016-05-31', '2016-06-30', '2016-07-29', '2016-08-31',
               '2016-09-30'],
              dtype='datetime64[ns]', freq='BM')

In [98]:
# 产生一组被规范化(normalize)到午夜的时间戳
pd.date_range('5/2/2016 12:12:31',periods=5,normalize=True)

DatetimeIndex(['2016-05-02', '2016-05-03', '2016-05-04', '2016-05-05',
               '2016-05-06'],
              dtype='datetime64[ns]', freq='D')

### 频率和日期偏移量
* 由基础频(base frequency)率和一个乘数组成
* 基础频率,如'M','H'
* 每个基础频率都有一个被称为日期偏移量(date offset)的对象与之对应

In [99]:
from pandas.tseries.offsets import Hour, Minute

In [101]:
hour = Hour()
hour

<Hour>

In [102]:
# 传入整数定义偏移量的倍数
four_hours = Hour(4)
four_hours

<4 * Hours>

In [103]:
# 使用字符串别名产生时间序列
pd.date_range('1/1/2016',periods=10,freq='2H')

DatetimeIndex(['2016-01-01 00:00:00', '2016-01-01 02:00:00',
               '2016-01-01 04:00:00', '2016-01-01 06:00:00',
               '2016-01-01 08:00:00', '2016-01-01 10:00:00',
               '2016-01-01 12:00:00', '2016-01-01 14:00:00',
               '2016-01-01 16:00:00', '2016-01-01 18:00:00'],
              dtype='datetime64[ns]', freq='2H')

In [104]:
# 大部分偏移量对象都可以通过加法进行连接
Hour(2) + Minute(10)

<130 * Minutes>

In [105]:
pd.date_range('1/1/2016',periods=10,freq='2h30min')

DatetimeIndex(['2016-01-01 00:00:00', '2016-01-01 02:30:00',
               '2016-01-01 05:00:00', '2016-01-01 07:30:00',
               '2016-01-01 10:00:00', '2016-01-01 12:30:00',
               '2016-01-01 15:00:00', '2016-01-01 17:30:00',
               '2016-01-01 20:00:00', '2016-01-01 22:30:00'],
              dtype='datetime64[ns]', freq='150T')

In [108]:
# 每月的第三个星期五 WOM(week of month)
rng = pd.date_range('2016',freq='WOM-3FRI',periods=10)
list(rng)

[Timestamp('2016-01-15 00:00:00', offset='WOM-3FRI'),
 Timestamp('2016-02-19 00:00:00', offset='WOM-3FRI'),
 Timestamp('2016-03-18 00:00:00', offset='WOM-3FRI'),
 Timestamp('2016-04-15 00:00:00', offset='WOM-3FRI'),
 Timestamp('2016-05-20 00:00:00', offset='WOM-3FRI'),
 Timestamp('2016-06-17 00:00:00', offset='WOM-3FRI'),
 Timestamp('2016-07-15 00:00:00', offset='WOM-3FRI'),
 Timestamp('2016-08-19 00:00:00', offset='WOM-3FRI'),
 Timestamp('2016-09-16 00:00:00', offset='WOM-3FRI'),
 Timestamp('2016-10-21 00:00:00', offset='WOM-3FRI')]

### 移动(超前和滞后)数据
* 沿着时间轴将数据前移或后移,保持索引不变

In [109]:
ts = Series(np.random.randn(4),index=pd.date_range('1/1/2016',periods=4,freq='M'))
ts

2016-01-31    0.883933
2016-02-29    0.265755
2016-03-31   -1.861968
2016-04-30   -0.336218
Freq: M, dtype: float64

In [110]:
# 前移和后移
ts.shift(2)

2016-01-31         NaN
2016-02-29         NaN
2016-03-31    0.883933
2016-04-30    0.265755
Freq: M, dtype: float64

In [111]:
ts.shift(-2)

2016-01-31   -1.861968
2016-02-29   -0.336218
2016-03-31         NaN
2016-04-30         NaN
Freq: M, dtype: float64

In [112]:
# shift通常用于计算一个时间序列或多个时间序列中的百分比变化
ts / ts.shift(1) - 1

2016-01-31         NaN
2016-02-29   -0.699349
2016-03-31   -8.006329
2016-04-30   -0.819429
Freq: M, dtype: float64