# 11.2 Time Series Basics（时间序列基础）

在pandas中，一个基本的时间序列对象，是一个用时间戳作为索引的Series，在pandas外部的话，通常是用python 字符串或datetime对象来表示的：

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime

In [2]:
dates = [datetime(2011, 1, 2), datetime(2011, 1, 5),
         datetime(2011, 1, 7), datetime(2011, 1, 8), 
         datetime(2011, 1, 10), datetime(2011, 1, 12)]

In [3]:
ts = pd.Series(np.random.randn(6), index=dates)
ts

2011-01-02   -1.516355
2011-01-05   -1.312124
2011-01-07    0.634266
2011-01-08   -2.084913
2011-01-10    0.190022
2011-01-12   -0.422192
dtype: float64

上面的转化原理是，datetime对象被放进了DatetimeIndex:

In [4]:
ts.index

DatetimeIndex(['2011-01-02', '2011-01-05', '2011-01-07', '2011-01-08',
               '2011-01-10', '2011-01-12'],
              dtype='datetime64[ns]', freq=None)

像其他的Series一行，数值原色会自动按时间序列索引进行对齐：

In [5]:
ts[::2]

2011-01-02   -1.516355
2011-01-07    0.634266
2011-01-10    0.190022
dtype: float64

In [6]:
ts + ts[::2]

2011-01-02   -3.032709
2011-01-05         NaN
2011-01-07    1.268533
2011-01-08         NaN
2011-01-10    0.380044
2011-01-12         NaN
dtype: float64

ts[::2]会在ts中，每隔两个元素选一个元素。

pandas中的时间戳，是按numpy中的datetime64数据类型进行保存的，可以精确到纳秒的级别，

DatetimeIndex的标量是pandas的Timestamp对象：

In [7]:
stamp = ts.index[0]
stamp

Timestamp('2011-01-02 00:00:00')

Timestamp可以在任何地方用datetime对象进行替换。

# 1 Indexing, Selection, Subsetting（索引，选择，取子集）

当我们基于标签进行索引和选择时，时间序列就像是pandas.Series：

In [8]:
ts

2011-01-02   -1.516355
2011-01-05   -1.312124
2011-01-07    0.634266
2011-01-08   -2.084913
2011-01-10    0.190022
2011-01-12   -0.422192
dtype: float64

In [9]:
stamp = ts.index[2]

In [10]:
ts[stamp]

0.63426639357601478

为了方便，我们可以直接传入一个字符串用来表示日期：

In [11]:
ts['1/10/2011']

0.190022244081339

In [12]:
ts['20110110']

0.190022244081339

对于比较长的时间序列，我们可以直接传入一年或一年一个月，来进行数据选取：

In [13]:
longer_ts = pd.Series(np.random.randn(1000),
                      index=pd.date_range('1/1/2000', periods=1000))
longer_ts

2000-01-01   -0.136372
2000-01-02   -1.224098
2000-01-03   -0.483691
2000-01-04   -0.413888
2000-01-05   -0.879009
2000-01-06    0.095670
2000-01-07   -0.641924
2000-01-08    0.854849
2000-01-09    1.176310
2000-01-10    0.822993
2000-01-11    1.885742
2000-01-12   -0.511405
2000-01-13    0.031318
2000-01-14    0.565075
2000-01-15    0.501018
2000-01-16   -1.096367
2000-01-17    0.290809
2000-01-18    0.035377
2000-01-19   -1.125671
2000-01-20    1.523958
2000-01-21    1.066099
2000-01-22    2.035655
2000-01-23    1.151862
2000-01-24   -1.204157
2000-01-25    0.770115
2000-01-26   -1.956111
2000-01-27   -0.047667
2000-01-28   -0.778248
2000-01-29    0.388646
2000-01-30   -1.226905
                ...   
2002-08-28   -0.156880
2002-08-29    0.092827
2002-08-30   -0.360281
2002-08-31   -0.524016
2002-09-01   -0.323111
2002-09-02    0.324747
2002-09-03   -0.090382
2002-09-04    0.573197
2002-09-05   -0.506839
2002-09-06   -0.516092
2002-09-07    0.382998
2002-09-08   -0.315170
2002-09-09 

In [14]:
longer_ts['2001']

2001-01-01   -0.689413
2001-01-02   -0.129989
2001-01-03    0.200948
2001-01-04   -1.336157
2001-01-05    0.668905
2001-01-06   -0.045583
2001-01-07    1.522334
2001-01-08    1.152726
2001-01-09   -0.285554
2001-01-10    0.891631
2001-01-11    1.474867
2001-01-12    0.785286
2001-01-13   -1.059099
2001-01-14    0.684889
2001-01-15    0.564896
2001-01-16   -0.278859
2001-01-17    0.206892
2001-01-18    0.335905
2001-01-19    2.227370
2001-01-20    1.485317
2001-01-21   -0.696520
2001-01-22   -1.065130
2001-01-23   -1.212273
2001-01-24   -0.834923
2001-01-25    1.073209
2001-01-26    0.477623
2001-01-27   -0.390275
2001-01-28   -1.414097
2001-01-29   -0.684405
2001-01-30   -0.709195
                ...   
2001-12-02   -0.430722
2001-12-03   -1.446557
2001-12-04   -0.433168
2001-12-05   -1.473646
2001-12-06   -0.240266
2001-12-07   -1.587177
2001-12-08    2.035679
2001-12-09    0.657873
2001-12-10    1.849066
2001-12-11    1.168744
2001-12-12    0.386542
2001-12-13   -0.780351
2001-12-14 

这里，字符串'2001'就直接被解析为一年，然后选中这个时期的数据。我们也可以指定月份：

In [15]:
longer_ts['2001-05']

2001-05-01   -1.047694
2001-05-02   -2.074168
2001-05-03   -1.374739
2001-05-04   -1.875479
2001-05-05   -0.629041
2001-05-06    0.239575
2001-05-07   -0.074583
2001-05-08   -0.332051
2001-05-09    0.352574
2001-05-10   -0.729554
2001-05-11   -1.260451
2001-05-12    2.386021
2001-05-13    0.507414
2001-05-14    2.174966
2001-05-15   -0.160949
2001-05-16   -0.535947
2001-05-17    0.272017
2001-05-18    0.933476
2001-05-19   -0.023010
2001-05-20   -0.827175
2001-05-21   -0.591335
2001-05-22    0.861078
2001-05-23    0.250064
2001-05-24   -0.344444
2001-05-25    0.274510
2001-05-26   -0.258019
2001-05-27   -1.322779
2001-05-28    0.376861
2001-05-29    0.037362
2001-05-30   -0.417679
2001-05-31   -1.521938
Freq: D, dtype: float64

利用datetime进行切片（slicing）也没问题：

In [16]:
ts[datetime(2011, 1, 7)]

0.63426639357601478

因为大部分时间序列是按年代时间顺序来排列的，我们可以用时间戳来进行切片，选中一段范围内的时间：

In [17]:
ts

2011-01-02   -1.516355
2011-01-05   -1.312124
2011-01-07    0.634266
2011-01-08   -2.084913
2011-01-10    0.190022
2011-01-12   -0.422192
dtype: float64

In [18]:
ts['1/6/2011':'1/11/2011']

2011-01-07    0.634266
2011-01-08   -2.084913
2011-01-10    0.190022
dtype: float64

记住，这种方式的切片得到的只是原来数据的一个视图，如果我们在切片的结果上进行更改的的，原来的数据也会变化。

所有这些都适用于DataFrame，我们对行进行索引：

In [19]:
dates = pd.date_range('1/1/2000', periods=100, freq='W-WED')

In [20]:
long_df = pd.DataFrame(np.random.randn(100, 4),
                       index=dates,
                       columns=['Colorado', 'Texas',
                                'New York', 'Ohio'])

In [21]:
long_df.loc['5-2001']

Unnamed: 0,Colorado,Texas,New York,Ohio
2001-05-02,-0.546985,-0.542177,0.850227,0.416314
2001-05-09,-0.555638,-0.225737,-1.446422,-0.813269
2001-05-16,0.6306,-0.811702,-0.264234,0.480644
2001-05-23,-0.20021,0.575624,1.420262,-0.367813
2001-05-30,1.342999,0.07687,-0.3027,-0.695746


# 2 Time Series with Duplicate Indices（重复索引的时间序列）

在某些数据中，可能会遇到多个数据在同一时间戳下的情况：

In [22]:
dates = pd.DatetimeIndex(['1/1/2000', '1/2/2000', '1/2/2000', 
                          '1/2/2000', '1/3/2000'])

In [23]:
dup_ts = pd.Series(np.arange(5), index=dates)
dup_ts

2000-01-01    0
2000-01-02    1
2000-01-02    2
2000-01-02    3
2000-01-03    4
dtype: int32

我们通过is_unique属性来查看index是否是唯一值：

In [24]:
dup_ts.index.is_unique

False

对这个时间序列取索引的的话， 要么得到标量，要么得到切片，这取决于时间戳是否是重复的：

In [25]:
dup_ts['1/3/2000'] # not duplicated

4

In [26]:
dup_ts['1/2/2000'] # duplicated

2000-01-02    1
2000-01-02    2
2000-01-02    3
dtype: int32

假设我们想要聚合那些有重复时间戳的数据，一种方法是用groupby，设定level=0：

In [27]:
grouped = dup_ts.groupby(level=0)
grouped.mean()

2000-01-01    0
2000-01-02    2
2000-01-03    4
dtype: int32

In [28]:
grouped.count()

2000-01-01    1
2000-01-02    3
2000-01-03    1
dtype: int64

### 附加内容（简略版）

日期偏移，DateOffset

In [29]:
from pandas.tseries.offsets import DateOffset,Day
dates

DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-02', '2000-01-02',
               '2000-01-03'],
              dtype='datetime64[ns]', freq=None)

In [30]:
dates + DateOffset(months = 1)

DatetimeIndex(['2000-02-01', '2000-02-02', '2000-02-02', '2000-02-02',
               '2000-02-03'],
              dtype='datetime64[ns]', freq=None)

In [31]:
dates + Day(3)

DatetimeIndex(['2000-01-04', '2000-01-05', '2000-01-05', '2000-01-05',
               '2000-01-06'],
              dtype='datetime64[ns]', freq=None)

时间周期Periods

In [33]:
rng = pd.period_range('2011Q3', '2012Q4', freq='Q-JAN')
ts = pd.Series(np.arange(len(rng)), index=rng)
ts

2011Q3    0
2011Q4    1
2012Q1    2
2012Q2    3
2012Q3    4
2012Q4    5
Freq: Q-JAN, dtype: int32

重采样和频度转换

可以按不同的频度转换数据，如按周、月、每5分钟等，需要考虑的细节也非常多，如开闭情况、周期选择（Q1从1月开始还是12月开始）、标签选择等。
- 把一个高频度转换为低频度称为降采样，需要考虑聚合问题
- 把一个低频度转换为高频度称为增采样，需要考虑插值问题

In [37]:
longer_ts.head()

2000-01-01   -0.136372
2000-01-02   -1.224098
2000-01-03   -0.483691
2000-01-04   -0.413888
2000-01-05   -0.879009
Freq: D, dtype: float64

In [35]:
longer_ts.resample('M', kind='period').sum()

2000-01     0.368187
2000-02    -8.608103
2000-03     9.213198
2000-04     8.463272
2000-05     9.582865
2000-06     6.481925
2000-07     2.052845
2000-08    -7.446094
2000-09    -8.158474
2000-10    -0.350103
2000-11     4.492100
2000-12    -3.275870
2001-01     3.848163
2001-02    -7.147563
2001-03    -8.117322
2001-04     2.969164
2001-05    -6.735116
2001-06    -2.504536
2001-07     6.987608
2001-08     8.608651
2001-09   -11.496576
2001-10    -6.502134
2001-11     9.201543
2001-12     4.593383
2002-01    13.781257
2002-02     8.868120
2002-03     6.306617
2002-04    -2.578467
2002-05     7.214143
2002-06    -2.298400
2002-07    10.638594
2002-08    -1.214689
2002-09     1.803768
Freq: M, dtype: float64

In [36]:
longer_ts.resample('Q-DEC', kind='period').sum()

2000Q1     0.973283
2000Q2    24.528062
2000Q3   -13.551722
2000Q4     0.866128
2001Q1   -11.416722
2001Q2    -6.270488
2001Q3     4.099683
2001Q4     7.292792
2002Q1    28.955994
2002Q2     2.337277
2002Q3    11.227673
Freq: Q-DEC, dtype: float64

继续深入了解：

一、《利用python进行数据分析》第十一章，时间序列

二、pandas库官方文档

三、下面三个notebook

[日期范围，频度，位移](http://nbviewer.jupyter.org/github/LearnXu/pydata-notebook/blob/master/Chapter-11/11.3%20Date%20Ranges%2C%20Frequencies%2C%20and%20Shifting%EF%BC%88%E6%97%A5%E6%9C%9F%E8%8C%83%E5%9B%B4%EF%BC%8C%E9%A2%91%E5%BA%A6%EF%BC%8C%E5%92%8C%E4%BD%8D%E7%A7%BB%EF%BC%89.ipynb)

[周期和周期运算](http://nbviewer.jupyter.org/github/LearnXu/pydata-notebook/blob/master/Chapter-11/11.5%20Periods%20and%20Period%20Arithmetic%EF%BC%88%E5%91%A8%E6%9C%9F%E5%92%8C%E5%91%A8%E6%9C%9F%E8%BF%90%E7%AE%97%EF%BC%89.ipynb)

[重采样和频度转换](http://nbviewer.jupyter.org/github/LearnXu/pydata-notebook/blob/master/Chapter-11/11.6%20Resampling%20and%20Frequency%20Conversion%EF%BC%88%E9%87%8D%E9%87%87%E6%A0%B7%E5%92%8C%E9%A2%91%E5%BA%A6%E8%BD%AC%E6%8D%A2%EF%BC%89.ipynb)