《[利用Python进行数据分析](https://book.douban.com/subject/25779298/)》读书笔记。
 
 [第10章](/2017/07/20/python_data_analysis10.html)  第2节：时间序列基础

所有用到的数据可以从[作者的 github](https://github.com/wesm/pydata-book)下载。


In [1]:
%pylab inline
import pandas as pd
from pandas import Series, DataFrame

Populating the interactive namespace from numpy and matplotlib


pandas中最基本的时间序列类型是以时间戳（字符串或datetime对象）为索引的Series。

In [2]:
from datetime import datetime
dates = [datetime(2011, 1, 2), datetime(2011, 1, 5), datetime(2011, 1, 7),
         datetime(2011, 1, 8), datetime(2011, 1, 10), datetime(2011, 1, 12)]
ts = Series(np.random.randn(6), index=dates)
ts

2011-01-02    2.196938
2011-01-05    0.904351
2011-01-07   -0.471502
2011-01-08   -0.006652
2011-01-10    0.566689
2011-01-12    2.491312
dtype: float64

In [4]:
# ts是一个 TimeSeries, 其索引是一个 DatetimeIndex
print(type(ts))
print(ts.index)

<class 'pandas.core.series.Series'>
DatetimeIndex(['2011-01-02', '2011-01-05', '2011-01-07', '2011-01-08',
               '2011-01-10', '2011-01-12'],
              dtype='datetime64[ns]', freq=None)


In [5]:
# 不同索引的时间序列之间的算数运算会自动对齐
ts + ts[::2]

2011-01-02    4.393876
2011-01-05         NaN
2011-01-07   -0.943004
2011-01-08         NaN
2011-01-10    1.133379
2011-01-12         NaN
dtype: float64

In [7]:
# DatetimeIndex使用 datetime64, 存储时间戳的纳秒数值
# 其值是pandas的Timestamp对象
print(ts.index.dtype)
ts.index[0]

datetime64[ns]


Timestamp('2011-01-02 00:00:00')

## 索引、选取、子集构建

TimeSeries是Series的一个子类，所以在索引以及数据选取方面跟Series一样。

In [8]:
stamp = ts.index[2]
ts[stamp]

-0.47150210579550061

In [9]:
# 更方便的用法是传入可以被解释为日期的字符串
print(ts['1/10/2011'])
print(ts['20110110'])

0.566689483331
0.566689483331


In [11]:
# 对于较长的时间序列，只需传入“年”或“年月”即可轻松选取数据切片
longer_ts = Series(np.random.randn(1000),
                   index=pd.date_range('1/1/2000', periods=1000))
longer_ts.head()

2000-01-01    0.752180
2000-01-02   -0.667890
2000-01-03    0.438020
2000-01-04    0.085829
2000-01-05    0.355862
Freq: D, dtype: float64

In [13]:
longer_ts['2001'].tail()

2001-12-27    1.398178
2001-12-28    0.420257
2001-12-29    0.217273
2001-12-30   -1.462059
2001-12-31   -1.135778
Freq: D, dtype: float64

In [14]:
longer_ts['2001-05'].tail()

2001-05-27   -0.747961
2001-05-28    0.435790
2001-05-29    1.298764
2001-05-30    0.378459
2001-05-31    0.577249
Freq: D, dtype: float64

In [19]:
# 可以用不存在于该时间序列中的时间戳对其进行切片（即范围查询）
# 这里可以传入字符串日期、datetime或者Timestamp
longer_ts['1/6/1999':'1/11/2000']

2000-01-01    0.752180
2000-01-02   -0.667890
2000-01-03    0.438020
2000-01-04    0.085829
2000-01-05    0.355862
2000-01-06   -1.193612
2000-01-07    1.730057
2000-01-08   -0.436587
2000-01-09    1.766524
2000-01-10    0.717529
2000-01-11   -0.558795
Freq: D, dtype: float64

In [20]:
dates = pd.date_range('1/1/2000', periods=100, freq='W-WED')
long_df = DataFrame(np.random.randn(100, 4),
                    index=dates,
                    columns=['Colorado', 'Texas', 'New York', 'Ohio'])
long_df.ix['5-2001']

Unnamed: 0,Colorado,Texas,New York,Ohio
2001-05-02,0.489843,0.667164,-0.021513,0.638708
2001-05-09,-0.672459,-0.951448,0.200572,0.33149
2001-05-16,0.52699,1.011224,-1.010511,1.244311
2001-05-23,0.199731,-0.477121,1.051838,0.859933
2001-05-30,-0.463182,-1.097522,-0.91246,0.34036


## 带有重复索引的时间序列

In [21]:
dates = pd.DatetimeIndex(['1/1/2000', '1/2/2000', '1/2/2000', '1/2/2000',
                          '1/3/2000'])
dup_ts = Series(np.arange(5), index=dates)
dup_ts

2000-01-01    0
2000-01-02    1
2000-01-02    2
2000-01-02    3
2000-01-03    4
dtype: int32

In [22]:
dup_ts.index.is_unique

False

In [25]:
# 索引得到的可能是标量值，也可能是切片
print(dup_ts['1/2/2000'])
print('----------------------------')
print(dup_ts['1/3/2000'])


2000-01-02    1
2000-01-02    2
2000-01-02    3
dtype: int32
----------------------------
4


In [28]:
# 对具有非唯一时间戳的数据进行聚合一个办法是使用groupby，并传入level = 0
grouped = dup_ts.groupby(level = 0)
grouped.mean()

2000-01-01    0
2000-01-02    2
2000-01-03    4
dtype: int32