# Financial and Economic Data Applications

- 截面（cross-section）表示某个时间点的数据，例如在特定时间点某股票收盘价就形成了一个截面
- 多个数据项（如价格和成交量）在多个时间点的截面数据就构成了一个面板（panel）
- 面板数据可以表示为层次化索引的DataFrame，也可以表示为三维的Panel pandas对象

In [1]:
%pylab inline

import numpy as np
from numpy.random import randn

import pandas as pd
from pandas import Series, DataFrame

Populating the interactive namespace from numpy and matplotlib


## Data Munging Topics
### Time Series and Cross-Section Alignment

处理金融数据，最费神的一个问题是所谓的“数据对齐”问题
- 两个相关的时间序列的索引可能没有很好对齐
- 两个DataFrame对象可能含有不匹配的列或行

Pandas 可以在算数运算中自动对对齐数据。在实际工作中， 为你带来极大的自由度，并提高工作效率。

In [2]:
prices = pd.read_csv('ch11/stock_px.csv', index_col=0)
prices = prices[['AAPL', 'JNJ', 'SPX', 'XOM']]
prices.head()

Unnamed: 0,AAPL,JNJ,SPX,XOM
1990-02-01 00:00:00,7.86,4.27,328.79,6.12
1990-02-02 00:00:00,8.0,4.37,330.92,6.24
1990-02-05 00:00:00,8.18,4.34,331.85,6.25
1990-02-06 00:00:00,8.12,4.32,329.66,6.23
1990-02-07 00:00:00,7.77,4.38,333.75,6.33


In [3]:
volume = pd.read_csv('ch11/volume.csv', index_col=0)
volume = volume[['AAPL', 'JNJ', 'XOM']]
volume.head()

Unnamed: 0,AAPL,JNJ,XOM
1990-02-01 00:00:00,4193200.0,5942400.0,2916400.0
1990-02-02 00:00:00,4248800.0,4732800.0,4250000.0
1990-02-05 00:00:00,3653200.0,3950400.0,5880800.0
1990-02-06 00:00:00,2640000.0,3761600.0,4750800.0
1990-02-07 00:00:00,11180800.0,5458400.0,4124800.0


假设项根据所有有效数据计算一个成交量加权平均价格

In [4]:
vwap = (prices*volume).sum() / volume.sum()
vwap

AAPL    81.246271
JNJ     40.576111
SPX           NaN
XOM     50.520303
dtype: float64

In [5]:
# 由于 SPX 在 volume 找不到，可以显示丢弃
vwap = vwap.dropna()
vwap

AAPL    81.246271
JNJ     40.576111
XOM     50.520303
dtype: float64

In [6]:
# 如果想手工对齐，可以使用DataFrame的align方法
prices_aligned = prices.align(volume, join='inner')[0]
prices_aligned.head()

Unnamed: 0,AAPL,JNJ,XOM
1990-02-01 00:00:00,7.86,4.27,6.12
1990-02-02 00:00:00,8.0,4.37,6.24
1990-02-05 00:00:00,8.18,4.34,6.25
1990-02-06 00:00:00,8.12,4.32,6.23
1990-02-07 00:00:00,7.77,4.38,6.33


In [7]:
vwap = (prices_aligned*volume).sum() / volume.sum()
vwap

AAPL    81.246271
JNJ     40.576111
XOM     50.520303
dtype: float64

In [8]:
# 通过一组索引可能不同的Series建构一个DataFrame
s1 = Series(range(3), index=list('abc'))
s2 = Series(range(4), index=list('dbce'))
s3 = Series(range(3), index=list('fac'))

DataFrame({'one': s1, 'two': s2, 'three': s3})

Unnamed: 0,one,three,two
a,0.0,1.0,
b,1.0,,1.0
c,2.0,2.0,2.0
d,,,0.0
e,,,3.0
f,,0.0,


In [9]:
# 显示定义结果的索引（丢弃其余的数据）
DataFrame({'one': s1, 'two': s2, 'three': s3}, index=list('face'))

Unnamed: 0,one,three,two
f,,0.0,
a,0.0,1.0,
c,2.0,2.0,2.0
e,,,3.0


### Operations with Time Series of Different Frequencies

经济学时间序列常常有按年、季、月、日（或其他更特殊的频率）计算的。有些完全不规则，例如盈利预测调整随时可能发生

场景 | 工具 | 说明
---|---|---
频率转换 | `resample` | 将数据转换到固定频率
重新对齐 |  `reindex` | 使数据符合一个新索引

In [10]:
ts1 = Series(np.arange(3),
             index=pd.date_range('2012-6-13', periods=3, freq='W-WED'))
ts1

2012-06-13    0
2012-06-20    1
2012-06-27    2
Freq: W-WED, dtype: int64

In [11]:
# 重新采样到工作日
ts1.resample('B').mean()

2012-06-13    0.0
2012-06-14    NaN
2012-06-15    NaN
2012-06-18    NaN
2012-06-19    NaN
2012-06-20    1.0
2012-06-21    NaN
2012-06-22    NaN
2012-06-25    NaN
2012-06-26    NaN
2012-06-27    2.0
Freq: B, dtype: float64

In [12]:
# 使用 ffill 填充空白
# 处理较低频率的数据常常这么做，最终结果中个时间点都会有一个最新的有效值
ts1.resample('B').ffill()

2012-06-13    0
2012-06-14    0
2012-06-15    0
2012-06-18    0
2012-06-19    0
2012-06-20    1
2012-06-21    1
2012-06-22    1
2012-06-25    1
2012-06-26    1
2012-06-27    2
Freq: B, dtype: int64

In [13]:
dates = pd.DatetimeIndex(['2012-6-12', '2012-6-17', '2012-6-18',
                          '2012-6-21', '2012-6-22', '2012-6-29'])
ts2 = Series(np.arange(6), index=dates)
ts2

2012-06-12    0
2012-06-17    1
2012-06-18    2
2012-06-21    3
2012-06-22    4
2012-06-29    5
dtype: int64

如果想将 ts1 最当前的值加到 ts2 shag
- 将两者重新采样为规整频率再相加
- 如果想维持ts2中的日期索引，reindex是一种更好的解救方法

In [14]:
ts2 + ts1.reindex(ts2.index, method='ffill')

2012-06-12    NaN
2012-06-17    1.0
2012-06-18    2.0
2012-06-21    4.0
2012-06-22    5.0
2012-06-29    7.0
dtype: float64

#### Using periods instead of timestamps

In [15]:
gdp = Series([1.78, 1.94, 2.08, 2.01, 2.15, 2.31, 2.46], 
             index=pd.period_range('1984Q2', periods=7, freq='Q-SEP'))
gdp

1984Q2    1.78
1984Q3    1.94
1984Q4    2.08
1985Q1    2.01
1985Q2    2.15
1985Q3    2.31
1985Q4    2.46
Freq: Q-SEP, dtype: float64

In [16]:
infl = Series([0.025, 0.045, 0.037, 0.04],
              index=pd.period_range('1982', periods=4, freq='A-DEC'))
infl

1982    0.025
1983    0.045
1984    0.037
1985    0.040
Freq: A-DEC, dtype: float64

跟 Timestamp 时间序列不同，由 period 索引的两个不同频率的时间序列之间的运算必须进行显式转换，然后使用 reindex 重新索引

In [17]:
infl_q = infl.asfreq('Q-SEP', how='end')
infl_q

1983Q1    0.025
1984Q1    0.045
1985Q1    0.037
1986Q1    0.040
Freq: Q-SEP, dtype: float64

In [18]:
infl_q.reindex(gdp.index, method='ffill')

1984Q2    0.045
1984Q3    0.045
1984Q4    0.045
1985Q1    0.037
1985Q2    0.037
1985Q3    0.037
1985Q4    0.037
Freq: Q-SEP, dtype: float64

### Time of Day and “as of” Data Selection

假设有一个很长的盘中市场时间序列，现在希望抽取其中每天特定时间的价格数据。如果数据不规整，观测值没有精准落在期望的时间点，该怎么办？

In [19]:
rng = pd.date_range('2012-06-01 09:30', '2012-06-01 15:59', freq='T')
rng = rng.append([rng + pd.offsets.BDay(i) for i in range(1,4)])

ts = Series(np.arange(len(rng), dtype=float), index=rng)
ts

2012-06-01 09:30:00       0.0
2012-06-01 09:31:00       1.0
2012-06-01 09:32:00       2.0
2012-06-01 09:33:00       3.0
2012-06-01 09:34:00       4.0
2012-06-01 09:35:00       5.0
2012-06-01 09:36:00       6.0
2012-06-01 09:37:00       7.0
2012-06-01 09:38:00       8.0
2012-06-01 09:39:00       9.0
2012-06-01 09:40:00      10.0
2012-06-01 09:41:00      11.0
2012-06-01 09:42:00      12.0
2012-06-01 09:43:00      13.0
2012-06-01 09:44:00      14.0
2012-06-01 09:45:00      15.0
2012-06-01 09:46:00      16.0
2012-06-01 09:47:00      17.0
2012-06-01 09:48:00      18.0
2012-06-01 09:49:00      19.0
2012-06-01 09:50:00      20.0
2012-06-01 09:51:00      21.0
2012-06-01 09:52:00      22.0
2012-06-01 09:53:00      23.0
2012-06-01 09:54:00      24.0
2012-06-01 09:55:00      25.0
2012-06-01 09:56:00      26.0
2012-06-01 09:57:00      27.0
2012-06-01 09:58:00      28.0
2012-06-01 09:59:00      29.0
                        ...  
2012-06-06 15:30:00    1530.0
2012-06-06 15:31:00    1531.0
2012-06-06

In [20]:
from datetime import time

# 利用 datetime.time 对象进行索引，抽取时间点上的值
ts[time(10, 0)]

2012-06-01 10:00:00      30.0
2012-06-04 10:00:00     420.0
2012-06-05 10:00:00     810.0
2012-06-06 10:00:00    1200.0
dtype: float64

In [21]:
# 选取时间点上的值，操作用到了 at_time
ts.at_time(time(10, 0))

2012-06-01 10:00:00      30.0
2012-06-04 10:00:00     420.0
2012-06-05 10:00:00     810.0
2012-06-06 10:00:00    1200.0
dtype: float64

In [22]:
# 选取两个时间 time 对象之间的值
ts.between_time(time(10, 0), time(10, 1))

2012-06-01 10:00:00      30.0
2012-06-01 10:01:00      31.0
2012-06-04 10:00:00     420.0
2012-06-04 10:01:00     421.0
2012-06-05 10:00:00     810.0
2012-06-05 10:01:00     811.0
2012-06-06 10:00:00    1200.0
2012-06-06 10:01:00    1201.0
dtype: float64

In [23]:
# 随机找700个数据，将其设为 NA
#indexer = np.sort(np.random.permutation(len(ts))[700:])
indexer = np.random.permutation(len(ts))[700:]

irr_ts = ts.copy()
irr_ts[indexer] = np.nan
irr_ts['2012-06-01 09:50':'2012-06-01 10:00']

2012-06-01 09:50:00    20.0
2012-06-01 09:51:00     NaN
2012-06-01 09:52:00    22.0
2012-06-01 09:53:00     NaN
2012-06-01 09:54:00    24.0
2012-06-01 09:55:00     NaN
2012-06-01 09:56:00    26.0
2012-06-01 09:57:00     NaN
2012-06-01 09:58:00     NaN
2012-06-01 09:59:00    29.0
2012-06-01 10:00:00     NaN
dtype: float64

In [24]:
selection = pd.date_range('2012-06-01 10:00', periods=4, freq='B')

# 将一组 timestamp 传入 asof 方法，就能得到这些时间点处（或其之前最近）得有效值
irr_ts.asof(selection)

2012-06-01 10:00:00      29.0
2012-06-04 10:00:00     419.0
2012-06-05 10:00:00     808.0
2012-06-06 10:00:00    1198.0
Freq: B, dtype: float64

### Splicing Together Data Sources

金融或经济领域中，有几种常见的情况
- 在一个特定的时间点上，从一个数据源切换到另一个数据源
- 在另一个事件序列对当前时间数列中的缺失值“打补丁”
- 将数据中的符号（国家、资产代码）替换成实际数据

In [25]:
# 在一个特定的时间点上，从一个数据源切换到另一个数据源

data1 = DataFrame(np.ones((6,3), dtype=float),
                  columns=['a','b','c'],
                  index=pd.date_range('6/12/2012', periods=6))

data2 = DataFrame(np.ones((6,3), dtype=float) * 2,
                  columns=['a','b','c'],
                  index=pd.date_range('6/13/2012', periods=6))

spliced = pd.concat([data1.ix[:'2012-6-14'], data2.ix['2012-6-15':]])
spliced

Unnamed: 0,a,b,c
2012-06-12,1.0,1.0,1.0
2012-06-13,1.0,1.0,1.0
2012-06-14,1.0,1.0,1.0
2012-06-15,2.0,2.0,2.0
2012-06-16,2.0,2.0,2.0
2012-06-17,2.0,2.0,2.0
2012-06-18,2.0,2.0,2.0


In [26]:
# 假设 data1 缺失了 data2 中存在的某个时间序列
data2 = DataFrame(np.ones((6,4), dtype=float) * 2,
                  columns=['a','b','c','d'],
                  index=pd.date_range('6/13/2012', periods=6))

spliced = pd.concat([data1.ix[:'2012-6-14'], data2.ix['2012-6-15':]])
spliced

Unnamed: 0,a,b,c,d
2012-06-12,1.0,1.0,1.0,
2012-06-13,1.0,1.0,1.0,
2012-06-14,1.0,1.0,1.0,
2012-06-15,2.0,2.0,2.0,2.0
2012-06-16,2.0,2.0,2.0,2.0
2012-06-17,2.0,2.0,2.0,2.0
2012-06-18,2.0,2.0,2.0,2.0


In [27]:
# combine_first 可以引入合并点之前的数据，扩展了’d‘项的历史
spliced_filled = spliced.combine_first(data2)
spliced_filled

Unnamed: 0,a,b,c,d
2012-06-12,1.0,1.0,1.0,
2012-06-13,1.0,1.0,1.0,2.0
2012-06-14,1.0,1.0,1.0,2.0
2012-06-15,2.0,2.0,2.0,2.0
2012-06-16,2.0,2.0,2.0,2.0
2012-06-17,2.0,2.0,2.0,2.0
2012-06-18,2.0,2.0,2.0,2.0


In [28]:
# DataFrame update 传入overwrite=False只填充空洞
spliced.update(data2, overwrite=False)
spliced

Unnamed: 0,a,b,c,d
2012-06-12,1.0,1.0,1.0,
2012-06-13,1.0,1.0,1.0,2.0
2012-06-14,1.0,1.0,1.0,2.0
2012-06-15,2.0,2.0,2.0,2.0
2012-06-16,2.0,2.0,2.0,2.0
2012-06-17,2.0,2.0,2.0,2.0
2012-06-18,2.0,2.0,2.0,2.0


In [29]:
cp_spliced = spliced.copy()

# 利用 DataFrame 索引机制直接对列进行设置
cp_spliced[['a','c']] = data1[['a','c']]
cp_spliced

Unnamed: 0,a,b,c,d
2012-06-12,1.0,1.0,1.0,
2012-06-13,1.0,1.0,1.0,2.0
2012-06-14,1.0,1.0,1.0,2.0
2012-06-15,1.0,2.0,1.0,2.0
2012-06-16,1.0,2.0,1.0,2.0
2012-06-17,1.0,2.0,1.0,2.0
2012-06-18,,2.0,,2.0


### Return Indexes and Cumulative Returns

金融领域中，收益（return）指的是某资产价格的百分比变化

In [30]:
import pandas_datareader.data as web

price = web.get_data_yahoo('AAPL', '2011-01-01')['Adj Close']
price[-5:]

Date
2017-04-03    143.699997
2017-04-04    144.770004
2017-04-05    144.020004
2017-04-06    143.660004
2017-04-07    143.339996
Name: Adj Close, dtype: float64

In [31]:
# 计算两个时间点之间的累计百分比回报
price['2011-10-03'] / price['2011-3-1'] - 1 

0.072399896359515159

In [32]:
# 利用 cumprod 计算出一个简单的收益指数
returns = price.pct_change()
ret_index = (1 + returns).cumprod()

# 降低一个值设置为1
ret_index[0] = 1

ret_index.tail()

Date
2017-04-03    3.365423
2017-04-04    3.390482
2017-04-05    3.372917
2017-04-06    3.364486
2017-04-07    3.356992
Name: Adj Close, dtype: float64

In [33]:
# 计算指定时期内的累计收益
m_returns = ret_index.resample('BM').last().pct_change()
m_returns['2012']

Date
2012-01-31    0.127111
2012-02-29    0.188311
2012-03-30    0.105284
2012-04-30   -0.025970
2012-05-31   -0.010702
2012-06-29    0.010853
2012-07-31    0.045822
2012-08-31    0.093877
2012-09-28    0.002796
2012-10-31   -0.107600
2012-11-30   -0.012375
2012-12-31   -0.090743
Freq: BM, Name: Adj Close, dtype: float64

In [34]:
# 重新采样聚合（根据时期聚合），从日百分比变化中计算得出
m_rets = (1 + returns).resample('M', kind='period').prod() - 1
m_rets['2012']

Date
2012-01    0.127111
2012-02    0.188311
2012-03    0.105284
2012-04   -0.025970
2012-05   -0.010702
2012-06    0.010853
2012-07    0.045822
2012-08    0.093877
2012-09    0.002796
2012-10   -0.107600
2012-11   -0.012375
2012-12   -0.090743
Freq: M, Name: Adj Close, dtype: float64

## Group Transforms and Analysis

In [35]:
import random; random.seed(0)
import string

# 随机产生股票名（n个英文字母组成）
def rands(n):
    choices = string.ascii_uppercase #'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
    return ''.join([random.choice(choices) for _ in xrange(n)])

N = 1000
tickers = np.array([rands(5) for _ in xrange(N)])

In [36]:
# 选择部分股票组成一个投资组合
M = 500
df = DataFrame({'Momentum': np.random.randn(M) / 200 + 0.03,
                'Value': np.random.randn(M) / 200 + 0.08,
                'ShortInterest': np.random.randn(M) / 200 - 0.02},
               index=tickers[:M])

In [37]:
# 随机创建一个行业分类（两个行业）
ind_names = np.array(['FINANCIAL', 'TECH'])
sampler = np.random.randint(0, len(ind_names), N)
industries = Series(ind_names[sampler], index=tickers, name='industry')

In [38]:
# 根据行业进行分类并执行分组聚合和变换
by_industry = df.groupby(industries)
by_industry.describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,Momentum,ShortInterest,Value
industry,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
FINANCIAL,count,255.0,255.0,255.0
FINANCIAL,mean,0.029274,-0.020109,0.080104
FINANCIAL,std,0.005167,0.005215,0.00502
FINANCIAL,min,0.012372,-0.033391,0.064412
FINANCIAL,25%,0.025711,-0.023445,0.076888
FINANCIAL,50%,0.028982,-0.01996,0.080431
FINANCIAL,75%,0.032976,-0.016732,0.083516
FINANCIAL,max,0.04462,-0.005727,0.092515
TECH,count,245.0,245.0,245.0
TECH,mean,0.02962,-0.020893,0.080245


In [39]:
def zscore(group):
    return (group - group.mean()) / group.std()

df_stand = by_industry.apply(zscore)

In [40]:
# 处理之后，各行业平均为0，标准差为1
df_stand.groupby(industries).agg(['mean', 'std'])

Unnamed: 0_level_0,Momentum,Momentum,ShortInterest,ShortInterest,Value,Value
Unnamed: 0_level_1,mean,std,mean,std,mean,std
industry,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
FINANCIAL,-5.356064e-15,1.0,2.410708e-15,1.0,1.238552e-14,1.0
TECH,1.284233e-15,1.0,3.554526e-15,1.0,-6.688527e-15,1.0


In [41]:
# 行业内降序排名
ind_rank = by_industry.rank(ascending=False)
ind_rank.groupby(industries).agg(['min', 'max'])

Unnamed: 0_level_0,Momentum,Momentum,ShortInterest,ShortInterest,Value,Value
Unnamed: 0_level_1,min,max,min,max,min,max
industry,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
FINANCIAL,1.0,255.0,1.0,255.0,1.0,255.0
TECH,1.0,245.0,1.0,245.0,1.0,245.0


In [42]:
# 在股票投资组合的定量分析中，“排名和标准化”是一种常见的变换运算组合
# 通过将 rank 和 zscore 连接在一起即可完成整个变换过程
by_industry.apply(lambda x: zscore(x.rank()))

Unnamed: 0,Momentum,ShortInterest,Value
VTKGN,1.342257,0.813489,0.460977
KUHMP,-1.626978,-0.474535,-0.067791
XNHTQ,-0.225767,1.213496,0.169325
GXZVX,1.171165,1.650919,-0.719632
ISXRM,-0.189814,0.352512,0.840606
CLPXZ,-0.437423,0.818405,0.423313
MWGUO,0.244047,-1.260908,1.111769
ASKVR,0.366070,-1.003303,-1.518513
AMWGI,-0.935513,-1.410048,-1.342257
WEOGZ,1.450722,1.111769,-1.016862
