# 金融和经济数据应用

- 截面（cross-section）表示某个时间点的数据，例如在特定时间点某股票收盘价就形成了一个截面
- 多个数据项（如价格和成交量）在多个时间点的截面数据就构成了一个面板（panel）
- 面板数据可以表示为层次化索引的DataFrame，也可以表示为三维的Panel pandas对象

In [1]:
%pylab inline

import numpy as np
from numpy.random import randn

import pandas as pd
from pandas import Series, DataFrame

Populating the interactive namespace from numpy and matplotlib


## 数据规整化的话题

### 时间序列以及界面对齐

处理金融数据，最费神的一个问题是所谓的“数据对齐”问题
- 两个相关的时间序列的索引可能没有很好对齐
- 两个DataFrame对象可能含有不匹配的列或行

Pandas 可以在算数运算中自动对对齐数据。在实际工作中， 为你带来极大的自由度，并提高工作效率。

In [2]:
prices = pd.read_csv('ch11/stock_px.csv', index_col=0)
prices = prices[['AAPL', 'JNJ', 'SPX', 'XOM']]
prices.head()

Unnamed: 0,AAPL,JNJ,SPX,XOM
1990-02-01 00:00:00,7.86,4.27,328.79,6.12
1990-02-02 00:00:00,8.0,4.37,330.92,6.24
1990-02-05 00:00:00,8.18,4.34,331.85,6.25
1990-02-06 00:00:00,8.12,4.32,329.66,6.23
1990-02-07 00:00:00,7.77,4.38,333.75,6.33


In [3]:
volume = pd.read_csv('ch11/volume.csv', index_col=0)
volume = volume[['AAPL', 'JNJ', 'XOM']]
volume.head()

Unnamed: 0,AAPL,JNJ,XOM
1990-02-01 00:00:00,4193200.0,5942400.0,2916400.0
1990-02-02 00:00:00,4248800.0,4732800.0,4250000.0
1990-02-05 00:00:00,3653200.0,3950400.0,5880800.0
1990-02-06 00:00:00,2640000.0,3761600.0,4750800.0
1990-02-07 00:00:00,11180800.0,5458400.0,4124800.0


假设项根据所有有效数据计算一个成交量加权平均价格

In [4]:
vwap = (prices*volume).sum() / volume.sum()
vwap

AAPL    81.246271
JNJ     40.576111
SPX           NaN
XOM     50.520303
dtype: float64

In [5]:
# 由于 SPX 在 volume 找不到，可以显示丢弃
vwap = vwap.dropna()
vwap

AAPL    81.246271
JNJ     40.576111
XOM     50.520303
dtype: float64

In [6]:
# 如果想手工对齐 column，可以使用DataFrame的align方法
prices_aligned = prices.align(volume, join='inner')[0]
prices_aligned.head()

Unnamed: 0,AAPL,JNJ,XOM
1990-02-01 00:00:00,7.86,4.27,6.12
1990-02-02 00:00:00,8.0,4.37,6.24
1990-02-05 00:00:00,8.18,4.34,6.25
1990-02-06 00:00:00,8.12,4.32,6.23
1990-02-07 00:00:00,7.77,4.38,6.33


In [7]:
vwap = (prices_aligned*volume).sum() / volume.sum()
vwap

AAPL    81.246271
JNJ     40.576111
XOM     50.520303
dtype: float64

#### 通过一组索引可能不同的Series建构一个DataFrame

In [8]:
s1 = Series(range(3), index=list('abc'))
s2 = Series(range(4), index=list('dbce'))
s3 = Series(range(3), index=list('fac'))

DataFrame({'one': s1, 'two': s2, 'three': s3})

Unnamed: 0,one,three,two
a,0.0,1.0,
b,1.0,,1.0
c,2.0,2.0,2.0
d,,,0.0
e,,,3.0
f,,0.0,


In [9]:
# 显示定义结果的索引（丢弃其余的数据）
DataFrame({'one': s1, 'two': s2, 'three': s3}, index=list('face'))

Unnamed: 0,one,three,two
f,,0.0,
a,0.0,1.0,
c,2.0,2.0,2.0
e,,,3.0


### 频率不同的时间序列的运算

经济学时间序列常常有按年、季、月、日（或其他更特殊的频率）计算的。有些完全不规则，例如盈利预测调整随时可能发生

场景 | 工具 | 说明
---|---|---
频率转换 | `resample` | 将数据转换到固定频率
重新对齐 |  `reindex` | 使数据符合一个新索引

In [10]:
ts1 = Series(np.arange(3),
             index=pd.date_range('2012-6-13', periods=3, freq='W-WED'))
ts1

2012-06-13    0
2012-06-20    1
2012-06-27    2
Freq: W-WED, dtype: int64

In [11]:
# 重新采样到工作日，使用 ffill 填充空白
# 处理较低频率的数据常常这么做，最终结果中个时间点都会有一个最新的有效值
ts1.resample('B').ffill()

2012-06-13    0
2012-06-14    0
2012-06-15    0
2012-06-18    0
2012-06-19    0
2012-06-20    1
2012-06-21    1
2012-06-22    1
2012-06-25    1
2012-06-26    1
2012-06-27    2
Freq: B, dtype: int64

In [12]:
# 不规整的采样时间
dates = pd.DatetimeIndex(['2012-6-12', '2012-6-17', '2012-6-18',
                          '2012-6-21', '2012-6-22', '2012-6-29'])
ts2 = Series(np.arange(6), index=dates)
ts2

2012-06-12    0
2012-06-17    1
2012-06-18    2
2012-06-21    3
2012-06-22    4
2012-06-29    5
dtype: int64

如果想将 ts1 最当前的值加到 ts2 上
- 将两者重新采样为规整频率再相加
- 如果想维持ts2中的日期索引，reindex是一种更好的解救方法

In [13]:
ts2 + ts1.reindex(ts2.index, method='ffill')

2012-06-12    NaN
2012-06-17    1.0
2012-06-18    2.0
2012-06-21    4.0
2012-06-22    5.0
2012-06-29    7.0
dtype: float64

#### 使用 Period

In [14]:
gdp = Series([1.78, 1.94, 2.08, 2.01, 2.15, 2.31, 2.46], 
             index=pd.period_range('1984Q2', periods=7, freq='Q-SEP'))
gdp

1984Q2    1.78
1984Q3    1.94
1984Q4    2.08
1985Q1    2.01
1985Q2    2.15
1985Q3    2.31
1985Q4    2.46
Freq: Q-SEP, dtype: float64

In [15]:
infl = Series([0.025, 0.045, 0.037, 0.04],
              index=pd.period_range('1982', periods=4, freq='A-DEC'))
infl

1982    0.025
1983    0.045
1984    0.037
1985    0.040
Freq: A-DEC, dtype: float64

跟 Timestamp 时间序列不同，由 period 索引的两个不同频率的时间序列之间的运算必须进行显式转换，然后使用 reindex 重新索引

In [16]:
infl_q = infl.asfreq('Q-SEP', how='end')
infl_q

1983Q1    0.025
1984Q1    0.045
1985Q1    0.037
1986Q1    0.040
Freq: Q-SEP, dtype: float64

In [17]:
infl_q.reindex(gdp.index, method='ffill')

1984Q2    0.045
1984Q3    0.045
1984Q4    0.045
1985Q1    0.037
1985Q2    0.037
1985Q3    0.037
1985Q4    0.037
Freq: Q-SEP, dtype: float64

### 时间和“最当前”数据选取

假设有一个很长的盘中市场时间序列，现在希望抽取其中每天特定时间的价格数据。如果数据不规整，观测值没有精准落在期望的时间点，该怎么办？

In [18]:
from datetime import time

In [19]:
rng = pd.date_range('2012-06-01 09:30', '2012-06-01 15:59', freq='T')
rng = rng.append([rng + pd.offsets.BDay(i) for i in range(1,4)])

ts = Series(np.arange(len(rng), dtype=float), index=rng)
ts.groupby(lambda t: t.day).head(3)

2012-06-01 09:30:00       0.0
2012-06-01 09:31:00       1.0
2012-06-01 09:32:00       2.0
2012-06-04 09:30:00     390.0
2012-06-04 09:31:00     391.0
2012-06-04 09:32:00     392.0
2012-06-05 09:30:00     780.0
2012-06-05 09:31:00     781.0
2012-06-05 09:32:00     782.0
2012-06-06 09:30:00    1170.0
2012-06-06 09:31:00    1171.0
2012-06-06 09:32:00    1172.0
dtype: float64

In [20]:
# 利用 datetime.time 对象进行索引，抽取时间点上的值
ts[time(10, 0)]

2012-06-01 10:00:00      30.0
2012-06-04 10:00:00     420.0
2012-06-05 10:00:00     810.0
2012-06-06 10:00:00    1200.0
dtype: float64

In [21]:
# 选取时间点上的值，操作用到了 at_time
ts.at_time(time(10, 0))

2012-06-01 10:00:00      30.0
2012-06-04 10:00:00     420.0
2012-06-05 10:00:00     810.0
2012-06-06 10:00:00    1200.0
dtype: float64

In [22]:
# 选取两个时间 time 对象之间的值
ts.between_time(time(10, 0), time(10, 1))

2012-06-01 10:00:00      30.0
2012-06-01 10:01:00      31.0
2012-06-04 10:00:00     420.0
2012-06-04 10:01:00     421.0
2012-06-05 10:00:00     810.0
2012-06-05 10:01:00     811.0
2012-06-06 10:00:00    1200.0
2012-06-06 10:01:00    1201.0
dtype: float64

In [23]:
# 将4/5的数据设为空值
indexer = np.random.permutation(len(ts))[len(ts)/5:]

irr_ts = ts.copy()
irr_ts[indexer] = np.nan

In [24]:
selection = pd.date_range('2012-06-01 10:00', periods=4, freq='B')

# 将一组 timestamp 传入 asof 方法，就能得到这些时间点处（或其之前最近）得有效值
irr_ts.asof(selection)

2012-06-01 10:00:00      30.0
2012-06-04 10:00:00     420.0
2012-06-05 10:00:00     810.0
2012-06-06 10:00:00    1195.0
Freq: B, dtype: float64

### 拼接多个数据源

金融或经济领域中，有几种常见的情况
- 在一个特定的时间点上，从一个数据源切换到另一个数据源
- 在另一个事件序列对当前时间数列中的缺失值“打补丁”
- 将数据中的符号（国家、资产代码）替换成实际数据

In [25]:
# 在一个特定的时间点上，从一个数据源切换到另一个数据源

data1 = DataFrame(np.ones((6,3), dtype=float),
                  columns=['a','b','c'],
                  index=pd.date_range('6/12/2012', periods=6))

data2 = DataFrame(np.ones((6,3), dtype=float) * 2,
                  columns=['a','b','c'],
                  index=pd.date_range('6/13/2012', periods=6))

spliced = pd.concat([data1.ix[:'2012-6-14'], data2.ix['2012-6-15':]])
spliced

Unnamed: 0,a,b,c
2012-06-12,1.0,1.0,1.0
2012-06-13,1.0,1.0,1.0
2012-06-14,1.0,1.0,1.0
2012-06-15,2.0,2.0,2.0
2012-06-16,2.0,2.0,2.0
2012-06-17,2.0,2.0,2.0
2012-06-18,2.0,2.0,2.0


In [26]:
# 假设 data1 缺失了 data2 中存在的某个时间序列
data2 = DataFrame(np.ones((6,4), dtype=float) * 2,
                  columns=['a','b','c','d'],
                  index=pd.date_range('6/13/2012', periods=6))

spliced = pd.concat([data1.ix[:'2012-6-14'], data2.ix['2012-6-15':]])
spliced

Unnamed: 0,a,b,c,d
2012-06-12,1.0,1.0,1.0,
2012-06-13,1.0,1.0,1.0,
2012-06-14,1.0,1.0,1.0,
2012-06-15,2.0,2.0,2.0,2.0
2012-06-16,2.0,2.0,2.0,2.0
2012-06-17,2.0,2.0,2.0,2.0
2012-06-18,2.0,2.0,2.0,2.0


In [27]:
# combine_first 可以引入合并点之前的数据，扩展了’d‘项的历史
spliced.combine_first(data2)

Unnamed: 0,a,b,c,d
2012-06-12,1.0,1.0,1.0,
2012-06-13,1.0,1.0,1.0,2.0
2012-06-14,1.0,1.0,1.0,2.0
2012-06-15,2.0,2.0,2.0,2.0
2012-06-16,2.0,2.0,2.0,2.0
2012-06-17,2.0,2.0,2.0,2.0
2012-06-18,2.0,2.0,2.0,2.0


In [28]:
# DataFrame update 传入overwrite=False只填充空洞
spliced.update(data2, overwrite=False)
spliced

Unnamed: 0,a,b,c,d
2012-06-12,1.0,1.0,1.0,
2012-06-13,1.0,1.0,1.0,2.0
2012-06-14,1.0,1.0,1.0,2.0
2012-06-15,2.0,2.0,2.0,2.0
2012-06-16,2.0,2.0,2.0,2.0
2012-06-17,2.0,2.0,2.0,2.0
2012-06-18,2.0,2.0,2.0,2.0


In [29]:
cp_spliced = spliced.copy()

# 利用 DataFrame 索引机制直接对列进行设置
cp_spliced[['a','c']] = data1[['a','c']]
cp_spliced

Unnamed: 0,a,b,c,d
2012-06-12,1.0,1.0,1.0,
2012-06-13,1.0,1.0,1.0,2.0
2012-06-14,1.0,1.0,1.0,2.0
2012-06-15,1.0,2.0,1.0,2.0
2012-06-16,1.0,2.0,1.0,2.0
2012-06-17,1.0,2.0,1.0,2.0
2012-06-18,,2.0,,2.0


### 收益指数和累计收益

金融领域中，收益（return）指的是某资产价格的百分比变化

In [30]:
import pandas_datareader.data as web

price = web.get_data_yahoo('AAPL', '2011-01-01')['Adj Close']
price[-5:]

Date
2017-04-12    141.800003
2017-04-13    141.050003
2017-04-17    141.830002
2017-04-18    141.199997
2017-04-19    140.679993
Name: Adj Close, dtype: float64

In [31]:
# 两个开盘日股价之间的变化
returns = price.pct_change()

In [32]:
# 利用 cumprod 计算出一个简单的收益指数
ret_index = (1 + returns).cumprod()

# 将第一个值设置为1
ret_index[0] = 1

# 计算指定时期内的累计收益
m_returns = ret_index.resample('BM').last().pct_change()

# 显示2012年每月累计收益
m_returns['2012']

Date
2012-01-31    0.127111
2012-02-29    0.188311
2012-03-30    0.105284
2012-04-30   -0.025970
2012-05-31   -0.010702
2012-06-29    0.010853
2012-07-31    0.045822
2012-08-31    0.093877
2012-09-28    0.002796
2012-10-31   -0.107600
2012-11-30   -0.012375
2012-12-31   -0.090743
Freq: BM, Name: Adj Close, dtype: float64

In [33]:
# 重新采样聚合（根据时期聚合），从日百分比变化中计算得出
m_rets = (1 + returns).resample('M', kind='period').prod() - 1

# 显示2012年每月累计收益
m_rets['2012']

Date
2012-01    0.127111
2012-02    0.188311
2012-03    0.105284
2012-04   -0.025970
2012-05   -0.010702
2012-06    0.010853
2012-07    0.045822
2012-08    0.093877
2012-09    0.002796
2012-10   -0.107600
2012-11   -0.012375
2012-12   -0.090743
Freq: M, Name: Adj Close, dtype: float64

## 分组变换和分析

In [34]:
import random; random.seed(0)
import string

# 随机产生股票名（n个英文字母组成）
def rands(n):
    choices = string.ascii_uppercase #'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
    return ''.join([random.choice(choices) for _ in xrange(n)])

N = 1000
tickers = np.array([rands(5) for _ in xrange(N)])

In [35]:
# 选择部分股票(前500个)组成一个投资组合
M = 500
df = DataFrame({'Momentum': np.random.randn(M) / 200 + 0.03,
                'Value': np.random.randn(M) / 200 + 0.08,
                'ShortInterest': np.random.randn(M) / 200 - 0.02},
               index=tickers[:M])
df.head()

Unnamed: 0,Momentum,ShortInterest,Value
VTKGN,0.038443,-0.016194,0.085771
KUHMP,0.030926,-0.020958,0.078515
XNHTQ,0.028365,-0.015938,0.069433
GXZVX,0.028257,-0.020981,0.076723
ISXRM,0.025703,-0.025565,0.071946


In [36]:
# 随机创建一个行业分类（两个行业）
ind_names = np.array(['FINANCIAL', 'TECH'])
sampler = np.random.randint(0, len(ind_names), N)
industries = Series(ind_names[sampler], index=tickers, name='industry')
industries.head()

VTKGN         TECH
KUHMP    FINANCIAL
XNHTQ    FINANCIAL
GXZVX    FINANCIAL
ISXRM    FINANCIAL
Name: industry, dtype: object

In [37]:
# 根据行业进行分类
by_industry = df.groupby(industries)
by_industry

<pandas.core.groupby.DataFrameGroupBy object at 0x110bc0510>

In [38]:
# 根据分组执行聚合
by_industry.describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,Momentum,ShortInterest,Value
industry,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
FINANCIAL,count,241.0,241.0,241.0
FINANCIAL,mean,0.030149,-0.019553,0.079885
FINANCIAL,std,0.005081,0.004652,0.004808
FINANCIAL,min,0.01674,-0.031383,0.066571
FINANCIAL,25%,0.026926,-0.022889,0.076723
FINANCIAL,50%,0.029755,-0.019351,0.079922
FINANCIAL,75%,0.033747,-0.016296,0.083045
FINANCIAL,max,0.043324,-0.007183,0.093516
TECH,count,259.0,259.0,259.0
TECH,mean,0.030064,-0.019968,0.080329


In [39]:
# 自定义变换函数：行业内标准化处理
def zscore(group):
    return (group - group.mean()) / group.std()

# 根据分组执行变换
df_stand = by_industry.apply(zscore)
df_stand.head()

Unnamed: 0,Momentum,ShortInterest,Value
VTKGN,1.70372,0.828881,1.055649
KUHMP,0.152854,-0.301978,-0.284915
XNHTQ,-0.351016,0.776911,-2.174061
GXZVX,-0.37231,-0.306979,-0.657719
ISXRM,-0.875025,-1.292258,-1.651327


In [40]:
# 验证：处理之后，各行业平均为0，标准差为1
df_stand.groupby(industries).agg(['mean', 'std'])

Unnamed: 0_level_0,Momentum,Momentum,ShortInterest,ShortInterest,Value,Value
Unnamed: 0_level_1,mean,std,mean,std,mean,std
industry,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
FINANCIAL,-2.697589e-15,1.0,-3.550986e-15,1.0,-4.143067e-15,1.0
TECH,1.020205e-15,1.0,-1.660834e-15,1.0,-7.210877e-15,1.0


In [41]:
# 行业内降序排名
ind_rank = by_industry.rank(ascending=False)

# 验证：处理之后，各行业排名最低为1，最高为...
ind_rank.groupby(industries).agg(['min', 'max'])

Unnamed: 0_level_0,Momentum,Momentum,ShortInterest,ShortInterest,Value,Value
Unnamed: 0_level_1,min,max,min,max,min,max
industry,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
FINANCIAL,1.0,241.0,1.0,241.0,1.0,241.0
TECH,1.0,259.0,1.0,259.0,1.0,259.0


In [42]:
# 在股票投资组合的定量分析中，“排名和标准化”是一种常见的变换运算组合
# 通过将 rank 和 zscore 连接在一起即可完成整个变换过程
zscore_rank = by_industry.apply(lambda x: zscore(x.rank()))

# 验证：“标准化排名” 平均为0，标准差为1
zscore_rank.groupby(industries).agg(['mean', 'std'])

Unnamed: 0_level_0,Momentum,Momentum,ShortInterest,ShortInterest,Value,Value
Unnamed: 0_level_1,mean,std,mean,std,mean,std
industry,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
FINANCIAL,2.073031e-18,1.0,-6.449428e-18,1.0,-8.522459e-18,1.0
TECH,8.573151e-18,1.0,-6.215534e-18,1.0,-8.358822e-18,1.0


### 分组因子暴露

因子分析（factor analysis）是投资组合定量管理中的一种技术。投资组合的持有量和性能（收益与损失）可以被分解为一个或多元表示投资组合权重的因子（风险因子是其中一个）。

晕了...