《[利用Python进行数据分析](https://book.douban.com/subject/25779298/)》读书笔记。
 
 [第11章](/2017/07/24/python_data_analysis11.html)  第2节：分组变换和分析

所有用到的数据可以从[作者的 github](https://github.com/wesm/pydata-book)下载。


In [1]:
%pylab inline
import pandas as pd
from pandas import Series, DataFrame

Populating the interactive namespace from numpy and matplotlib


在第九章中，已经学习了分组统计的基础，还学习了如何对数据集的分组应用自定义的变换函数。

下面以一组假想的投资组合为例。

In [6]:
pd.options.display.max_rows = 100
pd.options.display.max_columns = 10
np.random.seed(12345)

import pytz
import random; random.seed(0)
import string

#首先生成1000个股票代码

N = 1000
def rands(n):
    choices = string.ascii_uppercase
    return ''.join([random.choice(choices) for _ in range(n)])
tickers = np.array([rands(5) for _ in range(N)])

In [7]:
# 创建一个含有3列的DataFrame来承载这些假想数据，不过只选择部分股票组成该投资组合
M = 500
df = DataFrame({'Momentum' : np.random.randn(M) / 200 + 0.03,
                'Value' : np.random.randn(M) / 200 + 0.08,
                'ShortInterest' : np.random.randn(M) / 200 - 0.02},
                index=tickers[:M])

In [8]:
# 随机创建行业分类
ind_names = np.array(['FINANCIAL', 'TECH'])
sampler = np.random.randint(0, len(ind_names), N)
industries = Series(ind_names[sampler], index=tickers,
                    name='industry')

In [9]:
# 根据行业分类进行分组并执行分组聚合和变换
by_industry = df.groupby(industries)
by_industry.mean()

Unnamed: 0_level_0,Momentum,ShortInterest,Value
industry,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
FINANCIAL,0.029485,-0.020739,0.079929
TECH,0.030407,-0.019609,0.080113


In [10]:
by_industry.describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,Momentum,ShortInterest,Value
industry,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
FINANCIAL,count,246.0,246.0,246.0
FINANCIAL,mean,0.029485,-0.020739,0.079929
FINANCIAL,std,0.004802,0.004986,0.004548
FINANCIAL,min,0.01721,-0.036997,0.067025
FINANCIAL,25%,0.026263,-0.024138,0.076638
FINANCIAL,50%,0.029261,-0.020833,0.079804
FINANCIAL,75%,0.032806,-0.017345,0.082718
FINANCIAL,max,0.045884,-0.006322,0.093334
TECH,count,254.0,254.0,254.0
TECH,mean,0.030407,-0.019609,0.080113


In [13]:
# 自定义变换：行业内标准化处理（平均值为 0 ，标准差为 1 ）
def zscore(group):
    return (group - group.mean()) / group.std()

df_stand = by_industry.apply(zscore)

In [12]:
df_stand.groupby(industries).agg(['mean', 'std'])

Unnamed: 0_level_0,Momentum,Momentum,ShortInterest,ShortInterest,Value,Value
Unnamed: 0_level_1,mean,std,mean,std,mean,std
industry,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
FINANCIAL,1.114736e-15,1.0,3.081772e-15,1.0,8.001278e-15,1.0
TECH,-2.779929e-16,1.0,-1.910982e-15,1.0,-7.139521e-15,1.0


In [14]:
# 内置变换函数（比如rank）的用法会更简洁一些
ind_rank = by_industry.rank(ascending=False)
ind_rank.groupby(industries).agg(['min', 'max'])

Unnamed: 0_level_0,Momentum,Momentum,ShortInterest,ShortInterest,Value,Value
Unnamed: 0_level_1,min,max,min,max,min,max
industry,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
FINANCIAL,1.0,246.0,1.0,246.0,1.0,246.0
TECH,1.0,254.0,1.0,254.0,1.0,254.0


In [16]:
# 在股票投资组合的定量分析中，排名和标准化是一种常见的变换运算组合。
# 通过rank和zscore链接在一起即可完成整个过程
# 行业内排名和标准化,这是把排名进行了标准化
# Industry rank and standardize
by_industry.apply(lambda x: zscore(x.rank())).head()

Unnamed: 0,Momentum,ShortInterest,Value
MYNBI,-0.091346,-0.976696,-1.004802
QPMZJ,0.794005,1.299919,-0.358356
PLSGQ,-0.541047,-0.836164,-1.679355
EJEYD,-0.583207,-1.623142,0.990749
TZIRW,1.57212,-0.265423,0.374314


## 分组因子暴露

因子分析（factor analysis）是投资组合定量管理中的一种技术。

投资组合的持有量和性能（收益与损失）可以被分解为一个或多个表示投资组合权重的因子（风险因子就是其中之一）。

例如，某只股票与某个基准（比如标普500指数）的协动性被称为其beta风险系数。

下面以一个人为构成的投资的投资组合为例进行讲解，它由三个随机生成的因子（通常称为因子载荷）和一些权重构成。

In [17]:
from numpy.random import rand
fac1, fac2, fac3 = np.random.rand(3, 1000)

ticker_subset = tickers.take(np.random.permutation(N)[:1000])

# 因子加权和，噪声
port = Series(0.7 * fac1 - 1.2 * fac2 + 0.3 * fac3 + rand(1000),
              index=ticker_subset)
factors = DataFrame({'f1': fac1, 'f2': fac2, 'f3': fac3},
                    index=ticker_subset)

In [18]:
# 各因子与投资组合之间的矢量相关性可能说明不了什么问题
factors.corrwith(port)

f1    0.402377
f2   -0.680980
f3    0.168083
dtype: float64

In [19]:
#计算因子暴露的标准方式是最小二乘回归， 可以使用pandas.ols
pd.ols(y=port, x=factors).beta

  exec(code_obj, self.user_global_ns, self.user_ns)


f1           0.761789
f2          -1.208760
f3           0.289865
intercept    0.484477
dtype: float64

In [23]:
#可以看出，由于没有给投资组合添加过多的随机噪声，所以原始因子基本恢复了。
# 还可以通过groupby计算各行业的暴露量
def beta_exposure(chunk, factors=None):
    return pd.ols(y=chunk, x=factors).beta

In [22]:
# 根据行业进行分组，并应用该函数
by_ind = port.groupby(industries)
exposures = by_ind.apply(beta_exposure, factors=factors)
exposures.unstack()

  return func(g, *args, **kwargs)


Unnamed: 0_level_0,f1,f2,f3,intercept
industry,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
FINANCIAL,0.790329,-1.18297,0.275624,0.455569
TECH,0.740857,-1.232882,0.303811,0.508188


## 十分位和四分位分析

基于样本分位数的分析是金融分析师们的另一个重要工具，

例如，股票投资组合的性能可以根据各股的市盈率被划分入四分位。

通过pandas.qcut和groupby可以轻松实现分位数分析。

In [24]:
import pandas.io.data as web
data = web.get_data_yahoo('SPY', '2006-01-01')
data.info()

ImportError: The pandas.io.data module is moved to a separate package (pandas-datareader). After installing the pandas-datareader package (https://github.com/pandas-dev/pandas-datareader), you can change the import ``from pandas.io import data, wb`` to ``from pandas_datareader import data, wb``.

In [25]:
# 计算日收益率，并编写一个用于将收益率转换为趋势信号的函数

px = data['Adj Close']
returns = px.pct_change()

def to_index(rets):
    index = (1 + rets).cumprod()
    first_loc = max(index.index.get_loc(index.idxmax()) - 1, 0)
    index.values[first_loc] = 1
    return index

def trend_signal(rets, lookback, lag):
    signal = pd.rolling_sum(rets, lookback, min_periods=lookback - 5)
    return signal.shift(lag)

NameError: name 'data' is not defined

In [26]:
# 通过该函数，我们可以单纯地创建和测试一种根据每周五动量信号进行交易的交易策略
signal = trend_signal(returns, 100, 3)
trade_friday = signal.resample('W-FRI').resample('B', fill_method='ffill')
trade_rets = trade_friday.shift(1) * returns
trade_rets = trade_rets[:len(returns)]

NameError: name 'trend_signal' is not defined

In [27]:
# 将该策略的收益率转换为一个收益指数，并绘制一张图表
to_index(trade_rets).plot()

NameError: name 'to_index' is not defined

In [28]:
# 假如希望将该策略的性能按不同大小的交易期波幅进行划分。
# 年度标准差是计算波幅的一种简单办法，可以通过计算夏普比率来观察不同波动机制下的风险收益率：

vol = pd.rolling_std(returns, 250, min_periods=200) * np.sqrt(250)

def sharpe(rets, ann=250):
    return rets.mean() / rets.std()  * np.sqrt(ann)

NameError: name 'returns' is not defined

In [29]:
# 现在利用qcut将vol划分为4等份，并用sharpe进行聚合
cats = pd.qcut(vol, 4)
print('cats: %d, trade_rets: %d, vol: %d' % (len(cats), len(trade_rets), len(vol)))

NameError: name 'vol' is not defined

In [30]:
trade_rets.groupby(cats).agg(sharpe)

NameError: name 'trade_rets' is not defined