# 每一个股票的收益与因子值有什么关系？这个因子的“有效性”我们如何定义？

# 还记得我们的第一集的股价预测吗？
<!-- 我们用了很多的指标，但是我们并没有去验证这些指标的有效性，那么这一集我们就来验证一下这些指标的有效性。 -->





---
### 因子实战 第五集 
# 打造你的对冲基金：用Python实现高级对冲策略（3）
## 探秘因子统计学的解释性

### 🎬 @大导演哈罗德
### 🏛 香港中文大学 金融工程 本科
### 📈 即将前往美国金融工程硕士之路（已经获得录取）
### 🌐 [关注我的Bilibili，看所有人都能听得懂有收获的量化学习内容！](https://space.bilibili.com/629573485)

🌟🌟🌟 一起揭开量化的神秘面纱！#哈罗德的亮化频道🌟

---

读取数据

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from datetime import datetime

In [2]:
df_mkt = pd.read_excel('Monthly_Market_Value_and_Return.xlsx', index_col=0) 
df_mkt.reset_index(inplace=True)
df_mkt['stock'] = df_mkt['stock'].map(lambda x: str(x).zfill(6))
unique_dates = df_mkt['date'].unique()
unique_dates.sort()
unique_dates_nochange = unique_dates.copy()
unique_dates = [datetime.strptime(date, "%Y-%m") for date in unique_dates]
portfolio_returns = {f'portfolio_{i}': [] for i in range(1, 6)}

df = df_mkt

for date in unique_dates_nochange:
    df_date = df[df['date'] == date]
    if df_date.empty:
        continue
    
    thresholds = np.percentile(df_date['value_in_thousand'], [20, 40, 60, 80, 100])
    for i in range(5):
        if i == 0:
            portfolio = df_date[df_date['value_in_thousand'] <= thresholds[i]]
        else:
            portfolio = df_date[(df_date['value_in_thousand'] > thresholds[i-1]) & (df_date['value_in_thousand'] <= thresholds[i])]
        if portfolio.empty:
            portfolio_return = 0 
        else:
            if not np.isnan(portfolio['next_month_return'].mean()):
                portfolio_return = portfolio['next_month_return'].mean()

        portfolio_returns[f'portfolio_{i+1}'].append(portfolio_return)

hedge_returns = []
for i in range(len(portfolio_returns['portfolio_1'])):
    hedge_return = portfolio_returns['portfolio_1'][i] - portfolio_returns['portfolio_5'][i]
    hedge_returns.append(hedge_return)

average_returns = {portfolio: np.mean(returns) for portfolio, returns in portfolio_returns.items()}
average_hedge_return = np.mean(hedge_returns)

# 打印结果
print("Average Returns by Portfolio:")
for portfolio, average_return in average_returns.items():
    print(f"{portfolio}: {average_return}")

print(f"Average Hedge Return: {average_hedge_return}")

Average Returns by Portfolio:
portfolio_1: 0.03632338665770842
portfolio_2: 0.0209551217564462
portfolio_3: 0.01726534987882018
portfolio_4: 0.014766130366628655
portfolio_5: 0.012561872521888598
Average Hedge Return: 0.023761514135819815


SMB 的因子值

In [3]:
len(hedge_returns)

196

In [4]:
hedge_returns

[-0.08470671268656715,
 0.03887293584905659,
 -0.06181304135338346,
 -0.0631031515151515,
 0.44565036015325665,
 0.055511378906250006,
 0.076745,
 -0.004596515267175572,
 0.046115068702290075,
 -0.03239070483133219,
 -0.14477963295880153,
 -0.0898985888888889,
 0.06386219413919417,
 0.17187434731115692,
 0.1655050845588235,
 0.022197353790613716,
 0.06864116725978646,
 -0.1442391094890511,
 0.10581406859205778,
 0.02646128776978418,
 -0.03362237234042553,
 -0.08557337234042553,
 0.15943493664975644,
 0.03087467368421054,
 0.04835371180555555,
 0.08164946917808219,
 0.049691647260273974,
 -0.1156161843003413,
 0.06389523469387756,
 -0.007008798657718113,
 0.09571214473684211,
 -0.004692326303989192,
 -0.07564928246753248,
 0.033560177419354825,
 0.07521826683369873,
 0.1149865145631068,
 0.020068660256410242,
 0.0686622134347432,
 0.06707708626198083,
 0.06279507667731629,
 0.04946440392950897,
 0.003032951768488723,
 -0.05047491693290737,
 0.0796321214057508,
 0.0016474190476190517,
 0

---

# 因子...?

你哈罗德不是说，可以根据一个因子确定一个股票的涨跌吗？

---

哈罗德双因子

In [5]:
SMB_monthly = pd.DataFrame(hedge_returns, unique_dates_nochange,columns=['SMB_monthly'])

pd.DataFrame(hedge_returns, unique_dates_nochange,columns=['SMB_monthly'])

Unnamed: 0,SMB_monthly
2005-12,-0.084707
2006-01,0.038873
2006-02,-0.061813
2006-03,-0.063103
2006-04,0.445650
...,...
2021-11,0.064914
2021-12,0.050056
2022-01,0.023067
2022-02,0.080904


In [6]:
df_mkt

Unnamed: 0,stock,date,value_in_thousand,monthly_return,next_month_return
0,000001,2005-12,11947347.99,0.051370,0.034202
1,000001,2006-01,12355970.65,0.034202,0.077165
2,000001,2006-02,13309423.50,0.077165,-0.068713
3,000001,2006-03,12394887.09,-0.068713,0.237049
4,000001,2006-04,15333078.53,0.237049,0.114213
...,...,...,...,...,...
419256,605599,2021-11,9636666.94,0.061697,0.086360
419257,605599,2021-12,10468889.19,0.086360,-0.116642
419258,605599,2022-01,9247778.04,-0.116642,0.105971
419259,605599,2022-02,10227778.07,0.105971,-0.170342


In [7]:
MKT = df_mkt.groupby('date')['next_month_return'].mean()
MKT_monthly = pd.DataFrame(MKT)
MKT_monthly

Unnamed: 0_level_0,next_month_return
date,Unnamed: 1_level_1
2005-12,0.088967
2006-01,0.032050
2006-02,0.014953
2006-03,0.115215
2006-04,0.328658
...,...
2021-11,0.048813
2021-12,-0.082688
2022-01,0.049878
2022-02,-0.037078


现在我们有了每个月的 MKT factor 和 SMB factor 的值
### 对于任何一个股票，这个股票每个月的收益，我可以分解为 R = A1 * MKT factor + A2 * SMB factor + alpha

In [8]:
test_stock_code = "600000" #浦发银行
test_stock = df_mkt[df_mkt['stock'] == test_stock_code]
test_stock

Unnamed: 0,stock,date,value_in_thousand,monthly_return,next_month_return
208370,600000,2005-12,3.817125e+07,0.106697,0.155897
208371,600000,2006-01,4.412205e+07,0.155897,0.081633
208372,600000,2006-02,4.772385e+07,0.081633,-0.109106
208373,600000,2006-03,4.251690e+07,-0.109106,0.209055
208374,600000,2006-05,3.903255e+07,0.209055,-0.006018
...,...,...,...,...,...
208560,600000,2021-11,2.497869e+08,-0.048098,0.002350
208561,600000,2021-12,2.503740e+08,0.002350,-0.014068
208562,600000,2022-01,2.468517e+08,-0.014068,-0.002378
208563,600000,2022-02,2.462647e+08,-0.002378,-0.046484


In [9]:
test_stock_list = df_mkt['stock'].unique()
len(test_stock_list)

3255

In [12]:
y = test_stock[['date','next_month_return']].set_index('date')
y.dropna(inplace=True)
y

Unnamed: 0_level_0,next_month_return
date,Unnamed: 1_level_1
2005-12,0.155897
2006-01,0.081633
2006-02,-0.109106
2006-03,0.209055
2006-05,-0.006018
...,...
2021-10,-0.048098
2021-11,0.002350
2021-12,-0.014068
2022-01,-0.002378


In [10]:
X = pd.concat([MKT_monthly['next_month_return'], SMB_monthly['SMB_monthly']], axis=1)
X.dropna(inplace=True)
X

Unnamed: 0,next_month_return,SMB_monthly
2005-12,0.088967,-0.084707
2006-01,0.032050,0.038873
2006-02,0.014953,-0.061813
2006-03,0.115215,-0.063103
2006-04,0.328658,0.445650
...,...,...
2021-10,0.081611,0.091917
2021-11,0.048813,0.064914
2021-12,-0.082688,0.050056
2022-01,0.049878,0.023067


In [13]:
import statsmodels.api as sm
from statsmodels import regression

data_for_stock = y.index.unique()
X = X.loc[data_for_stock]
X = sm.add_constant(X)
model = regression.linear_model.OLS(y, X).fit()
alpha = model.params[0]
beta1 = model.params[1]
beta2 = model.params[2]
print('alpha: ' + str(alpha))
print('beta SMB: ' + str(beta1))
print('beta MKT: ' + str(beta2))
print(model.summary())

alpha: 0.01873376976333879
beta SMB: 0.6679959001624033
beta MKT: -0.73612189163164
                            OLS Regression Results                            
Dep. Variable:      next_month_return   R-squared:                       0.369
Model:                            OLS   Adj. R-squared:                  0.363
Method:                 Least Squares   F-statistic:                     55.96
Date:                Mon, 29 Jan 2024   Prob (F-statistic):           7.46e-20
Time:                        21:36:24   Log-Likelihood:                 208.41
No. Observations:                 194   AIC:                            -410.8
Df Residuals:                     191   BIC:                            -401.0
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                        coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------

  alpha = model.params[0]
  beta1 = model.params[1]
  beta2 = model.params[2]


In [33]:
R2 = []
X = pd.concat([MKT_monthly['next_month_return'], SMB_monthly['SMB_monthly']], axis=1)
X.dropna(inplace=True)

for stock in test_stock_list:
    y = df_mkt[df_mkt['stock'] == stock][['date','next_month_return']].set_index('date')
    y.dropna(inplace=True)
    
    if y.empty:
        continue

    data_for_stock = y.index.unique()
    X_stock = X.loc[data_for_stock]  # Use X_stock instead of X
    X_stock = sm.add_constant(X_stock)

    model = regression.linear_model.OLS(y, X_stock).fit()  # Use X_stock instead of X
    R2.append(model.rsquared)

  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss
  return 1 - self.ssr/self.centered_tss


In [42]:
R2_df = pd.DataFrame(R2).dropna()
R2_df.dropna(inplace=True)
R2_df.describe()

  sqr = _ensure_numeric((avg - values) ** 2)


Unnamed: 0,0
count,3249.0
mean,-inf
std,
min,-inf
25%,0.27752
50%,0.39031
75%,0.481441
max,1.0


---

# 三因子，还有....

---

In [13]:
df_pb = pd.read_excel('Monthly_PB_and_Return.xlsx', index_col=0)
df_pb.reset_index(inplace=True)
df_pb['stock'] = df_pb['stock'].map(lambda x: str(x).zfill(6))
df_pb

Unnamed: 0,stock,name,date,return,PB,next_month_return
0,600000.SH,浦发银行,2005-12-30,10.6697,2.5770,15.5897
1,600000.SH,浦发银行,2006-01-25,15.5897,2.9788,8.1633
2,600000.SH,浦发银行,2006-02-28,8.1633,3.2219,-10.9106
3,600000.SH,浦发银行,2006-03-31,-10.9106,2.8704,0.0000
4,600000.SH,浦发银行,2006-04-28,0.0000,2.8704,20.9181
...,...,...,...,...,...,...
419985,003816.SZ,中国广核,2021-11-30,-0.6849,1.4650,7.9310
419986,003816.SZ,中国广核,2021-12-31,7.9310,1.5812,-7.9872
419987,003816.SZ,中国广核,2022-01-28,-7.9872,1.4549,2.0833
419988,003816.SZ,中国广核,2022-02-28,2.0833,1.4852,-7.1429


In [14]:
df = df_pb.copy()
df['book_to_market'] = 1 / df['PB']
df

Unnamed: 0,stock,name,date,return,PB,next_month_return,book_to_market
0,600000.SH,浦发银行,2005-12-30,10.6697,2.5770,15.5897,0.388048
1,600000.SH,浦发银行,2006-01-25,15.5897,2.9788,8.1633,0.335706
2,600000.SH,浦发银行,2006-02-28,8.1633,3.2219,-10.9106,0.310376
3,600000.SH,浦发银行,2006-03-31,-10.9106,2.8704,0.0000,0.348384
4,600000.SH,浦发银行,2006-04-28,0.0000,2.8704,20.9181,0.348384
...,...,...,...,...,...,...,...
419985,003816.SZ,中国广核,2021-11-30,-0.6849,1.4650,7.9310,0.682594
419986,003816.SZ,中国广核,2021-12-31,7.9310,1.5812,-7.9872,0.632431
419987,003816.SZ,中国广核,2022-01-28,-7.9872,1.4549,2.0833,0.687332
419988,003816.SZ,中国广核,2022-02-28,2.0833,1.4852,-7.1429,0.673310


# TODO: 利用Book to Market Ratio构建投资组合

# 如何在中国市场复现Fama-French三因子模型？

### 参考：liu+stambagh+yu+2019