# Week 10. Data Analysis Examples

### Outline:

- Introduction to statsmodels  <br>
<br>
- Testing Capital Asset Pricing Model (CAPM)  <br>
<br>
- **Predicting Returns in HK Stocks Markets**

In [1]:
import numpy as np
import pandas as pd
np.random.seed(12345)
import matplotlib.pyplot as plt
plt.rc('figure', figsize=(6, 4))
PREVIOUS_MAX_ROWS = pd.options.display.max_rows
pd.options.display.max_rows = 15
np.set_printoptions(precision=4, suppress=True)

In [2]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

---

## 10.3 Predicting Stock Returns 

* One of the key tasks in empirical asset pricing is to predict individual stock returns. <br>
<br>
* In this section, we focus on the cross-sectional predictability, i.e., whether small firms earn higher average returns than large firms?   <br>
<br>
* We answer this question by simple OLS regressions. The baseline model we want to estimate is: <br>
<br>
$$
R_{i,t+1} = c + X^\top_{it} \beta + \epsilon_{i,t+1}, 
$$
* $R_{i,t+1}$: return of stock $i$ at time $t+1$
* $X_{it}$: the $K \times 1$ vector of signals of stock $i$ at time $t$ (i.e., we are living at time $t$ and observe $X_{it}$. We aim to predict the next-period $R_{i,t+1}$)
* The key parameter is $\beta$ : if $\beta_k$ ($1 \leq k \leq K$) is significantly different from zero, we can say that the $k$-th signal can predict stock returns. 

### 10.3.1 Load the panel data of HK stocks and Preprocess the data

In [4]:
D = pd.read_parquet('../data/HK_stocks_151signals.parquet', engine='pyarrow')
print(D.shape)

(413279, 154)


In [5]:
#D[['id', 'eom', 'market_equity']]

In [6]:
print(D.columns)
D.head()

Index(['id', 'eom', 'ret_exc_lead1m', 'cowc_gr1a', 'oaccruals_at',
       'oaccruals_ni', 'seas_16_20na', 'taccruals_at', 'taccruals_ni',
       'capex_abn',
       ...
       'eqnetis_at', 'eqnpo_12m', 'eqnpo_me', 'eqpo_me', 'fcf_me', 'ival_me',
       'netis_at', 'ni_me', 'ocf_me', 'sale_me'],
      dtype='object', length=154)


Unnamed: 0,id,eom,ret_exc_lead1m,cowc_gr1a,oaccruals_at,oaccruals_ni,seas_16_20na,taccruals_at,taccruals_ni,capex_abn,...,eqnetis_at,eqnpo_12m,eqnpo_me,eqpo_me,fcf_me,ival_me,netis_at,ni_me,ocf_me,sale_me
13581256,310108801.0,1990-07-31,-0.094007,,,,,,,,...,,0.033708,,,,,,0.093894,,0.139606
13581257,310108801.0,1990-08-31,-0.1457,,,,,,,,...,,0.033708,,,,,,0.102937,,0.153052
13581258,310108801.0,1990-09-30,0.151076,,,,,,,,...,,0.033708,,,,,,0.119655,,0.177908
13581259,310108801.0,1990-10-31,0.017782,,,,,,,,...,,0.034204,,,,,,0.104479,,0.155344
13581260,310108801.0,1990-11-30,0.020163,,,,,,,,...,,0.034204,,,,,,0.102087,,0.151787


#### First, let's focus on the subsample between Jan 2000 and Dec 2020

In [7]:
D = D[(D.eom>='2000-01') & (D.eom<'2021-01')]

In [8]:
D = D.set_index(['id', 'eom'])  # create a hierarchical index using two columns 'id' and 'eom' as the index. 
D

Unnamed: 0_level_0,Unnamed: 1_level_0,ret_exc_lead1m,cowc_gr1a,oaccruals_at,oaccruals_ni,seas_16_20na,taccruals_at,taccruals_ni,capex_abn,debt_gr3,fnl_gr1a,...,eqnetis_at,eqnpo_12m,eqnpo_me,eqpo_me,fcf_me,ival_me,netis_at,ni_me,ocf_me,sale_me
id,eom,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
310108801.0,2000-01-31,0.048838,0.015783,0.057819,0.297865,,0.334973,1.725669,,0.098660,0.014801,...,,0.018260,,,0.004774,0.226013,,0.149551,0.105005,0.013690
310108801.0,2000-02-29,0.120655,0.015783,0.057819,0.297865,,0.334973,1.725669,,0.098660,0.014801,...,,0.018260,,,0.004533,0.214610,,0.142006,0.099707,0.012999
310108801.0,2000-03-31,-0.206566,0.015783,0.057819,0.297865,,0.334973,1.725669,,0.098660,0.014801,...,,0.018260,,,0.004029,0.190759,,0.126224,0.088626,0.011554
310108801.0,2000-04-30,-0.229319,0.015752,0.358486,1.048355,,0.637979,1.865701,1.139721,0.055444,0.014718,...,,0.018260,,,-0.030523,3.880564,,0.278421,-0.013463,0.014734
310108801.0,2000-05-31,0.206311,0.015752,0.358486,1.048355,,0.637979,1.865701,1.139721,0.055444,0.014718,...,,0.017660,,,-0.039856,5.067118,,0.363554,-0.017580,0.019239
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
333190801.0,2020-08-31,-0.103452,0.033244,0.051367,0.215893,,-0.038681,-0.162575,0.424967,-0.184604,0.023118,...,0.271811,0.039263,-0.120776,0.070716,0.135990,1.967647,0.272167,0.178598,0.140040,0.693516
333190801.0,2020-09-30,0.114601,0.033244,0.051367,0.215893,,-0.038681,-0.162575,0.424967,-0.184604,0.023118,...,0.271811,0.039263,-0.134701,0.078869,0.151669,2.194506,0.272167,0.199189,0.156186,0.773475
333190801.0,2020-10-31,0.017469,0.034571,0.055404,0.202141,,-0.033019,-0.120470,0.424967,-0.184604,0.023285,...,0.271811,0.039263,-0.120843,0.070755,0.159228,1.968742,0.272167,0.206378,0.164660,0.764348
333190801.0,2020-11-30,-0.034063,0.034571,0.055404,0.202141,,-0.033019,-0.120470,0.424967,-0.184604,0.023285,...,0.271811,0.039263,-0.118761,0.069536,0.156484,1.934816,0.272167,0.202821,0.161823,0.751177


#### Second, we further remove the firms that do not have enough historic observations available

* For instance, if a firm is included in the sample, it should have at least 12 observations. 

In [38]:
history_lengths = D.index.to_frame()['id'].groupby(level=0).count()

In [10]:
history_lengths

id
301393202.0    115
301510501.0      8
301549801.0    252
301553001.0    252
301565201.0    252
              ... 
334597601.0      1
334600201.0      1
334600301.0      1
334600401.0      1
334612501.0      1
Name: id, Length: 2773, dtype: int64

In [11]:
cs_idx_with_history = history_lengths[history_lengths >= 24].index 

```python
ts_idx = D.index.get_level_values(1).unique()
df_idx = pd.MultiIndex.from_product((ts_idx, cs_idx_with_history)).to_frame()[[]]
        
# left merge to produce a balanced panel
D_full = pd.merge(D, df_idx, how='inner', left_index=True, right_index=True)
```

In [12]:
D_full = D.loc[D.index.get_level_values(0).isin(cs_idx_with_history),:]

In [13]:
D_full

Unnamed: 0_level_0,Unnamed: 1_level_0,ret_exc_lead1m,cowc_gr1a,oaccruals_at,oaccruals_ni,seas_16_20na,taccruals_at,taccruals_ni,capex_abn,debt_gr3,fnl_gr1a,...,eqnetis_at,eqnpo_12m,eqnpo_me,eqpo_me,fcf_me,ival_me,netis_at,ni_me,ocf_me,sale_me
id,eom,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
310108801.0,2000-01-31,0.048838,0.015783,0.057819,0.297865,,0.334973,1.725669,,0.098660,0.014801,...,,1.826049e-02,,,0.004774,0.226013,,0.149551,0.105005,0.013690
310108801.0,2000-02-29,0.120655,0.015783,0.057819,0.297865,,0.334973,1.725669,,0.098660,0.014801,...,,1.826049e-02,,,0.004533,0.214610,,0.142006,0.099707,0.012999
310108801.0,2000-03-31,-0.206566,0.015783,0.057819,0.297865,,0.334973,1.725669,,0.098660,0.014801,...,,1.826049e-02,,,0.004029,0.190759,,0.126224,0.088626,0.011554
310108801.0,2000-04-30,-0.229319,0.015752,0.358486,1.048355,,0.637979,1.865701,1.139721,0.055444,0.014718,...,,1.826049e-02,,,-0.030523,3.880564,,0.278421,-0.013463,0.014734
310108801.0,2000-05-31,0.206311,0.015752,0.358486,1.048355,,0.637979,1.865701,1.139721,0.055444,0.014718,...,,1.765956e-02,,,-0.039856,5.067118,,0.363554,-0.017580,0.019239
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
333183701.0,2020-08-31,-0.087502,-0.613156,-0.489703,-0.858629,,-0.455361,-0.798415,0.077146,-0.606049,-0.036384,...,,5.551115e-16,,,-0.029965,,,-0.211139,-0.024430,0.441818
333183701.0,2020-09-30,-0.069160,-0.613156,-0.489703,-0.858629,,-0.455361,-0.798415,0.077146,-0.606049,-0.036384,...,,2.775558e-16,,,-0.032835,,,-0.231367,-0.026770,0.484145
333183701.0,2020-10-31,0.000222,-0.167308,-0.186136,-0.202729,,0.184912,0.201396,0.077146,-0.896877,-0.371048,...,,-4.401689e-02,,,-0.033753,,,-0.229120,-0.182671,0.423562
333183701.0,2020-11-30,-0.341308,-0.167308,-0.186136,-0.202729,,0.184912,0.201396,0.077146,-0.896877,-0.371048,...,,-4.401689e-02,,,-0.033743,,,-0.229054,-0.182618,0.423440


#### Third, we need to remove some extremely small firms

* For instance, if a firm's market capitalization is less than 0.001% of the total market capitalization of the whole HK stock market, we will exclude this firm from our analysis.  <br>
<br>
* Reason: Micro-cap firms often have extreme behaviors and are illiquid, which makes trading these stocks costly. 

In [40]:
total_market_cap = D_full['market_equity'].groupby(level=1).sum()

In [15]:
market_cap_filter = (D_full['market_equity'] / total_market_cap) > 0.00001

In [16]:
D_full = D_full[market_cap_filter]

In [17]:
D_full

Unnamed: 0_level_0,Unnamed: 1_level_0,ret_exc_lead1m,cowc_gr1a,oaccruals_at,oaccruals_ni,seas_16_20na,taccruals_at,taccruals_ni,capex_abn,debt_gr3,fnl_gr1a,...,eqnetis_at,eqnpo_12m,eqnpo_me,eqpo_me,fcf_me,ival_me,netis_at,ni_me,ocf_me,sale_me
id,eom,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
310108801.0,2000-01-31,0.048838,0.015783,0.057819,0.297865,,0.334973,1.725669,,0.098660,0.014801,...,,1.826049e-02,,,0.004774,0.226013,,0.149551,0.105005,0.013690
310108801.0,2000-02-29,0.120655,0.015783,0.057819,0.297865,,0.334973,1.725669,,0.098660,0.014801,...,,1.826049e-02,,,0.004533,0.214610,,0.142006,0.099707,0.012999
310108801.0,2000-03-31,-0.206566,0.015783,0.057819,0.297865,,0.334973,1.725669,,0.098660,0.014801,...,,1.826049e-02,,,0.004029,0.190759,,0.126224,0.088626,0.011554
310108801.0,2000-04-30,-0.229319,0.015752,0.358486,1.048355,,0.637979,1.865701,1.139721,0.055444,0.014718,...,,1.826049e-02,,,-0.030523,3.880564,,0.278421,-0.013463,0.014734
310108801.0,2000-05-31,0.206311,0.015752,0.358486,1.048355,,0.637979,1.865701,1.139721,0.055444,0.014718,...,,1.765956e-02,,,-0.039856,5.067118,,0.363554,-0.017580,0.019239
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
333183701.0,2018-11-30,-0.105329,-0.209154,0.528659,6.997669,,0.563898,7.464126,1.927060,-0.170135,-0.175539,...,,0.000000e+00,,,-0.203585,,,-0.047785,-0.192614,0.586677
333183701.0,2018-12-31,-0.146483,-0.209154,0.528659,6.997669,,0.563898,7.464126,1.927060,-0.170135,-0.175539,...,,2.220446e-16,,,-0.227063,,,-0.053296,-0.214827,0.654334
333183701.0,2019-03-31,-0.137475,-0.108815,0.528659,6.997669,,0.563898,7.464126,1.927060,-0.170135,-0.061488,...,,2.220446e-16,,,-0.196687,,,-0.040261,-0.186087,0.567377
333183701.0,2019-04-30,-0.129850,0.127135,0.439295,12.512152,,0.428924,12.216765,-0.206262,-0.392954,-0.122736,...,,2.220446e-16,,,-0.165046,,,-0.011900,-0.160789,0.654023


#### Fourth, we perform the rank-transformation of the firm characteristics. 

* Predictors are in different units and full of outliers, so we want to standardize the data.  <br>
<br>
* A standard practice is to rank and center each characteristic cross-sectionally to lie within the $[-0.5, 0.5]$ range.
  * In each month, we rank the stocks by their firm characteristic $k$ and divide their ranks by the number of stocks during this period. Finally, we substract it by 0.5. Hence, the rank-transformed firm characteristic $k$ is
  $$
  X_{ikt} = \frac{\text{rank}_{ikt}}{N_t} - 0.5
  $$
    * $\text{rank}_{ikt} \in [1, N_t]$: the rank of stock $i$ by the values of firm characteristic $k$
    * $N_t$: the number of stocks in period $t$

In [18]:
D_full.std()

ret_exc_lead1m     0.234235
cowc_gr1a         33.953999
oaccruals_at      30.774210
oaccruals_ni      86.045923
seas_16_20na       0.033897
                    ...    
ival_me            4.450667
netis_at           0.533379
ni_me              0.637378
ocf_me             0.610625
sale_me            4.639632
Length: 152, dtype: float64

In [19]:
D_full.abs().max()

ret_exc_lead1m       63.189411
cowc_gr1a          6067.371054
oaccruals_at       5506.954637
oaccruals_ni      25039.179853
seas_16_20na          0.597768
                      ...     
ival_me             726.091826
netis_at             37.009293
ni_me                28.208714
ocf_me               41.938109
sale_me             321.386311
Length: 152, dtype: float64

In [20]:
D_ranked = D_full.groupby(level=1).rank(ascending=True)
D_ranked = D_ranked / D_ranked.groupby(level=1).max() - 0.5

In [21]:
D_ranked.min()

ret_exc_lead1m   -0.499435
cowc_gr1a        -0.499288
oaccruals_at     -0.499291
oaccruals_ni     -0.499291
seas_16_20na     -0.496942
                    ...   
ival_me          -0.498926
netis_at         -0.498670
ni_me            -0.499435
ocf_me           -0.499435
sale_me          -0.499143
Length: 152, dtype: float64

#### Fifth, we need to handle missing data

* Remove the firm-month observations whenever the return is missing. 
  * In general, we do not fill in missing returns! <br>
<br>
* Remove the firm characteristics whose missing rates are higher than $30\%$ in any month. <br>
<br>
* For the remaining sample, we fill in missing data with zeros. 

In [22]:
D_ranked = D_ranked[D_ranked['ret_exc_lead1m'].isna() == False]

In [23]:
mis_rate = D_ranked.isna().groupby(level=1).mean()

In [24]:
D_ranked = D_ranked.loc[:,(mis_rate > 0.3).any()==False]

In [25]:
D_ranked = D_ranked.fillna(0)

In [26]:
D_ranked

Unnamed: 0_level_0,Unnamed: 1_level_0,ret_exc_lead1m,fnl_gr1a,nfna_gr1a,at_gr1,be_gr1a,inv_gr1,inv_gr1a,mispricing_mgmt,ppeinv_gr1a,sale_gr1,...,prc,ret_1_0,at_me,be_me,bev_mev,debt_me,ebitda_mev,ni_me,ocf_me,sale_me
id,eom,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
310108801.0,2000-01-31,0.109848,0.196017,0.441300,0.472803,0.480220,0.405882,0.369198,-0.259916,0.359375,-0.360759,...,0.494361,-0.104563,-0.334661,-0.249478,-0.235967,-0.252495,-0.168737,0.354582,0.044177,-0.485944
310108801.0,2000-02-29,0.320561,0.191667,0.443750,0.472973,0.480435,0.406323,0.367647,-0.264706,0.360000,-0.359539,...,0.494393,0.108696,-0.313861,-0.237060,-0.224638,-0.240079,-0.174897,0.341584,0.048902,-0.484032
310108801.0,2000-03-31,-0.062152,0.192149,0.442149,0.473196,0.480519,0.404872,0.366667,-0.261317,0.354626,-0.362786,...,0.494495,0.320561,-0.332031,-0.258197,-0.251540,-0.251468,-0.170732,0.310547,0.023622,-0.486220
310108801.0,2000-04-30,-0.358998,0.237374,0.435354,0.459759,0.478903,0.371795,0.319106,-0.281563,0.346154,-0.365580,...,0.494495,-0.065056,-0.340691,-0.274194,-0.244980,-0.252896,-0.169339,0.419386,-0.286538,-0.480695
310108801.0,2000-05-31,0.354244,0.241414,0.433333,0.459759,0.478814,0.369767,0.317073,-0.298403,0.344017,-0.361507,...,0.494516,-0.358998,-0.312620,-0.211694,-0.171371,-0.234615,-0.162651,0.444551,-0.285441,-0.480769
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
333183701.0,2018-11-30,-0.303582,-0.474847,0.460123,-0.470482,-0.474831,0.000000,-0.142333,0.153024,-0.112214,-0.258282,...,-0.143709,-0.003902,-0.461171,-0.470874,-0.467026,-0.258788,-0.399308,-0.336918,-0.409200,-0.068862
333183701.0,2018-12-31,-0.414311,-0.473301,0.459951,-0.470273,-0.475152,0.000000,-0.141914,0.190534,-0.107116,-0.259540,...,-0.161376,-0.315476,-0.451708,-0.468900,-0.462940,-0.252392,-0.410848,-0.348057,-0.418139,-0.051683
333183701.0,2019-03-31,-0.426185,-0.373144,0.264233,-0.082878,-0.379257,0.000000,-0.099057,-0.041667,0.037279,-0.229981,...,-0.158434,0.491556,-0.430556,-0.457317,-0.465473,-0.199450,-0.343909,-0.320652,-0.423309,-0.061818
333183701.0,2019-04-30,-0.234807,-0.432915,-0.058621,-0.229064,0.042191,0.000000,-0.065663,0.105362,-0.154957,-0.255306,...,-0.194172,-0.435236,-0.420762,-0.453358,-0.468140,-0.237725,-0.231362,-0.261671,-0.411548,-0.021014


### 10.3.2 Simple OLS Regression Analysis

* After we preprocess the data, we end up with 47 variables in ```D_ranked```, 46 of which are signals (columns 2--47).  <br>
<br>
* We further divide the full sample into two subsamples: 
  * In-sample data: Jan 2000 - Dec 2015
  * Out-of-sample data: Jan 2016 - Dec 2020  <br>
<br>
* To assess predictive performance for individual excess stock return forecasts, we calculate the out-of-sample $R^2$ as <br>
<br>
$$
R^2_{oos} = 1 - \frac{\sum_{i,t \in \text{oos}} (R_{i,t+1} - \hat{R}_{i,t+1})^2}{\sum_{i,t \in \text{oos}} R^2_{i,t+1}}. 
$$
<br>
* $\sum_{i,t \in \text{oos}}$ indicates that fits are only assessed on the OOS subsample, whose data never enter into model estimation or tuning.   <br>
<br>
* $R^2_{oos}$ pools prediction errors across firms and over time into a grand panel-level assessment of each model.
   * Intuitively, it compares the model with the random walk hypothesis in which stocks are unpredictable and have zero mean returns.

In [27]:
D_ranked_in = D_ranked[(D_ranked.index.get_level_values(1) < '2016-01')]
D_ranked_oos = D_ranked[(D_ranked.index.get_level_values(1) >= '2016-01')]

In [28]:
data_in = pd.merge(D_full[['ret_exc_lead1m']], D_ranked_in, how='inner', left_index=True, right_index=True)
data_oos = pd.merge(D_full[['ret_exc_lead1m']], D_ranked_oos, how='inner', left_index=True, right_index=True)

#### Consider a simple case: Use ```['market_equity', 'be_me', 'ret_12_1']``` to predict stock returns

In [29]:
model = sm.OLS(data_in[['ret_exc_lead1m_x']].values, 
               sm.add_constant(data_in[['market_equity', 'be_me', 'ret_12_1']]))
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.001
Model:                            OLS   Adj. R-squared:                  0.001
Method:                 Least Squares   F-statistic:                     72.94
Date:                Fri, 13 Dec 2024   Prob (F-statistic):           3.86e-47
Time:                        05:13:54   Log-Likelihood:                -13999.
No. Observations:              198947   AIC:                         2.801e+04
Df Residuals:                  198943   BIC:                         2.805e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
const             0.0122      0.001     20.912

In [30]:
def get_R2(beta_hat, y, X):
    """
    beta_hat: a pandas.Series of coefficient estimates
    y: an ndarray of dependent variables
    X: a two-dimensional ndarray of independent variables
    This function returns the R-squared in the linear models
    """
    pred_err = y.reshape((len(y),)) - sm.add_constant(X) @ beta_hat
    return 1 - ((y.reshape(len(y),) - sm.add_constant(X) @ beta_hat)**2).sum() / (y**2).sum()


#### In-Sample Performance

In [31]:
get_R2(results.params, data_in[['ret_exc_lead1m_x']].values, 
       data_in[['market_equity', 'be_me', 'ret_12_1']].values)

np.float64(0.003281287943694644)

#### OOS Performance

$$
R^2_{oos} = 1 - \frac{\sum_{i,t \in \text{oos}} (R_{i,t+1} - \hat{R}_{i,t+1})^2}{\sum_{i,t \in \text{oos}} R^2_{i,t+1}}. 
$$

In [32]:
get_R2(results.params, data_oos[['ret_exc_lead1m_x']].values, 
       data_oos[['market_equity', 'be_me', 'ret_12_1']].values)

np.float64(-0.005787789160073498)

#### Use all 46 predictors

In [33]:
model = sm.OLS(data_in[['ret_exc_lead1m_x']].values, 
               sm.add_constant(data_in.iloc[:,2:]))
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.002
Model:                            OLS   Adj. R-squared:                  0.002
Method:                 Least Squares   F-statistic:                     10.05
Date:                Fri, 13 Dec 2024   Prob (F-statistic):           4.72e-70
Time:                        05:13:55   Log-Likelihood:                -13877.
No. Observations:              198947   AIC:                         2.785e+04
Df Residuals:                  198900   BIC:                         2.833e+04
Df Model:                          46                                         
Covariance Type:            nonrobust                                         
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
const               0.0083      0.001     

In [34]:
get_R2(results.params, data_oos[['ret_exc_lead1m_x']].values, 
       data_oos.iloc[:,2:].values)

np.float64(-0.005633118253951297)

### Conclusion

* OOS $R^2$ is always negative in previous analyses: Predicting stock returns using linear functions of these signals is even less efficient than the random walks, in which we just use zeros to predict returns. <br>
<br>
* All these signals are public information: Arbitrageurs have manipulated these opportunities in the past many years.  <br>
<br>
* Our model is too naive: Simple linear models fail to
    * Balance the bias-variance tradeoff in estimation
    * Capture the complex nonlinearity as in other methods (e.g., neural networks, tree-based regressions ...)

---

# END