# Gu et al, Empirical Asset Pricing via Machine Learning

## 1. Introduction

__1.1 Contribution__: 
1. Provide benchmarks for predictive accuracy of expected returns (OOS $R^2$ and SR)
2. Synthesize empirical asset pricing with ML. ML allows for more expressive functional form and higher dimensionality of features.

__1.2 Motivation__: To better forecast conditional expected stock returns (risk premia).
1. Features: high dimensionality + high correlation
2. Functional Forms: nonlinearities, interaction

## 2. Methodology
- Statistical Model
- Objective Function (MSE)
- Regularization

__2.1 Formulation__:

$$r_{i,t+1}=\mathbb{E}\left[r_{i,t+1}|\mathcal{F}_{t}\right]+\epsilon_{i,t+1}$$
where
$$\mathbb{E}\left[r_{i,t+1}|\mathcal{F}_{t}\right] = g^{*}\left(z_{i,t}\right)$$
Stocks are indexed by $i=1,...,N_t$ (each month has different number of stocks); and months are indexed by $t=1,...,T$

__2.2 Statistical Models__

1. _OLS_
  - Extension: OLS with Huber Loss
2. _Penalized Linear_ (Elastic Net)
3. _Dimensionality Reduction_: PCR and PLS<br>
  - Predictor averaging vs. Predictor selection (as in penalized linear)<br>
  - Predictor selection: Penalized linear produces suboptimal forecasts when predictors are highly correlated.<br>
  - Predictor averaging: dimensionality reduction
      - PCR: PCA + OLS (fails to include forecasting object in the PCA step)
      - PLS: Reduce dimension by expoiting covariation of predictors with the target
4. _Generalized Linear_
5. _RF and GBM_
6. _Neural Nets_
  
__2.3 Performance Evaluation: OOS $R^2$__

$$R_{OOS}^2 = 1 - \frac{\sum_{\left(i,t\right)\in \mathcal{T}_3} \left(r_{i,t+1}-\hat{r}_{i,t+1}\right)^2}{\sum_{\left(i,t\right)\in \mathcal{T}_3} r_{i,t+1}^2}$$

where $\mathcal{T}_3$ indicates the testing set.

__Model Comparison__: Diebold and Mariano Test for differences in OOS predictive accuracy between two models.

$$\text{Test Statistic: } DM_{12} = \frac{\overline{d}_{12}}{\hat{\sigma}_{\overline{d}_{12}}}$$
where
$$d_{12,t+1}=\frac{1}{n_{3, t+1}}\sum_{i=1}^{n_3} \Big(\big(\hat{e}_{i,t+1}^{\left(1\right)}\big)^2 - \big(\hat{e}_{i,t+1}^{\left(2\right)}\big)^2\Big)$$

$\hat{e}_{i,t+1}^{\left(1\right)}$ and $\hat{e}_{i,t+1}^{\left(2\right)}$ denote the prediction error for stock return $i$ at time $t$ from model 1 and model 2; <br>
$n_{3, t+1}$ is the number of stocks in the testing sample year $t+1$; <br>
then $\overline{d}_{12}$ and $\hat{\sigma}_{\overline{d}_{12}}$ denote mean and Newey-West standard error of $d_{12,t+1}$ over the testing set.

## 3. An Empirical Study of US Equities

__3.1 Data and Over-arching Model__
1. Data
   - __CRSP__: March 1957 - December 2016 (60 years, 30,000 stocks, 6200 stocks/month on average)
   - __Treasury-bill rate__ (risk-free rate) to calculate individual excess returns
   - __Compustat__ and __I/B/E/S__: stock-level predictive characteristics based on the cross section of stock returns (_94_ characteristics)
       - 61 updated annually
       - 13 updated quarterly
       - 20 updated monthly<br>
   SAS Code to synthesize data (from Jeremiah Green's website): https://drive.google.com/file/d/0BwwEXkCgXEdRQWZreUpKOHBXOUU/view?usp=sharing
       
   - __Industry__ dummies from first two digits of SIC code (_74_ dummies)
   - __Macroeconomic predictors__ (_8_ features)<br>
   Monthly data available from Amit Goyal's website
   - __Interactions__ pairwise stock-level characteristics and macroeconomic variables<br>
   __Total__: $94 \times \left(8+1\right) + 74=920.$<br>

2. Universe<br>
   We include stocks with price below $5$ dollars, share codes beyond 10 and 11, and financial firms. <br>
   __Three Reasons__: <br>
     - Commonly used filters remove certain stocks that are S&P500 components, and find it problematic;
     - Less prone to sample selection and data snooping biases;
     - Results are qualitatively identical and quantitatively unchanged if we filter out those firms

3. Preprocess
     - __Cross-sectional Ranking__: rank all stock characteristics period-by-period and map these ranks into $\left[-1,1\right]$ interval;
     - __Missing Values__: replace with the cross-sectional median at each month for each stock
