<h1>
<center>Empirical Asset Pricing via Machine Learning</center>
</h1>



<center>Alexander Margetis, Lanya Ma, Sheng Yang, Yiming Tan</center>



## 1. Introduction



In this project, we conducted an empirical analysis of machine learning methods in asset pricing. We aim to measure and predict risk premium with bundles of underlying factors. In particular, we attempt to study the structure of cross-sectional returns by using various factors. These factors can be stock-level firm characteristics, macroeconomic descriptors and many other derived indicators. In the empirical literature, classical models are proposed to estimate and explain the risk premia with several factors, like CAPM, Fama-French 3 factor and later Fama-French 5 factor model. These models are basically linear projection from bahavior of stocks' expected returns to multiple variates. As the high-dimensional nature is innate in machine learning methods, we can enhances the flexibility of representing assets risk profile relative to more traditional econometric prediction techniques.And the functionals which project high-dimensional predictors to risk premia can be complicated. That's why the application of machine learning in this field can be rather attractive.   

Our major contributions in this project are three-fold. First, we investigate machine learning techniques in prediction of cross-sectional returns. This tells us whether machine learning algorithms can improve the estimation of out-of-sample expected returns. Second, we examine the feature importance of factors.This process gives us insights that how to select informative factors. Third, we study the stability of machine learning in portfolio constrcution. We analyze the performance of long-short portfolio from the algorithms over the horizon.

## 2. Methodology

### 2.1 Data Preprocessing and Exploratory Data Analysis

Data Cleaning (Lanya Ma)

In [2]:
import pandas as pd
import os
import glob

os.chdir('C:/Users/Jack Tan/Desktop/CFRM/CFRM521Project-master/data')
result = glob.glob('*.csv')
print(result)
data_name_list = list(map(lambda x: x[:-4] , result)) 

['3m_tbill_rate.csv', 'bps.csv', 'ca_to_assets.csv', 'current.csv', 'debt_over_asset.csv', 'ebitda_to_sales.csv', 'eps_ttm.csv', 'equity_to_totalcap.csv', 'GICS_code.csv', 'market_cap.csv', 'netprofit_margin.csv', 'ocf_ps.csv', 'ocf_to_debt.csv', 'ocf_to_sales.csv', 'op_to_debt.csv', 'pb.csv', 'pe_ttm.csv', 'price.csv', 'ps_ttm.csv', 'roa.csv', 'roe.csv', 'roic.csv', 'sales_growth1yr.csv', 'tax_to_ebt.csv', 'turnover.csv']


In [3]:
def transform_row(filename):
    raw_df = pd.read_csv(filename)
    row_names = raw_df.iloc[:,0]
    raw_df = raw_df.rename(index = row_names)
    new_df = raw_df.drop(raw_df.columns[0],axis = 1)
    return new_df

price = transform_row('price.csv')

import copy
import numpy as np

is_available = copy.deepcopy(price)
is_available[:] = np.nan
is_available[price != 0] = 1
price_df = price*is_available

def tranfrom_missing(filename):
    raw =  transform_row(filename)
    nrow = raw.shape[0]
    ncol = raw.shape[1]
    raw = raw.applymap(lambda x: float(x))
    raw_mat = np.asmatrix(raw)
    for i in range(0,nrow):
        fill_data = 0
        for j in range(0,ncol):
            if raw_mat[i,j] != 0:
                fill_data = raw_mat[i,j]
            else:
                raw_mat[i,j] = fill_data
    filled_df = pd.DataFrame(raw_mat)
    filled_df.index = price.index.values
    filled_df.columns = price.columns.values
    filled_df = np.multiply(filled_df,is_available)
    filled_df[filled_df == 0] = np.nan
    return filled_df

features_dict = {}
for idx, element in enumerate(result):
    feature_name = data_name_list[idx]
    if feature_name not in ['3m_tbill_rate','GICS_code','price']:
        feature_df = tranfrom_missing(element)
        features_dict[data_name_list[idx]] = feature_df

sector_code = transform_row('GICS_code.csv')
sector_code[sector_code ==0] = np.nan

In [4]:
#Stock return
price_df.head()
r_m = copy.deepcopy(price_df)
r_m[:] = np.nan

nrow = r_m.shape[0]
ncol = r_m.shape[1]
r_mat = np.asmatrix(r_m)
p_mat = np.asmatrix(price_df)
for i in range(0,nrow):
    for j in range(0,ncol-1):
        if np.isnan(p_mat[i,j+1]) or np.isnan(p_mat[i,j]):
            r_mat[i,j] = np.nan
        else:
            r_mat[i,j] = (p_mat[i,j+1]-p_mat[i,j])/p_mat[i,j]
return_mon = pd.DataFrame(r_mat)
return_mon.index = price.index.values
return_mon.columns = price.columns.values

#Risk free rate
rf= transform_row('3m_tbill_rate.csv')
rfree = rf.loc[return_mon.columns]/3

In [5]:
#Data Dictionary
data_dict = {}
time_keys = price.columns.values
first_feature = list(features_dict.keys())[0]
for time_key in time_keys:
    cat_array = features_dict[first_feature][time_key].values.reshape(-1, 1)
    for key in list(features_dict.keys())[1:]:
        right_array = features_dict[key][time_key].values.reshape(-1, 1)
        cat_array = np.concatenate((cat_array,right_array),axis = 1)
    gics_code = sector_code.iloc[:,0].values.reshape(-1, 1)
    y_array = return_mon[time_key].values.reshape(-1,1)-rfree.loc[time_key].values.reshape(-1,1)
    cat_array = np.concatenate((cat_array,gics_code),axis = 1)
    data_array = np.concatenate((cat_array,y_array),axis = 1)
    data_df = pd.DataFrame(data_array)
    data_df.index = price_df.index.values
    data_df.columns = list(features_dict.keys()) + ['GICS_code'] + ['Excess_return']
    data_dict[time_key] = data_df

In [6]:
import pickle
with open('factordata.p', 'wb') as fp:
    pickle.dump(data_dict, fp, protocol=pickle.HIGHEST_PROTOCOL)

In [7]:
with open('factordata.p', 'rb') as fp:
    factordata = pickle.load(fp)

### 2.2 Generalized linear models


### 2.3 Support vector machines 


### 2.4 Ensemble Learning

### 2.5 Neural Networks

### 2.6 Dimension Reduction 

## 3. Experimental Results

### 3.1 Data Description and Exploratory Data Analysis




Our dataset in this project includes all listed firms in the NYSE, AMEX, and NASDAQ. Our sample begins in January 2009 to April 2019. Our data is monthly updated. We use the Treasury-bill rate as risk-free rate for calculating the excessive returns. The firms characteristics or the factors in other words, include firms' value, growth, solvency, cash flow, profitability, operating capacity, capital structure and momentum. In addition, we include the categorical industry classes corresponding to GICS sectors.    

### 3.2 Out-of-sample Stock-level Prediction Performance



### 3.3 Variable Importance for factors

### 3.4 Machine Learning Portfolios

## 4. Conclusions

## 5. References

Gu, Shihao, Bryan Kelly, and Dacheng Xiu. *Empirical asset pricing via machine learning.* No. w25398. National Bureau of Economic Research, 2018.

