## Proposing an effective Backtesting strategy using Bayesian Modeling 

### By: Serena Liu, Lin Ye and Akshat Mittal 

## 1 Introduction

Backtesting is a key component of effective trading system development. It is accomplished by reconstructing, with historical data, trades that would have occurred in the past using rules defined by a given strategy. The result offers statistics to gauge the effectiveness of the strategy. 




In this project, our team proposes a novel way to generate a backtesting strategy, where we make use of Bayesian Modeling to predict quarterly returns and industry groups and building our strategy based on them. Bayesian modelling is particularly helpful in the application of finance where data set is smaller and higher noise. As an extension to industry wide practice of frequentist models,  our approach incorporates probablistic interpretation to return predictions. In addition, we extend the probabilistic intepretation to clustering industries as well. We constructed industry neutral backtests with GICs industry code and our own version of industries from GMM model.   


Global Industry Classification Standard (GICS) is a framework which dilineates companies into 11 following sectors:

1. Communication services
2. Consumer discretionary
3. Consumer staples
4. Energy
5. Financials
6. Health care
7. Industrials
8. Information technology
9. Materials
10. Real estate
11. Utilities
    
(The framework further classifies companies into 24 industry groups, 68 industries, and 157 subindustries, but we limit our focus to the sector level classification)

Two things stand out for this classification:
- This classfication of companies is based not on any financial factors, but on the product or service they provide
- This is a hard classification i.e. at a certain time, companies usually have a single GICS code associated with them

In this work, we try to come up with a soft classification of companies into sectors using their values of their financial fundamental indicators using a Bayesian Gaussian Mixture Model where each component of the mixture would.

Using the posterior associations of each component for each data point, we supplant the indicator GICS in our two predictive models: Bayesian Linear Regression and Bayesian Neural Network and do a comparative analysis of our backtest for both types of models - the predictive models having GIC indicators with the predictive models having soft classifications

## 2 Data

We use the company fundamentals data for x companies over y years at a z frequency along with the close price (which is eventually used by converting them to return values). This data was downloaded from [SimFin](https://simfin.com/), which makes fundamentals financial data available for free.

In order to avoid problems related to multicollinearity as well as to keep in check the fact that we only use meaningful features, we manually removed one feature per pair where the pair had a correlation of 0.7 or more. This left us with 19 predictors. As a consequence of the frequency of the data and what we plan to predict, we performed following steps of preprocessing:

1. 

Our final list of fundamentals, which will eventually serve as predictors of the return, are:



Take a look at the correlation among our features:

In [1]:
import warnings
warnings.filterwarnings('ignore')
from preprocess import preprocess
data = preprocess(return_full=True)

In [3]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.figure(figsize=(15,15), dpi=100)
plt.matshow(data.corr(), fignum=1)
plt.xticks(range(len(data.columns)), data.columns, rotation=60)
plt.yticks(range(len(data.columns)), data.columns)
plt.colorbar()
plt.show()

NameError: name 'data' is not defined

<Figure size 1500x1500 with 0 Axes>

## 3 Sector Identification

### 3.1 Background

### 3.2 Model

### 3.3 Inference

## 4  Return Prediction

### 4.1 Bayesian Linear Regression

#### 4.1.1 Model

We constructed a Bayesian linear regression model as a baseline model to evaluate performance. The model takes in the 19 features as mentioned above.The predicted variable y is assumed to be drawn from a normal distribution:

<h2><center>$ y \sim N(\beta^{T}X,\sigma^2 I)$</center></h2>

The model parameters $\beta_i$s  are assumed to come from normal distributions. For each $\beta_i $, the prior is drawn from: $ \beta_i \sim N(0,1)$ (assuming all betas to be indepedent of each other) 

$\sigma$ is drawn from a lognormal distribution $ \sigma \sim$ LogNormal(0,5) . 

#### 4.1.2 Inference

We make inference via SVI.

In [None]:
from bayes_lin_regr import *
import warnings
warnings.filterwarnings('ignore')
preds = main(itr=10000)


### 5.1 Bayesian Neural Network

#### 5.1.1 Model

#### 5.1.2 Inference

## Backtesting Strategy

Backtesting allows us to simulate our trading strategy using historical returns data. By buying stocks with high return predition and selling stocks with low return prediction at every quarter end, we generated a wealth curve to see how will 1$ do if invested in our strategies.


We formulated four backtesting strategies. 
<br>
Baseline case: Bayesian linear regression with GICs Industry grouping<br>
Version 1: Bayesian linear regression with Industry grouping from GMM model<br>
Version 2: Bayesian neural network model  with GICs Industry grouping

Here is our backtest results.