In [14]:
import pandas as pd
import numpy as np
import time
import UserDefinedFunctions

## 4. Variable Selection

In this notebook we'll be looking at different ways of pruning our model to make it as predictive as possible but also as simple as possible. 

Arguably, the biggest problem in developing a trading strategy is to find one that works in real-life and not just in a backtest. We talked previously about how all sorts of factors conspire to make our backtested strategy seem more impressive than it turns out to be on real data. Data leakage in terms of using future data is one danger. Can you think of another? 

How about choosing which variables to test in the first place? For instance we know that in 2008 the housing market crashed and that the health of the housing market was a predictor of the stock market. Therefore we include several variables about the housing market (delinquenices and housing starts) amongst our set of variables to test. But including those housing market variables in the first place relies on us knowing that the housing market would crash and take the stock market along with it. 

Now that I've hopefully implanted some doubt in your mind, let's move on. In the next section from Hull et al. they will address the problem of variable selection. They've defined a set of 15 variables to potentially use in the model. Now they'll decide which variables to actually include.

<div class="alert alert-info" role="alert">
<span class="label label-primary"> The Paper </span>
<br><br>

<a id='snap_back_1'></a>
We consider 15 forecasting variables, which means each time we fit the model we need to estimate 15 forecasting coefficients plus the intercept term for a total of 16 parameters.  We limit the number of estimated parameters through variable selection, which leads to more parsimonious models and generally results in better out-of-sample forecasting properties.  <a href='#supplemental_content_1'>Parsimonious</a> models are also easier to interpret and attribute performance.  It is easier to understand which variables contribute to forecasting results when there are fewer variables to consider.  
<br><br>

<a id='snap_back_2'></a>
<a id='snap_back_3'></a>
We estimate the WLS specification which incorporates a <a href='#supplemental_content_2'>bidirectional stepwise procedure</a>.  Variables are chosen based on the <a href='#supplemental_content_3'>Akaike Information Criterion (AIC)</a>.  The bidirectional stepwise selection combines forward selection, which starts with no variables in the model and adds variables that capture the largest improvement, and backward elimination, which starts with all candidate variables and removing the least significant variables.  One feature of the stepwise WLS estimation is that the number of selected variables will change, as predictors come in and out of the selected set.  In comparison, in WLS without variable selection, all of the variables always get non-zero weights, even if they only contribute marginally in a given sample.  Hull and Qiao (2017) use correlation screening as their variable selection technique because using overlapping six-month market returns leads to inflated t-statistics and results in misleading traditional likelihood function calculations and AIC values.   
<br><br>

We estimate the stepwise WLS at the end of each month.  Starting on 03/31/2003, we use 154 months from 06/01/1990 to 03/31/2003 to estimate our model.  We obtain the model parameters on 03/31/2003, which we hold constant from 04/01/2003 to 04/30/2003.  For every day in this month, we use the updated return predictors, along with the fixed model parameters, to produce one-month equity risk premium forecasts on a daily frequency.  On 04/30/2003, we re-estimate our model using an expanding window, from 06/01/1990 to 04/30/2003, to obtain new parameter values which we use for next month’s equity premium forecasts.  We continue to re-estimate our model monthly and make one-month equity premium forecasts every day until the end of the sample.  We publish a summary of our model output in our Daily Report.  A sample of the Daily Report appears in Appendix I. 
<br><br>


2.3 Variable Selection 

In Figure 1 we look at the identity of the selected variables.  On the vertical axis is the contribution of each explanatory variable towards the <a href='#supplemental_content_4'>total explained variance</a> (Lindman, Merenda, and Gold, 1980; Chevan and Sutherland, 1991).  The stepwise WLS puts zero weight on marginal variables that do not add substantially to the model.  Of the 15 variables we consider, typically only about five to seven are selected at any given time.  There are significant changes in the number and identify of the selected variables.  In contrast, without variable selection, WLS reduces the weight put on marginal variables that do not contribute to the explanatory power of the model, but does not remove them from the model.   
<br><br>

Consider variable X which does not add any forecasting power to the model.  If we use WLS with variable selection, the marginal contribution of X would be too small and it will be eliminated from our model.  If we use WLS, we always put a positive weight on X.  For out-of-sample prediction, including X only adds noise to our forecast.  The forecasts coming from WLS with variable selection will likely be more stable compared to WLS without variable selection. 
<br><br>


<div class="alert alert-info" role="alert">
![](variable_selection_1.png "Title")

<div class="alert alert-info" role="alert">


Some variables were selected to be in the model throughout the sample, whereas others only contributed to the explanatory power of the model in a fraction of the sample.  CP was selected in the earlier part of the sample until 2009, and then it was eliminated from the model.  BD was important until 2013, and then it was driven out by other variables.  On the one hand, CRP and DL were useful in the first half of the sample but not in the second half.  On the other hand, NAPM and LOAN were only selected in the second half but not in the first half.  In addition, some variables such as EVUSD were almost never selected, but we did not remove them from the pool of candidate variables. 
<br><br>

Predictor variables entering and exiting our model may be due to variables containing overlapping information.  When one variable is dropped from the model, another variable (or several variables) that may share part of the same information set could come into the model.  For example, in 2004 BD (Baltic Dry Index) was temporarily removed from the model and UR (Change in Unemployment Rate) was added.  It is likely that both variables contained useful information about the macroeconomic environment, so when BD was dropped from the model, another variable that contained similar information was added. 
<br><br>

<span class="label label-warning"> Assessment: Remember </span>

What of the following are True?

 * The AIC metric should only be used with OLS, not WLS
 * More complex models are always better because they fit the data better
 * Linear models are by definition parsimonious
 * The variable with the highest R^2 in a univariate regression (when you regress the target variable on a single explanatory variable only) will indicate the variable that is the most important when you combine all variables in a multivariate regression (when you regress the target variable on all explanatory variables)
 * R^2 is the best metric to use in stepwise variable selection
 * AIC is the best metric to use in stepwise variable selection

<span class="label label-warning"> Assessment: Understand </span>
    
In variable selection we are picking which variables to use out of a set of all possible variables. This is akin to selecting which model to use out of a set of possible models. 

Hull et al. use 15 explanatory variables in their variable selection process. How many possible models (explanatory variable combinations) are they choosing from? What is the complexity of the "model space" as the number of explanatory variables grow? 

[your answer here]









Answer: Each variable can either be included in the model or excluded from the model. So that's 2 options for var1, 2 options for var2, etc. 

This will amount to 2x2x2x2x.... or 2^N where there are N possible variables. 

Given that Hull et al. use 15 variables, that's 2^15 = 32,768 different models. 

The model complexity is O(2^N). 

<span class="label label-warning"> Assessment: Understand </span>
    
In the forward variable selection procedure described, we test adding each variable and then actually add the variable with the best performance gain. We then continue doing this until some stoppin criteria. 

Hull et al. use 15 explanatory variables in their variable selection process. What is the worst-case scenario for the number of tests we must implement? 

[your answer here]







Answer: In the first iteration you have to test 15 variables. In the next stage you have to test 14 variables to potentially add in addition to the 1 previously chosen. In the next stage you have to test 13 variables to potentially add to the 2 previously chosen, etc. 

So you end up running 15 + 14 + 13 +... + 1 = 120 tests. 

<span class="label label-warning"> Assessment: Apply </span

As you might imagine, now we're going to implement Stepwise Variable Selection. The process is available built-in when using other coding languages like R, Stata, and SAS which are all more statistics-centric in comparison to Python. 

So what about python? Unfortunately, while scikit learn does have some functionality around feature selection (variable selection), it does not appear that they do stepwise variable selection based on the stackoverflow chatter. Furthermore, as of summer 2018, it doesn't seem like statsmodels does stepwise variable selection either. 

http://scikit-learn.org/stable/modules/feature_selection.html


Furthermore, even if they did do stepwise variable selection for regular regression (OLS), we'd want to make sure it's compatible with WLS. Luckily, stepwise variable selection isn't that complicated and the internet provides some resources:
    
https://planspace.org/20150423-forward_selection_with_statsmodels/

![](variable_selection_3.png "Title")

Now it's your turn. Adapt the code above (or just write your own code from scratch) to perform stepwise variable selection. For simplicity, you can use either forward, backward or bi-directional selection. 

In [15]:
# TODO

In [16]:
# get the input data
training_data = pd.read_hdf('training_data.hdf')

all_possible_vars = ['industrial_production', 'change_inflation', 'credit_risk_premium', 'slope_interest_rate',
         'housing_starts', 'delinquencies', 'change_unemployment']
selected_vars = []
current_aic = 



training_data.columns


Index(['portfolio_date', 'industrial_production', 'cpi_less_food_energy_date',
       'cpi_less_food_energy_realtime_start', 'cpi_less_food_energy',
       'monthly_cpi_change', 'ust_3m_date', 'ust_3m_realtime_start',
       'ust_3m_rate', 'change_inflation', 'credit_risk_premium',
       'slope_interest_rate', 'housing_starts', 'delinquencies',
       'change_unemployment', 'forward_spy_return'],
      dtype='object')

<span class="label label-warning"> Assessment: Analyze </span>

<span class="label label-warning"> Assessment: Evaluate </span>

<span class="label label-warning"> Assessment: Create </span>

### End Lesson

That's it for the introduction section. You can proceed to the next workbook

[intentionally left blank]
<br><br>
<br><br>
<br><br>
<br><br>
<br><br>
<br><br>
<br><br>
<br><br>
<br><br>
<br><br>
<br><br>
<br><br>
<br><br>
<br><br>
<br><br>


<span class="label label-success"> Commentary and Supplemental Content </span>
<a id='supplemental_content_1'></a>

### Definition: Parsimony

Merriam-Webster defines "parsimony" as

1. the quality of being careful with money or resources : "thrift the necessity of wartime parsimony"
2. the quality or state of being stingy: "The charity was surprised by the parsimony of some larger corporations."
3.  economy in the use of means to an end; especially : economy of explanation in conformity with Occam's razor the scientific law of parsimony dictates that any example of animal behavior should be interpreted at its simplest, most immediate level 

The third one above and an "economy of explanation" seems to fit our context. So this goes back to what we originally said about wanting a model that is as simple as possible. Simplicity (economy) can take many forms including the type of model (simple linear regression versus a recurrent neural network), estimation frequency (coefficient re-estimation frequently versus only once), and number of variables (a few versus a billion).

Ok, so by Occam's razor we should prefer simplicity which can be interpreted as prefering fewer explanatory variables to more explanatory variables. Occam aside, why should we prefer that? Ingo Ruczinski, some guy with on the internet, says there are lots of reasons to prefer simplicity:

 * We want to explain the data in the simplest way — redundant predictors should be removed. The principle of Occam’s Razor states that among several plausible explanations for a phenomenon, the simplest is best. Applied to regression analysis, this implies that the smallest model that fits the data is best.
 * Unnecessary predictors will add noise to the estimation of other quantities that we are interested in. Degrees of freedom will be wasted.
 * Collinearity is caused by having too many variables trying to do the same job.
 * Cost: if the model is to be used for prediction, we can save time and/or money by not measuring redundant predictors.

http://www.biostat.jhsph.edu/~iruczins/teaching/jf/ch10.pdf

Note: Collinearity is a statistical term for having many variables that are all versions of closely related things (ex: industrial production, new car production, ship building, tractor manufacturing, etc). From a statistical perspective, collinearity can make doing statistics difficult (such as in muddying the measured statsitical significance of your explanatory variables).

To the list above I'd also add that too many variables increases the risk of overfitting. Given enough explanatory variables we could almost perfectly fit any dataset. For instance, just by random chance we might find that a lot of variables have been correlated to historical stock market returns. So if we include enough random variables, we'll likely include one of those variables that "accidentally" predicted the market in the past but there is no rationale to believe it'll predict the market in the future. To be safe (and parsimonious) we should include as few variables in our model as possible to avoid the chance of introducing spurious correlations (but if you like spurious correlations you can find a few here http://www.tylervigen.com/spurious-correlations)

<a href='#snap_back_1'>go back to main body</a>


<span class="label label-success"> Commentary and Supplemental Content </span>
<a id='supplemental_content_2'></a>

### Stepwise Variable Selection

Back in Notebook 1 we spoke a little about Stepwise Variable Selection. Let's expand on that here.

The essence of Stepwise Variable Selection is that we want to only add variables to a model if their contribution in beneficial.  Let's say you have 10 variables you can possibly put into a linear regression model. You start by only including var1 and record how good of a model it is. Then you try var2, then var, etc. After trying them all, you pick the variable that had the best performance and you add that to your model. 

Then you iterate through the remaining nine variables and find which one, in combination with the variable you added in the previous round, produces the best model. Keep going until you hit some stopping criteria that you define (eg the performance only increases by an arbitrarily small amount).

You can do forward variable selection whereby you start with zero variables and add variables one at a time. Or backwards variable selection whereby you start with all variables and subtract variables one at a time. 

In bi-directional stepwise variable selection we combine the forward and backward procedures. Hull et al. aren't clear on how they implement bi-directional variable selection and the internet provides many variations on the concept. Here's one where you start with all variables intitially included in the model:

The process is one of alternation between choosing the least significant variable to drop and then re-considering all dropped variables (except the most recently dropped) for re-introduction into the model. This means that two separate significance levels must be chosen for deletion from the model and for adding to the model. The second significance must be more stringent than the first."

https://www.stat.ubc.ca/~rollin/teach/643w04/lec/node43.html

So in this process, at each iteration we try to drop a variable and then try to add back in a variable (where the performance gain for adding or dropping is satisfied). Why would we ever add back a variable that was previously dropped? Well maybe it's because we don't want to include var1 (say industrial production) when we also are including var6 (unemployment rate) in our model. But as we progress through the stepwise variable selection process, we might find that we end up dropping the unemployment rate from our model which then "makes room" for including industrial production. 

We've been a bit vague in the above discussion by what we mean in terms of model performance. How you choose to measure that is really up to you. Maybe it's statistical significance, or gain in R^2, or drop in RMSE. For Hull et al., they choose to use the Akaike Information Criteria (AIC). 

For more information on stepwise variable selection, see the collective intelligence of the internet at:

https://en.wikipedia.org/wiki/Stepwise_regression


<a href='#snap_back_2'>go back to main body</a>


<span class="label label-success"> Commentary and Supplemental Content </span>
<a id='supplemental_content_3'></a>

### Comparing Models with the Akaike Information Criterion (AIC)

The AIC is a measure used to compare the performance of different models (like R^2 or RMSE) that specifically incorporates information on the number of explanatory variables you use. The goal should be to minimize the calculated AIC value where a better model fit lowers the metric but more explanatory variables increase the metric. For us parsimony-preferring people, the AIC is a good metric to use and helps us pick between competing models (should we use a model that fits very well but includes 50 variables or a model that fits ok and only includes 5 variables).

The AIC metric itself is defined as:


![](variable_selection_2.png "Title")

where L is the likelihood function of the model and K is the number of explanatory variables in your model. 

L is the measure of how well the model fits the data. You want L to be as big as possible. Alternatively, because you want to minimize AIC, you want K to be as small as possible. 

Most statisical packages will calculate AIC as part of any linear regression, including the statsmodels library that we've been using for past coding examples. 

<a href='#snap_back_3'>go back to main body</a>

<span class="label label-success"> Commentary and Supplemental Content </span>
<a id='supplemental_content_4'></a>

### Which Variables are the Most Important

Let's imagine that you have a model that does a good job of predicting the S&P 500 based on 15 explanatory variables (say the ones selected by Hell et al.). You go to an investor and offer to manage their money using your strategy. You say "Look at all the money I can make you based on the historical performance." The investor is undoubtedly impressed but will inevitably want to know more about what's underneath the hood of your strategy before giving you money and therefore will ask "which variables are most important when you make your prediction?"

This is a bit of a hard question and there are several "reasonable" ways of approaching a solution. 

Hull et al. reference the method by Lindeman, Merenda, and Gold. Here's a description of an implementation of their method written for R: 

"R^2 represents the proportion of variance explained by a set of predictors. If one can estimate the proportion of the R^2 contributed by each individual predictor, the one with larger R^2 would be more important to explain the outcome variable. However, the difficulty lies in how to get the R^2 for each predictor.

The most intuitive way to decompose the total R^2 is to add the predictors to the regression model sequentially. Then, the increased R^2 can be considered as the contribution by the predictor just added. However, this method depends on the sequence the predictors are added if the predictors are correlated. 

The lmg approach is based on sequential R^2 but takes care of the dependence on orderings by averaging over orderings. For example, for a model with 4 predictors, there are a total of 24 orderings. For each ordering, the contributed R^2 can be calculated. lmg is the average of the R^2 across the 24 orderings."

https://advstats.psychstat.org/book/mregression/importance.php

So in other words: to measure a variable's importance we're going to measure the additional R^2 gained whenever we add that variable to the model. And we're going to add the variable to all possible models (all possible combinations of the variables) and then take the average performance gain.  

That sounds reasonable but there are some downsides to this approach. Can you think of any? 

For another take on variable importance using Lindeman, Merenda, and Gold see here:

https://www.r-bloggers.com/the-relative-importance-of-predictors-let-the-games-begin/