On this notebook, it is going to be introduced one of the most common mistakes on machine learning.

In [1]:
import logging
import imp
from dateutil.relativedelta import relativedelta

In [2]:
import pandas as pd
from sklearn.linear_model import LinearRegression

In [3]:
from helpers.dataset import read_quote_dataset, preprocess_quotes
from helpers.backtest import train_model_and_backtest_regressor, get_backtest_performance_metrics

In [4]:
# Configir logging module for jypter notebook
imp.reload(logging)
logging_format = '%(asctime)s - %(levelname)s - %(process)s - %(message)s'
logging.basicConfig(level=logging.DEBUG, format=logging_format)

In [5]:
PARAM_DATASET = '../data/SPY_postprocess_adj.csv.gz'

Read the datasaet

In [6]:
df = read_quote_dataset(PARAM_DATASET)

In [7]:
df.head()

Unnamed: 0,date,open,high,low,close,close_adj,volume,open_adj,low_adj,high_adj,...,ratio_close_adj_000_close_adj_005_norm,ratio_close_adj_000_close_adj_020_norm,ratio_close_adj_000_ema_005_norm,ratio_close_adj_000_ema_010_norm,ratio_close_adj_000_ema_020_norm,ratio_close_adj_000_ema_050_norm,ratio_close_adj_000_sma_005_norm,ratio_close_adj_000_sma_010_norm,ratio_close_adj_000_sma_020_norm,ratio_close_adj_000_sma_050_norm
0,2000-01-03,148.25,148.25,143.875,145.4375,101.425385,8164300,103.38677,100.335727,103.38677,...,,,,,,,,,,
1,2000-01-04,143.531204,144.0625,139.640594,139.75,97.459068,8089800,100.09601,97.38277,100.466526,...,,,,,,,,,,
2,2000-01-05,139.9375,141.531204,137.25,140.0,97.633377,12177900,97.589791,95.715579,98.70121,...,,,,,,,,,,
3,2000-01-06,139.625,141.5,137.75,137.75,96.064301,6227200,97.371891,96.064301,98.679482,...,,,0.48663,,,,,,,
4,2000-01-07,140.3125,145.75,140.0625,145.75,101.643333,8066500,97.851322,97.676977,101.643333,...,,,0.815422,,,,0.740588,,,


Compute the future values of the stock to be used as the class (dependent variable)

In [8]:
vars_to_shift = ['close_adj', 'close_adj_norm', 'close_adj_std']
shift_periods = [1, 5, 10, 20]
vars_for_return = ['close_adj']
return_periods = [1, 5, 10, 20]

In [9]:
df = preprocess_quotes(df, vars_to_shift=vars_to_shift, shift_periods=shift_periods,
                       vars_for_return=vars_for_return, return_periods=return_periods)

In [10]:
df[['date', 'close_adj', 'close_adj_shift_1', 'close_adj_ret_1', 'close_adj_shift_5', 'close_adj_ret_5']].head(10)

Unnamed: 0,date,close_adj,close_adj_shift_1,close_adj_ret_1,close_adj_shift_5,close_adj_ret_5
0,2000-01-03,101.425385,97.459068,-0.039106,101.992004,0.005587
1,2000-01-04,97.459068,97.633377,0.001789,100.771645,0.033989
2,2000-01-05,97.633377,96.064301,-0.016071,99.76915,0.021875
3,2000-01-06,96.064301,101.643333,0.058076,101.120308,0.052631
4,2000-01-07,101.643333,101.992004,0.00343,102.493233,0.008362
5,2000-01-10,101.992004,100.771645,-0.011965,101.686958,-0.002991
6,2000-01-11,100.771645,99.76915,-0.009948,102.51506,0.017301
7,2000-01-12,99.76915,101.120308,0.013543,100.945953,0.011795
8,2000-01-13,101.120308,102.493233,0.013577,100.727989,-0.00388
9,2000-01-14,102.493233,101.686958,-0.007867,97.873047,-0.045078


# Core analysis

Lets run a linear regression model as it was done on previous notebook, 
whose input is only the close price, and trying to predict next day price, using 1 month 
of history data on the training process.

Just as clarification, the backtest will go long or short at the current day close price 
(`buy_price_col = 'close_adj'`),
and the position will be closed at next day price (`sell_price_col = 'close_adj_**shift_1**'`). On the
next day, the regression is evaluated again, and againt it is decided to go long or short.

In [11]:
x_vars = ['close_adj']
y_var = 'close_adj_shift_1'
buy_price_col = 'close_adj'
sell_price_col = 'close_adj_shift_1'
model_class = LinearRegression
model_params = {'fit_intercept': True}

In [12]:
df_backtest = train_model_and_backtest_regressor(df, x_vars=x_vars, y_var=y_var, 
    buy_price_col=buy_price_col, sell_price_col=sell_price_col,
    model_class=model_class, model_params=model_params, 
    backtest_start='2000-02-01', backtest_end='2018-12-31', 
    model_update_frequency='M', train_history_period=relativedelta(months=1),
    ignore_last_x_training_items=0)

2019-05-06 10:40:47,525 - DEBUG - 30325 - 228 periods to backtest: ['2000-02-01', '2000-02-29', '2000-03-31', '2000-04-30', '2000-05-31', '2000-06-30', '2000-07-31', '2000-08-31', '2000-09-30', '2000-10-31', '2000-11-30', '2000-12-31', '2001-01-31', '2001-02-28', '2001-03-31', '2001-04-30', '2001-05-31', '2001-06-30', '2001-07-31', '2001-08-31', '2001-09-30', '2001-10-31', '2001-11-30', '2001-12-31', '2002-01-31', '2002-02-28', '2002-03-31', '2002-04-30', '2002-05-31', '2002-06-30', '2002-07-31', '2002-08-31', '2002-09-30', '2002-10-31', '2002-11-30', '2002-12-31', '2003-01-31', '2003-02-28', '2003-03-31', '2003-04-30', '2003-05-31', '2003-06-30', '2003-07-31', '2003-08-31', '2003-09-30', '2003-10-31', '2003-11-30', '2003-12-31', '2004-01-31', '2004-02-29', '2004-03-31', '2004-04-30', '2004-05-31', '2004-06-30', '2004-07-31', '2004-08-31', '2004-09-30', '2004-10-31', '2004-11-30', '2004-12-31', '2005-01-31', '2005-02-28', '2005-03-31', '2005-04-30', '2005-05-31', '2005-06-30', '2005-07

2019-05-06 10:40:47,850 - INFO - 30325 - Training dataset is between 2001-12-31 and 2002-01-30.
2019-05-06 10:40:47,857 - INFO - 30325 - Training a model to be tested between 2002-02-28 and 2002-03-31.
2019-05-06 10:40:47,860 - INFO - 30325 - Training dataset is between 2002-01-28 and 2002-02-27.
2019-05-06 10:40:47,869 - INFO - 30325 - Training a model to be tested between 2002-03-31 and 2002-04-30.
2019-05-06 10:40:47,872 - INFO - 30325 - Training dataset is between 2002-02-28 and 2002-03-28.
2019-05-06 10:40:47,884 - INFO - 30325 - Training a model to be tested between 2002-04-30 and 2002-05-31.
2019-05-06 10:40:47,887 - INFO - 30325 - Training dataset is between 2002-04-01 and 2002-04-29.
2019-05-06 10:40:47,900 - INFO - 30325 - Training a model to be tested between 2002-05-31 and 2002-06-30.
2019-05-06 10:40:47,902 - INFO - 30325 - Training dataset is between 2002-04-30 and 2002-05-30.
2019-05-06 10:40:47,912 - INFO - 30325 - Training a model to be tested between 2002-06-30 and 20

2019-05-06 10:40:48,290 - INFO - 30325 - Training dataset is between 2005-05-31 and 2005-06-29.
2019-05-06 10:40:48,303 - INFO - 30325 - Training a model to be tested between 2005-07-31 and 2005-08-31.
2019-05-06 10:40:48,306 - INFO - 30325 - Training dataset is between 2005-06-30 and 2005-07-29.
2019-05-06 10:40:48,317 - INFO - 30325 - Training a model to be tested between 2005-08-31 and 2005-09-30.
2019-05-06 10:40:48,320 - INFO - 30325 - Training dataset is between 2005-08-01 and 2005-08-30.
2019-05-06 10:40:48,329 - INFO - 30325 - Training a model to be tested between 2005-09-30 and 2005-10-31.
2019-05-06 10:40:48,331 - INFO - 30325 - Training dataset is between 2005-08-30 and 2005-09-29.
2019-05-06 10:40:48,340 - INFO - 30325 - Training a model to be tested between 2005-10-31 and 2005-11-30.
2019-05-06 10:40:48,342 - INFO - 30325 - Training dataset is between 2005-09-30 and 2005-10-28.
2019-05-06 10:40:48,350 - INFO - 30325 - Training a model to be tested between 2005-11-30 and 20

2019-05-06 10:40:48,761 - INFO - 30325 - Training dataset is between 2008-10-30 and 2008-11-28.
2019-05-06 10:40:48,770 - INFO - 30325 - Training a model to be tested between 2008-12-31 and 2009-01-31.
2019-05-06 10:40:48,772 - INFO - 30325 - Training dataset is between 2008-12-01 and 2008-12-30.
2019-05-06 10:40:48,779 - INFO - 30325 - Training a model to be tested between 2009-01-31 and 2009-02-28.
2019-05-06 10:40:48,781 - INFO - 30325 - Training dataset is between 2008-12-31 and 2009-01-30.
2019-05-06 10:40:48,791 - INFO - 30325 - Training a model to be tested between 2009-02-28 and 2009-03-31.
2019-05-06 10:40:48,792 - INFO - 30325 - Training dataset is between 2009-01-28 and 2009-02-27.
2019-05-06 10:40:48,800 - INFO - 30325 - Training a model to be tested between 2009-03-31 and 2009-04-30.
2019-05-06 10:40:48,804 - INFO - 30325 - Training dataset is between 2009-03-02 and 2009-03-30.
2019-05-06 10:40:48,812 - INFO - 30325 - Training a model to be tested between 2009-04-30 and 20

2019-05-06 10:40:49,208 - INFO - 30325 - Training dataset is between 2012-03-30 and 2012-04-27.
2019-05-06 10:40:49,217 - INFO - 30325 - Training a model to be tested between 2012-05-31 and 2012-06-30.
2019-05-06 10:40:49,220 - INFO - 30325 - Training dataset is between 2012-04-30 and 2012-05-30.
2019-05-06 10:40:49,229 - INFO - 30325 - Training a model to be tested between 2012-06-30 and 2012-07-31.
2019-05-06 10:40:49,231 - INFO - 30325 - Training dataset is between 2012-05-30 and 2012-06-29.
2019-05-06 10:40:49,242 - INFO - 30325 - Training a model to be tested between 2012-07-31 and 2012-08-31.
2019-05-06 10:40:49,244 - INFO - 30325 - Training dataset is between 2012-07-02 and 2012-07-30.
2019-05-06 10:40:49,253 - INFO - 30325 - Training a model to be tested between 2012-08-31 and 2012-09-30.
2019-05-06 10:40:49,255 - INFO - 30325 - Training dataset is between 2012-07-31 and 2012-08-30.
2019-05-06 10:40:49,266 - INFO - 30325 - Training a model to be tested between 2012-09-30 and 20

2019-05-06 10:40:49,655 - INFO - 30325 - Training dataset is between 2015-08-31 and 2015-09-29.
2019-05-06 10:40:49,664 - INFO - 30325 - Training a model to be tested between 2015-10-31 and 2015-11-30.
2019-05-06 10:40:49,666 - INFO - 30325 - Training dataset is between 2015-09-30 and 2015-10-30.
2019-05-06 10:40:49,675 - INFO - 30325 - Training a model to be tested between 2015-11-30 and 2015-12-31.
2019-05-06 10:40:49,677 - INFO - 30325 - Training dataset is between 2015-10-30 and 2015-11-27.
2019-05-06 10:40:49,686 - INFO - 30325 - Training a model to be tested between 2015-12-31 and 2016-01-31.
2019-05-06 10:40:49,689 - INFO - 30325 - Training dataset is between 2015-11-30 and 2015-12-30.
2019-05-06 10:40:49,697 - INFO - 30325 - Training a model to be tested between 2016-01-31 and 2016-02-29.
2019-05-06 10:40:49,700 - INFO - 30325 - Training dataset is between 2015-12-31 and 2016-01-29.
2019-05-06 10:40:49,708 - INFO - 30325 - Training a model to be tested between 2016-02-29 and 20

In [13]:
get_backtest_performance_metrics(df_backtest.ret, df_backtest.benchmark_ret, with_benchmark=True, with_delta=True)

Unnamed: 0,main,benchmark,delta
alpha,0.017836,-3.322314e-16,
beta,0.29097,1.0,
cagr,0.01933,0.05047234,-0.031143
max_drawdown,-0.494455,-0.5518942,0.057439
return,0.435562,1.534235,-1.098673
sharpe,0.195584,0.3527034,-0.157119
var,-0.017761,-0.01925716,0.001496
volatility,0.191751,0.1916436,0.000107


Using only the close price as input, the CAGR is 1.93%, lower than the benchmark, which is 5%

Lets see what happens if we run a similar regression model, but using a multivariable model. It means,
instead of using only close adjusted prices, use the OLHC quotes (OLHC means Open, Low,
High and Close quotes)

In [14]:
x_vars = ['open_adj', 'low_adj', 'high_adj', 'close_adj']
y_var = 'close_adj_shift_1'
buy_price_col = 'close_adj'
sell_price_col = 'close_adj_shift_1'
model_class = LinearRegression
model_params = {'fit_intercept': True}

df_backtest = train_model_and_backtest_regressor(df, x_vars=x_vars, y_var=y_var, 
    buy_price_col=buy_price_col, sell_price_col=sell_price_col,
    model_class=model_class, model_params=model_params, 
    backtest_start='2000-02-01', backtest_end='2018-12-31', 
    model_update_frequency='M', train_history_period=relativedelta(months=1))

2019-05-06 10:41:17,540 - DEBUG - 30325 - 228 periods to backtest: ['2000-02-01', '2000-02-29', '2000-03-31', '2000-04-30', '2000-05-31', '2000-06-30', '2000-07-31', '2000-08-31', '2000-09-30', '2000-10-31', '2000-11-30', '2000-12-31', '2001-01-31', '2001-02-28', '2001-03-31', '2001-04-30', '2001-05-31', '2001-06-30', '2001-07-31', '2001-08-31', '2001-09-30', '2001-10-31', '2001-11-30', '2001-12-31', '2002-01-31', '2002-02-28', '2002-03-31', '2002-04-30', '2002-05-31', '2002-06-30', '2002-07-31', '2002-08-31', '2002-09-30', '2002-10-31', '2002-11-30', '2002-12-31', '2003-01-31', '2003-02-28', '2003-03-31', '2003-04-30', '2003-05-31', '2003-06-30', '2003-07-31', '2003-08-31', '2003-09-30', '2003-10-31', '2003-11-30', '2003-12-31', '2004-01-31', '2004-02-29', '2004-03-31', '2004-04-30', '2004-05-31', '2004-06-30', '2004-07-31', '2004-08-31', '2004-09-30', '2004-10-31', '2004-11-30', '2004-12-31', '2005-01-31', '2005-02-28', '2005-03-31', '2005-04-30', '2005-05-31', '2005-06-30', '2005-07

2019-05-06 10:41:17,811 - INFO - 30325 - Training dataset is between 2001-12-31 and 2002-01-30.
2019-05-06 10:41:17,821 - INFO - 30325 - Training a model to be tested between 2002-02-28 and 2002-03-31.
2019-05-06 10:41:17,823 - INFO - 30325 - Training dataset is between 2002-01-28 and 2002-02-27.
2019-05-06 10:41:17,833 - INFO - 30325 - Training a model to be tested between 2002-03-31 and 2002-04-30.
2019-05-06 10:41:17,835 - INFO - 30325 - Training dataset is between 2002-02-28 and 2002-03-28.
2019-05-06 10:41:17,845 - INFO - 30325 - Training a model to be tested between 2002-04-30 and 2002-05-31.
2019-05-06 10:41:17,847 - INFO - 30325 - Training dataset is between 2002-04-01 and 2002-04-29.
2019-05-06 10:41:17,857 - INFO - 30325 - Training a model to be tested between 2002-05-31 and 2002-06-30.
2019-05-06 10:41:17,859 - INFO - 30325 - Training dataset is between 2002-04-30 and 2002-05-30.
2019-05-06 10:41:17,867 - INFO - 30325 - Training a model to be tested between 2002-06-30 and 20

2019-05-06 10:41:18,274 - INFO - 30325 - Training dataset is between 2005-05-31 and 2005-06-29.
2019-05-06 10:41:18,283 - INFO - 30325 - Training a model to be tested between 2005-07-31 and 2005-08-31.
2019-05-06 10:41:18,285 - INFO - 30325 - Training dataset is between 2005-06-30 and 2005-07-29.
2019-05-06 10:41:18,292 - INFO - 30325 - Training a model to be tested between 2005-08-31 and 2005-09-30.
2019-05-06 10:41:18,295 - INFO - 30325 - Training dataset is between 2005-08-01 and 2005-08-30.
2019-05-06 10:41:18,303 - INFO - 30325 - Training a model to be tested between 2005-09-30 and 2005-10-31.
2019-05-06 10:41:18,305 - INFO - 30325 - Training dataset is between 2005-08-30 and 2005-09-29.
2019-05-06 10:41:18,313 - INFO - 30325 - Training a model to be tested between 2005-10-31 and 2005-11-30.
2019-05-06 10:41:18,315 - INFO - 30325 - Training dataset is between 2005-09-30 and 2005-10-28.
2019-05-06 10:41:18,324 - INFO - 30325 - Training a model to be tested between 2005-11-30 and 20

2019-05-06 10:41:18,709 - INFO - 30325 - Training dataset is between 2008-10-30 and 2008-11-28.
2019-05-06 10:41:18,718 - INFO - 30325 - Training a model to be tested between 2008-12-31 and 2009-01-31.
2019-05-06 10:41:18,720 - INFO - 30325 - Training dataset is between 2008-12-01 and 2008-12-30.
2019-05-06 10:41:18,728 - INFO - 30325 - Training a model to be tested between 2009-01-31 and 2009-02-28.
2019-05-06 10:41:18,730 - INFO - 30325 - Training dataset is between 2008-12-31 and 2009-01-30.
2019-05-06 10:41:18,740 - INFO - 30325 - Training a model to be tested between 2009-02-28 and 2009-03-31.
2019-05-06 10:41:18,742 - INFO - 30325 - Training dataset is between 2009-01-28 and 2009-02-27.
2019-05-06 10:41:18,751 - INFO - 30325 - Training a model to be tested between 2009-03-31 and 2009-04-30.
2019-05-06 10:41:18,753 - INFO - 30325 - Training dataset is between 2009-03-02 and 2009-03-30.
2019-05-06 10:41:18,762 - INFO - 30325 - Training a model to be tested between 2009-04-30 and 20

2019-05-06 10:41:19,162 - INFO - 30325 - Training dataset is between 2012-03-30 and 2012-04-27.
2019-05-06 10:41:19,171 - INFO - 30325 - Training a model to be tested between 2012-05-31 and 2012-06-30.
2019-05-06 10:41:19,173 - INFO - 30325 - Training dataset is between 2012-04-30 and 2012-05-30.
2019-05-06 10:41:19,182 - INFO - 30325 - Training a model to be tested between 2012-06-30 and 2012-07-31.
2019-05-06 10:41:19,184 - INFO - 30325 - Training dataset is between 2012-05-30 and 2012-06-29.
2019-05-06 10:41:19,192 - INFO - 30325 - Training a model to be tested between 2012-07-31 and 2012-08-31.
2019-05-06 10:41:19,195 - INFO - 30325 - Training dataset is between 2012-07-02 and 2012-07-30.
2019-05-06 10:41:19,203 - INFO - 30325 - Training a model to be tested between 2012-08-31 and 2012-09-30.
2019-05-06 10:41:19,205 - INFO - 30325 - Training dataset is between 2012-07-31 and 2012-08-30.
2019-05-06 10:41:19,213 - INFO - 30325 - Training a model to be tested between 2012-09-30 and 20

2019-05-06 10:41:19,605 - INFO - 30325 - Training dataset is between 2015-08-31 and 2015-09-29.
2019-05-06 10:41:19,613 - INFO - 30325 - Training a model to be tested between 2015-10-31 and 2015-11-30.
2019-05-06 10:41:19,615 - INFO - 30325 - Training dataset is between 2015-09-30 and 2015-10-30.
2019-05-06 10:41:19,623 - INFO - 30325 - Training a model to be tested between 2015-11-30 and 2015-12-31.
2019-05-06 10:41:19,625 - INFO - 30325 - Training dataset is between 2015-10-30 and 2015-11-27.
2019-05-06 10:41:19,634 - INFO - 30325 - Training a model to be tested between 2015-12-31 and 2016-01-31.
2019-05-06 10:41:19,636 - INFO - 30325 - Training dataset is between 2015-11-30 and 2015-12-30.
2019-05-06 10:41:19,645 - INFO - 30325 - Training a model to be tested between 2016-01-31 and 2016-02-29.
2019-05-06 10:41:19,646 - INFO - 30325 - Training dataset is between 2015-12-31 and 2016-01-29.
2019-05-06 10:41:19,654 - INFO - 30325 - Training a model to be tested between 2016-02-29 and 20

In [15]:
get_backtest_performance_metrics(df_backtest.ret, df_backtest.benchmark_ret, with_benchmark=True, with_delta=True)

Unnamed: 0,main,benchmark,delta
alpha,0.062292,-3.322314e-16,
beta,0.183535,1.0,
cagr,0.05794,0.05047234,0.007468
max_drawdown,-0.313108,-0.5518942,0.238787
return,1.896996,1.534235,0.362761
sharpe,0.389161,0.3527034,0.036458
var,-0.017778,-0.01925716,0.001479
volatility,0.191945,0.1916436,0.000301


Using the OLHC quotes, the CAGR is similar to the benchmark, nevertheless, the max drawdown is significantive 
lower, because our regression model has a 31%, while the SPY had 55%. 
Here what's is important to highlight, is that adding the Open, Low and Close prices are aparently important to
the performance of the model. A model with OLHC quotes are much better than one with only the close prices.

Just for research purposeses, lets see what happens if we use other set of input variables. The idea is
to understand the nature of the model. Lets see what happen using only the amplitude of the daily quotes 
(low and high prices).

In [16]:
x_vars = ['low_adj', 'high_adj']
y_var = 'close_adj_shift_1'
buy_price_col = 'close_adj'
sell_price_col = 'close_adj_shift_1'
model_class = LinearRegression
model_params = {'fit_intercept': True}

df_backtest = train_model_and_backtest_regressor(df, x_vars=x_vars, y_var=y_var, 
    buy_price_col=buy_price_col, sell_price_col=sell_price_col,
    model_class=model_class, model_params=model_params, 
    backtest_start='2000-02-01', backtest_end='2018-12-31', 
    model_update_frequency='M', train_history_period=relativedelta(months=1))

2019-05-06 10:41:20,431 - DEBUG - 30325 - 228 periods to backtest: ['2000-02-01', '2000-02-29', '2000-03-31', '2000-04-30', '2000-05-31', '2000-06-30', '2000-07-31', '2000-08-31', '2000-09-30', '2000-10-31', '2000-11-30', '2000-12-31', '2001-01-31', '2001-02-28', '2001-03-31', '2001-04-30', '2001-05-31', '2001-06-30', '2001-07-31', '2001-08-31', '2001-09-30', '2001-10-31', '2001-11-30', '2001-12-31', '2002-01-31', '2002-02-28', '2002-03-31', '2002-04-30', '2002-05-31', '2002-06-30', '2002-07-31', '2002-08-31', '2002-09-30', '2002-10-31', '2002-11-30', '2002-12-31', '2003-01-31', '2003-02-28', '2003-03-31', '2003-04-30', '2003-05-31', '2003-06-30', '2003-07-31', '2003-08-31', '2003-09-30', '2003-10-31', '2003-11-30', '2003-12-31', '2004-01-31', '2004-02-29', '2004-03-31', '2004-04-30', '2004-05-31', '2004-06-30', '2004-07-31', '2004-08-31', '2004-09-30', '2004-10-31', '2004-11-30', '2004-12-31', '2005-01-31', '2005-02-28', '2005-03-31', '2005-04-30', '2005-05-31', '2005-06-30', '2005-07

2019-05-06 10:41:20,721 - INFO - 30325 - Training dataset is between 2001-12-31 and 2002-01-30.
2019-05-06 10:41:20,730 - INFO - 30325 - Training a model to be tested between 2002-02-28 and 2002-03-31.
2019-05-06 10:41:20,732 - INFO - 30325 - Training dataset is between 2002-01-28 and 2002-02-27.
2019-05-06 10:41:20,740 - INFO - 30325 - Training a model to be tested between 2002-03-31 and 2002-04-30.
2019-05-06 10:41:20,742 - INFO - 30325 - Training dataset is between 2002-02-28 and 2002-03-28.
2019-05-06 10:41:20,752 - INFO - 30325 - Training a model to be tested between 2002-04-30 and 2002-05-31.
2019-05-06 10:41:20,754 - INFO - 30325 - Training dataset is between 2002-04-01 and 2002-04-29.
2019-05-06 10:41:20,761 - INFO - 30325 - Training a model to be tested between 2002-05-31 and 2002-06-30.
2019-05-06 10:41:20,764 - INFO - 30325 - Training dataset is between 2002-04-30 and 2002-05-30.
2019-05-06 10:41:20,773 - INFO - 30325 - Training a model to be tested between 2002-06-30 and 20

2019-05-06 10:41:21,186 - INFO - 30325 - Training dataset is between 2005-05-31 and 2005-06-29.
2019-05-06 10:41:21,196 - INFO - 30325 - Training a model to be tested between 2005-07-31 and 2005-08-31.
2019-05-06 10:41:21,199 - INFO - 30325 - Training dataset is between 2005-06-30 and 2005-07-29.
2019-05-06 10:41:21,209 - INFO - 30325 - Training a model to be tested between 2005-08-31 and 2005-09-30.
2019-05-06 10:41:21,211 - INFO - 30325 - Training dataset is between 2005-08-01 and 2005-08-30.
2019-05-06 10:41:21,221 - INFO - 30325 - Training a model to be tested between 2005-09-30 and 2005-10-31.
2019-05-06 10:41:21,225 - INFO - 30325 - Training dataset is between 2005-08-30 and 2005-09-29.
2019-05-06 10:41:21,232 - INFO - 30325 - Training a model to be tested between 2005-10-31 and 2005-11-30.
2019-05-06 10:41:21,234 - INFO - 30325 - Training dataset is between 2005-09-30 and 2005-10-28.
2019-05-06 10:41:21,244 - INFO - 30325 - Training a model to be tested between 2005-11-30 and 20

2019-05-06 10:41:21,636 - INFO - 30325 - Training dataset is between 2008-10-30 and 2008-11-28.
2019-05-06 10:41:21,644 - INFO - 30325 - Training a model to be tested between 2008-12-31 and 2009-01-31.
2019-05-06 10:41:21,647 - INFO - 30325 - Training dataset is between 2008-12-01 and 2008-12-30.
2019-05-06 10:41:21,655 - INFO - 30325 - Training a model to be tested between 2009-01-31 and 2009-02-28.
2019-05-06 10:41:21,657 - INFO - 30325 - Training dataset is between 2008-12-31 and 2009-01-30.
2019-05-06 10:41:21,667 - INFO - 30325 - Training a model to be tested between 2009-02-28 and 2009-03-31.
2019-05-06 10:41:21,670 - INFO - 30325 - Training dataset is between 2009-01-28 and 2009-02-27.
2019-05-06 10:41:21,678 - INFO - 30325 - Training a model to be tested between 2009-03-31 and 2009-04-30.
2019-05-06 10:41:21,680 - INFO - 30325 - Training dataset is between 2009-03-02 and 2009-03-30.
2019-05-06 10:41:21,689 - INFO - 30325 - Training a model to be tested between 2009-04-30 and 20

2019-05-06 10:41:22,066 - INFO - 30325 - Training dataset is between 2012-03-30 and 2012-04-27.
2019-05-06 10:41:22,074 - INFO - 30325 - Training a model to be tested between 2012-05-31 and 2012-06-30.
2019-05-06 10:41:22,076 - INFO - 30325 - Training dataset is between 2012-04-30 and 2012-05-30.
2019-05-06 10:41:22,084 - INFO - 30325 - Training a model to be tested between 2012-06-30 and 2012-07-31.
2019-05-06 10:41:22,086 - INFO - 30325 - Training dataset is between 2012-05-30 and 2012-06-29.
2019-05-06 10:41:22,095 - INFO - 30325 - Training a model to be tested between 2012-07-31 and 2012-08-31.
2019-05-06 10:41:22,097 - INFO - 30325 - Training dataset is between 2012-07-02 and 2012-07-30.
2019-05-06 10:41:22,106 - INFO - 30325 - Training a model to be tested between 2012-08-31 and 2012-09-30.
2019-05-06 10:41:22,108 - INFO - 30325 - Training dataset is between 2012-07-31 and 2012-08-30.
2019-05-06 10:41:22,117 - INFO - 30325 - Training a model to be tested between 2012-09-30 and 20

2019-05-06 10:41:22,511 - INFO - 30325 - Training dataset is between 2015-08-31 and 2015-09-29.
2019-05-06 10:41:22,523 - INFO - 30325 - Training a model to be tested between 2015-10-31 and 2015-11-30.
2019-05-06 10:41:22,525 - INFO - 30325 - Training dataset is between 2015-09-30 and 2015-10-30.
2019-05-06 10:41:22,533 - INFO - 30325 - Training a model to be tested between 2015-11-30 and 2015-12-31.
2019-05-06 10:41:22,536 - INFO - 30325 - Training dataset is between 2015-10-30 and 2015-11-27.
2019-05-06 10:41:22,548 - INFO - 30325 - Training a model to be tested between 2015-12-31 and 2016-01-31.
2019-05-06 10:41:22,550 - INFO - 30325 - Training dataset is between 2015-11-30 and 2015-12-30.
2019-05-06 10:41:22,559 - INFO - 30325 - Training a model to be tested between 2016-01-31 and 2016-02-29.
2019-05-06 10:41:22,562 - INFO - 30325 - Training dataset is between 2015-12-31 and 2016-01-29.
2019-05-06 10:41:22,573 - INFO - 30325 - Training a model to be tested between 2016-02-29 and 20

In [17]:
get_backtest_performance_metrics(df_backtest.ret, df_backtest.benchmark_ret, with_benchmark=True, with_delta=True)

Unnamed: 0,main,benchmark,delta
alpha,0.096132,-3.322314e-16,
beta,0.209123,1.0,
cagr,0.096241,0.05047234,0.045769
max_drawdown,-0.317199,-0.5518942,0.234695
return,4.670535,1.534235,3.1363
sharpe,0.574377,0.3527034,0.221674
var,-0.017437,-0.01925716,0.00182
volatility,0.191977,0.1916436,0.000333


Using only the lowest and highest prices, the CAGR not only is better, it is also much higher than the benchmark.
Using the quote amplitude the CAGR is 9.6%, while the benchmark is 5.04%. 

Now lets try using the open, low and high prices.

In [18]:
x_vars = ['open_adj', 'low_adj', 'high_adj']
y_var = 'close_adj_shift_1'
buy_price_col = 'close_adj'
sell_price_col = 'close_adj_shift_1'
model_class = LinearRegression
model_params = {'fit_intercept': True}

df_backtest = train_model_and_backtest_regressor(df, x_vars=x_vars, y_var=y_var, 
    buy_price_col=buy_price_col, sell_price_col=sell_price_col,
    model_class=model_class, model_params=model_params, 
    backtest_start='2000-02-01', backtest_end='2018-12-31', 
    model_update_frequency='M', train_history_period=relativedelta(months=1))

2019-05-06 10:41:23,214 - DEBUG - 30325 - 228 periods to backtest: ['2000-02-01', '2000-02-29', '2000-03-31', '2000-04-30', '2000-05-31', '2000-06-30', '2000-07-31', '2000-08-31', '2000-09-30', '2000-10-31', '2000-11-30', '2000-12-31', '2001-01-31', '2001-02-28', '2001-03-31', '2001-04-30', '2001-05-31', '2001-06-30', '2001-07-31', '2001-08-31', '2001-09-30', '2001-10-31', '2001-11-30', '2001-12-31', '2002-01-31', '2002-02-28', '2002-03-31', '2002-04-30', '2002-05-31', '2002-06-30', '2002-07-31', '2002-08-31', '2002-09-30', '2002-10-31', '2002-11-30', '2002-12-31', '2003-01-31', '2003-02-28', '2003-03-31', '2003-04-30', '2003-05-31', '2003-06-30', '2003-07-31', '2003-08-31', '2003-09-30', '2003-10-31', '2003-11-30', '2003-12-31', '2004-01-31', '2004-02-29', '2004-03-31', '2004-04-30', '2004-05-31', '2004-06-30', '2004-07-31', '2004-08-31', '2004-09-30', '2004-10-31', '2004-11-30', '2004-12-31', '2005-01-31', '2005-02-28', '2005-03-31', '2005-04-30', '2005-05-31', '2005-06-30', '2005-07

2019-05-06 10:41:23,493 - INFO - 30325 - Training dataset is between 2001-12-31 and 2002-01-30.
2019-05-06 10:41:23,503 - INFO - 30325 - Training a model to be tested between 2002-02-28 and 2002-03-31.
2019-05-06 10:41:23,505 - INFO - 30325 - Training dataset is between 2002-01-28 and 2002-02-27.
2019-05-06 10:41:23,513 - INFO - 30325 - Training a model to be tested between 2002-03-31 and 2002-04-30.
2019-05-06 10:41:23,515 - INFO - 30325 - Training dataset is between 2002-02-28 and 2002-03-28.
2019-05-06 10:41:23,526 - INFO - 30325 - Training a model to be tested between 2002-04-30 and 2002-05-31.
2019-05-06 10:41:23,529 - INFO - 30325 - Training dataset is between 2002-04-01 and 2002-04-29.
2019-05-06 10:41:23,537 - INFO - 30325 - Training a model to be tested between 2002-05-31 and 2002-06-30.
2019-05-06 10:41:23,539 - INFO - 30325 - Training dataset is between 2002-04-30 and 2002-05-30.
2019-05-06 10:41:23,549 - INFO - 30325 - Training a model to be tested between 2002-06-30 and 20

2019-05-06 10:41:23,937 - INFO - 30325 - Training dataset is between 2005-05-31 and 2005-06-29.
2019-05-06 10:41:23,945 - INFO - 30325 - Training a model to be tested between 2005-07-31 and 2005-08-31.
2019-05-06 10:41:23,947 - INFO - 30325 - Training dataset is between 2005-06-30 and 2005-07-29.
2019-05-06 10:41:23,955 - INFO - 30325 - Training a model to be tested between 2005-08-31 and 2005-09-30.
2019-05-06 10:41:23,957 - INFO - 30325 - Training dataset is between 2005-08-01 and 2005-08-30.
2019-05-06 10:41:23,966 - INFO - 30325 - Training a model to be tested between 2005-09-30 and 2005-10-31.
2019-05-06 10:41:23,968 - INFO - 30325 - Training dataset is between 2005-08-30 and 2005-09-29.
2019-05-06 10:41:23,976 - INFO - 30325 - Training a model to be tested between 2005-10-31 and 2005-11-30.
2019-05-06 10:41:23,978 - INFO - 30325 - Training dataset is between 2005-09-30 and 2005-10-28.
2019-05-06 10:41:23,987 - INFO - 30325 - Training a model to be tested between 2005-11-30 and 20

2019-05-06 10:41:24,369 - INFO - 30325 - Training dataset is between 2008-10-30 and 2008-11-28.
2019-05-06 10:41:24,378 - INFO - 30325 - Training a model to be tested between 2008-12-31 and 2009-01-31.
2019-05-06 10:41:24,381 - INFO - 30325 - Training dataset is between 2008-12-01 and 2008-12-30.
2019-05-06 10:41:24,390 - INFO - 30325 - Training a model to be tested between 2009-01-31 and 2009-02-28.
2019-05-06 10:41:24,392 - INFO - 30325 - Training dataset is between 2008-12-31 and 2009-01-30.
2019-05-06 10:41:24,400 - INFO - 30325 - Training a model to be tested between 2009-02-28 and 2009-03-31.
2019-05-06 10:41:24,402 - INFO - 30325 - Training dataset is between 2009-01-28 and 2009-02-27.
2019-05-06 10:41:24,410 - INFO - 30325 - Training a model to be tested between 2009-03-31 and 2009-04-30.
2019-05-06 10:41:24,412 - INFO - 30325 - Training dataset is between 2009-03-02 and 2009-03-30.
2019-05-06 10:41:24,421 - INFO - 30325 - Training a model to be tested between 2009-04-30 and 20

2019-05-06 10:41:24,796 - INFO - 30325 - Training dataset is between 2012-03-30 and 2012-04-27.
2019-05-06 10:41:24,804 - INFO - 30325 - Training a model to be tested between 2012-05-31 and 2012-06-30.
2019-05-06 10:41:24,807 - INFO - 30325 - Training dataset is between 2012-04-30 and 2012-05-30.
2019-05-06 10:41:24,815 - INFO - 30325 - Training a model to be tested between 2012-06-30 and 2012-07-31.
2019-05-06 10:41:24,817 - INFO - 30325 - Training dataset is between 2012-05-30 and 2012-06-29.
2019-05-06 10:41:24,825 - INFO - 30325 - Training a model to be tested between 2012-07-31 and 2012-08-31.
2019-05-06 10:41:24,827 - INFO - 30325 - Training dataset is between 2012-07-02 and 2012-07-30.
2019-05-06 10:41:24,835 - INFO - 30325 - Training a model to be tested between 2012-08-31 and 2012-09-30.
2019-05-06 10:41:24,837 - INFO - 30325 - Training dataset is between 2012-07-31 and 2012-08-30.
2019-05-06 10:41:24,846 - INFO - 30325 - Training a model to be tested between 2012-09-30 and 20

2019-05-06 10:41:25,219 - INFO - 30325 - Training dataset is between 2015-08-31 and 2015-09-29.
2019-05-06 10:41:25,227 - INFO - 30325 - Training a model to be tested between 2015-10-31 and 2015-11-30.
2019-05-06 10:41:25,229 - INFO - 30325 - Training dataset is between 2015-09-30 and 2015-10-30.
2019-05-06 10:41:25,237 - INFO - 30325 - Training a model to be tested between 2015-11-30 and 2015-12-31.
2019-05-06 10:41:25,240 - INFO - 30325 - Training dataset is between 2015-10-30 and 2015-11-27.
2019-05-06 10:41:25,248 - INFO - 30325 - Training a model to be tested between 2015-12-31 and 2016-01-31.
2019-05-06 10:41:25,250 - INFO - 30325 - Training dataset is between 2015-11-30 and 2015-12-30.
2019-05-06 10:41:25,258 - INFO - 30325 - Training a model to be tested between 2016-01-31 and 2016-02-29.
2019-05-06 10:41:25,260 - INFO - 30325 - Training dataset is between 2015-12-31 and 2016-01-29.
2019-05-06 10:41:25,268 - INFO - 30325 - Training a model to be tested between 2016-02-29 and 20

In [19]:
get_backtest_performance_metrics(df_backtest.ret, df_backtest.benchmark_ret, with_benchmark=True, with_delta=True)

Unnamed: 0,main,benchmark,delta
alpha,0.125131,-3.322314e-16,
beta,0.153543,1.0,
cagr,0.124259,0.05047234,0.073787
max_drawdown,-0.313869,-0.5518942,0.238025
return,8.132907,1.534235,6.598672
sharpe,0.70579,0.3527034,0.353087
var,-0.017454,-0.01925716,0.001804
volatility,0.191997,0.1916436,0.000353


Using the Open, Low and High prices, the CAGR also increses to 12.4%

Wait, but we have seen that using open, low and high prices, the CAGR is 12.4%, but using open, low,
high and close prices the CAGR was near 5%. Was the model correct? Adding the close price decreses
the regression performance? Lets check again the model with OLHC prices (open, low, high and close prices).

In [20]:
x_vars = ['open_adj', 'low_adj', 'high_adj', 'close_adj']
y_var = 'close_adj_shift_1'
buy_price_col = 'close_adj'
sell_price_col = 'close_adj_shift_1'
model_class = LinearRegression
model_params = {'fit_intercept': True}

df_backtest = train_model_and_backtest_regressor(df, x_vars=x_vars, y_var=y_var, 
    buy_price_col=buy_price_col, sell_price_col=sell_price_col,
    model_class=model_class, model_params=model_params, 
    backtest_start='2000-02-01', backtest_end='2018-12-31', 
    model_update_frequency='M', train_history_period=relativedelta(months=1))

2019-05-06 10:41:32,842 - DEBUG - 30325 - 228 periods to backtest: ['2000-02-01', '2000-02-29', '2000-03-31', '2000-04-30', '2000-05-31', '2000-06-30', '2000-07-31', '2000-08-31', '2000-09-30', '2000-10-31', '2000-11-30', '2000-12-31', '2001-01-31', '2001-02-28', '2001-03-31', '2001-04-30', '2001-05-31', '2001-06-30', '2001-07-31', '2001-08-31', '2001-09-30', '2001-10-31', '2001-11-30', '2001-12-31', '2002-01-31', '2002-02-28', '2002-03-31', '2002-04-30', '2002-05-31', '2002-06-30', '2002-07-31', '2002-08-31', '2002-09-30', '2002-10-31', '2002-11-30', '2002-12-31', '2003-01-31', '2003-02-28', '2003-03-31', '2003-04-30', '2003-05-31', '2003-06-30', '2003-07-31', '2003-08-31', '2003-09-30', '2003-10-31', '2003-11-30', '2003-12-31', '2004-01-31', '2004-02-29', '2004-03-31', '2004-04-30', '2004-05-31', '2004-06-30', '2004-07-31', '2004-08-31', '2004-09-30', '2004-10-31', '2004-11-30', '2004-12-31', '2005-01-31', '2005-02-28', '2005-03-31', '2005-04-30', '2005-05-31', '2005-06-30', '2005-07

2019-05-06 10:41:33,141 - INFO - 30325 - Training dataset is between 2001-12-31 and 2002-01-30.
2019-05-06 10:41:33,151 - INFO - 30325 - Training a model to be tested between 2002-02-28 and 2002-03-31.
2019-05-06 10:41:33,153 - INFO - 30325 - Training dataset is between 2002-01-28 and 2002-02-27.
2019-05-06 10:41:33,165 - INFO - 30325 - Training a model to be tested between 2002-03-31 and 2002-04-30.
2019-05-06 10:41:33,167 - INFO - 30325 - Training dataset is between 2002-02-28 and 2002-03-28.
2019-05-06 10:41:33,175 - INFO - 30325 - Training a model to be tested between 2002-04-30 and 2002-05-31.
2019-05-06 10:41:33,179 - INFO - 30325 - Training dataset is between 2002-04-01 and 2002-04-29.
2019-05-06 10:41:33,188 - INFO - 30325 - Training a model to be tested between 2002-05-31 and 2002-06-30.
2019-05-06 10:41:33,190 - INFO - 30325 - Training dataset is between 2002-04-30 and 2002-05-30.
2019-05-06 10:41:33,200 - INFO - 30325 - Training a model to be tested between 2002-06-30 and 20

2019-05-06 10:41:33,591 - INFO - 30325 - Training dataset is between 2005-05-31 and 2005-06-29.
2019-05-06 10:41:33,599 - INFO - 30325 - Training a model to be tested between 2005-07-31 and 2005-08-31.
2019-05-06 10:41:33,601 - INFO - 30325 - Training dataset is between 2005-06-30 and 2005-07-29.
2019-05-06 10:41:33,609 - INFO - 30325 - Training a model to be tested between 2005-08-31 and 2005-09-30.
2019-05-06 10:41:33,612 - INFO - 30325 - Training dataset is between 2005-08-01 and 2005-08-30.
2019-05-06 10:41:33,619 - INFO - 30325 - Training a model to be tested between 2005-09-30 and 2005-10-31.
2019-05-06 10:41:33,621 - INFO - 30325 - Training dataset is between 2005-08-30 and 2005-09-29.
2019-05-06 10:41:33,629 - INFO - 30325 - Training a model to be tested between 2005-10-31 and 2005-11-30.
2019-05-06 10:41:33,631 - INFO - 30325 - Training dataset is between 2005-09-30 and 2005-10-28.
2019-05-06 10:41:33,638 - INFO - 30325 - Training a model to be tested between 2005-11-30 and 20

2019-05-06 10:41:34,031 - INFO - 30325 - Training dataset is between 2008-10-30 and 2008-11-28.
2019-05-06 10:41:34,039 - INFO - 30325 - Training a model to be tested between 2008-12-31 and 2009-01-31.
2019-05-06 10:41:34,041 - INFO - 30325 - Training dataset is between 2008-12-01 and 2008-12-30.
2019-05-06 10:41:34,049 - INFO - 30325 - Training a model to be tested between 2009-01-31 and 2009-02-28.
2019-05-06 10:41:34,051 - INFO - 30325 - Training dataset is between 2008-12-31 and 2009-01-30.
2019-05-06 10:41:34,061 - INFO - 30325 - Training a model to be tested between 2009-02-28 and 2009-03-31.
2019-05-06 10:41:34,064 - INFO - 30325 - Training dataset is between 2009-01-28 and 2009-02-27.
2019-05-06 10:41:34,072 - INFO - 30325 - Training a model to be tested between 2009-03-31 and 2009-04-30.
2019-05-06 10:41:34,074 - INFO - 30325 - Training dataset is between 2009-03-02 and 2009-03-30.
2019-05-06 10:41:34,083 - INFO - 30325 - Training a model to be tested between 2009-04-30 and 20

2019-05-06 10:41:34,495 - INFO - 30325 - Training dataset is between 2012-03-30 and 2012-04-27.
2019-05-06 10:41:34,504 - INFO - 30325 - Training a model to be tested between 2012-05-31 and 2012-06-30.
2019-05-06 10:41:34,506 - INFO - 30325 - Training dataset is between 2012-04-30 and 2012-05-30.
2019-05-06 10:41:34,515 - INFO - 30325 - Training a model to be tested between 2012-06-30 and 2012-07-31.
2019-05-06 10:41:34,517 - INFO - 30325 - Training dataset is between 2012-05-30 and 2012-06-29.
2019-05-06 10:41:34,525 - INFO - 30325 - Training a model to be tested between 2012-07-31 and 2012-08-31.
2019-05-06 10:41:34,528 - INFO - 30325 - Training dataset is between 2012-07-02 and 2012-07-30.
2019-05-06 10:41:34,537 - INFO - 30325 - Training a model to be tested between 2012-08-31 and 2012-09-30.
2019-05-06 10:41:34,539 - INFO - 30325 - Training dataset is between 2012-07-31 and 2012-08-30.
2019-05-06 10:41:34,549 - INFO - 30325 - Training a model to be tested between 2012-09-30 and 20

2019-05-06 10:41:34,932 - INFO - 30325 - Training dataset is between 2015-08-31 and 2015-09-29.
2019-05-06 10:41:34,940 - INFO - 30325 - Training a model to be tested between 2015-10-31 and 2015-11-30.
2019-05-06 10:41:34,943 - INFO - 30325 - Training dataset is between 2015-09-30 and 2015-10-30.
2019-05-06 10:41:34,950 - INFO - 30325 - Training a model to be tested between 2015-11-30 and 2015-12-31.
2019-05-06 10:41:34,952 - INFO - 30325 - Training dataset is between 2015-10-30 and 2015-11-27.
2019-05-06 10:41:34,962 - INFO - 30325 - Training a model to be tested between 2015-12-31 and 2016-01-31.
2019-05-06 10:41:34,964 - INFO - 30325 - Training dataset is between 2015-11-30 and 2015-12-30.
2019-05-06 10:41:34,972 - INFO - 30325 - Training a model to be tested between 2016-01-31 and 2016-02-29.
2019-05-06 10:41:34,974 - INFO - 30325 - Training dataset is between 2015-12-31 and 2016-01-29.
2019-05-06 10:41:34,983 - INFO - 30325 - Training a model to be tested between 2016-02-29 and 20

In [21]:
get_backtest_performance_metrics(df_backtest.ret, df_backtest.benchmark_ret, with_benchmark=True, with_delta=True)

Unnamed: 0,main,benchmark,delta
alpha,0.062292,-3.322314e-16,
beta,0.183535,1.0,
cagr,0.05794,0.05047234,0.007468
max_drawdown,-0.313108,-0.5518942,0.238787
return,1.896996,1.534235,0.362761
sharpe,0.389161,0.3527034,0.036458
var,-0.017778,-0.01925716,0.001479
volatility,0.191945,0.1916436,0.000301


Yes, there was no mistake. A model with open, low and high prices has a 12.42% CAGR, but a model with
open, low, high and close prices has a 5.79% CAGR. It means that the close price adds noisy, and aparently
it is not a good predictor for next day close prices. It is strange, but lets continue our analysis.

Now lets change the objetive of the regression. Instead of trying to predict the next day price, lets
see what happens when we estimate the price after 1 month (in fact 20 business days). 

To change that target, it will be used `y_var = 'close_adj**_shift_20**'`. 

The strategy is still daily. All business days a long or short trade is placed, and closed next day. The main
change is that it is going to be used a 20 days forectast instead of 1 day forecast. It means, the strategy
will go long, if current price is lower than the 20 days predicted price, but it is going to go short if current
price is higher than 20 predicted price.

In [22]:
x_vars = ['open_adj', 'low_adj', 'high_adj']
y_var = 'close_adj_shift_20'
buy_price_col = 'close_adj'
sell_price_col = 'close_adj_shift_1'
model_class = LinearRegression
model_params = {'fit_intercept': True}

df_backtest = train_model_and_backtest_regressor(df, x_vars=x_vars, y_var=y_var, 
    buy_price_col=buy_price_col, sell_price_col=sell_price_col,
    model_class=model_class, model_params=model_params, 
    backtest_start='2000-02-01', backtest_end='2018-12-31', 
    model_update_frequency='M', train_history_period=relativedelta(months=1))

2019-05-06 10:41:35,552 - DEBUG - 30325 - 228 periods to backtest: ['2000-02-01', '2000-02-29', '2000-03-31', '2000-04-30', '2000-05-31', '2000-06-30', '2000-07-31', '2000-08-31', '2000-09-30', '2000-10-31', '2000-11-30', '2000-12-31', '2001-01-31', '2001-02-28', '2001-03-31', '2001-04-30', '2001-05-31', '2001-06-30', '2001-07-31', '2001-08-31', '2001-09-30', '2001-10-31', '2001-11-30', '2001-12-31', '2002-01-31', '2002-02-28', '2002-03-31', '2002-04-30', '2002-05-31', '2002-06-30', '2002-07-31', '2002-08-31', '2002-09-30', '2002-10-31', '2002-11-30', '2002-12-31', '2003-01-31', '2003-02-28', '2003-03-31', '2003-04-30', '2003-05-31', '2003-06-30', '2003-07-31', '2003-08-31', '2003-09-30', '2003-10-31', '2003-11-30', '2003-12-31', '2004-01-31', '2004-02-29', '2004-03-31', '2004-04-30', '2004-05-31', '2004-06-30', '2004-07-31', '2004-08-31', '2004-09-30', '2004-10-31', '2004-11-30', '2004-12-31', '2005-01-31', '2005-02-28', '2005-03-31', '2005-04-30', '2005-05-31', '2005-06-30', '2005-07

2019-05-06 10:41:35,838 - INFO - 30325 - Training dataset is between 2001-12-31 and 2002-01-30.
2019-05-06 10:41:35,848 - INFO - 30325 - Training a model to be tested between 2002-02-28 and 2002-03-31.
2019-05-06 10:41:35,850 - INFO - 30325 - Training dataset is between 2002-01-28 and 2002-02-27.
2019-05-06 10:41:35,860 - INFO - 30325 - Training a model to be tested between 2002-03-31 and 2002-04-30.
2019-05-06 10:41:35,862 - INFO - 30325 - Training dataset is between 2002-02-28 and 2002-03-28.
2019-05-06 10:41:35,871 - INFO - 30325 - Training a model to be tested between 2002-04-30 and 2002-05-31.
2019-05-06 10:41:35,873 - INFO - 30325 - Training dataset is between 2002-04-01 and 2002-04-29.
2019-05-06 10:41:35,882 - INFO - 30325 - Training a model to be tested between 2002-05-31 and 2002-06-30.
2019-05-06 10:41:35,885 - INFO - 30325 - Training dataset is between 2002-04-30 and 2002-05-30.
2019-05-06 10:41:35,896 - INFO - 30325 - Training a model to be tested between 2002-06-30 and 20

2019-05-06 10:41:36,311 - INFO - 30325 - Training dataset is between 2005-05-31 and 2005-06-29.
2019-05-06 10:41:36,319 - INFO - 30325 - Training a model to be tested between 2005-07-31 and 2005-08-31.
2019-05-06 10:41:36,321 - INFO - 30325 - Training dataset is between 2005-06-30 and 2005-07-29.
2019-05-06 10:41:36,330 - INFO - 30325 - Training a model to be tested between 2005-08-31 and 2005-09-30.
2019-05-06 10:41:36,332 - INFO - 30325 - Training dataset is between 2005-08-01 and 2005-08-30.
2019-05-06 10:41:36,341 - INFO - 30325 - Training a model to be tested between 2005-09-30 and 2005-10-31.
2019-05-06 10:41:36,343 - INFO - 30325 - Training dataset is between 2005-08-30 and 2005-09-29.
2019-05-06 10:41:36,351 - INFO - 30325 - Training a model to be tested between 2005-10-31 and 2005-11-30.
2019-05-06 10:41:36,353 - INFO - 30325 - Training dataset is between 2005-09-30 and 2005-10-28.
2019-05-06 10:41:36,361 - INFO - 30325 - Training a model to be tested between 2005-11-30 and 20

2019-05-06 10:41:36,769 - INFO - 30325 - Training dataset is between 2008-10-30 and 2008-11-28.
2019-05-06 10:41:36,776 - INFO - 30325 - Training a model to be tested between 2008-12-31 and 2009-01-31.
2019-05-06 10:41:36,779 - INFO - 30325 - Training dataset is between 2008-12-01 and 2008-12-30.
2019-05-06 10:41:36,787 - INFO - 30325 - Training a model to be tested between 2009-01-31 and 2009-02-28.
2019-05-06 10:41:36,789 - INFO - 30325 - Training dataset is between 2008-12-31 and 2009-01-30.
2019-05-06 10:41:36,798 - INFO - 30325 - Training a model to be tested between 2009-02-28 and 2009-03-31.
2019-05-06 10:41:36,800 - INFO - 30325 - Training dataset is between 2009-01-28 and 2009-02-27.
2019-05-06 10:41:36,808 - INFO - 30325 - Training a model to be tested between 2009-03-31 and 2009-04-30.
2019-05-06 10:41:36,810 - INFO - 30325 - Training dataset is between 2009-03-02 and 2009-03-30.
2019-05-06 10:41:36,818 - INFO - 30325 - Training a model to be tested between 2009-04-30 and 20

2019-05-06 10:41:37,220 - INFO - 30325 - Training dataset is between 2012-03-30 and 2012-04-27.
2019-05-06 10:41:37,229 - INFO - 30325 - Training a model to be tested between 2012-05-31 and 2012-06-30.
2019-05-06 10:41:37,231 - INFO - 30325 - Training dataset is between 2012-04-30 and 2012-05-30.
2019-05-06 10:41:37,239 - INFO - 30325 - Training a model to be tested between 2012-06-30 and 2012-07-31.
2019-05-06 10:41:37,241 - INFO - 30325 - Training dataset is between 2012-05-30 and 2012-06-29.
2019-05-06 10:41:37,250 - INFO - 30325 - Training a model to be tested between 2012-07-31 and 2012-08-31.
2019-05-06 10:41:37,252 - INFO - 30325 - Training dataset is between 2012-07-02 and 2012-07-30.
2019-05-06 10:41:37,260 - INFO - 30325 - Training a model to be tested between 2012-08-31 and 2012-09-30.
2019-05-06 10:41:37,262 - INFO - 30325 - Training dataset is between 2012-07-31 and 2012-08-30.
2019-05-06 10:41:37,271 - INFO - 30325 - Training a model to be tested between 2012-09-30 and 20

2019-05-06 10:41:37,648 - INFO - 30325 - Training dataset is between 2015-08-31 and 2015-09-29.
2019-05-06 10:41:37,657 - INFO - 30325 - Training a model to be tested between 2015-10-31 and 2015-11-30.
2019-05-06 10:41:37,659 - INFO - 30325 - Training dataset is between 2015-09-30 and 2015-10-30.
2019-05-06 10:41:37,667 - INFO - 30325 - Training a model to be tested between 2015-11-30 and 2015-12-31.
2019-05-06 10:41:37,669 - INFO - 30325 - Training dataset is between 2015-10-30 and 2015-11-27.
2019-05-06 10:41:37,678 - INFO - 30325 - Training a model to be tested between 2015-12-31 and 2016-01-31.
2019-05-06 10:41:37,680 - INFO - 30325 - Training dataset is between 2015-11-30 and 2015-12-30.
2019-05-06 10:41:37,688 - INFO - 30325 - Training a model to be tested between 2016-01-31 and 2016-02-29.
2019-05-06 10:41:37,690 - INFO - 30325 - Training dataset is between 2015-12-31 and 2016-01-29.
2019-05-06 10:41:37,698 - INFO - 30325 - Training a model to be tested between 2016-02-29 and 20

In [23]:
get_backtest_performance_metrics(df_backtest.ret, df_backtest.benchmark_ret, with_benchmark=True, with_delta=True)

Unnamed: 0,main,benchmark,delta
alpha,0.549535,-3.322314e-16,
beta,-0.055703,1.0,
cagr,0.694489,0.05047234,0.644017
max_drawdown,-0.194069,-0.5518942,0.357825
return,21152.394548,1.534235,21150.860313
sharpe,2.885469,0.3527034,2.532765
var,-0.014543,-0.01925716,0.004714
volatility,0.189144,0.1916436,-0.002499


Hell yeah. The model has a 69% return annualy. It implies that for every dolar invested in
February 2001, it will be gotten $21,152.39 on December 2018. We are millonaires with a linear regression model.
Thanks Gauss for the least squares!

What is happening here is one of the most commons mistakes on machine learning: look-ahead bias. This is a fail
in the design, which produce non-realistic models. The regression in fact works correctly, but the model is not
realistic, because it is being used data that on evaluation  time we don't have it.

To understand what happen, lets take a look to the backtesting logs.
```
2019-05-05 16:10:31,432 - INFO - 22848 - Training a model to be tested between 2000-02-01 and 2000-02-29.
2019-05-05 16:10:31,434 - INFO - 22848 - Training dataset is between 2000-01-03 and 2000-01-31.
```
Those are the first lines of the backtesting logs. The model is backtesting the period February 2000, using 
data from January 2000. The model is forecasting with 20 business days. Lets see hows the training dataset 
on that moment:

In [24]:
df_tmp = pd.concat(
    [
        df, 
        df[['date']].shift(-20).rename({'date': 'date_shift_20'}, axis=1)
    ], 
    axis=1,
)
df_tmp[(df_tmp.date>='2000-01-20') & (df_tmp.date<='2000-01-31')][['date', 'open_adj', 'close_adj', 'close_adj_shift_20', 'date_shift_20']]

Unnamed: 0,date,open_adj,close_adj,close_adj_shift_20,date_shift_20
12,2000-01-20,102.493236,100.945953,96.434715,2000-02-17
13,2000-01-21,101.468956,100.727989,94.364395,2000-02-18
14,2000-01-24,101.577884,97.873047,94.124672,2000-02-22
15,2000-01-25,97.992985,98.984596,95.23616,2000-02-23
16,2000-01-26,98.330748,98.199989,93.318352,2000-02-24
17,2000-01-27,98.919159,97.807739,92.980522,2000-02-25
18,2000-01-28,97.241088,94.756668,94.931038,2000-02-28
19,2000-01-31,94.713105,97.328285,95.846313,2000-02-29


The problem can be seen much clear. We are on February 2000, training a model with January 2000 quotes. 
On 2000-01-20, we are training a model, telling what is the price on 2000-02-17. Nevertheless, we are evaluating the model on Febrary 2000. On that moment, we **do not** know what's the price on 2000-02-17! The look-ahead bias
is because on backtesting, we are training with information that we already don't have on that moment. That's the
root of the issue.

On the next notebook, we are going to solve this issue.