#MA707 Report - Investigation - LSTM and Sequential (spring 2019, Blackjack)

## Contents
1. Disclaimer
2. LSTM Model
3. Sequential Model

## 1. Disclaimer

NOTE: This notebook stands separately from the other notebooks currently written in that it is purely our guesswork and attempt to make a model that is unique.  In terms of full disclosure, some of the code in this notebook is "foreign" to us in that we do not fully understand its functionality, but more how to coerce it to work and produce results.  We looked to utilize LSTM modelling and explored the Sequential Keras node and how to potentially wrap it within a class.

In [5]:
%run "./1. Class demonstrations"

In [6]:
from math import sqrt
from numpy import concatenate
from matplotlib import pyplot
from pandas import read_csv
from pandas import DataFrame
from pandas import concat
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_squared_error
from keras.models import Sequential
from keras.layers import Dropout
from keras.layers import Dense
from keras.layers import LSTM

The above imports the notebooks we will need - note that we need a special cluster connection for the keras library to run here, so be sure to connect this notebook to the class.ma707.test cluster

The below is the same code written earlier that allows us to take our grid search arguments and read them in as a data frame and include only in the important parts of the grid search.

In [9]:
def display_pdf(a_pdf):
  display(spark.createDataFrame(a_pdf,verifySchema=False))

In [10]:
def est_grid_results_pdf(my_est_grid_obj,est_tag=None,fea_tag=None): 
  import pandas as pd
  import numpy  as np
  res_pdf = pd.DataFrame(data=my_est_grid_obj.cv_results_) \
           .loc[:,lambda df: np.logical_or(df.columns.str.startswith('param_'),
                                           df.columns.str.endswith('test_score'))
               ] \
           .loc[:,lambda df: np.logical_not(df.columns.str.startswith('split'))
               ] \
           .drop(['rank_test_score', 'std_test_score'], 
                 axis=1)
  res_pdf.columns = [column.replace('param_','') for column in list(res_pdf.columns)]
  if est_tag is not None: res_pdf = res_pdf.assign(est_tag=est_tag)
  if fea_tag is not None: res_pdf = res_pdf.assign(fea_tag=fea_tag)
  return res_pdf.sort_values('mean_test_score', ascending = False)

Code to create train and test data, see previous code in Notebook #3 for more in depth explanation

In [12]:
def create_train_test_ts(fea_pdf, tgt_ser, trn_prop=0.8):
  trn_len = int(trn_prop * len(fea_pdf))
  return (fea_pdf.iloc[:trn_len],
          fea_pdf.iloc[ trn_len:],
          tgt_ser.iloc[:trn_len],
          tgt_ser.iloc[ trn_len:]
         )

The three code blocks below can be used to create plots for viewing how the predicted code works agains thte actual result

In [14]:
def plot_comparison(test_p, test_y):
  pyplot.clf()
  pyplot.plot(test_y, label='actual')
  pyplot.plot(test_p, label='predicted')
  pyplot.legend()
  display(pyplot.show())

In [15]:
def plot_history():
  pyplot.clf()
  pyplot.plot(history.history['loss']    , label='train')
  pyplot.plot(history.history['val_loss'], label='test')
  pyplot.legend()
  display(pyplot.show())

In [16]:
def plot_actual(target_ser):
  import numpy as np
  pyplot.clf()
  pyplot.plot(target_ser,label='actual')
  pyplot.legend()
  display(pyplot.show())

## 2. LSTM

The code below (as we understand it) is based off code provided in class, but with some modified features. There is an interest in testing the different optimization, activation, and loss functions for the LSTM model, so we allow the user to enter these as input. By doing this, the hope is that we can mix and match a perfect combination of these from a subset to figure out what works best for the given data frame.  The model works as follows - the init parameter takes as argument (besides those already mentioned), the number of epochs, the batch size, and the number of units to be considered and assigns these to the self-function.  It specifies the model to be used as Sequential, which is simply the type of model we will be building is specified for us.  The fit method first runs through a min max scaler for the x and. y variable, and then assigns these values to be x and y scaled.  The x variable is than reshaped, so that it is a three-dimensional array, in this case with 1 sample, 1 iteration, and 1 feature at each point.  We then add the LSTM functionality to the Sequential model, and assign the number of units, and the input shape to the LSTM.  We add a dense layer and specify that it only have one output, and this is where we define the activation function for the output.  We then compile the model, using our specified loss and optimization measures from the init method, and use the fit method of the Sequential model with the necessary inputs.  The predict method given allows us to generate predictions based on the given methods and will be used with the different grid search parameters ran. NOTE: After trying to add in the hyperparameters for activation, optimization and loss, we got an error, because it is not an argument for the model. Fit method of sequential, see attached link: [Link Machine Learning](https://machinelearningmastery.com/grid-search-hyperparameters-deep-learning-models-python-keras/)

In [19]:
from sklearn.base import BaseEstimator, RegressorMixin
from keras.layers import LSTM
class LSTMWrapper_failure(BaseEstimator,RegressorMixin,LSTM):
  def __init__(self,verbose=0,epochs=1,batch_size=1,nb_units=50, loss_str, optimizer_str,  activation_str:
    self.model       = Sequential()
    self.verbose     = verbose
    self.epochs      = epochs
    self.batch_size  = batch_size
    self.nb_units    = nb_units
    self.loss        = loss_str
    self.optimizer   = optimizer_str
    self.activation  = activation_Str
    return 
  
  def fit(self,X,y=None):
    from sklearn.preprocessing import MinMaxScaler
    self.scl_X = MinMaxScaler(feature_range=(0, 1))
    self.scl_y = MinMaxScaler(feature_range=(0, 1))
    X_scl = self.scl_X.fit_transform(X)
    y_scl = self.scl_y.fit_transform(y.reshape(-1,1))
    X_scl_re = X_scl.reshape((X_scl.shape[0], 1, X_scl.shape[1]))
    self.model.add(LSTM(self.nb_units, 
                        input_shape=(X_scl_re.shape[1], X_scl_re.shape[2])))
    self.model.add(Dense(1,activation = self.activation))
    self.model.compile(loss= self.loss, optimizer= 'adam')
    self.model.fit(X_scl_re, y_scl, 
                   epochs    =self.epochs, 
                   batch_size=self.batch_size, 
                   verbose   =self.verbose, 
                   shuffle   =False)
    return self
  
  def predict(self,X,y=None):
    X_scl    = self.scl_X.transform(X)
    X_scl_re = X_scl.reshape((X_scl.shape[0], 1, X_scl.shape[1]))
    return self.scl_y.inverse_transform(self.model.predict(X_scl_re))

Per above article, and other article, we determined that we cannot in fact accomplish the model tuning in the way we wanted to, but we can tune the number of neurons for our given dataset.  We restate the original code for thee wrapper class, but with some additional parameters.  Dropout helps to reduce overfitting, making our model more useful for testing data, we also note that, despite several attempts (above and elsewhere) that in order to make a grid search for loss, optimization and such we would need a separate wrapper class for our Sequential model, as those hyperparameters are not part of the LSTM input. [Hyperparameter LSTM](https://machinelearningmastery.com/tune-lstm-hyperparameters-keras-time-series-forecasting/)

In [21]:
from sklearn.base import BaseEstimator, RegressorMixin
from keras.layers import LSTM
class LSTMWrapper_w_dropout(BaseEstimator,RegressorMixin,LSTM):
  def __init__(self,verbose=0,epochs=1,batch_size=1,nb_units=50):
    self.model       = Sequential()
    self.verbose     = verbose
    self.epochs      = epochs
    self.batch_size  = batch_size
    self.nb_units    = nb_units
    return 
  
  def fit(self,X,y=None):
    from sklearn.preprocessing import MinMaxScaler
    self.scl_X = MinMaxScaler(feature_range=(0, 1))
    self.scl_y = MinMaxScaler(feature_range=(0, 1))
    X_scl = self.scl_X.fit_transform(X)
    y_scl = self.scl_y.fit_transform(y.reshape(-1,1))
    X_scl_re = X_scl.reshape((X_scl.shape[0], 1, X_scl.shape[1]))
    self.model.add(LSTM(self.nb_units, 
                        input_shape=(X_scl_re.shape[1], X_scl_re.shape[2])))
    self.model.add (Dropout(0.2))
    self.model.add(Dense(1))
    self.model.compile(loss='mae', optimizer='adam')
    self.model.fit(X_scl_re, y_scl, 
                   epochs    =self.epochs, 
                   batch_size=self.batch_size, 
                   verbose   =self.verbose, 
                   shuffle   =False)
    return self
  
  def predict(self,X,y=None):
    X_scl    = self.scl_X.transform(X)
    X_scl_re = X_scl.reshape((X_scl.shape[0], 1, X_scl.shape[1]))
    return self.scl_y.inverse_transform(self.model.predict(X_scl_re))

the below code is a simple preprocessing pipeline to create something to be used with our LSTM model.  It contains only the bci data frame, as we are not going to work on NLP with this model.  Note we will create two of these, one for lag 3 and one for lag 7 - w will compare how the two models differ from each other through the hyperparameter selection process

In [23]:
from sklearn.pipeline import Pipeline
bci_pipe_lag3 = \
Pipeline(steps=[
    ('fea_one', FeatureUnionDF(transformer_list=[
      ('tgt_var'     ,CreateTargetVarDF(var='bci_5tc')),
      ('lag_bci_vars',CreateLagVarsDF(var_list=['c5', 'c7', 'p1a_03', 'p2a_03', 'p4_03', 'p3a_iv', 'shfe_al3',
                                                'rici', 'ice_kc3', 'cme_sm3', 'cme_lc2', 'opec_orb', 'shfe_cu3',
                                                'cme_ln1', 'cme_fc3', 'p3a_03', 'shfe_rb3', 'cme_s2', 'ice_sb3',
                                                'cme_ln3', 'cme_ln2', 'ice_tib3', 'ice_tib4', 'bci'],
                                      lag_list=[3])),
    ])),
    ('drop_na_rows'  ,DropNaRowsDF(how='any'))
  ])

In [24]:
from sklearn.pipeline import Pipeline
bci_pipe_lag7 = \
Pipeline(steps=[
    ('fea_one', FeatureUnionDF(transformer_list=[
      ('tgt_var'     ,CreateTargetVarDF(var='bci_5tc')),
      ('lag_bci_vars',CreateLagVarsDF(var_list=['c5', 'c7', 'p1a_03', 'p2a_03', 'p4_03', 'p3a_iv', 'shfe_al3',
                                                'rici', 'ice_kc3', 'cme_sm3', 'cme_lc2', 'opec_orb', 'shfe_cu3',
                                                'cme_ln1', 'cme_fc3', 'p3a_03', 'shfe_rb3', 'cme_s2', 'ice_sb3',
                                                'cme_ln3', 'cme_ln2', 'ice_tib3', 'ice_tib4', 'bci'],
                                      lag_list=[7])),
    ])),
    ('drop_na_rows'  ,DropNaRowsDF(how='any'))
  ])

We create the necessary train/test datasets and prepare to begin a grid search on the hyperparameters

In [26]:
fea_tgt_pdf_lag3 = bci_pipe_lag3.fit_transform(bci_pdf)
(trn_fea_pdf_3, tst_fea_pdf_3, 
 trn_tgt_ser_3, tst_tgt_ser_3
) = \
create_train_test_ts(fea_pdf = fea_tgt_pdf_lag3.drop( 'target',axis=1),
                     tgt_ser = fea_tgt_pdf_lag3.loc[:,'target'],
                     trn_prop= 0.9
                    )
(trn_fea_pdf_3.shape, tst_fea_pdf_3.shape, 
 trn_tgt_ser_3.shape, tst_tgt_ser_3.shape
)

In [27]:
fea_tgt_pdf_lag7 = bci_pipe_lag7.fit_transform(bci_pdf)
(trn_fea_pdf_7, tst_fea_pdf_7, 
 trn_tgt_ser_7, tst_tgt_ser_7
) = \
create_train_test_ts(fea_pdf = fea_tgt_pdf_lag7.drop( 'target',axis=1),
                     tgt_ser = fea_tgt_pdf_lag7.loc[:,'target'],
                     trn_prop= 0.9
                    )
(trn_fea_pdf_7.shape, tst_fea_pdf_7.shape, 
 trn_tgt_ser_7.shape, tst_tgt_ser_7.shape
)

Grid search Time! we will begin with a simple search performed on the lag 3 variable, and will work to tune the parameters for lag 3 and lag 7.  Note because of above realizations, we will not be able to tune thee optimization, activation and loss functions through the grid search, but will be able to observe changes in number of units, epochs, batch size and neurons.

In [29]:
from spark_sklearn           import GridSearchCV
from sklearn.pipeline        import Pipeline
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics         import make_scorer, mean_absolute_error, r2_score
simple_gs = \
GridSearchCV(sc,
             estimator=Pipeline(steps=[('lstm',LSTMWrapper_w_dropout(verbose=2))
                                      ]),
             param_grid={'lstm__nb_units':[10,100],
                         'lstm__epochs':[20,50],
                         'lstm__batch_size':[10,50],
                        },
             cv=TimeSeriesSplit(n_splits=2),
             scoring=make_scorer(r2_score),
             return_train_score=False,
             n_jobs=-1 
            ) 
simple_gs \
  .fit(trn_fea_pdf_3.values,
       trn_tgt_ser_3.values)
display_pdf(est_grid_results_pdf(simple_gs,
                                 est_tag='lstm'))

mean_test_score,lstm__batch_size,lstm__epochs,lstm__nb_units,est_tag
0.8032430591530905,10,50,10,lstm
0.726682879856092,50,50,100,lstm
0.6785707281692293,10,20,10,lstm
0.6697516025576068,10,50,100,lstm
0.6277807480075019,50,50,10,lstm
0.4510614531340376,50,20,10,lstm
0.2809202260884221,50,20,100,lstm
-1.4021268117554886,10,20,100,lstm


In [30]:
from spark_sklearn           import GridSearchCV
from sklearn.pipeline        import Pipeline
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics         import make_scorer, mean_absolute_error, r2_score
simple_gs = \
GridSearchCV(sc,
             estimator=Pipeline(steps=[('lstm',LSTMWrapper_w_dropout(verbose=2))
                                      ]),
             param_grid={'lstm__nb_units':[10,100],
                         'lstm__epochs':[20,50],
                         'lstm__batch_size':[10,50],
                        },
             cv=TimeSeriesSplit(n_splits=2),
             scoring=make_scorer(r2_score),
             return_train_score=False,
             n_jobs=-1 
            ) 
simple_gs \
  .fit(trn_fea_pdf_7.values,
       trn_tgt_ser_7.values)
display_pdf(est_grid_results_pdf(simple_gs,
                                 est_tag='lstm'))

mean_test_score,lstm__batch_size,lstm__epochs,lstm__nb_units,est_tag
0.6424707216206376,50,50,100,lstm
0.6132351910186489,10,20,100,lstm
0.5579599995560757,50,50,10,lstm
0.556669210321278,50,20,100,lstm
0.4904139526678436,10,20,10,lstm
0.4891644396191723,10,50,10,lstm
0.4310984451448941,10,50,100,lstm
0.3519033012324962,50,20,10,lstm


We observe from above that the 7 day lagged variables produce a worse mean test score, but all focus on the higher numbers specified in our parameter grid.  The 3-day lag variables only have the epochs in common with the lag 7.  We will continue by changing some hyper parameters to be a bit more accurate

In [32]:
from spark_sklearn           import GridSearchCV
from sklearn.pipeline        import Pipeline
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics         import make_scorer, mean_absolute_error, r2_score
simple_gs = \
GridSearchCV(sc,
             estimator=Pipeline(steps=[('lstm',LSTMWrapper_w_dropout(verbose=2))
                                      ]),
             param_grid={'lstm__nb_units':[10,100],
                         'lstm__epochs':[50,100],
                         'lstm__batch_size':[10,50],
                        },
             cv=TimeSeriesSplit(n_splits=2),
             scoring=make_scorer(r2_score),
             return_train_score=False,
             n_jobs=-1 
            ) 
simple_gs \
  .fit(trn_fea_pdf_3.values,
       trn_tgt_ser_3.values)
display_pdf(est_grid_results_pdf(simple_gs,
                                 est_tag='lstm'))

mean_test_score,lstm__batch_size,lstm__epochs,lstm__nb_units,est_tag
0.7328298996954958,10,100,10,lstm
0.716160934953555,50,50,100,lstm
0.6409753557453437,50,100,100,lstm
0.5849150397498094,10,50,100,lstm
0.5102485408877465,10,50,10,lstm
0.3666080669672868,50,50,10,lstm
0.2933143416422055,50,100,10,lstm
-2.834435625877102,10,100,100,lstm


Interestingly enough, our top result is now worse than it was before, most likely related to the dropout parameter - there is also huge loss here from the sheer time required to run the many epochs, therefore we will go back to the original model of epochs and attempt to find more accurate batch sizes and number of units

In [34]:
from spark_sklearn           import GridSearchCV
from sklearn.pipeline        import Pipeline
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics         import make_scorer, mean_absolute_error, r2_score
simple_gs = \
GridSearchCV(sc,
             estimator=Pipeline(steps=[('lstm',LSTMWrapper_w_dropout(verbose=2))
                                      ]),
             param_grid={'lstm__nb_units':[5,10],
                         'lstm__epochs':[50],
                         'lstm__batch_size':[5,10],
                        },
             cv=TimeSeriesSplit(n_splits=2),
             scoring=make_scorer(r2_score),
             return_train_score=False,
             n_jobs=-1 
            ) 
simple_gs \
  .fit(trn_fea_pdf_3.values,
       trn_tgt_ser_3.values)
display_pdf(est_grid_results_pdf(simple_gs,
                                 est_tag='lstm'))

mean_test_score,lstm__batch_size,lstm__epochs,lstm__nb_units,est_tag
0.8112314577009487,10,50,10,lstm
0.5404755479812213,5,50,10,lstm
-1.4707169035885317,10,50,5,lstm
-2.672992942228093,5,50,5,lstm


We see above that the model that is most accurate is unchanged - therefore we will try one more time to get better results, by marginally increasing batch size and number of units, otherwise we will conclude that the original model of 10,50,10 is the best model for lag 3

In [36]:
from spark_sklearn           import GridSearchCV
from sklearn.pipeline        import Pipeline
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics         import make_scorer, mean_absolute_error, r2_score
simple_gs = \
GridSearchCV(sc,
             estimator=Pipeline(steps=[('lstm',LSTMWrapper_w_dropout(verbose=2))
                                      ]),
             param_grid={'lstm__nb_units':[10,20],
                         'lstm__epochs':[50],
                         'lstm__batch_size':[20,10],
                        },
             cv=TimeSeriesSplit(n_splits=2),
             scoring=make_scorer(r2_score),
             return_train_score=False,
             n_jobs=-1 
            ) 
simple_gs \
  .fit(trn_fea_pdf_3.values,
       trn_tgt_ser_3.values)
display_pdf(est_grid_results_pdf(simple_gs,
                                 est_tag='lstm'))

mean_test_score,lstm__batch_size,lstm__epochs,lstm__nb_units,est_tag
0.5505666315589425,10,50,20,lstm
0.5074810192806085,10,50,10,lstm
0.4773660954592912,20,50,20,lstm
0.3244580617638214,20,50,10,lstm


Based on above, the one thing we can conclude, is that we can’t conclude anything with confidence right now.  It is possible that the models are not working as well as a result of the additional dropout.  Given that we have yet to beat a model from our first one we are now concluding that this model is going to work, and therefore are going to try and improve our model for lag 7.

In [38]:
from spark_sklearn           import GridSearchCV
from sklearn.pipeline        import Pipeline
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics         import make_scorer, mean_absolute_error, r2_score
simple_gs = \
GridSearchCV(sc,
             estimator=Pipeline(steps=[('lstm',LSTMWrapper_w_dropout(verbose=2))
                                      ]),
             param_grid={'lstm__nb_units':[100,200],
                         'lstm__epochs':[20,50],
                         'lstm__batch_size':[50,100],
                        },
             cv=TimeSeriesSplit(n_splits=2),
             scoring=make_scorer(r2_score),
             return_train_score=False,
             n_jobs=-1 
            ) 
simple_gs \
  .fit(trn_fea_pdf_7.values,
       trn_tgt_ser_7.values)
display_pdf(est_grid_results_pdf(simple_gs,
                                 est_tag='lstm'))

mean_test_score,lstm__batch_size,lstm__epochs,lstm__nb_units,est_tag
0.6482755039627647,50,50,100,lstm
0.6455213225541521,100,50,200,lstm
0.6425760584910437,50,50,200,lstm
0.6286908515253123,100,20,200,lstm
0.6251739476010353,100,50,100,lstm
0.5939287504492248,50,20,100,lstm
0.5888666152735456,100,20,100,lstm
0.5844254888722884,50,20,200,lstm


We have no exact change from the model above on a 7 day lag, therefore we will try considering unit sizes between 100 and 50 and batch sizes between 20 and 50, we will standardize epochs to increase run speed here

In [40]:
from spark_sklearn           import GridSearchCV
from sklearn.pipeline        import Pipeline
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics         import make_scorer, mean_absolute_error, r2_score
simple_gs = \
GridSearchCV(sc,
             estimator=Pipeline(steps=[('lstm',LSTMWrapper_w_dropout(verbose=2))
                                      ]),
             param_grid={'lstm__nb_units':[75,100],
                         'lstm__epochs':[50],
                         'lstm__batch_size':[35,50],
                        },
             cv=TimeSeriesSplit(n_splits=2),
             scoring=make_scorer(r2_score),
             return_train_score=False,
             n_jobs=-1 
            ) 
simple_gs \
  .fit(trn_fea_pdf_7.values,
       trn_tgt_ser_7.values)
display_pdf(est_grid_results_pdf(simple_gs,
                                 est_tag='lstm'))

mean_test_score,lstm__batch_size,lstm__epochs,lstm__nb_units,est_tag
0.6540663727083011,50,50,100,lstm
0.6514147063336818,50,50,75,lstm
0.6251044326217662,35,50,75,lstm
0.5166306088943846,35,50,100,lstm


We achieve marginal improvement over the previous slide by limiting to 75 but see no other changes as it relates to batch size.  For the sake of trying, we will try running the epochs between 50/100 and standardize batch size to see the impacts.

In [42]:
from spark_sklearn           import GridSearchCV
from sklearn.pipeline        import Pipeline
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics         import make_scorer, mean_absolute_error, r2_score
simple_gs = \
GridSearchCV(sc,
             estimator=Pipeline(steps=[('lstm',LSTMWrapper_w_dropout(verbose=2))
                                      ]),
             param_grid={'lstm__nb_units':[75,100],
                         'lstm__epochs':[50, 100],
                         'lstm__batch_size':[50],
                        },
             cv=TimeSeriesSplit(n_splits=2),
             scoring=make_scorer(r2_score),
             return_train_score=False,
             n_jobs=-1 
            ) 
simple_gs \
  .fit(trn_fea_pdf_7.values,
       trn_tgt_ser_7.values)
display_pdf(est_grid_results_pdf(simple_gs,
                                 est_tag='lstm'))

mean_test_score,lstm__batch_size,lstm__epochs,lstm__nb_units,est_tag
0.6559488632443122,50,50,75,lstm
0.6503623391069803,50,50,100,lstm
0.6373476554139575,50,100,75,lstm
0.5656251061504227,50,100,100,lstm


Interestingly enough, we find that adding more epochs both slows down processing time and decreases the overall efficacy.  We will move forward with the final model as 50,50,75.

Belwo, we run our selected models on the test data to see what the results we garner will be

In [45]:
 lstm_model_7 = \
  LSTMWrapper_w_dropout(nb_units=50,epochs=50,verbose=2,batch_size=75) \
    .fit(trn_fea_pdf_7.values,
         trn_tgt_ser_7.values)

In [46]:
 lstm_model_3 = \
  LSTMWrapper_w_dropout(nb_units=10,epochs=50,verbose=2,batch_size=20) \
    .fit(trn_fea_pdf_3.values,
         trn_tgt_ser_3.values)

In [47]:
lstm_model_3.score(tst_fea_pdf_3.values,
                 tst_tgt_ser_3.values)

In [48]:
lstm_model_7.score(tst_fea_pdf_7.values,
                 tst_tgt_ser_7.values)

As we can see above - the lag 3 model worked rather well in predicting on test data, but the lag 7 model was horrendous.  As a result of this, we can conclude that the lag7 model was likely overfit, and that we should have considered a higher dropout rate to counteract this fact.

###Sequential

The below is our greater attempt at determine the impact different activation/optimization/loss functions could have on our machine learning process. In the neural net we originally created we were unable to run a grid search on these parameters, because the class we wrapped, and the parameters were only for the LSTM class.  Below, we demonstrate that the Sequential model itself can be wrapped, and this allows us to specify different optimizers for testing.  the next step after the below code would be to determine a way to wrap both classes and grid search the parameters continuously

In [52]:
from sklearn.base import BaseEstimator, RegressorMixin
from keras.layers import LSTM
class Sequential_Wrapper(BaseEstimator,RegressorMixin,):
  def __init__(self,loss = 'mae', activation= 'relu', optimizer = 'adam'):
    self.model = Sequential()
    self.loss = loss
    self.activation = activation
    self.optimizer = optimizer
    return 
  
  def fit(self,X,y=None):
    from sklearn.preprocessing import MinMaxScaler
    self.scl_X = MinMaxScaler(feature_range=(0, 1))
    self.scl_y = MinMaxScaler(feature_range=(0, 1))
    X_scl = self.scl_X.fit_transform(X)
    y_scl = self.scl_y.fit_transform(y.reshape(-1,1))
    X_scl_re = X_scl.reshape((X_scl.shape[0], 1, X_scl.shape[1]))
    self.model.add(LSTM(75, 
                        input_shape=(X_scl_re.shape[1], X_scl_re.shape[2])))
    self.model.add (Dropout(0.2))
    self.model.add(Dense(1, activation = self.activation))
    self.model.compile(loss=self.loss, optimizer= self.optimizer)
    self.model.fit(X_scl_re, y_scl, 
                   epochs    =50, 
                   batch_size=50, 
                   verbose   =0, 
                   shuffle   =False)
    return self
  
  def predict(self,X,y=None):
    X_scl    = self.scl_X.transform(X)
    X_scl_re = X_scl.reshape((X_scl.shape[0], 1, X_scl.shape[1]))
    return self.scl_y.inverse_transform(self.model.predict(X_scl_re))

In [53]:
from spark_sklearn           import GridSearchCV
from sklearn.pipeline        import Pipeline
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics         import make_scorer, mean_absolute_error, r2_score
simple_gs_seq = \
GridSearchCV(sc,
             estimator=Pipeline(steps=[('seq',Sequential_Wrapper())
                                      ]),
             param_grid={'seq__loss':['mae', 'hinge'],
                         'seq__activation':['relu', 'softmax'],
                         'seq__optimizer': ['adam', 'SGD']
                        },
             cv=TimeSeriesSplit(n_splits=2),
             scoring=make_scorer(r2_score),
             return_train_score=False,
             n_jobs=-1 
            ) 
simple_gs_seq \
  .fit(trn_fea_pdf_3.values,
       trn_tgt_ser_3.values)
display_pdf(est_grid_results_pdf(simple_gs_seq,
                                 est_tag='seq'))

mean_test_score,seq__activation,seq__loss,seq__optimizer,est_tag
0.762642927317379,relu,mae,SGD,seq
0.6837643845472708,relu,mae,adam,seq
-23.39040411407789,softmax,mae,adam,seq
-23.39040411407789,softmax,mae,SGD,seq
-23.39040411407789,softmax,hinge,adam,seq
-23.39040411407789,softmax,hinge,SGD,seq
-251.7345853568152,relu,hinge,SGD,seq
-14298.673537970695,relu,hinge,adam,seq


We see above that the best activation function is RELU, or Rectified Linear, the loss function is Mean Average Error, and the optimizer is Stochastic Gradient Descent.  We could run through more parameters to test and such, but there is not much else we can do with the class from here.

## Summary

Above, we took the LSTM model discussed briefly in class and did an exploration on how we could make it work for different lags, different batch sizes, and finally different optimizers/losses/activators.  Our study revealed that predictions on data lagged 7 days can prove quite complex and as a result we should consider different learning rates and a higher dropout rate with the longer time frame.