# MA707 Models

## References
- [Scikit-learn ML map](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html)
- [sklearn.linear_model.Ridge](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html)
- [sklearn.linear_model.Lasso](http://www.scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html)
- [sklearn.linear_model.ElasticNet](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html)
- [sklearn.tree.DecisionTreeRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html),
[Decision Tree Regression](https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html), 
[Google images](https://www.google.com/search?rls=en&q=Decision_tree_r1.png&tbm=isch&source=univ&client=safari&sa=X&ved=2ahUKEwisk5rU0ebhAhXKmeAKHdeCAL4QsAR6BAgFEAE&biw=1280&bih=714#imgrc=ts7eAUUYVsZ6EM:)
- [sklearn.ensemble.RandomForestRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html)
- [sklearn.neighbors.KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html),
[Nearest Neighbors](https://scikit-learn.org/stable/modules/neighbors.html), 
[Nearest Neighbors Regression](https://scikit-learn.org/stable/modules/neighbors.html#nearest-neighbors-regression),
[Google images](https://www.google.com/search?client=safari&rls=en&biw=1280&bih=714&tbm=isch&sa=1&ei=STy_XML9LKO-ggeb-q7AAQ&q=knn+regression&oq=knn+regression&gs_l=img.3..0j0i24l9.57515.60963..61372...3.0..1.194.1080.12j2......1....1..gws-wiz-img.......0i67j0i8i30.Uo0cVTouR5M#imgrc=gwFDKjDzWxOqyM:)
- [sklearn.decomposition.PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)
- [sklearn.svm.SVR](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html),
[Support Vector Regression](https://scikit-learn.org/stable/modules/svm.html#svm-regression)

### Regularization

The goal of regularization is to avoid overfitting.
Hyper-parameters are used to control the degree of regularization of the model.
See [Regularization](https://en.wikipedia.org/wiki/Regularization) at Wikipedia (choose mathematical regularization).
Regularizations different forms for different models:
- Regularization for the linear regression model is accomplished by incorporating the coefficients in the cost function.
This produces models with smaller coeffients and in some cases with fewer non-zero coefficients.
- Regularization for the decision tree model is accomplished by 
    - producing smaller trees (`max_depth`, `min_samples_split`, `min_samples_leaf`)
    - making decisions on splitting nodes with less information (`max_features`)
    - by aggregating results from multiple trees (this is the Random Forest model)
- Regularization for k-nearest neighbors is accomplished by using a larger value for the `n_neighbors` hyper-parameter.

## Setup

In [5]:
%run "/Courses/MA707/Groups/Blackjack/Report - Final/0.1 Raw dataset (inc)"

In [6]:
%run "/Courses/MA707/Groups/Blackjack/Report - Final/0.2 Feature creation (inc)"

In [7]:
%run "/Courses/MA707/Groups/Blackjack/Report - Final/0.3 Feature selection (inc)"

In [8]:
%run "/Courses/MA707/Groups/Blackjack/Report - Final/0.4 Estimators (inc)"

In [9]:
def display_pdf(a_pdf):
  display(spark.createDataFrame(a_pdf,verifySchema=False))

In [10]:
def est_grid_results_pdf(my_est_grid_obj,est_tag=None,fea_tag=None): 
  import pandas as pd
  import numpy  as np
  res_pdf = pd.DataFrame(data=my_est_grid_obj.cv_results_) \
           .loc[:,lambda df: np.logical_or(df.columns.str.startswith('param_'),
                                           df.columns.str.endswith('test_score'))
               ] \
           .loc[:,lambda df: np.logical_not(df.columns.str.startswith('split'))
               ] \
           .drop(['rank_test_score', 'std_test_score'], 
                 axis=1)
  res_pdf.columns = [column.replace('param_','') for column in list(res_pdf.columns)]
  if est_tag is not None: res_pdf = res_pdf.assign(est_tag=est_tag)
  if fea_tag is not None: res_pdf = res_pdf.assign(fea_tag=fea_tag)
  return res_pdf.sort_values('mean_test_score')

## Create raw dataset(s)

In [12]:
%python
import pandas as pd
bci_pdf = pd.read_csv('/dbfs/mnt/group-ma707/data/5tc_plus_ind_vars.csv') \
            .rename(columns={'P3A~IV':'P3A_IV'}) \
            .assign(date=lambda pdf: pd.to_datetime(pdf.Date)) \
            .drop('Date', axis=1) \
            .sort_index(ascending=True)
bci_pdf.columns = bci_pdf.columns.str.lower()
bci_pdf.info() # 1602 non-null for all vars same count for index

In [13]:
%python
import numpy as np
import pandas as pd
coal_pdf = \
pd.read_csv('/dbfs/mnt/group-ma707/data/mining_com_coal.csv', 
            encoding='ISO-8859-1'
           ) \
  .loc[:,['date','tags','title','content']] \
  .fillna({'tags'   :'',
           'content':'',
           'title'  :''
          }) \
  .assign(date   =lambda pdf: pd.to_datetime(pd.to_datetime(pdf.date).dt.date)) \
  .groupby(by='date') \
  .agg({'tags'   : lambda ser: ' '.join(ser),
        'content': lambda ser: ' '.join(ser),
        'title'  : lambda ser: ' '.join(ser)}) \
  .sort_index(ascending=True) \
  .resample('D') \
  .pad() \
  .reset_index()
coal_pdf.info()

In [14]:
%python
import numpy as np
import pandas as pd
ore_pdf = \
pd.read_csv('/dbfs/mnt/group-ma707/data/mining_com_iron_ore.csv', 
            encoding='ISO-8859-1'
           ) \
  .loc[:,['date','tags','title','content']] \
  .fillna({'tags'   :'',
           'content':'',
           'title'  :''
          }) \
  .assign(date = lambda pdf: pd.to_datetime(pd.to_datetime(pdf.date,utc=True).dt.normalize().dt.date)) \
  .groupby(by='date') \
  .agg({'tags'   : lambda ser: ' '.join(ser),
        'content': lambda ser: ' '.join(ser),
        'title'  : lambda ser: ' '.join(ser)}) \
  .sort_index(ascending=True) \
  .resample('D') \
  .pad() \
  .reset_index()
ore_pdf.info(10)

In [15]:
%python
import pandas as pd
bci_coal_pdf = \
pd.concat(objs=[ bci_pdf.set_index('date'), 
                coal_pdf.set_index('date')], 
          join='inner',
          axis=1
         ) \
  .reset_index()
bci_coal_pdf.info()

In [16]:
%python
import pandas as pd
bci_ironore_pdf = \
pd.concat(objs=[bci_pdf.set_index('date'), 
                ore_pdf.set_index('date')], 
          join='inner',
          axis=1
         ) \
  .reset_index()
bci_ironore_pdf.info()

In [17]:
%python
import pandas as pd
bci_dual_pdf = \
pd.concat(objs=[bci_coal_pdf.set_index('date'), 
                ore_pdf.set_index('date')], 
          join='inner',
          axis=1
         ) \
  .reset_index()
bci_dual_pdf.info()

In [18]:
%python
import pandas as pd
bci_dual_pdf = \
pd.concat(objs=[ bci_pdf.set_index('date'), 
                 ore_pdf.set_index('date'),
                coal_pdf.set_index('date')], 
          join='inner',
          axis=1
         ) \
  .reset_index()
bci_dual_pdf.info()

## Create feature-target dataframe(s)

In [20]:
%python 
from sklearn.feature_extraction.text import CountVectorizer, ENGLISH_STOP_WORDS
class CountVectColDF(CountVectorizer):
  def __init__(self,col_name,prefix='cnt_',
               stop_words=list(ENGLISH_STOP_WORDS),
               add_stop_words=[]
              ):
    stop_words_list = stop_words+add_stop_words
    self.col_name = col_name
    self.prefix   = prefix
    super().__init__(stop_words=stop_words_list)
    return
  
  def fit(self,X,y=None):
    super().fit(X[self.col_name])
    return self
  
  def transform(self,X,y=None):
    return pd.DataFrame(data=super().transform(X[self.col_name]).toarray(),
                        columns=[self.prefix+feature_name for feature_name in super().get_feature_names()]
                       )

In [21]:
%python 
def get_count_vect_all_three_plus_all_ts_pipe():
  from sklearn.pipeline import FeatureUnion, Pipeline
  return Pipeline(steps=[
    ('fea_one', FeatureUnionDF(transformer_list=[
      ('tgt_var'     ,CreateTargetVarDF(var='bci_5tc')),
      ('dt_vars'     ,CreateDatetimeVarsDF(var='date')),
      ('lag_txt_vars',CreateLagVarsDF(var_list=['bci_5tc','tags','title','content',
                                                'bci', 'c5', 'c7', 'p1a_03', 'p2a_03', 'p4_03', 'p3a_iv', 'shfe_al3',
                                                'rici', 'ice_kc3', 'cme_sm3', 'cme_lc2', 'opec_orb', 'shfe_cu3',
                                                'cme_ln1', 'cme_fc3', 'p3a_03', 'shfe_rb3', 'cme_s2', 'ice_sb3',
                                                'cme_ln3', 'cme_ln2', 'ice_tib3', 'ice_tib4'],
                                      lag_list=[3])),
    ])),
    ('drop_na_rows'  ,DropNaRowsDF(how='any')),
    ('fea_two', FeatureUnionDF(transformer_list=[
      ('named_vars' , CreateNamedVarsDF(except_list=['tags_lag3','title_lag3','content_lag3'])),
      ('cnt_tags'   , CountVectColDF(col_name=  'tags_lag3'   ,prefix='cnt_tags_'     ,add_stop_words=[])),
      ('cnt_title'  , CountVectColDF(col_name=  'title_lag3'  ,prefix='cnt_title_'    ,add_stop_words=[])),  
      ('cnt_title'  , CountVectColDF(col_name=  'content_lag3',prefix='cnt_content_'  ,add_stop_words=[])),  
    ])),
    ('drop_na_rows_again', DropNaRowsDF(how='any')),
  ])

In [22]:
fea_tgt_count_vect_pdf = \
  get_count_vect_all_three_plus_all_ts_pipe() \
    .fit(bci_coal_pdf) \
    .transform(bci_coal_pdf)

In [23]:
fea_tgt_count_vect_pdf \
  .info()

In [24]:
[var for var in list(fea_tgt_count_vect_pdf.columns) if not var.startswith('cnt')]

## Create train and test datasets

In [26]:
def create_train_test_ts(fea_pdf, tgt_ser, trn_prop=0.8):
  trn_len = int(trn_prop * len(fea_pdf))
  return (fea_pdf.iloc[:trn_len],
          fea_pdf.iloc[ trn_len:],
          tgt_ser.iloc[:trn_len],
          tgt_ser.iloc[ trn_len:]
         )

In [27]:
(trn_fea_pdf, tst_fea_pdf, 
 trn_tgt_ser, tst_tgt_ser
) = \
create_train_test_ts(fea_pdf = fea_tgt_count_vect_pdf.drop( 'target',axis=1),
                     tgt_ser = fea_tgt_count_vect_pdf.loc[:,'target'],
                    )

## Models

In [29]:
from sklearn.pipeline        import FeatureUnion, Pipeline
from sklearn.linear_model    import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.svm             import SVR
from sklearn.tree            import DecisionTreeRegressor
from sklearn.ensemble        import RandomForestRegressor
from sklearn.neighbors       import KNeighborsRegressor
from sklearn.decomposition   import PCA
from spark_sklearn           import GridSearchCV
#from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics         import make_scorer, mean_absolute_error, r2_score

### Model: Ridge

In [31]:
simple_gs = \
GridSearchCV(sc,
             estimator=Pipeline(steps=[('pca',PCA()),
                                       ('rdg',Ridge())
                                      ]),
             param_grid={'rdg__normalize'   :[True, False],
                         'rdg__alpha'       :[10.0**n for n in [-3,0,3]],
                         'rdg__solver'      :['saga'],
                         'pca__n_components':[10**n for n in [1, 2, 3]]
             },
  cv=TimeSeriesSplit(n_splits=5),
  scoring=make_scorer(r2_score),
  return_train_score=False,
  n_jobs=-1 
) 
simple_gs \
  .fit(trn_fea_pdf, 
       trn_tgt_ser)
display_pdf(est_grid_results_pdf(simple_gs,
                                 est_tag='pca-rdg'))

### Model: Lasso

In [33]:
simple_gs = \
GridSearchCV(sc,
             estimator=Pipeline(steps=[('pca',PCA()),
                                       ('lso',Lasso())
                                      ]),
             param_grid={'pca__n_components': [10**n for n in [1,2,3]],
                         'lso__normalize'   :[True, False],
                         'lso__alpha'       : [10.0**n for n in [-3,0,3]],
                        },
             cv=TimeSeriesSplit(n_splits=5),
             scoring=make_scorer(r2_score),
             return_train_score=False,
             n_jobs=-1 
            ) 
simple_gs \
  .fit(trn_fea_pdf, 
       trn_tgt_ser)
display_pdf(est_grid_results_pdf(simple_gs,
                                 est_tag='pca-lso'))

mean_test_score,lso__alpha,lso__normalize,pca__n_components,est_tag
-8.568698012758558e+21,0.001,True,1000,pca-lso
-1.0745150608871948e+21,1.0,True,1000,pca-lso
-15.588785248319631,0.001,False,1000,pca-lso
-1.0642620033355898,1000.0,True,1000,pca-lso
-1.0642620033355898,1000.0,True,100,pca-lso
-1.0642620033355898,1000.0,True,10,pca-lso
0.3097251553534711,1.0,False,1000,pca-lso
0.7246839219286189,0.001,True,100,pca-lso
0.7249553966208978,1.0,False,100,pca-lso
0.7314796736500578,0.001,False,100,pca-lso


### Model: ElasticNet

In [35]:
simple_gs = \
GridSearchCV(sc,
             estimator=Pipeline(steps=[('pca',PCA()),
                                       ('ela',ElasticNet())
                                      ]),
             param_grid={'pca__n_components': [10**n for n in [1,2,3]],
                         'ela__normalize'   :[True, False],
                         'ela__alpha'       : [10.0**n for n in [-3,0,3]],
                        },
             cv=TimeSeriesSplit(n_splits=5),
             scoring=make_scorer(r2_score),
             return_train_score=False,
             n_jobs=-1 
            ) 
simple_gs \
  .fit(trn_fea_pdf, 
       trn_tgt_ser)
display_pdf(est_grid_results_pdf(simple_gs,
                                 est_tag='pca-ela'))

mean_test_score,ela__alpha,ela__normalize,pca__n_components,est_tag
-4.290765376945266e+21,0.001,True,1000,pca-ela
-2.5094411482654984e+16,1.0,True,1000,pca-ela
-1.0642620033355898,1000.0,True,1000,pca-ela
-1.0642620033355898,1000.0,True,100,pca-ela
-1.0642620033355898,1000.0,True,10,pca-ela
-1.0498332160215555,1.0,True,100,pca-ela
-1.04906964295422,1.0,True,10,pca-ela
0.2709323741577383,0.001,False,1000,pca-ela
0.6420122099511897,0.001,True,100,pca-ela
0.6987543652626844,1.0,False,1000,pca-ela


### Model: SVR

In [37]:
simple_gs = \
GridSearchCV(sc,
             estimator=Pipeline(steps=[('svr',SVR())
                                      ]),
             param_grid={'svr__kernel'      : ['rbf'], 
                         'svr__gamma'       : ['auto'],
                         'svr__C'           : [2.0**n for n in [-3, -2, -1, 0, 1, 2, 3]] # should be positive
                        },
             cv=TimeSeriesSplit(n_splits=5),
             scoring=make_scorer(r2_score),
             return_train_score=False,
             n_jobs=-1 
            ) 
simple_gs \
  .fit(trn_fea_pdf, 
       trn_tgt_ser)
display_pdf(est_grid_results_pdf(simple_gs,
                                 est_tag='svr'))

mean_test_score,svr__C,svr__gamma,svr__kernel,est_tag
-0.7830286360531313,8.0,auto,rbf,svr
-0.7828411699270976,4.0,auto,rbf,svr
-0.7827881033832752,2.0,auto,rbf,svr
-0.7827761071646782,1.0,auto,rbf,svr
-0.7827701098509408,0.5,auto,rbf,svr
-0.7827671113929622,0.25,auto,rbf,svr
-0.7827656122136958,0.125,auto,rbf,svr


In [38]:
simple_gs = \
GridSearchCV(sc,
             estimator=Pipeline(steps=[('svr',SVR())
                                      ]),
             param_grid={#'pca__n_components': [10**n for n in [1,2,3]],
                         'svr__C'      : [1],
                        },
             cv=TimeSeriesSplit(n_splits=5),
             scoring=make_scorer(r2_score),
             return_train_score=False,
             n_jobs=-1 
            ) 
simple_gs \
  .fit(trn_fea_pdf, 
       trn_tgt_ser)
display_pdf(est_grid_results_pdf(simple_gs,
                                 est_tag='svr'))

mean_test_score,svr__C,est_tag
-0.7827761071646782,1,svr


### Model: Decision Tree

In [40]:
simple_gs = \
GridSearchCV(sc,
             estimator=Pipeline(steps=[('dtr',DecisionTreeRegressor())
                                      ]),
             param_grid={'dtr__max_depth'        : [1,2,3,4,5,6,7,8,9,10]
                        },
             cv=TimeSeriesSplit(n_splits=5),
             scoring=make_scorer(r2_score),
             return_train_score=False,
             n_jobs=-1 
            ) 
simple_gs \
  .fit(trn_fea_pdf, 
       trn_tgt_ser)
display_pdf(est_grid_results_pdf(simple_gs,
                                 est_tag='dtr'))

mean_test_score,dtr__max_depth,est_tag
-0.2694397267643605,1,dtr
0.297886837703339,9,dtr
0.3342327195431135,10,dtr
0.3690942033415036,8,dtr
0.3855081782053876,2,dtr
0.4354303765333914,7,dtr
0.5233521770195328,5,dtr
0.5349071328026097,4,dtr
0.5716799784199559,6,dtr
0.6383056868419599,3,dtr


In [41]:
simple_gs = \
GridSearchCV(sc,
             estimator=Pipeline(steps=[('dtr', DecisionTreeRegressor())
                                      ]),
             param_grid={'dtr__min_samples_leaf': [100*n for n in [1,2,3,4,5]]
                        },
             cv=TimeSeriesSplit(n_splits=5),
             scoring=make_scorer(r2_score),
             return_train_score=False,
             n_jobs=-1 
            ) 
simple_gs \
  .fit(trn_fea_pdf, 
       trn_tgt_ser)
display_pdf(est_grid_results_pdf(simple_gs,
                                 est_tag='dtr'))

mean_test_score,dtr__min_samples_leaf,est_tag
-1.0634041150125826,500,dtr
-0.4875716200058761,400,dtr
-0.1923641226993827,300,dtr
0.1025557843273998,200,dtr
0.3297776342125731,100,dtr


In [42]:
simple_gs = \
GridSearchCV(sc,
             estimator=Pipeline(steps=[('dtr',DecisionTreeRegressor())
                                      ]),
             param_grid={'dtr__min_samples_split': [200*n for n in [1,2,3,4,5]]
                        },
             cv=TimeSeriesSplit(n_splits=5),
             scoring=make_scorer(r2_score),
             return_train_score=False,
             n_jobs=-1 
            ) 
simple_gs \
  .fit(trn_fea_pdf, 
       trn_tgt_ser)
display_pdf(est_grid_results_pdf(simple_gs,
                                 est_tag='dtr'))

mean_test_score,dtr__min_samples_split,est_tag
-1.0954042693477968,1000,dtr
-0.6218166257185701,800,dtr
-0.1732465428541448,600,dtr
0.161677215631013,400,dtr
0.164895762001744,200,dtr


In [43]:
simple_gs = \
GridSearchCV(sc,
             estimator=Pipeline(steps=[('dtr',DecisionTreeRegressor())
                                      ]),
             param_grid={'dtr__max_leaf_nodes': [5, 10, 15, 20, 25]
                        },
             cv=TimeSeriesSplit(n_splits=5),
             scoring=make_scorer(r2_score),
             return_train_score=False,
             n_jobs=-1 
            ) 
simple_gs \
  .fit(trn_fea_pdf, 
       trn_tgt_ser)
display_pdf(est_grid_results_pdf(simple_gs,
                                 est_tag='dtr'))

mean_test_score,dtr__max_leaf_nodes,est_tag
0.4504557409215142,5,dtr
0.4827626876522201,25,dtr
0.566018013720798,10,dtr
0.6053302312995659,20,dtr
0.6553427743928324,15,dtr


### Model: Random Forest

In [45]:
simple_gs = \
GridSearchCV(sc,
             estimator=Pipeline(steps=[('rf',RandomForestRegressor())
                                      ]),
             param_grid={'rf__n_estimators'  : [5, 10, 20],
                         'rf__max_leaf_nodes': [50, 100, 200]
                        },
             cv=TimeSeriesSplit(n_splits=5),
             scoring=make_scorer(r2_score),
             return_train_score=False,
             n_jobs=-1 
            ) 
simple_gs \
  .fit(trn_fea_pdf, 
       trn_tgt_ser)
display_pdf(est_grid_results_pdf(simple_gs,
                                 est_tag='rf'))

mean_test_score,rf__max_leaf_nodes,rf__n_estimators,est_tag
0.601111941494847,50,5,rf
0.6109226877453153,200,5,rf
0.6399796077939861,100,5,rf
0.6425659567474364,200,10,rf
0.6506565624664536,100,10,rf
0.6666601899307388,100,20,rf
0.6753871655029832,50,10,rf
0.6848372456631265,200,20,rf
0.7258262715862084,50,20,rf


### Model: K-nearest neighbors

In [47]:
simple_gs = \
GridSearchCV(sc,
             estimator=Pipeline(steps=[('knn',KNeighborsRegressor())
                                      ]),
             param_grid={'knn__weights'    : ['uniform', 'distance'],
                         'knn__p'          : [1,2,3], #p is the power parameter
                         'knn__n_neighbors': [3,10,50,100,200] 
                        },
             cv=TimeSeriesSplit(n_splits=5),
             scoring=make_scorer(r2_score),
             return_train_score=False,
             n_jobs=-1 
            ) 
simple_gs \
  .fit(trn_fea_pdf, 
       trn_tgt_ser)
display_pdf(est_grid_results_pdf(simple_gs,
                     est_tag='knn'))

mean_test_score,knn__n_neighbors,knn__p,knn__weights,est_tag
0.1371838755032608,200,1,uniform,knn
0.2108241751156202,200,2,uniform,knn
0.2197863420811695,200,3,uniform,knn
0.2367425066930523,200,1,distance,knn
0.2909487104870324,100,1,uniform,knn
0.3284489187274094,200,2,distance,knn
0.3463732785615501,200,3,distance,knn
0.3846125017277573,100,1,distance,knn
0.4000650263503891,50,1,uniform,knn
0.406891344387942,100,2,uniform,knn


__The End__