#MA707 Metrics

## References
- [sklearn.metrics](https://scikit-learn.org/stable/modules/classes.html#sklearn-metrics-metrics)
- [sklearn.model_selection.TimeSeriesSplit](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html)
- [sklearn.linear_model.Ridge](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html)
- [https://github.com/databricks/spark-sklearn](https://github.com/databricks/spark-sklearn)

## Setup

In [4]:
%run "/Courses/MA707/Groups/Blackjack/Report - Final/0.1 Raw dataset (inc)"

In [5]:
%run "/Courses/MA707/Groups/Blackjack/Report - Final/0.2 Feature creation (inc)"

In [6]:
%run "/Courses/MA707/Groups/Blackjack/Report - Final/0.3 Feature selection (inc)"

In [7]:
%run "/Courses/MA707/Groups/Blackjack/Report - Final/0.4 Estimators (inc)"

In [8]:
def display_pdf(a_pdf):
  display(spark.createDataFrame(a_pdf))

In [9]:
def est_grid_results_pdf(my_est_grid_obj,col_dict={}): 
  import pandas as pd
  import numpy  as np
  return pd.DataFrame(data={**my_est_grid_obj.cv_results_,
                            **col_dict}) \
           .loc[:,lambda df: np.logical_or(df.columns.str.startswith('param_'),
                                           df.columns.str.endswith('test_score'))
               ] \
           .loc[:,lambda df: np.logical_not(df.columns.str.startswith('split'))
               ]

## Create raw dataset(s)

In [11]:
%python
import pandas as pd
bci_pdf = pd.read_csv('/dbfs/mnt/group-ma707/data/5tc_plus_ind_vars.csv') \
            .rename(columns={'P3A~IV':'P3A_IV'}) \
            .assign(date=lambda pdf: pd.to_datetime(pdf.Date)) \
            .drop('Date', axis=1) \
            .sort_index(ascending=True)
bci_pdf.columns = bci_pdf.columns.str.lower()
bci_pdf.info() # 1602 non-null for all vars same count for index

In [12]:
%python
import numpy as np
import pandas as pd
coal_pdf = \
pd.read_csv('/dbfs/mnt/group-ma707/data/mining_com_coal.csv', 
            encoding='ISO-8859-1'
           ) \
  .loc[:,['date','tags','title','content']] \
  .fillna({'tags'   :'',
           'content':'',
           'title'  :''
          }) \
  .assign(date   =lambda pdf: pd.to_datetime(pd.to_datetime(pdf.date).dt.date)) \
  .groupby(by='date') \
  .agg({'tags'   : lambda ser: ' '.join(ser),
        'content': lambda ser: ' '.join(ser),
        'title'  : lambda ser: ' '.join(ser)}) \
  .sort_index(ascending=True) \
  .resample('D') \
  .pad() \
  .reset_index()
coal_pdf.info()

In [13]:
%python
import numpy as np
import pandas as pd
ore_pdf = \
pd.read_csv('/dbfs/mnt/group-ma707/data/mining_com_iron_ore.csv', 
            encoding='ISO-8859-1'
           ) \
  .loc[:,['date','tags','title','content']] \
  .fillna({'tags'   :'',
           'content':'',
           'title'  :''
          }) \
  .assign(date = lambda pdf: pd.to_datetime(pd.to_datetime(pdf.date,utc=True).dt.normalize().dt.date)) \
  .groupby(by='date') \
  .agg({'tags'   : lambda ser: ' '.join(ser),
        'content': lambda ser: ' '.join(ser),
        'title'  : lambda ser: ' '.join(ser)}) \
  .sort_index(ascending=True) \
  .resample('D') \
  .pad() \
  .reset_index()
ore_pdf.info(10)

In [14]:
%python
import pandas as pd
bci_coal_pdf = \
pd.concat(objs=[ bci_pdf.set_index('date'), 
                coal_pdf.set_index('date')], 
          join='inner',
          axis=1
         ) \
  .reset_index()
bci_coal_pdf.info()

In [15]:
%python
import pandas as pd
bci_ironore_pdf = \
pd.concat(objs=[bci_pdf.set_index('date'), 
                ore_pdf.set_index('date')], 
          join='inner',
          axis=1
         ) \
  .reset_index()
bci_ironore_pdf.info()

In [16]:
%python
import pandas as pd
bci_dual_pdf = \
pd.concat(objs=[bci_coal_pdf.set_index('date'), 
                ore_pdf.set_index('date')], 
          join='inner',
          axis=1
         ) \
  .reset_index()
bci_dual_pdf.info()

In [17]:
%python
import pandas as pd
bci_dual_pdf = \
pd.concat(objs=[ bci_pdf.set_index('date'), 
                 ore_pdf.set_index('date'),
                coal_pdf.set_index('date')], 
          join='inner',
          axis=1
         ) \
  .reset_index()
bci_dual_pdf.info()

## Create feature-target dataframe(s)

In [19]:
%python 
from sklearn.feature_extraction.text import CountVectorizer, ENGLISH_STOP_WORDS
class CountVectColDF(CountVectorizer):
  def __init__(self,col_name,prefix='cnt_',
               stop_words=list(ENGLISH_STOP_WORDS),
               add_stop_words=[]
              ):
    stop_words_list = stop_words+add_stop_words
    self.col_name = col_name
    self.prefix   = prefix
    super().__init__(stop_words=stop_words_list)
    return
  
  def fit(self,X,y=None):
    super().fit(X[self.col_name])
    return self
  
  def transform(self,X,y=None):
    return pd.DataFrame(data=super().transform(X[self.col_name]).toarray(),
                        columns=[self.prefix+feature_name for feature_name in super().get_feature_names()]
                       )

In [20]:
%python 
def get_count_vect_all_three_plus_all_ts_pipe():
  from sklearn.pipeline import FeatureUnion, Pipeline
  return Pipeline(steps=[
    ('fea_one', FeatureUnionDF(transformer_list=[
      ('tgt_var'     ,CreateTargetVarDF(var='bci_5tc')),
      ('dt_vars'     ,CreateDatetimeVarsDF(var='date')),
      ('lag_txt_vars',CreateLagVarsDF(var_list=['tags','bci_5tc'],
                                      lag_list=[3])),
    ])),
    ('drop_na_rows'  ,DropNaRowsDF(how='any')),
    ('fea_two', FeatureUnionDF(transformer_list=[
      ('named_vars' , CreateNamedVarsDF(except_list=['tags_lag3'])),
      ('cnt_tags'   , CountVectColDF(col_name=   'tags_lag3',prefix='cnt_tags_'   ,add_stop_words=[])),
     #('cnt_title'  , CountVectColDF(col_name=  'title_lag3',prefix='cnt_title_'  ,add_stop_words=[])),  
    ])),
    ('drop_na_rows_again', DropNaRowsDF(how='any')),
  ])

In [21]:
fea_tgt_count_vect_pdf = \
  get_count_vect_all_three_plus_all_ts_pipe() \
    .fit(bci_coal_pdf) \
    .transform(bci_coal_pdf)

In [22]:
fea_tgt_count_vect_pdf \
  .info()

In [23]:
[var for var in list(fea_tgt_count_vect_pdf.columns) if not var.startswith('cnt')]

## Create train and test datasets

In [25]:
def create_train_test_ts(fea_pdf, tgt_ser, trn_prop=0.8):
  trn_len = int(trn_prop * len(fea_pdf))
  return (fea_pdf.iloc[:trn_len],
          fea_pdf.iloc[ trn_len:],
          tgt_ser.iloc[:trn_len],
          tgt_ser.iloc[ trn_len:]
         )

In [26]:
(trn_fea_pdf, tst_fea_pdf, 
 trn_tgt_ser, tst_tgt_ser
) = \
create_train_test_ts(fea_pdf = fea_tgt_count_vect_pdf.drop( 'target',axis=1),
                     tgt_ser = fea_tgt_count_vect_pdf.loc[:,'target'],
                    )

In [28]:
trn_fea_pdf.columns

## Create estimator pipeline(s)

In [30]:
%python 
def get_simple_estimator_pipeline():
  from sklearn.pipeline import FeatureUnion, Pipeline
  from sklearn.linear_model import LinearRegression, Ridge
  return Pipeline(steps=[
    ('rdg', Ridge())
  ])

## Metrics

Metrics are used to evaluate a model by comparing the actual values with the predicted values produced by the model.

Text and links below from Scikit-learn reference link above: 
- [metrics.explained_variance_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.explained_variance_score.html#sklearn.metrics.explained_variance_score): 
Explained variance regression score function
- [metrics.mean_absolute_error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html#sklearn.metrics.mean_absolute_error): 
Mean absolute error regression loss
- [metrics.mean_squared_error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html#sklearn.metrics.mean_squared_error): 
Mean squared error regression loss
- [metrics.mean_squared_log_error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_log_error.html#sklearn.metrics.mean_squared_log_error): 
Mean squared logarithmic error regression loss
- [metrics.median_absolute_error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.median_absolute_error.html#sklearn.metrics.median_absolute_error): 
Median absolute error regression loss
- [metrics.r2_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html#sklearn.metrics.r2_score): 
R^2 (coefficient of determination) regression score function.

Wikipedia
- [Mean absolute error](https://en.wikipedia.org/wiki/Mean_absolute_error): scale dependent
- [Median absolute deviation](https://en.wikipedia.org/wiki/Median_absolute_deviation)
- [Mean square error](https://en.wikipedia.org/wiki/Mean_squared_error)
- [Root mean square deviation](https://en.wikipedia.org/wiki/Root-mean-square_deviation): 
  scale dependent, disproportionally penalizes larger errors
- [Coefficient of determination](https://en.wikipedia.org/wiki/Coefficient_of_determination) (Wikipedia)
    - "... the proportion of the variance in the dependent variable that is predictable from the independent variable(s)."
    - "It provides a measure of how well observed outcomes are replicated by the model, based on the proportion of total variation of outcomes explained by the model."
    - "In cases where negative values arise, the mean of the data provides a better fit to the outcomes than do the fitted function values, ..."


Other
- [Root mean square log error](https://peltarion.com/knowledge-center/documentation/modeling-view/build-an-ai-model/loss-functions/mean-squared-logarithmic-error)

__Summary__ 
- Square: larger differences are more costly
- Absolute: more easily interpreted, but scale dependent
- Log: indicates relative differences, independent of scale

The `score` method of an estimator pipeline calls the `score` method of the (final) estimator.

In [36]:
get_simple_estimator_pipeline() \
  .fit  (trn_fea_pdf, trn_tgt_ser) \
  .score(tst_fea_pdf, tst_tgt_ser)

Scoring with `GridSearchCV` entails using `scoring` parameter to specify the scoring metric.

The scoring results can be found in the value for the `mean_test_score` key in the `cv_results_` attribute of the `GridSearchCV` object.

In [38]:
from spark_sklearn import GridSearchCV
#from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics         import make_scorer, mean_absolute_error, r2_score
simple_gs = \
GridSearchCV(sc,
  estimator=get_simple_estimator_pipeline(),
  param_grid={'rdg__normalize':[True,False],
              'rdg__alpha'    :[10.0**n for n in range(-3,4)],
              'rdg__solver'   :['auto', 'svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga']},
  cv=TimeSeriesSplit(n_splits=5),
  scoring=make_scorer(mean_absolute_error),
  return_train_score=False,
  n_jobs=-1 
) 
simple_gs \
  .fit(trn_fea_pdf, 
       trn_tgt_ser)

In [39]:
display_pdf(est_grid_results_pdf(simple_gs) \
              .drop(['rank_test_score','std_test_score'],
                    axis=1) \
            .sort_values('mean_test_score'))

mean_test_score,param_rdg__alpha,param_rdg__normalize,param_rdg__solver
1411.6813081166167,1000.0,False,saga
1411.6828129925009,0.1,False,saga
1411.6831283969675,0.01,False,saga
1411.6858588645807,10.0,False,saga
1411.7033350941367,1.0,False,saga
1411.7077573160127,100.0,False,saga
1411.711776498261,0.001,False,saga
1426.654070004524,1000.0,False,sag
1426.672577699666,0.01,False,sag
1426.6773555186348,0.1,False,sag


## Investigation

What questions can answered by analyzing the above dataframe? 
- Does normalization make better predictions? Does it make consistently better predictions? 
- Are there values for `alpha` that make better predictions? Is this consistent across groups define by other parameters? 
- Are there solvers that make better predictions? Are they consistent? 
%md __The End__