# Exercises - Week 5 - Cross Validation - [Blackjack]

## References
- [3.1. Cross-validation: evaluating estimator performance](https://scikit-learn.org/stable/modules/cross_validation.html) (Scikit-learn documentation)

Next week:
- https://scikit-learn.org/stable/modules/grid_search.html
- https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection
- https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics
- https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
- https://scikit-learn.org/stable/modules/model_evaluation.html
- https://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html
- https://scikit-learn.org/stable/tutorial/statistical_inference/index.html

Last week:
- https://scikit-learn.org/stable/data_transforms.html
- https://scikit-learn.org/stable/modules/preprocessing.html
- https://scikit-learn.org/stable/modules/compose.html#combining-estimators
- https://scikit-learn.org/stable/modules/compose.html#featureunion-composite-feature-spaces

## Contents
1. Setup 
2. Data Lab notebooks
1. Cross validation
1. Pipelines with `FeatureUnion`

## 1. Setup

Load libraries and display version numbers.

In [6]:
import pandas  as pd
import numpy   as np
import sklearn as sk
print('sklearn',sk.__version__)
print('pandas ',pd.__version__)
print('numpy  ',np.__version__)

These version numbers may not be the most recent or correspond to the documentation you locate via Google.

The `display_pdf` function displays a pandas dataframe using the databricks display function.

In [9]:
def display_pdf(a_pdf):
  display(spark.createDataFrame(a_pdf))

## 2. Data Lab notebooks

Last week:
- [sklearn/Introduction](https://bentley.cloud.databricks.com/#notebook/210807) 
- [sklearn/Preprocessing](https://bentley.cloud.databricks.com/#notebook/404771)

## 3. Cross validation

Cross validation is an extension of the approach using train and test datasets to creating models that generalize well on new data. 

In the following sections, I'll review the train-test approach, provide background for cross validation, and describe the method.

### 3.1 Cross validation - Train-test review 

- [Train & test datasets](https://bentley.cloud.databricks.com/#notebook/958305) (data lab notebook)

Estimators are used in three steps:
1. Create a model (by fitting a dataset to the estimator)
1. Make predictions with that model
1. Evaluate the model by comparing these predictions with actual values (scoring)

It's important to create models that make good predictions on unseen data. [more TBD]

As a first step, do not create predictions on the same dataset which was fit to the estimator to create the model.
Instead, (randomly) split the initial dataset into two datasets (_train_ and _test_) and:
- create the model by fitting the estimator to the _train dataset_ 
- create predictions on the _test dataset_ (unseen data)

These predictions can then be scored to evaluate the model. A few common scoring functions are:
- root mean square error and R squared for regression models
- accuracy and area under the curve for classification models

Train-test datasets (and their proper use) ensure that data used to fit a model is kept separate from (unseen) data use to evaluate that model.

There are still potential problems with data from the training process leaking into the evaluation of the test dataset. __Cross Validation__ is a method that attempts to remedy this problem, and is described in the next section.

### 3.2 Cross validation - Introduction

__Reference__ 
- [Train & test datasets](https://bentley.cloud.databricks.com/#notebook/958305) (data lab notebook)
- [3.1. Cross-validation: evaluating estimator performance](https://scikit-learn.org/stable/modules/cross_validation.html) (Scikit-learn documentation)

This section provides background to the cross validation method in light of the problem described above.

Every estimator has a set of _hyper-parameters_ that determine how the model is created when the estimator is fit to a dataset.

It is tempting to:
1. Choose a set of hyper-parameters
1. Create a model by fitting the train dataset to the estimator determined by these hyper-parameters
1. Score predictions made by this model on the test dataset
1. Evaluate the model (as determined by these hyper-parameters) based on its prediction score
1. Return to step 1 and repeat this process until they have found the "best" model in terms of its evaluation on the test dataset

There are problems, often referred to as _overfitting_, with this process:
- The final model has been created (customized) so that it scores well on the test dataset
- The test dataset has been used multiple times to find the best hyper-parameters of the final model

Either way you look at it, the test dataset is no longer new or unseen. Information from the training process has _leaked into_ the evaluation process. The Scikit-learn documentation calls this a "methodolical error".

We have two objectives: 
1. Use the test dataset only once to evaluate the final model
1. Find the best hyper-parameters to use in creating the model (without using the test dataset)

To address the first objective, separate the initial dataset into a train dataset and a test dataset. 

To address the second objective, cross validation separates the train dataset into multiple pairs of _train_ and _validation_ datasets. These train-validation dataset pairs are used to find the best hyper-parameters. The final model is then evaluated on the test dataset.

The following sections, provide more detail into:
- the process of creating the train-validation dataset pairs
- the use of cross validation to find the best hyper-parameters.

### 3.3 Cross validation - Method

This section describes the cross validation method. In basic terms: 
- The input to the method is a single dataset, in particular it is the train dataset from the initial split of train and test datasets. 
- The output from the method consists of several train-validation dataset pairs. 

There are two methods for creating these train-validation dataset pairs for cross-sectional and time series (respectively).

For cross-sectional datasets, the _KFold_ procedure (described below) is the most straightforward. See this [link](https://scikit-learn.org/stable/modules/cross_validation.html#k-fold) from Scikit-learn for a graphic that may help in understand the procedure.

Train-validation pairs are created as follows, given an input parameter `k`:
1. The input dataset is randomly partitioned into `k` equally sized subsets
2. One of the `k` subsets is designated as the validation dataset
3. The train dataset paired with the validation dataset, created in step 2, consists of the remaining `k-1` subsets 
4. Repeat steps 2 and 3 for each of the `k` subsets. This produces `k` train-validation pairs. 

For more information see [3.1. Cross-validation: evaluating estimator performance](https://scikit-learn.org/stable/modules/cross_validation.html) from the [Scikit-learn documentation](https://scikit-learn.org/stable/index.html).

On the other hand, the train dataset of a time series dataset must precede in time the validation/test dataset. This means that the standard/classical cross-validation method (as described above) will not work (isn't acceptable) for time series. See this [link](https://scikit-learn.org/stable/modules/cross_validation.html#time-series-split) from Scikit-learn that may help in understanding the procedure. 

For time series datasets, the train-validation pairs are created as follows, given an input parameter `k`:
1. The input dataset is partitioned into `k+1` subsets of equal size (if possible), where each is a continuous sequence of rows 
1. The first train dataset (of the first train-validation pair) is the __first__ of the `k+1` subsets. The corresponding validation dataset is the second of the subsets. 
1. The second train dataset is the __first two__ of the `k+1` subsets. The corresponding validation dataset is the third of the subsets. 
1. The last train dataset is the __first `k`__ of the `k+1` subsets. The corresponding validation dataset is the last of the subsets. 

An example is presented in the next two code cells below to demonstrate the procedure. The first cell creates a dataset with 60 rows.

In [23]:
import pandas as pd
X = \
pd.DataFrame(data={'a':range(100,160),
                   'b':range(200,260)})

Run the code cell below with different values of `n_splits` in the first line of the cell. Numbers between `2` and `5` produce splits that are easy to understand.

In [25]:
n_splits = 5

from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=n_splits)

for train, test in tscv.split(X):
  print("min(train)=%s, max(train)=%2d, min(test)=%s, max(test)=%s, len(train)=%s, len(test)=%s" 
        % (min(train), max(train), min(test),  max(test), len(train), len(test)))
  

For more information see
[TimeSeriesSplit](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html),
[3.1.2.5. Cross validation of time series data](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-of-time-series-data)
from the 
[Scikit-learn documentation](https://scikit-learn.org/stable/index.html).

__Exercise:__ create a demonstration, as simple as the demonstration of `TimeSeriesSplit` above, of basic cross validation with a cross-sectional dataset using the `KFold` class and using a very simple dataset.

In the code below, we define the number of splits to be 5, and then import the KFold object from sklearn.  By specifying the n_splits argument in our version of KFold (KF) we are changing this away from the default.  We then utilize a for loop to perform a split and subsequent cross validation of the data.  This performance allows us to create cross validated data and to split our data randomly, this will be important moving forward because the need to validate which methods and hyper-parameters are most useful will be done on cross validation, as a certain trained dataset may produce overfit results.

In [29]:
n_splits = 5
import numpy as np
from sklearn.model_selection import KFold
KF= KFold(n_splits=n_splits)

for train, test in KF.split(X):
  print("min(train)=%s, max(train)=%2d, min(test)=%s, max(test)=%s, len(train)=%s, len(test)=%s" 
        % (min(train), max(train), min(test),  max(test), len(train), len(test)))
  

## 4. Pipelines with `FeatureUnion`

`FeatureUnion` combines several transformer objects into a new transformer that combines their output. --- Scikit-learn 

The `fit` and `transform` methods of a `FeatureUnion` object initiate the same methods on each component transformer object.
The result of the `transform` method (of the `FeatureUnion` object) is the column-wise concatenation of the results of the `transform` methods applied to the component transformer objects. 

For example:

In [32]:
from sklearn.pipeline                import FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
corpus = [
  'dogs and cats.',
  'dogs, more dogs and horses.',
  'cats or birds.'
]
fea = FeatureUnion([('cnt_vec', CountVectorizer()),
                    ('idf_vec', TfidfVectorizer())
                   ])
fea_pdf = \
fea.fit_transform(corpus) \
   .toarray() \
   .round(3)
display_pdf(pd.DataFrame(data=fea_pdf))

0,1,2,3,4,5,6,7,8,9,10,11,12,13
1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.577,0.0,0.577,0.577,0.0,0.0,0.0
1.0,0.0,0.0,2.0,1.0,1.0,0.0,0.344,0.0,0.0,0.688,0.452,0.452,0.0
0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.623,0.474,0.0,0.0,0.0,0.623


In [33]:
import pandas as pd 
title_tags_pdf = \
pd.read_csv('/dbfs/mnt/group-ma707/data/mining_com_coal.csv',
            encoding="ISO-8859-1"
           ) \
  .loc[:,['title','tags']
       ] \
  .assign(target = lambda df: 1*df.tags.str.contains('china').astype('bool')
         ) \
  .loc[lambda df: df.title.notnull()]
type(title_tags_pdf)

__Exercise:__ describe the function of each of the method calls above and describe how they work

The functionality of the code above works as follows:
  the first line simply serves to read in the CSV file specified and uses the appropriate ISO coding methodology.
  The first. loc function locates all rows from the columns titled "title" and "tags".
  The next function (. assign) implements a lambda function which is implemented to show that if a column has the value china, it will be returned as Boolean type true.  By multiplying by 1, the value returned will be either 1 (for true) or 0 (for false).  This allows us to create a sparse array to be used in the upcoming exercises.  The next. loc function again utilizes the lambda function to create a function which locates within the data frame specified places where the title is not null.

__Exercise:__ From the `title_tags_pdf` dataframe create:
- a `features_pdf` dataframe containing the `title` column
- a `target_ser` series containing the `target` column

The below code creates the necessary dataframes, there is not much to mention within this exact code segment as interesting.

In [38]:
features_pdf=title_tags_pdf['title']
print(type(features_pdf))
target_ser=title_tags_pdf.target
print(type(target_ser))

In [39]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression()
a=log_reg.fit(TfidfVectorizer().fit_transform(title_tags_pdf['title']),
            title_tags_pdf.target
           )
log_reg.predict(TfidfVectorizer().fit_transform(title_tags_pdf['title']))

__Exercise:__ Create a pipeline `est` containing `TfidfVectorizer` and `LogisticRegression` so that you can call
- `est.fit(features_pdf,target_ser)`
- `est.predict(features_pdf)`

The estimator pipeline below takes its first argument as the TfidfVectorizer (used to fit data) and then the Logistic Regression for its second argument (used to predict) and is able to produce results that print out in an array.  This is a simple example of how to create an estimator pipeline that utilizes different methods - and it is important to note that we keep the Logistic Regression method last, as it is the only one that functions with a predict.

In [42]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model  import LogisticRegression
from sklearn.preprocessing   import MinMaxScaler
from sklearn.impute          import SimpleImputer

est = Pipeline ([
 ('tfid', TfidfVectorizer()),
 ('lgr', LogisticRegression())
])


In [43]:
est.fit(features_pdf,target_ser)
est.predict(features_pdf)

## 5. Dataset

Display the paths of the three files in our dataset.

In [46]:
%sh ls -hot /dbfs/mnt/group-ma707/data/*

__Note:__ we will create, from the above files, at least three dataframes (with features and target). Each is described below.

From only the 5TC dataset: 
- target will be `BCI`
- features will include lagged versions of the other columns 
- features will include date and time components (hour, day of week, etc.)
- features may include external time series

From only the _mining_ dataset(s):
- the target may be one or more tags (from the `tags` variable)
- features would be words present in the `content` or `title` variables

From the 5TC and _mining_ dataset(s): 
- target will be `BCI` (from 5TC dataframe)
- include all features from either of the above dataframes
- the dataframes would need to be joined by either:
    1. aggregating the features from the _mining_ dataframe (by date)
    1. spreading the 5TC dataframe onto the _mining_ dataframe (duplicating 5TC rows)

__The End__