# Machine Learning Workflows

## Contents
1. Introduction
2. Preprocessing pipelines
3. Creating train-test dataset pairs
4. Estimator pipelines

## 1. Introduction

Machine learning workflows take raw data as input and produce models that make optimized predictions (as output).
By "optimized predictions" I mean predictions that minimize the difference (error) between the predicted and actual values. 

These workflows are substantial and complicated processes. For this reason, they should be:
- __modular__, which means that your code is both simple and capable
- __standard__, which means that your code can be easily understood, debugged and extended by others

Pipelines satisfy both of these requirements and can be found in Python and Spark, among other languages.

A machine learning workflow has three essential steps (the first and last are pipelines):
1. A preprocessing pipeline, which creates a feature-target dataset suitable for fitting to the estimator pipeline(s)
1. The separation of the feature-target dataset into train and test datasets 
1. An estimator pipeline, which can be used to create models and predictions from those models

Descriptions of these steps are provided below.

## 2. Preprocessing pipelines

Preprocessing pipelines are used to transform a raw dataset into a feature-target dataset suitable for:
1. separating into a train-test dateset pair
2. fitting the above train dataset to a `GridSearchCV` object initialized with an estimator pipeline and a parameter grid

The second step is often performed multiple times for different parameter grids. 

Objects in the pipeline may:
- clean raw dataset features
- create features from raw dataset features
- impute missing values in features from the raw dataset and in created features
- encode non-numeric features from the raw dataset as numeric features (in the feature-target dataset)

A preprocessing pipeline creates the feature-target dataset from which the train and test datasets will be created.
This is a potential source of what is called "data leakage".

## 2.1 Data leakage

Jason Brownlee, in 
[Data Leakage in Machine Learning](https://machinelearningmastery.com/data-leakage-machine-learning/)), 
describes data leakage as (paraphrased)
> _information from outside the train dataset being used to create the model_

He references [Leakage, by Data Skeptic](http://dataskeptic.com/blog/episodes/2016/leakage) who states (paraphrased)
> _any feature created with information not available when the model makes predictions is introducing data leakage_

There are several ways this may happen. Two are:
1. information from the final evaluation of the test dataset is used to create the feature-target dataset or the train dataset
1. information from the test dataset makes its way into the train dataset
1. information from the future is stored with data time stamped in the past

The first item can be avoided by only evaluating the final model on the test dataset.
This statement can be reworded as, once a model has been evaluated on the test dataset then it is the final model.
It is acceptaable, and common practice, to _fit the train dataset to a `GridSearchCV` object initialized with an estimator pipeline and a parameter grid_ multiple times, and then compare the collection of output results (across multiple runs) to find the best model.  

The second item can be avoided by not including scaling, normalization, standardization, feature selection or dimensionality reduction in the preprocessing pipeline.
(They should be included in the estimator pipeline.)
Imputing missing values with the mean/median of the dataset is not OK, 
but imputing with static values chosen based on domain knowledge is OK.

Any transformation (in the preprocessing pipeline) should be performed on a per value basis without information from or about the entire dataset.
For instance, changing a text field to lower case seems OK, 
but selecting a variable based on how well it correlates to the target across the entire dataset (train and test) doesn't seem OK.
From this perspective objects with fit methods are likely not acceptable as fit methods are designed to store information about the entire dataset.

Finally,  we need to address lag variables.
Stephen Nawara describes lag variables in 
[Avoiding Data Leakage in Machine Learning](https://conlanscientific.com/posts/category/blog/post/avoiding-data-leakage-machine-learning/)
as a remedy to avoid time series features that include same-day information about other time series features (using information in from the future) one of which may be the target.

Features can be created by both the preprocessing pipeline and by the estimator pipeline.

Preprocessing pipelines are used to create features whose creation would produce missing values (such as lag variables).
They are also used to create features whose creation doesn't involve any hyperparameters that would be tuned.
It is more computationally efficient to create features in the preprocessing pipeline as it is run only once. 

Estimator pipelines, on the other hand, are used to create features which are determined by hyperparameters.
When this is the case these hyperparameters can be tuned to produce optimal models.

## 3. Creating train-test dataset pairs

For each feature-target dataset create:
- one train-test dataset pair 
- one estimator pipeline (see below)

The cardinal rule with regard to the train-test pairs is that predictions are created only once and only for one of the test datasets.
This allows multiple train datasets to be fit to `GridSearchCV` objects initialized with different estimator pipelines and parameter grids.
The resulting output from `GridSearchCV` can be compared.
Then the best model can be evaluated on the corresponding test dataset.

## 4. Estimator pipelines

The end result of the estimator pipeline is a model, which, of course, can make predictions.

Estimator pipelines:
- begin with a sequence of zero or more transformer objects, which either create features or select features 
- end with an estimator object
- provide `fit`, `predict` and `score` methods

### 4.1 Components of the estimator pipeline

When creating an estimator pipeline there are several choices to make:
1. which transformer objects to use
1. which estimator object to use
1. which scoring metric to use

In addition, and once the above choices are made, parameter grids can be chosen in order to investigate sets of hyperparameters.
This is usually performed with the goal of finding a model which produces the best predictions. 
There are several techniques to do so.
Grid search is the most straightforward, but there are several "smarter" more advanced optimization methods.
See the section titled "Smart Hyperparameter Tuning" in _Evaluating Machine Learning Models_.

### 4.1 Feature creation

Features can be created by both the preprocessing pipeline and by the estimator pipeline.

Preprocessing pipelines are used to create features whose creation would produce missing values (such as lag variables).
They are also used to create features whose creation doesn't involve any hyperparameters that would be tuned.
It is more computationally efficient to create features in the preprocessing pipeline as it is run only once. 

Estimator pipelines are used to create features where this creation is determined by hyperparameters.
When this is the case these hyperparameters can be tuned to produce models that produce optimal predictions.

The objects of the `CountVectorizer` class are examples of transformers that create features and that are determined by hyperparameters.
Two of these hyperparameters are `stop_words` and `ngram_range` (of the init method).

The current project code places the `CountVectorizer` object in the preprocessing pipeline. It can be moved to the estimator pipeline so that its hyperparameters can be used to tune the model.

### 4.2 Feature selection and feature reduction

Often feature-target datasets have many features.
Some features may have been present in the raw dataset and others may have been created in either of the pipelines.
KDnuggets News (see __References__ above) lists several reasons why it may be more desirable to use fewer features in making good predictions:
- Some features might be irrelevant 
- Some features might be redundant 
- Overfitting is more likely with many features
- Models with fewer features are easier to understand

Feature selection and feature reduction should be part of the training process, which means for our work that it should be part of the estimator pipeline. 
See [External Validation](https://topepo.github.io/caret/feature-selection-overview.html#external-validation)
from [The `caret` package](https://topepo.github.io/caret/index.html) 
and [An Introduction to Feature Selection, by Jason Brownlee](https://machinelearningmastery.com/an-introduction-to-feature-selection/).

There are two approaches to using fewer features:
1. Feature selection
2. Dimensionality reduction

Each is explained in a separate section below.

#### 4.2.1 Feature selection 
As mentioned above, it is often desirable to reduce the number of features in a feature-target dataset. 
It is not clear though which features to remove and which to keep.
For this reason several algorithms have been written to automate the process of selecting features. 
They fall into three groups: 
1. __Based on the dataset__. Features are ranked based on per feature scores provided by a scoring function, which is often a statistical measurement. 
1. __Based on models built from the dataset__. Features are ranked, or chosen, based on their importance to the model. 
1. __Based on predictions of models (built from the dataset)__. Features are chosen based on their ability to reduce prediction error.

##### 4.2.1.2 Dataset based 

This collection of techniques involves:
- Applying a scoring function to each feature (possibly in relation to the target feature)
- Choosing features based on their rank with regard to these functions

Variance is the most common scoring function used for feature selection.
Only features which exceed a given threshold are kept for further analysis. 
Scikit-learn implements this technique with the 
[VarianceThreshold](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html) 
transformer class. 

Two transformer classes
[`SelectPercentile`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectPercentile.html#sklearn.feature_selection.SelectPercentile)
and 
[`SelectKBest`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html)
(implemented by Scikit-learn) take as input a scoring function and, when fit to a dataset, return the top percentile or number of features (respectively) based on the ranking provide by the scoring function.

Scikit-learn makes available these four scoring functions
([`f_classif`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html),
 [`f_regression`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_regression.html),
 [`mutual_info_classif`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_classif.html) and
 [`mutual_info_regression`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_regression.html))
which score features in relation to, categorical or numeric, target variables.
Note that a value of `0` indicates no linear dependence for [`f_regression`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_regression.html), but that a value of `0` indicates no dependence for [`mutual_info_regression`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_regression.html).
Additional scoring functions are supplied by Scikit-learn and can be written by the user.

##### 4.2.1.3 Model based 

Some algorithms assess the importance of features during the process of building the model. 
There are two basic types of these models. 
The first produces weights for each feature that indicate their importance in relation to the model.
The second produces models that may not include all features and so effectively eliminate some features and select others. 
LASSO is the most common example of this (second) type of model.

Feature selection of the first type, in Python, requires an estimator with a `fit` method that creates either a `coef_` or a `feature_importances_` attribute, which provides the weight for each feature. 
Features with higher weights are selected. 

The two transformer classes (described below) implement different methods using these weights to choose features. 
- `SelectFromModel` chooses features whose weights surpass a given threshold
- `RFE` iteratively prunes features with the least weight

Feature selection of the second type is called _regularization_.
A list of regularized models in R can be found in the
[`caret` package documentation](https://topepo.github.io/caret/feature-selection-overview.html#models-with-built-in-feature-selection). 
The regularized models in Scikit-learn models the linear models, tree-based models and support vector machine models.

###### 4.2.1.3.1 Ranking (`SelectFromModel`)

The `SelectFromModel` transformer class takes as input an estimator class and a weight threshold,
which is specified as a proportion of either the mean or median. 


The `fit` method 
1. Takes as input a predictor array (and, optionally, a target array)
1. Fits this/these array(s) to the estimator class 

The `transform` method
1. Takes as input a predictor array
1. Returns those features (columns of the predictor array) whose weight, according to the estimator, is greater than the threshold

Scikit-learn documentation for `SelectFromModel` can be found at
- http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html

###### 4.2.1.3.2 Recursive feature elimination (`RFE`)

The `RFE` transformer class takes as input an estimator class and the number of features to select. 

The `fit` method
1. Takes as input a predictor array (and, optionionally, a target array)
1. Proceeds through the algorithm below
1. The end result of the algorithm is a chosen subset of features

The `RFE` algorithm in brief:
1. Fit the estimator to the entire predictor array
1. Remove the feature(s) with the least weight 
1. If the number of features is equal to the number of features to select, then stop (otherwise continue)
1. Fit the remaining features to the estimator, creating new weights (for each of the remaining features)
1. Go to the second step

The `transform` method
1. Takes as input a predictor array
1. Returns those features of the predictor array determined by the `fit` method

Scikit-learn documentation for the `RFE` model can be found at
- http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html

##### 4.2.1.4 Prediction based

These techniques assess feature importance based on the ability (of those features) to decrease the error of a given model's predictions.

One approach (of this type) simply fits a model to every combination of features (an exhaustive search) and keeps the set of features for which the model produces the best predictions. This approach finds the optimal solution (set of features), but is rarely practical when the size of the original feature set is large. 

Three classes of techniques, _genetic algorithms_, _forward selection_, _backward elimination_, find sub-optimal solutions (without an exhaustive search through all feature combinations). 
- Genetic algorithms try small and large changes to the feature set to search the space of feature combinations
- Forward selection incrementally adds features (to an initially empty feature set) that provide the greatest decrease in prediction error
- Backward elimination removes features (from the original feature set) that provide the greatest decrease in prediction error

There are many libraries that implement these techniques. The following packages stand out based on completeness and clarity of their documentation.
- [Sequential Feature Selector](https://rasbt.github.io/mlxtend/user_guide/feature_selection/SequentialFeatureSelector/), by Sebastian Raschka, from [mlxtend documentation](https://rasbt.github.io/mlxtend/); see also this [mlxtend demo](https://www.kdnuggets.com/2018/06/step-forward-feature-selection-python.html)
- [Feature selection overview](https://topepo.github.io/caret/feature-selection-overview.html) of the [`caret` package in R](https://topepo.github.io/caret/index.html)

### 4.2.2 Dimensionality reduction

Dimensionality reduction techniques _replace_ an original set of features with a smaller set of new features and so does not perform feature _selection_.
A few examples are:
- PCA (principal component analysis)
- NMF (non-negative matrix factorization)
- Auto encoders (a deep learning technique)

## References

Workflows and pipelines
- [Architecting a Machine Learning Pipeline, by Semi Koen](https://towardsdatascience.com/architecting-a-machine-learning-pipeline-a847f094d1c7)
- [Evaluating Machine Learning Models, by Alice Zheng](https://learning.oreilly.com/library/view/evaluating-machine-learning/9781492048756/)
- [Scikit-learn Model selection: choosing estimators and their parameters](https://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html)
- [Scikit-learn 4.3 Preprocessing data](https://scikit-learn.org/stable/modules/preprocessing.html)

Train, validation and test datasets
- [What is the Difference Between Test and Validation Datasets? by Jason Brownlee](https://machinelearningmastery.com/difference-test-validation-datasets/)
- [Data Leakage in Machine Learning, by Jason Brownlee](https://machinelearningmastery.com/data-leakage-machine-learning/)
- [Leakage, by Data Skeptic](http://dataskeptic.com/blog/episodes/2016/leakage)
- [Leakage in Data Mining: Formulation, Detection, and Avoidance, by Kaufman, Rosset, and Perlich](https://www.cs.umb.edu/~ding/history/470_670_fall_2011/papers/cs670_Tran_PreferredPaper_LeakingInDataMining.pdf)
- [Avoiding Data Leakage in Machine Learning by Stephen Nawara, PhD.](https://conlanscientific.com/posts/category/blog/post/avoiding-data-leakage-machine-learning/)
- [Preparing and Cleaning Data for Machine Learning, by Dataquest Labs, Inc.](https://www.dataquest.io/blog/machine-learning-preparing-data/)

Feature selection and dimensionality reduction
- [Scikit-learn 1.13 Feature selection](https://scikit-learn.org/stable/modules/feature_selection.html`)
- [Must-Know: Why it may be better to have fewer predictors in Machine Learning models?, by KDnuggets](https://www.kdnuggets.com/2017/04/must-know-fewer-predictors-machine-learning-models.html)
- [4 ways to implement feature selection in Python for machine learning](https://hub.packtpub.com/4-ways-implement-feature-selection-python-machine-learning/), by Sugandha Lahoti
- [A survey of dimensionality reduction techniques](https://arxiv.org/pdf/1403.2877.pdf), by  C.O.S. Sorzano, J. Vargas and A. Pascual‐Montano
- [An Introduction to Feature Selection](https://machinelearningmastery.com/an-introduction-to-feature-selection/), by Jason Brownlee
- [An Introduction to Variable and Feature Selection](http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf), by Isabelle Guyon and Andre Elisseeff
- [Dimensionality Reduction: A Comparative Review](https://lvdmaaten.github.io/publications/papers/TR_Dimensionality_Reduction_Review_2009.pdf), by Laurens van der Maaten, Eric Postma and Jaap van den Herik
- [Discover Feature Engineering, How to Engineer Features and How to Get Good at It](https://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/), by Jason Brownlee
- [Feature Selection For Machine Learning in Python](https://machinelearningmastery.com/feature-selection-machine-learning-python/), by Jason Brownlee
- [Feature Selection to Improve Accuracy and Decrease Training Time](https://machinelearningmastery.com/feature-selection-to-improve-accuracy-and-decrease-training-time/), by Jason Brownlee
- `FeatureSelection` class: [Blog](https://towardsdatascience.com/a-feature-selection-tool-for-machine-learning-in-python-b64dd23710f0), 
  [GitHub](https://github.com/WillKoehrsen/feature-selector)
- [Genetic algorithms for feature selection in Data Analytics](https://www.neuraldesigner.com/blog/genetic_algorithms_for_feature_selection), by Fernando Gómez and Alberto Quesada
- [How to use Python to select the right variables for data science](https://www.dummies.com/programming/big-data/data-science/how-to-use-python-to-select-the-right-variables-for-data-science/), by John Paul Mueller, Luca Massaron
- [Introduction to Feature Selection methods with an example (or how to select the right variables?)](https://www.analyticsvidhya.com/blog/2016/12/introduction-to-feature-selection-methods-with-an-example-or-how-to-select-the-right-variables/), by Saurav Kaushik
- Quora: How do I perform feature selection? [Olivier Grisel, contributor to the scikit-learn project](https://www.quora.com/How-do-I-perform-feature-selection)
- [Review on wrapper feature selection approaches](https://ieeexplore.ieee.org/document/7745366/), by Naoual El Aboudi and Laila Benhlima
- [The `caret` Package](https://topepo.github.io/caret/index.html), by Max Kuhn

__The End__