# Notes on Chapter 6: Best Practices for Model Evaluation and Hyperparameter Tuning

So far, Raschka has introduced us to an world of different models for classifying data and compressing features. But how do we know how well any given model is performing, and how can we figure out how to improve performance when it's bad?

In Chapter 6, we explore many different techniques for improving model performance. Broadly, our focus includes:

- Estimating model performance
- Diagnosing common problems
- Fine-tuning models by adjusting hyperparameters
- Getting familiar with different performance metrics

The specific techniques we'll cover are:

1. [Data-processing pipelines](#Data-processing-pipelines): chaining algorithms together
2. Cross-validation: robust measures of performance
3. Learning and validation curves: measuring bias and variance
4. Grid search: tuning hyperparameters
5. Nested cross-validation: selecting good algorithms
6. Performance metrics: different ways of judging "good" and "bad" models

## Data-processing pipelines

Many preprocessing techniques (like PCA) find parameters that must be reused across all training and testing datasets in order to produce sensible results. To help standardize this procedure, we can build **pipelines** that record our transformation steps and allow us to reuse them across training, testing, and validation sets (or even on new samples from the same population). Pipelines also encourage an object-oriented approach to model-building.

In scikit-learn, we can use the [`Pipeline`](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) class to construct a pipeline:

In [4]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

pipeline = Pipeline([('scaler', StandardScaler()),
                     ('clf', LogisticRegression)])

Note that `Pipeline` estimators have the standard set of `fit()`, `predict()`, and `fit_transform()` methods, but that in order to evaluate correctly, all steps up until the final classification step must call `fit_transform`.

## Cross-validation

Raschka covers two types of **cross-validation**, or techniques for measuring how well a model generalizes. The two techniques include **holdout** and **k-fold** cross-validation. 

Note that both of these techniques represent means of partitioning a dataset into training/testing/validation sets for analysis, sets whose performance we must measure using a specific metric. Cross-validation doesn't describe the metric used to quantify a model's performance, it simply describes the process of partitioning a dataset in such a way that we get the most mileage out of our metric.

### Holdout cross-validation

**Holdout** is the most basic kind of cross-validation. The intuition is that we split a dataset into three partitions: **training, testing, and validation**. We use the training dataset for training the model; the validation dataset for getting feedback on how well our model is performing on data it hasn't seen before, and iterating on it; and the testing set for verifying our performance once we're confident our model is performing well.

Why use two different test partitions (training and validation)? Well, we need to have a validation partition to get feedback on how our model is doing, but after iterating many times we can't rely on this partition to accurately measure the model's performance. By testing against the validation partition over and over, we actually incorporate it into our model's training: not by literally learning parameters using the test dataset as training, but rather by *tuning* to fit the model in the human act of model selection. In order to get a final, unbiased estimate of our model's performance, we need to reserve ("hold out") one partition outside of the training process completely.

### $k$-fold cross-validation

Holdout cross-validation has a major drawback: the randomness of the partitioning can have unintended consequences for measuring a model's generalizability. If we accidentally partition the dataset in a strange way, such that (for example) most samples of class A end up in the training partition and most samples of class B end up in the testing partition, we're going to get mysteriously poor results.

**$k$-fold cross-validation** attempts to overcome this drawback by sampling and partitioning the dataset multiple times, retraining the model and measuring performance each time in order to get a good sense of the "average" performance of the model. The intuition involves repeating the holdout method $k$ times, in order to create $k$ different partitions (or **"folds"**) of the sample space: On each fold, we sample (without replacement) from the sample space to create $k-1$ training sets and 1 validation set; we then measure performance through the holdout method and save our performance metric for the end, when we average them all together.

Optionally, we can use a slightly modified form known as **stratified $k$-fold cross-validation** to further improve the robustness of our final average. In stratified $k$-fold, we weight the final average of the metrics by the **class proportion** of each fold, so that folds with more even distributions of classes among training/validation sets hold more weight than folds with uneven distributions.

Note that $k$-fold cross-validation only splits each fold into training/validation sets, so we need to first partition our dataset into training and test sets and then perform the cross-validation only on the test dataset.