## CHAPTER 11
---
# MODEL EVALUATION

---
- In this chapter we will examine strategies for evaluating the quality of models created through our learning algorithms. 
- It might appear strange to discuss model evaluation before discussing how to create them, but there is a method to our madness. 
- Models are only as useful as the quality of their predictions, and thus fundamentally our goal is not to create models (which is easy) but to create high-quality models (which is hard). 
- Therefore, before we explore the myriad learning algorithms, we first set up how we can evaluate the models they produce.

## 11.1 Cross-Validating Models
**Problem:** we want to evaluate how well our model will work in the real world

**Solution:** we will create a pipeline that
- preprocesses the data, 
- trains the model, and then 
- evaluates it using cross-validation

In [1]:
# Load libraries
from sklearn import datasets
from sklearn import metrics
from sklearn.model_selection import KFold, cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

# load digits dataset
digits = datasets.load_digits()

# create features matrix
features = digits.data

# create target vector
target = digits.target

# create standardizer
standardizer = StandardScaler()

# create logitic regression object
logit = LogisticRegression()

# create a pipeline that standardizes, then runs logistic regression
pipeline = make_pipeline(standardizer, logit)

# create k-fold cross-validation
kf = KFold(n_splits=10, shuffle=True, random_state=1)

# conduct k-fold cross-validation
cv_results = cross_val_score(pipeline, # Pipeline
                             features, # feature matrix
                             target, # target vector
                             cv=kf, # cross-validation technique,
                             scoring="accuracy", # loss function
                             n_jobs=-1) # use all CPU cores

# calculate mean
cv_results.mean()

0.9693916821849783

### Discussion:
- Our goal is to evaluate how well our model does on data it has never seen before (e.g., a new customer, a new crime, a new image). 
- **The validation approach:**
    - split data into training set and test set
    - set the test set aside and pretend it's never been seen before
    - train the model on the training set and teach it how to make the best predictions
    - evaluate the model on the testing set and see how it does
- The two major weaknesses of the validation approach:
    - the performance of the model can be highly dependent on which few observations were selected for the test set. 
    - Second, the model is not being trained using all the available data, and not being evaluated on all the available data.
- **The k-fold cross-validation (KFCV) strategy:**
    - data is split into k parts, called "*folds*"
    - the model is trained using k-1 folds, combined as a training set
    - the last fold is used as a test set
    - this is repeated k times each time using a different fold as the test set. 
    - The performance on the model for each of the k iterations is then averaged to produce an overall measurement.

In [2]:
cv_results

array([0.97777778, 0.98888889, 0.96111111, 0.94444444, 0.97777778,
       0.98333333, 0.95555556, 0.98882682, 0.97765363, 0.93854749])

- KFCV assumes that each observation was created independent from the other, if so it is a good idea to shuffle observations which can be done in scikit-learn by setting shuffle=True.
- When using KFCV to evaluate a classifier, it is often beneficial to perform *stratified k-fold* by replacing KFold class with StratifiedKFold. 
- When using validation sets or cross-validation, it is important to preprocess data based on the training set and then apply those transformations to both the training and test set. The reason is
    - we are pretending that the test set is unknown data
    - it prevents the leaking of information from test set into the training set

In [4]:
# Import library
from sklearn.model_selection import train_test_split

# Create training and test sets
features_train, features_test, target_train, target_test = train_test_split(
    features, target, test_size=0.1, random_state=1)

# Fit standardizer to training set
standardizer.fit(features_train)

# Apply to both training and test sets
features_train_std = standardizer.transform(features_train)
features_test_std = standardizer.transform(features_test)

# Create a pipeline
pipeline = make_pipeline(standardizer, logit)

# Do k-fold cross-validation
cv_results = cross_val_score(pipeline, # Pipeline
                             features, # Feature matrix
                             target, # Target vector
                             cv=kf, # Cross-validation technique
                             scoring="accuracy", # Loss function
                             n_jobs=-1) # Use all CPU scores
cv_results

array([0.97777778, 0.98888889, 0.96111111, 0.94444444, 0.97777778,
       0.98333333, 0.95555556, 0.98882682, 0.97765363, 0.93854749])

Cross_val_score parameters:
- cv determines our cross-validation technique. K-fold is the most common by far
- the scoring parameter defines our metric for success
- n_jobs=-1 tells scikit-learn to use every core of the computer available to speed up the operation

## 11.2 Creating a Baseline Regression Model

**Problem:** we want a simple baseline regression model to compare against our model.

**Solution:** use scikit-learn’s DummyRegressor to create a simple model to use as a baseline

In [6]:
# Load libraries
from sklearn.datasets import load_boston
from sklearn.dummy import DummyRegressor
from sklearn.model_selection import train_test_split

# Load data
boston = load_boston()

# Create features
features, target = boston.data, boston.target

# Make test and training split
features_train, features_test, target_train, target_test = train_test_split(
    features, target, random_state=0)

# Create a dummy regressor
dummy = DummyRegressor(strategy='mean')

# "Train" dummy regressor
dummy.fit(features_train, target_train)

# Get R-squared score
dummy.score(features_test, target_test)

-0.001119359203955339

To compare, we train our model and evaluate the performance score

In [7]:
# Load library
from sklearn.linear_model import LinearRegression

# Train simple linear regression model
ols = LinearRegression()
ols.fit(features_train, target_train)

# Get R-squared score
ols.score(features_test, target_test)

0.6354638433202129

## 11.3 Creating a Baseline Classification Model

**Problem:** You want a simple baseline classifier to compare against your model.

**Solution:** Use scikit-learn’s DummyClassifier

In [11]:
# Load libraries
from sklearn.datasets import load_iris
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import train_test_split

# Load data
iris = load_iris()

# Create target vector and feature matrix
features, target = iris.data, iris.target

# Split into training and test set
features_train, features_test, target_train, target_test = train_test_split(
features, target, random_state=0)

# Create dummy classifier
dummy = DummyClassifier(strategy='uniform', random_state=1)

# "Train" model
dummy.fit(features_train, target_train)

# Get accuracy score
dummy.score(features_test, target_test)

0.42105263157894735

By comparing the baseline classifier to our trained classifier, we can see the improvement

In [10]:
# Load library
from sklearn.ensemble import RandomForestClassifier

# Create classifier
classifier = RandomForestClassifier()

# Train model
classifier.fit(features_train, target_train)

# Get accuracy score
classifier.score(features_test, target_test)

0.9736842105263158

- A common measure of a classifier’s performance is how much better it is than random guessing. scikit-learn’s DummyClassifier makes this comparison easy. 
- The strategy parameter gives us a number of options for generating values. There are two strategies. 
    - *stratified* makes predictions that are proportional to the training set’s target vector’s class proportions 
    - *uniform* will generate predictions uniformly at random between the different classes. 