Run the cell below to import the required packages:

In [24]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error

from sklearn.datasets import load_boston

## Training and Testing
---
<a class="anchor" id="train"></a>

The above lessons were meant to get you comfortable with different types of machine learning algorithms. However, in practice, we would never use our entire dataset to train our model. Instead, we would use a portion of our data, the training set, to train the data, and then we would evaluate the accuracy of our model on the testing portion of our dataset. Since the test portion was not used to train the model, it gives us a more honest indication of how well our model does at predicting new data it hasn't encountered before. There are a few techniques for training and testing. This first one is less computationally intensive.

### Technique 1: Train/Validate/Test

First, a couple of vocab words:

**Training Dataset**: The sample of data used to fit the model. The actual dataset that we use to train the model. The model sees and learns from this data.

**Validation Dataset**: The sample of data used to provide an unbiased evaluation of a model fit on the training dataset while tuning model hyperparameters.  The validation set is used to evaluate a given model, but this is for frequent evaluation. We as machine learning engineers use this data to fine-tune the model hyperparameters. Hence the model occasionally sees this data, but never does it “Learn” from this. We use the validation set results and update higher level hyperparameters. So the validation set in a way affects a model, but indirectly.

**Test Dataset**: The sample of data used to provide an unbiased evaluation of a final model fit on the training dataset. The Test dataset provides the gold standard used to evaluate the model. It is only used once a model is completely trained (using the train and validation sets). The test set is generally what is used to evaluate competing models (For example on many Kaggle competitions, the validation set is released initially along with the training set and the actual test set is only released when the competition is about to close, and it is the result of the the model on the Test set that decides the winner).

<img src="images/train.png" width="400">

Side note: there is no hard and fast rule about how to proportion your data. Just know that your model is limited in what it can learn if you limit the data you feed it. However, if your test set is too small, it won’t provide an accurate estimate as to how your model will perform. 

### Boston Example

Scikit-learn has many data sets built in that you can use. Let's load in a Boston dataset of housing info and housing prices. When you first load the data, it's a bit hard to understand what's going on:

In [5]:
boston = load_boston()
boston.data

array([[6.3200e-03, 1.8000e+01, 2.3100e+00, ..., 1.5300e+01, 3.9690e+02,
        4.9800e+00],
       [2.7310e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9690e+02,
        9.1400e+00],
       [2.7290e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9283e+02,
        4.0300e+00],
       ...,
       [6.0760e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
        5.6400e+00],
       [1.0959e-01, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9345e+02,
        6.4800e+00],
       [4.7410e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
        7.8800e+00]])

However, we can read the docs to understand better:

In [6]:
print(boston.DESCR)
print(load_boston.__doc__)

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pu

We find that all of the predictor variables are contained in data and the target variable (housing price) is contained in the target:

In [19]:
X = boston.data
y = boston.target

### Simple Test/Train sets
Let's omit the validation set for now and focus on splitting into training and testing sets. If we save 30% for the testing set and run a simple linear regression, we get the following results:

In [29]:
lr = LinearRegression()

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# Fit the model against the training data
lr2.fit(X_train, y_train)

# Evaluate the model against the testing data
print('Train:', lr2.score(X_train, y_train))
print('Test:', lr2.score(X_test, y_test))

Train: 0.7609419100141347
Test: 0.6855557785812373


Run the above cell multiple times to notice the variation. Notice the R^2 value of the test (hold-out) set. Notice that model performance is usually a little lower on the test set. This is expected. In fact, this lower value is a much more accurate number to report as "real world" performance.

### Technique 2: Cross Validation

Cross validation assigns a certain percentage of the dataset to test data, and then does this multiple times. 

One round of cross-validation involves partitioning a sample of data into complementary subsets, performing the analysis on the training set and testing the analysis on the other subset. To reduce variability, in most methods multiple rounds of cross-validation are performed using different partitions, and the validation results are combined (e.g. averaged) over the rounds to give an estimate of the model’s predictive performance.

The upside of this method is it avoids unluckily sampling one unrepresentative set of test data. However, the downside is this method is computationally intensive. This method works great on small to medium-sized datasets. This is absolutely not the kind of thing you’d want to try on a massive dataset. 

<img src="images/train3.png" width="400">

Not surprisingly, scikit-learn can help us do this using a method called cross_val_score:

In [31]:
lr = LinearRegression()

scores = cross_val_score(lr, X, y, cv=5, scoring='neg_mean_squared_error')

# scores output is negative, which is because
# Scikit-learn uses negative mean squared error so that 
# scores always improve with higher values 
print(-scores)
print('Average MSE: ', np.mean(-scores))

[12.46030057 26.04862111 33.07413798 80.76237112 33.31360656]
Average MSE:  37.13180746769922


Advantages of a **train/test split:**

- Runs K times faster than K-fold cross-validation
- Simpler to examine the detailed results of the testing process

Advantages of **cross validation:**

- More accurate estimate of out-of-sample accuracy
- More "efficient" use of data (every observation is used for both training and testing)

Recommendations for **cross validation:**

- K can be any number, but K=4 or 5 is common

- Each response class should be represented with equal proportions in each of the K folds

- Scikit-learn's cross_val_score function does this by default

### Using Validation Sets to Tune Hyperparameters: Test/Train/Split
In the above linear regression model, we were not tuning any hyperparameters. In the Ridge and Lasso examples below, we would like to tune alpha, so we will create a validation set.

To create a validation set, we'll need to use test/train/split twice, once to separate the test set and then again to separate the validation set. We'll separate the data into 60% training, 20% validation, and 20% testing:

In [9]:
# intermediate/test split (gives us test set)
X_intermediate, X_test, y_intermediate, y_test = train_test_split(X, 
                                                                  y, 
                                                                  shuffle=True,
                                                                  test_size=0.2, 
                                                                  random_state=15)

# train/validation split (gives us train and validation sets)
X_train, X_validation, y_train, y_validation = train_test_split(X_intermediate,
                                                                y_intermediate,
                                                                shuffle=False,
                                                                test_size=0.25,
                                                                random_state=2018)
# delete intermediate variables
del X_intermediate, y_intermediate

# print proportions
print('train: {}% | validation: {}% | test {}%'.format(round(len(y_train)/len(y),2),
                                                       round(len(y_validation)/len(y),2),
                                                       round(len(y_test)/len(y),2)))

train: 0.6% | validation: 0.2% | test 0.2%


Recall that regularization is a form of constrained optimization that imposes limits on determining model parameters. It effectively allows us to add bias to a model that’s overfitting (overfitting occurs when your model is so specific to your training dataset that it does worse in predicting your test data). We can control the amount of bias with a hyperparameter called lambda (called alpha in Python since lambda is a reserved word) that defines regularization strength. Let's see below how we can test different values of alpha against our validation set in order to get the best hyperparameter:

In [10]:
alphas = [0.001, 0.01, 0.1, 1, 10]
print('Mean Squared Error')
print('-'*76)
for alpha in alphas:
    # instantiate and fit model
    ridge = Ridge(alpha=alpha, fit_intercept=True, random_state=99)
    ridge.fit(X_train, y_train)
    # calculate errors
    new_train_error = mean_squared_error(y_train, ridge.predict(X_train))
    new_validation_error = mean_squared_error(y_validation, ridge.predict(X_validation))
    new_test_error = mean_squared_error(y_test, ridge.predict(X_test))
    # print errors as report
    print('alpha: {:7} | train error: {:5} | val error: {:6} | test error: {}'.
          format(alpha,
                 round(new_train_error,3),
                 round(new_validation_error,3),
                 round(new_test_error,3)))

Mean Squared Error
----------------------------------------------------------------------------
alpha:   0.001 | train error: 22.924 | val error: 19.804 | test error: 23.958
alpha:    0.01 | train error: 22.924 | val error: 19.801 | test error: 23.943
alpha:     0.1 | train error: 22.938 | val error: 19.791 | test error: 23.82
alpha:       1 | train error: 23.315 | val error: 20.158 | test error: 23.533
alpha:      10 | train error: 24.199 | val error: 20.981 | test error: 23.369


There are a few key takeaways here. First, notice the U-shaped behavior exhibited by the validation error above. It starts at 19.796, goes down for two steps and then back up. Also notice that validation error and test error tend to move together, but by no means is the relationship perfect. We see both errors decrease as alpha increases initially but then test error keeps going down while validation error rises again. 

The U shape is typical. We prefer to be somewhere in the sweet spot between not overfitting (having too little error in the training data) and not underfitting (having too much error in the training data).


<img src="images/train2.png" width="400">

In our case, we'll decide to use alpha=0.1 to now train our entire training dataset since this gave us the smallest validation error:

In [11]:
# train/test split
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    shuffle=True,
                                                    test_size=0.2, 
                                                    random_state=15)

# instantiate model
ridge = Ridge(alpha=0.1, fit_intercept=True, random_state=99)
#fit model
ridge.fit(X_train, y_train)
#evaluate model
new_train_error = mean_squared_error(y_train, ridge.predict(X_train))
new_test_error = mean_squared_error(y_test, ridge.predict(X_test))
print('MSE train error:', new_train_error, 'R2 train: ', ridge.score(X_train, y_train))
print('MSE test error:', new_test_error, 'R2 test: ', ridge.score(X_test, y_test))

MSE train error: 21.87892746431845 R2 train:  0.7454491108712407
MSE test error: 23.679851355216798 R2 test:  0.6937869418612654


Thus, we can conclude that 69.4% of the variation in home price in our test set can be attributed to variation in our explanatory variables.

### Using Validation Sets to Tune Hyperparameters: Cross Validation & KFold

Cross validation assigns a certain percentage of the dataset to test data, and then does this multiple times. Using `cross_val_score`, is convenient, but it doesn't allow features to be manipulated during the cross validation steps. An alternative is to use `KFold`, which will allow manipulation during cross validation.
 Just for fun, we'll also use a different model called LASSO instead of Ridge:

In [17]:
K = 10
kf = KFold(n_splits=K, shuffle=True, random_state=42)

alphas = [0.001, 0.01, 0.1, 1, 10]

for alpha in alphas:
    train_errors = []
    validation_errors = []
    for train_index, val_index in kf.split(X, y):
        
        # split data
        X_train, X_val = X[train_index], X[val_index]
        y_train, y_val = y[train_index], y[val_index]

        # instantiate model
        lasso = Lasso(alpha=alpha, fit_intercept=True, random_state=77)
        # fit model
        lasso.fit(X_train, y_train)
        
        #calculate errors
        train_error = mean_squared_error(y_train, lasso.predict(X_train))
        val_error = mean_squared_error(y_val, lasso.predict(X_val))

        # append to appropriate list
        train_errors.append(train_error)
        validation_errors.append(val_error)
    
    # generate report
    print('alpha: {:6} | mean(train_error): {:7} | mean(val_error): {}'.
          format(alpha,
                 round(np.mean(train_errors),4),
                 round(np.mean(validation_errors),4)))

alpha:  0.001 | mean(train_error):  21.819 | mean(val_error): 23.3657
alpha:   0.01 | mean(train_error): 21.8551 | mean(val_error): 23.4134
alpha:    0.1 | mean(train_error): 22.9668 | mean(val_error): 24.5981
alpha:      1 | mean(train_error):  26.734 | mean(val_error): 28.2354
alpha:     10 | mean(train_error):  40.183 | mean(val_error): 40.9859


In our case, we'll decide to use alpha=0.001 to now train our entire training dataset since this gave us the smallest validation error:

In [18]:
# train/test split
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    shuffle=True,
                                                    test_size=0.2, 
                                                    random_state=15)

# instantiate model
lasso = Lasso(alpha=0.001, fit_intercept=True, random_state=77)
# fit model
lasso.fit(X_train, y_train)

#evaluate model
new_train_error = mean_squared_error(y_train, lasso.predict(X_train))
new_test_error = mean_squared_error(y_test, lasso.predict(X_test))
print('MSE train error:', new_train_error, 'R2 train: ', lasso.score(X_train, y_train))
print('MSE test error:', new_test_error, 'R2 test: ', lasso.score(X_test, y_test))

MSE train error: 21.87195567342094 R2 train:  0.7455302243341687
MSE test error: 23.774056435209108 R2 test:  0.6925687405641405


Thus, we can conclude that 69.3% of the variation in home price in our test set can be attributed to variation in our explanatory variables, which is slightly worse than the 69.4% we got before from Ridge regression.