Run the cell below to import the required packages:

In [57]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error

from sklearn.datasets import load_boston
from sklearn.preprocessing import StandardScaler

import warnings
warnings.filterwarnings('ignore')

## Training and Testing
---
<a class="anchor" id="train"></a>

The previous lessons were meant to get you comfortable with different types of machine learning algorithms. However, in practice, we would never use our entire dataset to train our model. Instead, we would use a portion of our data, the training set, to train the data, and then we would evaluate the accuracy of our model on the testing portion of our dataset. Since the test portion was not used to train the model, it gives us a more honest indication of how well our model does at predicting new data it hasn't encountered before. There are a few techniques for training and testing. This first one is less computationally intensive.

### Technique 1: Train/Validate/Test

First, a couple of vocab words:

**Training Dataset**: The sample of data used to fit the model. The actual dataset that we use to train the model. The model sees and learns from this data.

**Validation Dataset**: The sample of data used to provide an unbiased evaluation of a model fit on the training dataset while tuning model hyperparameters.  The validation set is used to evaluate a given model, but this is for frequent evaluation. We as machine learning engineers use this data to fine-tune the model hyperparameters. Hence the model occasionally sees this data, but never does it “Learn” from this. We use the validation set results and update higher level hyperparameters. So the validation set in a way affects a model, but indirectly.

**Test Dataset**: The sample of data used to provide an unbiased evaluation of a final model fit on the training dataset. The Test dataset provides the gold standard used to evaluate the model. It is only used once a model is completely trained (using the train and validation sets). The test set is generally what is used to evaluate competing models (For example on many Kaggle competitions, the validation set is released initially along with the training set and the actual test set is only released when the competition is about to close, and it is the result of the the model on the Test set that decides the winner).

<img src="images/train.png" width="400">

Side note: there is no hard and fast rule about how to proportion your data. Just know that your model is limited in what it can learn if you limit the data you feed it. However, if your test set is too small, it won’t provide an accurate estimate as to how your model will perform. 

### Boston Example

Scikit-learn has many data sets built in that you can use. Let's load in a Boston dataset of housing info and housing prices. When you first load the data, it's a bit hard to understand what's going on:

In [73]:
boston = load_boston()
boston.data

array([[6.3200e-03, 1.8000e+01, 2.3100e+00, ..., 1.5300e+01, 3.9690e+02,
        4.9800e+00],
       [2.7310e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9690e+02,
        9.1400e+00],
       [2.7290e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9283e+02,
        4.0300e+00],
       ...,
       [6.0760e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
        5.6400e+00],
       [1.0959e-01, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9345e+02,
        6.4800e+00],
       [4.7410e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
        7.8800e+00]])

However, we can read the docs to understand better:

In [74]:
print(boston.DESCR)
print(load_boston.__doc__)

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pu

We find that all of the predictor variables are contained in data and the target variable (housing price) is contained in the target:

In [60]:
X = boston.data #all the other predictor variables
y = boston.target #price of house

### Simple Test/Train sets
Let's omit the validation set for now and focus on splitting into training and testing sets. You'll definitely want to **shuffle** the data first (What if your data happened to be sorted? That would really mess with your results, as you would be training on a dataset very different than your testing set.) Luckily, train_test_split has a sorting parameter built in.

If we save 30% for the testing set and run a simple linear regression, we get the following results:

In [61]:
model = LinearRegression()

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True, test_size=0.3)

# Fit the model against the training data
model.fit(X_train, y_train)

# Evaluate the model against the testing data
print('Train:', model.score(X_train, y_train))
print('Test:', model.score(X_test, y_test))

Train: 0.7527011037469231
Test: 0.7042897668418293


Run the above cell multiple times to notice the variation. Notice the R^2 value of the test (hold-out) set. Notice that model performance is usually a little lower on the test set. This is expected. In fact, this lower value is a much more accurate number to report as "real world" performance.

### Technique 2: Cross Validation

Cross validation assigns a certain percentage of the dataset to test data, and then does this multiple times. 

One round of cross-validation involves partitioning a sample of data into complementary subsets, performing the analysis on the training set and testing the analysis on the other subset. To reduce variability, in most methods multiple rounds of cross-validation are performed using different partitions, and the validation results are combined (e.g. averaged) over the rounds to give an estimate of the model’s predictive performance.

The upside of this method is it avoids unluckily sampling one unrepresentative set of test data. However, the downside is this method is computationally intensive. This method works great on small to medium-sized datasets. This is absolutely not the kind of thing you’d want to try on a massive dataset. 

<img src="images/train3.png" width="400">

Not surprisingly, scikit-learn can help us do this using a method called cross_val_score. Unfortunately, though, cross_val_score does not come with a built in shuffle method. Therefore, will will need to shuffle the data first. Pandas has a built-in command to do this, but we'll need to make the data into a DataFrame first:

In [62]:
#create dataframe
df = pd.DataFrame(X)
df['price'] = y

#now sort
df = df.sample(frac=1).reset_index(drop=True)

#now we can separate the predictor and target variables again
y = df['price']
X = df.drop(columns = 'price')

We can now apply the cross_val_score method:

In [63]:
model = LinearRegression()

scores = cross_val_score(model, X, y, cv=5, scoring='r2')

print(scores)
print('Average Test R2: ', np.mean(scores))

[0.71643446 0.69275083 0.70954922 0.73310468 0.7657153 ]
Average Test R2:  0.723510895085884


This is pretty similar to what we got above using train_test_split. Notice, though, that one or two of the five partitions can give wackier results than the others. If you happened to do only one train/test split on that wackier data, your R2 would be much worse than the cross_val_score results.

Also note that there are many built in scoring techniques to cross_val_score. We can look at the average Mean Squared Error (MSE) instead of the R2:

In [64]:
model = LinearRegression()

scores = cross_val_score(model, X, y, cv=5, scoring='neg_mean_squared_error')

# scores output is negative, which is because
# Scikit-learn uses negative mean squared error so that 
# scores always improve with higher values 
print(-scores)
print('Average MSE: ', np.mean(-scores))

[27.08319926 29.97947213 25.14048285 19.8399267  15.67101821]
Average MSE:  23.5428198268474


Source: https://www.ritchieng.com/machine-learning-cross-validation/


Advantages of a **train/test split:**

- Runs K times faster than K-fold cross-validation
- Simpler to examine the detailed results of the testing process

Advantages of **cross validation:**

- More accurate estimate of out-of-sample accuracy
- More "efficient" use of data (every observation is used for both training and testing)

Recommendations for **cross validation:**

- K can be any number, but K=4 or 5 is common

- Each response class should be represented with equal proportions in each of the K folds

- Scikit-learn's cross_val_score function does this by default

### Feature Scaling
Source: https://medium.com/@contactsunny/why-do-we-need-feature-scaling-in-machine-learning-and-how-to-do-it-using-scikit-learn-d8314206fe73

When you’re working with a learning model, it is important to scale the features to a range which is centered around zero. This is done so that the variance of the features are in the same range. If a feature’s variance is orders of magnitude more than the variance of other features, that particular feature might dominate other features in the dataset, which is not something we want happening in our model.
The aim here is to to achieve Gaussian with zero mean and unit variance.

The SciKit Learn library provides a class to easily scale our data. We can use the StandardScaler class from the library for this. Now that we know why we need to scale our features, let’s see how to do it. Consider the following dataset of consumer info:

In [65]:
df = pd.DataFrame([[44,72000, 'No'], 
                        [27,48000, 'Yes'],
                        [30,54000,'No'],
                       [38,61000,'No'],
                       [40,30000,'Yes'],
                       [35,58000, 'Yes'],
                       [48,79000,'Yes'],
                       [50,83000,'No'],
                      [18,0,'No'],
                      [50, 100000, 'Yes']], columns = ['age', 'income', 'purchased'])
df.head()

Unnamed: 0,age,income,purchased
0,44,72000,No
1,27,48000,Yes
2,30,54000,No
3,38,61000,No
4,40,30000,Yes


Let's transform the categorical column purchased to a one-hot matrix:

In [66]:
one_hot = pd.get_dummies(df['purchased'])
df = df.drop('purchased', axis = 1)
df = df.join(one_hot)
df.head()

Unnamed: 0,age,income,No,Yes
0,44,72000,1,0
1,27,48000,0,1
2,30,54000,1,0
3,38,61000,1,0
4,40,30000,0,1


As we can see now, the features are not at all on the same scale. We definitely need to scale them. The StandardScalar method standardizes features (each column) by removing the mean and scaling to unit variance.

The standard score of a sample x is calculated as:

$z_{\text{score}} = \frac{x - \mu}{\sigma}$

Let’s look at the code for doing that:

In [67]:
standardScaler = StandardScaler()
X = standardScaler.fit_transform(df)
pd.DataFrame(X)

Unnamed: 0,0,1,2,3
0,0.593507,0.499777,1.0,-1.0
1,-1.088096,-0.388716,-1.0,1.0
2,-0.791343,-0.166592,1.0,-1.0
3,0.0,0.092551,1.0,-1.0
4,0.197836,-1.055085,-1.0,1.0
5,-0.296753,-0.01851,-1.0,1.0
6,0.989178,0.758921,-1.0,1.0
7,1.187014,0.907003,1.0,-1.0
8,-1.978356,-2.165701,1.0,-1.0
9,1.187014,1.536352,-1.0,1.0


Now, after running this code with the dataset given above, we end up with a nicely scaled set of features as shown below. We can pass this input to a model and get much better results.

Note that StandardScaler is the normalizer that we will mostly use in this class, but there are others. For example, the sk-learn MinMaxScaler standardizes according to this formula:

$z = (x - x_{\text{min}}) / (x_{\text{max}}- x_{\text{min}})$

You can investigate the various scalers and their pros and cons if you would like.

### Featuring Scaling and Train/Test Splits

Source: https://datascience.stackexchange.com/questions/38395/standardscaler-before-and-after-splitting-data

We should be careful to not apply scaling to the entire dataset at the beginning, though.

In the interest of preventing information about the distribution of the test set leaking into your model, you should fit the scaler on your training data only, then standardise both training and test sets with that scaler. If instead you fit the scaler on the full dataset prior to splitting, information about the test set is used to transform the training set, which in turn is passed downstream.

As an example, knowing the distribution of the whole dataset might influence how you detect and process outliers, as well as how you parameterise your model. Although the data itself is not exposed, information about the distribution of the data is. As a result, your test set performance is not a true estimate of performance on unseen data. 

Consider our very small consumer example above, and now suppose we wanted to use our input data to predict the price of the item:

In [68]:
df = pd.DataFrame([[44,72000, 'No', 1000], 
                        [27,48000, 'Yes', 100],
                        [30,54000,'No', 50],
                       [38,61000,'No', 100],
                       [40,30000,'Yes', 20],
                       [35,58000, 'Yes', 1000],
                       [48,79000,'Yes', 500],
                       [50,83000,'No', 2],
                      [18,0,'No', 50],
                      [50, 100000, 'Yes', 1000]], columns = ['age', 'income', 'purchased', 'price_of_item'])

one_hot = pd.get_dummies(df['purchased'])
df = df.drop('purchased', axis = 1)
df = df.join(one_hot)
df.head()

Unnamed: 0,age,income,price_of_item,No,Yes
0,44,72000,1000,1,0
1,27,48000,100,0,1
2,30,54000,50,1,0
3,38,61000,100,1,0
4,40,30000,20,0,1


If we wanted to do an 80/20 train/test split, we would first split the data and then apply feature scaling to our training set only:

In [69]:
y = df['price_of_item']
X = df.drop(columns=['price_of_item'])

X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True, test_size=0.2)

standardScaler = StandardScaler()
X_train = standardScaler.fit_transform(X_train)
pd.DataFrame(X_train)

Unnamed: 0,0,1,2,3
0,1.123494,0.839543,1.290994,-1.290994
1,1.123494,1.6551,-0.774597,0.774597
2,-0.059131,-1.703074,-0.774597,0.774597
3,0.886969,0.647648,-0.774597,0.774597
4,-1.596543,-0.839543,-0.774597,0.774597
5,0.413919,0.31183,1.290994,-1.290994
6,-0.650444,-0.359804,-0.774597,0.774597
7,-1.241756,-0.5517,1.290994,-1.290994


Then, we would apply our linear regression model. If we wanted to score our model on the testing data, we would then transform that separately:

In [70]:
model = LinearRegression()

#fit the model and give the training error
model.fit(X_train, y_train)
print('Train R^2:', model.score(X_train, y_train))

# Evaluate the model against the TRANFORMED testing data
X_test = standardScaler.fit_transform(X_test)
print('Test R^2:', model.score(X_test, y_test))

Train R^2: 0.3356620444741757
Test R^2: -299.20339561650616


Don't be freaked out if the test $R^2$ that you happen to get above is negative. This example was totally crappy and made up. The best possible $R^2$ score is 1.0. A constant model that always predicts the expected value of y, disregarding the input features, would get an $R^2$ score of 0.0. An $R^2$ can be negative if the model is arbitrarily worse than just guessing the expected value of y.

### Homework

1.) Do a 70/30 train/test split on your previous car dataset. Be sure to sort and apply the StandardScalar to the training set. How does the error compare to what you previously reported using the whole dataset?

2.)Also do a cv=5 cross validation. It is difficult to scale only the training data for cross_val_score, so you don't need to include StandardScalar for this question. (Since cross_val_score doesn't let us manipulate things easily, we won't use this one much in the future.)

In [71]:
df = pd.read_csv('data/cars.csv', index_col = 0)
df.head()

Unnamed: 0,wheel-base,length,width,height,curb-weight,engine-size,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,88.6,168.8,64.1,48.8,2548,130,3.47,2.68,9.0,111.0,5000.0,21,27,13495.0
1,88.6,168.8,64.1,48.8,2548,130,3.47,2.68,9.0,111.0,5000.0,21,27,16500.0
2,94.5,171.2,65.5,52.4,2823,152,2.68,3.47,9.0,154.0,5000.0,19,26,16500.0
3,99.8,176.6,66.2,54.3,2337,109,3.19,3.4,10.0,102.0,5500.0,24,30,13950.0
4,99.4,176.6,66.4,54.3,2824,136,3.19,3.4,8.0,115.0,5500.0,18,22,17450.0


In [72]:
#insert work here