# Step 1: Import libraries and modules.

In [1]:
import numpy as np
import pandas as pd

# The first one will be the train_test_split() function from the model_selection module.
from sklearn.model_selection import train_test_split

# This contains utilities for scaling, transforming, and wrangling data
from sklearn import preprocessing

# A "family" of models are broad types of models, such as random forests, SVM's, linear regression models, etc. 
# Within each family of models, you'll get an actual model after you fit and tune its parameters to the data.
from sklearn.ensemble import RandomForestRegressor

# Import cross-validation pipeline
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV

# Import evaluation metrics
from sklearn.metrics import mean_squared_error, r2_score

# Import module for saving scikit-learn models
# Joblib is an alternative to Python's pickle package, and we'll use it because it's more efficient for storing large numpy arrays.
from sklearn.externals import joblib

# Step 2: Load red wine data

In [4]:
# Load wine data from remote URL
dataset_url = 'http://mlr.cs.umass.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'
data = pd.read_csv(dataset_url, sep=';')

In [6]:
data.shape

(1599, 12)

# Step 3: Split data into training and test sets

In [9]:
# Separate target from training features
y = data.quality
X = data.drop('quality', axis=1)

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123, stratify=y)

In [15]:
X_train

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
691,9.2,0.920,0.24,2.60,0.087,12.0,93.0,0.99980,3.48,0.54,9.800000
1475,5.3,0.470,0.11,2.20,0.048,16.0,89.0,0.99182,3.54,0.88,13.566667
1065,7.7,0.610,0.18,2.40,0.083,6.0,20.0,0.99630,3.29,0.60,10.200000
1159,10.2,0.410,0.43,2.20,0.110,11.0,37.0,0.99728,3.16,0.67,10.800000
227,9.0,0.820,0.14,2.60,0.089,9.0,23.0,0.99840,3.39,0.63,9.800000
1598,6.0,0.310,0.47,3.60,0.067,18.0,42.0,0.99549,3.39,0.66,11.000000
1243,8.3,0.560,0.22,2.40,0.082,10.0,86.0,0.99830,3.37,0.62,9.500000
221,7.4,0.530,0.26,2.00,0.101,16.0,72.0,0.99570,3.15,0.57,9.400000
108,8.0,0.330,0.53,2.50,0.091,18.0,80.0,0.99760,3.37,0.80,9.600000
170,7.9,0.885,0.03,1.80,0.058,4.0,8.0,0.99720,3.36,0.33,9.100000


# Step 4: Declare data preprocessing steps
Standardization is the process of subtracting the means from each feature and then dividing by the feature standard deviations.

Standardization is a common requirement for machine learning tasks. Many algorithms assume that all features are centered around zero and have approximately the same variance.

In [16]:
# Lazy way of scaling data
X_train_scaled = preprocessing.scale(X_train)
X_train_scaled

array([[ 0.51358886,  2.19680282, -0.164433  , ...,  1.08415147,
        -0.69866131, -0.58608178],
       [-1.73698885, -0.31792985, -0.82867679, ...,  1.46964764,
         1.2491516 ,  2.97009781],
       [-0.35201795,  0.46443143, -0.47100705, ..., -0.13658641,
        -0.35492962, -0.20843439],
       ...,
       [-0.98679628,  1.10708533, -0.93086814, ...,  0.24890976,
        -0.98510439,  0.35803669],
       [-0.69826067,  0.46443143, -1.28853787, ...,  1.08415147,
        -0.35492962, -0.68049363],
       [ 3.1104093 , -0.62528606,  2.08377675, ..., -1.61432173,
         0.79084268, -0.39725809]])

In [17]:
# Confirm the scaled dataset
print(X_train_scaled.mean(axis=0))
print(X_train_scaled.std(axis=0))

[ 1.16664562e-16 -3.05550043e-17 -8.47206937e-17 -2.22218213e-17
  2.22218213e-17 -6.38877362e-17 -4.16659149e-18 -2.54439854e-15
 -8.70817622e-16 -4.08325966e-16 -1.17220107e-15]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]


Now, here's the preprocessing code we will use...

So instead of directly invoking the scale function, we'll be using a feature in Scikit-Learn called the **Transformer API**. The Transformer API allows you to "fit" a preprocessing step using the training data the same way you'd fit a model...

...and then use the same transformation on future data sets!

Here's what that process looks like:

* Fit the transformer on the training set (saving the means and standard deviations)
* Apply the transformer to the training set (scaling the training data)
* Apply the transformer to the test set (using the same means and standard deviations)

This makes your final estimate of model performance more realistic, and it allows to insert your preprocessing steps into a ***cross-validation*** pipeline (more on this in Step 7).

In [19]:
# Fitting the Transformer API
scaler = preprocessing.StandardScaler().fit(X_train)

# Now, the scaler object has the saved meands and standard deviations for each feature in the training set

In [21]:
# Applying transformer to training data
X_train_scaled = scaler.transform(X_train)

print(X_train_scaled.mean(axis=0))
print(X_train_scaled.std(axis=0))

[ 1.16664562e-16 -3.05550043e-17 -8.47206937e-17 -2.22218213e-17
  2.22218213e-17 -6.38877362e-17 -4.16659149e-18 -2.54439854e-15
 -8.70817622e-16 -4.08325966e-16 -1.17220107e-15]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]


In [22]:
# Apply transformer to test data
X_test_scaled = scaler.transform(X_test)

print(X_test_scaled.mean(axis=0))
print(X_test_scaled.std(axis=0))

[ 0.02776704  0.02592492 -0.03078587 -0.03137977 -0.00471876 -0.04413827
 -0.02414174 -0.00293273 -0.00467444 -0.10894663  0.01043391]
[1.02160495 1.00135689 0.97456598 0.91099054 0.86716698 0.94193125
 1.03673213 1.03145119 0.95734849 0.83829505 1.0286218 ]


Notice how the scaled features in the test set are not perfectly centered at zero with unit variance! This is exactly what we'd expect, as we're transforming the test set using the means from the training set, not from the test set itself.

In practice, when we set up the cross-validation pipeline, we won't even need to manually fit the Transformer API. Instead, we'll simply declare the class object, like so:

In [25]:
# Pipeline with preprocessing and model
pipeline = make_pipeline(preprocessing.StandardScaler(), RandomForestRegressor(n_estimators=100))

This is exactly what it looks like: a modeling pipeline that first transforms the data using StandardScaler() and then fits a model using a random forest regressor.

# Step 5: Declare hyperparameters to tune

#### WTF are hyperparameters?
There are two types of parameters we need to worry about: model parameters and hyperparameters. Models parameters can be learned directly from the data (i.e. regression coefficients), while hyperparameters cannot.
Hyperparameters express "higher-level" structural information about the model, and they are typically set before training the model.

#### Example: random forest hyperparameters.
As an example, let's take our random forest for regression:
Within each decision tree, the computer can empirically decide where to create branches based on either mean-squared-error (MSE) or mean-absolute-error (MAE). Therefore, the actual branch locations are **model parameters**.

However, the algorithm does not know which of the two criteria, MSE or MAE, that it should use. The algorithm also cannot decide how many trees to include in the forest. These are examples of **hyperparameters** that the user must set.

In [28]:
# List tunable hyperparameters
print(pipeline.get_params())

{'memory': None, 'steps': [('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('randomforestregressor', RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False))], 'standardscaler': StandardScaler(copy=True, with_mean=True, with_std=True), 'randomforestregressor': RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=Fals

In [29]:
# Declare hyperparameters to tune
hyperparameters = {'randomforestregressor__max_features': ['auto', 'sqrt', 'log2'],
                   'randomforestregressor__max_depth': [None, 5, 3, 1]}

# Step 6: Tune model using a cross-validation pipeline
### WTF is cross-validation (CV)?
Cross-validation is a process for reliably estimating the performance of a method for building a model by training and evaluating your model multiple times using the same method.

Practically, that "method" is simply a set of hyperparameters in this context.

These are the steps for CV:
* Split your data into k equal parts, or "folds" (typically k=10).
* Train your model on k-1 folds (e.g. the first 9 folds)
* Evaluate it on the remaining "hold-out" fold (e.g. the 10th fold).
* Perform steps (2) and (3) k times, each time holding out a different fold.
* Aggregate the performance across all k folds. This is your performance metric.

<div>
<img src='K-fold_cross_validation_EN.jpg'></img>
K-Fold Cross-validation diagram, courtesy of Wikipedia</div>

### Why is cross-validation important in machine learning?
Let's say you want to train a random forest regressor. One of the hyperparameters you must tune is the maximum depth allowed for each decision tree in your forest.

### How can you decide?
That's where cross-validation comes in. Using only your training set, you can use CV to evaluate different hyperparameters and estimate their effectiveness.

This allows you to keep your test set "untainted" and save it for a true hold-out evaluation when you're finally ready to select a model.

For example, you can use CV to tune a random forest model, a linear regression model, and a k-nearest neighbors model, using only the training set. Then, you still have the untainted test set to make your final selection between the model families!

### So WTF is a cross-validation "pipeline?"
The best practice when performing CV is to include your data preprocessing steps inside the cross-validation loop. This prevents accidentally tainting your training folds with influential data from your test fold.

Here's how the CV pipeline looks after including preprocessing steps:

* Split your data into k equal parts, or "folds" (typically k=10).
* Preprocess k-1 training folds.
* Train your model on the same k-1 folds.
* Preprocess the hold-out fold using the same transformations from step (2).
* Evaluate your model on the same hold-out fold.
* Perform steps (2) - (5) k times, each time holding out a different fold.
* Aggregate the performance across all k folds. This is your performance metric.

In [31]:
# Sklearn cross-validation with pipeline
# GridSearchCV essentially performs cross-validation across the entire "grid" (all possible permutations) of hyperparameters.
clf = GridSearchCV(pipeline, hyperparameters, cv=10)

# Fit and tune model
clf.fit(X_train, y_train)

GridSearchCV(cv=10, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('randomforestregressor', RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decr...mators=100, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'randomforestregressor__max_features': ['auto', 'sqrt', 'log2'], 'randomforestregressor__max_depth': [None, 5, 3, 1]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [32]:
# Now, you can see the best set of parameters found using CV:
print(clf.best_params_)

{'randomforestregressor__max_depth': None, 'randomforestregressor__max_features': 'log2'}


# Step 7: Refit on the entire training set
After you've tuned your hyperparameters appropriately using cross-validation, you can generally get a small performance improvement by refitting the model on the entire training set.

Conveniently, GridSearchCV from sklearn will automatically refit the model with the best set of hyperparameters using the entire training set.

In [33]:
print(clf.refit)
# True

True


Now, you can simply use the  clf object as your model when applying it to other sets of data. That's what we'll be doing in the next step.

# Step 8: Evaluate model pipeline on test data

In [34]:
# Predict a new set of data
y_pred = clf.predict(X_test)

# Now we can use the metrics we imported earlier to evaluate our model performance
print(r2_score(y_test, y_pred))
print(mean_squared_error(y_test, y_pred))

0.4691494642532841
0.34254375000000004


Great, so now the question is... is this ***performance good enough***?

Well, the rule of thumb is that your very first model probably won't be the best possible model. However, we recommend a combination of three strategies to decide if you're satisfied with your model performance.
* Start with the goal of the model. If the model is tied to a business problem, have you successfully solved the problem?
* Look in academic literature to get a sense of the current performance benchmarks for specific types of data.
* Try to find low-hanging fruit in terms of ways to improve your model.

There are various ways to improve a model. We'll have more guides that go into detail about how to improve model performance, but here are a few quick things to try:
* Try other regression model families (e.g. regularized regression, boosted trees, etc.).
* Collect more data if it's cheap to do so.
* Engineer smarter features after spending more time on exploratory analysis.
* Speak to a domain expert to get more context (...this is a good excuse to go wine tasting!).
As a final note, when you try other families of models, we recommend using the same training and test set as you used to fit the random forest model. That's the best way to get a true apples-to-apples comparison between your models.

# Step 9: Save model for future use

In [35]:
# Save model to a .pkl file
joblib.dump(clf, 'rf_regressor.pkl')

['rf_regressor.pkl']

In [None]:
# When you want to load the model again, simply use this function:
# Load model from .pkl file
# clf2 = joblib.load('rf_regressor.pkl')

# Predict dataset using loaded model
# clf2.predict(X_test)