## Learning Objectives

Today we will be covering the full ML pipeline in all of its glory, starting from good clean data (that is a big if) to the final predictions.

## The full picture

With cross validation we can show you the full picture of model building (after you have done the hard work of data munging). The magic that cross validation unlocks is twofold

1. It allow you to have more training data and therefore get better performance and more accurate representations of your performance
2. It actually simplifies the process. You will no longer need to keep 3 sets of data and you can get by with just two in your mental model

Let's get started:

In [1]:
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split

boston_data = load_boston()

# we make our test set
X_train, X_test, y_train, y_test = train_test_split(boston_data['data'], boston_data['target'], test_size=0.2, random_state=1)

# and we make our validation set
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=1)

Our next step will be to define the model that we are looking at:

In [2]:
from sklearn.tree import DecisionTreeRegressor

reg = DecisionTreeRegressor()

Then we determine which parameters we would like to search over:

In [3]:
params = {
    'max_depth': range(2, 20, 2),
    'min_samples_leaf': range(5, 25, 5)
}

And finally we use GridSearchCV which will search over the parameters doing cross validation to determine their performance:

In [4]:
from sklearn.model_selection import GridSearchCV

gs = GridSearchCV(reg, params, scoring='neg_mean_absolute_error')

In [5]:
gs.fit(X_train, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_split=1e-07,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best'),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'max_depth': [2, 4, 6, 8, 10, 12, 14, 16, 18], 'min_samples_leaf': [5, 10, 15, 20]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='neg_mean_absolute_error', verbose=0)

We get a lot of goodies. We can see the best score and estimator:

In [6]:
gs.best_score_

-3.20857327399123

In [7]:
gs.best_estimator_

DecisionTreeRegressor(criterion='mse', max_depth=8, max_features=None,
           max_leaf_nodes=None, min_impurity_split=1e-07,
           min_samples_leaf=10, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best')

And we get to use the grid search object as that estimator as well:

In [8]:
gs.predict(X_train[:5])

array([ 16.5       ,  16.97      ,  23.42941176,  19.02727273,  35.19166667])

## A note on hyperparam tuning

Grid search might be becoming a bit old school in the next few years, with advancements like random search, hyperband, bayesian hyperparam search and more we might use a more advanced way to search through available params. That being said it is good to know and still widely used in ML.

## Learning Objectives

Today we will be covering the full ML pipeline in all of its glory, starting from good clean data (that is a big if) to the final predictions.

## Comprehension Questions

1.	Is what we outlined the full ML picture? Can you think of any more parts?
2.	Could you automate this ML pipeline?
3.	What percent of the job do you think training and grid searching are?
