# Tuning the whole pipeline with Cross Validation


In this notebook we will see how Grid Search Cross Validation can be used to not only tune the parameters of the model but also the parameters of all the transformers in a pipeline, thus helping us find the best preprocessing strategy for our data.

## 1.Pipeline creation

As shown in the previous notebooks, here we clean the data, split it and create a pipeline:

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.pipeline import make_pipeline


# reading, import data
url = "https://drive.google.com/file/d/12kg6J-pEbPkJ2k9p0zdxhdbDrvFMUptZ/view?usp=share_link"
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
data = pd.read_csv(path)
data.head()


Unnamed: 0,LotArea,LotFrontage,TotalBsmtSF,BedroomAbvGr,Fireplaces,PoolArea,GarageCars,WoodDeckSF,ScreenPorch,Expensive,MSZoning,Condition1,Heating,Street,CentralAir,Foundation
0,8450,65.0,856,3,0,0,2,0,0,0,RL,Norm,GasA,Pave,Y,PConc
1,9600,80.0,1262,3,1,0,2,298,0,0,RL,Feedr,GasA,Pave,Y,CBlock
2,11250,68.0,920,3,1,0,2,0,0,0,RL,Norm,GasA,Pave,Y,PConc
3,9550,60.0,756,3,1,0,3,0,0,0,RL,Norm,GasA,Pave,Y,BrkTil
4,14260,84.0,1145,4,1,0,3,192,0,0,RL,Norm,GasA,Pave,Y,PConc


In [2]:
data.columns

Index(['LotArea', 'LotFrontage', 'TotalBsmtSF', 'BedroomAbvGr', 'Fireplaces',
       'PoolArea', 'GarageCars', 'WoodDeckSF', 'ScreenPorch', 'Expensive',
       'MSZoning', 'Condition1', 'Heating', 'Street', 'CentralAir',
       'Foundation'],
      dtype='object')

In [3]:
# X and y creation
X = data.drop(columns=['MSZoning', 'Condition1', 'Heating', 'Street', 'CentralAir','Foundation'])
y = X.pop("Expensive")

In [4]:
# feature selection: only numericals
X_num = X.select_dtypes(include="number").copy()

# data splitting
X_num_train, X_num_test, y_train, y_test = train_test_split(X_num, y, test_size=0.2, random_state=123)

# initialize transformers &amp; model
imputer = SimpleImputer()
dtree = DecisionTreeClassifier()
 
# Create a pipeline
pipe = make_pipeline(imputer,
                     dtree)

## 2.Cross Validation with the whole pipeline:

We can see the steps in the pipeline (note that they have been given names: simpleimputer and decisiontreeclassifier. we will use these names when defining the parameter grid for the cross validation)

In [5]:
pipe

When defining the cross validation, we want to pass our pipeline (pipe), our parameter grid (param_grid) and the number of folds (an arbitrary number, usually 5 or 10). You can also define the parameter verbose if you want to recieve a bit more info about the CV task.

To define the parameter grid for cross validation, you need to create a dictionary, where:

The keys are the name of the pipeline step, followed by two underscores and the name of the parameter you want to tune.
The values are lists (or "ranges") with all the values you want to try for each parameter.

In [6]:
param_grid = {
    "simpleimputer__strategy":["mean", "median"],
    "decisiontreeclassifier__max_depth": range(2, 14),
    "decisiontreeclassifier__min_samples_leaf": range(3, 10),
    "decisiontreeclassifier__criterion":["gini", "entropy"]
}

When defining the cross validation, we want to pass our pipeline (pipe), our parameter grid (param_grid) and the number of folds (an arbitrary number, usually 5 or 10). You can also define the parameter verbose if you want to recieve a bit more info about the CV task.

In [7]:
from sklearn.model_selection import GridSearchCV

search = GridSearchCV(pipe,
                      param_grid,
                      cv=10,
                      verbose=1)

Fit your "search" to the training data (X and y), as we used to do with our model alone or with our pipeline:

In [8]:
search.fit(X_num_train, y_train)

Fitting 10 folds for each of 336 candidates, totalling 3360 fits


Explore the best parameters and the best score achieved with your cross validation:

In [9]:
search.best_params_

{'decisiontreeclassifier__criterion': 'gini',
 'decisiontreeclassifier__max_depth': 5,
 'decisiontreeclassifier__min_samples_leaf': 6,
 'simpleimputer__strategy': 'mean'}

In [10]:
# cross validation average accuracy
search.best_score_

0.9263778367226644

In [11]:
# training accuracy
y_train_pred = search.predict(X_num_train)

accuracy_score(y_train, y_train_pred)

0.9383561643835616

In [12]:
# testing accuracy
y_test_pred = search.predict(X_num_test)

accuracy_score(y_test, y_test_pred)

0.9315068493150684

**Exercise 1:**
Add a scaler to the pipeline, and use GridSearchCV to tune the parameters of the scaler, as well as the parameters of the imputer and the decision tree.



In [13]:
from sklearn.preprocessing import StandardScaler

# initialize transformers &amp; model
imputer = SimpleImputer()
scaler = StandardScaler()
dtree = DecisionTreeClassifier()

# create the pipeline
pipe = make_pipeline(imputer,
                     scaler,
                     dtree)

# create parameter grid
param_grid = {
    "simpleimputer__strategy":["mean", "median"],
    "standardscaler__with_mean":[True, False],
    "standardscaler__with_std":[True, False],
    "decisiontreeclassifier__max_depth": range(2, 14),
    "decisiontreeclassifier__min_samples_leaf": range(3, 10),
    "decisiontreeclassifier__criterion":["gini", "entropy"]
}

# define cross validation
search = GridSearchCV(pipe,
                      param_grid,
                      cv=10,
                      verbose=1)

# fit
search.fit(X_num_train, y_train)

# cross validation average accuracy
search.best_score_

Fitting 10 folds for each of 1344 candidates, totalling 13440 fits


0.9263778367226644

In [14]:
# best parameters
search.best_params_

{'decisiontreeclassifier__criterion': 'gini',
 'decisiontreeclassifier__max_depth': 5,
 'decisiontreeclassifier__min_samples_leaf': 6,
 'simpleimputer__strategy': 'mean',
 'standardscaler__with_mean': True,
 'standardscaler__with_std': True}