# Creating a sklearn pipeline and applying Cross Validation



- Create a Pipeline with your transformers and your model.
- Use GridSearchCV to find the best parameters for your model.


The following articles on the platform will help you to accomplish this notebook:

- Scikit-Learn Pipelines
- Grid Search & Cross Validation


## 1.Data reading and preprocessing


We will first review everything we did in the previous notebook.

In [1]:
import pandas as pd

In [2]:
url = "https://drive.google.com/file/d/1X0ysrPjRZrdI_Tpz919KcjriQ_NOFRTm/view?usp=share_link"
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]


data = pd.read_csv(path)

In [3]:
data.head()

Unnamed: 0,LotArea,LotFrontage,TotalBsmtSF,BedroomAbvGr,Fireplaces,PoolArea,GarageCars,WoodDeckSF,ScreenPorch,Expensive
0,8450,65.0,856,3,0,0,2,0,0,0
1,9600,80.0,1262,3,1,0,2,298,0,0
2,11250,68.0,920,3,1,0,2,0,0,0
3,9550,60.0,756,3,1,0,3,0,0,0
4,14260,84.0,1145,4,1,0,3,192,0,0


In [4]:
data.columns

Index(['LotArea', 'LotFrontage', 'TotalBsmtSF', 'BedroomAbvGr', 'Fireplaces',
       'PoolArea', 'GarageCars', 'WoodDeckSF', 'ScreenPorch', 'Expensive'],
      dtype='object')

### 1.1. Setting X and y


X: columns that help us make a prediction.
y: the column that we want to predict.

In [5]:
X=data
#when we have non numerical columns 
#X_num = X.select_dtypes(include="number")

y=X.pop('Expensive')

### 1.3. Data splitting

### Create train and test

In [6]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, 
                                                            y, 
                                                            test_size=0.2, 
                                                            random_state=123)

### 1.4. Imputing missing values

(Fit on train, transform train & test)

In [7]:
from sklearn.impute import SimpleImputer

my_imputer = SimpleImputer() # initialize
my_imputer.fit(X_train) # fit on the train set
X_imputed_train = my_imputer.transform(X_train) # transform the train set
X_imputed_test = my_imputer.transform(X_test) # transform the test set

### 1.5. Modelling: Decision Tre

In [8]:
# 1. import the model
from sklearn.tree import DecisionTreeClassifier 
# 2. initialize the model
my_tree = DecisionTreeClassifier(max_depth=4,
                                 min_samples_leaf=10
                                )
# 3. fit the model to the train data
my_tree.fit(X = X_imputed_train, 
            y = y_train)

### 1.6. Check accuracy on the train set

In [9]:
from sklearn.metrics import accuracy_score

y_pred_tree_train = my_tree.predict(X_imputed_train)

accuracy_score(y_true = y_train,
               y_pred = y_pred_tree_train)

0.9238013698630136

### 1.7. Check accuracy on the test set

In [10]:
y_pred_tree_test = my_tree.predict(X_imputed_test)

accuracy_score(y_true = y_test,
               y_pred = y_pred_tree_test)

0.9212328767123288

## 2.Creating a Pipeline

Before moving forward in our quest to improve the model, take a moment to learn how to use Scikit-Learn Pipelines. They will not increase your performance, but they are a necessary tool to compress all the steps in the data preparation + modelling phases into a single one, and this will become very relevant as we move forward and keep adding new steps:

Read the lesson "Scikit-Learn Pipelines" on the platform.

Check the docs: https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html

In [11]:
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import make_pipeline

# 1. initialize transformers &amp; model
imputer = SimpleImputer(strategy="median")
dtree = DecisionTreeClassifier(max_depth=4,
                               min_samples_leaf=10)
 
# 2. Create a pipeline*
pipe = make_pipeline(imputer, dtree)
 
# 3. Fit the pipeline to the training data
pipe.fit(X_train, y_train)
 
# 4. Use the pipeline to make predictions
pipe.predict(X_test)

array([0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0,
       1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0])

Now, the object pipie can take (almost) raw data as input and output predictions. We no longer need to impute missing values and use the model to make predictions in separate steps.

## 3.Using GridsearchCV to find the best parameters


So far, we tuned the hyperparameters of the decision tree manually. This is not ideal, for two reasons:

It's not efficient in terms of quickly finding the best combination of parameters.
If we keep checking the performance on the test set over and over again, we might end up creating a model that fits that particular test set, but does not generalize as well with new data. Test sets are meant to reamain unseen until the very last moment of ML development —we have been cheating a bit!
Grid Search Cross Validation solves both issues:

Read the lesson "Housing Prices: Iteration 2, Grid Search & Cross Validation" on the platform.

Check the docs: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

In [12]:
# 1. initialize transformers &amp; model
imputer = SimpleImputer()
dtree = DecisionTreeClassifier()
 
# 2. Create a pipeline*
pipe = make_pipeline(imputer, dtree)

param_grid = {
    'decisiontreeclassifier__max_depth': range(2, 12),
    'decisiontreeclassifier__min_samples_leaf': range(5, 20, 2),
    'decisiontreeclassifier__min_samples_split': range(5, 40, 10),
    'decisiontreeclassifier__criterion':['gini', 'entropy']
    }
    
from sklearn.model_selection import GridSearchCV
 
search = GridSearchCV(pipe, # you have defined this beforehand
                      param_grid, # your parameter grid
                      cv=5, # the value for K in K-fold Cross Validation
                      scoring='accuracy', # the performance metric to use, 
                      verbose=1) # we want informative outputs during the training process

In [13]:
search.fit(X_imputed_train, y_train)

Fitting 5 folds for each of 640 candidates, totalling 3200 fits


In [14]:
search.best_params_

{'decisiontreeclassifier__criterion': 'entropy',
 'decisiontreeclassifier__max_depth': 6,
 'decisiontreeclassifier__min_samples_leaf': 5,
 'decisiontreeclassifier__min_samples_split': 25}

In [15]:
search.best_score_

0.9229558710245405