## Using Grid Search with Cross-Validation to Find the Best Parameters for a Model
The goal of this exercise is to make use of grid search to find the best parameters for a DecisionTree classifier. We will be making use of the Cars dataset that you worked with previously.

In [1]:
# import libraries
import pandas as pd

In [2]:
# create headers for data
_headers = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'car']

In [3]:
# read in cars dataset
df = pd.read_csv('https://raw.githubusercontent.com/'\
                 'PacktWorkshops/The-Data-Science-Workshop/'\
                 'master/Chapter07/Dataset/car.data', names=_headers, index_col=None)
print(df.shape)
df.head()

(1728, 7)


Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,car
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


In [4]:
# encode categorical variables
_df = pd.get_dummies(df, columns=['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety'])
_df.head()

Unnamed: 0,car,buying_high,buying_low,buying_med,buying_vhigh,maint_high,maint_low,maint_med,maint_vhigh,doors_2,...,doors_5more,persons_2,persons_4,persons_more,lug_boot_big,lug_boot_med,lug_boot_small,safety_high,safety_low,safety_med
0,unacc,0,0,0,1,0,0,0,1,1,...,0,1,0,0,0,0,1,0,1,0
1,unacc,0,0,0,1,0,0,0,1,1,...,0,1,0,0,0,0,1,0,0,1
2,unacc,0,0,0,1,0,0,0,1,1,...,0,1,0,0,0,0,1,1,0,0
3,unacc,0,0,0,1,0,0,0,1,1,...,0,1,0,0,0,1,0,0,1,0
4,unacc,0,0,0,1,0,0,0,1,1,...,0,1,0,0,0,1,0,0,0,1


In [5]:
# split the data into features and labels
features = _df.drop(['car'], axis=1).values
labels = _df[['car']].values

In [6]:
# import libraries
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

In this step, you import numpy. NumPy is a numerical computation library. You alias it as np. You also import DecisionTreeClassifier, which you use to create decision trees. Finally, you import GridSearchCV, which will use cross-validation to train multiple models.

In [7]:
clf = DecisionTreeClassifier()

In [8]:
params = {'max_depth': np.arange(1, 8)}

In this step, you create a dictionary of parameters. There are two parts to this dictionary:

The key of the dictionary is a parameter that is passed into the model. In this case, max_depth is a parameter that DecisionTreeClassifier takes.

The value is a Python list that grid search iterates over and passes to the model. In this case, we create an array that starts at 1 and ends at 7, inclusive.

In [9]:
# instantiate GridSearchCV
clf_cv = GridSearchCV(clf, param_grid=params, cv=5)

In this step, you create an instance of GridSearchCV. The first parameter is the model to train. The second parameter is the parameters to search over. The third parameter is the number of cross-validation splits to create.

In [10]:
# train the models
clf_cv.fit(features, labels)

GridSearchCV(cv=5, estimator=DecisionTreeClassifier(),
             param_grid={'max_depth': array([1, 2, 3, 4, 5, 6, 7])})

In [11]:
# print the best parameter
print("Tuned Decision Tree Parameters: {}".format(clf_cv.best_params_))

Tuned Decision Tree Parameters: {'max_depth': 2}


In the preceding output, you see that the best performing model is one with a max_depth of 2.

Accessing best_params_ lets you train another model with the best-known parameters using a larger training dataset.

In [12]:
# print best R2
print("Best score is {}".format(clf_cv.best_score_))

Best score is 0.7778822149618833


In [13]:
# access the best model
model = clf_cv.best_estimator_
model

DecisionTreeClassifier(max_depth=2)