In the code below, we load the digits dataset, which contains 64 feature variables. Each feature denotes the darkness of a pixel in an 8 by 8 image of a handwritten digit.

In [3]:
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn import datasets, svm
import matplotlib.pyplot as plt

In [4]:
# load the digit data
digits = datasets.load_digits()

In [6]:
# view features of the first observation
digits.data[0:1]

array([[ 0.,  0.,  5., 13.,  9.,  1.,  0.,  0.,  0.,  0., 13., 15., 10.,
        15.,  5.,  0.,  0.,  3., 15.,  2.,  0., 11.,  8.,  0.,  0.,  4.,
        12.,  0.,  0.,  8.,  8.,  0.,  0.,  5.,  8.,  0.,  0.,  9.,  8.,
         0.,  0.,  4., 11.,  0.,  1., 12.,  7.,  0.,  0.,  2., 14.,  5.,
        10., 12.,  0.,  0.,  0.,  0.,  6., 13., 10.,  0.,  0.,  0.]])

The target data is a vector containing the image’s true digit. For example, the first observation is a handwritten digit for ‘0’.

In [7]:
# view the target of the first observation
digits.target[0:1]

array([0])

To demonstrate cross validation and parameter tuning, first we are going to divide the digit data into two datasets called `data1` and `data2`. `data1` contains the first 1000 rows of the digits data, while `data2` contains the remaining ~800 rows.

In [8]:
# create dataset 1
data_1_X = digits.data[:1000]
data_1_Y = digits.target[:1000]

# create dataset 2
data_2_X = digits.data[1000:]
data_2_Y = digits.target[1000:]

## Create Parameter Candidates

Before looking for which combination of parameter values produces the most accurate model, we must specify the different candidate values we want to try. In the code below we have a number of candidate parameter values, including four different values for `C` (1, 10, 100, 1000), two values for `gamma` (0.001, 0.0001), and two `kernels` (linear, rbf). The grid search will try all combinations of parameter values and select the set of parameters which provides the most accurate model.

In [9]:
parameter_candidates = [{'C': [1,10,100,1000], 'kernel': ['linear']},
                        {'C': [1,10,100,1000], 'gamma': [0.001,0.0001], 'kernel': ['rbf']}]

## Conduct Grid Search To Find Parameters Producing Highest Score

Now we are ready to conduct the grid search using scikit-learn’s `GridSearchCV` which stands for grid search cross validation. By default, the `GridSearchCV`’s cross validation uses 3-fold `KFold` or `StratifiedKFold` depending on the situation.

In [10]:
# create a classifier object with the classifier and parameter candidates
clf = GridSearchCV(estimator=svm.SVC(), param_grid=parameter_candidates, n_jobs=-1)

# train the classifier on data1 feature and target
clf.fit(data_1_X, data_1_Y)



GridSearchCV(cv='warn', error_score='raise-deprecating',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False),
       fit_params=None, iid='warn', n_jobs=-1,
       param_grid=[{'C': [1, 10, 100, 1000], 'kernel': ['linear']}, {'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [11]:
# view the accuracy score
print('Best score for data1: ', clf.best_score_)

Best score for data1:  0.942


In [12]:
# view best parameters
print('Best C:', clf.best_estimator_.C)
print('Best Kernel:', clf.best_estimator_.kernel)
print('Best Gamma:', clf.best_estimator_.gamma)

Best C: 10
Best Kernel: rbf
Best Gamma: 0.001


This tells us that the most accurate model uses **C=10**, the **rbf kernel**, and **gamma=0.001**.

## Sanity Check Using Second Dataset

First, we apply the classifier we just trained to the second dataset. Then we will train a new support vector classifier from scratch using the parameters found using the grid search. We should get the same results for both models.

In [13]:
# apply the classifier trained on data1 to data2 and view accuracy score
clf.score(data_2_X, data_2_Y)

0.9698870765370138

In [15]:
# train the new classifier using the best parameters found by grid search
svm.SVC(C=10, kernel='rbf', gamma=0.001).fit(data_1_X, data_1_Y).score(data_2_X, data_2_Y)

0.9698870765370138