# KNN Parameter Tuning

In [`Segmentation: KNN`](segment-knn.ipynb), we perform KNN classification of pixels as crop or non-crop. One parameter in the KNN classifier is the number of neighbors (the K in KNN). To determine what value this parameter should be, we perform cross-validation and pick the k that corresponds to the highest accuracy score. In this notebook, we demonstrate that cross-validation, using the training data X (values) and y (classifications) that was generated in `Segmentation: KNN`. The k value is then fed back into `Segmentation: KNN` to create the KNN Classifier that is used to predict pixel crop/non-crop designation.

In this notebook, we find that increasing the number of neighbors from 3 to 9 increases accuracy only marginally, while it also increases run time. Therefore, we will use the smallest number of neighbors: 3.

In [1]:
from __future__ import print_function

import os

import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.neighbors import KNeighborsClassifier as KNN

First we load the data that was saved in `Segmentation: KNN`

In [2]:
# Load data
def load_cross_val_data(datafile):
    npzfile = np.load(datafile)
    X = npzfile['X']
    y = npzfile['y']
    return X,y

datafile = os.path.join('data', 'knn_cross_val', 'xy_file.npz')
X, y = load_cross_val_data(datafile)

Next, we perform a grid search over the number of neighbors, looking for the value that corresponds to the highest accuracy.

In [3]:
tuned_parameters = {'n_neighbors': range(3,11,2)}

clf = GridSearchCV(KNN(n_neighbors=3),
                   tuned_parameters,
                   cv=3,
                   verbose=10)
clf.fit(X, y)

print("Best parameters set found on development set:\n")
print(clf.best_params_)

print("Grid scores on development set:\n")

means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
res_params = clf.cv_results_['params']
for mean, std, params in zip(means, stds, res_params):
    print("%0.3f (+/-%0.03f) for %r"
          % (mean, std * 2, params))

Fitting 3 folds for each of 4 candidates, totalling 12 fits
[CV] n_neighbors=3 ...................................................
[CV] .................... n_neighbors=3, score=0.776729, total=  41.6s
[CV] n_neighbors=3 ...................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  1.8min remaining:    0.0s


[CV] .................... n_neighbors=3, score=0.792561, total=  40.0s
[CV] n_neighbors=3 ...................................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  3.6min remaining:    0.0s


[CV] .................... n_neighbors=3, score=0.706300, total=  49.5s
[CV] n_neighbors=5 ...................................................


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  5.7min remaining:    0.0s


[CV] .................... n_neighbors=5, score=0.789614, total=  50.9s
[CV] n_neighbors=5 ...................................................


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:  7.9min remaining:    0.0s


[CV] .................... n_neighbors=5, score=0.800847, total=  47.4s
[CV] n_neighbors=5 ...................................................


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed: 10.2min remaining:    0.0s


[CV] .................... n_neighbors=5, score=0.713654, total=  44.9s
[CV] n_neighbors=7 ...................................................


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed: 12.4min remaining:    0.0s


[CV] .................... n_neighbors=7, score=0.796888, total=  57.1s
[CV] n_neighbors=7 ...................................................


[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed: 15.5min remaining:    0.0s


[CV] .................... n_neighbors=7, score=0.806654, total=  51.8s
[CV] n_neighbors=7 ...................................................


[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed: 18.2min remaining:    0.0s


[CV] .................... n_neighbors=7, score=0.717171, total= 1.0min
[CV] n_neighbors=9 ...................................................


[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed: 20.8min remaining:    0.0s


[CV] .................... n_neighbors=9, score=0.801349, total= 1.0min
[CV] n_neighbors=9 ...................................................
[CV] .................... n_neighbors=9, score=0.810791, total=  55.9s
[CV] n_neighbors=9 ...................................................
[CV] .................... n_neighbors=9, score=0.719112, total=  52.0s


[Parallel(n_jobs=1)]: Done  12 out of  12 | elapsed: 28.3min finished


Best parameters set found on development set:

{'n_neighbors': 9}
Grid scores on development set:

0.759 (+/-0.075) for {'n_neighbors': 3}
0.768 (+/-0.077) for {'n_neighbors': 5}
0.774 (+/-0.080) for {'n_neighbors': 7}
0.777 (+/-0.082) for {'n_neighbors': 9}


It turns out that increasing the number of neighbors from 3 to 9 increases accuracy only marginally, while it also increases run time. Therefore, we will use the smallest number of neighbors: 3.