# KNN and Model Persistence

In this module we are going to learn about the k-nearest neighbor (KNN) model. This model is known as a clustering model, and can be used for both classification and regression. We will learn about model persistence to save our model for later use.

<b>Functions and attributes in this lecture: </b>
- `sklearn.neighbors` - Contains KNN algorithms.
 - `KNeighborsClassifier` - Classifier KNN model.
- `joblib` - Library for saving NumPy objects, like models.
  - `dump` - Saving the model for later use.
  - `load` - Loading a saved object.

In [1]:
# Non-sklearn packages
import numpy as np
import pandas as pd

# Sklearn packages
from sklearn.datasets import fetch_covtype
from sklearn.model_selection import GridSearchCV

# Importing the dataset
X_orig, y_orig = fetch_covtype(return_X_y=True, as_frame=True)

# Restricting the dataset to first 10000 rows
X, y = X_orig.iloc[:10000], y_orig.iloc[:10000]

# Printing the description for the dataset
print(fetch_covtype()["DESCR"])

.. _covtype_dataset:

Forest covertypes
-----------------

The samples in this dataset correspond to 30×30m patches of forest in the US,
collected for the task of predicting each patch's cover type,
i.e. the dominant species of tree.
There are seven covertypes, making this a multiclass classification problem.
Each sample has 54 features, described on the
`dataset's homepage <https://archive.ics.uci.edu/ml/datasets/Covertype>`__.
Some of the features are boolean indicators,
while others are discrete or continuous measurements.

**Data Set Characteristics:**

    Classes                        7
    Samples total             581012
    Dimensionality                54
    Features                     int

:func:`sklearn.datasets.fetch_covtype` will load the covertype dataset;
it returns a dictionary-like 'Bunch' object
with the feature matrix in the ``data`` member
and the target values in ``target``. If optional argument 'as_frame' is
set to 'True', it will return ``data`` and ``target`

## Implementing KNN

In [2]:
# Importing the KNN classifier
from sklearn.neighbors import KNeighborsClassifier

In [3]:
# Creating a KNN classifier
KNN = KNeighborsClassifier(n_neighbors=5)

In [4]:
# Training the KNN models
KNN.fit(X, y)

KNeighborsClassifier()

In [5]:
# Setting the parameters
neighbors = {'n_neighbors': [1, 3, 7, 10]}

In [6]:
# Doing a grid search
grid_neighbor = GridSearchCV(estimator=KNeighborsClassifier(), param_grid=neighbors, cv=5, verbose=2)

In [7]:
# Fitting the model
grid_neighbor.fit(X, y)

Fitting 5 folds for each of 4 candidates, totalling 20 fits
[CV] END ......................................n_neighbors=1; total time=   0.4s
[CV] END ......................................n_neighbors=1; total time=   0.1s
[CV] END ......................................n_neighbors=1; total time=   0.1s
[CV] END ......................................n_neighbors=1; total time=   0.1s
[CV] END ......................................n_neighbors=1; total time=   0.1s
[CV] END ......................................n_neighbors=3; total time=   0.1s
[CV] END ......................................n_neighbors=3; total time=   0.2s
[CV] END ......................................n_neighbors=3; total time=   0.2s
[CV] END ......................................n_neighbors=3; total time=   0.2s
[CV] END ......................................n_neighbors=3; total time=   0.2s
[CV] END ......................................n_neighbors=7; total time=   0.2s
[CV] END ......................................n_

GridSearchCV(cv=5, estimator=KNeighborsClassifier(),
             param_grid={'n_neighbors': [1, 3, 7, 10]}, verbose=2)

In [8]:
# Showing the results
pd.DataFrame(grid_neighbor.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_n_neighbors,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.006947,0.002834,0.267639,0.111709,1,{'n_neighbors': 1},0.4515,0.535,0.4665,0.464,0.323,0.448,0.068999,4
1,0.004506,0.000326,0.248493,0.00384,3,{'n_neighbors': 3},0.4435,0.5555,0.4775,0.47,0.3375,0.4568,0.070362,3
2,0.004228,0.000497,0.279172,0.004867,7,{'n_neighbors': 7},0.44,0.579,0.4905,0.5025,0.3575,0.4739,0.073263,2
3,0.004485,0.000503,0.276026,0.007129,10,{'n_neighbors': 10},0.4525,0.5945,0.505,0.5,0.357,0.4818,0.077465,1


## Model Persistence

In [9]:
# Import joblib
import joblib

In [10]:
# Saving the model
joblib.dump(grid_neighbor, 'knn.joblib')

['knn.joblib']

In [11]:
# Retreving the model
knn_model = joblib.load('knn.joblib')

In [12]:
# Get a "new" observation
new_observation = np.array(X_orig.iloc[10000]).reshape(1, -1)

In [13]:
# Predict on the new observation
knn_model.predict(new_observation)



array([7])