# <center>IEE 520: Fall 2019</center>

# <center> Nearest Neighbors (10/03/19)</center>

## <center>Klim Drobnyh (klim.drobnyh@asu.edu)</center>

**NOTE: TO SUPPORT INTERACTIVE PLOTS IN JUPYTER LAB, RUN**

conda install -c conda-forge nodejs

jupyter labextension install @jupyter-widgets/jupyterlab-manager

In [None]:
# For compatibility with Python 2
from __future__ import print_function

# To load datasets
from sklearn import datasets

# To import the classifier (K-Nearest Neighbors Classifier and Regressor)
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor

# To measure accuracy
from sklearn import metrics

from sklearn.model_selection import KFold

# To support plots
from ipywidgets import interact
import ipywidgets as widgets
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

import numpy as np

# To display all the plots inline
%matplotlib inline 

In [None]:
# To increase quality of figures
plt.rcParams["figure.figsize"] = (20, 10)

## <center>Classification</center>

### <center>Load the data</center>

The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper "The use of multiple measurements in taxonomic problems" as an example of linear discriminant analysis.

The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish the species from each other.

In [None]:
X, y = datasets.load_iris(True)

## <center>Nearest Neighbors Classifier</center>

You can find a full list of parameters here:

https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

In [None]:
kfold = KFold(n_splits=5, shuffle=True, random_state=520)

yhat = np.zeros((X.shape[0], ))
# Cross-validation
for train, test in kfold.split(X, y):
    model = KNeighborsClassifier(7, weights='distance')
    model.fit(X[train], y[train])
    yhat[test] = model.predict(X[test])

In [None]:
# You need to install pandas_ml in order to use that!
# conda install -c conda-forge pandas_ml

# Uncomment the next line to install a missing package to Google Colab Environment
# !pip install pandas_ml
from pandas_ml import ConfusionMatrix

In [None]:
cm = ConfusionMatrix(y, yhat)
ax = cm.plot(backend='seaborn', annot=True, fmt='g')
ax.set_title('CV Confusion Matrix')
plt.show()

## <center>Visualization</center>

Here we're going to use just 2 variables (for visualization purpose).

In [None]:
X = X[:, :2]

In [None]:
# Here we use closure to store the related variables
def create_plot_knn_classification(_X, _y):
    X, y = _X, _y
    def plot_knn(k=3, weighted=True):
        h = .02  # step size in the mesh
        cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
        cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
        if weighted:
            clf = KNeighborsClassifier(k, weights='distance')
        else:
            clf = KNeighborsClassifier(k, weights='uniform')
        clf.fit(X, y)
        x1_min = X[:, 0].min() - 1
        x1_max = X[:, 0].max() + 1
        x2_min = X[:, 1].min() - 1
        x2_max = X[:, 1].max() + 1
        xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, h),
                               np.arange(x2_min, x2_max, h))
        Z = clf.predict(np.c_[xx1.ravel(), xx2.ravel()])

        # Put the result into a color plot
        Z = Z.reshape(xx1.shape)
        plt.figure()
        plt.pcolormesh(xx1, xx2, Z, cmap=cmap_light)

        # Plot also the training points
        plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold,
                    edgecolor='k', s=20)
        plt.xlim(xx1.min(), xx1.max())
        plt.ylim(xx2.min(), xx2.max())
        plt.title("Nearest neighbor classification (k = %i, %s)"
                  % (k, 'weighted' if weighted else 'unweighted'))
        plt.show()
    return plot_knn

In [None]:
interact(create_plot_knn_classification(X, y), k=(1, 150, 1))

## <center>Regression (visualization)</center>

You can find a full list of parameters here:

https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html

Here we will emulate $y=x^2+\epsilon$, where $\epsilon$ is standard normal, $1\leq x \leq 10$.

In [None]:
n = 40
np.random.seed(520)
X = np.random.uniform(1, 10, n)
e = np.random.randn(n)
y = X**2 + e
X = X.reshape((n, 1))
y = y.reshape((n,))
plt.plot(X, y, 'bo')
plt.plot(np.arange(1, 10, 0.1), np.arange(1, 10, 0.1)**2, 'r')
plt.show()

In [None]:
# Here we use closure to store the related variables
def create_plot_knn_regression(_X, _y):
    X, y = _X, _y
    def plot_knn(k=3, weighted=True):
        h = .02  # step size in the mesh
        if weighted:
            clf = KNeighborsRegressor(k, weights='distance')
        else:
            clf = KNeighborsRegressor(k, weights='uniform')
        clf.fit(X, y)
        x_min = X[:, 0].min() - 1
        x_max = X[:, 0].max() + 1
        xx = np.arange(x_min, x_max, h)
        xx = xx.reshape((xx.shape[0], 1))
        Z = clf.predict(xx)

        # Put the result into a color plot
        plt.figure()
        plt.plot(xx, Z, 'g')
        plt.plot(xx, xx**2, 'r')

        # Plot also the training points
        plt.plot(X, y, 'bo')
        plt.title("Nearest neighbor regression (k = %i, %s)"
                  % (k, 'weighted' if weighted else 'unweighted'))
        plt.show()
    return plot_knn

In [None]:
interact(create_plot_knn_regression(X, y), k=(1, n, 1))