## K-Nearest Neighbors

The KNN is one of the simplest classifier supervised machine learning algorithm. Instead of training a model, KNN will predict the class of an observation by using the K features close to the observation using some metrics such as Euclidean, Manhattan, or Minkowski. Then the K features vote based on their classs and a class with majority of votes wins.   

It should be noted that the features should be standardized in order to get the best result.

We use Iris flower data set or Fisher's Iris data set for our practice.

In [2]:
# Nearst Neighbors
from sklearn import datasets
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import StandardScaler

iris = datasets.load_iris()
X = iris.data

standardizer = StandardScaler()

X_standard = standardizer.fit_transform(X)

#neares_neighbor = NearestNeighbors(n_neighbors=2).fit(X_standard)

# metric shows the metric used to compute the distance between observation: Euclidean, Manhattan, Minkowski
neares_neighbor = NearestNeighbors(n_neighbors=2, metric='euclidean').fit(X_standard)

observation = [1, 1, 1, 1]
distances, indices = neares_neighbor.kneighbors([observation]) 

print("The closest neighbors are: ", X_standard[indices])
print("Distances are: ", distances)

The closest neighbors are:  [[[1.03800476 0.55861082 1.10378283 1.18556721]
  [0.79566902 0.32841405 0.76275827 1.05393502]]]
Distances are:  [[0.49140089 0.74294782]]


In [4]:
# K Nearst Neighbors
from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

iris = datasets.load_iris()
X = iris.data
y = iris.target

standardizer = StandardScaler()

X_std = standardizer.fit_transform(X)

knn = KNeighborsClassifier(n_neighbors=5, n_jobs=-1).fit(X_std, y)

observation = [[0.75, 0.75, 0.75, 0.75],[1, 1, 1, 1]]

knn.predict(observation)


array([1, 2])

# Best K

Chossing the optimal value for the hyperparameter K, is an important issue in applying KNN algorithm. If k=1, the the algorithm has high variance. If k=n (number of observations), the the algorithm will have high bias. We have to choose the best value for K such that the algorithm has low variance and low bias. The best value of K can be chosen by using GridSearchCV. 

In [9]:
# Best K
from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import GridSearchCV

iris = datasets.load_iris()
X = iris.data
y = iris.target

standardizer = StandardScaler()

X_std = standardizer.fit_transform(X)
knn = KNeighborsClassifier(n_neighbors=5, n_jobs=-1)

pipe = Pipeline([("standardizer", standardizer), ("knn", knn)])

search_space = [{"knn__n_neighbors": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}]

classifier = GridSearchCV(pipe, search_space, cv=5, verbose=0).fit(X_std, y)

print("The best value is", "k =",classifier.best_estimator_.get_params()["knn__n_neighbors"],)

The best value is k = 6


# Radius-Based Neighbor Classifier

In radius-based neighbor classifier, instead of using K nearest neighbor, we predict the class of an observation within a given radiu

In [16]:
# Radius Nearst Neighbors
from sklearn import datasets
from sklearn.neighbors import RadiusNeighborsClassifier
from sklearn.preprocessing import StandardScaler

iris = datasets.load_iris()
X = iris.data
y = iris.target

standardizer = StandardScaler()

X_std = standardizer.fit_transform(X)

rnn = RadiusNeighborsClassifier(radius=.5, n_jobs=-1).fit(X_std, y)

observation = [[1, 1, 1, 1]]

rnn.predict(observation)

#print("The closest neighbors are: ", X_standard[indices])
#print("Distances are: ", distances)

array([2])