#### K-Nearest Neighbours

To understand the k-nearest neighbor algorithm, we first need to understand nearest neighbor. 

Nearest neighbor algorithm is an algorithm that can be used for regression and classification tasks but is usually used for classification because it is simple and intuitive. A Nearest neighbor classifier has very quick training time as it is just storing all samples. At test time however, its speed is slower because it needs to search through all stored examples for the closest match. The time spent to receive a classification prediction increases as the dataset increases.

The k-nearest neighbor algorithm is a modification of the nearest neighbor algorithm in which a class label for an input is voted on by the k closest examples to it. That is the predicted label would be the label with the majority vote from the delegates close to it.

##### What is K-Nearest Neighbor (or KNN) Algorithm?

The K-Nearest neighbor algorithm is a classification algorithm that takes a bunch of labeled points and uses them to learn how to label other points. 
- It is a method for classifying cases based on their similarity to other cases. 
- Data points that are near to each other are said to be neighbors. 
- It is based on similar cases with same class labels are near each other.

#### How K-Nearest Neighbor Algorithm works.
1. Pick a value for K.
2. Calculate the distance of unknown case from all cases.
3. Select the K-observations in the training data that are "nearest" to the unknown data point.
4. Predict the response of the unknown data point using the most popular response value from the K-nearest neighbors. 

#### What is the best value of K for KNN?

The K in KNN is the number of Nearest Neighbors to examine. It is supposed to be specified by the user.

KNN can also be used for regression.

#### KNN Algorithm In Machine Learning and KNN Algorithm Using Python

KNN is based on feature similarity. We can do classification using KKn classifier.

KNN is one of the simplest supervised ML algorithm mostly used for Classification. It classifies a datapoint based on ho its neighbors are classified. It stores all available cases and classifies new cases based on a similarity measure.

k in KNN is a parameter that refers to the number of nearest neighbors to include in the majority voting process.
A data point is classified by majority votes from its 5 nearest neighbors. 

##### How do we choose K?
KNN Algorithm is based on feature similarity: Choosing the right value of 'k' is a process called parameter tuning, and is important for better accuracy.

To choose a value of k:
- Sqrt(n), where n is the total number of data points
- Odd value of K is selected to avoid confusion between two classes of data.

##### When do we use KNN Algorithm?
- When data is labeled
- When data is noise free
- When dataset is small

#### How does KNN Algorithm work?
- Consider a dataset having two variables: height(cm) & weight(kg) and each point is classified as Normal or Underweight.
- On the basis of the given data, we have to classify the below set as Normal or Underweight using KNN.
- To find the nearest neighbors, we will calculate Euclidean distance. The Euclidean distance is defined as the distance between two points.
- Calculate the nearest neighbor

#### Recap of KNN
- A positive integer k is specified along with a new sample.
- We select the k entries in our database which are closest to the new sample.
- We find the most common classification of these entries.
- This is the classification we give to the new sample.

#### Use Case: Predict Diabetes

Objective: Predict whether a person will be diagnosed with diabetes or not.

In [2]:
# Import the required Scikit-learn libraries as shown
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score

In [5]:
dataset = pd.read_csv('C:\\Users\\user\\Documents\\Data Science Files\\Jupyter notebooks for Data Science\\diabetes.csv')
print( len(dataset) )
print( dataset.head() )

768
   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   
2            8      183             64              0        0  23.3   
3            1       89             66             23       94  28.1   
4            0      137             40             35      168  43.1   

   DiabetesPedigreeFunction  Age  Outcome  
0                     0.627   50        1  
1                     0.351   31        0  
2                     0.672   32        1  
3                     0.167   21        0  
4                     2.288   33        1  


Values of columns like 'Glucose', BloodPressure' cannot be accepted as zeroes because it will affect the outcome. We can replace such values with the mean of the respective column: 

In [7]:
# Replace zeroes
zero_not_accepted = ['Glucose', 'BloodPressure', 'SkinThickness', 'BMI', 'Insulin']

for column in zero_not_accepted:
    dataset[column] = dataset[column].replace(0, np.NaN)
    mean = int(dataset[column].mean(skipna = True))
    dataset[column] = dataset[column].replace(np.NaN, mean)

In [8]:
print(dataset['Glucose'])

0      148.0
1       85.0
2      183.0
3       89.0
4      137.0
       ...  
763    101.0
764    122.0
765    121.0
766    126.0
767     93.0
Name: Glucose, Length: 768, dtype: float64


Before proceeding further, let's split the dataset into train and test

In [10]:
# split dataset
X = dataset.iloc[:, 0:8]
y = dataset.iloc[:, 8]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0, test_size = 0.2)

Rule of thumb: Any algorithm that computes distance or assumes normality, scale your features - Feature Scaling

In [11]:
# Feature scaling
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

N_neighbors here is 'K'. p is the power parameter to define the metric used, which is 'Euclidean' in our case. 
Then define the model using KNeighborsClassifier and fit the train data in the model.

In [13]:
import math
math.sqrt(len(y_test))

12.409673645990857

In [14]:
# Define the model: Init K-NN
classifier = KNeighborsClassifier(n_neighbors = 11, p = 2, metric = 'euclidean')

In [15]:
# Fit Model
classifier.fit(X_train, y_train)

KNeighborsClassifier(metric='euclidean', n_neighbors=11)

In [16]:
# Predict the test set results
y_pred = classifier.predict(X_test)
y_pred

  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


array([1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1,
       1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1,
       1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1,
       0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
      dtype=int64)

It's important to evaluate the model, let's use confusion matrix to do that:

In [17]:
# Evaluate model
cm = confusion_matrix(y_test, y_pred)
print (cm)

[[94 13]
 [15 32]]


In [18]:
print(f1_score(y_test, y_pred))

0.6956521739130436


In [19]:
print(accuracy_score(y_test, y_pred))

0.8181818181818182


So, we have created a model using KNN which can predict whether a person will have diabetes or not.
An accuracy of 80% tells us that it is a pretty fair fit in the model.