# KNN Algorithm in Python

In [1]:

#The K-Nearest Neighbor algorithm in this tutorial will focus on classification problems, 
#though many of the principles will work for regression as well.

#We will present:

#How the algorithm works to predict classes of data
#How the algorithm can be tweaked to use different types of distances
#How the algorithm works with multiple dimensions
#How to work with categorical or non-numeric data in KNN classification
#How to validate your algorithm and test its effectiveness
#How to improve your algorithm using hyper-parameter turning in Python


#Let’s get started!

In [2]:
#Using the K-Nearest Neighbor Algorithm in Python’s Scikit-Learn
#In this section, you’ll learn how to use the popular Scikit-Learn (sklearn) library to make use of the KNN algorithm. 
#To start, we begin by importing some critical libraries: sklearn and pandas:

In [3]:
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
import seaborn as sns #Import Python Seaborn
from seaborn import load_dataset

In [4]:
#We will focus on the Penguins dataset that comes bundled with Python Seaborn. 
#The dataset covers information on different species of penguins, including the island the sample was taken from,
#as well as their bill length and depth.

In [5]:
#The dataset focuses on predicting the species of a penguin based on its physical characteristics. 
#There are three types of Penguins that the dataset has data on: the Adelie, Chinstrap, and Gentoo penguins, 
#as shown below:

In [6]:
# Load and display the first rows of the penguins dataset

df = load_dataset('penguins')
print(df.head())

  species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
0  Adelie  Torgersen            39.1           18.7              181.0   
1  Adelie  Torgersen            39.5           17.4              186.0   
2  Adelie  Torgersen            40.3           18.0              195.0   
3  Adelie  Torgersen             NaN            NaN                NaN   
4  Adelie  Torgersen            36.7           19.3              193.0   

   body_mass_g     sex  
0       3750.0    Male  
1       3800.0  Female  
2       3250.0  Female  
3          NaN     NaN  
4       3450.0  Female  


In [7]:
#Display df

df

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female
...,...,...,...,...,...,...,...
339,Gentoo,Biscoe,,,,,
340,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,Female
341,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,Male
342,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,Female


In [8]:
#We can see that our dataset has six features and one target. Let’s break this down a little bit:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object 
dtypes: float64(4), object(3)
memory usage: 18.9+ KB


In [9]:
#Splitting our Data into Training and Testing Data
#We’ll need to split our data into both features and target arrays.

#The features array, commonly referred to as X, is expected to be multi-dimensional array.
#Meanwhile, the target array, commonly noted as y, is expected to be of a single dimension.
#Lets focus only one a single dimension for now: bill length. 
#We’ll extract that column as a DataFrame (rather than as a Series), so that sklearn can load it properly.

In [10]:
#Drop NaN values in our dataset

#Splitting our DataFrame into features and target

df = df.dropna()

X = df[['bill_length_mm']]
y = df['species']

In [11]:
#One important piece to note above is that we’ve dropped any missing records. 
#Technically it may be a good idea to try and impute these values. 
#However, this is a bit out of the scope of this tutorial.

In [12]:
#We can also split our data into training and testing data to prevent or minimize overfitting. 

#This can be done using the train_test_split() function in sklearn. 

#For this, we need to import the function first
#We set a random_state= value so that our results are reproducible. This, of course, is optional. 

#However, random_state = value lets you reproduce your results consistently, so it’s a good practice.

In [13]:
# Splitting data into training and testing data
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 100)

In [14]:
#Now that we have our dataset lined up, let’s take a look at how the KNeighborsClassifier class works in sklearn.

In [15]:
#Understanding KNeighborsClassifier in Sklearn
#Before diving further into using sklearn to calculate the KNN algorithm, 
#let’s take a look at the KNeighborsClasifier class:

In [16]:
KNeighborsClassifier(
    n_neighbors=5,          # The number of neighbours to consider
    weights='uniform',      # How to weight distances (e.g uniform or distance)
    algorithm='auto',       # Algorithm to compute the neighbors
    leaf_size=30,           # The leaf size to speed up searches
    p=2,                    # The power parameter for the Minkowski metric. 2 = Euclidean distance; p=1 is for Manhattan distance
    metric='minkowski',     # The type of distance to use. This generalizes Euclidean and Manhattan distance
    metric_params=None,     # Keyword arguments for the metric function
    n_jobs=None             # How many parallel jobs to run
)

KNeighborsClassifier()

In [17]:
#Lets focus on the n_neighbors=, weights=, p=, and n_jobs= hyperparameters.

#To kick things off let’s focus on what we have learned so far: 
#measuring distances using the Euclidian distance, and finding the five nearest neighbors.

#In order to use the Euclidian distance, we can either modify the metric= parameter to 'euclidean', 
#or we can change the p= parameter to 2.

#The p hyperparameter value in the KNeigborsClassifier() can be manipulated to give us different distances like:
#p = 1, when p is set to 1 we get Manhattan distance.
#p = 2, when p is set to 2 we get Euclidean distance.


#njobs = -1 means you want to use all the available cores , and if you specify with a particular value , 
#then those only cores will be used for training.

#Weights determines whether to weigh all neighbors equally or to take their distances into consideration


#Conventionally, the classifier object is assigned to a variable clf. 
#Let’s load the class with the parameters discussed above:

In [18]:
# Creating a classifier object in sklearn
clf = KNeighborsClassifier(p=2)

In [19]:
#In the object above, we’ve instantiated a classifier object that uses the Euclidean distance (p=2) 
#and looks for five neighbours (default n_neighbors=5).

In [20]:
#Now that we have our classifier set up, we can pass in our training data to fit the algorithm. 
#This will handle the steps we visually undertook earlier in the slide deck by finding the nearest 
#neighbours’s class for each penguin:

In [21]:
# Fitting our model
clf.fit(X_train, y_train)

KNeighborsClassifier()

In [22]:
#At this point, we’ve made our algorithm! Sklearn has abstracted a lot of the complexities 
#of the calculation behind the scenes.

#We can now use our model to make predictions on the data. 
#To do this, we can use the .predict() method and pass in our testing feature:

In [23]:
# Making predictions

predictions = clf.predict(X_test)
print(predictions)

['Adelie' 'Gentoo' 'Chinstrap' 'Adelie' 'Gentoo' 'Gentoo' 'Gentoo'
 'Chinstrap' 'Gentoo' 'Gentoo' 'Gentoo' 'Adelie' 'Adelie' 'Gentoo'
 'Gentoo' 'Chinstrap' 'Chinstrap' 'Adelie' 'Gentoo' 'Gentoo' 'Adelie'
 'Gentoo' 'Adelie' 'Adelie' 'Adelie' 'Adelie' 'Gentoo' 'Chinstrap'
 'Adelie' 'Adelie' 'Adelie' 'Adelie' 'Gentoo' 'Adelie' 'Chinstrap'
 'Gentoo' 'Adelie' 'Gentoo' 'Gentoo' 'Gentoo' 'Adelie' 'Gentoo' 'Adelie'
 'Adelie' 'Chinstrap' 'Chinstrap' 'Chinstrap' 'Adelie' 'Gentoo' 'Gentoo'
 'Gentoo' 'Gentoo' 'Adelie' 'Adelie' 'Gentoo' 'Gentoo' 'Adelie' 'Gentoo'
 'Gentoo' 'Adelie' 'Gentoo' 'Gentoo' 'Gentoo' 'Adelie' 'Adelie' 'Adelie'
 'Chinstrap' 'Adelie' 'Gentoo' 'Gentoo' 'Chinstrap' 'Chinstrap' 'Adelie'
 'Chinstrap' 'Gentoo' 'Gentoo' 'Gentoo' 'Chinstrap' 'Adelie' 'Gentoo'
 'Adelie' 'Adelie' 'Adelie' 'Chinstrap']


In [24]:
#Similarly, if we wanted to simply pass in a single mock-penguins data, 
#we could pass in a list containing that one value. 
#Say we measured our own pet penguin’s bill length and found that it was 45.2 mm. We could simply write:

In [25]:
# Making your own predictions - on a measuremen of bill length as 45.2mm

predictions = clf.predict([[44.2]])
print(predictions)

['Gentoo']
