In [1]:
import pandas as pd

In [2]:
from google.colab import files
uploaded = files.upload()

Saving penguins.csv to penguins.csv


In [3]:
import io
penguins = pd.read_csv(io.BytesIO(uploaded['penguins.csv']))

In [4]:
penguins.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object 
dtypes: float64(4), object(3)
memory usage: 18.9+ KB


 I used Python in Google Colab to build this KNN model. The data is an open source classification data set with measurement, location, and gender data on three types of penguins. It can be located on Allison Horst's GitHub here: https://allisonhorst.github.io/palmerpenguins/articles/articles/art.html

 I initially looked at an overview of the data and realized that some rows were missing all of the explanatory factors. K-nearest neighbors will not work with missing values and so the first thing I did was to delete these values.

In [5]:
penguins.dropna(how='any', inplace=True)

In [6]:
penguins.isna()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False
5,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...
339,False,False,False,False,False,False,False
340,False,False,False,False,False,False,False
341,False,False,False,False,False,False,False
342,False,False,False,False,False,False,False


After that, I addressed the two categorical explanatory features in the data set, sex and island. I used the pandas' get_dummies function to convert these into dummy variable so that they would work within the K-nn algorithm.

In [7]:
penguins_dummy = pd.get_dummies(penguins, columns=['island','sex'])

In [8]:
penguins_dummy.head()

Unnamed: 0,species,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,island_Biscoe,island_Dream,island_Torgersen,sex_female,sex_male
0,Adelie,39.1,18.7,181.0,3750.0,0,0,1,0,1
1,Adelie,39.5,17.4,186.0,3800.0,0,0,1,1,0
2,Adelie,40.3,18.0,195.0,3250.0,0,0,1,1,0
4,Adelie,36.7,19.3,193.0,3450.0,0,0,1,1,0
5,Adelie,39.3,20.6,190.0,3650.0,0,0,1,0,1


The following establishes which features are explanatory for the penguin type, which is what I am trying to classify. I then use StandardScaler to normalize the numerical features in the data set.

In [26]:
feature_cols = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g','island_Biscoe','island_Dream','island_Torgersen','sex_female','sex_male'] 
X = penguins_dummy[feature_cols]
y = penguins_dummy.species

In [29]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit_transform(penguins[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']])

array([[-0.89604189,  0.7807321 , -1.42675157, -0.56847478],
       [-0.82278787,  0.11958397, -1.06947358, -0.50628618],
       [-0.67627982,  0.42472926, -0.42637319, -1.1903608 ],
       ...,
       [ 1.02687621,  0.52644436, -0.56928439, -0.53738048],
       [ 1.24663828,  0.93330475,  0.64546078, -0.13315457],
       [ 1.13675725,  0.7807321 , -0.2120064 , -0.53738048]])

In [30]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

Next, I split the data into training and testing sets leaving out 30% of the data to test my model. Then, I perform a grid search to see that 7 neighbors results in the best score.

In [31]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=.3, random_state=1)

In [40]:
import numpy as np


param_grid = {'n_neighbors': np.arange(2,50)}
knn = KNeighborsClassifier()


knn_cv = GridSearchCV(knn, param_grid, cv=5)
knn_cv.fit(X_train,y_train)

GridSearchCV(cv=5, error_score=nan,
             estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30,
                                            metric='minkowski',
                                            metric_params=None, n_jobs=None,
                                            n_neighbors=5, p=2,
                                            weights='uniform'),
             iid='deprecated', n_jobs=None,
             param_grid={'n_neighbors': array([ 2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
       19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
       36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49])},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

In [41]:
print('The best parameter is ' + str(knn_cv.best_params_))

print('The best score for parameters is ' + str(knn_cv.best_score_))

The best parameter is {'n_neighbors': 7}
The best score for parameters is 0.8029602220166513


I then fit the model using seven neighbors and evaluate it using a confusion matrix and a classification report.

In [34]:
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=1, p=2,
                     weights='uniform')

In [35]:
y_pred = knn.predict(X_test)

In [36]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [37]:
knn.score(X_test, y_test)

0.82

In [38]:
print(confusion_matrix(y_test, y_pred))

[[38  3  2]
 [ 9 10  1]
 [ 1  2 34]]


In [39]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

      Adelie       0.79      0.88      0.84        43
   Chinstrap       0.67      0.50      0.57        20
      Gentoo       0.92      0.92      0.92        37

    accuracy                           0.82       100
   macro avg       0.79      0.77      0.78       100
weighted avg       0.81      0.82      0.81       100



The classsification for Gentoo and Adelie performed pretty well, but Chinstrap didn't do all that well. The grid search recommended using just one neighbor for Knn. This actually results in a more complex model and risks overfitting. I'm going to check if using a higher n for neighbors results in better classifcation overall, and especially for the chinstrap penguins. Specifically I will use 7 neighbors because when I ran the parameter search again using 2:50, 7 was the top result.

In [42]:
knn7 = KNeighborsClassifier(n_neighbors=7)
knn7.fit(X_train, y_train)

y_pred = knn7.predict(X_test)

In [43]:
knn7.score(X_test, y_test)

0.76

In [44]:
print(confusion_matrix(y_test, y_pred))

[[36  1  6]
 [12  7  1]
 [ 4  0 33]]


In [45]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

      Adelie       0.69      0.84      0.76        43
   Chinstrap       0.88      0.35      0.50        20
      Gentoo       0.82      0.89      0.86        37

    accuracy                           0.76       100
   macro avg       0.80      0.69      0.71       100
weighted avg       0.78      0.76      0.74       100



Switching to 7 neighbors increased the precision for the Chinstrap penguins but reduced the performance of pretty much every other metric for all three penguins. So it appears that using 1 neighbor serves as the best model and did not end up overfitting this data set.