### KNN for classification

We want to predict the type of a car (based on carbon emission)

#### Read dataset

In [3]:
import pandas as pd

In [6]:
dataset_path = '../data/mtcars.csv'

In [7]:
dataset = pd.read_csv(dataset_path)

#### Explore dataset

In [8]:
dataset.head()

Unnamed: 0.1,Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
3,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
4,Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2


In [10]:
dataset['carb'].unique()

array([4, 1, 2, 3, 6, 8])

#### Select features

In [15]:
X = dataset[['mpg', 'disp', 'hp', 'drat']].values

In [36]:
X.shape

(32, 4)

In [20]:
X[:10] # show first 10 rows of array

array([[ 21.  , 160.  , 110.  ,   3.9 ],
       [ 21.  , 160.  , 110.  ,   3.9 ],
       [ 22.8 , 108.  ,  93.  ,   3.85],
       [ 21.4 , 258.  , 110.  ,   3.08],
       [ 18.7 , 360.  , 175.  ,   3.15],
       [ 18.1 , 225.  , 105.  ,   2.76],
       [ 14.3 , 360.  , 245.  ,   3.21],
       [ 24.4 , 146.7 ,  62.  ,   3.69],
       [ 22.8 , 140.8 ,  95.  ,   3.92],
       [ 19.2 , 167.6 , 123.  ,   3.92]])

In [19]:
y = dataset['carb'].values

In [37]:
y.shape

(32,)

In [21]:
y[:10]

array([4, 4, 1, 1, 2, 1, 4, 2, 2, 4])

#### Preprocess

*scale* from sklearn.preprocessing: Standardize a dataset along any axis (by default along axis=0, meaning standarizing each column). Standarizing means substracting the mean and divide by standard deviation (scale to unit variance)

In [22]:
from sklearn.preprocessing import scale

In [23]:
X_stand = scale(X)

In [26]:
X_stand[:10]

array([[ 0.15329914, -0.57975032, -0.54365487,  0.57659448],
       [ 0.15329914, -0.57975032, -0.54365487,  0.57659448],
       [ 0.4567366 , -1.00602601, -0.7955699 ,  0.48158406],
       [ 0.22072968,  0.22361542, -0.54365487, -0.98157639],
       [-0.23442651,  1.05977159,  0.41954967, -0.84856181],
       [-0.33557233, -0.0469057 , -0.61774753, -1.58964307],
       [-0.97616253,  1.05977159,  1.45684686, -0.7345493 ],
       [ 0.72645879, -0.68877852, -1.25494437,  0.17755072],
       [ 0.4567366 , -0.73714442, -0.76593284,  0.61459865],
       [-0.15013833, -0.51744848, -0.35101396,  0.61459865]])

#### Train/test splitting

*train_test_split* from sklearn.model_selection splits dataset into train and test with specific test_size (you can also pass a random_state to reproduce results)

In [29]:
from sklearn.model_selection import train_test_split

In [45]:
X_train, X_test, y_train, y_test = train_test_split(X_stand, y, test_size=0.33, random_state=17)

In [46]:
X_train.shape

(21, 4)

In [47]:
y_train.shape

(21,)

In [48]:
X_test.shape

(11, 4)

In [49]:
y_test.shape

(11,)

#### Build model

*KNeighborsClassifier* from sklearn.neighbors implements KNN classification with *n_neighbors=5* by default

In [50]:
from sklearn.neighbors import KNeighborsClassifier

In [52]:
KNNclf = KNeighborsClassifier()

#### Train the model

In [53]:
KNNclf.fit(X_train, y_train)

KNeighborsClassifier()

In [54]:
KNNclf

KNeighborsClassifier()

#### Predict

In [57]:
y_pred = KNNclf.predict(X_test)

In [58]:
y_pred

array([2, 2, 4, 2, 2, 2, 4, 1, 1, 4, 2])

#### Evaluate

*classification_report* from sklearn.metrics: build a text report showing the main classification metric

By label:
- **precision**: TP / (TP + FP) (ability to not label as positive a sample that is negative)
- **recall**: TP / (TP + FN) (ability to find all positive candidates)
- **f1-score**: weighted mean of precision and recall
- **support**: number of ocurrences 

In [59]:
from sklearn.metrics import classification_report

In [61]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           1       0.50      0.33      0.40         3
           2       0.33      0.67      0.44         3
           3       0.00      0.00      0.00         1
           4       0.67      0.67      0.67         3
           6       0.00      0.00      0.00         1

    accuracy                           0.45        11
   macro avg       0.30      0.33      0.30        11
weighted avg       0.41      0.45      0.41        11

