# A Supervised Learning Example

We'll use the classic iris dataset to build a KNN ML model.

## Imports

In [3]:
import numpy as np
import pandas as pd
# import matplotlib.pyplot as plt

from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

In [4]:
iris = datasets.load_iris()

In [5]:
type(iris)

sklearn.utils.Bunch

## The Iris Dataset

In [6]:
print(iris.keys())

dict_keys(['data', 'target_names', 'DESCR', 'feature_names', 'target'])


In [7]:
print(type(iris.data), type(iris.target))

<class 'numpy.ndarray'> <class 'numpy.ndarray'>


In [8]:
iris.data.shape

(150, 4)

In [9]:
iris.target_names

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

In [10]:
X, y = iris.data, iris.target
print("\n", iris['data'].shape, "\n",iris['target'].shape)


 (150, 4) 
 (150,)


In [11]:
df = pd.DataFrame(X, columns=iris.feature_names)
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


## The Model

What is [KNN](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)?

### Split the data

It is important to split data into training and test data before you do exploration or try modeling your data. You are likely to bias your results by exploring the same data that you analyze.

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=21, stratify=y)

### Build the model

Initialize the model class with its hyperparameters. Hyperparamters are parameters that are set before learning begins. In this case, we are telling the KNN classifier to search for 8 groups.

In [13]:
knn = KNeighborsClassifier(n_neighbors=8)
knn.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=8, p=2,
           weights='uniform')

### Evaluate the model

Test your model using the current model to predict on your test data. Note that this splitting constraint becomes less valuable the more times you tinker with the model to achieve a better evaluation. It becomes more important to bring in completely independent data to check the validity of the model.

In [14]:
y_pred = knn.predict(X_test)
print("Test set predictions:\n {}".format(y_pred))

Test set predictions:
 [2 1 2 2 1 0 1 0 0 1 0 2 0 2 2 0 0 0 1 0 2 2 2 0 1 1 1 0 0 1 2 2 0 0 1 2 2
 1 1 2 1 1 0 2 1]


In [18]:
round(knn.score(X_test, y_test), 2)

0.96

Additional performance testing is required. Is the model overfitting? Test against completely independent data.