# Import Libraries

Here I import the libraries and function that we are going to use below. I prefer putting all imports in one place to get a good view of what is needed and what is used.

In [33]:
import pandas as pd
import numpy as np

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans

from scipy.stats import chisquare

## Load the data set

Below, I download an example data set from [UCI Data Repository](https://archive.ics.uci.edu/ml/datasets.php). This is a very good source for getting relatively small data sets for lectures like this. This particular data set is called [The Iris Data Set](https://archive.ics.uci.edu/ml/datasets/iris).

In [16]:
dataset = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",sep=',',header=None)

## Train-test split

For cross-validation purposes, I need to split the data into train and test data sets. I am using [SciKit-Learn](https://scikit-learn.org/stable/)'s [`train_test_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function.

In [53]:
X = np.array(dataset.iloc[:,0:3])
y = np.array(dataset.iloc[:,4])
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.15)

## K-NN

In [54]:
classifier = KNeighborsClassifier(n_neighbors=3, metric='euclidean')
classifier.fit(X_train,y_train)
Model1 = pd.crosstab(classifier.predict(X_test),y_test)
Model1

col_0,Iris-setosa,Iris-versicolor,Iris-virginica
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Iris-setosa,9,0,0
Iris-versicolor,0,6,1
Iris-virginica,0,0,7


## K-Means

In [55]:
clusterer = KMeans(n_clusters=3)
clusterer.fit(X_train)
Model2 = pd.crosstab(clusterer.predict(X_test),y_test)
Model2

col_0,Iris-setosa,Iris-versicolor,Iris-virginica
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,9,0,0
1,0,0,5
2,0,6,3


## Analyze and Evaluate

In [56]:
print(chisquare(Model1,axis=None))
print(chisquare(Model2,axis=None))

Power_divergenceResult(statistic=42.34782608695652, pvalue=1.1651535139082795e-06)
Power_divergenceResult(statistic=36.08695652173913, pvalue=1.692805673177089e-05)
