# Import Libraries

Here I import the libraries and function that we are going to use below. I prefer putting all imports in one place to get a good view of what is needed and what is used.

In [33]:
import pandas as pd
import numpy as np

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans

from scipy.stats import chisquare

## Load the data set

Below, I download an example data set from [UCI Data Repository](https://archive.ics.uci.edu/ml/datasets.php). This is a very good source for getting relatively small data sets for lectures like this. This particular data set is called [The Iris Data Set](https://archive.ics.uci.edu/ml/datasets/iris).

In [58]:
dataset = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",sep=',',header=None)
dataset

Unnamed: 0,0,1,2,3,4
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


## Train-test split

For cross-validation purposes, I need to split the data into train and test data sets. I am using [SciKit-Learn](https://scikit-learn.org/stable/)'s [`train_test_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function.

In [84]:
X = np.array(dataset.iloc[:,0:3])
y = np.array(dataset.iloc[:,4])
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.5)

## K-NN

In [89]:
model1 = KNeighborsClassifier(n_neighbors=3, metric='euclidean')
model1.fit(X_train,y_train)
Result1 = pd.crosstab(y_test,model1.predict(X_test))
Result1

col_0,Iris-setosa,Iris-versicolor,Iris-virginica
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Iris-setosa,26,0,0
Iris-versicolor,0,23,2
Iris-virginica,0,1,23


## K-Means

In [91]:
model2 = KMeans(n_clusters=3, init='k-means++')
model2.fit(X_train)
Result2 = pd.crosstab(y_test,model2.predict(X_test))
Result2

col_0,0,1,2
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Iris-setosa,26,0,0
Iris-versicolor,0,22,3
Iris-virginica,0,7,17


## Analyze and Evaluate

Read on Chi Square [here](https://www.itl.nist.gov/div898/handbook/eda/section3/eda3674.htm) and [here](https://stats.idre.ucla.edu/stata/whatstat/what-statistical-analysis-should-i-usestatistical-analyses-using-stata/)

In [92]:
print(chisquare(Result1,axis=None))
print(chisquare(Result2,axis=None))

Power_divergenceResult(statistic=133.67999999999998, pvalue=4.879202928325163e-25)
Power_divergenceResult(statistic=105.83999999999999, pvalue=2.720745087045414e-19)
