## KNN with large data set

In [None]:
# import everything first
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn

In [None]:
# We are going to use the Iris data sets from sklearn as our example
from sklearn import datasets
iris = datasets.load_iris()

In [None]:
iris # see what the dataset is like

In [None]:
# build a dataframe with the data
# first four columns are the features, the last column is the target that we want to predict
df = pd.DataFrame(iris.data, columns = ['sepal length (cm)',
  'sepal width (cm)',
  'petal length (cm)',
  'petal width (cm)'])
df['class'] = iris.target
df.head()

In [None]:
print(iris.target_names) # these are the class names corresponding to their numeric label [0, 1, 2 ...]

### Train our Model

In [None]:
# we'll use the sepal dimesnions as the features
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

# we let k = 5 first, which means choosing 5 nearest neighbors.
knn = KNeighborsClassifier(n_neighbors = 5) 

In [None]:
# 
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:,0:2], df['class'], random_state = 42)
X_train.head()

In [None]:
knn.fit(X_train, y_train)

### Test our Model

In [None]:
y_pred = knn.predict(X_test)
print(y_pred) # our prediction
print(y_test) # actual values

In [None]:
# we should test how accurate our model is 

from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))

Closer to 1 accuracy score means better prediction. Our model has an accuracy score of 0.815 approximately.

### Explore more about model

We can try use different k values for our model.
Check its accuracy with these k values.<br>

We will try k = 1 to k =20, as smaller k means noises have large influence and larger k means comuptation becomes expensive. 

In [None]:
k_array = np.arange(1, 21, 2)

k_array

In [None]:
# we can change k value to 1 - 20, and check the accuracy score
# Then we can choose the optimized k value

for k in k_array:
    knn_ex = KNeighborsClassifier(n_neighbors = k)
    knn_ex.fit(X_train, y_train)
    ac = accuracy_score(y_test, knn_ex.predict(X_test))
    print(k)
    print(ac)

In [None]:
knn_1 = KNeighborsClassifier(n_neighbors = 1)
knn_1.fit(X_train, y_train)
y_pred1 = knn_1.predict(X_test)
print(accuracy_score(y_test, y_pred1))

The accuracy of the model using different numbers of trees varys. Choosing a optimized value for our model is important.

### Validation with Confusion Matrix
We can use Confusion Matrix to see how the prediction goes.
The matrix has the format:

|+                  |actual classA | actual classB| |
|-------------------|--------------|--------|-----|
|predicted classA   |              |        |     |
|predicted classB   |              |        |     |  



In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred)

This Confusion Matrix shows that we have 15 predicted class 0 are correct.
7 predicted class 1 are correct; 4 predicted class 1 which are actually class 2.
9 predicted class 2 are correct; 3 predicted class 2 which are actually class 1 .


In [None]:
# The confusion matrix when k = 1
confusion_matrix(y_test, y_pred1)

In [None]:
# The F1 score can be interpreted as a weighted average of the precision and recall, 
# where an F1 score reaches its best value at 1 and worst score at 0.
from sklearn.metrics import f1_score
f1_score(y_test, y_pred1, average = 'micro')

### Conclusion

Our accuracy score got from finding the suitable k is between 0.7 to 0.8