<center>
    <img src="https://jsantarc.github.io/ADMN5016_2022/images/logo-stc.jpeg" width="300" alt="cognitiveclass.ai logo"  />
</center>

# **K Nearest Neighbor**

In this lab, you will learn and practice the K Nearest Neighbor (KNN) model. KNN for classifications. In addition, and regression.  Ff the feature space is not very large, KNN can be a high-interpretable model as you can explain and understand how a prediction is made by looking at its nearest neighbors.

The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis.[1] It is sometimes called Anderson's Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species.[2] Two of the three species were collected in the Gaspé Peninsula "all from the same pasture, and picked on the same day and measured at the same time by the same person with the same apparatus

For regression will also use the  Boston Housing Dataset. A Dataset derived from information collected by the U.S. Census Service concerning housing in the area of Boston Mass. This dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass.

After completing this lab you will be able to:

* Train  with different neighbor hyper-parameters
* Evaluate KNN models on classification
* Tune the number of neighbors and find the optimized classification 
* KNN for regression. 

----

First, let's install `seaborn` for visualization tasks and import required libaries for this lab

In [None]:
!pip install seaborn==0.11.1

for <a href="https://jupyterlite.readthedocs.io/en/latest/#"> jupyterlite</a>

import piplite
await piplite.install(['pandas'])
await piplite.install(['matplotlib'])
await piplite.install(['scipy'])
await piplite.install(['seaborn'])
await piplite.install(['ipywidgets'])
await piplite.install(['tqdm'])
await piplite.install(['scikit-learn'])

from pyodide.http import pyfetch

async def download(url, filename):
    response = await pyfetch(url)
    if response.status == 200:
        with open(filename, "wb") as f:
            f.write(await response.bytes())

In [None]:
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline

## Classification 

## Load and explore the tumor sample dataset

We first load the dataset `iris..csv` as a Pandas dataframe:

In [None]:
# Read datast in csv format
dataset_url = "https://raw.githubusercontent.com/jsantarc/ADMN5016_2022/master/week-1/iris.csv"
df = pd.read_csv(dataset_url)

for <a href="https://jupyterlite.readthedocs.io/en/latest/#"> jupyterlite</a>

path="https://raw.githubusercontent.com/jsantarc/ADMN5016_2022/master/week-1/iris.csv"
await download(path, "iris.csv")
path="iris.csv"

Then, let's quickly take a look at the head of the dataframe.

In [None]:
df.head()

and display its columns

In [None]:
df.columns

In [None]:
sns.pairplot(df, hue="species")


Each observation in this dataset contains lab tests results about a tumor sample, such as clump or shapes. Based on these lab test results or features, we want to build a classification model to predict if this tumor sample is is malicious (cancer) and benign. The target variable `y` is specified in the `Class` column.

Then, let's split the dataset into input `X` and output `y`:

In [None]:
X =df[['sepal_length','sepal_width','petal_length','petal_width']]
y = df['species']

and we first check the statistics summary of features in `X`

In [None]:
X.describe()

In [None]:
X.shape

as we can see from the above cell output, all features are numeric and ranged between 1 to 10. This is very convenient as we do not need to scale the feature values as they are already in the same range.

Next, let's check the class distribution of output `y`:

In [None]:
y.value_counts(normalize=True)

In [None]:
y.value_counts().plot.bar(color=['green', 'red','blue'])

## Process and split training and testing datasets

In [None]:
# Split 80% as training dataset
# and 20% as testing dataset (we can also use this as Validation data 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=0)

## Train and evaluate a KNN classifier with the number of neighbors set to 2

Training a KNN classifier is very similar to training other classifiers in `sklearn`, we first need to define a `KNeighborsClassifier` object. Here we use `n_neighbors=2` argument to specify how many neighbors will be used for prediction, and we keep other arguments to be their default values.

In [None]:
# Define a KNN classifier with `n_neighbors=2`
knn_model = KNeighborsClassifier(n_neighbors=2)

Then we can train the model with `X_train` and `y_train`, and we use ravel() method to convert the data frame `y_train` to a vector.

In [None]:
knn_model.fit(X_train, y_train)

and we can make predictions on the `X_test` dataframe.

In [None]:
yhat = knn_model.predict(X_test)
yhat

find the accuracy 

In [None]:
np.mean(yhat==y_test)

<details><summary>Click here for a sample solution</summary>

```python
model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train, y_train.values.ravel())
preds = model.predict(X_test)
evaluate_metrics(y_test, preds)
```

</details>

## Find number of neighbors 

OK, you may wonder which `n_neighbors` argument may give you the best classification performance. We can try different `n_neighbors` (the K value) and check which `K` gives the best classification performance.

Here we could try K from 1 to 50, and store the aggregated `f1score` for each k into a list

In [None]:
# Try K from 1 to 50
max_k = 25
# Create an empty list  accuracy
accuracy = []

Then we will train 50 KNN classifiers with K ranged from 1 to 50.

In [None]:
for k in range(1, max_k + 1):
    # Create a KNN classifier
    knn = KNeighborsClassifier(n_neighbors=k)
    # Train the classifier
    knn = knn.fit(X_train, y_train)
    yhat = knn.predict(X_test)
    # Evaluate the classifier with accuracy
  
    accuracy.append(np.mean(yhat==y_test))



Visualize your results    

In [None]:
plt.plot([n+1 for n in range(len(accuracy))],accuracy)
plt.xlabel('k')
plt.ylabel('Accuracy ')
plt.show()

In [None]:
ac

Best hyperparameter and train with all your data   

In [None]:
 knn = KNeighborsClassifier(n_neighbors=5)

In [None]:
knn = knn.fit(X, y)

For more check out my course on <a href="https://www.coursera.org/learn/machine-learning-with-python">Machine Learning with Python</a> 

## Regression

we load the dateset 

In [None]:
from sklearn.datasets import load_boston

In [None]:
boston_dataset = load_boston()


we have the feature names

In [None]:
feature_names=boston_dataset.feature_names
feature_names

we have the data 

In [None]:
boston_dataset.data

we convert the features to a dataframe 

In [None]:
X=pd.DataFrame(data=boston_dataset.data, columns=feature_names)
X.head()

we have the targets 

In [None]:
y=boston_dataset.target
y[0:10]

we split the data to training and  validation data

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=0)

In [None]:
from sklearn.neighbors import KNeighborsRegressor

train the model

In [None]:
knn = KNeighborsRegressor(2)
knn.fit(X_train, y_train)

make the prediction 

In [None]:
yhat=knn.predict(X_test)