# K-Nearest Neighbor Lab
Read over the sklearn info on [nearest neighbor learners](https://scikit-learn.org/stable/modules/neighbors.html#nearest-neighbors-classification)




In [1]:
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
import numpy as np
import pandas as pd
from scipy.io import arff
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
try:
    from CS270Boi.discussion270 import Discussion
except:
    !pip install -U -q CS270Boi
    from CS270Boi.discussion270 import Discussion

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


## 1.0 (0%) Set `net_id` to Your NetID

In [2]:
# This should match your BYU email.
# For example, if my BYU email were jake270@byu.edu, I would set net_id to "jake270"

net_id = "joshhend"

# --------------------_Make sure to run all of the cells before continuing_--------------------
### The discussions and text box are loaded in by running the cell associated with the discussion.
### If you experience any problems/errors with the discussions, please send Jake Cahoon (TA) a message on Discord :)

## 1 K-Nearest Neighbor (KNN) algorithm

### 1.1 (15%) Basic KNN Classification

Learn the [Glass data set](https://archive.ics.uci.edu/dataset/42/glass+identification) using [KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier) with default parameters.
- Randomly split your data into train/test.  Anytime we don't tell you specifics (such as what percentage is train vs test) choose your own reasonable values
- Give typical train and test set accuracies after running with different random splits
- Print the output probabilities for a test set (predict_proba)
- Try it with different p values (Minkowskian exponent) and discuss any differences

In [None]:
headers = ['ID', 'RI', 'Na', 'Mg', 'Al', 'Si', 'K', 'Ca', 'Ba', 'Fe', 'Class']
df = pd.read_csv("glass.csv", header=None, names=headers)
df = df.drop('ID', axis=1)

In [None]:
# Preprocessing
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df.iloc[:, :-1]), columns=df.columns[:-1])
df_scaled[df.columns[-1]] = df[df.columns[-1]]
df_scaled

In [28]:
# Learn the glass data
target_col = 'Class'
X_train, X_test, y_train, y_test = train_test_split(df_scaled.drop(target_col, axis=1), df_scaled[target_col], test_size=0.2)

# KNN Classifier
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

print("Train Accuracy: ", knn.score(X_train, y_train))
print("Test Accuracy: ", knn.score(X_test, y_test))
print("Probabilites: ", knn.predict_proba(X_test))


Train Accuracy:  0.7192982456140351
Test Accuracy:  0.7674418604651163
Probabilites:  [[0.  1.  0.  0.  0.  0. ]
 [0.6 0.4 0.  0.  0.  0. ]
 [0.4 0.2 0.4 0.  0.  0. ]
 [0.  0.  0.  0.6 0.2 0.2]
 [0.2 0.2 0.6 0.  0.  0. ]
 [0.  1.  0.  0.  0.  0. ]
 [0.4 0.4 0.2 0.  0.  0. ]
 [1.  0.  0.  0.  0.  0. ]
 [0.8 0.  0.2 0.  0.  0. ]
 [0.  0.6 0.  0.4 0.  0. ]
 [0.  0.  0.  0.  0.  1. ]
 [0.2 0.8 0.  0.  0.  0. ]
 [0.  1.  0.  0.  0.  0. ]
 [0.  0.6 0.  0.2 0.2 0. ]
 [0.  0.  0.  0.  0.  1. ]
 [0.6 0.4 0.  0.  0.  0. ]
 [0.2 0.6 0.2 0.  0.  0. ]
 [0.2 0.8 0.  0.  0.  0. ]
 [1.  0.  0.  0.  0.  0. ]
 [0.  0.4 0.  0.6 0.  0. ]
 [0.4 0.4 0.2 0.  0.  0. ]
 [0.  0.  0.  0.  0.  1. ]
 [1.  0.  0.  0.  0.  0. ]
 [0.  0.  0.  0.  0.  1. ]
 [0.  1.  0.  0.  0.  0. ]
 [0.6 0.2 0.2 0.  0.  0. ]
 [0.6 0.2 0.2 0.  0.  0. ]
 [0.  0.8 0.2 0.  0.  0. ]
 [1.  0.  0.  0.  0.  0. ]
 [0.  0.  0.  0.  0.  1. ]
 [0.6 0.2 0.2 0.  0.  0. ]
 [0.  1.  0.  0.  0.  0. ]
 [0.  0.6 0.  0.4 0.  0. ]
 [0.  0.6 0.  0.  0.4 0

num | train accuracy | test accuracy
--- | --- | ---
1 | 0.766 | 0.581
2 | 0.760 | 0.581
3 | 0.754 | 0.698
-- | -- | --
avg | 0.760 | 0.620

In [33]:
p_values = [1, 1.5, 2, 2.5, 3, 3.5, 4]
print("| p | Train Accuracy | Test Accuracy |")
print("|---|----------------|---------------|")
for p in p_values:
    knn = KNeighborsClassifier(p=p)
    knn.fit(X_train, y_train)
    print(f"| {p} | {knn.score(X_train, y_train):.3f} | {knn.score(X_test, y_test):.3f} |")

| p | Train Accuracy | Test Accuracy |
|---|----------------|---------------|
| 1 | 0.725 | 0.837 |
| 1.5 | 0.708 | 0.791 |
| 2 | 0.719 | 0.767 |
| 2.5 | 0.731 | 0.767 |
| 3 | 0.737 | 0.767 |
| 3.5 | 0.725 | 0.744 |
| 4 | 0.719 | 0.721 |


| p | Train Accuracy | Test Accuracy |
|---|----------------|---------------|
| 1 | 0.725 | 0.837 |
| 1.5 | 0.708 | 0.791 |
| 2 | 0.719 | 0.767 |
| 2.5 | 0.731 | 0.767 |
| 3 | 0.737 | 0.767 |
| 3.5 | 0.725 | 0.744 |
| 4 | 0.719 | 0.721 |

In [None]:
# @title 1.1 Discussion { display-mode: "form" }
# PLEASE DO NOT ALTER THIS CODE
if net_id == "":
  raise Exception("You need to set your net_id, silly goose.")
discussion_id = "1.1KNN"
questions = ["Include a general discussion about what you did/learned above."]
Discussion(discussion_id, questions, net_id); pass

VBox(children=(Label(value='Include a general discussion about what you did/learned above.'), Textarea(value="…

Button(description='Save Answers', style=ButtonStyle())

**Include a general discussion about what you did/learned above.**

I loaded in the glass dataset, dropped the ID column (since it has unique values for each row), and normalized the rest of the features. Then I split the data into train and test sets, and ran the model on three random splits to get an average accuracy. With the default values, the train and test accuracy was a little bit better than random, but not by much. I also printed out the probabilites one time, and the voting method was clear. Each probability is a multiple of 0.2, so each closest neighbor gave 0.2 votes, and the highest one won. I then experimented with different p values. The p value didn't seem to effect the training set much, with accuracies bouncing between 0.708 and 0.737, but the testing set accurcy did consistently decrease as the p value increased. This suggests that for this dataset, a lower p value leads to better generalization.

## 2 KNN Classification with normalization and distance weighting

Use the [magic telescope](https://axon.cs.byu.edu/data/uci_class/MagicTelescope.arff) dataset

### 2.1 (5%) - Without Normalization or Distance Weighting
- Do random 80/20 train/test splits each time
- Run with k=3 and *without* distance weighting and *without* normalization
- Show train and test set accuracy

In [None]:
# Learn magic telescope data

In [None]:
# @title 2.1 Discussion { display-mode: "form" }
# PLEASE DO NOT ALTER THIS CODE
if net_id == "":
  raise Exception("You need to set your net_id, silly goose.")
discussion_id = "2.1KNN"
questions = ["Include a general discussion about what you did/learned above."]
Discussion(discussion_id, questions, net_id); pass

### 2.2 (10%) With Normalization
- Try it with k=3 without distance weighting but *with* normalization of input features.  You may use any reasonable normalization approach (e.g. standard min-max normalization between 0-1, z-transform, etc.)

In [None]:
# Train/Predict with normalization

In [None]:
# @title 2.2 Discussion { display-mode: "form" }
# PLEASE DO NOT ALTER THIS CODE
if net_id == "":
  raise Exception("You need to set your net_id, silly goose.")
discussion_id = "2.2KNN"
questions = ["Discuss the results of using normalized data vs. unnormalized data.", "Why is it a good idea to normalize data before using KNN?"]
Discussion(discussion_id, questions, net_id); pass

### 2.3 (10%) With Distance Weighting
- Try it with k=3 and with distance weighting *and* normalization

In [None]:
#Train/Precdict with normalization and distance weighting

In [None]:
# @title 2.3 Discussion { display-mode: "form" }
# PLEASE DO NOT ALTER THIS CODE
if net_id == "":
  raise Exception("You need to set your net_id, silly goose.")
discussion_id = "2.3KNN"
questions = ["How did the results change when you used distance weighting?"]
Discussion(discussion_id, questions, net_id); pass

### 2.4 (10%) Different k Values
- Using your normalized data with distance weighting, create one graph with classification accuracy on the test set on the y-axis and k values on the x-axis.
- Use values of k from 1 to 15.  Use the same train/test split for each.

In [None]:
# Calculate and Graph classification accuracy vs k values

In [None]:
# @title 2.4 Discussion { display-mode: "form" }
# PLEASE DO NOT ALTER THIS CODE
if net_id == "":
  raise Exception("You need to set your net_id, silly goose.")
discussion_id = "2.4KNN"
questions = ["Which is the best k value for the `magic_telescope` dataset?", "Interpret/describe your graph."]
Discussion(discussion_id, questions, net_id); pass

## 3 KNN Regression with normalization and distance weighting

Use the [sklean KNeighborsRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html#sklearn.neighbors.KNeighborsRegressor) on the [housing price prediction](https://axon.cs.byu.edu/data/uci_regression/housing.arff) problem.  
### 3.1 (5%) Ethical Data
Note this data set has an example of an inappropriate input feature which we discussed.  State which feature is inappropriate and discuss why.

In [None]:
# @title 3.1 Discussion { display-mode: "form" }
# PLEASE DO NOT ALTER THIS CODE
if net_id == "":
  raise Exception("You need to set your net_id, silly goose.")
discussion_id = "3.1KNN"
questions = ["Which feature is innapropriate and why?"]
Discussion(discussion_id, questions, net_id); pass

### 3.2 (15%) - KNN Regression
- Do random 80/20 train/test splits each time
- Run with k=3
- Print the score (coefficient of determination) and Mean Absolute Error (MAE) for the train and test set for the cases of
  - No input normalization and no distance weighting
  - Normalization and no distance weighting
  - Normalization and distance weighting
- Normalize inputs features where needed but do not normalize the output

In [None]:
# Learn and experiment with housing price prediction data

In [None]:
# @title 3.2 Discussion { display-mode: "form" }
# PLEASE DO NOT ALTER THIS CODE
if net_id == "":
  raise Exception("You need to set your net_id, silly goose.")
discussion_id = "3.2KNN"
questions = ["Discuss your results.", "Which method was the best?"]
Discussion(discussion_id, questions, net_id); pass

### 3.3 (10%)  Different k Values
- Using housing with normalized data and distance weighting, create one graph with MAE on the test set on the y-axis and k values on the x-axis
- Use values of k from 1 to 15.  Use the same train/test split for each.

In [None]:
# Learn and graph for different k values

In [None]:
# @title 3.3 Discussion { display-mode: "form" }
# PLEASE DO NOT ALTER THIS CODE
if net_id == "":
  raise Exception("You need to set your net_id, silly goose.")
discussion_id = "3.3KNN"
questions = ["Which is the best k value for the `housing` dataset?", "Interpret/describe your graph."]
Discussion(discussion_id, questions, net_id); pass

## 4. (20%) KNN with nominal and real data

- Use the [lymph dataset](https://axon.cs.byu.edu/data/uci_class/lymph.arff)
- Use a 80/20 split of the data for the training/test set
- This dataset has both continuous and nominal attributes
- Implement a distance metric which uses Euclidean distance for continuous features and 0/1 distance for nominal. Hints:
    - Write your own distance function (e.g. mydist) and use clf = KNeighborsClassifier(metric=mydist)
    - Change the nominal features in the data set to integer values since KNeighborsClassifier expects numeric features. I used Label_Encoder on the nominal features.
    - Keep a list of which features are nominal which mydist can use to decide which distance measure to use
    - There was an occasional bug in SK version 1.3.0 ("Flags object has no attribute 'c_contiguous'") that went away when I upgraded to the lastest SK version 1.3.1
- Use your own choice for k and other parameters

In [None]:
# Train/Predict lymph with your own distance metric

In [None]:
# @title 4 Discussion { display-mode: "form" }
# PLEASE DO NOT ALTER THIS CODE
if net_id == "":
  raise Exception("You need to set your net_id, silly goose.")
discussion_id = "4KNN"
questions = ["Explain your distance metric.", "Discuss the results of using your own distance metric."]
Discussion(discussion_id, questions, net_id); pass

## 5. (Optional 15% extra credit) Code up your own KNN Learner
Below is a scaffold you could use if you want. Requirements for this task:
- Your model should support the methods shown in the example scaffold below
- Use Euclidean distance to decide closest neighbors
- Implement both the classification and regression versions
- Include optional distance weighting for both algorithms
- Run your algorithm on the magic telescope and housing data sets above and discuss and compare your results

In [None]:
# @title 5 Discussion { display-mode: "form" }
# PLEASE DO NOT ALTER THIS CODE
if net_id == "":
  raise Exception("You need to set your net_id, silly goose.")
discussion_id = "5KNN"
questions = ["Discuss what you learned from implementing a KNN from scratch."]
Discussion(discussion_id, questions, net_id); pass

In [None]:
from sklearn.base import BaseEstimator, ClassifierMixin

class KNNClassifier(BaseEstimator,ClassifierMixin):
    def __init__(self, columntype=[], weight_type='inverse_distance'): ## add parameters here
        """
        Args:
            columntype for each column tells you if continues[real] or if nominal[categoritcal].
            weight_type: inverse_distance voting or if non distance weighting. Options = ["no_weight","inverse_distance"]
        """
        self.columntype = columntype #Note This won't be needed until part 5
        self.weight_type = weight_type

    def fit(self, data, labels):
        """ Fit the data; run the algorithm (for this lab really just saves the data :D)
        Args:
            X (array-like): A 2D numpy array with the training data, excluding targets
            y (array-like): A 2D numpy array with the training targets
        Returns:
            self: this allows this to be chained, e.g. model.fit(X,y).predict(X_test)
        """
        return self

    def predict(self, data):
        """ Predict all classes for a dataset X
        Args:
            X (array-like): A 2D numpy array with the training data, excluding targets
        Returns:
            array, shape (n_samples,)
                Predicted target values per element in X.
        """
        pass

    #Returns the Mean score given input data and labels
    def score(self, X, y):
        """ Return accuracy of model on a given dataset. Must implement own score function.
        Args:
            X (array-like): A 2D numpy array with data, excluding targets
            y (array-like): A 2D numpy array with targets
        Returns:
            score : float
                Mean accuracy of self.predict(X) wrt. y.
        """
        return 0