# CSE 572: Lab 4

In this lab, you will practice implementing k nearest neighbors for classification.

To execute and make changes to this notebook, click File > Save a copy to save your own version in your Google Drive or Github. Read the step-by-step instructions below carefully. To execute the code, click on each cell below and press the SHIFT-ENTER keys simultaneously or by clicking the Play button. 

When you finish executing all code/exercises, save your notebook then download a copy (.ipynb file). Submit the following **three** things:
1. a link to your Colab notebook,
2. the .ipynb file, and
3. a pdf of the executed notebook on Canvas.

To generate a pdf of the notebook, click File > Print > Save as PDF.

Acknowledgment: Much of the content in this notebook was adapted from Introduction to Data Mining, 2nd Edition by Tan, Steinbach, Karpatne, Kumar.

## Dataset preparation

We will use the Wisconsin Breast Cancer Dataset for this exercise. We used the original version of the dataset for Lab 3 (found [here](https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(original)), but for Lab 4 we will use a newer version of the dataset with different features. This dataset does not have any missing attribute values, so we ill skip cleaning it.

Read about this dataset here: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

Description of dataset from UCI documentation: 

Number of instances: 569 

Number of attributes: 32 (ID, diagnosis, 30 real-valued input features)

Attribute information

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass.  They describe characteristics of the cell nuclei present in the image. 

1. ID number
2. Diagnosis (M = malignant, B = benign)
3. Attributes 3-32:

Ten real-valued features are computed for each cell nucleus:

	a) radius (mean of distances from center to points on the perimeter)
	b) texture (standard deviation of gray-scale values)
	c) perimeter
	d) area
	e) smoothness (local variation in radius lengths)
	f) compactness (perimeter^2 / area - 1.0)
	g) concavity (severity of concave portions of the contour)
	h) concave points (number of concave portions of the contour)
	i) symmetry 
	j) fractal dimension ("coastline approximation" - 1)

The mean, standard error, and "worst" or largest (mean of the three
largest values) of these features were computed for each image,
resulting in 30 features.  For instance, attribute 3 is Mean Radius, attribute 13
13 is Radius standard error, attribute 23 is Worst Radius.

In [None]:
# Load the original dataset
import pandas as pd
import numpy as np

data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data', header=None)

data

In [None]:
# Drop the first column which represents the ID number
data = data.drop(columns=0)

# Rename column 1 to "class" since this column represents the class (M=malignant, B=benign)
data = data.rename(columns={1: "class"})

data

### Dataset splits

It is common practice in data mining and machine learning to split a dataset into training, validation, and test sets. 

- Training set: subset of dataset used for training/fitting model parameters
- Validation set: subset of dataset used to evaluate model generalization performance and tune hyperparameters (model choices)
- Test set: subset of dataset used to test performance after initial vetting using validation set

The training set is usually allocated the largest percentage of the total dataset. For example, a common split might be 60/20/20\% or 80/10/10\% of the data assigned to training/validation/test subsets respectively. 

This example shows how to split the dataset into training, validation, and test subsets using a simple random sampling strategy (without replacement). 

In [None]:
# Set the random seed
SEED = 42

In [None]:
# Sample 60% of the instances for the training set
train = data.sample(frac=0.6, random_state=SEED)
train

In [None]:
# Sample 20% for the validation set. 
# First we need to drop the training instances from our dataframe to sample from the remaining instances.
data_remaining = data.drop(train.index)
# Note that since we are sampling from the rows remaining after removing the training subset, which 
# leaves 40% of the total data, we need to sample 50% of the remaining dataset to result in 20% of 
# the original dataset.
val = data_remaining.sample(frac=0.5, random_state=SEED)
val

In [None]:
# Drop the validation instances from data_remaining
# This leaves us with the remaining 20% of the original dataset, 
# which makes up our test set.
test = data_remaining.drop(val.index)
test

Use `value_counts()` to get the number of benign vs. malevolent examples in each subset. 

In [None]:
# YOUR CODE HERE

### K-Nearest neighbors classifier

In this approach, the class label of a test instance is predicted based on the majority class of its *k* closest training instances. The number of nearest neighbors, *k*, is a hyperparameter that must be provided by the user, along with the distance metric. By default, we can use Euclidean distance (which is equivalent to Minkowski distance with an exponent factor equals to p=2):

\begin{equation*}
\textrm{Minkowski distance}(x,y) = \bigg[\sum_{i=1}^N |x_i-y_i|^p \bigg]^{\frac{1}{p}}
\end{equation*}

We will use the Scikit-learn library to implement the KNN classifier. In `sklearn`, classifier objects have a `fit()` function used to train the classifier model. This function expects two arguments: `X` (the input features, formatted as a matrix in which the rows are individual training samples and the columns are the features) and `y` (the target class to predict, formatted as a vector in which the rows are individual training samples and the column is the class label). You can read more about the `sklearn` API [here](https://scikit-learn.org/stable/developers/develop.html#apis-of-scikit-learn-objects).

In the code block below, write code to create new variables for X and y (for the train, val, and test sets) that we can pass to our classifier functions.

In [None]:
X_train = # YOUR CODE HERE
X_val = # YOUR CODE HERE
X_test = # YOUR CODE HERE

y_train = # YOUR CODE HERE
y_val = # YOUR CODE HERE
y_test = # YOUR CODE HERE

In [None]:
# Standardize the data using the StandardScaler.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)

**Question 1: How does StandardScaler scale the data?**

**Answer:**

YOUR ANSWER HERE

**Question 2: Why would we use the training set to calculate the scaling parameters? Why not use the entire dataset (before splitting) instead?**

**Answer:**

YOUR ANSWER HERE

In the code below, we'll implement the KNN classifier using different settings of the hyperparameter $k$ or `n_neighbors` (number of neighbors) and observe how the training and validation accuracy changes.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
%matplotlib inline

numNeighbors = range(1, 31)
trainAcc = []
valAcc = []

for k in numNeighbors:
    clf = KNeighborsClassifier(n_neighbors=k, metric='minkowski', p=2)
    clf.fit(X=X_train, y=y_train)
    Y_predTrain = clf.predict(X_train)
    Y_predVal = clf.predict(X_val)
    trainAcc.append(accuracy_score(y_train, Y_predTrain))
    valAcc.append(accuracy_score(y_val, Y_predVal))

plt.plot(numNeighbors, trainAcc, 'ro-', numNeighbors, valAcc,'bv--')
plt.legend(['Training Accuracy','Validation Accuracy'])
plt.xlabel('Number of neighbors')
plt.ylabel('Accuracy')

In class, we discussed other distance metrics that could be used besides Euclidean distance, such as absolute distance (Minkowski distance with order = 1) and cosine distance. Implement kNN and create the same plot as above, but using 1) absolute distance and 2) cosine distance. You will need to consult the [scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) to find how to change the distance metric.

In [None]:
### Absolute distance ###
# YOUR CODE HERE

In [None]:
### Cosine distance ###
# YOUR CODE HERE

**Question 3: Which distance metric(s) gave the best overall accuracy on the validation set? You can refer to the highest validation accuracy achieved by each distance metric for this question.**

**Answer:**

YOUR ANSWER HERE

### Weighted kNN
By default, kNN classifier in sklearn uses uniform weights---i.e., each neighbor is weighted equally when determining the class label based on the k nearest neighbors. Alternatively, we could weight the decision based on the distance of each neighbor from the test instance. Consult the documentation to figure out how to weight neighbors by their distance during prediction, then implement weighted kNN and generate the same plot as in the previous cells. Use absolute distance as the distance metric.

In [None]:
### Weighted kNN ###
# YOUR CODE HERE

To compute our final test accuracy, we'll choose a distance metric and number of neighbors that gave good performance on the validation set. Below, train a kNN model using Absolute Distance (L1 distance) and 3 neighbors (with uniform weights), then compute the test accuracy.

In [None]:
# YOUR CODE HERE