<a href="https://colab.research.google.com/github/nickwharff/CS167_Notes/blob/main/Wharff_Day04_P1_weighted_knn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CS167: Day04 
## Part 1: weighted kNN

#### CS167: Machine Learning, J-Term 2023

Friday, January 6th, 2023 -- Session I (9:00-10:15)

[⏮ Day03 Part 2](https://github.com/merriekay/j23_cs167_notes/blob/main/Day03_P2_Missing_Data_Normalization.ipynb) | [Day03 Part 2⏩]()

## Helpful Links:
📆 [Course Schedule](https://docs.google.com/spreadsheets/d/e/2PACX-1vStj3FCEJqloUMLn2VtHa4yy1ILY6WvABhu4jd4cVUpPGkrx1mEjfTFmd77DMESR9HJ-8UBxgMDJL06/pubhtml?gid=0&single=true) | 🙋[PollEverywhere](https://pollev.com/meredithmoore011) | 📜 [Syllabus](https://analytics.drake.edu/~moore/j23_cs167/Syllabus.html)


# Overview of Today:

Part 1: Notebook #2 Questions, weighted kNN

Part 2: Graphs, Metrics, and Testing

# Admin Stuff

You should be working on:
- [Notebook #2: kNN and Normalization](https://classroom.github.com/a/ZihGOnY-) is released today, but will be due on Friday 1/6 by 11:59pm.
- Quiz #1 will be released today after class and will be due Monday 1/9/23 by 11:59pm. 
    - Blackboard
    - To be completed individually
    - Cite any external resoucres you use, please.


## Can't forget to load our data:

And some of our favorite modulues, `pandas` and `numpy`

In [None]:
#run this cell if you're using Colab:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
#import the data:
#make sure the path on the line below corresponds to the path where you put your dataset.
import pandas as pd
import numpy as np
path = 'datasets/irisData.csv'
iris= pd.read_csv(path)
iris.head()

Unnamed: 0,sepal length,sepal width,petal length,petal width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


## Demo of `copy()` and a DIY `z_score()` function

In [None]:
def z_score(columns, data):
    """
    takes in a list of columns to normalize using the z-score method
    Params:
        columns, a list of columns to normalize
        data, the dataframe, preferably a copy
    """
    for col in columns:
        #get the mean and std
        col_m = data[col].mean()
        col_s = data[col].std()
        
        data[col] = (data[col] - col_m)/col_s

In [None]:
iris_norm = iris.copy()
z_score(['sepal length', 'sepal width', 'petal width', 'petal length'], iris_norm)
iris_norm.head()

Unnamed: 0,sepal length,sepal width,petal length,petal width,species,distance_to_new
76,1.155302,-0.585801,0.590184,0.263815,Iris-versicolor,0.591608
52,1.276066,0.10609,0.64686,0.394849,Iris-versicolor,0.7
77,1.034539,-0.12454,0.703536,0.656917,Iris-versicolor,0.74162
50,1.396829,0.33672,0.533509,0.263815,Iris-versicolor,0.83666
129,1.638355,-0.12454,1.156943,0.525883,Iris-virginica,0.866025


# Quick Review:

## Machine Learning Variations

We are going to learn about a lot of different types of machine learning in CS167. Here are a few categories to look out for: 
- __classifcation__: identify which category it goes in. Examples: Spam or ham? Eric or Tim? Fish, amphibian, reptile, bird, or mammal
- __regression__: real-valued labels. Examples: price of Bitcoin, tomorrow's temperature, etc.
- __supervised learning__: data has labels, goal is to predict the labels of new instance. 
- __unsupervised learning__: data does not have a label, the goal is to analyze/cluster the examples. 
- __other issues__: missing data, sequential data, outlier anomaly detetion, and many more. 

## 🚨 Terminology Alert 🚨
Each row in the table represents a __training example__, a previously-seen, known instance of the thing we are trying to model. 

Each column in the table represents a __feature__, some attribute or variable that each training example has a value for. 

__Target variable__: the 'feature' we will try to predict(e.g. species)--it's value is unkonwn for any new cases not in the training data.

__Predictor variables__: (or just predictors), the features that will be used to make predictions of the target variable. (e.g. `sepal length`, `petal length`, `sepal width`, `petal width`

__k-Nearest-Neighbor Algorihm__: Predict the _most commonly appearing_ class among the __k__ closest training examples.

## 3-Nearest-Neighbor Algorithm

> Wait... why did we skip 2-NN?

### What will a 3NN algorithm predict?

<div>
<img src="https://github.com/merriekay/j23_cs167_notes/blob/main/images/day03_3NN_iris.png?raw=1" width=450/>
</div>

# Remember our kNN function?

In [None]:
def kNN(specimen, data, k):
    # write your code in here to make this function work
    # 1. calculate distances
    data['distance_to_new'] = np.sqrt(
    (specimen['petal length'] - data['petal length'])**2 
    +(specimen['sepal length'] - data['sepal length'])**2 
    +(specimen['petal width'] - data['petal width'])**2
    +(specimen['sepal width'] - data['sepal width'])**2)

    # 2. sort
    data.sort_values(['distance_to_new'], inplace=True)
    
    # 3. predict
    prediction = data.iloc[0:k]['species'].mode()

    return prediction

In [None]:
new_iris = {}
new_iris['petal length'] = 5.1
new_iris['sepal length'] = 7.2
new_iris['petal width'] = 1.5
new_iris['sepal width'] = 2.5

kNN(new_iris, iris, 15)

0    Iris-versicolor
dtype: object

## kNN for Regression?

The only thing we need to change if our target variable is a real-valued number (continuous) is that rather than taking the `mode()` of the __k__ closest neighbors, we will take the `mean()` of the k closest neighbors.

# ✨ New Material

## Are all neighbors created equal?

The way we've learned kNN so far, each neighbor gets an equal vote in the decision of what to predict.

Do we see any problems with this? If so, what?

<div>
<img src="https://github.com/merriekay/j23_cs167_notes/blob/main/images/day04_wknn_motivation.png?raw=1" width = 500/>
</div>

Should neighbors that are closer to the new instance get a larger share of the vote?

# Weighted k-NNN Intuition:

In weighted kNN, the nearest k points are given a weight, and the weights are grouped by the target variable. The class with the largest sum of weights will be the class that is predicted. 

The intuition is to give more weight to the points that are nearby and less weight to the points that are farther away.
- distance-weighted voting

In w-kNN, we want to predict the target variable with the most weight, where the weight is defined by the inverse distance function.

## $w_{q,i} = \frac{1}{d(x_q, x_i)^2}$

> In English, you can read that as the __weight__ of a traning example is equal to 1 divided by the distance between the new instance and the traning example squared.

## A w-kNN Example: Step 1

Start by calculating the distance between the new example ('X'), and each of the other training examples:

<div>
<img src="https://github.com/merriekay/j23_cs167_notes/blob/main/images/day04_wknn_ex.png?raw=1"/>
</div>

## A w-kNN Example: Step 2

Then, __calculate the weight___ of each training example using the inverse distance squared.

<div>
<img src="https://github.com/merriekay/j23_cs167_notes/blob/main/images/day04_wknn_ex1.png?raw=1"/>
</div>

## A w-kNN Example: Step 3

Find the k closest neighbors--let's assume `k=3` for this example: 
<div>
<img src="https://github.com/merriekay/j23_cs167_notes/blob/main/images/day04_wknn_ex2.png?raw=1"/>
</div>

Then, sum the weights for each possible class: 
- __orange__: $1$
- __blue__: $1/16 + 1/9 = 0.115$

### What would a __normal 3NN__ predict? Weighted 3NN?

## Let's write some code: 

Write a new function `weighted_kNN()`

Pass the iris measurements (specimen), data frame, and k as parameters and return the precited class.

In [None]:
def w_knn(specimen, data, k):
    #calculate the distance
    
    # calculate the weights (remember, weights are 1/d^2)
    
    # find the k closest neighbors
    
    # use gropuby to sum the weights of each species in the closest k
    
    #return the class that has the largest sum of weight.

In [None]:
new_iris = {}
new_iris['petal length'] = 5.1
new_iris['sepal length'] = 7.2
new_iris['petal width'] = 1.5
new_iris['sepal width'] = 2.5

kNN(new_iris, iris, 15)
#new_iris['petal length']

## Exercises:

Normalize each of the predictor columns in the iris dataset

>__Note__: you need a way to transform the new reading (the specimen) that you will make the prediction on so that the new one and the training data will all be on the same scale. How can you do that?

Repeat your k-NN prediction code for the normalized data.
- Does the value of k change the predictions? 
    - compare using `k=3`, and `k=5` on each method (normalized and non-normalized), (weighted and unweighted)

## Use these tables to keep track of your predictions:
### `k=3`
|                    | **not normalized** | **normalized** |
|--------------------|--------------------|----------------|
| **unweighted kNN** |                    |                |
| **weighted kNN**   |                    |                |

## `k=5`

|                    | **not normalized** | **normalized** |
|--------------------|--------------------|----------------|
| **unweighted kNN** |                    |                |
| **weighted kNN**   |                    |                |

# 💬 Discussion Question

Should we __always__ normalize our data? Why or why not?

When does it make sense to normalize? When might it make more sense not to?