In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("hw12_sp24.ipynb")

# Homework 12: Classification

**Helpful Resource:**

- [Python Reference](http://data8.org/sp22/python-reference.html): Cheat sheet of helpful array & table methods used in Data 8!

**Recommended Reading**: 

* [Classification](https://www.inferentialthinking.com/chapters/17/Classification.html)

Please complete this notebook by filling in the cells provided. Before you begin, execute the following cell to setup the notebook by importing some helpful libraries. Each time you start your server, you will need to execute this cell again.

For all problems that you must write explanations and sentences for, you **must** provide your answer in the designated space. **Moreover, throughout this homework and all future ones, please be sure to not re-assign variables throughout the notebook!** For example, if you use `max_temperature` in your answer to one question, do not reassign it later on. Otherwise, you will fail tests that you thought you were passing previously!


**Note: This homework has hidden tests on it. That means even though the tests may say 100% passed, it doesn't mean your final grade will be 100%. We will be running more tests for correctness once everyone turns in the homework.**


Directly sharing answers is not okay, but discussing problems with the course staff or with other students is encouraged. 

You should start early so that you have time to get help if you're stuck.

In [None]:
# Don't change this cell; just run it. 

import numpy as np
from datascience import *
import d8error

# These lines do some fancy plotting magic
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore')
from datetime import datetime

## 1. Nba Team Coordinates with Classification

Welcome to Homework 12! This homework is about k-Nearest Neighbors classification (k-NN). This topic is covered in depth in Project 3. The purpose of this homework is to reinforce the basics of this method. You can and should reuse a lot of code that you wrote for Project 3 for this homework, or use code from this homework on Project 3!

### Our Dearest Neighbors

Tyla is trying to classify NBA staff as either affiliated with the Golden State Warriors or the Sacramento Kings. To classify these individuals, Tyla has access to the coordinates of their primary residence. First, load in the `staff_coordinates.csv` table.



In [None]:
# Just run this cell!
location = Table.read_table('staff_coordinates.csv')
location.show(5)

As usual, let's investigate our data visually before performing any kind of numerical analysis.

In [None]:
# Just run this cell!
location.scatter("longitude", "latitude", group="team")

The locations of the points on this scatter plot might be familiar - run the following cell to see what they correspond to.

In [None]:
# Just run this cell!
colors = {"Golden State Warriors":"gold", "Sacramento Kings":"purple"}
t = Table().with_columns("lat", location.column(0), 
                                      "lon", location.column(1), 
                                      "color", location.apply(colors.get, 2)
                        )
Circle.map_table(t, radius=5, fill_opacity=1)

**Question 1.1.** Let's begin implementing the k-Nearest Neighbors algorithm. Define the `distance` function, which takes in two arguments: an array of numerical features (`array1`), and a different array of numerical features (`array2`). The function should return the [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance)  between the two arrays. Euclidean distance is often referred to as the straight-line distance formula that you may have learned previously.  **(10 points)**


In [None]:
def distance(array1, array2):
    ...

# Don't change/delete the code below in this cell
distance_example = distance(make_array(1, 2, 3), make_array(4, 5, 6))
distance_example

In [None]:
grader.check("q1_1")

### Splitting the Dataset 

We'll do two different kinds of things with the `location` dataset:

1. We'll build a classifier using coordinates for which we know the associated label; this will teach it to recognize labels of similar coordinate values. This process is known as *training.*
2. We'll evaluate or *test* the accuracy of the classifier we build on data we haven't seen before.
 
As discussed in [Section 17.2](https://inferentialthinking.com/chapters/17/2/Training_and_Testing.html#training-and-testing), we want to use separate datasets for *training* and *testing.* As such, we split up our one dataset into two.

**Question 1.2.** Next, let's split our dataset into a training set and a test set. Since `location` has 200 rows, let's create a training set with the first 150 rows and a test set with the remaining 50 rows. Remember that assignment to each group should be random, so we should shuffle the table first. 
**(10 points)**

*Hint:* As a first step we can shuffle all the rows, then use the `tbl.take` function to split up the rows for each table.


In [None]:
shuffled_table = ...
training_tbl = ...
testing_tbl = ...

print("Training set:\t",   training_tbl.num_rows, "examples")
print("Test set:\t",       testing_tbl.num_rows, "examples")
training_tbl.show(5), testing_tbl.show(5);

In [None]:
grader.check("q1_2")

**Question 1.3.** Assign `attributes` to an array of column names (strings) of the features from the `location` table. ( **(10 points)**

*Hint:* Which of the column names in the `location` table are the features, and which of the column names correspond to the class we're trying to predict?

*Hint:* No need to modify any tables, just manually create an array of the feature names!

In [None]:
attributes = ...
attributes

In [None]:
grader.check("q1_3")

**Question 1.4.** Now define the `classify` function. This function should take in a `testing_row` from a table like `testing_tbl` and classify it using the k-Nearest Neighbors based on the correct `attributes` and the data in `training_tbl`. A refresher on k-Nearest Neighbors can be found [here](https://www.inferentialthinking.com/chapters/17/4/Implementing_the_Classifier.html). **(10 points)**


*Hint 1:* The `distance` function we defined earlier takes in arrays as input, so use the `row_to_array` function we defined for you to convert rows to arrays of features.

*Hint 2:* The skeleton code we provided iterates through each row in the training set.


In [None]:
def row_to_array(row, attributes):
    """Converts a row to an array of its features."""
    arr = make_array()
    for attribute in attributes:
        arr = np.append(arr, row.item(attribute))
    return arr

def classify(testing_tbl_row, k, training_tbl):
    testing_row_features_array = row_to_array(testing_tbl_row, attributes)
    distances = make_array()
    for training_row in training_tbl.rows:
        training_row_attributes_array = ...
        row_distance = ...
        distances = ...
    train_with_distances = ...
    nearest_neighbors = ...
    most_common_label = ...
    ...

# Don't modify/delete the code below
first_test = classify(testing_tbl.row(0), 5, training_tbl)
first_test

In [None]:
grader.check("q1_4")

<div class="hide">\pagebreak</div>

**Question 1.5.** 
Define the function `five_classify`, which takes a `row` from the `testing_tbl` set as an argument and classifies the row using a 5-Nearest Neighbors algorithm. After defining this function, use it to determine the `accuracy` of the 5-NN classifier on the entire `testing_tbl` set. Report the `accuracy` as a proportion (not a percentage) of correctly predicted team affiliations.

**(10 points)**

*Hint:* Make sure to use the distance function you defined earlier to compute the distances between neighbors.

*Note:* Typically, before applying a classifier to the test set, we would first validate its performance on a separate validation set. This allows for adjustments to the training approach based on the validation performance before final testing. You won't need to perform this step for this question, but it's an important concept you'll would explore more in advanced courses like Data 100 at UC Berkeley. **(10 points)**

In [None]:
def five_classify(row):
    ...

test_with_prediction = ...
labels_correct = ...
accuracy = ...
accuracy

In [None]:
grader.check("q1_5")

**Question 1.6.** What is the total amount of Golden State Warriors' staff in the `location` table? Assign the value to `k`. **(10 points)**


In [None]:
k = ...
k

In [None]:
grader.check("q1_6")

<!-- BEGIN QUESTION -->

**Question 1.7.** Why do we divide our data into a training and test set? What is the point of a test set, and why do we only want to use the test set once? Explain your answer in 3 sentences or less. **(10 points)**

*Hint:* Check out this [section](https://inferentialthinking.com/chapters/17/2/Training_and_Testing.html) in the textbook.


_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 1.8.** Explain why we would not use `k`= 6 in k-NN. **(10 points)**


_Type your answer here, replacing this text._

<!-- END QUESTION -->

**Question 1.9.0. Setup**

Thomas has devised a scheme for splitting up the test and training set. For each row from `location`:
- Rows for Kings' staff have a 75% chance of being placed in the training set and 25% chance of being placed in the test set.  
- Rows for Warriors' staff have a 60% chance of being placed in the training set and 40% chance of being placed in the test set.  


*Hint 1:* Remember that there are 100 Warriors' staff and 100 Kings' staff in coordinates.  

*Hint 2:* Thomas' last name is Bayes. (So [18.1](https://inferentialthinking.com/chapters/18/1/More_Likely_than_Not_Binary_Classifier.html#bayes-rule) from the textbook may be helpful here!)

*Hint 3:* The following tree diagram may be helpful in Questions 1.9.1 and 1.9.2!

<img src="Kings Warriors training set.png" width="450">

**Question 1.9.1.** Given that a row is in the test set, what is the probability that it corresponds to a Warriors' staff? Assign that probability to `probability_furd`. **(10 points)**


In [None]:
probability_furd = ...
probability_furd

In [None]:
grader.check("q1_9_1")

**Question 1.9.2.** Given that a row is Warriors, what is the probability that the staff is in the test set? Assign that probability to `probability_test`. **(10 points)**


In [None]:
probability_test = ...
probability_test

In [None]:
grader.check("q1_9_2")

## (OPTIONAL, NOT IN SCOPE): k-NN for Non-Binary Classification

**THIS IS NOT IN SCOPE**. There are no autograder tests for this or code for you to write. It just relies on the function `classify` in Question 1.4. Go ahead and read through this section and run the following cells!

In this class, we have taught you how to use the k-NN algorithm to classify data as one of two classes. However, much of the data you will encounter in the real world will not fall nicely into one of two categories. 

**How can we classify data with non-binary classes?** It turns out we can still use k-NN! That is, we find the distance between a point and all its neighbors, find the nearest neighbors, and take a majority vote among the neighbors to determine this point's class. 

The only difference is that now the neighboring points have more than two possible classes. This does introduce difficulty because now we have no way of guaranteeing that we will not encounter ties between classes. In the case that we do encounter a tie, we can just arbitrarily choose one of the classes.

In fact, you don't even have to modify the code you wrote before at all to enable multi-class classification!

Let's add some more data to our train table, this time for another NBA facility, staff for the Stockton Kings.

In [None]:
location_multi = location.with_rows([
    [37.960108, -121.290780, "Stockton Kings"],  
    [37.959500, -121.291200, "Stockton Kings"],  
    [37.957400, -121.287600, "Stockton Kings"],  
    [37.958800, -121.289500, "Stockton Kings"],  
    [37.960200, -121.292100, "Stockton Kings"],  
    [37.960600, -121.291800, "Stockton Kings"],  
    [37.959800, -121.290300, "Stockton Kings"],  
    [37.959000, -121.291500, "Stockton Kings"],  
    [37.957900, -121.287800, "Stockton Kings"],  
    [37.958300, -121.289200, "Stockton Kings"]                              
])

In [None]:
classify(location_multi.row(0), 5, location_multi)

In [None]:
classify(location_multi.row(140), 5, location_multi)

In [None]:
classify(location_multi.row(205), 5, location_multi)

Our classifier can classify rows as belonging to one of three classes!

Classification is one of the most important fields in statistics, data science, and machine learning. There are thousands of different classification algorithms and modifications of algorithms! There are many that you'll learn if you continue down the path of becoming a data scientist!

You're all done with Homework 12! :,)

**Important submission steps:** 
1. Run the tests and verify that they all pass.
2. Choose **Save Notebook** from the **File** menu, then **run the final cell**. 
3. Click the link to download the zip file.
4. Then submit the zip file to the corresponding assignment according to your instructor's directions. 

**It is your responsibility to make sure your work is saved before running the last cell.**

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False, run_tests=True)