In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("hw09.ipynb")

# Homework 09: Classification

**Reading**: 

* [Classification](https://inferentialthinking.com/chapters/17/Classification.html)

For all problems that you must write explanations and sentences for, you **must** provide your answer in the designated space. Moreover, throughout this homework and all future ones, please be sure to not re-assign variables throughout the notebook! For example, if you use `max_temperature` in your answer to one question, do not reassign it later on. Otherwise, you will fail tests that you thought you were passing previously!

**Note: This homework has hidden tests on it. Additional tests will be run once your homework is submitted for grading. While you may pass all the tests you have access to before submission, you may not earn full credit if you do not pass the hidden tests as well.**. 

Many of the tests you have access to before submitting only test to ensure you have given an answer that is formatted correctly and/or you have given an answer that *could* make sense in context. For example, a test you have access to while completing the assignment may check that you selected a valid choice for a multiple choice problem (1, 2, or 3) or that your answer is an integer between 0 and 50 if asked to count a subset of states in the United States. The tests that are run after submission will evaluate your work for accuracy. **Do not assume that just because all your tests pass before submission means that your answers are correct!**

Consult with your teacher and course syllabus for information and policies regarding appropriate collaboration with other students, appropriate use of AI tools, and submission of late work.

In [None]:
# Don't change this cell; just run it. 

import numpy as np
from datascience import *

# These lines do some fancy plotting magic.
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

## 1. Tobacco Road Coordinates with Classification


Welcome to Homework 09! This homework is about k-Nearest Neighbors classification (kNN).

## Our Dearest Neighbors

Carol is trying classify students as either attendees of Duke University or UNC at Chapel Hill. To classify the students, Carol has access to the coordinates of the location they live during the school year (wow, kind of creepy Carol). First, load in the `coordinates` table.

In [None]:
# Just run this cell!
coordinates = Table.read_table('coordinates_nc.csv')
coordinates.show(5)

As usual, let's investigate our data visually before performing any kind of numerical analysis.

In [None]:
# Just run this cell!
coordinates.scatter("longitude", "latitude", group="school")

In this case, most people probably don't recognize how the latitude and longitude relate to real life, so we can use a mapping function to put these points onto a map.

In [None]:
# Just run this cell!
colors = {"Duke":"darkblue", "UNC-CH":"cornflowerblue"}
t = Table().with_columns("lat", coordinates.column(0), 
                                      "lon", coordinates.column(1), 
                                      "color", coordinates.apply(colors.get, 2)
                        )
Circle.map_table(t, area=10, fill_opacity=1)

### Question 1

Let's begin implementing the k-Nearest Neighbors algorithm. Define the `distance` function, which takes in two arguments: two arrays, `x_1` and `x_2`,  of numerical features. The function should return the [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance) between the two arrays. Euclidean distance is often referred to as the straight-line distance formula that you may have learned previously.

In [None]:
def distance(x_1, x_2):
    ...

# Don't change/delete the code below in this cell
distance_example = distance(make_array(1, 2, 3), make_array(4, 5, 6))
distance_example

In [None]:
grader.check("q1_1")

### Splitting the dataset
We'll do 2 different kinds of things with the `coordinates` dataset:
1. We'll build a classifier using coordinates for which we know the associated label; this will teach it to recognize labels of similar coordinate values. This process is known as *training*.
2. We'll evaluate or *test* the accuracy of the classifier we build on data we haven't seen before.

For reasons discussed in class and the textbook, we want to use separate datasets for these two purposes.  So we split up our one dataset into two.

### Question 2

Next, let's split our dataset into a training set and a test set. Since `coordinates` has 84 rows, let's create a training set with the first 63 rows (75% of the data) and a test set with the remaining 21 rows (25% of the data). Remember that assignment to each group should be random, so you should shuffle the table **first** (save it to `shuffled_table`), then select 63 rows for the training set and 21 rows for the testing set using the `.take()` table method with `np.arange` to specify which rows go into each set.


In [None]:
shuffled_table = ...
train = ...
test = ...

# DON'T CHANGE THE CODE BELOW
print("Training set:\t",   train.num_rows, "examples")
train.show(5)

print("Test set:\t",       test.num_rows, "examples")
test.show(5);

In [None]:
grader.check("q1_2")

### Question 3

Assign `features` to an array of the labels of the *features* from the `coordinates` table.

**Hint:** Which of the column labels in the `coordinates` table are the features, and which of the column labels correspond to the class we're trying to predict?

In [None]:
features = ...
features

In [None]:
grader.check("q1_3")

### `row_to_array`

The function `row_to_array` will convert a `row` from a Table to an array that contains the features for that row. This will allow you to use array operations on the feature array in a way that you can't do easily with a `row` object. Run the cell below to load the function, which you'll use in the next question.

In [None]:
def row_to_array(row, features):
    arr = make_array()
    for feature in features:
        arr = np.append(arr, row.item(feature))
    return arr

### Question 4

Now define the `classify` function. This function should take in a `row` from a table like `test` and classify it based on the data in `train` using the `k`-Nearest Neighbors based on the correct `features`. There's a good bit of code provided for you, so just finish the lines of code that are unfinished.

**Hint:** the skeleton code we provided iterates through each row in the training set.

In [None]:
def classify(row, k, train):
    test_row_features_array = row_to_array(row, features)
    distances = make_array()
    for train_row in train.rows:
        train_row_features_array = ...
        row_distance = ...
        distances = ...
    train_with_distances = ...
    nearest_neighbors = train_with_distances.sort('Distances').take(np.arange(k))
    most_common_label = ...
    ...

# The code below will attempt to classify the first row in your 
# test dataset using a 5 nearest neighbors classifier
first_test = classify(test.row(0), 5, train)
first_test

In [None]:
grader.check("q1_4")

### Question 5

The function `three_classify` takes in a `row` from `test` as an argument and classifies the row using a 3-Nearest Neighbors classifier. We define this function so we can use the `apply` method to quickly classify all the rows we have in the testing data set. We can then compare the prediction from the classifier to the known label for each row to get an idea of how accurate our classifier is on the test data. 

You should:
* Use the `apply` method on the `test` Table to create a table `test_with_prediction` that contains a new column labled `"prediction"` that contains the predicted value for that location
* Create an array `labels_correct` that contains either `True` or `False` if the prediction was correct or incorrect respectively for each row in the Table `test_with_prediction`.
* Compute the accuracy of your model as a proportion (not a percentage) of the schools that were correctly predicted, assigned to `accuracy`.

In [None]:
def three_classify(row):
    return classify(row, 3, train)

test_with_prediction = ...
labels_correct = ...
accuracy = ...
accuracy

In [None]:
grader.check("q1_5")

<!-- BEGIN QUESTION -->

### Question 6

Suppose you work at the leasing office for an apartment building located at the GPS coordinates 35.95979476700251, -78.9870877612551 and a student is moving in. The cell below will create a table `new_student` with these coordinates as the only row.

Use your 3-Nearest Neighbor classifier to predict the school they attend.

In [None]:
new_student = Table().with_columns('latitude', make_array(35.95979476700251), 'longitude', make_array(-78.9870877612551))
new_student

In [None]:
# Write your code below
...

<!-- END QUESTION -->

### Question 7

There are 45 rows of Duke students and 39 rows of UNC-CH students in the `coordinates` table. If we used the entire `coordinates` table as the train set, what is the smallest value of k that would ensure that a k-Nearest Neighbor classifier would always predict Duke as the school? Assign the value to `k`. The test on this question will only verify your answer is formatted correctly, it will not check it for accuracy. 


In [None]:
k = ...
k

In [None]:
grader.check("q1_7")

<!-- BEGIN QUESTION -->

### Question 8

Why do we divide our data into a training and test set instead of using all the data to train the model?



_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 9

Why do we use an odd-numbered `k` in k-NN? Explain.



_Type your answer here, replacing this text._

<!-- END QUESTION -->

# Submitting your work
You're done with this assignment! Assignments should be turned in using the following best practices:
1. Save your notebook.
2. Restart the kernel and run all cells up to this one.
3. Run the cell below with the code `grader.export(...)`. This will re-run all the tests. Make sure they are passing as you expect them to.
4. Download the file named `hw08_<date-time-stamp>.zip`, found in the explorer pane on the left side of the screen. **Note**: Clicking on the link in this notebook may result in an error, it's best to download from the file explorer panel.
5. Upload `hw08_<date-time-stamp>.zip` to the corresponding assignment on Canvas.

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit.

In [None]:
grader.export(pdf=False, force_save=True)