Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [None]:
NAME = "" # put your full name here
COLLABORATORS = [] # list anyone you collaborated with on this workbook

## Lab 10: Classification

**This lab is optional! It was distributed the week of 11/02/2020, and you can complete it whenever you like.**

Welcome to Lab 10, the **last** lab notebook of the semester!


Back in Lab 7, we began exploring methods to answer classification questions. Specifically, we learned how to use  logistic regression to predict the probability that a qualitative outcome occurs. In this notebook, we'll take a look at two other classification methods: k-nearest neighbours and decision tress. Homework 10 will give you more practice using decision trees and ensemble methods.

### Setup

In [None]:
# Run this cell
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib.colors import ListedColormap

In [None]:
!pip install xlrd
!pip install graphviz

As in Lab 7, we'll work with a modified version of the [ozone level detection dataset](https://archive.ics.uci.edu/ml/datasets/ozone+level+detection) from the UCI Machine Learning Repository, which uses temperature, wind speed, pressure, and other features to decide if a specific day was in fact a normal day or a high ground level ozone day. 

Run the cell below to load ozone.csv into dataframe `df`.

In [None]:
# run this cell
df = pd.read_csv('data/ozone.csv')
df.head()

Looking at the columns, we can infer that columns WSR0, WSR1, etc. are the hourly measurements for the wind speed, and the second-to-last column, `Class`, is the variable we want to predict. 0 is a  normal day and 1 is an ozone day. If you want more information on the features, you can read up on the description of the data [here](https://archive.ics.uci.edu/ml/datasets/ozone+level+detection). 

----

### Section 1: k-Nearest Neighbors for Classification

In homework 5, we used the KNN algortithm for regression -- we predicted PM2.5 levels based on the average of the surrounding k measurements. But this time around, we'll use the ozone dataset and KNN to classify whether a day is "normal" or an ozone day. Unlike our logistic regression approach, we'll be working with *two* features, namely, the peak wind speed (`WSR_PK`) and the peak temperature (`T_PK`), instead of just one feature.

Run the following cell to a see a scatter plot of the data.

In [None]:
plt.figure(figsize=(10, 7))
plt.ylabel('Peak Temperature')
plt.xlabel('Peak Wind Speed')
for i in range(df.shape[0]):
    if df.Class[i] == 0: # if it's a normal day.
        pltcolor = 'b'
    else:
        pltcolor = 'r'
    plt.scatter(df.WSR_PK[i], df.T_PK[i], c=pltcolor)
plt.legend(['Ozone Day', 'Normal Day']);

Using KNN doesn't seem like a bad idea -- there's only a few cross overs and the possible decision boundary doesn't look too messy. 

Instead of coding the KNN algorithm from scratch like we did in homework 5, we'll make use of scikit-learn's `KNeighborsClassifier`. Check out the [documentation](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) to see if there are any arguments you could tweak.

**Question 1.1** Split the data into training and tests sets using `train_test_split`, with `test_size = 0.2` and `random_state = 2020`. Then, instantiate a scikit-learn KNN model and fit the model with `WSR_PK` and `T_PK`. First set n_neighbors to 4. Then choose a value for peak wind speed and peak temperature and use `.predict()` to determine the ozone class at those values. Is the class what you expect it to be?

In [None]:
# YOUR CODE HERE
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = ...

knn = ...
...

In [None]:
# YOUR CODE HERE

**Question 1.2** In a couple sentences, explain in your own words how KNN works for classification problems. You can use formulas if it helps you explain or understand the method. How does KNN decide if a given wind speed and temperature corresponds to an ozone day?

*YOUR ANSWER HERE*

---

Now that we have our classifier fitted, let's test out some values of K. Before we do so, run the cell below, which defines a function that plots the decision boundary for a classifier when given a number of neighbors.

In [None]:
def plot_boundary(model, X, y, n_neighbors):
    cmap_light = ListedColormap(['#AAAAFF', '#FFAAAA'])
    cmap_bold = ListedColormap(['#0000FF', '#FF0000'])
    h = .02
    
    x_min, x_max = X.iloc[:, 0].min() - 1, X.iloc[:, 0].max() + 1
    y_min, y_max = X.iloc[:, 1].min() - 1, X.iloc[:, 1].max() + 1

    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])

    Z = Z.reshape(xx.shape)
    
    plt.figure(figsize=(8, 7))
    plt.pcolormesh(xx, yy, Z, cmap=cmap_light)

    plt.scatter(X.iloc[:, 0], X.iloc[:, 1], c=y, cmap=cmap_bold,
                edgecolor='k', s=20)
    
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    plt.xlabel('Peak Wind Speed')
    plt.ylabel('Peak Temperature')
    plt.title("Ozone/Normal Day Classification (k = %i)"
              % (n_neighbors))
    
    plt.show()

**Question 1.3** Plot three decision boundaries, using a small value for K, a large value for K, and one somewhere in between. Use `.fit()` to train the model on the training data, use `plot_boundary()` to produce a plot, and use `.score()` to get the score of the model on the test data - i.e. the mean accuracy, or the proportion of test data points that were accurately classified. Make sure to show the plot and the score for each value of K.

In [None]:
# ex1
...

In [None]:
# ex2
...

In [None]:
# ex3
...

**Question 1.4** Now that we have a a few plots of various decision boundaries, what are some problems with using small or large values for K? Reference your plots in your answer.

*YOUR ANSWER HERE*

----

### Section 2: Intro to Decision Trees

To prepare you for the homework, let's set up a decision tree to predict ozone days using the same two features we used to train our KNN model in Section 1. We will make use of scikit-learn's `DecisionTreeClassifier`. 

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn import tree

**Question 2.1** Instantiate a `DecisionTreeClassifer` model and call it `O3tree`. Fit the model using the training data, and score it using the training and test data. Assign the scores to the variables `train_score` and `val_score` respectively. 

In [None]:
# YOUR CODE HERE
...

What do these outputs represent? We can copy the code and visualize the tree on [Webgraphviz](http://webgraphviz.com). By running the following cell, you'll see a pretty long output -- follow the link and copy and paste the output to get a visualization of the decision tree we fit!

In [None]:
import graphviz
print(tree.export_graphviz(O3tree, feature_names=X_train.columns))