# Module 7 Activity: Classification

In this module, we will be focusing on a simple classification problem. We will be looking at [Kickstarter Data](https://www.kaggle.com/kemical/kickstarter-projects) and attempting to classify projects as successful or failed based on the different attributes of each project, such as the monetary goal, the number of backers, and how long the project was on Kickstarter.

This is a problem of binary classification, where there are two possible outcomes - success or failure.

In [None]:
# dependencies
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Let's explore our data a little bit. We'll load it in the cell below.

In [None]:
ks = pd.read_csv('ks_2020.csv')
print(ks.shape)
ks.head()

For a classification model, we want to be able to test our model on data that it hasn't seen before. If we try to test it on a point that it already knows the answer for, it'll be correct 100% of the time! For this reason, we split up our data into a training set and a test set. The cell below accomplishes this - it sets aside 10% of the data for testing.

In [None]:
train, test = train_test_split(ks, test_size = 0.1)

In [None]:
train.head()

In [None]:
print(f"Training Shape: {train.shape}")
print(f"Test Shape: {test.shape}")

In this module, we'll be implementing a k-nearest neighbors classifier. The idea behind this type of classifier is to look at the k points closest to our point of interest, and classify our point as whatever category the majority of surrounding points are. For example, look at the image of the plot below. The green point has a single nearest neighbor, which is a Sycamore. Of its 5 closest neighbors, 3 are Birch. This would lead us to classify our green point as Birch if we were using KNN with k = 5. 

In our case, rather than classifying trees based on tree diameter and height, we're looking to see if we can classify kickstarter projects as successful or failed.

<p><a href="https://otd.gitbook.io/book/module-7/nearest-neighbors"><img src="knn.PNG"></a></p>

For KNN, we need to select numerical features in order to be able to calculate distances. Our data contain various numerical features, and we'll use all of them - these are `goal`, `backers`, `duration`, and `name_length`. We need to select the columns that contain these features.

In [None]:
train_features = train[['goal', 'backers', 'duration', 'name_length']]
train_labels = train['state']
test_features = test[['goal', 'backers', 'duration', 'name_length']]
test_labels = test['state']

### Question 1: Standard Units

We want to standardize these units, though! The distance between goals might be a lot larger than the distance between name lengths, so we can turn our data into "standard units" -- that is, we will subtract our data from the mean and divide by the standard deviation to get our data into units that make more sense for our analysis.

In [None]:
def standard_units(array):
    m = ...
    sd = ...
    return ...

We can now apply our standard units function to all of our data.

In [None]:
train_features = train_features.apply(...)
test_features = test_features.apply(...)

### Question 2: Fitting the Model

In order to build our model, we will be using `scikit-learn`, a very popular Python library that is used to build statistical and machine learning models. In this class, we will only be using `scikit-learn`'s `KNeighborsClassifier`. The documentation is linked [here](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html). 

Let's try classifying the first piece of test data using 3-nn. We must first create the classifier, fit our data to the model, and then use this model to score how accurate our data is.

In [None]:
knn = KNeighborsClassifier(n_neighbors=...)

In [None]:
knn.fit(..., ...)

Now let's looking at our training accuracy using the `.score` method. It takes in a `features dataframe` (X) and the `labels` (y)

In [None]:
train_accuracy = knn.score(train_features, ...)

In [None]:
test_accuracy = knn.score(..., test_labels)

In [None]:
print(f"Training Accuracy: {train_accuracy}")
print(f"Test Accuracy: {test_accuracy}")

### Question 3: Changing *k*
What happens to the accuracy you change the value of k? Try it out on k = 5 and 7 using the following cells! You may want to copy some or most of your code from question 2.

In [None]:
knn_5 = KNeighborsClassifier(n_neighbors=...)
knn_5.fit(..., ...) #fit taakes in a features DataFrame and a labels array

In [None]:
train_accuracy = knn_5.score(..., ...)
test_accuracy = knn_5.score(..., ...)

In [None]:
print(f"Training Accuracy: {train_accuracy}")
print(f"Test Accuracy: {test_accuracy}")

In [None]:
knn_7 = KNeighborsClassifier(n_neighbors=...)
knn_7.fit(..., ...)

In [None]:
train_accuracy = knn_7.score(..., ...)
test_accuracy = knn_7.score(..., ...)

In [None]:
print(f"Training Accuracy: {train_accuracy}")
print(f"Test Accuracy: {test_accuracy}")