# Module 7 Activity: Classification

In this module, we will be focusing on a simple classification problem. We will be looking at [Kickstarter Data](https://www.kaggle.com/kemical/kickstarter-projects) and attempting to classify projects as successful or failed based on the different attributes of each project, such as the monetary goal, the number of backers, and how long the project was on Kickstarter.

This is a problem of binary classification, where there are two possible outcomes - success or failure.

In [2]:
# dependencies
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Let's explore our data a little bit. We'll load it in the cell below.

In [3]:
ks = pd.read_csv('ks_2020.csv')
print(ks.shape)
ks.head()

(261358, 12)


Unnamed: 0,name,main_category,currency,deadline,goal,launched,state,backers,country,usd_goal_real,duration,name_length
0,Greeting From Earth: ZGAC Arts Capsule For ET,Film & Video,USD,2017-11-01,30000.0,2017-09-02,failed,15,US,30000.0,60,45
1,Where is Hank?,Film & Video,USD,2013-02-26,45000.0,2013-01-12,failed,3,US,45000.0,45,14
2,ToshiCapital Rekordz Needs Help to Complete Album,Music,USD,2012-04-16,5000.0,2012-03-17,failed,1,US,5000.0,30,49
3,Monarch Espresso Bar,Food,USD,2016-04-01,50000.0,2016-02-26,successful,224,US,50000.0,35,20
4,Support Solar Roasted Coffee & Green Energy! ...,Food,USD,2014-12-21,1000.0,2014-12-01,successful,16,US,1000.0,20,60


In [4]:
train, test = train_test_split(ks, test_size = 0.1)

In [5]:
train.head()

Unnamed: 0,name,main_category,currency,deadline,goal,launched,state,backers,country,usd_goal_real,duration,name_length
124616,Marching With Candice,Music,USD,2014-09-07,1000.0,2014-08-06,failed,11,US,1000.0,32,21
25443,“A Friend Indeed – The Bill Sackter Story” TV ...,Film & Video,USD,2014-02-24,15000.0,2014-01-16,successful,141,US,15000.0,39,55
108066,Zoohoo Carry-all hand box,Design,USD,2016-04-16,30000.0,2016-02-16,failed,4,US,30000.0,60,25
174572,Double Negative! A comedy show to lift your sp...,Film & Video,USD,2017-01-29,3000.0,2016-12-15,failed,1,US,3000.0,45,52
238666,Harmonized Hysteria Gallery Show,Art,USD,2012-03-03,2000.0,2012-02-02,successful,48,US,2000.0,30,32


In [8]:
print(f"Training Shape: {train.shape}")
print(f"Test Shape: {test.shape}")

Training Shape: (235222, 12)
Test Shape: (26136, 12)


In this module, we'll be implementing a k-nearest neighbors classifier. The idea behind this type of classifier is to look at the k points closest to our point of interest, and classify our point as whatever category the majority of surrounding points are. For example, look at the image of the plot below. The green point has a single nearest neighbor, which is a Sycamore. Of its 5 closest neighbors, 3 are Birch. This would lead us to classify our green point as Birch if we were using KNN with k = 5. 

In our case, rather than classifying trees based on tree diameter and height, we're looking to see if we can classify kickstarter projects as successful or failed.

<p><a href="https://otd.gitbook.io/book/module-7/nearest-neighbors"><img src="knn.PNG"></a></p>

For KNN, we need to select numerical features in order to be able to calculate distances. Our data contain various numerical features, and we'll use all of them - these are `goal`, `backers`, `duration`, and `name_length`. We need to select the columns that contain these features.

In [13]:
train_features = train[['goal', 'backers', 'duration', 'name_length']]
train_labels = train['state']
test_features = test[['goal', 'backers', 'duration', 'name_length']]
test_labels = test['state']

### Question 1: Standard Units

We want to standardize these units, though! The distance between goals might be a lot larger than the distance between name lengths, so we can turn our data into "standard units" -- that is, we will subtract our data from the mean and divide by the standard deviation to get our data into units that make more sense for our analysis.

In [14]:
def standard_units(array):
    m = np.mean(array)
    sd = np.std(array)
    return (array - m)/sd

We can now apply our standard units function to all of our data.

In [15]:
train_features = train_features.apply(standard_units)
test_features = test_features.apply(standard_units)

### Question 2: Fitting the Model

In order to build our model, we will be using `scikit-learn`, a very popular Python library that is used to build statistical and machine learning models. In this class, we will only be using `scikit-learn`'s `KNeighborsClassifier`. The documentation is linked [here](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html). 

Let's try classifying the first piece of test data using 3-nn. We must first create the classifier, fit our data to the model, and then use this model to score how accurate our data is.

In [36]:
knn = KNeighborsClassifier(n_neighbors=3)

In [37]:
knn.fit(train_features, train_labels)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=3, p=2,
                     weights='uniform')

Now let's looking at our training accuracy using the `.score` method. It takes in a `features dataframe` (X) and the `labels` (y)

In [27]:
train_accuracy = knn.score(train_features, train_labels)

In [28]:
test_accuracy = knn.score(test_features, test_labels)

In [29]:
print(f"Training Accuracy: {train_accuracy}")
print(f"Test Accuracy: {test_accuracy}")

Training Accuracy: 0.9348530324544473
Test Accuracy: 0.8045224977043158


### Question 3: Changing *k*
What happens to the accuracy you change the value of k? Try it out on k = 5 and 7 using the following cells! You may want to copy some or most of your code from question 2.

In [30]:
knn_5 = KNeighborsClassifier(n_neighbors=5)
knn_5.fit(train_features, train_labels)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

In [31]:
train_accuracy = knn_5.score(train_features, train_labels)
test_accuracy = knn_5.score(test_features, test_labels)

In [32]:
print(f"Training Accuracy: {train_accuracy}")
print(f"Test Accuracy: {test_accuracy}")

Training Accuracy: 0.9187490965981073
Test Accuracy: 0.8162304866850322


In [33]:
knn_7 = KNeighborsClassifier(n_neighbors=7)
knn_7.fit(train_features, train_labels)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=7, p=2,
                     weights='uniform')

In [34]:
train_accuracy = knn_7.score(train_features, train_labels)
test_accuracy = knn_7.score(test_features, test_labels)

In [35]:
print(f"Training Accuracy: {train_accuracy}")
print(f"Test Accuracy: {test_accuracy}")

Training Accuracy: 0.9097405854894525
Test Accuracy: 0.8222375267829813
