# Classification Assignment

The main goal of this assignment is to check in on your ability to access, load, explore, and make predictions using classification models.  For next Wednesday, you are to do the following:

1. Locate a dataset on Kaggle, NYC Open Data, UCI Machine Learning Repository, or other resource that contains data that can be addressed through a classification task
2. Load and explore the data for missing values and perform a brief EDA
3. Frame and state the classification problem
4. Split your data into train and test sets
4. Implement a `DummyClassifier`, `KNeighborsClassifier`, and `LogisticRegression` model.
5. Improve the models by performing a `GridSearchCV` for `n_neighbors` and `C` parameters respectively.  Include a scale transformation in your pipeline for KNN and a `PolynomialFeatures` step in the Logistic model.
6. Discuss the outcome of your classifiers using the `classification_report`.  Which did the best?  Do you prefer a recall or a precision oriented model?  Why?

**EXTRA**:

- Include `SGDClassifier`
- Incorporate AUC and ROC curves


Note: If you can't find a good dataset, use the titanic dataset.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import os

from sklearn.dummy import DummyClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.metrics import accuracy_score, classification_report

# set chart style
sns.set(style='ticks')
%matplotlib inline

# choose laptop location
home = '/Users/karenhao'
office = '/users/khao'

location = home+'/Google Drive/02 Working/Quartz/Education/GA Data Science/DAT-NYC-6.13'

### step 0. import data

**About the data**

https://www.kaggle.com/kemical/kickstarter-projects


* `ID`: internal kickstarter id
* `name`: name of kickstarter project
* `category`: category
* `main_category`: category of campaign
* `currency`: currency used to support
* `deadline`: deadline for crowdfunding
* `goal`: fundraising goal in local currency
* `launched`: date launched
* `pledged`: amount of money pledged in local currency
* `state`: state of campaign (successful, failed, other)
* `backers`: number of backers
* `country`: country pledged from
* `usd pledged`: amount of money pledged in usd
* `usd_pledged_real`: amount of money pledged in usd
* `usd_goal_real`: fundraising goal in usd

In [None]:
os.chdir(location+'/data')
kickstarter = pd.read_csv('ks-projects-201801.csv')

In [None]:
kickstarter.head()

### step 1. check for null values

In [None]:
kickstarter.info()

It looks like `name` and `usd pledged` are the only columns with null values. We don't care about `name`, but we might care about `usd pledged`.

In [None]:
kickstarter[kickstarter['usd pledged'].isnull()].head(10)

From examinining rows where `usd pledged` is null, it seems ok to use `usd_pledged_real` instead, which accurately converts `pledged` to usd.

### step 2. understanding existing variables

Now we continue to explore each of the columns. We start by checking the value counts for some of the key columns.

In [None]:
kickstarter.category.value_counts().head(10)

In [None]:
kickstarter.main_category.value_counts()

In [None]:
kickstarter.currency.value_counts()

In [None]:
kickstarter.state.value_counts()

Now let's take a look at the distributions of different variables based on the `state`.

In [None]:
sns.boxplot(x='state',y='usd_goal_real',data=kickstarter)

Because of the outliers in `usd_goal_real`, it's too hard to tell what is actually going on in the above boxplot. Let's get rid of outliers.

In [None]:
# check distribution
kickstarter.usd_goal_real.describe()

In [None]:
# remove outliers
ks = kickstarter.copy()
ks = ks[ks.usd_goal_real<ks.usd_goal_real.quantile(.95)]

ks.usd_goal_real.describe()

Now we can try visualizing `usd_goal_real` again.

In [None]:
sns.boxplot(x='state',y='usd_goal_real',data=ks)

In [None]:
sns.boxplot(x='state',y='backers',data=ks)

Once again, there are too many outliers in `backers`. Let's do the same thing as we did with `usd_goal_real`.

In [None]:
# remove outliers again
ks = ks[ks.backers<ks.backers.quantile(.95)]
sns.boxplot(x='state',y='backers',data=ks)

Based on the above boxplot, `backers` seems like a great way to classify the success of a campaign.

In [None]:
fig, ax = plt.subplots(figsize=(14,5))
sns.countplot(x='main_category',hue='state',data=ks,ax=ax)

The `main_category`, on the other hand, doesn't seem like a great way to classify success.

### step 3. create new variables

Let's calculate a new variable from `launched` and `deadline` that represents the duration of the campaign.

In [None]:
# convert columns to dates
ks['launched'] = pd.to_datetime(ks.launched).dt.date
ks['deadline'] = pd.to_datetime(ks.deadline).dt.date

# calculate new variable
ks['duration'] = (ks.deadline - ks.launched).dt.days

In [None]:
sns.boxplot(x='state',y='duration',data=ks)

Remove outliers again.

In [None]:
ks = ks[ks.duration<ks.duration.quantile(.95)]
sns.boxplot(x='state',y='duration',data=ks)

Let's simplify `state` to have only binary values that indicate successful and unsuccessful campaigns.

In [None]:
ks['state_simple'] = ks.state.apply(lambda state: 1 if state=='successful' else 0)

In [None]:
ks.state_simple.value_counts()

Let's also simplify 'currency' to binary values for USD or not USD.

In [None]:
ks['USD?'] = ks.currency.apply(lambda curr: 1 if curr=='USD' else 0)

In [None]:
ks.groupby('state_simple')['USD?'].mean()

Let's also create dummy variables for `category`, `main_category`, and `currency`.

In [None]:
category = pd.get_dummies(ks.category, drop_first=True)
main_category = pd.get_dummies(ks.main_category, drop_first=True)
currency = pd.get_dummies(ks.currency, drop_first=True)

# put all categorical variables in one df
cat_var = pd.concat([category,main_category,currency],axis=1)

In [None]:
cat_var.head()

### step 4. implement classification model

We are now ready to implement different classification models. Let's first use a KNeighbors model and try feeding the model everything. Note: Because the dataframe is really big, I first shrunk it down.

In [None]:
ks_sample = ks[:10000]

In [None]:
X = ks_sample[['duration','backers','usd_goal_real']].join(cat_var)
y = ks_sample.state_simple

In [None]:
X_train,X_test,y_train,y_test = train_test_split(X,y)

In [None]:
knn = KNeighborsClassifier(n_neighbors = 5)
knn.fit(X_train,y_train)
y_pred = knn.predict(X_test)

accuracy_score(y_test,y_pred)

Let's try it again with only `duration`, `backers`, and `usd_goal_real`.

In [None]:
X = ks_sample[['duration','backers','usd_goal_real']]
y = ks_sample.state_simple

X_train,X_test,y_train,y_test = train_test_split(X,y)

knn.fit(X_train,y_train)
y_pred = knn.predict(X_test)

accuracy_score(y_test,y_pred)

Let's try with `duration`, `backers`, `main_category`.

In [None]:
X = ks_sample[['duration','backers']].join(main_category)
y = ks_sample.state_simple

X_train,X_test,y_train,y_test = train_test_split(X,y)

knn.fit(X_train,y_train)
y_pred = knn.predict(X_test)

accuracy_score(y_test,y_pred)

Let's try it with all the categories but using `USD?` instead of `currency`.

In [None]:
X = ks_sample[['duration','backers','usd_goal_real','USD?']].join(pd.concat([main_category,category],axis=1))
y = ks_sample.state_simple

X_train,X_test,y_train,y_test = train_test_split(X,y)

knn.fit(X_train,y_train)
y_pred = knn.predict(X_test)

accuracy_score(y_test,y_pred)

Let's try one last time with `backers` only.

In [None]:
X = ks_sample[['backers']]
y = ks_sample.state_simple

X_train,X_test,y_train,y_test = train_test_split(X,y)

knn.fit(X_train,y_train)
y_pred = knn.predict(X_test)

accuracy_score(y_test,y_pred)

It seems like its between the first and second model. Let's continue optimizing the first model.

### step 5. optimize model

Let's redefine the variables.

In [None]:
X = ks_sample[['duration','backers','usd_goal_real']].join(cat_var)
y = ks_sample.state_simple

X_train,X_test,y_train,y_test = train_test_split(X,y)

Now let's implement a grid search to optimize the number of neighbors and use a pipe to first implement a `StandardScaler()`.

In [None]:
pipe = make_pipeline(StandardScaler(),KNeighborsClassifier())

In [None]:
params = {'kneighborsclassifier__n_neighbors': list(range(3, 10))}

In [None]:
grid = GridSearchCV(pipe, param_grid=params, scoring = 'accuracy')

In [None]:
grid.fit(X_train, y_train)

In [None]:
best = grid.best_estimator_

best.fit(X_train, y_train)
y_pred = best.predict(X_test)

print('accuracy score:',accuracy_score(y_test,y_pred))
print(classification_report(y_test,y_pred))

Ironically a grid search did not actually improve the model. In this instance, the grid found that n_neighbors=5 is the most optimal. Given that that was the same value I had used above, it seems like the `StandardScaler()` is the main reason for a slightly lower accuracy score.

### step 6. compare to a dummy classifier

In [None]:
dummy = DummyClassifier()
dummy.fit(X_train, y_train)
dummy_pred = dummy.predict(X_test)

print('accuracy score:',accuracy_score(y_test,dummy_pred))
print(classification_report(y_test,dummy_pred))

At least our model does much better than a dummy classifier.

### step 7. try a logistic regressor instead

In [None]:
# create pipe
pipe = make_pipeline(StandardScaler(), PolynomialFeatures(), LogisticRegression())

# set parameters
params = {'polynomialfeatures__degree': list(range(2, 10))}

# perform grid search
grid = GridSearchCV(pipe, param_grid=params)
grid.fit(X_train, y_train)

best = grid.best_estimator_
best.fit(X_train, y_train)
best_pred = best.predict(X_test)

print('accuracy score:',accuracy_score(y_test,best_pred))
print(classification_report(y_test,best_pred))