# Competition: starter notebook

This notebook gives a starting point for the competition: it loads the data, picks a few features, adds a DTC on top and gives a baseline accuracy of 61%. 

There is a **lot** of room for improvement! This is where you come in. Go through the notebook and try to gradually improve the notebook in order to get a better accuracy on the test set. 

**Tasks**:

1. Run through this notebook, make sure you understand the general structure
2. Run through the file `predictive_model.py`, make sure you understand how this notebook translates to the `predictive_model.py` file
3. Modify the notebook to increase the performances as much as you can
4. When you're ready to submit, run the cell to save your trained model (see at the bottom of this notebook) and adapt the `predictive_model.py` file to correspond to what you did in the notebook.

## Submitting your model

When you would like to submit your model, copy paste the relevant steps from the notebook to the auxiliary file `predictive_model.py`. **Test** that the whole procedure worked by typing

```bash
python model_tester.py
```

if it returns something like 

```
Accuracy score (dummy test set): 0.6159
MODEL RAN SUCCESSFULLY
```

it's a good sign. If it failed then you need to fix it before submitting!

When submitting, make sure you send **both** the `predictive_model.py` file as well as the auxiliary `predictive_model.pickle` file that this notebook should generate (see end of notebook) and which contains your trained classifier.


## Important note

If, in your feature engineering, you create categories for specific occurences of words for example. Think about how to deal with a case where a word is in the test data and **not** in the training data.

This is very important as your model will likely fail on the test data otherwise.

## Importing the libraries

Below we import some useful libraries to build an elementary model, feel free to add whatever libraries you need (make sure to include them in `predictive_model.py`). 

In [1]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

### Getting the data

The data corresponds to kickstarter projects, data about the projects and whether it was successfully funded or not. 

In [2]:
kick = pd.read_csv("kick.csv", low_memory=False)
kick.head(3)

Unnamed: 0,photo,name,blurb,goal,state,slug,disable_communication,country,currency,currency_symbol,...,created_at,launched_at,static_usd_rate,creator,location,category,profile,urls,source_url,friends
0,"{""small"":""https://ksr-ugc.imgix.net/assets/012...",Angular - Where Modern Art meets Cards,Angular is a minimalist card design for simpli...,17380.0,failed,angular-where-modern-art-meets-cards,False,US,USD,$,...,1455845363,1456694829,1.0,"{""urls"":{""web"":{""user"":""https://www.kickstarte...","{""country"":""US"",""urls"":{""web"":{""discover"":""htt...","{""urls"":{""web"":{""discover"":""http://www.kicksta...","{""background_image_opacity"":0.8,""should_show_f...","{""web"":{""project"":""https://www.kickstarter.com...",https://www.kickstarter.com/discover/categorie...,
1,"{""small"":""https://ksr-ugc.imgix.net/assets/014...",Ladybeard is KAWAII-CORE,Original songs and music videos to jump start ...,24000.0,failed,ladybeard-is-kawaii-core,False,US,USD,$,...,1475568868,1480946454,1.0,"{""urls"":{""web"":{""user"":""https://www.kickstarte...","{""country"":""JP"",""urls"":{""web"":{""discover"":""htt...","{""urls"":{""web"":{""discover"":""http://www.kicksta...","{""background_image_opacity"":0.8,""should_show_f...","{""web"":{""project"":""https://www.kickstarter.com...",https://www.kickstarter.com/discover/categorie...,
2,"{""small"":""https://ksr-ugc.imgix.net/assets/011...",Vegan Cafe Delivery Service in Vancouver BC,Our project is to launch a vegan lunch deliver...,40000.0,failed,vegancafeca,False,CA,CAD,$,...,1405218883,1405957628,0.926746,"{""urls"":{""web"":{""user"":""https://www.kickstarte...","{""country"":""CA"",""urls"":{""web"":{""discover"":""htt...","{""urls"":{""web"":{""discover"":""http://www.kicksta...","{""background_image_opacity"":0.8,""should_show_f...","{""web"":{""project"":""https://www.kickstarter.com...",https://www.kickstarter.com/discover/categorie...,


### Having a first look at the data

In [3]:
# All thecolumns that are available
print("COLUMNS:\n")
print(kick.columns)
# The dimension of the dataset
print("\nDIMENSIONS:\n")
print(kick.shape)
# The types of all the columns
print("\nTYPES FOR THE COLUMNS:\n")
print(kick.dtypes)

COLUMNS:

Index(['photo', 'name', 'blurb', 'goal', 'state', 'slug',
       'disable_communication', 'country', 'currency', 'currency_symbol',
       'currency_trailing_code', 'deadline', 'state_changed_at', 'created_at',
       'launched_at', 'static_usd_rate', 'creator', 'location', 'category',
       'profile', 'urls', 'source_url', 'friends'],
      dtype='object')

DIMENSIONS:

(112879, 23)

TYPES FOR THE COLUMNS:

photo                      object
name                       object
blurb                      object
goal                      float64
state                      object
slug                       object
disable_communication        bool
country                    object
currency                   object
currency_symbol            object
currency_trailing_code       bool
deadline                    int64
state_changed_at            int64
created_at                  int64
launched_at                 int64
static_usd_rate           float64
creator                    object

In [4]:
#one-hot encoding for countries
countries=kick['country']
countries=pd.get_dummies(countries)
countries.head()

#dropping countries from dataframe


Unnamed: 0,AT,AU,BE,CA,CH,DE,DK,ES,FR,GB,...,IE,IT,LU,MX,NL,NO,NZ,SE,SG,US
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


In [5]:
kick=kick.drop(['location','friends','photo','name','blurb','slug','currency_trailing_code','currency_symbol','creator','category','profile','urls','source_url'], axis=1)

In [6]:
kick['goal_in_usd']=kick['goal']*kick['static_usd_rate']


In [7]:
#since now everything in USD, remove currency, static_usd_rate,goal and replace with goal in usd

kick=kick.drop(['goal','currency','static_usd_rate'], axis=1)

In [8]:
communication=kick['disable_communication']
communication=pd.get_dummies(communication)
communication.head()

kick=kick.drop('disable_communication',axis=1)

#one-hot encoding for countries
countries=kick['country']
countries=pd.get_dummies(countries)
countries.head()

for c in countries.columns:
    kick[c] = countries[c]
    
for c in communication.columns:
    kick[c] = communication[c]
    

kick=kick.drop(['state_changed_at','created_at','launched_at','deadline'],axis=1)

### Extracting the response

The response is the "state" column, we extract it, and encode it:

* `failed`, `canceled`, `suspended` --> 0
* `successful` --> 1

**Note**: do not modify this. 

In [9]:
# Do not modify this cell
response = kick["state"]
del kick["state"]
response = response.apply(lambda x: 0 if x in ['failed', 'canceled', 'suspended'] else 1)

Now the projects are labelled as either 0 or 1, which you can check by running the next cell:

In [10]:
response.unique()

array([0, 1], dtype=int64)

## A first, simple model

In this model, we extract the numerical columns and apply a Decision Tree Classifier. 

In [11]:
numerical_columns = kick.select_dtypes(include=['float64', 'int64']).columns
print(numerical_columns)

Index(['goal_in_usd'], dtype='object')


In [13]:
# plt.figure(figsize=(8,6))
# sns.distplot(numerical_columns['deadline'])

Train test split after only considering the columns extracted above

In [14]:
# train_test_split?
X_train, X_test, y_train, y_test = train_test_split(kick[numerical_columns], response, test_size = 0.3, random_state=465)

Fitting a DTC with maximum depth of 6

In [15]:
rf = RandomForestClassifier(n_estimators = 250, bootstrap = True, oob_score = True, max_depth = 50)
# RandomForestClassifier?
rf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=50, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=250, n_jobs=1, oob_score=True, random_state=None,
            verbose=0, warm_start=False)

Getting the prediction on the test set with the DTC and comparing to the truth.

In [16]:
y_pred = rf.predict(X_test)

print("ACCURACY ON TEST SET: {0:.3f}".format(accuracy_score(y_test, y_pred)))
print("CLASSIFICATION REPORT:\n", 
      str(classification_report(y_test, y_pred, digits=3)))
print("\n CLASSIFICATION REPORT: \n")
print(confusion_matrix(y_test, y_pred))

ACCURACY ON TEST SET: 0.574
CLASSIFICATION REPORT:
              precision    recall  f1-score   support

          0      0.618     0.532     0.572     18111
          1      0.536     0.623     0.576     15753

avg / total      0.580     0.574     0.574     33864


 CLASSIFICATION REPORT: 

[[9626 8485]
 [5938 9815]]


## OUTPUT YOUR MODEL

Adapt and run this cell when you have trained your classifier and whish to submit it. 

Pickle allows to save python objects preserving their structure.

In [17]:
import pickle

Only change the name of the classifier (here `dtc`) to the name of your trained model, leave the rest as is. 

In [19]:
with open('predictive_model.pickle', 'wb') as output_file:
    # Change what is between the square brackets, nothing else
    pickle.dump([rf], output_file)