#### Kickstarter - Crowfunding and Success

Kickstarter Is an American public-benefit corporation based in Brooklyn, New York, that maintains a global crowdfunding platform focused on creativity The company's stated mission is to "help bring creative projects to life". Kickstarter has reportedly received more than $1.9 billion in pledges from 9.4 million backers to fund 257,000 creative projects, such as films, music, stage shows, comics, journalism, video games, technology and food-related projects.
People who back Kickstarter projects are offered tangible rewards or experiences in exchange for their pledges. This model traces its roots to subscription model of arts patronage, where artists would go directly to their audiences to fund their work.

Questions:

1. What categories/campaign strategy are usually successfull?
2. Duration of project and impact on success?
3. Funding and Number of backers association with success?


### why not a model to predict if a project will be successful before it is released?

#### Independent-

#### ID:
- ID                  378661 non-null int64 -- Unique project Id

#### Text:
- name                378657 non-null object -- Project name
- category            378661 non-null object -- Type of Industry, For ex- Retaurant, Food, Poetry
- main_category       378661 non-null object -- main campaign category or idea - Food/Music/Video

#### Date:
- deadline            378661 non-null object -- Crowd Funding Dead line
- launched            378661 non-null object -- date launched

#### Categorical: Nominal
- currency            378661 non-null object -- Type of Currency
- country             378661 non-null object -- Country

#### Numerical:
- goal                378661 non-null float64 -- Goal - The amount of money creator needs to complete the project
- pledged             378661 non-null float64 -- amount pledged by crowd
- backers             378661 non-null int64 -- number of supporters
- usd pledged         374864 non-null float64 -- Pledged amount in USD (conversion made by KS) 
- usd_pledged_real    378661 non-null float64 -- Pledged amount in USD (conversion made by fixer.io api)
- usd_goal_real       378661 non-null float64 -- Goal amount in USD 

##### Dependent- Nominal
- state               378661 non-null object -- Project Status - Successfull, failed, canceled, undefined, etc...

In [45]:
#Load the Librarys
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [46]:
#loading the data with encode 
df_ks = pd.read_csv("input/ks-projects-201801.csv")

In [47]:
df_ks.head()

Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,usd_pledged_real,usd_goal_real
0,1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,2015-10-09,1000.0,2015-08-11 12:12:28,0.0,failed,0,GB,0.0,0.0,1533.95
1,1000003930,Greeting From Earth: ZGAC Arts Capsule For ET,Narrative Film,Film & Video,USD,2017-11-01,30000.0,2017-09-02 04:43:57,2421.0,failed,15,US,100.0,2421.0,30000.0
2,1000004038,Where is Hank?,Narrative Film,Film & Video,USD,2013-02-26,45000.0,2013-01-12 00:20:50,220.0,failed,3,US,220.0,220.0,45000.0
3,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,USD,2012-04-16,5000.0,2012-03-17 03:24:11,1.0,failed,1,US,1.0,1.0,5000.0
4,1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,USD,2015-08-29,19500.0,2015-07-04 08:35:03,1283.0,canceled,14,US,1283.0,1283.0,19500.0


In [48]:
# Printing unique values in our dataset
print(df_ks.nunique())

ID                  378661
name                375764
category               159
main_category           15
currency                14
deadline              3164
goal                  8353
launched            378089
pledged              62130
state                    6
backers               3963
country                 23
usd pledged          95455
usd_pledged_real    106065
usd_goal_real        50339
dtype: int64


#### Data looks biased towards 'failed' class, Need to recheck

In [49]:
Success_dist = round(df_ks["state"].value_counts() / len(df_ks["state"]) * 100,2)

print("Success_dist in %: ")
print(Success_dist)

Success_dist in %: 
failed        52.22
successful    35.38
canceled      10.24
undefined      0.94
live           0.74
suspended      0.49
Name: state, dtype: float64


#### Remove live, undefined, suspended rows as it has little to no impact on independent variable

In [8]:
#Remove live, undefined, suspended

df_ks.drop(df_ks.index[df_ks['state'] == 'live'], inplace = True)
df_ks.drop(df_ks.index[df_ks['state'] == 'undefined'], inplace = True)
df_ks.drop(df_ks.index[df_ks['state'] == 'suspended'], inplace = True)

In [50]:
#Couting null in each column
df_ks.isnull().sum()

ID                     0
name                   4
category               0
main_category          0
currency               0
deadline               0
goal                   0
launched               0
pledged                0
state                  0
backers                0
country                0
usd pledged         3797
usd_pledged_real       0
usd_goal_real          0
dtype: int64

In [None]:
#Searching nans
nans = lambda df: df[df.isnull().any(axis=1)]
nans(df_ks)

#### Additional cleanup

    - usd pledged - Currency column generated by kaggle user. Dropping it has most NaN Values. Instead, usd_pledged_real can be used.

    -Remove rows with no project name 
    - Checking for duplicates

In [12]:
df_ks = df_ks.drop(['usd pledged'], axis=1)

In [16]:
df_ks = df_ks.dropna(axis=0)

In [13]:
# Select duplicate rows except first occurrence based on all columns
duplicateRowsDF = df_ks[df_ks.duplicated()]
 
print("Duplicate Rows except first occurrence based on all columns are :")
print(duplicateRowsDF)

Duplicate Rows except first occurrence based on all columns are :
Empty DataFrame
Columns: [ID, name, category, main_category, currency, deadline, goal, launched, pledged, state, backers, country, usd_pledged_real, usd_goal_real]
Index: []


#### Feature Extraction: 

    -Launched and deadline are strings. Convert those into timestamp to create duration of project feature plus additional.
    - One hot encode categorical values

In [18]:
df_ks['launched'] = pd.to_datetime(df_ks['launched'])
df_ks['laun_month_year'] = df_ks['launched'].dt.to_period("M")
df_ks['laun_day_month_year'] = df_ks['launched'].dt.to_period("D")
df_ks['laun_year'] = df_ks['launched'].dt.to_period("A")
df_ks['laun_hour'] = df_ks['launched'].dt.hour

df_ks['deadline'] = pd.to_datetime(df_ks['deadline'])
df_ks['dead_month_year'] = df_ks['deadline'].dt.to_period("M")
df_ks['dead_day_month_year'] = df_ks['deadline'].dt.to_period("D")
df_ks['dead_year'] = df_ks['deadline'].dt.to_period("A")


#Creating a new columns with Campaign total months
df_ks['time_campaign'] = (((df_ks.dead_day_month_year - df_ks.laun_day_month_year)/np.timedelta64(1, 'M'))).astype(int)
#df_ks['time_campaign'] = df_ks['dead_month_year'] - df_ks['laun_month_year']
#df_ks['time_campaign'] = df_ks['time_campaign']

In [20]:
onehot = pd.get_dummies(df_ks['state'])
df_ks = onehot.join(df_ks)
df_ks.shape

(370451, 25)

In [21]:
onehot = pd.get_dummies(df_ks['currency'])
df_ks = onehot.join(df_ks)
df_ks.shape

(370451, 39)

In [22]:
df_ks.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 370451 entries, 0 to 378660
Data columns (total 39 columns):
AUD                    370451 non-null uint8
CAD                    370451 non-null uint8
CHF                    370451 non-null uint8
DKK                    370451 non-null uint8
EUR                    370451 non-null uint8
GBP                    370451 non-null uint8
HKD                    370451 non-null uint8
JPY                    370451 non-null uint8
MXN                    370451 non-null uint8
NOK                    370451 non-null uint8
NZD                    370451 non-null uint8
SEK                    370451 non-null uint8
SGD                    370451 non-null uint8
USD                    370451 non-null uint8
canceled               370451 non-null uint8
failed                 370451 non-null uint8
successful             370451 non-null uint8
ID                     370451 non-null int64
name                   370451 non-null object
category               370451 non

#### Creating model and Testing Accuracy

1. RandomForestClassifier - 65%
2. XGBClassifier - 67%

In [39]:
from sklearn import *
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from xgboost import XGBClassifier

col = ['AUD','CAD','CHF','DKK','EUR','GBP','HKD','JPY','MXN','NOK','NZD','SEK','SGD','USD','backers','usd_pledged_real','usd_goal_real','time_campaign']
X_train, X_test, y_train, y_test = model_selection.train_test_split(df_ks[col].fillna(-1), df_ks[['canceled','failed','successful']], random_state=1, stratify=df_ks[['canceled','failed','successful']], 
                                                    test_size=0.25)
#Normalizing the features 
import math
from sklearn.preprocessing import StandardScaler 
sc_X = StandardScaler() 
X_train = sc_X.fit_transform(X_train) 
X_test = sc_X.transform(X_test)

In [36]:
#model = multiclass.OneVsRestClassifier(ensemble.RandomForestClassifier(max_depth = 7, n_estimators=1000, random_state=33))
#model = multiclass.OneVsRestClassifier(ensemble.ExtraTreesClassifier(n_jobs=-1, n_estimators=100, random_state=33))

param_dist = {'objective': 'binary:logistic', 'max_depth': 1, 'n_estimators':1000, 'num_round':1000, 'eval_metric': 'logloss'}
model = multiclass.OneVsRestClassifier(xgb.XGBClassifier(**param_dist))

model.fit(X_train, y_train)

OneVsRestClassifier(estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, eval_metric='logloss', gamma=0,
       learning_rate=0.1, max_delta_step=0, max_depth=1,
       min_child_weight=1, missing=None, n_estimators=1000, n_jobs=1,
       nthread=None, num_round=1000, objective='binary:logistic',
       random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
       seed=None, silent=True, subsample=1),
          n_jobs=1)

In [37]:
import math
import pickle
# save the model to disk
filename = 'XGBClassifier.sav'
pickle.dump(model, open(filename, 'wb'))

In [38]:
# some time later...
 
# load the model from disk
loaded_model = pickle.load(open(filename, 'rb'))
print('Accuracy:',(math.exp(-metrics.log_loss(y_test, loaded_model.predict_proba(X_test)))))

Accuracy: 0.6700048466191929
