#### Kickstarter - Crowfunding and Success

Kickstarter Is an American public-benefit corporation based in Brooklyn, New York, that maintains a global crowdfunding platform focused on creativity The company's stated mission is to "help bring creative projects to life". Kickstarter has reportedly received more than $1.9 billion in pledges from 9.4 million backers to fund 257,000 creative projects, such as films, music, stage shows, comics, journalism, video games, technology and food-related projects.
People who back Kickstarter projects are offered tangible rewards or experiences in exchange for their pledges. This model traces its roots to subscription model of arts patronage, where artists would go directly to their audiences to fund their work.

**Project Owner's Perspective:**

1. What is an ideal and optimal range of the funding goal for my project ? 
2. On which day of the week, I should post the project on Kickstarter ? 
3. How many keywords should I use in my project title ? 
4. What should be the total length of my project description ? 

**Kickstarters Perspective:** A large amount of manual effort is required to screen the project before it is approved to be hosted on the platform. Key ingredients for the project to be successfull.

### why not a model to predict if a project will be successful before it is released?


**List of possible predicting factors:**

- **Total amount to be raised** - More amount may decrease the chances that the project will be successful. 
- **Total duration of the project** - It is possible that projects which are active for very short or very long time periods are not successful. 
- **Theme of the project** - People may consider donating to a project which has a good cause or a good theme. 
- **Writing style of the project description** - If the message is not very clear, the project may not get complete funding. 
- **Length of the project description** - Very long piecies of text may not perform good as compared to shorter crisp texts. 
- **Project launch time** - A project launched on weekdays as compared to weekends or holidays may not get complete funding amount. 


### Given Dataset

#### Independent-

#### ID:
- ID                  378661 non-null int64 -- Unique project Id

#### Text:
- name                378657 non-null object -- Project name
- category            378661 non-null object -- Type of Industry, For ex- Retaurant, Food, Poetry
- main_category       378661 non-null object -- main campaign category or idea - Food/Music/Video

#### Date:
- deadline            378661 non-null object -- Crowd Funding Dead line
- launched            378661 non-null object -- date launched

#### Categorical: Nominal
- currency            378661 non-null object -- Type of Currency
- country             378661 non-null object -- Country

#### Numerical:
- goal                378661 non-null float64 -- Goal - The amount of money creator needs to complete the project
- pledged             378661 non-null float64 -- amount pledged by crowd
- backers             378661 non-null int64 -- number of supporters
- usd pledged         374864 non-null float64 -- Pledged amount in USD (conversion made by KS) 
- usd_pledged_real    378661 non-null float64 -- Pledged amount in USD (conversion made by fixer.io api)
- usd_goal_real       378661 non-null float64 -- Goal amount in USD 

##### Dependent- Nominal
- state               378661 non-null object -- Project Status - Successfull, failed, canceled, undefined, etc...

In [1]:
#Load the Librarys
import sys
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

np.set_printoptions(threshold=sys.maxsize)

In [2]:
#loading the data with encode 
df_ks = pd.read_csv("input/ks-projects-201801.csv" , parse_dates = ["launched", "deadline"])

In [3]:
print ("Total Projects: ", df_ks.shape[0], "\nTotal Features: ", df_ks.shape[1])
df_ks.head()

Total Projects:  378661 
Total Features:  15


Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,usd_pledged_real,usd_goal_real
0,1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,2015-10-09,1000.0,2015-08-11 12:12:28,0.0,failed,0,GB,0.0,0.0,1533.95
1,1000003930,Greeting From Earth: ZGAC Arts Capsule For ET,Narrative Film,Film & Video,USD,2017-11-01,30000.0,2017-09-02 04:43:57,2421.0,failed,15,US,100.0,2421.0,30000.0
2,1000004038,Where is Hank?,Narrative Film,Film & Video,USD,2013-02-26,45000.0,2013-01-12 00:20:50,220.0,failed,3,US,220.0,220.0,45000.0
3,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,USD,2012-04-16,5000.0,2012-03-17 03:24:11,1.0,failed,1,US,1.0,1.0,5000.0
4,1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,USD,2015-08-29,19500.0,2015-07-04 08:35:03,1283.0,canceled,14,US,1283.0,1283.0,19500.0


#### Data Clean Up
1. Verify Individual distinct column values
2. Get rid of unwanted columns (active stage columns)  
3. Remove Duplicates 
4. Handle Missing Values 
 

- Feature Engineering (driven from our hypothesis generation) 
- Encode the Categorical Features

In [4]:
# Printing unique values in our dataset
#print(df_ks.nunique())

df_ks = df_ks.dropna() ## Drop the rows where at least one element is missing.
#df_ks = df_ks[df_ks["currency"] == "USD"] # Keep USD currency rows only
df_ks.replace({'state': "canceled"}, "failed", inplace = True)
df_ks = df_ks[df_ks["state"].isin(["failed", "successful"])] ## State - Successful and Failed
##Drop other not needed columns
df_ks = df_ks.drop(["backers", "ID", "currency", "country", "pledged", "usd pledged", "usd_pledged_real", "goal"], axis = 1)
##Dropping 1970 rows - all are failed
df_ks.drop(df_ks.index[df_ks['launched'].str.contains("1970")], inplace = True)

ID                  378661
name                375764
category               159
main_category           15
currency                14
deadline              3164
goal                  8353
launched            378089
pledged              62130
state                    6
backers               3963
country                 23
usd pledged          95455
usd_pledged_real    106065
usd_goal_real        50339
dtype: int64


In [7]:
#Counting null in each column
#df_ks.isnull().sum()

#Printing nan rows
#nans = lambda df: df[df.isnull().any(axis=1)]
#nans(df_ks)

In [8]:
#df_ks['usd pledged'].describe()

In [None]:
#df_ks = df_ks.drop(['usd pledged'], axis=1)

In [None]:
#df_ks = df_ks.dropna(axis=0) ## Removing Rows - NaNs with no project Name

In [None]:
#Remove live, undefined, suspended rows as it has little to no impact on independent variable

#df_ks.drop(df_ks.index[df_ks['state'] == 'live'], inplace = True)
#df_ks.drop(df_ks.index[df_ks['state'] == 'undefined'], inplace = True)
#df_ks.drop(df_ks.index[df_ks['state'] == 'suspended'], inplace = True)
#df_ks.drop(df_ks.index[df_ks['state'] == 'canceled'], inplace = True)

In [None]:
#Print rows with canceled state
#df_ks[df_ks['state'].str.contains("canceled")]

#Replacing canceled with failed state



In [None]:
#df_ks['currency'].unique()
#df_ks['country'].value_counts()
#replace function to fix "N,0" Countries
'''def replace(s, r1, r2):
    if("N,0\"" in str(s)):
        s = str(s).replace(r2,r1)
    return s

df_ks['country'] = df_ks.apply(lambda r: replace(r['country'], r['currency'], "N,0\""), axis=1)
#replace function
df_ks.replace({'country': "GBP"}, "GB", inplace = True)
df_ks.replace({'country': "CAD"}, "CA", inplace = True)
df_ks.replace({'country': "NOK"}, "NO", inplace = True)
df_ks.replace({'country': "DKK"}, "DK", inplace = True)
df_ks.replace({'country': "SEK"}, "SK", inplace = True)
df_ks.replace({'country': "AUD"}, "AU", inplace = True)
df_ks.replace({'country': "USD"}, "US", inplace = True)
df_ks.replace({'country': "EUR"}, "DE", inplace = True) ##there are some othe countries with same currency'''

In [None]:
#df_ks[df_ks['launched'].str.contains("1970")]

In [None]:
#df_ks['launched'] = pd.DatetimeIndex(df_ks.launched).normalize() #Removing time from Launched

In [None]:
#df_ks['launched'] = pd.to_datetime(df_ks['launched'])
#df_ks['deadline'] = pd.to_datetime(df_ks['deadline'])
#df_ks.groupby(df_ks["deadline"].dt.year).count()

In [None]:
# Calculate duration between features
#df_ks['totaltime'] = [delta.days for delta in (df_ks['deadline'] - df_ks['launched'])]

Distributions - Outliers and Skew

In [None]:
#df_ks.info()

In [None]:
#test = df_ks.drop(['ID','pledged','goal'], axis=1)

In [None]:
#test.head()

In [None]:
#df_ks['backers'].describe()
#test.plot.kde()
#df_ks.groupby('state').hist()
#sns.scatterplot(x="usd_pledged_real", y="usd_goal_real", s=150,hue='state' , data=test)

In [10]:
def ret_percentage(column):
    return round(column.value_counts(normalize=True) * 100,2)

print(ret_percentage(df_ks['currency']))

USD    78.00
GBP     9.01
EUR     4.60
CAD     3.95
AUD     2.10
SEK     0.47
MXN     0.46
NZD     0.39
DKK     0.30
CHF     0.20
NOK     0.19
HKD     0.16
SGD     0.15
JPY     0.01
Name: currency, dtype: float64


In [None]:
test = df_ks.drop(df_ks.index[df_ks['backers'] == 0], inplace = False)
test = test.drop(df_ks.index[df_ks['backers'] == 1], inplace = False)
#print(ret_percentage(test['backers']))
#df_ks.sort_values(by=['backers'], ascending=True)
#df_ks.index[df_ks['backers'].str.contains("1970")]

#### Data looks biased towards 'failed' class, Need to recheck

In [None]:
df_ks['usd_goal_real'].describe()

In [None]:
percentage_dist = round(df_ks['state'].value_counts(normalize=True) * 100,2)
print(percentage_dist)

In [None]:
sns.set(style='darkgrid')
sns.countplot(y = 'state',
              data = df_ks,
              order = percentage_dist.index)
plt.show()

#### Remove live, undefined, suspended rows as it has little to no impact on independent variable

#### Additional cleanup

    - usd pledged - Currency column generated by kaggle user. Dropping it has most NaN Values. Instead, usd_pledged_real can be used.

    -Remove rows with no project name 
    - Checking for duplicates

In [None]:
# Select duplicate rows except first occurrence based on all columns
duplicateRowsDF = df_ks[df_ks.duplicated()]
 
print("Duplicate Rows except first occurrence based on all columns are :")
print(duplicateRowsDF)

#### Feature Extraction: 

    -Launched and deadline are strings. Convert those into timestamp to create duration of project feature plus additional.
    - One hot encode categorical values

In [None]:
df_ks['launched'] = pd.to_datetime(df_ks['launched'])
df_ks['laun_month_year'] = df_ks['launched'].dt.to_period("M")
df_ks['laun_day_month_year'] = df_ks['launched'].dt.to_period("D")
df_ks['laun_year'] = df_ks['launched'].dt.to_period("A")
df_ks['laun_hour'] = df_ks['launched'].dt.hour

df_ks['deadline'] = pd.to_datetime(df_ks['deadline'])
df_ks['dead_month_year'] = df_ks['deadline'].dt.to_period("M")
df_ks['dead_day_month_year'] = df_ks['deadline'].dt.to_period("D")
df_ks['dead_year'] = df_ks['deadline'].dt.to_period("A")


#Creating a new columns with Campaign total months
df_ks['time_campaign'] = (((df_ks.dead_day_month_year - df_ks.laun_day_month_year)/np.timedelta64(1, 'M'))).astype(int)
#df_ks['time_campaign'] = df_ks['dead_month_year'] - df_ks['laun_month_year']
#df_ks['time_campaign'] = df_ks['time_campaign']

In [None]:
onehot = pd.get_dummies(df_ks['state'])
df_ks = onehot.join(df_ks)
df_ks.shape

In [None]:
from sklearn.preprocessing import LabelEncoder 
labelencoder_X = LabelEncoder() 
df_ks['state'] = labelencoder_X.fit_transform(df_ks['state'])

In [None]:
onehot = pd.get_dummies(df_ks['currency'])
df_ks = onehot.join(df_ks)
df_ks.shape

In [None]:
df_ks.head()

#### Creating model and Testing Accuracy

1. RandomForestClassifier - 65%
2. XGBClassifier - 67%

In [None]:
from sklearn import *
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from xgboost import XGBClassifier

col = ['AUD','CAD','CHF','DKK','EUR','GBP','HKD','JPY','MXN','NOK','NZD','SEK','SGD','USD','backers','usd_pledged_real','usd_goal_real','time_campaign']
X_train, X_test, y_train, y_test = model_selection.train_test_split(df_ks[col], df_ks[['canceled','failed','successful']], random_state=1, stratify=df_ks[['canceled','failed','successful']], 
                                                    test_size=0.25)
#Normalizing the features 
import math
from sklearn.preprocessing import StandardScaler 
sc_X = StandardScaler() 
X_train = sc_X.fit_transform(X_train) 
X_test = sc_X.transform(X_test)

In [None]:
#model = multiclass.OneVsRestClassifier(ensemble.RandomForestClassifier(max_depth = 7, n_estimators=1000, random_state=33))
#model = multiclass.OneVsRestClassifier(ensemble.ExtraTreesClassifier(n_jobs=-1, n_estimators=100, random_state=33))

param_dist = {'objective': 'binary:logistic', 'max_depth': 1, 'n_estimators':1000, 'num_round':1000, 'eval_metric': 'logloss'}
model = multiclass.OneVsRestClassifier(xgb.XGBClassifier(**param_dist))

model.fit(X_train, y_train)

In [None]:
import math
import pickle
# save the model to disk
filename = 'XGBClassifierLLabel.sav'
pickle.dump(model, open(filename, 'wb'))

In [None]:
# some time later...
 
# load the model from disk
loaded_model = pickle.load(open(filename, 'rb'))
print('Accuracy:',(math.exp(-metrics.log_loss(y_test, loaded_model.predict_proba(X_test)))))