#### Kickstarter - Crowfunding and Success

Kickstarter Is an American public-benefit corporation based in Brooklyn, New York, that maintains a global crowdfunding platform focused on creativity The company's stated mission is to "help bring creative projects to life". Kickstarter has reportedly received more than $1.9 billion in pledges from 9.4 million backers to fund 257,000 creative projects, such as films, music, stage shows, comics, journalism, video games, technology and food-related projects.
People who back Kickstarter projects are offered tangible rewards or experiences in exchange for their pledges. This model traces its roots to subscription model of arts patronage, where artists would go directly to their audiences to fund their work.

**Project Owner's Perspective:**

1. What is an ideal and optimal range of the funding goal for my project ? 
2. On which day of the week, I should post the project on Kickstarter ? 
3. How many keywords should I use in my project title ? 
4. What should be the total length of my project description ? 

**Kickstarters Perspective:** A large amount of manual effort is required to screen the project before it is approved to be hosted on the platform. Key ingredients for the project to be successfull.

### why not a model to predict if a project will be successful before it is released?


**List of possible predicting factors:**

- **Total amount to be raised** - More amount may decrease the chances that the project will be successful. 
- **Total duration of the project** - It is possible that projects which are active for very short or very long time periods are not successful. 
- **Theme of the project** - People may consider donating to a project which has a good cause or a good theme. 
- **Writing style of the project description** - If the message is not very clear, the project may not get complete funding. 
- **Length of the project description** - Very long piecies of text may not perform good as compared to shorter crisp texts. 
- **Project launch time** - A project launched on weekdays as compared to weekends or holidays may not get complete funding amount. 


### Given Dataset

#### Independent-

#### ID:
- ID                  378661 non-null int64 -- Unique project Id

#### Text:
- name                378657 non-null object -- Project name
- category            378661 non-null object -- Type of Industry, For ex- Retaurant, Food, Poetry
- main_category       378661 non-null object -- main campaign category or idea - Food/Music/Video

#### Date:
- deadline            378661 non-null object -- Crowd Funding Dead line
- launched            378661 non-null object -- date launched

#### Categorical: Nominal
- currency            378661 non-null object -- Type of Currency
- country             378661 non-null object -- Country

#### Numerical:
- goal                378661 non-null float64 -- Goal - The amount of money creator needs to complete the project
- pledged             378661 non-null float64 -- amount pledged by crowd
- backers             378661 non-null int64 -- number of supporters
- usd pledged         374864 non-null float64 -- Pledged amount in USD (conversion made by KS) 
- usd_pledged_real    378661 non-null float64 -- Pledged amount in USD (conversion made by fixer.io api)
- usd_goal_real       378661 non-null float64 -- Goal amount in USD 

##### Dependent- Nominal
- state               378661 non-null object -- Project Status - Successfull, failed, canceled, undefined, etc...

In [3]:
#Load the Librarys
import sys
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
np.set_printoptions(threshold=sys.maxsize)

In [4]:
#loading the data with encode 
df_ks = pd.read_csv("input/ks-projects-201801.csv" , parse_dates = ["launched", "deadline"])

In [5]:
print ("Total Projects: ", df_ks.shape[0], "\nTotal Features: ", df_ks.shape[1])
df_ks.head()

Total Projects:  378661 
Total Features:  15


Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,usd_pledged_real,usd_goal_real
0,1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,2015-10-09,1000.0,2015-08-11 12:12:28,0.0,failed,0,GB,0.0,0.0,1533.95
1,1000003930,Greeting From Earth: ZGAC Arts Capsule For ET,Narrative Film,Film & Video,USD,2017-11-01,30000.0,2017-09-02 04:43:57,2421.0,failed,15,US,100.0,2421.0,30000.0
2,1000004038,Where is Hank?,Narrative Film,Film & Video,USD,2013-02-26,45000.0,2013-01-12 00:20:50,220.0,failed,3,US,220.0,220.0,45000.0
3,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,USD,2012-04-16,5000.0,2012-03-17 03:24:11,1.0,failed,1,US,1.0,1.0,5000.0
4,1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,USD,2015-08-29,19500.0,2015-07-04 08:35:03,1283.0,canceled,14,US,1283.0,1283.0,19500.0


#### Data Clean Up
1. Verify Individual distinct column values
2. Get rid of unwanted columns (active stage columns)  
3. Remove Duplicates 
4. Handle Missing Values 
 

In [6]:
# Printing unique values in our dataset
#print(df_ks.nunique())

df_ks = df_ks.dropna() ## Drop the rows where at least one element is missing.
#df_ks = df_ks[df_ks["currency"] == "USD"] # Keep USD currency rows only
df_ks.replace({'state': "canceled"}, "failed", inplace = True)
df_ks = df_ks[df_ks["state"].isin(["failed", "successful"])] ## State - Successful and Failed
##Drop other not needed columns
df_ks = df_ks.drop(["backers", "ID", "currency", "country", "pledged", "usd pledged", "usd_pledged_real", "goal"], axis = 1)
##Dropping 1970 rows - all are failed
#df_ks.drop(df_ks.index[df_ks['launched'].str.contains("1970")], inplace = True)

Distributions - Outliers and Skew

In [7]:
def ret_percentage(column):
    return round(column.value_counts(normalize=True) * 100,2)

print(ret_percentage(df_ks['state']))

failed        63.85
successful    36.15
Name: state, dtype: float64


In [8]:
# Select duplicate rows except first occurrence based on all columns
duplicateRowsDF = df_ks[df_ks.duplicated()]
 
print("Duplicate Rows except first occurrence based on all columns are :")
print(duplicateRowsDF)

Duplicate Rows except first occurrence based on all columns are :
Empty DataFrame
Columns: [name, category, main_category, deadline, launched, state, usd_goal_real]
Index: []


In [11]:
#nul check
#df_ks.isnull().sum()
df_ks.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 370219 entries, 0 to 378660
Data columns (total 7 columns):
name             370219 non-null object
category         370219 non-null object
main_category    370219 non-null object
deadline         370219 non-null datetime64[ns]
launched         370219 non-null datetime64[ns]
state            370219 non-null object
usd_goal_real    370219 non-null float64
dtypes: datetime64[ns](2), float64(1), object(4)
memory usage: 22.6+ MB


#### Feature Extraction: 
    - One hot encode categorical values
    - Feature Engineering (driven from our hypothesis generation) 
    - Encode the Categorical Features

In [12]:
from sklearn.preprocessing import LabelEncoder 
def syllable_count(word):
    word = word.lower()
    vowels = "aeiouy"
    count = 0
    if word[0] in vowels:
        count += 1
    for index in range(1, len(word)):
        if word[index] in vowels and word[index - 1] not in vowels:
            count += 1
    if word.endswith("e"):
        count -= 1
    if count == 0:
        count += 1
    return count


## feature engineering

def features1(projects):
    projects["syllable_count"]   = projects["name"].apply(lambda x: syllable_count(x))
    projects["launched_month"]   = projects["launched"].dt.month
    projects["launched_week"]    = projects["launched"].dt.week
    projects["launched_day"]     = projects["launched"].dt.weekday
    projects["is_weekend"]       = projects["launched_day"].apply(lambda x: 1 if x > 4 else 0)
    projects["num_words"]        = projects["name"].apply(lambda x: len(x.split()))
    projects["num_chars"]        = projects["name"].apply(lambda x: len(x.replace(" ","")))
    projects["state"]            = projects["state"].apply(lambda x: 1 if x=="successful" else 0)
    projects["duration"]         = projects["deadline"] - projects["launched"]
    projects["duration"]         = projects["duration"].apply(lambda x: int(str(x).split()[0]))
    ## label encoding the categorical features
    projects = pd.concat([projects, pd.get_dummies(projects["main_category"])], axis = 1)
    le = LabelEncoder()
    for c in ["category", "main_category"]:
        projects[c] = le.fit_transform(projects[c])

    ## Generate Count Features related to Category and Main Category
    t2 = projects.groupby("main_category").agg({"usd_goal_real" : "mean", "category" : "sum"})
    t1 = projects.groupby("category").agg({"usd_goal_real" : "mean", "main_category" : "sum"})
    t2 = t2.reset_index().rename(columns={"usd_goal_real" : "mean_main_category_goal", "category" : "main_category_count"})
    t1 = t1.reset_index().rename(columns={"usd_goal_real" : "mean_category_goal", "main_category" : "category_count"})
    projects = projects.merge(t1, on = "category")
    projects = projects.merge(t2, on = "main_category")
    projects["diff_mean_category_goal"] = projects["mean_category_goal"] - projects["usd_goal_real"]
    projects["diff_mean_category_goal"] = projects["mean_main_category_goal"] - projects["usd_goal_real"]
    projects = projects.drop(["launched", "deadline"], axis = 1)
    return projects
    
df_feat = features1(df_ks)

In [13]:
df_feat[[c for c in df_feat.columns if c != "name"]].head()

Unnamed: 0,category,main_category,state,usd_goal_real,syllable_count,launched_month,launched_week,launched_day,is_weekend,num_words,...,Music,Photography,Publishing,Technology,Theater,mean_category_goal,category_count,mean_main_category_goal,main_category_count,diff_mean_category_goal
0,108,12,0,1533.95,10,8,33,1,0,6,...,0,0,1,0,0,5213.996468,16308,22605.780995,2685321,21071.830995
1,108,12,0,6060.97,13,6,26,4,0,8,...,0,0,1,0,0,5213.996468,16308,22605.780995,2685321,16544.810995
2,108,12,0,2000.0,4,3,10,4,0,3,...,0,0,1,0,0,5213.996468,16308,22605.780995,2685321,20605.780995
3,108,12,0,10000.0,15,5,18,3,0,9,...,0,0,1,0,0,5213.996468,16308,22605.780995,2685321,12605.780995
4,108,12,0,757.52,4,10,40,3,0,4,...,0,0,1,0,0,5213.996468,16308,22605.780995,2685321,21848.260995


In [21]:
labelencoder_X = LabelEncoder() 
df_feat['state'] = labelencoder_X.fit_transform(df_feat['state'])

In [22]:
## define predictors and label 
label = df_feat.state
features = [c for c in df_feat.columns if c not in ["state", "name"]]

#Splitting the data into Training Set and Test Set
from sklearn.model_selection import train_test_split
## prepare training and testing dataset
X_train, X_test, y_train, y_test = train_test_split(df_feat[features], label, test_size = 0.025, random_state = 2)

#Normalizing the features 
import math
from sklearn.preprocessing import StandardScaler 
sc_X = StandardScaler() 
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

In [23]:
#Fitting Classifier to Training Set. Create a classifier object here and call it classifierObj 
from sklearn.ensemble import RandomForestClassifier 
classifierObj = RandomForestClassifier(criterion='entropy') 
classifierObj.fit(X_train,y_train)

## train a random forest classifier 
#model1 = RandomForestClassifier(n_estimators=50, random_state=0).fit(X_train1, y_train1)
y_pred = classifierObj.predict(X_test)

#Evaluating the predictions using a Confusion Matrix
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_pred))
print(classifierObj.score(X_test, y_test))

[[4993  967]
 [2051 1245]]
0.6739412273120138


In [27]:
test1 = [c for c in df_feat.columns if c not in ["name"]]
data = df_feat[test1]
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 370219 entries, 0 to 370218
Data columns (total 32 columns):
category                   370219 non-null int64
main_category              370219 non-null int64
usd_goal_real              370219 non-null float64
syllable_count             370219 non-null int64
launched_month             370219 non-null int64
launched_week              370219 non-null int64
launched_day               370219 non-null int64
is_weekend                 370219 non-null int64
num_words                  370219 non-null int64
num_chars                  370219 non-null int64
duration                   370219 non-null int64
Art                        370219 non-null uint8
Comics                     370219 non-null uint8
Crafts                     370219 non-null uint8
Dance                      370219 non-null uint8
Design                     370219 non-null uint8
Fashion                    370219 non-null uint8
Film & Video               370219 non-null uint8
Food 

In [18]:
df_feat = df_feat[[c for c in df_feat if c not in ['state']] + ['state']]

In [29]:
X = data.iloc[:,:-1].values 
X_sig = X[:,[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31]] 
y = data.iloc[:,31].values

SyntaxError: invalid syntax (<ipython-input-29-e4c8df198a48>, line 2)

In [28]:
X = data.iloc[:,:-1].values 
#X_sig = X[:,[0,1,2,3,4,5,6]] 
y = data.iloc[:,31].values

import statsmodels.api as sm
X=sm.add_constant(X)
model=sm.OLS(Y,X).fit()
model.summary()

0,1,2,3
Dep. Variable:,state,R-squared:,0.058
Model:,OLS,Adj. R-squared:,0.058
Method:,Least Squares,F-statistic:,878.4
Date:,"Sun, 21 Apr 2019",Prob (F-statistic):,0.0
Time:,20:04:33,Log-Likelihood:,-242850.0
No. Observations:,370219,AIC:,485800.0
Df Residuals:,370192,BIC:,486100.0
Df Model:,26,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.3003,0.003,101.247,0.000,0.294,0.306
x1,0.0001,1.98e-05,5.382,0.000,6.76e-05,0.000
x2,0.0020,0.000,7.848,0.000,0.001,0.002
x3,-5.029e-07,1.46e-08,-34.449,0.000,-5.32e-07,-4.74e-07
x4,-0.0079,0.001,-13.821,0.000,-0.009,-0.007
x5,0.0023,0.001,2.012,0.044,6.03e-05,0.005
x6,-0.0008,0.000,-2.975,0.003,-0.001,-0.000
x7,-0.0060,0.001,-10.029,0.000,-0.007,-0.005
x8,-0.0071,0.003,-2.371,0.018,-0.013,-0.001

0,1,2,3
Omnibus:,15484.212,Durbin-Watson:,1.918
Prob(Omnibus):,0.0,Jarque-Bera (JB):,50006.349
Skew:,0.533,Prob(JB):,0.0
Kurtosis:,1.549,Cond. No.,2.79e+20


In [None]:
from sklearn.svm import SVC
# range(start, stop[, step]) -> range object
#Fitting Classifier to Training Set. Create a classifier object here and call it classifierObj - poly
degree = []
accuracy = []
for n in range(2,22,3):
 degree.append(n)
 classifierObj = SVC(kernel='poly', degree=n)
 classifierObj.fit(X_train, y_train)
 print("Accuracy for degree ", n, ": ", classifierObj.score(X_test, y_test))
 accuracy.append(classifierObj.score(X_test, y_test))

#Visual Exploration of Training Set
plt.scatter(degree, accuracy,color='red')
plt.title('Relationship between Test Accuracy and Poly Degree')
plt.xlabel('Polynomial Degree')
plt.ylabel('Test Accuracy')
plt.show()
# Degree 2 gives me the best accuracy of 98.22%

#### Creating model and Testing Accuracy

1. RandomForestClassifier - 67%
2. XGBClassifier - 67%

In [None]:
from sklearn import *
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from xgboost import XGBClassifier

param_dist = {'objective': 'binary:logistic', 'max_depth': 1, 'n_estimators':1000, 'num_round':1000, 'eval_metric': 'logloss'}
model2 = multiclass.OneVsRestClassifier(xgb.XGBClassifier(**param_dist))

model2.fit(X_train, y_train)

print('Accuracy:',(math.exp(-metrics.log_loss(y_test, model2.predict_proba(X_test)))))

In [None]:
import math
import pickle
# save the model to disk
filename = 'XGBClassifierLLabel.sav'
pickle.dump(model, open(filename, 'wb'))

In [None]:
# some time later...
 
# load the model from disk
loaded_model = pickle.load(open(filename, 'rb'))
print('Accuracy:',(math.exp(-metrics.log_loss(y_test, loaded_model.predict_proba(X_test)))))