Kickstarter is an American public-benefit corporation based in Brooklyn, New York, that maintains a global crowdfunding platform focused on creativity The company's stated mission is to "help bring creative projects to life". Kickstarter has reportedly received more than $1.9 billion in pledges from 9.4 million backers to fund 257,000 creative projects, such as films, music, stage shows, comics, journalism, video games, technology and food-related projects.

People who back Kickstarter projects are offered tangible rewards or experiences in exchange for their pledges. This model traces its roots to subscription model of arts patronage, where artists would go directly to their audiences to fund their work.


First, we import the necessary libraries required for our analysis

In [24]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Next, we can get ready to load our data on the platform. Here is a sneak preview of the dataset.

In [8]:
dataset=pd.read_csv("Desktop/ks-projects-201801.csv")
dataset.head()

Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,usd_pledged_real,usd_goal_real
0,1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,2015-10-09,1000.0,2015-08-11 12:12:28,0.0,failed,0,GB,0.0,0.0,1533.95
1,1000003930,Greeting From Earth: ZGAC Arts Capsule For ET,Narrative Film,Film & Video,USD,2017-11-01,30000.0,2017-09-02 04:43:57,2421.0,failed,15,US,100.0,2421.0,30000.0
2,1000004038,Where is Hank?,Narrative Film,Film & Video,USD,2013-02-26,45000.0,2013-01-12 00:20:50,220.0,failed,3,US,220.0,220.0,45000.0
3,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,USD,2012-04-16,5000.0,2012-03-17 03:24:11,1.0,failed,1,US,1.0,1.0,5000.0
4,1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,USD,2015-08-29,19500.0,2015-07-04 08:35:03,1283.0,canceled,14,US,1283.0,1283.0,19500.0


We will also need to see the number unique values within each variable

In [10]:
dataset.nunique()

ID                  378661
name                375764
category               159
main_category           15
currency                14
deadline              3164
goal                  8353
launched            378089
pledged              62130
state                    6
backers               3963
country                 23
usd pledged          95455
usd_pledged_real    106065
usd_goal_real        50339
dtype: int64

As the objective of the notebook is to create a predictive model to gauge whether a new kickstarter project could succed in securing its target funding goal, we will need to look into each variable and identify the ones we require.

Here are the variables that shall be dropped from our model
1. ID 
    Not a feature that affects prediction
    
2. name
    While frequent keywords can be extracted from the project name using NLP techniques to include our model, we shall     skip this to avoid complication.
    
3. category
    This is an important variable without question, but as a categorical variable with 159 unique values, labelling and encoding it will create 158 extra variables into my model, something that I am not very inclined to do. 

4. currency
   Currency and country seems to be similar variables, e.g. USD and US, GBP and GB, therefore, one of them should be dropped to avoid duplicate variables. In this case, country shall be retained within our model. 

5. goal
    For the comparison between projects to be fair, the goal set should be based on the same currency. We will use usd_goal_real instead of goal.

6. pledged, backers, usd pledged, usd_pledged_real
    Variables that are not known when the project is live
    
7. launched, deadline
    Instead of using these two variables directly, a new variable, duration, could be derived by taking the difference of the dates.

We end up with 4 independent variables: main_category, country, duration, usd_goal_real to predict our dependent variable: state 


    

    
    


In [13]:
dataset['launched']=pd.to_datetime(dataset['launched'],format="%Y-%m-%d")
dataset['deadline']=pd.to_datetime(dataset['deadline'],format="%Y-%m-%d")
dataset['duration']=(dataset['deadline']-dataset['launched']).dt.days

columns=['main_category','country','duration','usd_goal_real','state']
projects=dataset[columns]
projects.head()

Unnamed: 0,main_category,country,duration,usd_goal_real,state
0,Publishing,GB,58,1533.95,failed
1,Film & Video,US,59,30000.0,failed
2,Film & Video,US,44,45000.0,failed
3,Music,US,29,5000.0,failed
4,Film & Video,US,55,19500.0,canceled


Now, let us take a deeper look into the possible values of the dependent variable, state.

In [19]:
projects['state'].value_counts()

failed        197719
successful    133956
canceled       38779
undefined       3562
live            2799
suspended       1846
Name: state, dtype: int64

The only outcomes that we are concerned about is whether the project succeeds or fails. Therefore, the rest of the outcomes must be filtered out from our modelling dataset.

In [21]:
projects=projects[projects['state'].isin(("failed","successful"))]

Before we start analysing, check for missing values.

In [23]:
projects.isnull().sum(axis=0)

main_category    0
country          0
duration         0
usd_goal_real    0
state            0
dtype: int64

Wonderful! Since there are no missing variables, let us move on to our next step: labelling and encoding our categorical variables: main_category and country.

In [35]:
X=projects.loc[:,['main_category','country','duration','usd_goal_real']]
y=projects.loc[:,'state']

from sklearn.preprocessing import LabelEncoder,OneHotEncoder
label_x1=LabelEncoder()
label_x2=LabelEncoder()


X['main_category']=label_x1.fit_transform(X['main_category'])
X['country']=label_x2.fit_transform(X['country'])

onehot=OneHotEncoder(categorical_features=[0])
X=onehot.fit_transform(X).toarray()

label_y=LabelEncoder()
y=label_y.fit_transform(projects['state'])


To begin the modelling process, we have to split the dataset into training set and test set first.

In [37]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=0)

We will need to standardise the variables first to prevent any variable to hold an unfair influence in the model due to the magnitude of its values.

In [38]:
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
X_train=sc.fit_transform(X_train)
X_test=sc.transform(X_test)

Finally, to the exciting part where we start modelling. For this kernel, we shall be using 3-layer artificial neural network with 9 nodes each in our hidden layers to build our deep model. We will run batch sizes in sets of 32 over 100 iterations.

In [None]:
import keras
from keras.models import Sequential
from keras.layers import Dense

classifier=Sequential()
classifier.add(Dense(units=9,kernel_initializer='uniform',activation='relu',input_dim=18))
classifier.add(Dense(units=9,kernel_initializer='uniform',activation='relu'))
classifier.add(Dense(units=1,kernel_initializer='uniform',activation='sigmoid'))
classifier.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])
classifier.fit(X_train,y_train,batch_size=32,epochs=100)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100

As can be seen, accuracy of results hover around 65%. There could be areas of improvement in terms of tweaking the hyperparameters like the no. of hidden layers, no. of nodes, batch size or no. of iterations. However, my guess is that the largest area of improvement to the model could come in extracting repetitive keywords from project names that are commonly associated with either success or failure.