# KICKSTARTER CAMPAIGN SUCCESS PREDICTION  

* **Date Published**: 2019/08/16
* **Collaborators**: Nateé Johnson & Misha Berrien
* **Data Source**: https://webrobots.io/kickstarter-datasets/

## INTRODUCTION

Kickstarter is a US based global crowd funding platform focused on bringing funding to creative projects. Since the platform’s launch in 2009, the site has hosted over 159,000 successfully funded projects with over 15 million unique backers. Kickstarter uses an “all-or-nothing” funding system. This means that funds are only dispersed for projects that meet the original funding goal set by the creator.

### Project Objective 

Kickstarter earns 5% commission on projects that are successfully funded. Currently, less than 40% of projects on the platform succeed. The objective is to predict which projects are likely to succeed so that these projects can be highlighted on the site either through 'staff picks' or 'featured product' lists.

### Proposed Solutions

1. Predict Successful Campaigns and promote those with the lowest predicted probability of being successful. 
1. Contact creators from those campaigns that are just below the “success” margin and give them insights that will help them succeed. 

## EXPLORATORY DATA ANALYSIS

In [23]:
# read in kickstarter intermediate data 
import functools
import glob
import io
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd 
import seaborn as sns
import sys

src_dir = os.path.join(os.getcwd(), '..', '..', 'src')
sys.path.append(src_dir)

from d02_intermediate.intermediate_cleaning import kickstarter_deduped_to_intermediate
from d02_intermediate.feature_engineering import kickstarter_feature_engineering

# Load the "autoreload" extension
%load_ext autoreload

# reload modules so that as you change code in src, it gets loaded
%autoreload

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


Load and clean our datasets. 

In [26]:
kick_deduped = pd.read_csv('../../data/02_intermediate/kick_deduped.csv.zip')
cluster_features_df =  pd.read_csv('../../data/03_processed/KNN_cluster_features_.csv')

In [27]:
kick_inter = kickstarter_deduped_to_intermediate(kick_deduped)

### Summary Stats

In [29]:
kick_inter.head()

Unnamed: 0,backers_count,blurb,converted_pledged_amount,country,created_at,currency,currency_symbol,currency_trailing_code,current_currency,deadline,...,urls,usd_pledged,usd_type,sub_category,overall_category,city,country_loc,state_loc,creator_name,creator_slug
0,0,"I'm just going to say it, I'm not special. I'm...",0.0,US,2019-07-15 02:59:36,USD,$,True,USD,2019-08-17 05:04:48,...,"{""web"":{""project"":""https://www.kickstarter.com...",0.0,domestic,Apparel,fashion/apparel,Wasilla,US,AK,Dima01,dima01
1,568,for Tabletop Role Playing Games like Dungeons ...,18969.0,US,2019-06-02 21:06:55,USD,$,True,USD,2019-07-18 03:55:00,...,"{""web"":{""project"":""https://www.kickstarter.com...",18969.0,domestic,Tabletop Games,games/tabletop games,Holland,US,MI,quEmpire Gaming,quempire
2,0,Giuliano Clothing is on a mission to reinvent ...,0.0,CA,2019-07-17 23:13:13,CAD,$,True,USD,2019-08-17 03:50:07,...,"{""web"":{""project"":""https://www.kickstarter.com...",0.0,domestic,Fashion,fashion,Toronto,CA,ON,Giuliano Clothing,giulianoclothing
3,80,We have a new album that we are ready to relea...,3691.0,US,2019-06-27 18:36:40,USD,$,True,USD,2019-07-18 03:30:00,...,"{""web"":{""project"":""https://www.kickstarter.com...",3691.0,domestic,Music,music,Saratoga Springs,US,NY,Drank The Gold,drankthegold
4,0,The film follows 4 frustrated campaigns as the...,0.0,US,2019-07-17 19:45:17,USD,$,True,USD,2019-08-17 03:23:01,...,"{""web"":{""project"":""https://www.kickstarter.com...",0.0,domestic,Drama,film & video/drama,Nashville,US,TN,Anthony Stephen Hamilton,aperiodpiece


In [31]:
kick_inter.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 332899 entries, 0 to 332898
Data columns (total 42 columns):
backers_count               332899 non-null int64
blurb                       332889 non-null object
converted_pledged_amount    195183 non-null float64
country                     332899 non-null object
created_at                  332899 non-null datetime64[ns]
currency                    332899 non-null object
currency_symbol             332899 non-null object
currency_trailing_code      332899 non-null bool
current_currency            195183 non-null object
deadline                    332899 non-null datetime64[ns]
disable_communication       332899 non-null bool
friends                     1629 non-null object
fx_rate                     185035 non-null float64
goal                        332899 non-null float64
id                          332899 non-null int64
is_backing                  1629 non-null object
is_starrable                206127 non-null object
is_starred   

Our current dataset has 42 columns. We first narrow the columns down that we would like to work with. Since we are trying to predict weather a Kickstarter campaign will be successful or fail, we need to ensure that we are not using any features that contain "future information" (i.e. number of backers or amount pledged), because these features could be proxies for our target variable. 

After taking a look through our data dictionary (located in the reference folder in the repository), we have identified X columns that we need to drop before building our model. 

**COLUMNS TO DROP:**
1. backers_count: This is the number of people who backed the project. This column contains "future information" and could act as a proxy for our target variable. 
1. blurb: This is a short description of the project. We created a new column (blurb count) and will drop this feature.  
1. currency_symbol: This feature is redundant with the currency feature. 
1. currency_trailing_code - This feature is redundant with the currency feature. 
1. converted pledge amount - This feature contains the amount of money that has been pledged to the campaign. This feature contains "future information" and could be used as a proxy for the target variable. 
1. current_currency - This column is redundant with the currency column 
1. friends - This column is 99% empty.
1. ID - Unique identifier for the campaign. Will need to be dropped before learning the model. 
1. Name - Unique identifier for the campaign. Will need to be dropped before learning the model. 
1. is_backing - This column is ~ 99% empty 
1. is_starrable - This column contains "future information" regarding how successful Kickstarter believes the campaign will be.
1. Permissions - this column is 99% empty 
1. slug - this column is redundant with name.
1. source_url - This is not needed for model building. 
1. spotlight - This column contains "future information" regarding how successful Kickstarter believes the campaign will be.
1. staff_pick - This column contains "future information" regarding how successful Kickstarter believes the campaign will be.
1. unread_message_count - This column is empty. 
1. unseen_activity_count - This column is empty. 
1. URL - This is not needed for model building.
1. usd_pleged - Redundant with the currency column.  
1. country - Redundant with the currency column (does not actually reflect where the campaign is. 
1. creator_name - Unnecessary information. 
1. creator_slug - Unnecessary information.  
1. disabled_communication - False for all campaigns that have ended. 

**Let's analyze our remaining 12 columns:**

1. created_at (datetime) 
1. currency (categorical) 
1. deadline (datetime)
1. fx_rate (quantitative) 
1. goal(quantitative) 
1. last_update_published_at
1. launched_at (datetime) 
1. sub_category (categorical) 
1. overall_category (categorical) 
1. city (categorical) 
1. country_loc (categorical) 
1. state_loc (categorical) 

## FEATURE CONSTRUCTION

## MODEL EXPLORATION

## MODEL TUNING

## CONCLUSION & NEXT STEPS