# Kickstartin' Success

By: Priscilla Mannuel, Mark Roberts

In this study, we will explore [data](https://webrobots.io/kickstarter-datasets/) on craft beers.

* [Introduction](#intro)
* [Initialization](#init)
* [Analysis](#analysis)
* [Conclusion](#conclud)

<h4><center>...</center></h4>

<a id='intro'><h2>Introduction</h2></a>

...

<h2><a id='intro'>Initialization</a></h2>

In [2]:
import pandas as pd
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
import seaborn as sns
import os, json
from scipy.stats import skew
from scipy.stats import ttest_ind, f_oneway, lognorm, levy, skew, chisquare
from sklearn.preprocessing import normalize, scale

%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

The dataset utilized is provided by [Kaggle](https://www.kaggle.com/nickhould/craft-cans). The dataset contains a list of 2K+ craft canned beers from the US.

In [11]:
%%capture
# Download dataset
# Check if the dataset is present on local disk and load it
if os.path.exists('dataset/kaggle_beer.csv'):
    data = pd.read_csv('dataset/kaggle_beer.csv', error_bad_lines=False)

In [14]:
# Print the size of the dataset
print ("Number of rows:", data.shape[0])
print ("Number of columns:", data.shape[1])

Number of rows: 5973
Number of columns: 22


In [28]:
data.head()

Unnamed: 0,Name,id,brewery_id,cat_id,style_id,Alcohol By Volume,International Bitterness Units,Standard Reference Method,Universal Product Code,filepath,...,last_mod,Style,Category,Brewer,Address,City,State,Country,Coordinates,Website
0,Scottish Ale,1495,347,1,15,5.8,0.0,0.0,0.0,,...,2010-07-23T01:30:00+05:30,Scotch Ale,British Ale,Carlyle Brewing,215 East State Street,Rockford,Illinois,United States,"42.2689, -89.0907",
1,Het Kapittel Pater,1509,301,-1,-1,0.0,0.0,0.0,0.0,,...,2010-07-23T01:30:00+05:30,,,Brouwerij Van Eecke,Douvieweg 2,Watou,West-Vlaanderen,Belgium,"50.8612, 2.6615",
2,Export Premium,1516,785,-1,-1,5.4,0.0,0.0,0.0,,...,2010-07-23T01:30:00+05:30,,,Licher Privatbrauerei,In den Hardtberggrten,Lich,Hessen,Germany,"50.5208, 8.8166",
3,Bee Sting Honey Ale,1527,604,3,26,5.9,0.0,0.0,0.0,,...,2010-07-23T01:30:00+05:30,American-Style Pale Ale,North American Ale,Great Divide Brewing,2201 Arapahoe Street,Denver,Colorado,United States,"39.7539, -104.989",http://www.greatdivide.com/
4,PranQster Belgian Ale,1540,919,-1,-1,7.6,0.0,0.0,0.0,,...,2010-07-23T01:30:00+05:30,,,North Coast Brewing Company,455 North Main Street,Fort Bragg,California,United States,"39.4466, -123.806",http://www.northcoastbrewing.com/home.htm


<h2>Quick Data Overview</h2>

In [29]:
il_beers_only = data[data.State == 'Illinois']

In [30]:
print ("Number of rows:", il_beers_only.shape[0])
print ("Number of columns:", il_beers_only.shape[1])

Number of rows: 190
Number of columns: 22


In [15]:
data.columns

Index(['Name', 'id', 'brewery_id', 'cat_id', 'style_id', 'Alcohol By Volume',
       'International Bitterness Units', 'Standard Reference Method',
       'Universal Product Code', 'filepath', 'Description', 'add_user',
       'last_mod', 'Style', 'Category', 'Brewer', 'Address', 'City', 'State',
       'Country', 'Coordinates', 'Website'],
      dtype='object')

In [24]:
data.isnull().sum() / 5973 * 100

Name                               0.167420
id                                 0.000000
brewery_id                         0.167420
cat_id                             0.385066
style_id                           0.401808
Alcohol By Volume                  0.418550
International Bitterness Units     0.418550
Standard Reference Method          0.418550
Universal Product Code             0.485518
filepath                          99.581450
Description                       65.745856
add_user                           0.719906
last_mod                           1.222166
Style                             25.230203
Category                          25.230203
Brewer                             0.418550
Address                           13.092248
City                               0.870584
State                              5.842960
Country                            0.418550
Coordinates                        3.800435
Website                           51.799766
dtype: float64

<h2><a id='analysis'>Analysis</a></h2>

**Summary**

The initial dataset contained 5973 variations of canned craft beers. In order to build the business case, the analysis process is brokened down into (1) [Data cleaning](#s1), (2) [Feature engineering](#s2) (3) [Exploratory data analysis](#s3) and (4) [Visualization](#s4).

<h3><a id='s1'>Data Cleaning</a></h3>

In [None]:
# define a function to clean a loaded dataset

def clean(mydata):
    
    """
    This function cleans the input dataframe mydata:
    
    input:
        mydata: pandas.dataframe
    output: 
        pandas.dataframe

    """
    
    data = mydata.copy()
    
    #get rid of uneccessary columns in training and testing
    selected_cols = ['filepath',
                     'blurb',]
    
    data = data[selected_cols]

    #drop data with empty blurb or empty name entries
    #given more time, webscrapped missing entries
    data = data.dropna() 

    #select only data with known status
    successful = data['state'] == "successful"
    failed = data['state'] == "failed"
    cancelled = data['state'] == "cancelled"
    suspended = data['state'] == "suspended"
    data = data.loc[failed | successful | cancelled | suspended]

    #label categorical collumns                   ##Commented by NN
    #categorical_cols = ['category.id',
    #                    'category.parent_id',
    #                    'country',
    #                    'spotlight',
    #                    'staff_pick',
    #                    'state',
    #                    'usd_type']
    #data[categorical_cols] = pd.Categorical

    #label numerical collumns
    num_cols = ['usd_pledged',
                'deadline',
                'created_at',
                'launched_at']
    data[num_cols] = data[num_cols].apply(pd.to_numeric, errors='coerce')

    #because there are some "bad lines" a.k.a lines that are shifted to the right due to bad parsing
    #subset rows that are correctly parsed
    data = data.dropna()

    #label datetime collumns
    data['created_at'] = pd.to_datetime(data['created_at'],unit='s')
    data['launched_at'] = pd.to_datetime(data['launched_at'],unit='s')
    data['deadline'] = pd.to_datetime(data['deadline'],unit='s')    ## Corrected the variable
    
    #CORRECT RANGE
    
    return data

**Let's clean the data... *scrub* *scrub* *scrub* **

In [None]:
print("Data dim before cleaning:", data.shape)
data = clean(data)
print("Data dim after cleaning:", data.shape)

In [None]:
data.columns[data.isnull().any()]

<h3><a id='s3'>Feature Engineering</a></h3>

In order to gain a deeper understanding of Kickstarter environment and the drivers to successful campaigns, new features are intuitively engineered from current variables. The end results reduced the dataset to the following features:

* **success**: boolean feature indicating campaign (1) success (0) failure
* **name_len**: length of name of project
* **desc_len**: length of the short description or blurb
* **state**: successful, failed, cancelled, or suspended
* **duration**: days between creation and launch of campaign
* **time variables**: month, wday (day of week), hour (hour of day)
* **category**:
* **subcategory**:
* **country**:
* **spotlight**:
* **staff_picked**:
* **goal**:
* pics_count:

Additionally, these features can be included to predict campaigns that have already started:

* comments: the number of comments
* **traction**: rate of gaining backers, total number of backers divided by total number of weeks


In [None]:
def engineer_features(mydata):
    
    """
    This function generates new features for the input dataframe mydata:
    
    input:
        mydata: pandas.dataframe
    output: 
        pandas.dataframe

    """
    
    data = mydata.copy()

    # create success variable (PREDICT)
    #data['success'] = data['state'].astype('category')
    #data['success'] = pd.Categorical.from_array(data.success).codes
    
    #data = data.drop(drop_cols, axis = 1)
    data['state'].replace('suspended','failed',inplace=True)
    
    #Categories is initially a json datatype that would need to be deserialized to the Python native datatype of dictionary.
    data['catg.type'], data['catg.subtype'] = data['category.slug'].str.split('/', 1).str
    
    data['state_num'] = data['state'].apply(lambda x: 1 if x=='successful' else 0)
    
    data['launched_at_hr'] = data['launched_at'].apply(lambda x: x.hour) + 1
    data['launched_at_dow'] = data['launched_at'].apply(lambda x: x.dayofweek + 1) #Monday=1, Sunday=7
    data['launched_at_mo'] = data['launched_at'].apply(lambda x: x.month)
    data['launched_at_yr'] = data['launched_at'].apply(lambda x: x.year)
    
    data['deadline_hr'] = data['deadline'].apply(lambda x: x.hour) + 1
    data['deadline_dow'] = data['deadline'].apply(lambda x: x.dayofweek + 1) #Monday=1, Sunday=7
    data['deadline_mo'] = data['deadline'].apply(lambda x: x.month)
    data['deadline_yr'] = data['deadline'].apply(lambda x: x.year)
    
    data['created_at_hr'] = data['created_at'].apply(lambda x: x.hour) + 1
    data['created_at_dow'] = data['created_at'].apply(lambda x: x.dayofweek + 1) #Monday=1, Sunday=7
    data['created_at_mo'] = data['created_at'].apply(lambda x: x.month)
    data['created_at_yr'] = data['created_at'].apply(lambda x: x.year)
    
    #To plot the number of campaigns per month for all years
    data['count'] = 1
    
    data['success'] = (data['state'] == 'successful')
    
    #State_changed_at is column describing when the campaign changed state to either successful, failed, cancelled or suspended. 
    #Since this will not be known ahead of time for a given campaign, this is not a good datetime to use in the following 
    #look at the time differences between our dates.
    #creation and launch
    #launch and deadline
    data['launched-created'] = (data.launched_at - data.created_at).dt.components.days
    data['deadline-launched'] = (data.deadline - data.launched_at).dt.components.days
    
    
    return data

**Let's engineer new features!**

In [None]:
print("Data dim before feature engineering:", data.shape)
data = engineer_features(data)
print("Data dim after feature engineering:", data.shape)

In [None]:
# make spotlight into boolean
# spotlightmap = {True: True, False: False, 'True': True, 'False': False}

# data['spotlight'] = data['spotlight'].map(spotlightmap)

data['spotlight'].unique()

<h3><a id=''>Sanity Check</a></h3>
<br>


State_changed_at is column describing when the campaign changed state to either successful, failed, cancelled or suspended. Since this will not be known ahead of time for a given campaign, this is not a good datetime to use in the following look at the time differences between our dates.

* creation and launch
* launch and deadline

In [None]:
pd.qcut(data['goal'], 10).value_counts().sort_index().plot(kind='bar')

In [None]:
data['catg.type'].value_counts().plot.pie(autopct='%.2f',figsize=(6,6))
plt.title('Distribution of projects in different categories')
plt.tight_layout()

Simple quality check: Let's check to make sure no launch dates came after before deadline dates

In [None]:
np.any(data.deadline < data.launched_at)

In [None]:
data.launched_at_yr.value_counts()

The following plot will look at counts in campaigns from 2009 to 2018 grouped by month.

In [None]:
data['count'] = 1

In [None]:
def plot_monthly_campaign_count(n):
    
    mth = data[data['launched_at_yr'] == n]
    mth_cnt = mth.groupby('launched_at_mo').count()['count']
    mth_cnt.plot(marker='o', markersize=5, alpha=.5, rot=90)

fig = plt.figure(figsize=(12, 4))
for i in range(2009, 2019):
    plot_monthly_campaign_count(i)
plt.ylabel('Number of Campaigns', fontsize=14)
plt.xlabel('Month', fontsize=14)
plt.legend(range(2009, 2018))
plt.show()

Peak campaign count occured in the month of july in 2014. I wonder what caused that? Not really any seasonality is readily apparent.

In [None]:
pd.crosstab(data.launched_at_mo, data.launched_at_yr)

In [None]:
data.staff_pick.value_counts()

In [None]:
data.staff_pick.unique()

We observe discrepancy in values and correct it to only reflect True/False

In [None]:
data = data.replace({'staff_pick': {'True': True, 'False': False}})

In [None]:
data.staff_pick.value_counts()

In [None]:
staff_picked = data.staff_pick.value_counts()
print("Not so nice, ~ %g%% are staff picked" % (staff_picked[1] * 100 / staff_picked.sum()).round())

In [None]:
data['success'] = (data['state'] == 'successful')

In [None]:
data.success.value_counts()

## Plots

State_changed_at is column describing when the campaign changed state to either successful, failed, cancelled or suspended. Since this will not be known ahead of time for a given campaign, this is not a good datetime to use in the following look at the time differences between our dates.

* creation and launch
* launch and deadline

In [None]:
data['launched-created'].describe()

In [None]:
data['deadline-launched'].describe()

<h3><a id='s1'>Exploratory Data Analysis</a></h3>
<br>
<center>**Unsupervised learning**</center>

In order to narrow our exploration. We studied the correlation and utilized clustering to investigate latent drivers that contribute to the success of a campaign.

<h4><center>...</center></h4>

**Target Variable: Success** <br>


In [None]:
data['state'].value_counts().plot.pie(autopct='%.2f')
plt.title('Distribution of successful (1)/failed (0) projects')

In [None]:
stats = pd.DataFrame(data.groupby(['state']).size())
stats['Project proportion(%)'] = round((data.groupby(['state']).size()/sum(data.groupby(['state']).size()))*100,2)
stats['Project median goal($)'] = round((data.groupby(['state'])['goal'].median()),2).astype(str)
stats['Project average goal($)'] = round(data.groupby(['state'])['goal'].mean(),2)
stats['Median Pledged($)'] = round((data.groupby(['state'])['usd_pledged'].median()),2).astype(str)
stats['Average pledged($)'] = round(data.groupby(['state'])['usd_pledged'].mean(),2)
stats['Max. Backers Count'] = data.groupby(['state'])['backers_count'].max()
stats.columns.values[0]='Projects'

stats.transpose()

**Correlation matrix** <br>

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nullam ultrices nisl odio, non fringilla lorem lacinia et. Cras in nibh diam. Donec bibendum eros nulla, et aliquam nunc ultricies in. Nulla ut lacinia justo. Donec sit amet efficitur nisl, sed porta odio. Donec et blandit augue.

In [None]:
import numpy as np;
import seaborn as sns; sns.set()

def corrmatrix(mydata,annot=True):
        
    """
    This function cleans the input dataframe mydata:
    
    input:
        mydata: pandas.dataframe
    output: 
        pandas.dataframe

    """
    
    data = mydata.copy()
    
    data_continuos = data.select_dtypes(include=['int64','float64'])
    
    continuous_variables = list(data_continuos)
    
    corr = data_continuos.corr()
    
    mask = np.zeros_like(corr)
    mask[np.triu_indices_from(mask)] = True
    
    with sns.axes_style("white"):
        ax = sns.heatmap(corr, mask=mask, annot=annot, linewidths=.5)

In [None]:
dropped_cols = ['category.id',
                'category.parent_id',
                'state_num',
                'launched_at_hr',
                'launched_at_dow',
                'launched_at_mo',
                'launched_at_yr',
                'deadline_hr',
                'deadline_dow',
                'deadline_mo',
                'deadline_yr',
                'created_at_hr',
                'created_at_dow',
                'created_at_mo',
                'created_at_yr',
                'count']

corrmatrix(data.drop(dropped_cols, axis=1))

In [None]:
sns.pairplot(data.drop(dropped_cols, axis=1).select_dtypes(include=['int64','float64']), palette="husl")

**K-Means clustering** <br>
<br>
K-means clustering is one of the most widely used unsupervised machine learning algorithms that forms clusters of data based on the similarity between data instances. For this particular algorithm to work, the number of clusters has to be defined beforehand. The K in the K-means refers to the number of clusters.

1. Import kmeans and PCA through the sklearn library
2. Devise an elbow curve to select the optimal number of clusters (k)
3. Generate and visualise a k-means clusters

We need to determine optimal k.The technique to determine K, the number of clusters, is called the elbow method.

We plot values for K on the horizontal axis and the distortion on the Y axis (the values calculated with the cost function).

In [None]:
import matplotlib.pyplot as plt

datac=data.copy()
selected_cols=['backers_count','goal','usd_pledged']
datac=datac[selected_cols]


# scaling the data 

from sklearn.preprocessing import scale
datac = pd.DataFrame(scale(datac), columns=['backers_count','goal','usd_pledged'])

# k means clustering
from sklearn.cluster import KMeans
from sklearn import metrics
from scipy.spatial.distance import cdist

# first step is to determine k
distortions = []
score=[]
K = range(1,10)
for k in K:
    kmeanModel = KMeans(n_clusters=k).fit(datac)
    kmeanModel.fit(datac)
    distortions.append(sum(np.min(cdist(datac, kmeanModel.cluster_centers_,
                                        'euclidean'),axis=1)) / datac.shape[0])
    score.append(kmeanModel.fit(datac).score(datac) )


#Plot the elbow
plt.plot(K, distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method showing the optimal k')
plt.show()


In [None]:
# plot centroids and clusters
from sklearn import cluster
centroids,labels,inertia = cluster.k_means(datac,n_clusters=5)
#Let's check the parameter cluster centers of the estimator

centroids = pd.DataFrame(centroids,
                         columns=['backers_count','goal','usd_pledged'])

In [None]:
sns.heatmap(centroids, cmap='BuPu', annot=True)

<center>**Understading the marketplace**</center>

Throughout exploratory analysis, key understandings of Kickstarter marketplace is built.

<h4><center>...</center></h4>

**Kickstarter is predominantly domestic** <br>


Although the platform boast its global reach, 79.7% of Kickstarter's campaign creators are United States-based.

In [None]:
def my_autopct(pct):
    return ('%.2f' % pct) if pct > 2 else ''

labels = ['US', 'GB', 'CA'] + ["" for x in range(19)]

data['country'].value_counts().plot.pie(autopct=my_autopct,labels=labels,figsize=(6,6))
plt.title('Distribution of projects by countries')
plt.tight_layout()

**Kickstarter is home to the artsy-fartsy and the tech-enthusiast**<br>

In [None]:
data['catg.type'].value_counts().plot(kind = 'bar', title = 'Category Distribution')

In [None]:
sns.factorplot(x='catg.type', y='state_num', kind='bar', data=data, size=6)
locs, labels = plt.xticks();
plt.title('Percentage of successful projects per category')
plt.setp(labels, rotation=90);

**Creation of campaigns varies by time** <br>
<br>
The timeline of a Kickstarter campaign includes creation of the campaign where creators set up the funding page and marketing material, launching the campaign where creators publish the campaign for others to start backing. Creation and launching is more popular during the weekday (Monday - Friday) between 4 PM - 4 AM. The timing of campaign creation and launch all coincide with the average work-hours, suggesting that people create Kickstarter campaigns during the
workday.

In [None]:
def plot_monthly_campaign_count(n):
    mth = data[data['launched_at_yr'] == n]
    mth_cnt = mth.groupby('launched_at_mo').count()['count']
    mth_cnt.plot(marker='o', markersize=5, alpha=.5, rot=90)

# Project creation density by dow and hr
c_dow_vs_hr = pd.core.frame.DataFrame({'count' : data.groupby(['created_at_dow','created_at_hr']).size()}).reset_index()
c_dow_vs_hr['created_at_dow'] = c_dow_vs_hr['created_at_dow'].astype('category')
c_dow_vs_hr = c_dow_vs_hr.pivot("created_at_dow", "created_at_hr", "count")

# Project launched density by dow and hr
l_dow_vs_hr = pd.core.frame.DataFrame({'count' : data.groupby(['launched_at_dow','launched_at_hr']).size()}).reset_index()
l_dow_vs_hr['launched_at_dow'] = l_dow_vs_hr['launched_at_dow'].astype('category')
l_dow_vs_hr = l_dow_vs_hr.pivot("launched_at_dow", "launched_at_hr", "count")

In [None]:
#plot heatmap
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(20,5))
fig.subplots_adjust(hspace=0, wspace = 0.2)

sns.heatmap(c_dow_vs_hr, ax=axes[0], cmap='BuPu')
sns.heatmap(l_dow_vs_hr, ax=axes[1], cmap='BuPu')

In [None]:
fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(15,15))
fig.subplots_adjust(hspace=0.5, wspace = 0.3)

sns.countplot(x="launched_at_hr", hue="success", data=data, ax=axes[0, 0]).set_title('Launched at Hour Distribution')
sns.countplot(x="created_at_hr", hue="success", data=data, ax=axes[1, 0]).set_title('Created at Hour Distribution')
sns.countplot(x="deadline_hr", hue="success", data=data, ax=axes[2, 0]).set_title('Deadline Hour Distribution')
sns.countplot(x="launched_at_dow", hue="success", data=data, ax=axes[0, 1]).set_title('Launched at weekday Distribution')
sns.countplot(x="created_at_dow", hue="success", data=data, ax=axes[1, 1]).set_title('Created at weekday Distribution')
sns.countplot(x="deadline_dow", hue="success", data=data, ax=axes[2, 1]).set_title('Deadline weekday Distribution')

<center>Finding insights that matter</center>
    
After a establishing understanding of the Kickstarter nature of campaign creators, drivers that promotes success are  determined. Potential features are studied and a t-test is performed to determine whether the feature can be used to significantly distinguish successful campaigns from failed ones.

<h4><center>...</center></h4>

**Setting the right goal is as important as you'd think** <br>
The median goal for a successful campaign from the dataset is USD 5,000 while the medial goal for failed campaigns is nearly USD 17,000. We also found that 38% of failed campaigns had a goal of over USD 50,000. From the comparison below, per main category, there is a clear separation of range of goals ($) between successful and failed cases. The successful cases have way lower/achievable goal compared to the failed cases.

In [None]:
sns.factorplot(x='catg.type', y='goal', hue='state_num', kind='bar', data=data, size=7)
locs, labels = plt.xticks();
plt.setp(labels, rotation=90);
plt.title('Range of goal ($) among successful and failed projects')
plt.gca().set_yscale("log", nonposy='clip');

In [None]:
fig, ax = plt.subplots()

df_success = data[(data['state_num'] == 1)]

df_success = np.sort(df_success.goal).cumsum()

# Percentile values
p = np.array([0.0, 25.0, 50.0, 75.0, 90.0, 100.0])

perc_success = mlab.prctile(df_success, p=p)

plt.plot(df_success, color='green')
plt.title("Proportion of Successful Campaigns by Goal")
plt.ylabel('Goal (USD)')
plt.xlabel('Percentage of Campaigns (%)')

# Set tick locations and labels
plt.xticks((len(df_success)-1) * p/100., map(str, p))
plt.show()

fig, ax = plt.subplots()

df_fail = data[(data['state_num'] == 0)]

df_fail = np.sort(df_fail.goal).cumsum()

# Percentile values
p = np.array([0.0, 25.0, 50.0, 75.0, 90.0, 100.0])

perc_fail = mlab.prctile(df_fail, p=p)

plt.plot(df_fail, color='darkred')
plt.title("Proportion of Failed Campaigns by Goal")
plt.ylabel('Goal (USD)')
plt.xlabel('Percentage of Campaigns (%)')
plt.gca().set_ylim(0,8e8)

# Set tick locations and labels
plt.xticks((len(df_fail)-1) * p/100., map(str, p))
plt.show()

**Staff picks have significant impact on success**<br>
<br>
Kickstarter’s “staff picks” are given high-value front page real estate as “ Projects We Love” . Since these projects are given such high visibility, it’s no surprise that staff pick projects are 9.6 times more likely to be successful than those that aren’t. While it’s understandable that not all projects can be staff picks, we will touch on how Kickstarter can leverage the power of staff picks to improve its platform’s success rate in the implications section.

Staff picks are another variable of interest we are interested in determining. I can perform the same sort of analysis I will make in this notebook to staff picks. We would be interested in finding which features are associated with higher probabilities of Kickstarter staff choosing a campaign to tag with the staff pick criteria. This seems to boost backer confidence and is associated with a higher probability of success.

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(20,5))
fig.subplots_adjust(hspace=0, wspace = 0.5)

sns.countplot(x="staff_pick", hue="success", ax=axes[0], data=data).set_title('Staff Picked Distribution')
sns.countplot(x="spotlight", hue="success", ax=axes[1], data=data).set_title('Spotlight Distribution')

In [None]:
staff_picked = data.staff_pick.value_counts()

print("Not so nice, ~ %g%% are staff picked" % (staff_picked[1] * 100 / staff_picked.sum()).round())

pd.crosstab(data.staff_pick, data.success)

**Successful campaigns invest more time in creating the campaign.** <br>
<br>
The median number of days spent between creation and launch for successful campaigns is 19, as compared to the median of 12 days spent for failed  campaigns. Furthermore, taking longer than 1 week to create your campaign makes your campaign 1.83 times more likely to succeed.

In [None]:
data.boxplot('launched-created', by='success')
plt.ylim(-1,90)

tripdist_quant <- quantile(df$Trip_distance, seq(0,1,0.01))
tripdist_quant <- data.frame(fval=seq(0,1,0.01), q=tripdist_quant, row.names=NULL)

xyplot(q ~ fval, 
       tripdist_quant,
       xlab = "Proportion", 
       ylab = "Trip Distance (miles)",
       type = c("p", "g"), 
       subset = q < 40)

In [None]:
fig, ax = plt.subplots()

df_success = data[(data['state_num'] == 1)]

d = np.sort(df_success['launched-created']).cumsum()

# Percentile values
p = np.array([0.0, 25.0, 50.0, 75.0, 90.0, 100.0])

perc = mlab.prctile(d, p=p)

plt.plot(d, color='green')
plt.title("Proportion of Successful Campaigns by launched-creation duration")
plt.ylabel('Duration (mins)')
plt.xlabel('Percentage of Campaigns (%)')

# Set tick locations and labels
plt.xticks((len(d)-1) * p/100., map(str, p))
plt.show()

fig, ax = plt.subplots()

df_fail = data[(data['state_num'] == 0)]

d = np.sort(df_fail['launched-created']).cumsum()

# Percentile values
p = np.array([0.0, 25.0, 50.0, 75.0, 90.0, 100.0])

perc = mlab.prctile(d, p=p)

plt.plot(d, color='darkred')
plt.title("Proportion of Failed Campaigns by launched-creation duration")
plt.ylabel('Duration (mins)')
plt.xlabel('Percentage of Campaigns (%)')
plt.gca().set_ylim(0,3500000)

# Set tick locations and labels
plt.xticks((len(d)-1) * p/100., map(str, p))
plt.show()