In [2]:
import pandas as pd
ks = pd.read_csv('./input/ks-projects-201801.csv',
                 parse_dates=['deadline', 'launched'])
ks.head(10)

Unnamed: 0.1,Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,...,n_polysyllable_words,flesch_kincaid_grade_level,flesch_reading_ease,smog_index,gunning_fog_index,coleman_liau_index,automated_readability_index,lix,gulpease_index,wiener_sachtextformel
0,0,1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,2015-10-09,1000.0,2015-08-11 12:12:28,0.0,...,1,5.24,66.4,8.841846,10.0,7.680995,4.62,45.0,99.0,7.057
1,1,1000003930,Greeting From Earth: ZGAC Arts Capsule For ET,Narrative Film,Film & Video,USD,2017-11-01,30000.0,2017-09-02 04:43:57,2421.0,...,0,0.72,97.025,3.1291,1.6,3.996687,2.35375,29.0,117.75,0.5838
2,2,1000004038,Where is Hank?,Narrative Film,Film & Video,USD,2013-02-26,45000.0,2013-01-12 00:20:50,220.0,...,0,-2.62,119.19,3.1291,1.2,-4.103777,-2.66,3.0,152.333333,-3.6434
3,3,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,USD,2012-04-16,5000.0,2012-03-17 03:24:11,1.0,...,1,10.74,30.53,8.841846,8.514286,16.091526,11.002857,49.857143,70.428571,7.216829
4,4,1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,USD,2015-08-29,19500.0,2015-07-04 08:35:03,1283.0,...,3,9.655,40.09,13.023867,18.2,17.249855,12.0075,58.0,64.0,12.1601
5,5,1000014025,Monarch Espresso Bar,Restaurants,Food,USD,2016-04-01,50000.0,2016-02-26 13:38:27,52375.0,...,0,1.313333,90.99,3.1291,1.2,9.615875,8.33,69.666667,129.0,6.093267
6,6,1000023410,Support Solar Roasted Coffee & Green Energy! ...,Food,Food,USD,2014-12-21,1000.0,2014-12-01 18:30:44,1205.0,...,2,14.432143,-2.174643,8.841846,12.828571,17.744623,13.962857,46.357143,103.285714,10.3302
7,7,1000030581,Chaser Strips. Our Strips make Shots their B*tch!,Drinks,Food,USD,2016-03-17,25000.0,2016-02-01 20:05:12,453.0,...,0,-0.755,107.6,3.1291,1.6,6.201631,4.12,4.0,114.0,-3.06745
8,8,1000034518,SPIN - Premium Retractable In-Ear Headphones w...,Product Design,Design,USD,2014-05-29,125000.0,2014-04-24 18:14:43,8233.0,...,1,3.67,75.875,7.168622,6.6,9.141557,6.475,41.5,109.0,5.03255
9,9,100004195,STUDIO IN THE SKY - A Documentary Feature Film...,Documentary,Film & Video,USD,2014-08-10,65000.0,2014-07-11 21:55:48,6240.57,...,1,7.586667,56.7,8.841846,8.044444,10.310975,6.62,42.333333,72.333333,5.286467


In [7]:
ks.columns

Index(['Unnamed: 0', 'ID', 'name', 'category', 'main_category', 'currency',
       'deadline', 'goal', 'launched', 'pledged', 'state', 'backers',
       'country', 'usd pledged', 'usd_pledged_real', 'usd_goal_real',
       'n_words', 'n_sents', 'n_chars', 'n_syllables', 'n_unique_words',
       'n_long_words', 'n_monosyllable_words', 'n_polysyllable_words',
       'flesch_kincaid_grade_level', 'flesch_reading_ease', 'smog_index',
       'gunning_fog_index', 'coleman_liau_index',
       'automated_readability_index', 'lix', 'gulpease_index',
       'wiener_sachtextformel'],
      dtype='object')

<b>Preparing target column</b>

In [6]:
pd.unique(ks.state)

array(['failed', 'canceled', 'successful', 'live', 'undefined',
       'suspended'], dtype=object)

We have six states, how many records of each?

In [8]:
ks.groupby('state')['ID'].count()

state
canceled       38779
failed        197719
live            2799
successful    133956
suspended       1846
undefined       3562
Name: ID, dtype: int64

Data cleaning isn't the current focus, so we'll simplify this example by:

Dropping projects that are "live"
Counting "successful" states as outcome = 1
Combining every other state as outcome = 0

In [9]:
# Drop live projects
ks = ks.query('state != "live"')

# Add outcome column, "successful" == 1, others are 0
ks = ks.assign(outcome=(ks['state'] == 'successful').astype(int)) #True = 1, False = 0

<b>Converting timestamps</b><br>
I convert the launched feature into categorical features we can use in a model. Since I loaded in the columns as timestamp data, I access date and time values through the .dt attribute on the timestamp column.

In [23]:
# SEE: https://www.geeksforgeeks.org/python-working-with-date-and-time-using-pandas/
ks = ks.assign(hour=ks.launched.dt.hour,
               day=ks.launched.dt.day,
               month=ks.launched.dt.month,
               year=ks.launched.dt.year)

In [24]:
ks.head()

Unnamed: 0.1,Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,...,coleman_liau_index,automated_readability_index,lix,gulpease_index,wiener_sachtextformel,outcome,hour,day,month,year
0,0,1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,2015-10-09,1000.0,2015-08-11 12:12:28,0.0,...,7.680995,4.62,45.0,99.0,7.057,0,12,11,8,2015
1,1,1000003930,Greeting From Earth: ZGAC Arts Capsule For ET,Narrative Film,Film & Video,USD,2017-11-01,30000.0,2017-09-02 04:43:57,2421.0,...,3.996687,2.35375,29.0,117.75,0.5838,0,4,2,9,2017
2,2,1000004038,Where is Hank?,Narrative Film,Film & Video,USD,2013-02-26,45000.0,2013-01-12 00:20:50,220.0,...,-4.103777,-2.66,3.0,152.333333,-3.6434,0,0,12,1,2013
3,3,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,USD,2012-04-16,5000.0,2012-03-17 03:24:11,1.0,...,16.091526,11.002857,49.857143,70.428571,7.216829,0,3,17,3,2012
4,4,1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,USD,2015-08-29,19500.0,2015-07-04 08:35:03,1283.0,...,17.249855,12.0075,58.0,64.0,12.1601,0,8,4,7,2015


<b>Prepping categorical variables</b><br>
Now for the categorical variables -- category, currency, and country -- I'll need to convert them into integers so our model can use the data. For this I'll use scikit-learn's LabelEncoder. This assigns an integer to each value of the categorical feature and replaces those values with the integers.

In [25]:
from sklearn.preprocessing import LabelEncoder

cat_features = ['category', 'currency', 'country']
encoder = LabelEncoder()

# Apply the label encoder to each column
encoded = ks[cat_features].apply(encoder.fit_transform)
encoded.head(10)

Unnamed: 0,category,currency,country
0,108,5,9
1,93,13,22
2,93,13,22
3,90,13,22
4,55,13,22
5,123,13,22
6,58,13,22
7,41,13,22
8,113,13,22
9,39,13,22


I'll collect all the features we'll use in a new dataframe and use that to train a model.

In [34]:
# Since ks and encoded have the same index and I can easily join them
data = ks[['goal', 'hour', 'day', 'month', 'year', 'outcome']].join(encoded)
data.head(10)

Unnamed: 0,goal,hour,day,month,year,outcome,category,currency,country
0,1000.0,12,11,8,2015,0,108,5,9
1,30000.0,4,2,9,2017,0,93,13,22
2,45000.0,0,12,1,2013,0,93,13,22
3,5000.0,3,17,3,2012,0,90,13,22
4,19500.0,8,4,7,2015,0,55,13,22
5,50000.0,13,26,2,2016,1,123,13,22
6,1000.0,18,1,12,2014,1,58,13,22
7,25000.0,20,1,2,2016,0,41,13,22
8,125000.0,18,24,4,2014,0,113,13,22
9,65000.0,21,11,7,2014,0,39,13,22


<b>Creating training, validation, and test splits</b><br>
We need to create data sets for training, validation, and testing. We'll use a fairly simple approach and split the data using slices. We'll use 10% of the data as a validation set, 10% for testing, and the other 80% for training.

In [36]:
#https://stackoverflow.com/questions/34329617/how-colon-works-in-python-pandas
#https://stackoverflow.com/questions/509211/understanding-slice-notation
valid_fraction = 0.1
valid_size = int(len(data) * valid_fraction)

train = data[:-2 * valid_size] #data from beginning to -0.2 i.e., everything to last 20% of data
valid = data[-2 * valid_size:-valid_size] #last 20% to last 10%
test = data[-valid_size:] #last 10% to end of data

In general you want to be careful that each data set has the same proportion of target classes. I'll print out the fraction of successful outcomes for each of our datasets.

In [37]:
for each in [train, valid, test]:
    print(f"Outcome fraction = {each.outcome.mean():.4f}")

Outcome fraction = 0.3570
Outcome fraction = 0.3539
Outcome fraction = 0.3542


This looks good, each set is around 35% true outcomes likely because the data was well randomized beforehand. A good way to do this automatically is with sklearn.model_selection.StratifiedShuffleSplit but I don't need to use it here.

<b>Training a LightGBM model</b><br>
For this course we'll be using a LightGBM model. This is a tree-based model that typically provides the best performance, even compared to XGBoost. It's also relatively fast to train. We won't do hyperparameter optimization because that isn't the goal of this course. So, our models won't be the absolute best performance you can get. But you'll still see model performance improve as we do feature engineering.

In [39]:
import lightgbm as lgb

feature_cols = train.columns.drop('outcome')

dtrain = lgb.Dataset(train[feature_cols], label=train['outcome'])
dvalid = lgb.Dataset(valid[feature_cols], label=valid['outcome'])

param = {'num_leaves': 64, 'objective': 'binary'}
param['metric'] = 'auc'
num_round = 1000
bst = lgb.train(param, dtrain, num_round, valid_sets=[dvalid], early_stopping_rounds=10, verbose_eval=False)

<b>Making predictions & evaluating the model</b><br>
Finally, let's make predictions on the test set with the model and see how well it performs. An important thing to remember is that you can overfit to the validation data. This is why we need a test set that the model never sees until the final evaluation.

In [40]:
from sklearn import metrics
ypred = bst.predict(test[feature_cols])
score = metrics.roc_auc_score(test['outcome'], ypred)

print(f"Test AUC score: {score}")

Test AUC score: 0.747615303004287
