# Feature Engineering
## Baseline model

In [1]:
import pandas as pd
ks = pd.read_csv('../data/ks-projects-201801.csv', parse_dates=['deadline', 'launched'])

In [2]:
ks.shape

(378661, 15)

In [3]:
ks.head(10)

Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,usd_pledged_real,usd_goal_real
0,1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,2015-10-09,1000.0,2015-08-11 12:12:28,0.0,failed,0,GB,0.0,0.0,1533.95
1,1000003930,Greeting From Earth: ZGAC Arts Capsule For ET,Narrative Film,Film & Video,USD,2017-11-01,30000.0,2017-09-02 04:43:57,2421.0,failed,15,US,100.0,2421.0,30000.0
2,1000004038,Where is Hank?,Narrative Film,Film & Video,USD,2013-02-26,45000.0,2013-01-12 00:20:50,220.0,failed,3,US,220.0,220.0,45000.0
3,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,USD,2012-04-16,5000.0,2012-03-17 03:24:11,1.0,failed,1,US,1.0,1.0,5000.0
4,1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,USD,2015-08-29,19500.0,2015-07-04 08:35:03,1283.0,canceled,14,US,1283.0,1283.0,19500.0
5,1000014025,Monarch Espresso Bar,Restaurants,Food,USD,2016-04-01,50000.0,2016-02-26 13:38:27,52375.0,successful,224,US,52375.0,52375.0,50000.0
6,1000023410,Support Solar Roasted Coffee & Green Energy! ...,Food,Food,USD,2014-12-21,1000.0,2014-12-01 18:30:44,1205.0,successful,16,US,1205.0,1205.0,1000.0
7,1000030581,Chaser Strips. Our Strips make Shots their B*tch!,Drinks,Food,USD,2016-03-17,25000.0,2016-02-01 20:05:12,453.0,failed,40,US,453.0,453.0,25000.0
8,1000034518,SPIN - Premium Retractable In-Ear Headphones w...,Product Design,Design,USD,2014-05-29,125000.0,2014-04-24 18:14:43,8233.0,canceled,58,US,8233.0,8233.0,125000.0
9,100004195,STUDIO IN THE SKY - A Documentary Feature Film...,Documentary,Film & Video,USD,2014-08-10,65000.0,2014-07-11 21:55:48,6240.57,canceled,43,US,6240.57,6240.57,65000.0


We want to predict whether a kickstarter project will succeed. 

In [4]:
ks.state.unique()

array(['failed', 'canceled', 'successful', 'live', 'undefined',
       'suspended'], dtype=object)

In [5]:
pd.unique(ks.state)

array(['failed', 'canceled', 'successful', 'live', 'undefined',
       'suspended'], dtype=object)

In [6]:
ks.state.value_counts()

failed        197719
successful    133956
canceled       38779
undefined       3562
live            2799
suspended       1846
Name: state, dtype: int64

In [7]:
ks.groupby('state')['ID'].count()

state
canceled       38779
failed        197719
live            2799
successful    133956
suspended       1846
undefined       3562
Name: ID, dtype: int64

We can drop 'live' projects, count 'successful' states as `outcome = 1`, every other state can be `outcome = 0`

In [8]:
# ks = ks.loc[ks.state != 'live']

ks = ks.query('state != "live"')

In [9]:
ks.state.unique()

array(['failed', 'canceled', 'successful', 'undefined', 'suspended'],
      dtype=object)

In [10]:
# ks['outcome'] = ks.state.map(lambda x: 1 if x == 'successful' else 0)

ks = ks.assign(outcome=(ks['state'] == 'successful').astype(int))

The `launched` feature can be converted into categorical features, with the date-time attribute.

In [11]:
ks.launched

0        2015-08-11 12:12:28
1        2017-09-02 04:43:57
2        2013-01-12 00:20:50
3        2012-03-17 03:24:11
4        2015-07-04 08:35:03
                 ...        
378656   2014-09-17 02:35:30
378657   2011-06-22 03:35:14
378658   2010-07-01 19:40:30
378659   2016-01-13 18:13:53
378660   2011-07-19 09:07:47
Name: launched, Length: 375862, dtype: datetime64[ns]

In [12]:
ks = ks.assign(hour=ks.launched.dt.hour, day=ks.launched.dt.day, 
          month=ks.launched.dt.month, year=ks.launched.dt.year)

In [13]:
ks.dtypes

ID                           int64
name                        object
category                    object
main_category               object
currency                    object
deadline            datetime64[ns]
goal                       float64
launched            datetime64[ns]
pledged                    float64
state                       object
backers                      int64
country                     object
usd pledged                float64
usd_pledged_real           float64
usd_goal_real              float64
outcome                      int64
hour                         int64
day                          int64
month                        int64
year                         int64
dtype: object

### Preparing categorical variables

In [14]:
cat_features = ['category', 'currency', 'country']
for col in cat_features:
    print(col.upper(), ks[col].nunique(), sep=', Cardinality: ')
    print(ks[col].value_counts())
    print()
    print()

CATEGORY, Cardinality: 159
Product Design     22077
Documentary        16082
Music              15647
Tabletop Games     14072
Shorts             12311
                   ...  
Residencies           69
Letterpress           48
Chiptune              35
Literary Spaces       23
Taxidermy             13
Name: category, Length: 159, dtype: int64


CURRENCY, Cardinality: 14
USD    293624
GBP     33853
EUR     17076
CAD     14830
AUD      7880
SEK      1768
MXN      1645
NZD      1464
DKK      1113
CHF       754
NOK       714
HKD       583
SGD       527
JPY        31
Name: currency, dtype: int64


COUNTRY, Cardinality: 23
US      290887
GB       33393
CA       14624
AU        7769
DE        4096
N,0"      3796
FR        2887
NL        2833
IT        2802
ES        2224
SE        1737
MX        1645
NZ        1436
DK        1097
IE         800
CH         747
NO         700
BE         605
HK         583
AT         582
SG         527
LU          61
JP          31
Name: country, dtype: int64




The cardinality of each of these columns is rather high, so we will label encode them, rather than one-hot encode them. `Sklearn` provides a label encoder. 

In [15]:
from sklearn.preprocessing import LabelEncoder

cat_features = ['category', 'currency', 'country']
encoder = LabelEncoder()

# Apply the label encoder to each column
encoded = ks[cat_features].apply(encoder.fit_transform)

In [16]:
encoded.head()

Unnamed: 0,category,currency,country
0,108,5,9
1,93,13,22
2,93,13,22
3,90,13,22
4,55,13,22


If need be, we can get the inverse transformation.

In [17]:
encoder.inverse_transform(encoded.country)

array(['GB', 'US', 'US', ..., 'US', 'US', 'US'], dtype=object)

However, the `encoder` has 'forgotten' the previous columns' transformations. We could solve this problem in the future by looping through and creating a separate encoder for each of the columns in `cat_features`. 

In [18]:
encoder.classes_

array(['AT', 'AU', 'BE', 'CA', 'CH', 'DE', 'DK', 'ES', 'FR', 'GB', 'HK',
       'IE', 'IT', 'JP', 'LU', 'MX', 'N,0"', 'NL', 'NO', 'NZ', 'SE', 'SG',
       'US'], dtype=object)

We can join the label encoded categorical features with some of the features in the model. They already have the same index. 

In [19]:
data = ks[['goal', 'hour', 'day', 'month', 'year', 'outcome']].join(encoded)

In [20]:
data.head()

Unnamed: 0,goal,hour,day,month,year,outcome,category,currency,country
0,1000.0,12,11,8,2015,0,108,5,9
1,30000.0,4,2,9,2017,0,93,13,22
2,45000.0,0,12,1,2013,0,93,13,22
3,5000.0,3,17,3,2012,0,90,13,22
4,19500.0,8,4,7,2015,0,55,13,22


In [21]:
ks.shape

(375862, 20)

We can use 10% of the data for validation, 10% for testing, and 80% for testing. In general, we'll want to shuffle our dataset, incase the data is listed in an orderly way.

In [22]:
# Shuffle
# data.sample(frac=1).reset_index(drop=True)

In [23]:
valid_fraction = 0.1
valid_size = int(len(data) * valid_fraction)
train = data[:-2 * valid_size]
valid = data[-2 * valid_size:-valid_size]
test = data[-valid_size:]

We also want to make sure that each data set has the same proportion of target classes. 

In [24]:
for each in [train, valid, test]:
    print('Outcome fraction = {:.4f}'.format(each.outcome.mean()))

Outcome fraction = 0.3570
Outcome fraction = 0.3539
Outcome fraction = 0.3542


We could have also used this built in `Sklearn` object.

In [25]:
from sklearn.model_selection import StratifiedShuffleSplit

We can use a 'fast, distributed, high performance gradient boosting' model: LightGBM.
https://github.com/microsoft/LightGBM

In [50]:
%%time
import lightgbm as lgb

feature_cols = train.columns.drop('outcome')

dtrain = lgb.Dataset(train[feature_cols], label=train['outcome'])

dvalid = lgb.Dataset(valid[feature_cols], label=valid['outcome'])

param = {'num_leaves': 64, 'objective': 'binary'}
param['metric'] = 'auc'
num_round = 1000
bst = lgb.train(param, dtrain, num_round, valid_sets=[dvalid], 
            early_stopping_rounds=10, verbose_eval=True)

[1]	valid_0's auc: 0.694192
Training until validation scores don't improve for 10 rounds
[2]	valid_0's auc: 0.697026
[3]	valid_0's auc: 0.70002
[4]	valid_0's auc: 0.701645
[5]	valid_0's auc: 0.70601
[6]	valid_0's auc: 0.707926
[7]	valid_0's auc: 0.70945
[8]	valid_0's auc: 0.710437
[9]	valid_0's auc: 0.712047
[10]	valid_0's auc: 0.713417
[11]	valid_0's auc: 0.714648
[12]	valid_0's auc: 0.715791
[13]	valid_0's auc: 0.717431
[14]	valid_0's auc: 0.718216
[15]	valid_0's auc: 0.719381
[16]	valid_0's auc: 0.720884
[17]	valid_0's auc: 0.721617
[18]	valid_0's auc: 0.722789
[19]	valid_0's auc: 0.723307
[20]	valid_0's auc: 0.72501
[21]	valid_0's auc: 0.725721
[22]	valid_0's auc: 0.727384
[23]	valid_0's auc: 0.728268
[24]	valid_0's auc: 0.72865
[25]	valid_0's auc: 0.729141
[26]	valid_0's auc: 0.729552
[27]	valid_0's auc: 0.730459
[28]	valid_0's auc: 0.731047
[29]	valid_0's auc: 0.732472
[30]	valid_0's auc: 0.732801
[31]	valid_0's auc: 0.733166
[32]	valid_0's auc: 0.734182
[33]	valid_0's auc: 0.734

A perfect classifier would have `auc = 1`, and random guessing would have `auc = 0.5`. 

We've trained and validated the model to find the best iteration, now we can test the model. 

In [35]:
from sklearn import metrics

ypred = bst.predict(test[feature_cols])
score = metrics.roc_auc_score(test['outcome'], ypred)

In [46]:
print('LightGBM Test AUC score: {:.4f}'.format(score))

LightGBM Test AUC score: 0.7476


We can compare to xgboost

In [53]:
%%time
from xgboost import XGBRegressor
boost_model = XGBRegressor(n_estimators=500, eval_metric='auc')

boost_model.fit(train[feature_cols], train['outcome'], 
                early_stopping_rounds=10, 
                eval_set=[(valid[feature_cols], valid['outcome'])], verbose=True)

[0]	validation_0-auc:0.645159
Will train until validation_0-auc hasn't improved in 10 rounds.
[1]	validation_0-auc:0.653088
[2]	validation_0-auc:0.654156
[3]	validation_0-auc:0.654615
[4]	validation_0-auc:0.65478
[5]	validation_0-auc:0.658086
[6]	validation_0-auc:0.658156
[7]	validation_0-auc:0.661804
[8]	validation_0-auc:0.662761
[9]	validation_0-auc:0.665331
[10]	validation_0-auc:0.665792
[11]	validation_0-auc:0.667063
[12]	validation_0-auc:0.668361
[13]	validation_0-auc:0.67011
[14]	validation_0-auc:0.670959
[15]	validation_0-auc:0.673019
[16]	validation_0-auc:0.673959
[17]	validation_0-auc:0.675058
[18]	validation_0-auc:0.675775
[19]	validation_0-auc:0.677246
[20]	validation_0-auc:0.679631
[21]	validation_0-auc:0.680162
[22]	validation_0-auc:0.680595
[23]	validation_0-auc:0.681198
[24]	validation_0-auc:0.681963
[25]	validation_0-auc:0.683131
[26]	validation_0-auc:0.684135
[27]	validation_0-auc:0.684933
[28]	validation_0-auc:0.68585
[29]	validation_0-auc:0.687061
[30]	validation_0-a

[255]	validation_0-auc:0.730723
[256]	validation_0-auc:0.73073
[257]	validation_0-auc:0.730791
[258]	validation_0-auc:0.730814
[259]	validation_0-auc:0.730832
[260]	validation_0-auc:0.73101
[261]	validation_0-auc:0.731168
[262]	validation_0-auc:0.731265
[263]	validation_0-auc:0.73136
[264]	validation_0-auc:0.731375
[265]	validation_0-auc:0.73141
[266]	validation_0-auc:0.731559
[267]	validation_0-auc:0.731643
[268]	validation_0-auc:0.731745
[269]	validation_0-auc:0.731784
[270]	validation_0-auc:0.731897
[271]	validation_0-auc:0.731906
[272]	validation_0-auc:0.7319
[273]	validation_0-auc:0.731966
[274]	validation_0-auc:0.732063
[275]	validation_0-auc:0.732166
[276]	validation_0-auc:0.732196
[277]	validation_0-auc:0.732198
[278]	validation_0-auc:0.732235
[279]	validation_0-auc:0.732239
[280]	validation_0-auc:0.732251
[281]	validation_0-auc:0.732313
[282]	validation_0-auc:0.732461
[283]	validation_0-auc:0.732467
[284]	validation_0-auc:0.732608
[285]	validation_0-auc:0.732626
[286]	validati

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, eval_metric='auc', gamma=0,
             importance_type='gain', learning_rate=0.1, max_delta_step=0,
             max_depth=3, min_child_weight=1, missing=None, n_estimators=500,
             n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
             silent=None, subsample=1, verbosity=1)

In [None]:
boost_pred = boost_model.predict(test[feature_cols])

In [41]:
boost_score = metrics.roc_auc_score(test['outcome'], boost_pred)

In [54]:
print('XGBoost Test AUC score: {:.4f}'.format(boost_score))

XGBoost Test AUC score: 0.7397


The `XGBoost` model performed ever so slightly worse than `LightGBM`, and took 1 min 42 seconds as compared to `LightGBM`'s mere 6 seconds.