Well, looks like the Fraud-Detection competition got extended by a few days and people are salty. I'm not too invested in the competition so I'm not bothered by it, might as well play around with some more things before it ends. 

In this notebook I'm going to work on feature creation and selection. This is an area I have barely touched since going down the data science route, let's see where it takes us. 

In [2]:
import pandas as pd
import numpy as np
import time

In [3]:
t0 = time.time()

train_transaction = pd.read_csv('Data/train_transaction.csv')

t1 = time.time()

print(t1-t0)

41.482211112976074


# Encoding Categorical Variables

So the first thing I'm going to try to play around with is on how to deal with categorical variables. I was initially using the one hot/dummy encoding method, I'm going to try using more advanced methods, namely feature hashingand binning. I'll compare them to using dummy encoding. These methods should ideally save on memory, so this will be beneficial to me and my MacBook Air with 4GB of RAM.

In [4]:
strings = train_transaction.select_dtypes(include='object')
strings.fillna('NaN', inplace = True)
dummies = pd.get_dummies(strings)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  downcast=downcast, **kwargs)


In [5]:
from sklearn.feature_extraction import FeatureHasher
cols = []
for column in strings:
    classes = len(strings[column].unique())
    h = FeatureHasher(n_features=classes, input_type='string')
    col = h.transform(strings[column])
    col = col.toarray()
    cols.append(col)

In [6]:
print(cols)

[array([[ 0.,  0.,  0.,  0., -1.],
       [ 0.,  0.,  0.,  0., -1.],
       [ 0.,  0.,  0.,  0., -1.],
       ...,
       [ 0.,  0.,  0.,  0., -1.],
       [ 0.,  0.,  0.,  0., -1.],
       [ 0.,  0.,  0.,  0., -1.]]), array([[ 1., -1., -1.,  0.,  1.],
       [ 4.,  1., -1.,  0.,  0.],
       [ 1., -1.,  0., -1., -1.],
       ...,
       [ 4.,  1., -1.,  0.,  0.],
       [ 4.,  1., -1.,  0.,  0.],
       [ 4.,  1., -1.,  0.,  0.]]), array([[ 1., -1., -1.,  0.,  1.],
       [ 1., -1., -1.,  0.,  1.],
       [ 0., -2.,  0.,  0.,  1.],
       ...,
       [ 0., -2.,  0.,  0.,  1.],
       [ 0., -2.,  0.,  0.,  1.],
       [ 1., -1., -1.,  0.,  1.]]), array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0., -1., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       ...,
       [ 0.,  0., -1., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0., -1., ...,  0.,  0.,  0.]]), array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 

Ok, looks like hashing doesn't do what exactly I thought it would. Let's try factorizing instead. This isn't ideal in theory because it implies that the categories are somehow ordered, but we'll see how it fares in practice.

In [7]:
from sklearn.preprocessing import LabelEncoder
encoded = strings.apply(LabelEncoder().fit_transform)
encoded.head(10)

Unnamed: 0,ProductCD,card4,card6,P_emaildomain,R_emaildomain,M1,M2,M3,M4,M5,M6,M7,M8,M9
0,4,2,2,0,0,2,2,2,2,0,2,1,1,1
1,4,3,2,17,0,1,1,1,0,2,2,1,1,1
2,4,4,3,36,0,2,2,2,0,0,0,0,0,0
3,4,3,3,54,0,1,1,1,0,2,0,1,1,1
4,1,3,2,17,0,1,1,1,3,1,1,1,1,1
5,4,4,3,17,0,2,2,2,1,0,2,1,1,1
6,4,4,3,54,0,2,2,2,0,0,0,2,2,2
7,4,4,3,30,0,1,1,1,0,0,0,1,1,1
8,1,4,3,2,0,1,1,1,3,1,1,1,1,1
9,4,3,3,54,0,2,2,2,0,2,2,1,1,1


As I mentioned before, this might save on memory but implies a certain structure which might not be present. I'll train two models using xgboost and compare their performance on a validation set using the two encoding styles. Note I'm only including the converted categorical variables as features. I'm omitting the numeric feats to save on computation time. 

In [8]:
#Splitting the first 100000 rows of data into a training and validation set

fraud = train_transaction['isFraud']
fraud_train = fraud.iloc[:50000]
fraud_val = fraud.iloc[50000:100000]

dummies_train = dummies.iloc[:50000,:]
dummies_val = dummies.iloc[50000:100000,:]

encoded_train = encoded.iloc[:50000,:]
encoded_val = encoded.iloc[50000:100000,:]


In [9]:
import xgboost as xgb

model = xgb.XGBClassifier(
    learning_rate = 0.2,
    n_estimators = 100,
    max_depth = 10
)

In [10]:
#This helps to prevent the kernel from dying for whatever reason
import os
os.environ['KMP_DUPLICATE_LIB_OK']='True'

In [27]:
dummies_model = xgb.XGBClassifier(
    learning_rate = 0.2,
    n_estimators = 100,
    max_depth = 10
)

dummies_model.fit(dummies_train, fraud_train, 
          eval_metric = "auc", 
          eval_set= [(dummies_val, fraud_val)],
          early_stopping_rounds = 10
         )

[0]	validation_0-auc:0.764742
Will train until validation_0-auc hasn't improved in 10 rounds.
[1]	validation_0-auc:0.77272
[2]	validation_0-auc:0.773051
[3]	validation_0-auc:0.773278
[4]	validation_0-auc:0.772528
[5]	validation_0-auc:0.772187
[6]	validation_0-auc:0.770965
[7]	validation_0-auc:0.769918
[8]	validation_0-auc:0.769016
[9]	validation_0-auc:0.770176
[10]	validation_0-auc:0.770011
[11]	validation_0-auc:0.769723
[12]	validation_0-auc:0.770169
[13]	validation_0-auc:0.770086
Stopping. Best iteration:
[3]	validation_0-auc:0.773278



XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=0.2,
       max_delta_step=0, max_depth=10, min_child_weight=1, missing=None,
       n_estimators=100, n_jobs=1, nthread=None,
       objective='binary:logistic', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=None,
       subsample=1, verbosity=1)

In [28]:
factorisation_model = xgb.XGBClassifier(
    learning_rate = 0.2,
    n_estimators = 100,
    max_depth = 10
) 

factorisation_model.fit(encoded_train, fraud_train, 
          eval_metric = "auc", 
          eval_set= [(encoded_val, fraud_val)],
          early_stopping_rounds = 10
         )

[0]	validation_0-auc:0.75978
Will train until validation_0-auc hasn't improved in 10 rounds.
[1]	validation_0-auc:0.762506
[2]	validation_0-auc:0.763862
[3]	validation_0-auc:0.763135
[4]	validation_0-auc:0.764082
[5]	validation_0-auc:0.764363
[6]	validation_0-auc:0.76566
[7]	validation_0-auc:0.764398
[8]	validation_0-auc:0.765039
[9]	validation_0-auc:0.765164
[10]	validation_0-auc:0.764646
[11]	validation_0-auc:0.763739
[12]	validation_0-auc:0.764609
[13]	validation_0-auc:0.764567
[14]	validation_0-auc:0.765997
[15]	validation_0-auc:0.765146
[16]	validation_0-auc:0.76572
[17]	validation_0-auc:0.764974
[18]	validation_0-auc:0.7641
[19]	validation_0-auc:0.764882
[20]	validation_0-auc:0.763617
[21]	validation_0-auc:0.763945
[22]	validation_0-auc:0.763197
[23]	validation_0-auc:0.763793
[24]	validation_0-auc:0.763797
Stopping. Best iteration:
[14]	validation_0-auc:0.765997



XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=0.2,
       max_delta_step=0, max_depth=10, min_child_weight=1, missing=None,
       n_estimators=100, n_jobs=1, nthread=None,
       objective='binary:logistic', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=None,
       subsample=1, verbosity=1)

Not surprisingly, the model with the categorical variables encoded as dummies does better than the model with categories encoded by factorisation. What is surprising is the very small difference in performance. The best AUC score differs by less than 0.01. The advantage with using the factorize() method is that it allows us to cut out 150 columns(!).

In [16]:
print(encoded.shape, dummies.shape)

(590540, 14) (590540, 164)


# Feature Selection

Before we get into any feature creation, I think it will be important to set up a feature selection process. I'll experiment with permutation importance first using the models trained on categorical variables, after that we can see how it can be applied to our feature creation process.

In [33]:
import eli5
from eli5.sklearn import PermutationImportance

perm_dummies = PermutationImportance(dummies_model, random_state=1).fit(dummies_val, fraud_val)


In [45]:
perm_dummies_df = eli5.explain_weights_df(perm_dummies, feature_names = dummies_val.columns.tolist())
perm_dummies_df.head(30)

Unnamed: 0,feature,weight,std
0,M6_NaN,0.00122,7e-05
1,M5_T,0.001196,8.5e-05
2,M4_M2,0.000636,6.9e-05
3,card6_debit,0.0005,9.6e-05
4,P_emaildomain_NaN,0.000284,8.3e-05
5,card4_visa,0.000212,2.7e-05
6,P_emaildomain_hotmail.com,0.000204,9.9e-05
7,P_emaildomain_live.com,0.000196,1.5e-05
8,M1_NaN,0.000188,3.2e-05
9,ProductCD_H,0.00012,7.6e-05


In [44]:
perm_factorize = PermutationImportance(factorisation_model, random_state=1).fit(encoded_val, fraud_val)
perm_factorize_df = eli5.explain_weights_df(perm_factorize, feature_names = encoded_val.columns.tolist())
perm_factorize_df

Unnamed: 0,feature,weight,std
0,M5,0.005652,0.000447
1,P_emaildomain,0.001656,0.000204
2,M4,0.001464,8.9e-05
3,ProductCD,0.000448,0.000105
4,card6,0.000436,8.4e-05
5,M6,0.0004,9.2e-05
6,R_emaildomain,8.4e-05,5e-05
7,M1,6.4e-05,2.9e-05
8,M7,-2.8e-05,2e-05
9,card4,-8.4e-05,8.5e-05


These two methods provide different results which are a bit hard to reconcile. There are some categories for particular variables which are highly significant e.g. M6_NaN, but the variable overall is not very significant e.g. M6's weights are quite small in the encoded permutation. 

It seems that when we encode using the factorize method, it can mask information. Important categories are mixed in with non-informative ones, while it does save on computation power, perhaps the best method is to use dummy encoding and then select down to variables which bring positive weights? Let's see how it goes.

In [53]:
important_dummies = perm_dummies_df[perm_dummies_df['weight']>0]['feature']
dummies_subset = dummies[important_dummies]

dummies_subset_train = dummies_subset.iloc[:50000,:]
dummies_subset_val = dummies_subset.iloc[50000:100000,:]

In [55]:
dummies_subset_model = xgb.XGBClassifier(
    learning_rate = 0.2,
    n_estimators = 100,
    max_depth = 10
)

dummies_subset_model.fit(dummies_subset_train, fraud_train, 
          eval_metric = "auc", 
          eval_set= [(dummies_subset_val, fraud_val)],
          early_stopping_rounds = 10
         )

[0]	validation_0-auc:0.679431
Will train until validation_0-auc hasn't improved in 10 rounds.
[1]	validation_0-auc:0.683594
[2]	validation_0-auc:0.685619
[3]	validation_0-auc:0.699676
[4]	validation_0-auc:0.698786
[5]	validation_0-auc:0.700591
[6]	validation_0-auc:0.704102
[7]	validation_0-auc:0.705296
[8]	validation_0-auc:0.704642
[9]	validation_0-auc:0.704185
[10]	validation_0-auc:0.703001
[11]	validation_0-auc:0.702291
[12]	validation_0-auc:0.703659
[13]	validation_0-auc:0.70248
[14]	validation_0-auc:0.701189
[15]	validation_0-auc:0.70057
[16]	validation_0-auc:0.699068
[17]	validation_0-auc:0.698954
Stopping. Best iteration:
[7]	validation_0-auc:0.705296



XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=0.2,
       max_delta_step=0, max_depth=10, min_child_weight=1, missing=None,
       n_estimators=100, n_jobs=1, nthread=None,
       objective='binary:logistic', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=None,
       subsample=1, verbosity=1)

Performance seems to suffer even when we ommit the unimportant variables. Perhaps we should change our threshold?

In [60]:
important_dummies = perm_dummies_df[perm_dummies_df['weight']>=0]['feature']
dummies_subset = dummies[important_dummies]

dummies_subset_train = dummies_subset.iloc[:50000,:]
dummies_subset_val = dummies_subset.iloc[50000:100000,:]

dummies_subset_model = xgb.XGBClassifier(
    learning_rate = 0.2,
    n_estimators = 100,
    max_depth = 10
)

dummies_subset_model.fit(dummies_subset_train, fraud_train, 
          eval_metric = "auc", 
          eval_set= [(dummies_subset_val, fraud_val)],
          early_stopping_rounds = 10
         )

[0]	validation_0-auc:0.754834
Will train until validation_0-auc hasn't improved in 10 rounds.
[1]	validation_0-auc:0.754935
[2]	validation_0-auc:0.762978
[3]	validation_0-auc:0.762437
[4]	validation_0-auc:0.759588
[5]	validation_0-auc:0.76182
[6]	validation_0-auc:0.763075
[7]	validation_0-auc:0.75833
[8]	validation_0-auc:0.752506
[9]	validation_0-auc:0.752666
[10]	validation_0-auc:0.752094
[11]	validation_0-auc:0.752204
[12]	validation_0-auc:0.749589
[13]	validation_0-auc:0.750293
[14]	validation_0-auc:0.752488
[15]	validation_0-auc:0.755306
[16]	validation_0-auc:0.755774
Stopping. Best iteration:
[6]	validation_0-auc:0.763075



XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=0.2,
       max_delta_step=0, max_depth=10, min_child_weight=1, missing=None,
       n_estimators=100, n_jobs=1, nthread=None,
       objective='binary:logistic', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=None,
       subsample=1, verbosity=1)

Performance improves, but we end up including almost all the variables in the full dummies dataframe. There are a lot of boderline significant variables that appear to make no difference individually, but their inclusion as a whole makes a noticable impact.

As a side note, perhaps it might be worth joining the two methods together to see how they fare?

In [63]:
all_cat = pd.concat([dummies,encoded],axis=1)

all_cat_train = all_cat.iloc[:50000,:]
all_cat_val = all_cat.iloc[50000:100000,:]

all_cat_model = xgb.XGBClassifier(
    learning_rate = 0.2,
    n_estimators = 100,
    max_depth = 10
)

all_cat_model.fit(all_cat_train, fraud_train, 
          eval_metric = "auc", 
          eval_set= [(all_cat_val, fraud_val)],
          early_stopping_rounds = 30
         )

[0]	validation_0-auc:0.767448
Will train until validation_0-auc hasn't improved in 30 rounds.
[1]	validation_0-auc:0.769066
[2]	validation_0-auc:0.769952
[3]	validation_0-auc:0.768474
[4]	validation_0-auc:0.767363
[5]	validation_0-auc:0.769396
[6]	validation_0-auc:0.766722
[7]	validation_0-auc:0.766311
[8]	validation_0-auc:0.765077
[9]	validation_0-auc:0.766475
[10]	validation_0-auc:0.766291
[11]	validation_0-auc:0.766952
[12]	validation_0-auc:0.766478
[13]	validation_0-auc:0.763708
[14]	validation_0-auc:0.765503
[15]	validation_0-auc:0.765872
[16]	validation_0-auc:0.765911
[17]	validation_0-auc:0.767149
[18]	validation_0-auc:0.765496
[19]	validation_0-auc:0.76499
[20]	validation_0-auc:0.763375
[21]	validation_0-auc:0.763432
[22]	validation_0-auc:0.762955
[23]	validation_0-auc:0.762984
[24]	validation_0-auc:0.763538
[25]	validation_0-auc:0.763855
[26]	validation_0-auc:0.763139
[27]	validation_0-auc:0.763307
[28]	validation_0-auc:0.763266
[29]	validation_0-auc:0.763184
[30]	validation_0

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=0.2,
       max_delta_step=0, max_depth=10, min_child_weight=1, missing=None,
       n_estimators=100, n_jobs=1, nthread=None,
       objective='binary:logistic', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=None,
       subsample=1, verbosity=1)

In [64]:
all_cat_perm = PermutationImportance(all_cat_model).fit(all_cat_val, fraud_val)

In [65]:
all_cat_perm_df = eli5.explain_weights_df(all_cat_perm, feature_names = all_cat.columns.tolist())
all_cat_perm_df

Unnamed: 0,feature,weight,std
0,M6_NaN,0.001300,0.000129
1,M5_T,0.001216,0.000167
2,M4_M2,0.000592,0.000079
3,P_emaildomain_NaN,0.000568,0.000137
4,card6_debit,0.000480,0.000057
5,card4_visa,0.000220,0.000057
6,P_emaildomain_outlook.com,0.000216,0.000079
7,P_emaildomain_live.com,0.000216,0.000020
8,P_emaildomain_hotmail.com,0.000212,0.000061
9,M1_NaN,0.000172,0.000052


Interesting results that I'm not entirely sure waht to make of, P_emaildomain was very important when we factorized the variables, but when we combine it with the dummies, it sinks to the bottom in importance. My hypothesis is that since we separated out the different emails, the aggregation of those categories, P_emaildomain is no longer useful.

But, the question still remains, how do we encode our categorical variables? Dummy encoding gives marginally better performance but requires many variables, factorising gives slightly less performance but allows us to drop out 150 variables. With my limited computational power, I will go with the latter and use factorisation.

Another option is to use some sort of hybrid method, but I'm not sure how that would be best implemented. Something to be explored next time?

Let's move onto feature creation with the numeric data, I'll do this in a separate notebook to save on memory. 