# About this Kernel
This is my first attempt to publish a kernel in Kaggle. I will try TPOT AutoML  on this IEEE Fraud transaction dataset. Primary goal for me is to work on a dataset that has combination of features (numbers, categorical, datetime) and the features are not explanatory, so i could experiment with pre-processing & Feature engineering concepts


## Credits
Thanks to kernels submitted by xhulu & makalesta2. I have used some of their concepts in this Kernel
https://www.kaggle.com/xhlulu/ieee-fraud-efficient-grid-search-with-xgboost
https://www.kaggle.com/makalesta2/fraud-detection-randomforest

In [1]:
import os
import gc
import itertools

import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.metrics import roc_auc_score


# Basic Preprocessing

In [2]:
%%time
train_transaction = pd.read_csv('../input/train_transaction.csv', index_col='TransactionID')
test_transaction = pd.read_csv('../input/test_transaction.csv', index_col='TransactionID')

train_identity = pd.read_csv('../input/train_identity.csv', index_col='TransactionID')
test_identity = pd.read_csv('../input/test_identity.csv', index_col='TransactionID')

sample_submission = pd.read_csv('../input/sample_submission.csv', index_col='TransactionID')

train = train_transaction.merge(train_identity, how='left', left_index=True, right_index=True)
test = test_transaction.merge(test_identity, how='left', left_index=True, right_index=True)

print(train.shape)
print(test.shape)

(590540, 433)
(506691, 432)
CPU times: user 54.1 s, sys: 5.9 s, total: 1min
Wall time: 1min


# Let's treat the TransactionDT feature, per the dataset description given
References: https://www.kaggle.com/wajihullahbaig/up-sampling-on-every-hour

In [3]:
def derive_hour_feature(df,tname):
    """
    Creates an hour of the day feature, encoded as 0-23. 
    Parameters: 
        df : pd.DataFrame
            df to manipulate.
        tname : str
            Name of the time column in df.
    """
    hours = df[tname] / (3600)        
    encoded_hours = np.floor(hours) % 24
    return encoded_hours

train['hours'] = derive_hour_feature(train,'TransactionDT')
test['hours'] = derive_hour_feature(test,'TransactionDT')


In [4]:
del train_transaction, train_identity, test_transaction, test_identity

# Drop TransactionDT

In [5]:
X_train = train.drop(['TransactionDT'], axis=1)
X_test = test.drop(['TransactionDT'], axis=1)

In [6]:
del train, test

# Lets do Data exploration

In [7]:
total = X_train.isnull().sum().sort_values(ascending=False)
percent = (X_train.isnull().sum()/X_train.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])

In [8]:
notuseful_features = missing_data[missing_data['Percent']>0.80]

In [9]:
n = np.array(notuseful_features.index)

In [10]:
X_train.drop(['id_25', 'id_07', 'id_08', 'id_21', 'id_26', 'id_22',
       'id_23', 'id_27', 'dist2', 'D7', 'id_18', 'D13', 'D14', 'D12',
       'id_03', 'id_04', 'D6', 'id_33', 'D8', 'D9', 'id_09', 'id_10',
       'id_30', 'id_32', 'id_34', 'id_14', 'V153', 'V155', 'V156', 'V157',
       'V149', 'V148', 'V147', 'V146', 'V142', 'V141', 'V154', 'V163',
       'V158', 'V140', 'V139', 'V138', 'V161', 'V162', 'V159', 'V160',
       'V143', 'V152', 'V151', 'V150', 'V165', 'V166', 'V145', 'V144',
       'V164', 'V324', 'V332', 'V323', 'V339', 'V338', 'V337', 'V335',
       'V334', 'V333', 'V336', 'V331', 'V329', 'V328', 'V327', 'V326',
       'V325', 'V330', 'V322'], axis=1, inplace=True)
X_test.drop(['id_25', 'id_07', 'id_08', 'id_21', 'id_26', 'id_22',
       'id_23', 'id_27', 'dist2', 'D7', 'id_18', 'D13', 'D14', 'D12',
       'id_03', 'id_04', 'D6', 'id_33', 'D8', 'D9', 'id_09', 'id_10',
       'id_30', 'id_32', 'id_34', 'id_14', 'V153', 'V155', 'V156', 'V157',
       'V149', 'V148', 'V147', 'V146', 'V142', 'V141', 'V154', 'V163',
       'V158', 'V140', 'V139', 'V138', 'V161', 'V162', 'V159', 'V160',
       'V143', 'V152', 'V151', 'V150', 'V165', 'V166', 'V145', 'V144',
       'V164', 'V324', 'V332', 'V323', 'V339', 'V338', 'V337', 'V335',
       'V334', 'V333', 'V336', 'V331', 'V329', 'V328', 'V327', 'V326',
       'V325', 'V330', 'V322'], axis=1, inplace=True)

In [11]:
num_fs = X_train.dtypes[X_train.dtypes != "object"].index
print("Number of Numerical features: ", len(num_fs))

cat_fs = X_train.dtypes[X_train.dtypes == "object"].index
print("Number of Categorical features: ", len(cat_fs))

Number of Numerical features:  334
Number of Categorical features:  26


In [12]:
n = X_train.select_dtypes(include=object)
for col in n.columns:
    print(col, ':  ', X_train[col].unique())

ProductCD :   ['W' 'H' 'C' 'S' 'R']
card4 :   ['discover' 'mastercard' 'visa' 'american express' nan]
card6 :   ['credit' 'debit' nan 'debit or credit' 'charge card']
P_emaildomain :   [nan 'gmail.com' 'outlook.com' 'yahoo.com' 'mail.com' 'anonymous.com'
 'hotmail.com' 'verizon.net' 'aol.com' 'me.com' 'comcast.net'
 'optonline.net' 'cox.net' 'charter.net' 'rocketmail.com' 'prodigy.net.mx'
 'embarqmail.com' 'icloud.com' 'live.com.mx' 'gmail' 'live.com' 'att.net'
 'juno.com' 'ymail.com' 'sbcglobal.net' 'bellsouth.net' 'msn.com' 'q.com'
 'yahoo.com.mx' 'centurylink.net' 'servicios-ta.com' 'earthlink.net'
 'hotmail.es' 'cfl.rr.com' 'roadrunner.com' 'netzero.net' 'gmx.de'
 'suddenlink.net' 'frontiernet.net' 'windstream.net' 'frontier.com'
 'outlook.es' 'mac.com' 'netzero.com' 'aim.com' 'web.de' 'twc.com'
 'cableone.net' 'yahoo.fr' 'yahoo.de' 'yahoo.es' 'sc.rr.com' 'ptd.net'
 'live.fr' 'yahoo.co.uk' 'hotmail.fr' 'hotmail.de' 'hotmail.co.uk'
 'protonmail.com' 'yahoo.co.jp']
R_emaildomain :   

In [13]:
## Let's see the distribuition of the categories: 
for cat in list(cat_fs):
    print('Distribuition of feature:', cat)
    print(X_train[cat].value_counts(normalize=True))
    print('#'*50)

Distribuition of feature: ProductCD
W    0.744522
C    0.116028
R    0.063838
H    0.055922
S    0.019690
Name: ProductCD, dtype: float64
##################################################
Distribuition of feature: card4
visa                0.653296
mastercard          0.321271
american express    0.014140
discover            0.011293
Name: card4, dtype: float64
##################################################
Distribuition of feature: card6
debit              0.746963
credit             0.252961
debit or credit    0.000051
charge card        0.000025
Name: card6, dtype: float64
##################################################
Distribuition of feature: P_emaildomain
gmail.com           0.460315
yahoo.com           0.203462
hotmail.com         0.091214
anonymous.com       0.074580
aol.com             0.057025
comcast.net         0.015901
icloud.com          0.012633
outlook.com         0.010272
msn.com             0.008249
att.net             0.008130
live.com            0.006130
sb

I will still explore EDA in the next version of this notebook. For now, proceeding with other steps.

In [14]:
# Seaborn visualization library
# import seaborn as sns
# Create the default pairplot
# sns.pairplot(X_train, hue = 'isFraud')

# Lets separate the target variable from the training dataset 

In [15]:
y_train = X_train['isFraud']
X_train.drop(['isFraud'], axis=1, inplace = True)
y_pred = sample_submission

# Treat NaNs
There are better ways to do. but for now going with filling as below; intend to revisit this later

In [16]:
X_train = X_train.fillna(-999)
X_test = X_test.fillna(-999)

In [17]:
X_train.columns, X_test.columns, y_train.shape

(Index(['TransactionAmt', 'ProductCD', 'card1', 'card2', 'card3', 'card4',
        'card5', 'card6', 'addr1', 'addr2',
        ...
        'id_28', 'id_29', 'id_31', 'id_35', 'id_36', 'id_37', 'id_38',
        'DeviceType', 'DeviceInfo', 'hours'],
       dtype='object', length=359),
 Index(['TransactionAmt', 'ProductCD', 'card1', 'card2', 'card3', 'card4',
        'card5', 'card6', 'addr1', 'addr2',
        ...
        'id_28', 'id_29', 'id_31', 'id_35', 'id_36', 'id_37', 'id_38',
        'DeviceType', 'DeviceInfo', 'hours'],
       dtype='object', length=359),
 (590540,))

# Convert categorical variables 
Ideally, we should apply label encoding & then one hot encoding OR use getdummies from pandas. My worry is, there are already too many unexplainable features in this dataset. Applying one hot encoding will add more features by cardinality and not sure if that is a good sign. Also, I have read in the discussion forum of this competition that one hot encoding was not that useful. So going with the flow of just applying label encoding. Remember i told you, this is my first Kaggle Kernel submission and so excited to have couple of good commits:P

In [18]:
# Label Encoding
for f in X_train.columns:
    if X_train[f].dtype=='object' or X_test[f].dtype=='object': 
        lbl = preprocessing.LabelEncoder()
        lbl.fit(list(X_train[f].values) + list(X_test[f].values))
        X_train[f] = lbl.transform(list(X_train[f].values))
        X_test[f] = lbl.transform(list(X_test[f].values))   

# RAM Optimization
This is adapted from the kaggle kernel https://www.kaggle.com/xhlulu/ieee-fraud-efficient-grid-search-with-xgboost.
I just use a normal PC that has 8GB RAM. so any memory saving step is a boost for people like me. Also believe me, i truly understand the below code snippet :P

In [19]:
def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    
    return df

In [20]:
%%time
X_train = reduce_mem_usage(X_train)

Memory usage of dataframe is 1641.97 MB
Memory usage after optimization is: 454.78 MB
Decreased by 72.3%
CPU times: user 44.9 s, sys: 1min 55s, total: 2min 40s
Wall time: 2min 40s


In [21]:
X_test = reduce_mem_usage(X_test)

Memory usage of dataframe is 1411.67 MB
Memory usage after optimization is: 399.81 MB
Decreased by 71.7%


# Let's try the AutoML frameworks

In [22]:
X_train.shape, X_test.shape, y_train.shape, y_pred.shape

((590540, 359), (506691, 359), (590540,), (506691, 1))

In [23]:
%%time
from sklearn.model_selection import train_test_split
X_tr, X_te, y_tr, y_te = train_test_split(X_train, y_train,train_size=0.80, test_size=0.20)

CPU times: user 1.31 s, sys: 672 ms, total: 1.98 s
Wall time: 2 s


In [24]:
%%time
from tpot import TPOTClassifier
tpot = TPOTClassifier(generations=5, population_size=5, verbosity=2,cv=5, scoring='roc_auc', warm_start=True, early_stop=5 )
tpot.fit(X_tr, y_tr)

HBox(children=(IntProgress(value=0, description='Optimization Progress', max=30, style=ProgressStyle(descripti…

Generation 1 - Current best internal CV score: 0.7290538728152991
Generation 2 - Current best internal CV score: 0.7290538728152991
Generation 3 - Current best internal CV score: 0.8218767635746419
Generation 4 - Current best internal CV score: 0.8218767635746419
Generation 5 - Current best internal CV score: 0.8219322540629437

Best pipeline: DecisionTreeClassifier(DecisionTreeClassifier(input_matrix, criterion=gini, max_depth=7, min_samples_leaf=6, min_samples_split=14), criterion=gini, max_depth=4, min_samples_leaf=15, min_samples_split=9)
CPU times: user 2h 12min 54s, sys: 3min 18s, total: 2h 16min 13s
Wall time: 2h 16min 19s


TPOTClassifier(config_dict=None, crossover_rate=0.1, cv=5,
               disable_update_check=False, early_stop=5, generations=5,
               max_eval_time_mins=5, max_time_mins=None, memory=None,
               mutation_rate=0.9, n_jobs=1, offspring_size=None,
               periodic_checkpoint_folder=None, population_size=5,
               random_state=None, scoring='roc_auc', subsample=1.0,
               template=None, use_dask=False, verbosity=2, warm_start=True)

In [25]:
%%time
print("ROC_AUC is {}%".format(tpot.score(X_te, y_te)*100))

ROC_AUC is 83.93510747636397%
CPU times: user 1.17 s, sys: 1.23 s, total: 2.4 s
Wall time: 2.39 s


# Now lets do prediction on the sample_submission dataset

In [26]:
%%time
preds = tpot.predict(X_test)

CPU times: user 3.88 s, sys: 3.2 s, total: 7.08 s
Wall time: 7.09 s


In [27]:
%%time
preds_probab = tpot.predict_proba(X_test)

CPU times: user 3.74 s, sys: 2.85 s, total: 6.59 s
Wall time: 6.58 s


In [28]:
sample_submission['isFraud'] = '0'
sample_submission['isFraud'] = preds
sample_submission.to_csv('TPOT_automl_submission_pred_3.csv', index=True)

Using 1 - probab, since the output class predictor is predicting non-Fraud %

In [29]:
sample_submission['isFraud'] = '0'
sample_submission['isFraud'] = 1.000000 - preds_probab
sample_submission.to_csv('TPOT_automl_submission_probab_3.csv', index=True)