<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Data-Description" data-toc-modified-id="Data-Description-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Data Description</a></span></li><li><span><a href="#Imports" data-toc-modified-id="Imports-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Imports</a></span></li><li><span><a href="#Useful-Scripts" data-toc-modified-id="Useful-Scripts-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Useful Scripts</a></span></li><li><span><a href="#Load-the-data" data-toc-modified-id="Load-the-data-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Load the data</a></span></li><li><span><a href="#Balance-the-dataset-for-undersampling" data-toc-modified-id="Balance-the-dataset-for-undersampling-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Balance the dataset for undersampling</a></span></li><li><span><a href="#Train-test-split-with-stratify" data-toc-modified-id="Train-test-split-with-stratify-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Train test split with stratify</a></span></li><li><span><a href="#Modelling-lightgbm" data-toc-modified-id="Modelling-lightgbm-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Modelling lightgbm</a></span></li><li><span><a href="#Grid-search" data-toc-modified-id="Grid-search-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Grid search</a></span></li><li><span><a href="#Light-gbm-cross-validation" data-toc-modified-id="Light-gbm-cross-validation-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>Light gbm cross validation</a></span></li></ul></div>

# Data Description

The datasets contains transactions made by credit cards in September
2013 by european cardholders.


This dataset presents transactions that occurred in two days,
where we have 492 frauds out of 284,807 transactions. 

The dataset is highly unbalanced, the positive class (frauds)
account for 0.172% of all transactions.

It contains only numerical input variables which are
the result of a PCA transformation.


Unfortunately, due to confidentiality issues,
we cannot provide the original features and 
more background information about the data.


Features V1, V2, ... V28 are the principal
components obtained with PCA,
the only features which have not been transformed with PCA are 'Time' and 'Amount'. 

Feature 'Time' contains the seconds elapsed between each transaction
and the first transaction in the dataset. The feature 'Amount'
is the transaction Amount, this feature can be used for 
example-dependant cost-senstive learning. 

Feature 'Class' is the response variable and it takes value
1 in case of fraud and 0 otherwise.

# Imports

In [1]:
import bhishan

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
import numpy as np
import pandas as pd
import seaborn as sns
sns.set(color_codes=True)

import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

import os
import time

# random state
random_state=100
np.random.seed(random_state) # we need this in each cell
np.random.set_state=random_state

# Jupyter notebook settings for pandas
pd.set_option('display.max_columns', 200)
pd.set_option('display.max_rows', 100) # None for all the rows
pd.set_option('display.max_colwidth', 50)

print([(x.__name__,x.__version__) for x in [np, pd,sns,matplotlib]])

[('numpy', '1.16.4'), ('pandas', '0.25.0'), ('seaborn', '0.9.0'), ('matplotlib', '3.1.1')]


In [4]:
import scipy
from scipy import stats

In [5]:
# scale and split
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold

In [6]:
# dimension reduction for visualization
from sklearn.decomposition import PCA
from sklearn.decomposition import TruncatedSVD
from sklearn.manifold import TSNE

In [7]:
# classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

In [8]:
# hyperparameters search
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import make_scorer, matthews_corrcoef

In [9]:
# pipelines
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline

In [10]:
# prediction
from sklearn.model_selection import cross_val_predict

In [11]:
# model evaluation metrics
from sklearn.model_selection import cross_val_score

In [12]:
# roc auc etc scores
from sklearn.metrics import auc
from sklearn.metrics import roc_auc_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import average_precision_score

In [13]:
# roc auc curves
from sklearn.metrics import roc_curve
from sklearn.metrics import precision_recall_curve

In [14]:
# confusion matrix
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [15]:
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
import plotly.figure_factory as ff

# Useful Scripts

In [16]:
def show_method_attributes(method, ncols=7,exclude=None):
    """ Show all the attributes of a given method.
    Example:
    ========
    show_method_attributes(list)
     """
    x = [I for I in dir(method) if I[0]!='_' ]
    x = [I for I in x 
         if I not in 'os np pd sys time psycopg2'.split()
         if (exclude not in i)
        ]
             

    return pd.DataFrame(np.array_split(x,ncols)).T.fillna('')

# Load the data

In [17]:
df = pd.read_csv('../data/raw/creditcard.csv.zip',compression='zip')
print(df.shape)
df.head()

(284807, 31)


Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,0.090794,-0.5516,-0.617801,-0.99139,-0.311169,1.468177,-0.470401,0.207971,0.025791,0.403993,0.251412,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,-0.166974,1.612727,1.065235,0.489095,-0.143772,0.635558,0.463917,-0.114805,-0.183361,-0.145783,-0.069083,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,0.207643,0.624501,0.066084,0.717293,-0.165946,2.345865,-2.890083,1.109969,-0.121359,-2.261857,0.52498,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,-0.054952,-0.226487,0.178228,0.507757,-0.287924,-0.631418,-1.059647,-0.684093,1.965775,-1.232622,-0.208038,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,0.753074,-0.822843,0.538196,1.345852,-1.11967,0.175121,-0.451449,-0.237033,-0.038195,0.803487,0.408542,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


# Balance the dataset for undersampling

In [18]:
target = 'Class'
df[target].value_counts(normalize=True)*100

0    99.827251
1     0.172749
Name: Class, dtype: float64

In [19]:
# shuffle data
df = df.sample(frac=1)

df_low = df.loc[df[target] == 1]
df_high = df.loc[df[target] == 0][:df_low.shape[0]]

df_balanced = pd.concat([df_low, df_high])
df_balanced = df_balanced.sample(frac=1, random_state=100)

df_balanced[target].value_counts()

1    492
0    492
Name: Class, dtype: int64

# Train test split with stratify

In [20]:
from sklearn.model_selection import train_test_split

Xtrain, Xtest, ytrain, ytest = train_test_split(
    df.drop(target,axis=1), df[target],
    test_size=0.2, random_state=random_state, stratify=df[target])

df.shape, Xtrain.shape, Xtest.shape

((284807, 31), (227845, 30), (56962, 30))

# Modelling lightgbm

In [21]:
%%time

import lightgbm as lgbm

lgbm_clf = lgbm.LGBMClassifier(n_estimators=100, random_state = random_state,)

lgbm_clf.fit(Xtrain, ytrain)
lgbm_clf.fit(Xtrain, ytrain)
y_pred = lgbm_clf.predict(Xtest)
y_score = lgbm_clf.predict_proba(Xtest)[:,1]


Starting from version 2.1.4, the library file in distribution wheels for macOS will be built by the Apple Clang compiler.
This means that in case of installing LightGBM from PyPI via the ``pip install lightgbm`` command, you won't need to install the gcc compiler anymore.
Instead of that, you'll need to install the OpenMP library, which is required for running LightGBM on the system with the Apple Clang compiler.
You can install the OpenMP library by the following command: ``brew install libomp``.



CPU times: user 14 s, sys: 302 ms, total: 14.3 s
Wall time: 4.25 s


In [29]:
cm = confusion_matrix(ytest,y_pred)
vals = cm.ravel()
cm

array([[56807,    57],
       [   62,    36]])

In [30]:
print('lightGBM Results')
print('-'*25)
print('Total Frauds: ', vals[2] + vals[3])
print('Incorrect Frauds: ', vals[2])
print('Incorrect Percent: ', round(vals[2]*100/(vals[2]+vals[3]),2),'%')

lightGBM Results
-------------------------
Total Frauds:  98
Incorrect Frauds:  62
Incorrect Percent:  63.27 %


In [22]:
from bhishan.util_plot_model_eval import plotly_binary_clf_evaluation

plotly_binary_clf_evaluation('lgbm with n_estimators = 100',lgbm_clf,ytest,y_pred,y_score,df)

# Grid search

In [31]:
import scipy

fit_params = {"early_stopping_rounds" : 50, 
             "eval_metric" : 'binary', 
             "eval_set" : [(Xtest,ytest)],
             'eval_names': ['valid'],
             'verbose': 0,
             'categorical_feature': 'auto'}

param_test = {'learning_rate' : [0.01, 0.02, 0.03, 0.04, 0.05, 0.08, 0.1, 0.2, 0.3, 0.4],
              'n_estimators' : [100, 200, 300, 400, 500, 600, 800, 1000, 1500, 2000, 3000, 5000],
              'num_leaves': scipy.stats.randint(6, 50), 
              'min_child_samples': scipy.stats.randint(100, 500), 
              'min_child_weight': [1e-5, 1e-3, 1e-2, 1e-1, 1, 1e1, 1e2, 1e3, 1e4],
              'subsample': scipy.stats.uniform(loc=0.2, scale=0.8), 
              'max_depth': [-1, 1, 2, 3, 4, 5, 6, 7],
              'colsample_bytree': scipy.stats.uniform(loc=0.4, scale=0.6),
              'reg_alpha': [0, 1e-1, 1, 2, 5, 7, 10, 50, 100],
              'reg_lambda': [0, 1e-1, 1, 5, 10, 20, 50, 100]}

#number of combinations
n_iter = 2 #(replace 2 by 200, 90 minutes)

#intialize lgbm and lunch the search
lgbm_clf = lgbm.LGBMClassifier(random_state=random_state, silent=True,
                               metric='None', n_jobs=-1)

grid_search = RandomizedSearchCV(
    estimator=lgbm_clf,
    param_distributions=param_test, 
    n_iter=n_iter,
    scoring='accuracy',
    cv=5,
    refit=True,
    random_state=random_state,
    verbose=True)

grid_search.fit(Xtrain, ytrain, **fit_params)
print('Best score reached: {} with params: {} '.format(
    grid_search.best_score_, grid_search.best_params_))

Fitting 5 folds for each of 2 candidates, totalling 10 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:   26.3s finished


Best score reached: 0.9995259935482456 with params: {'colsample_bytree': 0.7260429650745792, 'learning_rate': 0.04, 'max_depth': 7, 'min_child_samples': 443, 'min_child_weight': 1e-05, 'n_estimators': 3000, 'num_leaves': 36, 'reg_alpha': 5, 'reg_lambda': 1, 'subsample': 0.3491737465307062} 


In [33]:
opt_parameters =  grid_search.best_params_

clf_sw = lgbm.LGBMClassifier(**lgbm_clf.get_params())
clf_sw.set_params(**opt_parameters)

LGBMClassifier(boosting_type='gbdt', class_weight=None,
               colsample_bytree=0.7260429650745792, importance_type='split',
               learning_rate=0.04, max_depth=7, metric='None',
               min_child_samples=443, min_child_weight=1e-05,
               min_split_gain=0.0, n_estimators=3000, n_jobs=-1, num_leaves=36,
               objective=None, random_state=100, reg_alpha=5, reg_lambda=1,
               silent=True, subsample=0.3491737465307062,
               subsample_for_bin=200000, subsample_freq=0)

In [34]:
%%time
lgbm_clf = lgbm.LGBMClassifier(boosting_type='gbdt', class_weight=None,
        colsample_bytree=0.5112837457460335, importance_type='split',
        learning_rate=0.02, max_depth=7, metric='None',
        min_child_samples=195, min_child_weight=0.01, min_split_gain=0.0,
        n_estimators=3000, n_jobs=4, num_leaves=44, objective=None,
        random_state=42, reg_alpha=2, reg_lambda=10, silent=True,
        subsample=0.8137506311449016, subsample_for_bin=200000,
        subsample_freq=0)

lgbm_clf.fit(Xtrain, ytrain)
lgbm_clf.fit(Xtrain, ytrain)
y_pred = lgbm_clf.predict(Xtest)
y_score = lgbm_clf.predict_proba(Xtest)[:,1]

CPU times: user 3min 6s, sys: 2.26 s, total: 3min 8s
Wall time: 56.8 s


In [36]:
cm = confusion_matrix(ytest,y_pred)
vals = cm.ravel()

cm

array([[56863,     1],
       [   19,    79]])

In [37]:
print('LightGBM Grid Search Results')
print('-'*25)
print('Total Frauds: ', vals[2] + vals[3])
print('Incorrect Frauds: ', vals[2])
print('Incorrect Percent: ', round(vals[2]*100/(vals[2]+vals[3]),2),'%')

LightGBM Grid Search Results
-------------------------
Total Frauds:  98
Incorrect Frauds:  19
Incorrect Percent:  19.39 %


In [35]:
plotly_binary_clf_evaluation('lgbm_clf',lgbm_clf,ytest,y_pred,y_score,df)

# Light gbm cross validation

In [28]:
%%time

X = df.drop('Class',axis=1).values
y = df['Class'].values

scores = cross_val_score(lgbm_clf,X,y,scoring ='f1',
                         cv=5,n_jobs=-1,verbose=2)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.


CPU times: user 118 ms, sys: 125 ms, total: 243 ms
Wall time: 3min 51s


[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:  3.9min finished


In [29]:
trace = go.Table(
    header=dict(values=['<b>F1 score mean<b>', '<b>F1 score std<b>'],
                line = dict(color='#7D7F80'),
                fill = dict(color='#a1c3d1'),
                align = ['center'],
                font = dict(size = 15)),
    cells=dict(values=[np.round(scores.mean(),6),
                       np.round(scores.std(),6)],
               line = dict(color='#7D7F80'),
               fill = dict(color='#EDFAFF'),
               align = ['center'], font = dict(size = 15)))

layout = dict(width=800, height=500,
              title = 'Cross validation - 5 folds [F1 score]',
              font = dict(size = 15))
fig = dict(data=[trace], layout=layout)
py.iplot(fig, filename = '../reports/figures/lightgbm_cross_validation.html')