This code on the whole deals with detection of fraudulent credit card transaction.The main challenges faced in this project is to resolve the dataset imbalances,to extract impportant features by analyzing it through a correlation matrix and to remove outliers if any.

Steps involved:
1.Pre-processing of data(Scaling)
2.Solving Dataset Imbalance using SMOTE.
3.Training it on ML models.
a.LightGBM
b.Logistic Regression
c.XGBoost


In [1]:
import numpy as np
import pandas as pd
import lightgbm

Reading the information from the CSV and storing it into a dataframe named data.

In [2]:
data=pd.read_csv('../input/creditcardfraud/creditcard.csv')

In [3]:
data.columns

Index(['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
       'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20',
       'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount',
       'Class'],
      dtype='object')

In [4]:
data.head

<bound method NDFrame.head of             Time         V1         V2        V3        V4        V5  \
0            0.0  -1.359807  -0.072781  2.536347  1.378155 -0.338321   
1            0.0   1.191857   0.266151  0.166480  0.448154  0.060018   
2            1.0  -1.358354  -1.340163  1.773209  0.379780 -0.503198   
3            1.0  -0.966272  -0.185226  1.792993 -0.863291 -0.010309   
4            2.0  -1.158233   0.877737  1.548718  0.403034 -0.407193   
...          ...        ...        ...       ...       ...       ...   
284802  172786.0 -11.881118  10.071785 -9.834783 -2.066656 -5.364473   
284803  172787.0  -0.732789  -0.055080  2.035030 -0.738589  0.868229   
284804  172788.0   1.919565  -0.301254 -3.249640 -0.557828  2.630515   
284805  172788.0  -0.240440   0.530483  0.702510  0.689799 -0.377961   
284806  172792.0  -0.533413  -0.189733  0.703337 -0.506271 -0.012546   

              V6        V7        V8        V9  ...       V21       V22  \
0       0.462388  0.239599  0.

Storing the class labels in a variable called y and the rest of the columns is overriden in the variable data itself.

In [5]:
y=data.Class
data=data.drop('Class',axis=1)
a=data.columns

Scaling needs to be done and it can be inferred when the data is being described all other columns except Amount and Time are in the same range so these 2 needs to be scaled.

The 2 predominant types of scaling are:
1.Standard Scaler
2.Robust Scaler

Standard Scaler:The StandardScaler assumes your data is normally distributed within each feature and will scale them such that the distribution is now centred around 0, with a standard deviation of 1.

Robust Scaler:The RobustScaler uses a similar method to the Min-Max scaler but it instead uses the interquartile range, rathar than the min-max, so that it is robust to outliers. Therefore it follows the formula:

xi–Q1(x)/Q3(x)–Q1(x)
For each feature.

Of course this means it is using the less of the data for scaling so it’s more suitable for when there are outliers in the data.

In [6]:
from sklearn.preprocessing import StandardScaler, RobustScaler

std_scaler = StandardScaler()
rob_scaler = RobustScaler()

data['scaled_amount'] = rob_scaler.fit_transform(data['Amount'].values.reshape(-1,1))
data['scaled_time'] = rob_scaler.fit_transform(data['Time'].values.reshape(-1,1))

data.drop(['Time','Amount'], axis=1, inplace=True)

In [7]:
scaled_amount = data['scaled_amount']
scaled_time = data['scaled_time']

data.drop(['scaled_amount', 'scaled_time'], axis=1, inplace=True)
data.insert(0, 'scaled_amount', scaled_amount)
data.insert(1, 'scaled_time', scaled_time)

data.head()

Unnamed: 0,scaled_amount,scaled_time,V1,V2,V3,V4,V5,V6,V7,V8,...,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28
0,1.783274,-0.994983,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,...,0.403993,0.251412,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053
1,-0.269825,-0.994983,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,...,-0.145783,-0.069083,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724
2,4.983721,-0.994972,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,...,-2.261857,0.52498,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752
3,1.418291,-0.994972,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,...,-1.232622,-0.208038,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458
4,0.670579,-0.99496,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,...,0.803487,0.408542,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153


SMOTE stands for Synthetic Minority Oversampling Technique. This is a statistical technique for increasing the number of cases in your dataset in a balanced way. The module works by generating new instances from existing minority cases that you supply as input. This implementation of SMOTE does not change the number of majority cases.

The new instances are not just copies of existing minority cases; instead, the algorithm takes samples of the feature space for each target class and its nearest neighbors, and generates new examples that combine features of the target case with features of its neighbors. This approach increases the features available to each class and makes the samples more general.

SMOTE takes the entire dataset as an input, but it increases the percentage of only the minority cases. For example, suppose you have an imbalanced dataset where just 1% of the cases have the target value A (the minority class), and 99% of the cases have the value B. To increase the percentage of minority cases to twice the previous percentage, you would enter 200 for SMOTE percentage in the module's properties.

In [8]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x1,y1,test_size=0.2, random_state=42)

NameError: name 'x1' is not defined

In [9]:
from imblearn.over_sampling import SMOTE
smt=SMOTE(random_state=42)
x1,y1=smt.fit_resample(data,y)

Using TensorFlow backend.


In [10]:
df=pd.DataFrame(x1)

In [11]:
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20,21,22,23,24,25,26,27,28,29
0,1.783274,-0.994983,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,...,0.403993,0.251412,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053
1,-0.269825,-0.994983,1.191857,0.266151,0.166480,0.448154,0.060018,-0.082361,-0.078803,0.085102,...,-0.145783,-0.069083,-0.225775,-0.638672,0.101288,-0.339846,0.167170,0.125895,-0.008983,0.014724
2,4.983721,-0.994972,-1.358354,-1.340163,1.773209,0.379780,-0.503198,1.800499,0.791461,0.247676,...,-2.261857,0.524980,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752
3,1.418291,-0.994972,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,...,-1.232622,-0.208038,-0.108300,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458
4,0.670579,-0.994960,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,...,0.803487,0.408542,-0.009431,0.798278,-0.137458,0.141267,-0.206010,0.502292,0.219422,0.215153
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
568625,9.854633,0.638100,-2.356885,1.759518,-6.388271,1.791556,-3.461029,1.440556,-0.373200,1.446709,...,-0.617735,0.188948,0.855687,0.618928,0.954478,-0.615495,-1.709122,0.107490,1.124115,0.405976
568626,1.110878,-0.236643,-3.352917,0.753401,-1.698278,0.863169,-1.186314,-0.406322,-1.652498,0.020940,...,0.376346,-0.371572,0.342796,0.454379,-0.130009,-0.499223,-0.042935,0.987288,-1.389017,0.750979
568627,0.313366,0.510075,1.015969,1.435963,-3.709889,2.784929,-0.214184,-1.435874,-1.600511,0.319123,...,-0.383629,0.288674,0.523091,0.374095,-0.176358,-0.434731,0.450034,-0.275049,0.488280,0.241260
568628,2.558090,0.519992,-2.297717,2.980660,-5.140960,2.602508,-1.445263,-0.677400,-1.801274,1.636259,...,-0.072661,-0.070069,0.600670,0.688650,-0.002084,0.160731,0.023995,-0.401783,0.549772,0.097347


In [12]:
df.columns=a

In [13]:
df

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
0,1.783274,-0.994983,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,...,0.403993,0.251412,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053
1,-0.269825,-0.994983,1.191857,0.266151,0.166480,0.448154,0.060018,-0.082361,-0.078803,0.085102,...,-0.145783,-0.069083,-0.225775,-0.638672,0.101288,-0.339846,0.167170,0.125895,-0.008983,0.014724
2,4.983721,-0.994972,-1.358354,-1.340163,1.773209,0.379780,-0.503198,1.800499,0.791461,0.247676,...,-2.261857,0.524980,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752
3,1.418291,-0.994972,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,...,-1.232622,-0.208038,-0.108300,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458
4,0.670579,-0.994960,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,...,0.803487,0.408542,-0.009431,0.798278,-0.137458,0.141267,-0.206010,0.502292,0.219422,0.215153
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
568625,9.854633,0.638100,-2.356885,1.759518,-6.388271,1.791556,-3.461029,1.440556,-0.373200,1.446709,...,-0.617735,0.188948,0.855687,0.618928,0.954478,-0.615495,-1.709122,0.107490,1.124115,0.405976
568626,1.110878,-0.236643,-3.352917,0.753401,-1.698278,0.863169,-1.186314,-0.406322,-1.652498,0.020940,...,0.376346,-0.371572,0.342796,0.454379,-0.130009,-0.499223,-0.042935,0.987288,-1.389017,0.750979
568627,0.313366,0.510075,1.015969,1.435963,-3.709889,2.784929,-0.214184,-1.435874,-1.600511,0.319123,...,-0.383629,0.288674,0.523091,0.374095,-0.176358,-0.434731,0.450034,-0.275049,0.488280,0.241260
568628,2.558090,0.519992,-2.297717,2.980660,-5.140960,2.602508,-1.445263,-0.677400,-1.801274,1.636259,...,-0.072661,-0.070069,0.600670,0.688650,-0.002084,0.160731,0.023995,-0.401783,0.549772,0.097347


In [14]:
df['Class'] = y1

In [15]:
df

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,1.783274,-0.994983,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,...,0.251412,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,0
1,-0.269825,-0.994983,1.191857,0.266151,0.166480,0.448154,0.060018,-0.082361,-0.078803,0.085102,...,-0.069083,-0.225775,-0.638672,0.101288,-0.339846,0.167170,0.125895,-0.008983,0.014724,0
2,4.983721,-0.994972,-1.358354,-1.340163,1.773209,0.379780,-0.503198,1.800499,0.791461,0.247676,...,0.524980,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,0
3,1.418291,-0.994972,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,...,-0.208038,-0.108300,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,0
4,0.670579,-0.994960,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,...,0.408542,-0.009431,0.798278,-0.137458,0.141267,-0.206010,0.502292,0.219422,0.215153,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
568625,9.854633,0.638100,-2.356885,1.759518,-6.388271,1.791556,-3.461029,1.440556,-0.373200,1.446709,...,0.188948,0.855687,0.618928,0.954478,-0.615495,-1.709122,0.107490,1.124115,0.405976,1
568626,1.110878,-0.236643,-3.352917,0.753401,-1.698278,0.863169,-1.186314,-0.406322,-1.652498,0.020940,...,-0.371572,0.342796,0.454379,-0.130009,-0.499223,-0.042935,0.987288,-1.389017,0.750979,1
568627,0.313366,0.510075,1.015969,1.435963,-3.709889,2.784929,-0.214184,-1.435874,-1.600511,0.319123,...,0.288674,0.523091,0.374095,-0.176358,-0.434731,0.450034,-0.275049,0.488280,0.241260,1
568628,2.558090,0.519992,-2.297717,2.980660,-5.140960,2.602508,-1.445263,-0.677400,-1.801274,1.636259,...,-0.070069,0.600670,0.688650,-0.002084,0.160731,0.023995,-0.401783,0.549772,0.097347,1


Light GBM is a gradient boosting framework that uses tree based learning algorithm.
Light GBM grows tree vertically while other algorithm grows trees horizontally meaning that Light GBM grows tree leaf-wise while other algorithm grows level-wise. It will choose the leaf with max delta loss to grow. When growing the same leaf, Leaf-wise algorithm can reduce more loss than a level-wise algorithm.
Light GBM can handle the large size of data and takes lower memory to run. Another reason of why Light GBM is popular is because it focuses on accuracy of results. LGBM also supports GPU learning and thus data scientists are widely using LGBM for data science application development.

In [16]:
categorical_features = [c for c, col in enumerate(data.columns) if 'cat' in col]
train_data = lightgbm.Dataset(x_train,label=y_train,categorical_feature=categorical_features)
test_data = lightgbm.Dataset(x_test,label=y_test)

NameError: name 'x_train' is not defined

In [17]:
parameters = {
    'application': 'binary',
    'objective': 'binary',
    'metric': 'auc',
    'is_unbalance': 'true',
    'boosting': 'gbdt',
    'num_leaves': 31,
    'feature_fraction': 0.5,
    'bagging_fraction': 0.5,
    'bagging_freq': 20,
    'learning_rate': 0.05,
    'verbose': 0
}

In [18]:
model = lightgbm.train(parameters,
                       train_data,
                       valid_sets=test_data,
                       num_boost_round=5000,
                       early_stopping_rounds=100)

NameError: name 'train_data' is not defined


Logistic regression is named for the function used at the core of the method, the logistic function.

The logistic function, also called the sigmoid function was developed by statisticians to describe properties of population growth in ecology, rising quickly and maxing out at the carrying capacity of the environment. It’s an S-shaped curve that can take any real-valued number and map it into a value between 0 and 1, but never exactly at those limits.

1 / (1 + e^-value)

Where e is the base of the natural logarithms (Euler’s number or the EXP() function in your spreadsheet) and value is the actual numerical value that you want to transform. Below is a plot of the numbers between -5 and 5 transformed into the range 0 and 1 using the logistic function.



In [19]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import *
model=LogisticRegression()
model.fit(x_train,y_train)
pred=model.predict(x_test)
target_names=['class 0','class 1']
print(classification_report(y_test,pred,target_names=target_names))


NameError: name 'x_train' is not defined

XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. The same code runs on major distributed environment (Hadoop, SGE, MPI) and can solve problems beyond billions of examples.

In [20]:
from sklearn import datasets
import xgboost as xgb
D_train = xgb.DMatrix(x_train, label=y_train)
D_test = xgb.DMatrix(x_test, label=y_test)
param = {
    'eta': 0.3, 
    'max_depth': 3,  
    'objective': 'multi:softprob',  
    'num_class': 3} 

steps = 20

model = xgb.train(param, D_train, steps)

preds2 = model.predict(D_test)
best_preds = np.asarray([np.argmax(line) for line in preds2])

target_names=['class 0','class 1']
print(classification_report(y_test,best_preds,target_names=target_names))

NameError: name 'x_train' is not defined

In [21]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import *
model=RandomForestClassifier()
model.fit(x_train,y_train)
pred=model.predict(x_test)
target_names=['class 0','class 1']
print(classification_report(y_test,pred,target_names=target_names))

NameError: name 'x_train' is not defined