CREDIT CARD FRAUD DETECTION

Context:

It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase.

Content:

The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, ... V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

Inspiration:

Identify fraudulent credit card transactions.

Given the class imbalance ratio, we recommend measuring the accuracy using the Area Under the Precision-Recall Curve (AUPRC). Confusion matrix accuracy is not meaningful for unbalanced classification.

Acknowledgements:

The dataset has been collected and analysed during a research collaboration of Worldline and the Machine Learning Group (http://mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) on big data mining and fraud detection. More details on current and past projects on related topics are available on https://www.researchgate.net/project/Fraud-detection-5 and the page of the DefeatFraud project

Importing the Dependencies

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

Loading the dataset to a Pandas DataFrame

In [2]:
data = pd.read_csv('creditcard.csv')

This csv file contains the transactions which was made by different users via different credit cards.

Printing first 5 rows of the dataset

In [3]:
data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In the dataset, class indicates that transaction is fraud or not.
0 --> Legit Transaction(i.e., Transaction is not a fraud transaction)
1--> Fraud Transaction.

Printing last 5rows of the dataset

In [4]:
data.tail()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
284802,172786.0,-11.881118,10.071785,-9.834783,-2.066656,-5.364473,-2.606837,-4.918215,7.305334,1.914428,...,0.213454,0.111864,1.01448,-0.509348,1.436807,0.250034,0.943651,0.823731,0.77,0
284803,172787.0,-0.732789,-0.05508,2.03503,-0.738589,0.868229,1.058415,0.02433,0.294869,0.5848,...,0.214205,0.924384,0.012463,-1.016226,-0.606624,-0.395255,0.068472,-0.053527,24.79,0
284804,172788.0,1.919565,-0.301254,-3.24964,-0.557828,2.630515,3.03126,-0.296827,0.708417,0.432454,...,0.232045,0.578229,-0.037501,0.640134,0.265745,-0.087371,0.004455,-0.026561,67.88,0
284805,172788.0,-0.24044,0.530483,0.70251,0.689799,-0.377961,0.623708,-0.68618,0.679145,0.392087,...,0.265245,0.800049,-0.163298,0.123205,-0.569159,0.546668,0.108821,0.104533,10.0,0
284806,172792.0,-0.533413,-0.189733,0.703337,-0.506271,-0.012546,-0.649617,1.577006,-0.41465,0.48618,...,0.261057,0.643078,0.376777,0.008797,-0.473649,-0.818267,-0.002415,0.013649,217.0,0


There are 2,84,807 transactions. This is very huge dataset and also very unbalanced dataset.

Getting the information of dataset

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    284807 non-null  float64
 1   V1      284807 non-null  float64
 2   V2      284807 non-null  float64
 3   V3      284807 non-null  float64
 4   V4      284807 non-null  float64
 5   V5      284807 non-null  float64
 6   V6      284807 non-null  float64
 7   V7      284807 non-null  float64
 8   V8      284807 non-null  float64
 9   V9      284807 non-null  float64
 10  V10     284807 non-null  float64
 11  V11     284807 non-null  float64
 12  V12     284807 non-null  float64
 13  V13     284807 non-null  float64
 14  V14     284807 non-null  float64
 15  V15     284807 non-null  float64
 16  V16     284807 non-null  float64
 17  V17     284807 non-null  float64
 18  V18     284807 non-null  float64
 19  V19     284807 non-null  float64
 20  V20     284807 non-null  float64
 21  V21     28

Checking the number of missing values in each column

In [6]:
data.isnull().sum()

Time      0
V1        0
V2        0
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
Amount    0
Class     0
dtype: int64

We can see that we don't have any missing values.If we have missing values, we need to convert the missing values into meaningful numbers.

Next we are going to check the distribution of legit tansactions and fradulent transactions.

In [7]:
#distribution of legit tansactions and fradulent transactions
data['Class'].value_counts()

Class
0    284315
1       492
Name: count, dtype: int64

Here we can see the number of normal transactions and the number of fraudulent transactions.
0 represnts normal transaction and 1 represents fraudulent transactions.
There are 284315 transactions are normal transactions and 492 are fraudulent.

And we can also see that there are more than 99% of the data(transactions) are in one particular class. So we can't feed this data to our machine learning model. Because if you just train your machine learning model with this data, it cannot recognize the fraudulent data which is very less data are fraudulent. So we need to handle this unbalanced dataset.

In [8]:
#Separating the normal and fraudulent transactions for analysis
legit = data[data.Class == 0]
fraud = data[data.Class == 1]

In [9]:
print(legit.shape)
print(fraud.shape)

(284315, 31)
(492, 31)


(284315, 31) --> Normal Transaction ;
(492, 31) --> Fraudulent Transactions

In [10]:
#statistical measures of the data which are legit
legit.Amount.describe()
#returns the money that was transacted in the particular transaction

count    284315.000000
mean         88.291022
std         250.105092
min           0.000000
25%           5.650000
50%          22.000000
75%          77.050000
max       25691.160000
Name: Amount, dtype: float64

Here, (25% --> 5.650000) represents 25% of transactions amount was less than 5.65

In [12]:
#statistical measures of the data which are fraud
fraud.Amount.describe()

count     492.000000
mean      122.211321
std       256.683288
min         0.000000
25%         1.000000
50%         9.250000
75%       105.890000
max      2125.870000
Name: Amount, dtype: float64

Comparing the both statistical measures, mean for the fraudulent transaction is more than the mean for the legit transactions.

In [13]:
#compare the values for both transactions
data.groupby('Class').mean()

Unnamed: 0_level_0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,94838.202258,0.008258,-0.006271,0.012171,-0.00786,0.005453,0.002419,0.009637,-0.000987,0.004467,...,-0.000644,-0.001235,-2.4e-05,7e-05,0.000182,-7.2e-05,-8.9e-05,-0.000295,-0.000131,88.291022
1,80746.806911,-4.771948,3.623778,-7.033281,4.542029,-3.151225,-1.397737,-5.568731,0.570636,-2.581123,...,0.372319,0.713588,0.014049,-0.040308,-0.10513,0.041449,0.051648,0.170575,0.075667,122.211321


In [15]:
#Here, we got the mean values for each column whose class are 0 & 1 separately.
#We can also see the wide difference in the mean of normal transactions and the fraudulent transactions.

UNDER-SAMPLING

Build a sample dataset containing similar distribution of normal transactions and Fraudulent Transactions.

Number of Fraudulent Transactions --> 492

In [16]:
#To build a sample dataset, we will take random 492 transactions from legit transactions and add to fraudulent transactions. Then it will be a proper balanced dataset.

In [17]:
#taking random datas from legit
legit_sample = legit.sample(n=492)
#returns the random 492 values, not the first 492

Concatenating two DataFrames

In [18]:
new_sample = pd.concat([legit_sample, fraud],axis = 0)
#axis = 0 --> rows & axis = 1 --> columns
#by mentioning axis = 0, 2nd DataFrame will be added below 1st DataFrame

In [19]:
new_sample.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
54554,46502.0,-1.254418,-0.175598,3.180335,0.053616,-0.170921,0.70624,-0.546384,0.643905,0.735221,...,-0.034662,0.155874,-0.202156,0.280195,0.28699,0.303297,0.058834,0.063528,4.7,0
144019,85800.0,0.326079,-2.312094,0.167918,-0.074141,-1.793212,-0.517294,-0.027149,-0.288946,-0.539306,...,0.505344,0.302354,-0.562311,0.484435,0.333267,-0.202473,-0.070955,0.115874,557.03,0
46416,42777.0,1.168008,-0.209677,0.595152,1.316714,-0.450264,0.560179,-0.548571,0.338883,0.957436,...,-0.134686,-0.181452,-0.196179,-0.525376,0.701393,-0.247522,0.038704,0.004027,12.2,0
207435,136674.0,1.946018,-1.760173,-1.181017,-1.820891,0.63713,4.151453,-2.19542,1.166514,0.463694,...,0.472758,1.174689,0.163403,0.74309,-0.409189,-0.079295,0.061008,-0.02365,90.0,0
26642,34180.0,-0.398232,0.807036,1.61805,1.16324,0.369599,0.354416,0.619143,0.082985,-0.285784,...,-0.039942,0.343063,-0.187733,0.008731,-0.171328,-0.266157,0.255818,-0.029619,7.99,0


In [20]:
new_sample.tail()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
279863,169142.0,-1.927883,1.125653,-4.518331,1.749293,-1.566487,-2.010494,-0.88285,0.697211,-2.064945,...,0.778584,-0.319189,0.639419,-0.294885,0.537503,0.788395,0.29268,0.147968,390.0,1
280143,169347.0,1.378559,1.289381,-5.004247,1.41185,0.442581,-1.326536,-1.41317,0.248525,-1.127396,...,0.370612,0.028234,-0.14564,-0.081049,0.521875,0.739467,0.389152,0.186637,0.76,1
280149,169351.0,-0.676143,1.126366,-2.2137,0.468308,-1.120541,-0.003346,-2.234739,1.210158,-0.65225,...,0.751826,0.834108,0.190944,0.03207,-0.739695,0.471111,0.385107,0.194361,77.89,1
281144,169966.0,-3.113832,0.585864,-5.39973,1.817092,-0.840618,-2.943548,-2.208002,1.058733,-1.632333,...,0.583276,-0.269209,-0.456108,-0.183659,-0.328168,0.606116,0.884876,-0.2537,245.0,1
281674,170348.0,1.991976,0.158476,-2.583441,0.40867,1.151147,-0.096695,0.22305,-0.068384,0.577829,...,-0.16435,-0.295135,-0.072173,-0.450261,0.313267,-0.289617,0.002988,-0.015309,42.53,1


In [22]:
#distribution of legit tansactions and fradulent transactions in the sample dataset
new_sample['Class'].value_counts()

Class
0    492
1    492
Name: count, dtype: int64

In [23]:
#compare the values for both transactions in the sample dataset
new_sample.groupby('Class').mean()

Unnamed: 0_level_0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,93448.278455,-0.130713,-0.16335,0.080125,0.146341,-0.022052,0.025125,-0.096937,-0.016904,0.074123,...,0.016321,0.055175,-0.008451,0.043271,0.01221,-0.007676,0.033893,-0.012235,0.012973,96.827541
1,80746.806911,-4.771948,3.623778,-7.033281,4.542029,-3.151225,-1.397737,-5.568731,0.570636,-2.581123,...,0.372319,0.713588,0.014049,-0.040308,-0.10513,0.041449,0.051648,0.170575,0.075667,122.211321


Mean diffrence for both transactions is still there. But, here this dataset will help the machine learning to detect that the dataset is normal or fraudulent, because it is a balanced dataset.

Splitting the data into Features & Targets:

In [24]:
X = new_sample.drop(columns = 'Class', axis = 1)
Y = new_sample['Class']

In [25]:
print(X)

            Time        V1        V2        V3        V4        V5        V6  \
54554    46502.0 -1.254418 -0.175598  3.180335  0.053616 -0.170921  0.706240   
144019   85800.0  0.326079 -2.312094  0.167918 -0.074141 -1.793212 -0.517294   
46416    42777.0  1.168008 -0.209677  0.595152  1.316714 -0.450264  0.560179   
207435  136674.0  1.946018 -1.760173 -1.181017 -1.820891  0.637130  4.151453   
26642    34180.0 -0.398232  0.807036  1.618050  1.163240  0.369599  0.354416   
...          ...       ...       ...       ...       ...       ...       ...   
279863  169142.0 -1.927883  1.125653 -4.518331  1.749293 -1.566487 -2.010494   
280143  169347.0  1.378559  1.289381 -5.004247  1.411850  0.442581 -1.326536   
280149  169351.0 -0.676143  1.126366 -2.213700  0.468308 -1.120541 -0.003346   
281144  169966.0 -3.113832  0.585864 -5.399730  1.817092 -0.840618 -2.943548   
281674  170348.0  1.991976  0.158476 -2.583441  0.408670  1.151147 -0.096695   

              V7        V8        V9  .

Here, we can see that class column is dropped from X DataFrame.

In [26]:
print(Y)

54554     0
144019    0
46416     0
207435    0
26642     0
         ..
279863    1
280143    1
280149    1
281144    1
281674    1
Name: Class, Length: 984, dtype: int64


Split the data into Training Data & Testing Data:

In [27]:
#for this we are going to use train_test_split function which we have imported from sklearn.model_selection

In [32]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, stratify = Y, random_state = 2)
#features are present in X, labels are present in Y
#next we are splitting the X and Y into training data and testing data which will be splitted randomly
#X_train & X_test contains all the features of training data
#Y_train & Y_test contains all the labels of training data
#stratify(Y) --> evenly distribute the both classes to training data and testing data
#test_size = 0.2 --> gives 20% of data to X_test & Y_test and remaining will be stored in X_train & Y_train

In [33]:
print(X.shape, X_train.shape, X_test.shape)

(984, 30) (787, 30) (197, 30)


In [34]:
#Here, X_train has 80% of data and X_test has 20% of data, because we mentioned that test_size = 0.2

MODEL TRAINING

Logistic Regression

In [35]:
model = LogisticRegression()
#this means we are loading one instance of the LogicalRegression model to the particular variable called 'model'

In [36]:
#training the Logistic Regression model with Training Data
model.fit(X_train, Y_train)
#to train the particular data function 'fit()' is used

MODEL EVALUATION

Accuracy Score:

1) First let's try to find accuracy score on training data. 
2) As our Logistic Regression model as learned from the data, so we will give only the X_train values to our model and it will try to predict the class. 
3) Once it predicted it will try to compare the values predicted by the model and the original values which are presented in the Y_train. So it will give us Accuracy Score.

In [39]:
#Accuracy on training data
X_train_prediction = model.predict(X_train)
#predicting the labels for X_train, so 'predict()' function is used

training_data_accuracy = accuracy_score(X_train_prediction, Y_train)
#comparing the values predicted with originally presented in Y_train and it will gives us the Accuracy Score

In [38]:
print('Accuracy on Training data : ', training_data_accuracy)

Accuracy on Training data :  0.940279542566709


Here we got Accuracy Score of 0.9402(i.e.,94%). It means out of 100 predictions, our model can predict approximately 94 predictions correct.
If accuracy score is more than 75-80%, predictions are good 

In [40]:
#Accuracy on test data
X_test_prediction = model.predict(X_test)
#predicting the labels for X_test, so 'predict()' function is used

test_data_accuracy = accuracy_score(X_test_prediction, Y_test)
#comparing the values predicted with originally presented in Y_test and it will gives us the Accuracy Score

In [41]:
print('Accuracy on Test data : ', test_data_accuracy)

Accuracy on Test data :  0.934010152284264


Here we got Accuracy Score of 0.93401(i.e.,93%) which is almost similar to Accuracy score of Training data.

If the Accuracy score of Training data is very different from Test data, it means our model is over fitted or under fitted.
For example, if we got 89% of Accuracy Score on Training Data and 40% of Accuracy Score on Testing Data, it means our model has over fitted with the training data.