# **Credit Card Fraud Detection**

### **Data Source :**
We use the dataset available at https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud?select=creditcard.csv

The dataset contains transactions made by credit card. The dataset is highly unbalanced, we have 492 fraudulent transactions out of 284,807 transactions.

In this dataset, **"Class"** is the target variable and it takes value 1 in case of fraudulent transaction and 0 otherwise.

Connect google drive to google colab.

In [1]:
from google.colab import drive

## **Import Libraries**


In [2]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from sklearn import preprocessing 

from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier 
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
import xgboost as xgb

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

Now, load the dataset to a pandas DataFrame.

In [3]:
data = pd.read_csv("/content/drive/MyDrive/creditcard.csv")

In [4]:
data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


**The dataset consists of 284807 transactions.**

In [5]:
data.shape

(284807, 31)

### **Information of dataset**

Let's see the information of the dataset.

In [6]:
data.info()            # information of dataset

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    284807 non-null  float64
 1   V1      284807 non-null  float64
 2   V2      284807 non-null  float64
 3   V3      284807 non-null  float64
 4   V4      284807 non-null  float64
 5   V5      284807 non-null  float64
 6   V6      284807 non-null  float64
 7   V7      284807 non-null  float64
 8   V8      284807 non-null  float64
 9   V9      284807 non-null  float64
 10  V10     284807 non-null  float64
 11  V11     284807 non-null  float64
 12  V12     284807 non-null  float64
 13  V13     284807 non-null  float64
 14  V14     284807 non-null  float64
 15  V15     284807 non-null  float64
 16  V16     284807 non-null  float64
 17  V17     284807 non-null  float64
 18  V18     284807 non-null  float64
 19  V19     284807 non-null  float64
 20  V20     284807 non-null  float64
 21  V21     28

Check whether the dataset have any missing value or not.

In [7]:
data.isnull().sum()              # check missing values

Time      0
V1        0
V2        0
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
Amount    0
Class     0
dtype: int64

Check whether the dataset has any duplicate values or not.

In [8]:
duplicate = data[data.duplicated()]        # check duplicate values

In [9]:
duplicate.shape

(1081, 31)

In [10]:
data.drop_duplicates(inplace=True)         # remove the duplicate values

In [11]:
data.shape

(283726, 31)

See, the distribution of legit transactions and fraudulent transactions i.e, the number of legit transactions and fraudulent transactions in dataset.

**value_counts()** function is used to find the count of all unique values in the given index

In [12]:
data["Class"].value_counts()

0    283253
1       473
Name: Class, dtype: int64

For better analysis of data, split the dataset on the basis of legit and fraudulent transactions into **legit** and **fraud**.

In the dataset,

**0 --> legit transaction**

**1 --> fraudulent transaction**

In result, **legit** contains all the legit transactions and **fraud** consists of all the fraudulent transactions.

In [13]:
legit = data[data.Class == 0]
fraud = data[data.Class == 1]

Check the shape of legit and fraud data.

In [14]:
print(legit.shape)
print(fraud.shape)

(283253, 31)
(473, 31)


### **Statistical Measures of Data**

Summary statistics are used to summarize a set of observations.

Numerical measures such as mean, median, mode, percentiles, range, variance and standard deviation are commonly used to summarize quantitative and qualitative data.

In [15]:
legit.Amount.describe()

count    283253.000000
mean         88.413575
std         250.379023
min           0.000000
25%           5.670000
50%          22.000000
75%          77.460000
max       25691.160000
Name: Amount, dtype: float64

In [16]:
fraud.Amount.describe()

count     473.000000
mean      123.871860
std       260.211041
min         0.000000
25%         1.000000
50%         9.820000
75%       105.890000
max      2125.870000
Name: Amount, dtype: float64

### **Splitting data into features and targets**

**Features** are independent variables in the dataset, now we store the features of dataset into **X**.

**Targets** are dependent variables in the dataset, we store the dependent variables i.e, **Class** into **Y**.

In [17]:
X = data.drop(columns = "Class", axis = 1)
Y = data["Class"]

In [18]:
X.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,0.251412,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.069083,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.52498,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.208038,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,0.408542,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99


In [19]:
Y.head()

0    0
1    0
2    0
3    0
4    0
Name: Class, dtype: int64

### **Oversampling - SMOTE**

**It is used to balance the class distribtion for a classification dataset that has a skewed class distribution.**

Here, we're using SMOTE which will generate synthetic data points that are slightly different from the original data points to balance the dataset.

In [20]:
from collections import Counter
s = SMOTE(random_state=42)
X_new, Y_new = s.fit_resample(X, Y)
print('Original dataset shape %s' % Counter(Y))
print('Resampled dataset shape %s' % Counter(Y_new))

Original dataset shape Counter({0: 283253, 1: 473})
Resampled dataset shape Counter({0: 283253, 1: 283253})


In [21]:
print(X_new.shape)
print(Y_new.shape)

(566506, 30)
(566506,)


After we applied SMOTE, now the resampled dataset has 566506 total transactions including legit and fraud.

### **Normalization**

Normalization is the process of rescaling one or more attributes to the range of 0 to 1. The normalization formula is a statistics formula that can transform a data set so that all of its variations fall between zero and one.

In [22]:
scaler = preprocessing.MinMaxScaler()               # standardization of data
names = X.columns
d = scaler.fit_transform(X_new)
scaled_data = pd.DataFrame(d, columns=names)
scaled_data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
0,0.0,0.935192,0.76649,0.881365,0.313023,0.763439,0.267669,0.266815,0.786444,0.475312,...,0.582942,0.561184,0.522992,0.663793,0.391253,0.585122,0.394557,0.418976,0.312697,0.005824
1,0.0,0.978542,0.770067,0.840298,0.271796,0.76612,0.262192,0.264875,0.786298,0.453981,...,0.57953,0.55784,0.480237,0.666938,0.33644,0.58729,0.446013,0.416345,0.313423,0.000105
2,6e-06,0.935217,0.753118,0.868141,0.268766,0.762329,0.281122,0.270177,0.788042,0.410603,...,0.585855,0.565477,0.54603,0.678939,0.289354,0.559515,0.402727,0.415489,0.311911,0.014739
3,6e-06,0.941878,0.765304,0.868484,0.213661,0.765647,0.275559,0.266803,0.789434,0.414999,...,0.57805,0.559734,0.510277,0.662607,0.223826,0.614245,0.389197,0.417669,0.314371,0.004807
4,1.2e-05,0.938617,0.77652,0.864251,0.269796,0.762975,0.263984,0.268968,0.782484,0.49095,...,0.584615,0.561327,0.547271,0.663392,0.40127,0.566343,0.507497,0.420561,0.31749,0.002724


In [23]:
scaled_data.describe()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
count,566506.0,566506.0,566506.0,566506.0,566506.0,566506.0,566506.0,566506.0,566506.0,566506.0,...,566506.0,566506.0,566506.0,566506.0,566506.0,566506.0,566506.0,566506.0,566506.0,566506.0
mean,0.507372,0.919287,0.785972,0.77757,0.353169,0.755553,0.255564,0.249034,0.791087,0.418157,...,0.582458,0.565186,0.511991,0.664833,0.37491,0.579291,0.429992,0.418499,0.314001,0.003764
std,0.28092,0.08874,0.036284,0.10235,0.138016,0.026967,0.016207,0.032396,0.041429,0.074992,...,0.010475,0.028602,0.042028,0.013856,0.072701,0.034622,0.072063,0.015476,0.008206,0.008619
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.261922,0.909445,0.766114,0.75039,0.249251,0.754252,0.247568,0.246758,0.783371,0.384901,...,0.578327,0.558905,0.487373,0.662117,0.331345,0.560883,0.381549,0.415319,0.312012,0.000143
50%,0.459037,0.944849,0.7777,0.811364,0.316515,0.762876,0.256099,0.260947,0.787238,0.434777,...,0.58074,0.563986,0.511244,0.665004,0.381105,0.58018,0.42519,0.417622,0.313897,0.001006
75%,0.780906,0.972496,0.796469,0.842182,0.441314,0.768689,0.263135,0.266555,0.794442,0.465708,...,0.584948,0.571444,0.534581,0.66812,0.42741,0.598595,0.475599,0.424395,0.31741,0.003892
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


### **Training data and Testing data**

For building training and testing data, we use train_test_split function.

We split the dataset into,

**Training dataset --> 80%**

**Testing dataset --> 20%**

**Building Training dataset -**

We extract 80% data from features and their corresponding dependent variables from target are store into, **X_train** and **Y_train**.

**Building Testing Dataset -**

The remaining 20% data from features and their respective corresponding target values are stored into, **X_test** and **Y_test** respectively.

**test_size --> 0.2**, means the size of testing data is 20%.


In [24]:
X_train, X_test, Y_train, Y_test = train_test_split(X_new, Y_new, test_size = 0.2, random_state = 2)

In [25]:
print(X_new.shape, X_train.shape, X_test.shape)

(566506, 30) (453204, 30) (113302, 30)


### **Model Training**

Now, we will implement various machine learning algorithms to enhance the precision, recall, f1-score and accuracy of the model.

### **LOGISTIC REGRESSION**

In [26]:
lr = LogisticRegression(max_iter=1000)

In [27]:
lr.fit(X_train, Y_train)

In [28]:
y_pred = lr.predict(X_test)

In [29]:
print(classification_report(Y_test, y_pred))

              precision    recall  f1-score   support

           0       0.96      0.99      0.97     56455
           1       0.99      0.96      0.97     56847

    accuracy                           0.97    113302
   macro avg       0.97      0.97      0.97    113302
weighted avg       0.97      0.97      0.97    113302



In [30]:
cm = confusion_matrix(Y_test, y_pred)

In [31]:
cm

array([[55642,   813],
       [ 2166, 54681]])

### **NAIVE BAYES MODEL**

In [32]:
nb = GaussianNB() 

In [33]:
nb.fit(X_train, Y_train) 

In [34]:
y_pred = nb.predict(X_test)

In [35]:
print(classification_report(Y_test, y_pred))

              precision    recall  f1-score   support

           0       0.80      0.99      0.88     56455
           1       0.99      0.75      0.85     56847

    accuracy                           0.87    113302
   macro avg       0.89      0.87      0.87    113302
weighted avg       0.89      0.87      0.87    113302



### **DECISION TREE MODEL**

In [36]:
dt = DecisionTreeClassifier(criterion='entropy', random_state=0)

In [37]:
dt.fit(X_train, Y_train)

In [38]:
y_pred = dt.predict(X_test)

In [39]:
print(classification_report(Y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56455
           1       1.00      1.00      1.00     56847

    accuracy                           1.00    113302
   macro avg       1.00      1.00      1.00    113302
weighted avg       1.00      1.00      1.00    113302



### **RANDOM FOREST MODEL**

In [40]:
rf = RandomForestClassifier(n_estimators= 10, criterion="entropy") 

In [41]:
rf.fit(X_train,Y_train)

In [42]:
y_pred = rf.predict(X_test)

In [43]:
print(classification_report(Y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56455
           1       1.00      1.00      1.00     56847

    accuracy                           1.00    113302
   macro avg       1.00      1.00      1.00    113302
weighted avg       1.00      1.00      1.00    113302



### **K-NEAREST NEIGHBOR(KNN) MODEL**

In [44]:
knn = KNeighborsClassifier(n_neighbors=5)

In [45]:
knn.fit(X_train, Y_train)

In [46]:
y_pred = knn.predict(X_test)

In [47]:
print(classification_report(Y_test, y_pred))

              precision    recall  f1-score   support

           0       0.98      0.94      0.96     56455
           1       0.95      0.98      0.96     56847

    accuracy                           0.96    113302
   macro avg       0.96      0.96      0.96    113302
weighted avg       0.96      0.96      0.96    113302



### **LINEAR DISCRIMINANT ANALYSIS (LDA) MODEL**

In [48]:
lda = LinearDiscriminantAnalysis()

In [49]:
lda.fit(X_train, Y_train)

In [50]:
y_pred = lda.predict(X_test)

In [51]:
print(classification_report(Y_test, y_pred))

              precision    recall  f1-score   support

           0       0.89      0.99      0.93     56455
           1       0.99      0.87      0.93     56847

    accuracy                           0.93    113302
   macro avg       0.94      0.93      0.93    113302
weighted avg       0.94      0.93      0.93    113302



### **EXTREME GRADIENT BOOSTING (XGBoost)**

In [52]:
xg_boost = xgb.XGBClassifier(n_estimators=100)

In [53]:
xg_boost.fit(X_train, Y_train)

In [54]:
y_pred = xg_boost.predict(X_test)

In [55]:
print(classification_report(Y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56455
           1       1.00      1.00      1.00     56847

    accuracy                           1.00    113302
   macro avg       1.00      1.00      1.00    113302
weighted avg       1.00      1.00      1.00    113302



### After implementing various machine learning algorithms we find that **Decision Tree**, **Random Forest** and **XGBoost** gives best result.