# Credit Card Fraud Detection Using Machine Learning Models
### This project aims to identify fraudulent transactions using a dataset of credit card transactions.
### The dataset is highly imbalanced, with a small number of fraudulent transactions compared to legitimate ones.
### We experiment with different machine learning algorithms, including Logistic Regression, Random Forest, and Balanced Random Forest, 
### to improve model performance in detecting fraud.
### The models are evaluated based on accuracy, precision, recall, and F1-score, focusing on improving the detection of fraudulent transactions.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [4]:
credit_df=pd.read_csv('creditcard.csv')
credit_df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


## Data Exploration
### We start by exploring the dataset to understand the distribution of legitimate and fraudulent transactions.
### Since the data is highly imbalanced, it poses challenges in effectively detecting fraud.
### We also analyze the statistical features of both classes and investigate feature relationships.


In [6]:
credit_df.groupby('Class').describe()

Unnamed: 0_level_0,Time,Time,Time,Time,Time,Time,Time,Time,V1,V1,...,V28,V28,Amount,Amount,Amount,Amount,Amount,Amount,Amount,Amount
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
Class,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
0,284315.0,94838.202258,47484.015786,0.0,54230.0,84711.0,139333.0,172792.0,284315.0,0.008258,...,0.077962,33.847808,284315.0,88.291022,250.105092,0.0,5.65,22.0,77.05,25691.16
1,492.0,80746.806911,47835.365138,406.0,41241.5,75568.5,128483.0,170348.0,492.0,-4.771948,...,0.381152,1.779364,492.0,122.211321,256.683288,0.0,1.0,9.25,105.89,2125.87


In [9]:
credit_df.isnull().sum()

Time      0
V1        0
V2        0
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
Amount    0
Class     0
dtype: int64

In [16]:
legit=credit_df[credit_df.Class==0]
fraud=credit_df[credit_df.Class==1]
print(legit.shape)
print(fraud.shape)

(284315, 31)
(492, 31)


In [17]:
legit.Amount.describe()

count    284315.000000
mean         88.291022
std         250.105092
min           0.000000
25%           5.650000
50%          22.000000
75%          77.050000
max       25691.160000
Name: Amount, dtype: float64

In [18]:
fraud.Amount.describe()

count     492.000000
mean      122.211321
std       256.683288
min         0.000000
25%         1.000000
50%         9.250000
75%       105.890000
max      2125.870000
Name: Amount, dtype: float64

In [23]:
credit_df.groupby('Class').mean()

Unnamed: 0_level_0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,94838.202258,0.008258,-0.006271,0.012171,-0.00786,0.005453,0.002419,0.009637,-0.000987,0.004467,...,-0.000644,-0.001235,-2.4e-05,7e-05,0.000182,-7.2e-05,-8.9e-05,-0.000295,-0.000131,88.291022
1,80746.806911,-4.771948,3.623778,-7.033281,4.542029,-3.151225,-1.397737,-5.568731,0.570636,-2.581123,...,0.372319,0.713588,0.014049,-0.040308,-0.10513,0.041449,0.051648,0.170575,0.075667,122.211321


In [19]:
legit_sample = legit.sample(n=492)

In [20]:
credit_new = pd.concat([legit_sample, fraud], axis=0)

In [21]:
credit_new['Class'].value_counts()

Class
0    492
1    492
Name: count, dtype: int64

In [24]:
#This is to comapre the values of mean of the new data set to the original dataset
credit_new.groupby('Class').mean()

Unnamed: 0_level_0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,94994.394309,-0.213301,-0.055344,-0.034912,-0.058707,0.006227,-0.072741,0.063368,-0.032929,-0.052004,...,-0.017596,0.033042,-0.04652,-0.051144,0.014545,-0.063369,0.03394,-0.008113,-0.006016,107.815752
1,80746.806911,-4.771948,3.623778,-7.033281,4.542029,-3.151225,-1.397737,-5.568731,0.570636,-2.581123,...,0.372319,0.713588,0.014049,-0.040308,-0.10513,0.041449,0.051648,0.170575,0.075667,122.211321


In [26]:
X = credit_new.drop(columns='Class', axis='columns')
Y = credit_new['Class']

In [28]:
from sklearn.model_selection import train_test_split

In [50]:
 X_train, X_test, y_train, y_test=train_test_split(X,Y,test_size=0.2)

## Logistic Regression Model
### Logistic Regression is one of the simplest models, and we use it as a baseline.
### To handle the imbalance, we apply under-sampling, reducing the majority class (legitimate transactions) to match the minority class (fraudulent transactions).
### The model performs well on accuracy, we also evaluated the model based on precision ,recall and f1scores


In [51]:
from sklearn.linear_model import LogisticRegression
logistic_model=LogisticRegression(max_iter=799)

In [52]:
logistic_model.fit(X_train,y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [53]:
y_pred=logistic_model.predict(X_test)

In [54]:
logistic_model.score(X_train,y_train)

0.9504447268106735

In [55]:
logistic_model.score(X_test,y_test)

0.9390862944162437

In [56]:
from sklearn.metrics import classification_report

In [57]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.90      0.99      0.94       101
           1       0.99      0.89      0.93        96

    accuracy                           0.94       197
   macro avg       0.94      0.94      0.94       197
weighted avg       0.94      0.94      0.94       197



## Balanced Random Forest Model
### The Balanced Random Forest is specifically designed for imbalanced datasets.
### It under-samples the majority class within each tree, creating a balanced dataset for each decision tree in the forest.
### In this model, we do not apply additional sampling to the majority class data, allowing the algorithm to handle it internally.
#### Despite the model achieving high recall (detecting a high number of fraud cases), the precision for fraud detection is very low.
#### This is because the model tends to classify more transactions as fraud, leading to a higher number of false positives (legitimate transactions wrongly classified as fraud).
#### Improving this requires further tuning, such as adjusting class weights, experimenting with thresholds, or exploring advanced techniques like SMOTE or cost-sensitive learning.


In [58]:
from imblearn.ensemble import BalancedRandomForestClassifier
brforest=BalancedRandomForestClassifier()

In [61]:
X_new=credit_df.drop(['Class'],axis='columns')
y_new=credit_df['Class']

In [62]:
X_train1, X_test1, y_train1, y_test1 =train_test_split(X_new,y_new,test_size=0.2)

In [63]:
brforest.fit(X_train1,y_train1)

  warn(
  warn(
  warn(


In [65]:
brforest.score(X_train1,y_train1)

0.9785907085957559

In [66]:
brforest.score(X_test1,y_test1)

0.9769671008742671

In [70]:
y_pred1=brforest.predict(X_test1)

In [71]:
print(classification_report(y_test1,y_pred1))

              precision    recall  f1-score   support

           0       1.00      0.98      0.99     56855
           1       0.07      0.92      0.13       107

    accuracy                           0.98     56962
   macro avg       0.53      0.95      0.56     56962
weighted avg       1.00      0.98      0.99     56962



#### above we can see that the precision of fraud cases is very low,this is happening due to many false positive cases in fraud due large amount of legitimate data,the makes the model confused,although if we simply look at the accuracy score of the model it might give us wrong impression that the model is performing well

## Random Forest Model
### The Random Forest model is applied with the same under-sampling technique used in Logistic Regression.
### Random Forest often handles imbalanced data better by utilizing multiple decision trees.
### We also evaluated the model using precision,recall and f1 score


In [72]:
from sklearn.ensemble import RandomForestClassifier
forest_model=RandomForestClassifier()

In [73]:
forest_model.fit(X_train,y_train)

In [74]:
forest_model.score(X_train,y_train)

1.0

In [75]:
forest_model.score(X_test,y_test)

0.9390862944162437

In [76]:
y_predicted=forest_model.predict(X_test)

In [77]:
print(classification_report(y_test,y_predicted))

              precision    recall  f1-score   support

           0       0.90      0.99      0.94       101
           1       0.99      0.89      0.93        96

    accuracy                           0.94       197
   macro avg       0.94      0.94      0.94       197
weighted avg       0.94      0.94      0.94       197

