# Introduction
---
Credit card fraud detection is critical for financial institutions to prevent losses and maintain customer trust. This project focuses on developing machine learning models to identify fraudulent transactions effectively.
The project follows these key steps:
- **Importing Libraries**: Essential libraries for data manipulation, visualization, and model building.
- **Data Loading**: Loading the dataset for analysis.
- **Data Visualization**: Exploring and understanding data patterns.
- **Data Preprocessing**: Handled **Missing values** and **Unbalanced Dataset**
- **Feature-Label Split**: Separating features from labels.
- **Normalization**: Normalizing the feature set using **Standardization**
- **Train-Test Split**: Dividing the data into training and testing sets.
- **Training Models**: **Logistic regression** and **Random Forest**
- **Model Evaluation**: Using **Accuracy score**,**Precision**,**F1-score**,**Recall**

# Importing Python Libraries


---




In [45]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score,classification_report

# Data Loading:


---


Loading the given Dataset into pandas Dataframe



In [46]:
CreditCard_data = pd.read_csv('creditcard.csv')

# Data Visualisation
---

1. Visualising the first 5 rows of the dataset:


In [47]:
CreditCard_data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


2. Visualising the last 5 rows of the dataset:

In [48]:
CreditCard_data.tail()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
284802,172786.0,-11.881118,10.071785,-9.834783,-2.066656,-5.364473,-2.606837,-4.918215,7.305334,1.914428,...,0.213454,0.111864,1.01448,-0.509348,1.436807,0.250034,0.943651,0.823731,0.77,0
284803,172787.0,-0.732789,-0.05508,2.03503,-0.738589,0.868229,1.058415,0.02433,0.294869,0.5848,...,0.214205,0.924384,0.012463,-1.016226,-0.606624,-0.395255,0.068472,-0.053527,24.79,0
284804,172788.0,1.919565,-0.301254,-3.24964,-0.557828,2.630515,3.03126,-0.296827,0.708417,0.432454,...,0.232045,0.578229,-0.037501,0.640134,0.265745,-0.087371,0.004455,-0.026561,67.88,0
284805,172788.0,-0.24044,0.530483,0.70251,0.689799,-0.377961,0.623708,-0.68618,0.679145,0.392087,...,0.265245,0.800049,-0.163298,0.123205,-0.569159,0.546668,0.108821,0.104533,10.0,0
284806,172792.0,-0.533413,-0.189733,0.703337,-0.506271,-0.012546,-0.649617,1.577006,-0.41465,0.48618,...,0.261057,0.643078,0.376777,0.008797,-0.473649,-0.818267,-0.002415,0.013649,217.0,0


3. Obtaining some general Information about the given Dataset:

In [49]:
CreditCard_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    284807 non-null  float64
 1   V1      284807 non-null  float64
 2   V2      284807 non-null  float64
 3   V3      284807 non-null  float64
 4   V4      284807 non-null  float64
 5   V5      284807 non-null  float64
 6   V6      284807 non-null  float64
 7   V7      284807 non-null  float64
 8   V8      284807 non-null  float64
 9   V9      284807 non-null  float64
 10  V10     284807 non-null  float64
 11  V11     284807 non-null  float64
 12  V12     284807 non-null  float64
 13  V13     284807 non-null  float64
 14  V14     284807 non-null  float64
 15  V15     284807 non-null  float64
 16  V16     284807 non-null  float64
 17  V17     284807 non-null  float64
 18  V18     284807 non-null  float64
 19  V19     284807 non-null  float64
 20  V20     284807 non-null  float64
 21  V21     28

# Data Preprocessing
---

**1. Missing values Handling:** Checking for missing values in each column of the Dataset (if any):





In [50]:
CreditCard_data.isnull().sum()

Time      0
V1        0
V2        0
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
Amount    0
Class     0
dtype: int64

There are no missing values in our dataset.

**2. Handling Unbalanced Dataset:**


---


Checking the distribution of legit and fradulent trasactions:

In [51]:
CreditCard_data['Class'].value_counts()

Class
0    284315
1       492
Name: count, dtype: int64

The Data is highly unbalanced.

we have two label:
*    label - '0' --> indicates legitmate transaction.
*    label - '1' --> indicates fradulent transaction.

There are 282315 legit trasanction and only 492 fraud transactions. Training our model on such a data will result in a **biased** model which will be biased towards the class - 'legit transaction' which is the majority class.


**Under-Sampling Method:**

1. Separating the data into different classes for analysis

In [53]:
#1. Legit Class:
legit_class = CreditCard_data[CreditCard_data.Class == 0]
#2. Fraud Class:
fraud_class = CreditCard_data[CreditCard_data.Class == 1]

2. Comparing the values of both the class

In [54]:
#Compare the values of transactions of both the classes:
CreditCard_data.groupby('Class').mean()

Unnamed: 0_level_0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,94838.202258,0.008258,-0.006271,0.012171,-0.00786,0.005453,0.002419,0.009637,-0.000987,0.004467,...,-0.000644,-0.001235,-2.4e-05,7e-05,0.000182,-7.2e-05,-8.9e-05,-0.000295,-0.000131,88.291022
1,80746.806911,-4.771948,3.623778,-7.033281,4.542029,-3.151225,-1.397737,-5.568731,0.570636,-2.581123,...,0.372319,0.713588,0.014049,-0.040308,-0.10513,0.041449,0.051648,0.170575,0.075667,122.211321


**Dealing with unbalanced Dataset:** Using Undersampling Method


---

Under this method, we build a new_dataset containing similar distribution of normal and fraudulent transactions.

1. Obtaining n=492 random examples from the majority class:

In [55]:
#Creating a sample_ Data frame of legit transactions:
legit_sample_Dataset = legit_class.sample(n=492)

2. Creating a new Dataset by concatenating equal number of positive and negative examples:

In [56]:
new_dataset = pd.concat([legit_sample_Dataset, fraud_class], axis=0)    #Combining both the Dataframe row-wise so, axis = 0;

**Visualising the New_Dataset:**


---

1. Observing the 1st 5 values of the new dataframe:

In [57]:
new_dataset.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
241210,150928.0,2.226736,-1.686982,0.247256,-1.53367,-2.240039,-0.433328,-2.026069,0.071303,-0.53598,...,0.091573,0.715095,0.296967,-0.121447,-0.60318,-0.106827,0.070931,-0.027946,15.0,0
158517,111375.0,1.685813,-1.709515,-0.862273,-0.755017,-0.470225,1.350639,-1.171191,0.436154,1.471597,...,-0.184069,-0.622279,0.295681,-1.768386,-0.790888,-0.512238,-0.018208,-0.049466,189.7,0
235408,148374.0,-4.101316,-0.308307,-1.531816,-2.421422,1.08763,0.512667,-0.850986,1.487185,-1.467903,...,0.160633,0.742404,-0.721178,-0.845485,1.142757,0.114193,0.012988,-0.174248,25.0,0
29583,35534.0,1.138546,-0.905831,-0.052347,-0.730905,-0.851105,-0.612491,-0.293122,-0.155208,-1.147044,...,0.326058,0.527141,-0.288504,0.05743,0.593457,-0.128761,-0.036571,0.021207,149.9,0
770,580.0,1.26703,-0.071114,0.03768,0.512683,0.242392,0.705212,-0.226582,0.109483,0.657565,...,-0.164468,-0.177225,-0.222918,-1.245505,0.67836,0.525059,0.00292,-0.003333,12.36,0


2. Checking for the distribution of classes in the new_dataset:

In [58]:
new_dataset['Class'].value_counts()

Class
0    492
1    492
Name: count, dtype: int64

3. Comparing the Statistical measures of both the classes in the new dataset:

In [59]:
new_dataset.groupby('Class').mean()

Unnamed: 0_level_0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,92882.660569,0.002545,-0.077199,0.116897,-0.016108,-0.088073,-0.011701,-0.069459,0.12692,0.112147,...,0.015769,0.001184,-0.01296,-0.01354,-0.007615,-0.030294,0.013346,0.00774,0.014145,91.82685
1,80746.806911,-4.771948,3.623778,-7.033281,4.542029,-3.151225,-1.397737,-5.568731,0.570636,-2.581123,...,0.372319,0.713588,0.014049,-0.040308,-0.10513,0.041449,0.051648,0.170575,0.075667,122.211321


We can observe that the mean value in our new_dataset are similar with the original dataset, we can conclude that we got a **good sample**.

# Feature-Label Split over dataset:


---



1. Obtaining all the features in 'X' variable
2. Obtaining all the features in 'Y' variable

In [60]:
X = new_dataset.drop(columns = 'Class',axis = 1)  #Features.
Y = new_dataset['Class']                          #Output Labels.

3. Printing the features:

In [61]:
print(X)

            Time        V1        V2        V3        V4        V5        V6  \
241210  150928.0  2.226736 -1.686982  0.247256 -1.533670 -2.240039 -0.433328   
158517  111375.0  1.685813 -1.709515 -0.862273 -0.755017 -0.470225  1.350639   
235408  148374.0 -4.101316 -0.308307 -1.531816 -2.421422  1.087630  0.512667   
29583    35534.0  1.138546 -0.905831 -0.052347 -0.730905 -0.851105 -0.612491   
770        580.0  1.267030 -0.071114  0.037680  0.512683  0.242392  0.705212   
...          ...       ...       ...       ...       ...       ...       ...   
279863  169142.0 -1.927883  1.125653 -4.518331  1.749293 -1.566487 -2.010494   
280143  169347.0  1.378559  1.289381 -5.004247  1.411850  0.442581 -1.326536   
280149  169351.0 -0.676143  1.126366 -2.213700  0.468308 -1.120541 -0.003346   
281144  169966.0 -3.113832  0.585864 -5.399730  1.817092 -0.840618 -2.943548   
281674  170348.0  1.991976  0.158476 -2.583441  0.408670  1.151147 -0.096695   

              V7        V8        V9  .

4. Printing the output class labels:

In [62]:
print(Y)

241210    0
158517    0
235408    0
29583     0
770       0
         ..
279863    1
280143    1
280149    1
281144    1
281674    1
Name: Class, Length: 984, dtype: int64


# Normalising the feature set:
---
Standardization: Scaling the data such that their mean = 0 and standard deviation = 1

1. Obtaining the Standard_scaler:


In [63]:
scaler_standard = StandardScaler()

2. Scaling the feature set

In [64]:
X_standardized = scaler_standard.fit_transform(X)

3. Converting the standardized features back to a data_frame

In [65]:
X = pd.DataFrame(X_standardized, columns=X.columns)
print("Standardized Data:")
print(X.head())

Standardized Data:
       Time        V1        V2        V3        V4        V5        V6  \
0  1.319904  0.836614 -0.934924  0.593491 -1.180996 -0.148448  0.158486   
1  0.505624  0.738479 -0.941012  0.415781 -0.938785  0.275037  1.200276   
2  1.267325 -0.311430 -0.562422  0.308542 -1.457144  0.647803  0.710922   
3 -1.055720  0.639192 -0.723866  0.545505 -0.931285  0.183899  0.053859   
4 -1.775320  0.662502 -0.498336  0.559924 -0.544449  0.445553  0.823364   

         V7        V8        V9  ...       V20       V21       V22       V23  \
0  0.136128 -0.057379  0.296058  ... -0.602423 -0.095729  0.615536  0.282810   
1  0.282874  0.018068  1.146959  ... -0.098411 -0.194997 -0.536520  0.281687   
2  0.337839  0.235409 -0.098932  ... -0.827969 -0.070858  0.639061 -0.606196   
3  0.433600 -0.104218  0.037063  ...  0.165576 -0.011282  0.453627 -0.228402   
4  0.445022 -0.049483  0.801936  ... -0.244047 -0.187938 -0.153137 -0.171135   

        V24       V25       V26       V27       V

# Train/Test Split Over Dataset:
---

In [66]:
X_train , X_test , Y_train , Y_test = train_test_split(X,Y,test_size = 0.25,stratify=Y,random_state = 4)

# Training of Model: 
---



## 1. Logistic Regression

1. Obtainign our model:

In [67]:
model =  LogisticRegression()

2. Training our Logistic_Regression model on the Training set:

In [68]:
model.fit(X_train,Y_train)

### Logistic Regression Evaluation:


---



1. Checking Accuracy Score on training data:


In [69]:
X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction,Y_train)
print('Accuracy on Training Set: ' , training_data_accuracy)

Accuracy on Training Set:  0.9579945799457995


2. Checking Accuracy Score on testing data:

In [70]:
X_test_prediction = model.predict(X_test)
testing_data_accuracy = accuracy_score(X_test_prediction,Y_test)
print('Accuracy on Testing Set : ' , testing_data_accuracy)

Accuracy on Testing Set :  0.9512195121951219


3. Checking Precision on training data and testing data:

In [71]:
precision_train = precision_score(Y_train, X_train_prediction, average='weighted')
precision_test = precision_score(Y_test, X_test_prediction, average='weighted')
print('Precision on Training Set: ', precision_train)
print('Precision on Testing Set: ', precision_test)

Precision on Training Set:  0.9601065399598631
Precision on Testing Set:  0.9522957662492546


4. Checking Recall on the training data and testing data:

In [72]:
recall_train = recall_score(Y_train, X_train_prediction, average='weighted')
recall_test = recall_score(Y_test, X_test_prediction, average='weighted')
print('Recall on Training Set: ', recall_train)
print('Recall on Testing Set: ', recall_test)

Recall on Training Set:  0.9579945799457995
Recall on Testing Set:  0.9512195121951219


5. Checking F1-Score on training and testing data:

In [73]:
f1_train = f1_score(Y_train, X_train_prediction, average='weighted')
f1_test = f1_score(Y_test, X_test_prediction, average='weighted')
print('F1-Score on Training Set: ', f1_train)
print('F1-Score on Testing Set: ', f1_test)

F1-Score on Training Set:  0.9579463217277338
F1-Score on Testing Set:  0.9511904761904761


**No Overfitting:**  As the model performs equally well on training and testing data, There is no overfitting

## 2. Random Forest

1. Obtaining our model

In [74]:
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)

2. Training our model on the training set

In [75]:
rf_model.fit(X_train, Y_train)

### Random Forest Evaluation
---

1. Evaluating prediction:

In [78]:
Y_train_pred = rf_model.predict(X_train)
Y_test_pred = rf_model.predict(X_test)

2. Checking accuracy on training and testing data:

In [79]:
accuracy_train = accuracy_score(Y_train, Y_train_pred)
accuracy_test = accuracy_score(Y_test, Y_test_pred)
print('Accuracy on Training Set: ', {accuracy_train})
print('Accuracy on Testing Set: ', {accuracy_test})

Accuracy on Training Set:  {1.0}
Accuracy on Testing Set:  {0.9512195121951219}


3. Checking precision on training and testing data:

In [80]:
precision_train = precision_score(Y_train, Y_train_pred, average='weighted')
precision_test = precision_score(Y_test, Y_test_pred, average='weighted')
print('Precision on Training Set: ', {precision_train})
print('Precision on Testing Set: ', {precision_test})

Precision on Training Set:  {1.0}
Precision on Testing Set:  {0.9531364088947892}


4. Checking recall on training and testing data:


In [81]:
recall_train = recall_score(Y_train, Y_train_pred, average='weighted')
recall_test = recall_score(Y_test, Y_test_pred, average='weighted')
print('Recall on Training Set: ', {recall_train})
print('Recall on Testing Set: ', {recall_test})

Recall on Training Set:  {1.0}
Recall on Testing Set:  {0.9512195121951219}


5. Checking F1_score on the training data:

In [83]:
f1_train = f1_score(Y_train, Y_train_pred, average='weighted')
f1_test = f1_score(Y_test, Y_test_pred, average='weighted')
print('F1-Score on Training Set: ', {f1_train})
print('F1-Score on Testing Set: ', {f1_test})

F1-Score on Training Set:  {1.0}
F1-Score on Testing Set:  {0.951167868722292}


# Conclusion
---
- In this project, we successfully implemented and evaluated two machine learning models for credit card fraud detection: Logistic Regression and Random Forest.

- The Logistic Regression model showed strong performance, The Random Forest model exhibited perfect performance on the training set with metrics of 1, and **Random Forest slightly outperformed Logistic Regression on the testing set**.

- Both models demonstrated high accuracy in detecting fraudulent transactions, with Random Forest showing a slight edge. This highlights the effectiveness of machine learning in enhancing financial security measures.