Skip to content

A model for binary classification of credit card data as fraudulent or legitimate

License

Notifications You must be signed in to change notification settings

MdTanvirHossainTusher/Credit-Card-Fraud-Detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Credit-Card-Fraud-Detection

Overview

A binary classification model that can classify whether a credit card data is fraudulent or legit.

Data Collection

Data has already available here. The dataset contains transactions made by credit cards in September 2013 by European cardholders. Dataset presents transactions that occurred in two days, which contains 492 frauds transactions out of 284,807 transactions. The dataset is highly imbalanced.

Data Preprocessing

Initially datasets contains 1081 duplicate rows. After removing those it reduces to 283,726 observations, where positive (fraudulent) class contains only 0.001667% of data and negative class contains 0.998333% of data. To reduce imbalanced property, under and over sampling has done where over sampling performs well. On the other hand, under sampling perfomrs poorly.

Model Training

Datasets is trained using Decision Tree, Random Forest and Gradient Boosting classification models. Showing each models performance, their training time and lackings.

Result Analysis

In the table I showed the Precision, Recall, F1 score and accuracy for three models.

Model Precision Recall F1-score Accuracy
fraudulent Non-fraudulent fraudulent Non-fraudulent fraudulent Non-fraudulent Training Testing
Decision Tree 0.53 1.00 0.53 1.00 0.53 1.00 1.00 0.9993
Random Forest 0.94 1.00 0.53 1.00 0.68 1.00 1.00 0.9997
Gradient Boosting 0.03 1.00 0.80 0.99 0.07 0.99 0.9849 0.985

Most important feature for training Decision Tree, Random Forest and Gredient Boosting is V14.

Random Forest is fitting 2 folds for each of 12 candidates, totalling 24 fits. On the other hand, Gredient Boosting is fitting 2 folds for each of 9 candidates, totalling 18 fits.

Random Forest performed best for max depth = 30 and n estimators = 75 where n estimators was 25, 50, 75 and max depth was 10, 20, 30, 40. On the other hand, Gredient Boosting performed best for max depth = 4 and n estimators = 30 where n estimators was 20, 25, 30 and max depth was 2, 3, 4.

Mean Test Score of Gradient Boosting is lower than Random Forest. But Mean Fit Time of Gradient Boosting is higher than Random Forest.

Relation bewteen Estimators and Training Time

From the image we can see there is a Positive Correlation between Training Time and Estimators for Random Forest Classifier.

Positive Correlation

Releases

No releases published

Packages

No packages published