Credit-Card-Fraud-Detection

Overview

A binary classification model that can classify whether a credit card data is fraudulent or legit.

Data Collection

Data has already available here. The dataset contains transactions made by credit cards in September 2013 by European cardholders. Dataset presents transactions that occurred in two days, which contains 492 frauds transactions out of 284,807 transactions. The dataset is highly imbalanced.

Data Preprocessing

Initially datasets contains 1081 duplicate rows. After removing those it reduces to 283,726 observations, where positive (fraudulent) class contains only 0.001667% of data and negative class contains 0.998333% of data. To reduce imbalanced property, under and over sampling has done where over sampling performs well. On the other hand, under sampling perfomrs poorly.

Model Training

Datasets is trained using Decision Tree, Random Forest and Gradient Boosting classification models. Showing each models performance, their training time and lackings.

Result Analysis

In the table I showed the Precision, Recall, F1 score and accuracy for three models.

Model	Precision		Recall		F1-score		Accuracy
	fraudulent	Non-fraudulent	fraudulent	Non-fraudulent	fraudulent	Non-fraudulent	Training	Testing
Decision Tree	0.53	1.00	0.53	1.00	0.53	1.00	1.00	0.9993
Random Forest	0.94	1.00	0.53	1.00	0.68	1.00	1.00	0.9997
Gradient Boosting	0.03	1.00	0.80	0.99	0.07	0.99	0.9849	0.985

Most important feature for training Decision Tree, Random Forest and Gredient Boosting is V14.

Random Forest is fitting 2 folds for each of 12 candidates, totalling 24 fits. On the other hand, Gredient Boosting is fitting 2 folds for each of 9 candidates, totalling 18 fits.

Random Forest performed best for max depth = 30 and n estimators = 75 where n estimators was 25, 50, 75 and max depth was 10, 20, 30, 40. On the other hand, Gredient Boosting performed best for max depth = 4 and n estimators = 30 where n estimators was 20, 25, 30 and max depth was 2, 3, 4.

Mean Test Score of Gradient Boosting is lower than Random Forest. But Mean Fit Time of Gradient Boosting is higher than Random Forest.

Relation bewteen Estimators and Training Time

From the image we can see there is a Positive Correlation between Training Time and Estimators for Random Forest Classifier.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
dataset		dataset
image		image
model		model
notebook		notebook
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
creditcard_fraud_detection.py		creditcard_fraud_detection.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataset

dataset

image

image

model

model

notebook

notebook

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md