Credit Card Fraud Analytics & ML Modeling

This Machine Learning project is learning on how to deal with high variable, high values dataset and imbalanced that are for detecting Credit card Fraud case. Below are the steps followed right from preparing dataset to building model with accuracte prediction.

Data Source - https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud

1. Data Preparation and Sampling

We have in this dataset V1-V28 ( 28 data variables; PCA applied data to protect customer identity) credit card transaction along with Amount and Time of transaction column, with Target column lablled Class=0/1(No Fraud/Fraud) repectively.
Checking the total number of datapoints for Fraud/NotFraud, we found the data is highly imbalanced. We also observed that data is highly variable.
So we perfomed Robust scaling which is very effective against outliers.
Next we use Stratified sampling, so we have euqal probability of all type of data samples.
To address the main issue of imbalance data, we persom a sampling on stratified sampled output to get euqal number of fraud and NotFraud cases.

2. Data Analysis

Given the data is PCA data, we plot a correlation matrix to check the influence of variables on Class label.
We got the Higly Negative and Positive correlated variables. Plotting Box plot for both Positive and Negative variables, to identify the data distribution and outlier patterns.
These outliers to are to be removed as they are anamolies which would affect our model.

3. Removing Outlier and Dimensionality reduction

To deal with outliers, we calculate the quartile ranges Q25 and Q75 values to calculate IQR.
We calculate the cutoff range for each valaues, 1.5 * IQR. Using this we get Lower and Upper bound of variable.
All values outside the lower and upper bound are treated as outliers and dropped form the data.
Given we have 28 data variables, we perform a dimensionality reduction. I have used techniques such as,
- t-SNE - t-distributed stochastic neighbor embedding
- PCA - Principal Component Analysis
- TSVD - Truncated Singluar Value Decompisition

4. Classifier and Evaulation Metrics

X - predictor variables and Y - Target variables.
Simple Train and Test split, since data is already scaled and sampled with equal probability of both classes.
We used Lositic Regression, K-Nearest Neighbour, Support Vector and Decision tree classifier.
We used GridsearchCV for hyperparameter tuning, to get best results for each model.
Learning curve plots to observe the training and testing phase patterns of each model.
Plot ROC(receiver operating characteristic curve) to performance of a classification model at all classification thresholds.

We observed the below ROC score,

Logistic Regression 0.9510652185989668
KNN 0.930923976445753
SVC 0.9457946780734403
Decision Tree 0.9217126826287428

One key metric to access our model performance is False Negative rate. The cost of False Negatives is much higher than the cost of False Positives. To put in simple words, wrongly predict the transaction to be genuine instead of a fraud one then that would cause a massive problem for the credit card company.

Logistic Regression:

Specificity Score : 0.98
Sensitivity Score: 0.94

K-Nearest Neighbour:

Specificity Score: 1.00
Sensitivity Score: 0.90

Support Vector Classifier:

Specificity Score: 0.99
Sensitivity Score: 0.94

Decision Tree Classifier:

Specificity Score: 0.99
Sensitivity Score: 0.87

Conclusion, for this project suprisingly Logistic regression performed the best having low False Negative rate.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
docs		docs
notebooks		notebooks
sample		sample
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
requirements.in		requirements.in
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Credit Card Fraud Analytics & ML Modeling

About

Languages

License

imnischaygowda/Credit-Card-Fraud

Folders and files

Latest commit

History

Repository files navigation

Credit Card Fraud Analytics & ML Modeling

About

Topics

Resources

License

Stars

Watchers

Forks

Languages