Our task is to solve the Problem given in Kaggle as Credit Card Fraud Detection
The datasets contain credit card transactions of European cardholders over a two day collection period in September 2013. This dataset contains 30 features which include details of time and amount of the transaction and other 28 features from V1, V2,.. to V28 are the result of PCA transformation (this is done because of confidentiality). And another feature 'Class' is an indication that transaction is fraudulent or not.
Feature Description:
- Time: the seconds elapsed between each transaction and the first transaction in the dataset
- Amount: the Transaction Amount
- V1, V2,..., V28: numerical input variables which are the output of PCA transformation
- Class: the response variable, it takes two values as 1 - for fraud and 0 - otherwise
Objective: Here our task is to find out given transaction is it fraudulent or not?
This is Binary Classification Problem. This dataset has 492 frauds out of 284,807 transactions means we have uneven distribution as fraud class accounts for only 0.172% of all transactions. Thus this dataset is highly unbalanced.
File: DataSet_creditcardfraud.zip
Due to class imbalance ratio, accuracy is measured using the Area Under the Precision-Recall Curve(AUPRC).
Confusion matrix accuracy is not meaningful for unbalanced classification.
To resolve an issue of the unbalanced dataset, we use oversampling techniques.
From Wikipedia: The usual reason for oversampling is to correct for a bias in the original dataset. One scenario where it is useful is when training a classifier using labeled training data from a biased source since labeled training data is valuable but often comes from un-representative sources.
There are 4 ways of addressing class imbalance problems like these:
- Synthesis of new minority class instances
- Over-sampling of a minority class
- Under-sampling of the majority class
- tweak the cost function to make misclassification of minority instances more important than misclassification of majority instances
Out of 4 methods mentioned above, here we use SMOTE: Synthetic Minority Over-sampling Technique. Here is the [link] where it is explained thoroughly.
To implement SMOTE we use imblearn
package. (Here is the [link] for further detail to how to use the package)
code: Logistic_Regression.py
code: SVM.py
code: NeuralNet.py
code: Random_Forest.py
python3.5
numpy
sklearn
pandas
imblearn