In [15]:
# import the necessary packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import gridspec

In [16]:
# load the dataset from the csv file using pandas
data = pd.read_csv("creditcard.csv")

In [17]:
# take a peek at the data
data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


### Kaggle Credit Card Fraud Dataset

This dataset consists of transactions made by credit cards in September 2013 by European cardholders over a period of two days. The dataset is highly unbalanced with only 492 frauds out of 284,807 transactions, accounting for just 0.172% of all transactions.

#### Features

- **V1, V2, ..., V28**: These are the principal components obtained with Principal Component Analysis (PCA). Due to confidentiality issues, the original features from which these components were derived are not available. PCA is a method that transforms the original variables into a new set of variables which are linear combinations of the original variables and are orthogonal (independent) to each other. 

- **Time**: This feature represents the seconds elapsed between each transaction and the first transaction in the dataset. It allows tracking of the time context of each transaction.

- **Amount**: This is the transaction amount for each record. This feature can be used for example-dependent cost-sensitive learning, meaning that the cost function can consider the transaction amount in fraud detection scenarios.

- **Class**: This is the target variable that we aim to predict. It takes the value 1 in case of fraud and 0 otherwise.

In [18]:
# Determine number of fraud cases in dataset
fraud = data[data['Class'] == 1]
valid = data[data['Class'] == 0]
outlierFraction = len(fraud)/float(len(valid))
print(outlierFraction)
print('Fraud Cases: {}'.format(len(data[data['Class'] == 1])))
print('Valid Transactions: {}'.format(len(data[data['Class'] == 0])))

0.0017304750013189597
Fraud Cases: 492
Valid Transactions: 284315


### Dealing with Imbalanced Data in Fraud Detection

Fraudulent transactions account for only 0.17% of total transactions, leading to a highly imbalanced dataset. In machine learning, an *imbalanced dataset* refers to a situation where the classes are not represented equally. 

For instance, consider a binary classification problem where 95% of the instances belong to Class A and only 5% of the instances belong to Class B. This dataset would be considered imbalanced.

This imbalance can introduce a significant bias in our model training. Most machine learning algorithms are designed to maximize overall accuracy, which can be misleading when the classes are imbalanced. 

For example, in our case, a model could achieve a 99.83% accuracy rate by predicting every transaction to be non-fraudulent. However, such a model would be absolutely useless at detecting fraudulent transactions, which is our main objective. 

Therefore, we need to implement strategies to correct for the imbalance in our dataset.

For now, lets see how the model performs without balancing, and then compare the results with a balanced set later.

In [19]:
# dividing the X and the Y from the dataset
X = data.drop(['Class'], axis = 1)
Y = data["Class"]

print(X.shape)
print(Y.shape)

# getting just the values for the sake of processing 
# (its a numpy array with no columns)
xData = X.values
yData = Y.values

(284807, 30)
(284807,)


In [21]:
# splitting the data into training and testing set
from sklearn.model_selection import train_test_split
# Split the data into training and testing sets
xTrain, xTest, yTrain, yTest = train_test_split(
        xData, yData, test_size = 0.2, random_state = 42)

### Choosing an ML Model for our usecase: Random Forest Classification
The Random Forest algorithm is a powerful tool for detecting credit card fraud due to its inherent characteristics. It is an ensemble learning method that operates by constructing multiple decision trees and outputting the majority vote of individual trees. 

The model is less prone to overfitting due to the randomness and diversity among the individual trees. Moreover, the algorithm can handle a large amount of data with numerous variables and manage missing values, making it suitable for complex datasets such as credit card transactions. 

Its ability to estimate the importance of features can be beneficial in identifying significant indicators of fraud. Furthermore, the random forest algorithm works well with imbalanced datasets, which is often the case with fraud detection, where the number of legitimate transactions significantly outweighs fraudulent ones. Therefore, it's an effective technique for such tasks.


In [24]:
# Building the Random Forest Classifier (RANDOM FOREST)
from sklearn.ensemble import RandomForestClassifier
# random forest model creation
rfc = RandomForestClassifier()
rfc.fit(xTrain, yTrain)
# predictions
yPred = rfc.predict(xTest)

In [27]:
# Evaluating the classifier
# printing every score of the classifier
# scoring in anything
from sklearn.metrics import classification_report, accuracy_score 
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import f1_score, matthews_corrcoef
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score
  
n_outliers = len(fraud)
n_errors = (yPred != yTest).sum()
print("The model used is Random Forest classifier")
  
acc = accuracy_score(yTest, yPred)
print("The accuracy is {}".format(acc))
  
prec = precision_score(yTest, yPred)
print("The precision is {}".format(prec))
  
rec = recall_score(yTest, yPred)
print("The recall is {}".format(rec))
  
f1 = f1_score(yTest, yPred)
print("The F1-Score is {}".format(f1))
  
MCC = matthews_corrcoef(yTest, yPred)
print("The Matthews correlation coefficient is {}".format(MCC))

auc = roc_auc_score(yTest, yPred)
print("The AUC is {}".format(auc))

The model used is Random Forest classifier
The accuracy is 0.9995786664794073
The precision is 0.9743589743589743
The recall is 0.7755102040816326
The F1-Score is 0.8636363636363635
The Matthews correlation coefficient is 0.8690748763736589
The AUC is 0.8877375162220206


## Results

1. **Accuracy (0.9995786664794073)**: This metric represents the proportion of total predictions that were correct. In this case, the model has a very high accuracy, indicating that it correctly identified fraudulent and non-fraudulent transactions in about 99.96% of cases. However, in imbalanced datasets, accuracy can be misleading because it does not differentiate between the classes.

2. **Precision (0.9743589743589743)**: Precision is the proportion of positive identifications (in this case, predicted fraudulent transactions) that were actually correct. A high precision means a low false positive rate. With a precision of about 97.43%, the model is very good at not labelling a transaction as fraudulent when it is not.

3. **Recall (0.7755102040816326)**: Recall (or sensitivity) is the proportion of actual positives that were identified correctly. It is particularly important in situations like fraud detection, where it is crucial to capture as many positives as possible. Your model's recall is about 77.55%, which suggests there is room for improvement. This implies that the model is missing about 22.45% of fraudulent transactions.

3. **F1-Score (0.8636363636363635)**: The F1-score is the harmonic mean of precision and recall. It tries to find the balance between precision and recall. In the case, the F1-Score is 0.86 which is pretty good, indicating a decent balance between precision and recall.

4. **Matthews correlation coefficient (0.8690748763736589)**: This is a measure of the quality of binary classifications. It takes into account true and false positives and negatives and is generally regarded as a balanced measure which can be used even if the classes are of very different sizes. A coefficient of +1 represents a perfect prediction, 0 an average random prediction and -1 an inverse prediction. In the case, a MCC of 0.87 is a very good value.

5. **AUC (0.8877375162220206)**: The Area Under the Curve (AUC) is a performance measurement for classification problems at various thresholds settings. ROC is a probability curve and AUC represents the degree or measure of separability. It tells how much the model is capable of distinguishing between classes. An AUC of 0.88 indicates a very good performance of the model in terms of its ability to distinguish between fraudulent and non-fraudulent transactions.

While the model has high accuracy and precision, its relatively lower recall suggests that it's not capturing all the fraudulent transactions effectively. Since it's critical in a fraud detection task to correctly identify as many fraudulent transactions as possible (even at the risk of some false positives), we may need to adjust the model or rebalance the dataset to improve the recall rate.

## rebalancing the data

Here are several techniques we can use to balance our data:

1. Undersampling: In this method, you randomly remove some of the samples from the majority class to balance the ratio between the majority and minority class. The disadvantage is that you lose potentially useful data.

2. Oversampling: Here, you increase the number of samples in the minority class by randomly replicating them. The disadvantage is that by duplicating the data, the model may overfit the data.

3. SMOTE (Synthetic Minority Over-Sampling Technique): This is a popular algorithm to create synthetic samples of the minority class. It works by selecting examples that are close in the feature space, drawing a line between the examples in the feature space and drawing a new sample at a point along that line.

In [30]:
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler, SMOTE
from collections import Counter

print('Original dataset shape:', Counter(Y))

# Random Under-sampling
rus = RandomUnderSampler(random_state=42)
X_rus, Y_rus = rus.fit_resample(X, Y)
print('Resampled (undersampling) dataset shape:', Counter(Y_rus))

Original dataset shape: Counter({0: 284315, 1: 492})
Resampled (undersampling) dataset shape: Counter({0: 492, 1: 492})


In [32]:
# let's see how the undersampled data performs with Random Forest classifier

# getting just the values for the sake of processing 
# (its a numpy array with no columns)
xData = X_rus.values
yData = Y_rus.values

# splitting the data into training and testing set
from sklearn.model_selection import train_test_split
# Split the data into training and testing sets
xTrain, xTest, yTrain, yTest = train_test_split(
        xData, yData, test_size = 0.2, random_state = 42)

# Building the Random Forest Classifier (RANDOM FOREST)
from sklearn.ensemble import RandomForestClassifier
# random forest model creation
rfc = RandomForestClassifier()
rfc.fit(xTrain, yTrain)
# predictions
yPred = rfc.predict(xTest)

In [33]:
# Evaluating the classifier
# printing every score of the classifier
# scoring in anything
from sklearn.metrics import classification_report, accuracy_score 
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import f1_score, matthews_corrcoef
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score
  
n_outliers = len(fraud)
n_errors = (yPred != yTest).sum()
print("The model used is Random Forest classifier")
  
acc = accuracy_score(yTest, yPred)
print("The accuracy is {}".format(acc))
  
prec = precision_score(yTest, yPred)
print("The precision is {}".format(prec))
  
rec = recall_score(yTest, yPred)
print("The recall is {}".format(rec))
  
f1 = f1_score(yTest, yPred)
print("The F1-Score is {}".format(f1))
  
MCC = matthews_corrcoef(yTest, yPred)
print("The Matthews correlation coefficient is {}".format(MCC))

auc = roc_auc_score(yTest, yPred)
print("The AUC is {}".format(auc))

The model used is Random Forest classifier
The accuracy is 0.9238578680203046
The precision is 0.9560439560439561
The recall is 0.8877551020408163
The F1-Score is 0.9206349206349207
The Matthews correlation coefficient is 0.849807156821831
The AUC is 0.9236755308183879


## Results

1. **Accuracy (0.9238578680203046):** The accuracy has decreased from 0.9996 to 0.9239, which means the model is now making more misclassifications overall. However, given the imbalance in the dataset, this is not necessarily a bad thing if the model has become better at detecting fraud.

2. **Precision (0.9560439560439561):** Precision has decreased slightly from 0.9744 to 0.9560, suggesting a slight increase in the number of false positives (transactions that were non-fraudulent but were identified as fraudulent). In the context of fraud detection, this might be acceptable if the model can detect more true fraudulent cases.

3. **Recall (0.8877551020408163):** The recall has improved from 0.7755 to 0.8878. This means the model is now better at detecting fraudulent transactions, which is critical in this context.

4. **F1-Score (0.9206349206349207):** The F1-Score has increased from 0.8636 to 0.9206, suggesting that the balance between precision and recall has improved. This is a positive development, as both precision and recall are important in a fraud detection context.

5. **Matthews correlation coefficient (0.849807156821831):** This metric has decreased slightly from 0.8691 to 0.8498, which indicates a slight decrease in the overall quality of binary classifications. However, it remains high, indicating a generally good quality of predictions.

6. **AUC (0.9236755308183879):** The AUC has increased from 0.8877 to 0.9237, suggesting that the model's ability to distinguish between fraudulent and non-fraudulent transactions has improved.

Overall, while accuracy and precision have dropped slightly after applying undersampling, the model has improved in terms of recall, F1-Score, and AUC. This suggests that despite making more overall mistakes, the model is better at identifying fraudulent transactions, which is usually the priority in fraud detection. 

In [36]:
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler, SMOTE
from collections import Counter

print('Original dataset shape:', Counter(Y))

# Random Over-sampling
ros = RandomOverSampler(random_state=42)
X_ros, Y_ros = ros.fit_resample(X, Y)
print('Resampled (oversampling) dataset shape:', Counter(Y_ros))

Original dataset shape: Counter({0: 284315, 1: 492})
Resampled (oversampling) dataset shape: Counter({0: 284315, 1: 284315})


In [37]:
# let's see how the undersampled data performs with Random Forest classifier

# getting just the values for the sake of processing 
# (its a numpy array with no columns)
xData = X_ros.values
yData = Y_ros.values

# splitting the data into training and testing set
from sklearn.model_selection import train_test_split
# Split the data into training and testing sets
xTrain, xTest, yTrain, yTest = train_test_split(
        xData, yData, test_size = 0.2, random_state = 42)

# Building the Random Forest Classifier (RANDOM FOREST)
from sklearn.ensemble import RandomForestClassifier
# random forest model creation
rfc = RandomForestClassifier()
rfc.fit(xTrain, yTrain)
# predictions
yPred = rfc.predict(xTest)

In [38]:
# Evaluating the classifier
# printing every score of the classifier
# scoring in anything
from sklearn.metrics import classification_report, accuracy_score 
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import f1_score, matthews_corrcoef
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score
  
n_outliers = len(fraud)
n_errors = (yPred != yTest).sum()
print("The model used is Random Forest classifier")
  
acc = accuracy_score(yTest, yPred)
print("The accuracy is {}".format(acc))
  
prec = precision_score(yTest, yPred)
print("The precision is {}".format(prec))
  
rec = recall_score(yTest, yPred)
print("The recall is {}".format(rec))
  
f1 = f1_score(yTest, yPred)
print("The F1-Score is {}".format(f1))
  
MCC = matthews_corrcoef(yTest, yPred)
print("The Matthews correlation coefficient is {}".format(MCC))

auc = roc_auc_score(yTest, yPred)
print("The AUC is {}".format(auc))

The model used is Random Forest classifier
The accuracy is 0.9999560346798445
The precision is 0.9999122514522385
The recall is 1.0
The F1-Score is 0.9999561238010829
The Matthews correlation coefficient is 0.9999120728626671
The AUC is 0.9999559471365639


## Results

1. **Accuracy (0.9999560346798445):** The accuracy has slightly increased from the original model (0.9995786664794073 to 0.9999560346798445). This means the model is correctly identifying fraudulent and non-fraudulent transactions almost all the time.

2. **Precision (0.9999122514522385):** Precision has slightly increased compared to the original model (from 0.9743589743589743 to 0.9999122514522385), indicating that almost all transactions identified as fraudulent are indeed fraudulent. There are very few false positives.

3. **Recall (1.0):** The recall has significantly improved from 0.7755102040816326 to 1.0, suggesting that the model is now able to identify all fraudulent transactions correctly. This is an excellent result because in the context of fraud detection, it is crucial to identify as many fraudulent cases as possible.

4. **F1-Score (0.9999561238010829):** The F1-Score, which balances precision and recall, has significantly improved (from 0.8636363636363635 to 0.9999561238010829). This suggests that the model has improved in terms of both precision and recall.

5. **Matthews correlation coefficient (0.9999120728626671):** The Matthews correlation coefficient has improved from 0.8690748763736589 to 0.9999120728626671, indicating a significant improvement in the quality of binary classifications.

6. **AUC (0.9999559471365639):** The AUC score has increased from 0.8877375162220206 to 0.9999559471365639, indicating that the model's ability to distinguish between fraudulent and non-fraudulent transactions has improved drastically.

Overall, the oversampling has significantly improved the model's performance across all metrics, especially in terms of recall, which is crucial in fraud detection. However, it's worth noting that these results may be too good to be true and might indicate overfitting due to the replication of data in oversampling. It's crucial to test the model with a separate, unseen dataset to ensure it generalizes well. It's also important to remember that oversampling doesn't add any new information to the model, which might limit its ability to detect different or evolving types of fraud.