# Credit Card Fraud Detection with Imbalanced Data: Undersampling vs. Oversampling (Oversampling Notebook)

### Goal
Credit card fraud detection using labeled data is one of the main basic projects used while learning about data science, machine learning, and binary classifiers. In this notebook, I am less focused on building simple classification models to detect fraud, and instead, I am curious to test the difference between oversampling and undersampling techniques and their outcomes when applied to highly-imbalanced data, like the dataset used here. Utilizing the Imbalanced-Learn library, my goal is to find which of the two sampling techniques is more useful in highly imbalanced data, and how each affects the accuracy.

***This project has been broken into 3 notebooks, as one notebook exceeds the file size limit for Github. Also, for sizing purposes, the visualizations have not been outputted. Refer to the README for the visualizations, including any interpretations. These notebooks are for showcasing code only.***

### Data Source
The original dataset can be found [here](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud) <br> 
**Notes from data source:** "It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-sensitive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise." <br> 

In addition, the publisher recommended to not use confusion matrices due to the high imbalance in target variable, however I will utilize them in this notebook because I am comparing methods to create balanced data. 

In [1]:
#import dependencies
import numpy as np 
import pandas as pd 
import plotly.express as px
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

In [3]:
#import data 
credit_data = pd.read_csv('creditcard.csv')

#credit_data.head()

## Predict Fraud
### Oversampling vs. Undersampling

As seen in the EDA notebook, there is a clear imbalance of classes. There are a few ways imbalance can be handled, and in this notebook I will focus on oversampling (with another notebook focusing on undersampling), and determine which would be more beneficial for the data given.

#### Oversampling
For Over-sampling, I will use the **Synthetic Minority Oversampling Technique** (SMOTE). Essentially, when applied, SMOTE looks into the k-nearest neighbors of the minority class, and chooses synthetic data based on those neighbors. SMOTE can be applied using the [imblearn library](https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html).<br> 
Steps to implement SMOTE: <br> 
1. import imblearn.oversampling 
2. create a SMOTE object, using sampling_strategy = 'minority" 
3. fit the object to the data to get oversampled X values and oversampled Y values
4. concatenate the oversampled X's and Y's into one dataframe

I will use musltiple predictive model algorithms to determine the best model for prediction.

In [4]:
#import imblearn 
from imblearn.over_sampling import SMOTE

# Resampling the minority class
sm = SMOTE(sampling_strategy='minority', random_state=42)
# Fit the model to generate the data.
oversampled_X, oversampled_Y = sm.fit_resample(credit_data.drop('Class', axis=1), credit_data['Class'])
new_df = pd.concat([pd.DataFrame(oversampled_Y), pd.DataFrame(oversampled_X)], axis=1)
new_df.shape

(568630, 31)

In [5]:
new_fraud = new_df[new_df['Class'] == 1]
new_normal = new_df[new_df['Class'] == 0]
print(f'{str(len(new_normal))} non-fraud transactions')
print(f'{str(len(new_fraud))} fraud transactions')

284315 non-fraud transactions
284315 fraud transactions


Now I can try different classification models

In [19]:
#import classification dependencies
import sklearn 
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics  import (f1_score,accuracy_score, recall_score, precision_score, confusion_matrix)


In [7]:
#going to get X and y using a copy of new_df (assigned to variable ('credit'))
credit = new_df.copy()
labels = credit['Class']
xtrain = credit.drop(['Class'], axis = 1)

#To get a validation test set, I will implement train_test_split twice
#The beow will give a test set containing 20% of the total data, training set with 60% of the data, and cv set with 20%
x_1, X_test, y_1, y_test = train_test_split(xtrain,labels,test_size=0.2,train_size=0.8)
X_train, X_cv, y_train, y_cv = train_test_split(x_1,y_1,test_size = 0.25,train_size =0.75)


Because I am working with limited computing power, I will leave the below cell's function commented and not utilize it. The below function would conduct all of the desired classification modeling and output a dictionary with the classifier names as the keys, and their corresponding accuracies as the values. If I did utilize it, the run time would be very long, especially for the oversampling portion of this notebook.

In [8]:
# def classifications(X, y, testsize, max_it, rand_state = 42): 
#     X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= testsize, random_state=rand_state)
#     l = [ ('Random Forest', RandomForestClassifier(random_state = rand_state)), 
#          ('Decision Tree', DecisionTreeClassifier(random_state = rand_state)), 
#          ('Logistic Regression', LogisticRegression(solver='lbfgs', max_iter= max_it, random_state = rand_state)),
#         ('K-Nearest Neighbors', KNeighborsClassifier())] 
#     acc_dict = {}
#     for classifier in l: 
#         c = classifier[1]
#         c.fit(X_train, y_train)
#         c_pred = c.predict(X_test)
#         c_acc = accuracy_score(c_pred, y_test)
#         acc_dict[classifier[0]] = c_acc
#     return acc_dict

First, I will look at **Random Forest**:

In [9]:
#Random Forest: 
RAND_STATE = 42

rfc = RandomForestClassifier(random_state = RAND_STATE)
rfc.fit(X_train, y_train)
rfc_pred = rfc.predict(X_test)
rfc_acc = accuracy_score(rfc_pred, y_test)

rfc_acc

0.9997977595272849

In [10]:
#This is a high accuracy- check out the validation set's accuracy results: 
y_pred_val_rfc = rfc.predict(X_cv)
rfc_cv_acc = accuracy_score(y_cv, y_pred_val_rfc)

rfc_cv_acc

0.9998768971035648

In [11]:
rfc_test_precision = precision_score(y_test, rfc_pred)
rfc_test_recall = recall_score(y_test, rfc_pred)

rfc_cv_precision = precision_score(y_cv, y_pred_val_rfc)
rfc_cv_recall = recall_score(y_cv, y_pred_val_rfc)

print('Random Forest Classifier')
print(f'Test precision: {str(round(rfc_test_precision, 4))}')
print(f'Test precision: {str(round(rfc_test_recall, 4))}')
print(f'Validation precision: {str(round(rfc_cv_precision, 4))}')
print(f'Validation precision: {str(round(rfc_cv_recall, 4))}')


Random Forest Classifier
Test precision: 0.9996
Test precision: 0.9999
Validation precision: 0.9998
Validation precision: 0.9999


***For each classifier, I will also create confusion matrices. However, for the sake of file size, I will not output the plots in this notebook. I will only include the code. Please refer to the README for the Confusion Matrices.***

In [None]:
# Confusion Matrix:
LABELS = ['Normal', 'Fraud']
conf_matrix = confusion_matrix(y_test, rfc_pred)
plt.figure(figsize=(12,12))
sns.heatmap(conf_matrix, xticklabels = LABELS, yticklabels = LABELS, annot=True, fmt='d')
plt.title('Confusion Matrix')
plt.ylabel('True class')
plt.xlabel('Predicted class')
plt.show()

Next will be **Decision Tree Classifier**:

In [12]:
#Decision Tree:
dtc = DecisionTreeClassifier(random_state = RAND_STATE)
dtc.fit(X_train, y_train)
dtc_pred = dtc.predict(X_test)
dtc_acc = accuracy_score(dtc_pred, y_test)

dtc_acc

0.9980567328491287

In [13]:
y_pred_val_dtc = dtc.predict(X_cv)
dtc_cv_acc = accuracy_score(y_cv, y_pred_val_dtc)

dtc_cv_acc

0.9983029386419991

In [14]:
dtc_test_precision = precision_score(y_test, dtc_pred)
dtc_test_recall = recall_score(y_test, dtc_pred)

dtc_cv_precision = precision_score(y_cv, y_pred_val_dtc)
dtc_cv_recall = recall_score(y_cv, y_pred_val_dtc)

print('Decision Tree Classifier')
print(f'Test precision: {str(round(dtc_test_precision, 4))}')
print(f'Test precision: {str(round(dtc_test_recall, 4))}')
print(f'Validation precision: {str(round(dtc_cv_precision, 4))}')
print(f'Validation precision: {str(round(dtc_cv_recall, 4))}')


Decision Tree Classifier
Test precision: 0.9974
Test precision: 0.9987
Validation precision: 0.9977
Validation precision: 0.9989


In [None]:
# Confusion Matrix
conf_matrix = confusion_matrix(y_test, dtc_pred)
plt.figure(figsize=(12,12))
sns.heatmap(conf_matrix, xticklabels = LABELS, yticklabels = LABELS, annot=True, fmt='d')
plt.title('Confusion Matrix')
plt.ylabel('True class')
plt.xlabel('Predicted class')
plt.show()

**Logistic Regression**:

In [15]:
#Logistic Regression:
lr = LogisticRegression(solver='lbfgs', max_iter= 1000, random_state = RAND_STATE)
lr.fit(X_train, y_train)
lr_pred = lr.predict(X_test)
lr_acc = accuracy_score(lr_pred, y_test)

lr_acc

0.971800643652287

In [16]:
y_pred_val_lr = lr.predict(X_cv)
lr_cv_acc = accuracy_score(y_cv, y_pred_val_lr)

lr_cv_acc

0.9720292633170955

In [17]:
lr_test_precision = precision_score(y_test, lr_pred)
lr_test_recall = recall_score(y_test, lr_pred)

lr_cv_precision = precision_score(y_cv, y_pred_val_lr)
lr_cv_recall = recall_score(y_cv, y_pred_val_lr)

print('Logistic Regression')
print(f'Test precision: {str(round(lr_test_precision, 4))}')
print(f'Test precision: {str(round(lr_test_recall, 4))}')
print(f'Validation precision: {str(round(lr_cv_precision, 4))}')
print(f'Validation precision: {str(round(lr_cv_recall, 4))}')

Logistic Regression
Test precision: 0.979
Test precision: 0.9642
Validation precision: 0.9799
Validation precision: 0.9637


In [None]:
conf_matrix = confusion_matrix(y_test, lr_pred)
plt.figure(figsize=(12,12))
sns.heatmap(conf_matrix, xticklabels = LABELS, yticklabels = LABELS, annot=True, fmt='d')
plt.title('Confusion Matrix')
plt.ylabel('True class')
plt.xlabel('Predicted class')
plt.show()

**K-Nearest Neighbors**:

In [20]:
#K NEaighbors
kn_class = KNeighborsClassifier()
kn_class.fit(X_train, y_train)
knc_pred = kn_class.predict(X_test) 
kn_acc = accuracy_score(knc_pred, y_test) 
kn_acc

0.9538891722209522

In [21]:
y_pred_val_kn = kn_class.predict(X_cv)
kn_cv_acc = accuracy_score(y_cv, y_pred_val_kn)

kn_cv_acc

0.954539858959253

In [22]:
kn_test_precision = precision_score(y_test, knc_pred)
kn_test_recall = recall_score(y_test, knc_pred)

kn_cv_precision = precision_score(y_cv, y_pred_val_kn)
kn_cv_recall = recall_score(y_cv, y_pred_val_kn)

print('K-Nearest Neighbor Classifier')
print(f'Test precision: {str(round(kn_test_precision, 4))}')
print(f'Test precision: {str(round(kn_test_recall, 4))}')
print(f'Validation precision: {str(round(kn_cv_precision, 4))}')
print(f'Validation precision: {str(round(kn_cv_recall, 4))}')

K-Nearest Neighbor Classifier
Test precision: 0.9389
Test precision: 0.9708
Validation precision: 0.9398
Validation precision: 0.9711


In [None]:
conf_matrix = confusion_matrix(y_test, knc_pred)
plt.figure(figsize=(12,12))
sns.heatmap(conf_matrix, xticklabels = LABELS, yticklabels = LABELS, annot=True, fmt='d')
plt.title('Confusion Matrix')
plt.ylabel('True class')
plt.xlabel('Predicted class')
plt.show()

A detailed overview of the results can be found in the Undersampling notebook (Part 3 of this Project). 