In [15]:
# import the necessary packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import gridspec

In [16]:
# load the dataset from the csv file using pandas
data = pd.read_csv("creditcard.csv")

In [17]:
# take a peek at the data
data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


### Kaggle Credit Card Fraud Dataset

This dataset consists of transactions made by credit cards in September 2013 by European cardholders over a period of two days. The dataset is highly unbalanced with only 492 frauds out of 284,807 transactions, accounting for just 0.172% of all transactions.

#### Features

- **V1, V2, ..., V28**: These are the principal components obtained with Principal Component Analysis (PCA). Due to confidentiality issues, the original features from which these components were derived are not available. PCA is a method that transforms the original variables into a new set of variables which are linear combinations of the original variables and are orthogonal (independent) to each other. 

- **Time**: This feature represents the seconds elapsed between each transaction and the first transaction in the dataset. It allows tracking of the time context of each transaction.

- **Amount**: This is the transaction amount for each record. This feature can be used for example-dependent cost-sensitive learning, meaning that the cost function can consider the transaction amount in fraud detection scenarios.

- **Class**: This is the target variable that we aim to predict. It takes the value 1 in case of fraud and 0 otherwise.

In [18]:
# Determine number of fraud cases in dataset
fraud = data[data['Class'] == 1]
valid = data[data['Class'] == 0]
outlierFraction = len(fraud)/float(len(valid))
print(outlierFraction)
print('Fraud Cases: {}'.format(len(data[data['Class'] == 1])))
print('Valid Transactions: {}'.format(len(data[data['Class'] == 0])))

0.0017304750013189597
Fraud Cases: 492
Valid Transactions: 284315


### Dealing with Imbalanced Data in Fraud Detection

Fraudulent transactions account for only 0.17% of total transactions, leading to a highly imbalanced dataset. In machine learning, an *imbalanced dataset* refers to a situation where the classes are not represented equally. 

For instance, consider a binary classification problem where 95% of the instances belong to Class A and only 5% of the instances belong to Class B. This dataset would be considered imbalanced.

This imbalance can introduce a significant bias in our model training. Most machine learning algorithms are designed to maximize overall accuracy, which can be misleading when the classes are imbalanced. 

For example, in our case, a model could achieve a 99.83% accuracy rate by predicting every transaction to be non-fraudulent. However, such a model would be absolutely useless at detecting fraudulent transactions, which is our main objective. 

Therefore, we need to implement strategies to correct for the imbalance in our dataset.

For now, lets see how the model performs without balancing, and then compare the results with a balanced set later.

In [19]:
# dividing the X and the Y from the dataset
X = data.drop(['Class'], axis = 1)
Y = data["Class"]

print(X.shape)
print(Y.shape)

# getting just the values for the sake of processing 
# (its a numpy array with no columns)
xData = X.values
yData = Y.values

(284807, 30)
(284807,)


In [21]:
# splitting the data into training and testing set

# Using Scikit-learn to split data into training and testing sets
from sklearn.model_selection import train_test_split
# Split the data into training and testing sets
xTrain, xTest, yTrain, yTest = train_test_split(
        xData, yData, test_size = 0.2, random_state = 42)

### Choosing an ML Model for our usecase: Random Forest Classification
The Random Forest algorithm is a powerful tool for detecting credit card fraud due to its inherent characteristics. It is an ensemble learning method that operates by constructing multiple decision trees and outputting the majority vote of individual trees. 

The model is less prone to overfitting due to the randomness and diversity among the individual trees. Moreover, the algorithm can handle a large amount of data with numerous variables and manage missing values, making it suitable for complex datasets such as credit card transactions. 

Its ability to estimate the importance of features can be beneficial in identifying significant indicators of fraud. Furthermore, the random forest algorithm works well with imbalanced datasets, which is often the case with fraud detection, where the number of legitimate transactions significantly outweighs fraudulent ones. Therefore, it's an effective technique for such tasks.


In [24]:
# Building the Random Forest Classifier (RANDOM FOREST)
from sklearn.ensemble import RandomForestClassifier
# random forest model creation
rfc = RandomForestClassifier()
rfc.fit(xTrain, yTrain)
# predictions
yPred = rfc.predict(xTest)

In [26]:
# Evaluating the classifier
# printing every score of the classifier
# scoring in anything
from sklearn.metrics import classification_report, accuracy_score 
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import f1_score, matthews_corrcoef
from sklearn.metrics import confusion_matrix
  
n_outliers = len(fraud)
n_errors = (yPred != yTest).sum()
print("The model used is Random Forest classifier")
  
acc = accuracy_score(yTest, yPred)
print("The accuracy is {}".format(acc))
  
prec = precision_score(yTest, yPred)
print("The precision is {}".format(prec))
  
rec = recall_score(yTest, yPred)
print("The recall is {}".format(rec))
  
f1 = f1_score(yTest, yPred)
print("The F1-Score is {}".format(f1))
  
MCC = matthews_corrcoef(yTest, yPred)
print("The Matthews correlation coefficient is {}".format(MCC))

The model used is Random Forest classifier
The accuracy is 0.9995786664794073
The precision is 0.9743589743589743
The recall is 0.7755102040816326
The F1-Score is 0.8636363636363635
The Matthews correlation coefficient is 0.8690748763736589


## Results

Because of the massive dataset imbalance, we only want to focus on precision and recall. 

The model caught 77.55% of all the fraudulent transactions, so that means we have some room for improvement. But, for the transactions it labeled fraudulent, it was correct 97.435% of the time.

I think we can improve this by cleaning up our dataset.