## SVM Classification on the Credit Card Fraud Detection dataset (Kaggle)

The dataset contains transactions made by credit cards in September 2013 by European cardholders.
This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-sensitive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

Given the class imbalance ratio, we recommend measuring the accuracy using the Area Under the Precision-Recall Curve (AUPRC). Confusion matrix accuracy is not meaningful for unbalanced classification.

Kaggle Link: https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud

In [1]:
# Imports
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt 
from sklearn import svm
from sklearn.model_selection import train_test_split

In [2]:
# Load data
path = "data/credit-card-fraud/"
num_columns = 30

features = []
for i in range(num_columns):
    features.append(np.array(pd.read_csv(path + f"col_{i}.csv")))

features = np.concatenate(features)
features = features.reshape([-1, num_columns])
    
labels = np.array(pd.read_csv(path + f"col_{num_columns}.csv")).astype(float)
labels = labels.reshape([-1])

print("\nFeatures shape, type:", features.shape, type(features), type(features[0,0]))
print("Labels shape, type:", labels.shape, type(labels), type(labels[0]))

print("\nFeatures:\n", features, "\n")
print("Labels:\n", labels, "\n")

print("Fraction of frauds:", np.mean(labels))


Features shape, type: (284807, 30) <class 'numpy.ndarray'> <class 'numpy.float64'>
Labels shape, type: (284807,) <class 'numpy.ndarray'> <class 'numpy.float64'>

Features:
 [[  0.     0.     1.   ...  23.    23.    23.  ]
 [ 24.    25.    26.   ...  39.    40.    41.  ]
 [ 41.    41.    41.   ...  56.    56.    59.  ]
 ...
 [ 37.14  11.99 157.04 ...   8.99  52.34  10.  ]
 [  9.42 220.28   7.88 ...   7.    12.99   7.22]
 [  1.    80.    25.   ...  67.88  10.   217.  ]] 

Labels:
 [0. 0. 0. ... 0. 0. 0.] 

Fraction of frauds: 0.001727485630620034


In [3]:
# Separate normal and fraudulent features and labels
normal_features = features[labels==0, :]
normal_labels = labels[labels==0]
fraud_features = features[labels==1, :]
fraud_labels = labels[labels==1]

del features
del labels

print("\nNormal features shape:", normal_features.shape)
print("Normal labels shape:", normal_labels.shape)
print("Fraud features shape:", fraud_features.shape)
print("Fruad labels shape:", fraud_labels.shape)


Normal features shape: (284315, 30)
Normal labels shape: (284315,)
Fraud features shape: (492, 30)
Fruad labels shape: (492,)


## SVM approach

In [4]:
Xtrain_normal, Xtest_normal, Ytrain_normal, Ytest_normal = train_test_split(normal_features, normal_labels, test_size=0.2)
Xtrain_fraud, Xtest_fraud, Ytrain_fraud, Ytest_fraud = train_test_split(fraud_features, fraud_labels, test_size=0.2)

print("Split complete")

Xtrain = np.vstack((Xtrain_normal[:1000], Xtrain_fraud))
Xtest = np.vstack((Xtest_normal[:200], Xtest_fraud))
Ytrain = np.hstack((Ytrain_normal[:1000], Ytrain_fraud))
Ytest = np.hstack((Ytest_normal[:200], Ytest_fraud))

print("Training and test sets prepared")

Split complete
Training and test sets prepared


In [5]:
# clf = svm.SVC(kernel="linear")
# clf.fit(Xtrain, Ytrain)

# print("Training complete")

# Ptrain = clf.predict(Xtrain)
# Ptest = clf.predict(Xtest)

# train_accuracy = np.mean(np.equal(Ptrain, Ytrain).astype(float))
# test_accuracy = np.mean(np.equal(Ptest, Ytest).astype(float))

# print("Train accuracy:", train_accuracy)
# print("Test accuracy:", test_accuracy)

**For this dataset, accuracies of more than 90% are not good because of the extremely skewed nature.**

## Ensemble approach

In [8]:
# Store all SVMs in a list
classifiers = []
num_ensembles = 20
num_normal_points = 300

# Xtrain = np.vstack((Xtrain_normal, Xtrain_fraud))
# Xtest = np.vstack((Xtest_normal, Xtest_fraud))
# Ytrain = np.hstack((Ytrain_normal, Ytrain_fraud))
# Ytest = np.hstack((Ytest_normal, Ytest_fraud))

# Train ensembles
for i in range(num_ensembles):
    normal_indices = np.random.choice(a=Xtrain_normal.shape[0], size=num_normal_points, replace=False)
    
    # Prepare training data
    Xtrain = np.vstack((Xtrain_normal[normal_indices, :], Xtrain_fraud))
    Ytrain = np.hstack((Ytrain_normal[normal_indices], Ytrain_fraud))
    
    # Train SVM
    classifiers.append(svm.SVC())
    clf = classifiers[-1]
    clf.fit(Xtrain, Ytrain)
    
    # Test accuracy
    Ptest_normal = clf.predict(Xtest_normal)
    Ptest_fraud = clf.predict(Xtest_fraud)
    
    test_accuracy_normal = np.mean(np.equal(Ptest_normal, Ytest_normal).astype(float))
    test_accuracy_fraud = np.mean(np.equal(Ptest_fraud, Ytest_fraud).astype(float))
    
    print(f"SVM {i+1} trained, normal accuracy: {test_accuracy_normal}, fraud accuracy: {test_accuracy_fraud}")

SVM 1 trained, normal accuracy: 0.010797882630181313, fraud accuracy: 0.98989898989899
SVM 2 trained, normal accuracy: 0.013998557937498901, fraud accuracy: 0.98989898989899
SVM 3 trained, normal accuracy: 0.011149605191425004, fraud accuracy: 0.98989898989899
SVM 4 trained, normal accuracy: 0.010885813270492236, fraud accuracy: 0.98989898989899
SVM 5 trained, normal accuracy: 0.01109684680723845, fraud accuracy: 0.98989898989899
SVM 6 trained, normal accuracy: 0.010868227142430052, fraud accuracy: 1.0
SVM 7 trained, normal accuracy: 0.01202891159453423, fraud accuracy: 0.98989898989899
SVM 8 trained, normal accuracy: 0.01028788491637796, fraud accuracy: 1.0
SVM 9 trained, normal accuracy: 0.011817878057788016, fraud accuracy: 0.98989898989899
SVM 10 trained, normal accuracy: 0.011395810984295587, fraud accuracy: 0.98989898989899
SVM 11 trained, normal accuracy: 0.011237535831735927, fraud accuracy: 0.98989898989899
SVM 12 trained, normal accuracy: 0.011483741624606511, fraud accuracy:

In [9]:
# Test ensemble approach

Ptest_normal = np.zeros(Ytest_normal.shape)
Ptest_fraud = np.zeros(Ytest_fraud.shape)

for i in range(num_ensembles):
    Ptest_normal += classifiers[i].predict(Xtest_normal)
    Ptest_fraud += classifiers[i].predict(Xtest_fraud)

# Mean predictions over all ensembles
Ptest_normal = Ptest_normal / num_ensembles
Ptest_fraud = Ptest_fraud / num_ensembles

Ptest_normal = Ptest_normal.round()
Ptest_fraud = Ptest_fraud.round()

# Compute accuracy
test_accuracy_normal = np.mean(np.equal(Ptest_normal, Ytest_normal).astype(float))
test_accuracy_fraud = np.mean(np.equal(Ptest_fraud, Ytest_fraud).astype(float))

print("Test accuracy over all ensembles:", test_accuracy_normal, test_accuracy_fraud)

Test accuracy over all ensembles: 0.011835464185850202 0.98989898989899
