# Credit Card Fraud Detection using SVM


The dataset is obtained from https://www.kaggle.com/mlg-ulb/creditcardfraud

The dataset comprises of Time, Amount, Class and 28 features(V1 to V28) obtained from PCA. Support vector machines (SVM) will be used for classification. We will make use of the features and Class in classification by using different kernel functions. Linear Kernel, RBF Kernel, Sigmoid Kernel and Polynomial Kernel are the four different types of kernel functions which were tried out.


The dataset contains only 0.172% of fraud cases so it is very unbalanced. The dataset is split as 70% training data and 30% testing data. As the dataset is unbalanced we train the model on 50-50 data of fraud and non fraud cases instead of a highly biased dataset towards non fraud cases. This helps the model to learn better about the fraud cases.  We test the models using the same method of 50% fraud cases and 50% non fraudd cases from the test data. This new test data is better to use as compared to the old test data where the non fraud cases were extremely high.

## Steps:
1. Reading and visualising the data
2. Preparing the data for modeling
3. Build and train the model on new_Xtrain and new_ytrain and evaluate the model on accuracy and f1 score for different kernels using new_Xtest and new_ytest

### Reading and visualising the data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
from sklearn import svm
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix

In [3]:
#Load dataset
dt=pd.read_csv(r'C:\Users\muska\Documents\svm\credit_card.csv')
print('Shape of dataset:',dt.shape)
dt.head(5)

Shape of dataset: (284807, 31)


Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [4]:
print('Datatypes of dataset\n')
print(dt.dtypes)

Datatypes of dataset

Time      float64
V1        float64
V2        float64
V3        float64
V4        float64
V5        float64
V6        float64
V7        float64
V8        float64
V9        float64
V10       float64
V11       float64
V12       float64
V13       float64
V14       float64
V15       float64
V16       float64
V17       float64
V18       float64
V19       float64
V20       float64
V21       float64
V22       float64
V23       float64
V24       float64
V25       float64
V26       float64
V27       float64
V28       float64
Amount    float64
Class       int64
dtype: object


The datatype of all the features from V1 to V28 is same i.e. float64.

In [5]:
dt['Class'].value_counts()

0    284315
1       492
Name: Class, dtype: int64

We can observe that the dataset comprising of total 284807 entries out of which it has only 492 entries for fraud which makes the dataset very unbalanced.

### Preparing the data for modelling

In [6]:
features=dt.loc[:,"V1":"V28"]#extracting features
X=np.asarray(features)
y=np.asarray(dt['Class'])

In [7]:
#split the dataset into 30% test and 70% train
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y, test_size=0.3)

In [8]:
print('Shape of X_train:',X_train.shape)
print('Shape of y_train:',y_train.shape)
print('Shape of X_test:',X_test.shape)
print('Shape of y_test:',y_test.shape)

Shape of X_train: (199364, 28)
Shape of y_train: (199364,)
Shape of X_test: (85443, 28)
Shape of y_test: (85443,)


As the dataset is highly unbalanced, we split the training data as per the number of frauds in y_train. This allows us to have 50-50 data of fraud and non fraud cases for training instead of a highly biased dataset towards non fraud cases.

In [9]:
t = np.where(y_train== 1)# reading indices of y_train where there are fraud cases
t1=np.asarray(t)
for i,index in enumerate(t1):
    y_train1=np.asarray(y_train[index][:])
len_y_train1=y_train1.size#storing the number of fraud cases found in y_train so that we can pick up the same number of non fraud cases later
for i,index in enumerate(t1):
    X_train1=np.asarray(X_train[index][:])# reading entries of X_train for fraud cases
#number of rows for y_train1 and X_train1 should be same
print('Shape of y_train1:',y_train1.shape)
print('Shape of X_train1:',X_train1.shape)

Shape of y_train1: (364,)
Shape of X_train1: (364, 28)


In [10]:
p = np.where(y_train== 0)# reading indices of y_train where there are fraud cases
t2=np.random.choice(p[0],size=len_y_train1).reshape(-1,1).T
for i,index in enumerate(t2):
    y_train2=np.asarray(y_train[index][:])
for i,index in enumerate(t2):
    X_train2=np.asarray(X_train[index][:])
#number of rows for y_train2 and X_train2 should be same
print('Shape of y_train2:',y_train2.shape)
print('Shape of X_train2:',X_train2.shape)

Shape of y_train2: (364,)
Shape of X_train2: (364, 28)


In [11]:
new_ytrain=np.append(y_train1,y_train2)#concatenate y_train1 and y_train2
new_Xtrain=np.concatenate((X_train1,X_train2),axis=0)#concatenate X_train1 and X_train2
new_Xtrain.shape
print('Shape of new_ytrain:',new_ytrain.shape)
print('Shape of new_Xtrain:',new_Xtrain.shape)

Shape of new_ytrain: (728,)
Shape of new_Xtrain: (728, 28)


new_ytrain and new_Xtrain are the new training sets we will use for training our models.

Similarly, we will change the testing sets as they are also unbalanced.Thus, we will have 50-50 data of fraud and non fraud cases for testing the models instead of a highly biased dataset towards non fraud cases.

In [12]:
j = np.where(y_test == 1)# reading indices of y_test where there are fraud cases
t3=np.asarray(j)
for i,index in enumerate(t3):
    y_test1=np.asarray(y_test[index][:])
len_y_test1=y_test1.size#storing the number of fraud cases found in y_test so that we can pick up the same number of non fraud cases later
for i,index in enumerate(t3):
    X_test1=np.asarray(X_test[index][:])
#number of rows for y_test1 and X_test1 should be same
print('Shape of y_test1:',y_test1.shape)
print('Shape of X_test1:',X_test1.shape)

Shape of y_test1: (128,)
Shape of X_test1: (128, 28)


In [13]:
k = np.where(y_test== 0)# reading indices of y_test where there are non fraud cases
t4=np.random.choice(k[0],size=len_y_test1).reshape(-1,1).T
for i,index in enumerate(t4):
    y_test2=np.asarray(y_test[index][:])
for i,index in enumerate(t4):
    X_test2=np.asarray(X_test[index][:])
#number of rows for y_test2 and X_test2 should be same
print('Shape of y_test2:',y_test2.shape)
print('Shape of X_test2:',X_test2.shape)

Shape of y_test2: (128,)
Shape of X_test2: (128, 28)


In [14]:
new_ytest=np.append(y_test1,y_test2)#concatenate y_test1 and y_test2
new_Xtest=np.concatenate((X_test1,X_test2),axis=0)#concatenate X_test1 and X_test2
print('Shape of new_ytest:',new_ytest.shape)
print('Shape of new_Xtest:',new_Xtest.shape)

Shape of new_ytest: (256,)
Shape of new_Xtest: (256, 28)


### Build and train the model on new_Xtrain and new_ytrain and evaluate the model on accuracy and f1 score for different kernels using new_Xtest and new_ytest

In [15]:
#Linear Kernel
linear=svm.SVC(kernel='linear',gamma='auto',C=2)
linear.fit(new_Xtrain,new_ytrain)
linear_predict = linear.predict(new_Xtest)
linear_accuracy = accuracy_score(new_ytest, linear_predict)
linear_f1 = f1_score(new_ytest, linear_predict, average='weighted')
print('Accuracy of Linear Kernel: ', "%.2f" % (linear_accuracy*100))
print('F1 Score of Linear Kernel: ', "%.2f" % (linear_f1*100))
cm1 = confusion_matrix(new_ytest, linear_predict)
print('Confusion matrix:')
print(cm1)

Accuracy of Linear Kernel:  94.92
F1 Score of Linear Kernel:  94.91
Confusion matrix:
[[127   1]
 [ 12 116]]


In [16]:
#RBF Kernel
rbf = svm.SVC(kernel = 'rbf', random_state = 0)
rbf.fit(new_Xtrain,new_ytrain)
rbf_predict = rbf.predict(new_Xtest)
rbf_accuracy = accuracy_score(new_ytest, rbf_predict)
rbf_f1 = f1_score(new_ytest, rbf_predict, average='weighted')
print('Accuracy of RBF Kernel: ', "%.2f" % (rbf_accuracy*100))
print('F1 Score of RBF Kernel: ', "%.2f" % (rbf_f1*100))
cm2 = confusion_matrix(new_ytest, rbf_predict)
print('Confusion matrix:')
print(cm2)

Accuracy of RBF Kernel:  92.97
F1 Score of RBF Kernel:  92.93
Confusion matrix:
[[128   0]
 [ 18 110]]


In [17]:
#Sigmoid Kernel
sigmoid = svm.SVC(kernel='sigmoid', C=1)
sigmoid.fit(new_Xtrain,new_ytrain)
sigmoid_predict = sigmoid.predict(new_Xtest)
sigmoid_accuracy = accuracy_score(new_ytest, sigmoid_predict)
sigmoid_f1 = f1_score(new_ytest, sigmoid_predict, average='weighted')
print('Accuracy of Sigmoid Kernel: ', "%.2f" % (sigmoid_accuracy*100))
print('F1 Score of Sigmoid Kernel: ', "%.2f" % (sigmoid_f1*100))
cm3 = confusion_matrix(new_ytest, sigmoid_predict)
print('Confusion matrix:')
print(cm3)

Accuracy of Sigmoid Kernel:  90.23
F1 Score of Sigmoid Kernel:  90.22
Confusion matrix:
[[120   8]
 [ 17 111]]


In [18]:
#Polynomial Kernel
poly = svm.SVC(kernel='poly', degree=3, C=1)
poly.fit(new_Xtrain,new_ytrain)
poly_predict = poly.predict(new_Xtest)
poly_accuracy = accuracy_score(new_ytest, poly_predict)
poly_f1 = f1_score(new_ytest, poly_predict, average='weighted')
print('Accuracy of Polynomial Kernel: ', "%.2f" % (poly_accuracy*100))
print('F1 Score of Polynomial Kernel: ', "%.2f" % (poly_f1*100))
cm4 = confusion_matrix(new_ytest, poly_predict)
print('Confusion matrix:')
print(cm4)

Accuracy of Polynomial Kernel:  87.50
F1 Score of Polynomial Kernel:  87.30
Confusion matrix:
[[128   0]
 [ 32  96]]


## Conclusion

When we take the case of 50-50 fraud and non fraud cases for both training(new_Xtrain,new_ytrain) and test(new_Xtest,new_ytest) dataset we can see that the f1 score is 94.91 and accuracy is 94.92% for the linear kernel case making it the best performing kernel. The second best performing kernel is the RBF kernel with accuracy 92.97%.
In case we use the train(X_train,y_train) and test(X_test,y_test) data set on the model, the model overfits as the datapoints of non fraud cases dominates and the accuracy comes around 99%. Thus, dividing the dataset as 50-50 fraud and non fraud cases for both training and test cases allows the model to learn more about the fraudulent cases.