# Credit Card Fraud Detection using Machine Learning

# About Data:
This is a dataset containing credit card transactions with 31 features and a class label. The features represent various aspects of the transaction, and the class label indicates whether the transaction was fraudulent (class 1) or not (class 0).

The first feature is "Time", which represents the number of seconds elapsed between the transaction and the first transaction in the dataset. The next 28 features, V1 to V28, are anonymized variables resulting from a principal component analysis (PCA) transformation of the original features. They represent different aspects of the transaction, such as the amount, location, and type of transaction.

The second last feature is "Amount", which represents the transaction amount in USD. The last feature is the "Class" label, which indicates whether the transaction is fraudulent (class 1) or not (class 0).

Overall, this dataset is used to train machine learning models to detect fraudulent transactions in real-time. The features are used to train the model to learn patterns in the data, which can then be used to detect fraudulent transactions in future transactions.

In [37]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [39]:
credit_card_data = pd.read_csv('dataset/test-2.csv')
credit_card_data.head(5)

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [40]:
credit_card_data.sample()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
6607,8088,-2.098047,-0.099137,3.300938,2.138612,1.516937,2.207892,-0.508034,1.026307,-0.205847,...,0.42964,0.951894,0.184371,-1.136725,0.079279,-0.068108,0.004177,0.071401,102.0,0


In [41]:
# dataset informations
credit_card_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7500 entries, 0 to 7499
Data columns (total 31 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Time    7500 non-null   int64  
 1   V1      7500 non-null   float64
 2   V2      7500 non-null   float64
 3   V3      7500 non-null   float64
 4   V4      7500 non-null   float64
 5   V5      7500 non-null   float64
 6   V6      7500 non-null   float64
 7   V7      7500 non-null   float64
 8   V8      7500 non-null   float64
 9   V9      7500 non-null   float64
 10  V10     7500 non-null   float64
 11  V11     7500 non-null   float64
 12  V12     7500 non-null   float64
 13  V13     7500 non-null   float64
 14  V14     7500 non-null   float64
 15  V15     7500 non-null   float64
 16  V16     7500 non-null   float64
 17  V17     7500 non-null   float64
 18  V18     7500 non-null   float64
 19  V19     7500 non-null   float64
 20  V20     7500 non-null   float64
 21  V21     7500 non-null   float64
 22  

In [42]:
# checking the number of missing values in each column
credit_card_data.isnull().sum()

Time      0
V1        0
V2        0
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
Amount    0
Class     0
dtype: int64

In [43]:
# distribution of legit transactions & fraudulent transactions
credit_card_data['Class'].value_counts()

Class
0    7475
1      25
Name: count, dtype: int64

This Dataset is highly unblanced

0 --> Normal Transaction

1 --> fraudulent transaction

The first line of code creates a new dataframe called "legit" by selecting only the rows from the original "credit_card_data" dataframe where the "Class" label is equal to 0. In other words, it filters out all transactions labeled as fraudulent (Class == 1) and keeps only the legitimate transactions (Class == 0).

The second line of code creates a new dataframe called "fraud" by selecting only the rows from the original "credit_card_data" dataframe where the "Class" label is equal to 1. This filters out all legitimate transactions and keeps only the fraudulent transactions.

By separating the data into two dataframes, it becomes easier to analyze and compare the characteristics of legitimate and fraudulent transactions separately. This can be useful for identifying patterns or features that are more common in fraudulent transactions, which can then be used to develop models for fraud detection.

In [44]:
legit = credit_card_data[credit_card_data.Class==0]
fraud = credit_card_data[credit_card_data['Class']==1]

In [45]:
fraud['Class']

541     1
623     1
4920    1
6108    1
6329    1
6331    1
6334    1
6336    1
6338    1
6427    1
6446    1
6472    1
6529    1
6609    1
6641    1
6717    1
6719    1
6734    1
6774    1
6820    1
6870    1
6882    1
6899    1
6903    1
6971    1
Name: Class, dtype: int64

In [46]:
# statistical measures of the data
legit.Amount.describe()

count    7475.000000
mean       65.233196
std       189.863375
min         0.000000
25%         4.490000
50%        15.980000
75%        55.810000
max      7712.430000
Name: Amount, dtype: float64

In [47]:
fraud.Amount.describe()

count      25.000000
mean      106.308400
std       372.676883
min         0.000000
25%         1.000000
50%         1.000000
75%         1.000000
max      1809.680000
Name: Amount, dtype: float64

In [48]:
# compare the values for both transactions
credit_card_data.groupby('Class').mean()

Unnamed: 0_level_0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,3838.068361,-0.307603,0.291124,0.906121,0.193777,-0.007708,0.177136,-0.006946,-0.06109,0.617026,...,0.045744,-0.053597,-0.167602,-0.034043,0.026271,0.091555,-0.011393,0.01793,0.00011,65.233196
1,7359.24,-1.154048,2.93088,-4.757618,4.59024,-0.636103,-1.952536,-2.202403,0.647916,-1.581984,...,0.263011,0.393614,-0.265715,-0.116502,-0.183413,0.067479,0.256994,0.421586,0.2376,106.3084


Build a sample dataset containing similar distribution of normal transactions and Fraudulent Transactions

Number of Fraudulent Transactions --> 492

legit_sample = legit.sample(n=492) is a line of code that takes a random sample of 492 observations from the legit dataset. This is done to balance the number of observations in the legit and fraud datasets, which is necessary for training a machine learning model to predict fraud. Since the original dataset has a large number of legitimate transactions and a small number of fraudulent transactions, the model may be biased towards predicting that all transactions are legitimate. By creating a balanced dataset with an equal number of legitimate and fraudulent transactions, the model can be trained to better recognize the patterns that differentiate fraudulent transactions from legitimate ones

In [49]:
legit_sample = legit.sample(n=492)

In [50]:
new_df = pd.concat([legit_sample,fraud],axis=0)

In [51]:
new_df

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
1716,1328,0.988553,-0.774050,-0.602488,-1.519531,-0.350313,-0.690654,0.236715,-0.091379,0.905596,...,0.102347,0.026370,-0.259711,-0.271585,0.553470,0.033558,-0.043757,0.012867,160.00,0
985,744,0.846466,-0.648165,0.909723,1.278921,-0.848055,0.723489,-0.706530,0.453813,0.762523,...,0.238578,0.464747,-0.204267,-0.327269,0.330811,-0.217592,0.041951,0.037855,138.00,0
4276,3756,1.350757,-0.767438,-0.944465,-1.595062,1.432443,3.284171,-1.135629,0.703558,0.357024,...,-0.262242,-0.912080,0.058162,0.921365,0.347414,-0.511174,-0.024878,0.020000,68.15,0
6754,8492,-0.984687,1.024215,1.429716,0.431752,0.604193,0.079797,0.399412,0.072281,0.881834,...,-0.143881,-0.376227,-0.191026,-0.448101,-0.226216,-0.610318,-0.286850,0.102007,10.75,0
394,285,-0.931805,1.527737,0.818889,0.056990,-0.319930,-1.054736,0.358790,0.354073,-0.392590,...,-0.249969,-0.713791,0.044634,0.334423,-0.081413,0.078215,0.231701,0.090920,8.99,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6870,8757,-1.863756,3.442644,-4.468260,2.805336,-2.118412,-2.332285,-4.261237,1.701682,-1.439396,...,0.667927,-0.516242,-0.012218,0.070614,0.058504,0.304883,0.418012,0.208858,1.00,1
6882,8808,-4.617217,1.695694,-3.114372,4.328199,-1.873257,-0.989908,-4.577265,0.472216,0.472017,...,0.481830,0.146023,0.117039,-0.217565,-0.138776,-0.424453,-1.002041,0.890780,1.10,1
6899,8878,-2.661802,5.856393,-7.653616,6.379742,-0.060712,-3.131550,-3.103570,1.778492,-3.831154,...,0.734775,-0.435901,-0.384766,-0.286016,1.007934,0.413196,0.280284,0.303937,1.00,1
6903,8886,-2.535852,5.793644,-7.618463,6.395830,-0.065210,-3.136372,-3.104557,1.823233,-3.878658,...,0.716720,-0.448060,-0.402407,-0.288835,1.011752,0.425965,0.413140,0.308205,1.00,1


In [17]:
new_df['Class'].value_counts()

Class
0    492
1    492
Name: count, dtype: int64

In [18]:
new_df.groupby('Class').mean()

Unnamed: 0_level_0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,94894.493902,-0.005318,-0.053724,-0.069328,-0.049915,0.022442,0.001627,0.033571,-0.043619,-0.131781,...,-0.01947,0.020041,0.021029,-0.005789,-0.016514,0.013873,-0.014395,-0.027772,0.018552,97.838659
1,80746.806911,-4.771948,3.623778,-7.033281,4.542029,-3.151225,-1.397737,-5.568731,0.570636,-2.581123,...,0.372319,0.713588,0.014049,-0.040308,-0.10513,0.041449,0.051648,0.170575,0.075667,122.211321


In [19]:
X = new_df.drop(columns='Class', axis=1)
Y = new_df['Class']

In [20]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, stratify=Y, random_state=2)

# Model Training

Logistic Regression

In [21]:
from sklearn.linear_model import SGDClassifier

# Logistic Regression with SGD (to allow epochs and batch_size)
sgd_logreg = SGDClassifier(loss='log_loss', max_iter=100, tol=1e-3, random_state=42)

# Fit the model (batch_size is handled internally by SGDClassifier)
sgd_logreg.fit(X_train, Y_train)

# Accuracy on training data
train_acc = sgd_logreg.score(X_train, Y_train)
print('SGD Logistic Regression Accuracy on Training data:', train_acc)

# Accuracy on test data
test_acc = sgd_logreg.score(X_test, Y_test)
print('SGD Logistic Regression Accuracy on Test data:', test_acc)

SGD Logistic Regression Accuracy on Training data: 0.5006353240152478
SGD Logistic Regression Accuracy on Test data: 0.49746192893401014


In [26]:
model=LogisticRegression()
# training the Logistic Regression Model with Training Data
model.fit(X_train, Y_train)
# accuracy on training data
X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)
print('Accuracy on Training data : ', training_data_accuracy)
# accuracy on test data
X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)
print('Accuracy score on Test Data : ', test_data_accuracy)

Accuracy on Training data :  0.9504447268106735
Accuracy score on Test Data :  0.9187817258883249


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


SVM

In [27]:
from sklearn.linear_model import SGDClassifier

# SVM with SGD (to allow epochs=100)
sgd_svm = SGDClassifier(loss='hinge', max_iter=100, tol=1e-3, random_state=42)
sgd_svm.fit(X_train, Y_train)

# Accuracy on training data
train_acc_sgd_svm = sgd_svm.score(X_train, Y_train)
print('SGD SVM Accuracy on Training data:', train_acc_sgd_svm)

# Accuracy on test data
test_acc_sgd_svm = sgd_svm.score(X_test, Y_test)
print('SGD SVM Accuracy on Test data:', test_acc_sgd_svm)

SGD SVM Accuracy on Training data: 0.5006353240152478
SGD SVM Accuracy on Test data: 0.49746192893401014


In [23]:

from sklearn import svm

# Support Vector Machine
svm_model = svm.SVC(kernel='linear')

# training the SVM Model with Training Data
svm_model.fit(X_train, Y_train)

# accuracy on training data
X_train_prediction_svm = svm_model.predict(X_train)
training_data_accuracy_svm = accuracy_score(X_train_prediction_svm, Y_train)
print('Accuracy on Training data (SVM): ', training_data_accuracy_svm)

# accuracy on test data
X_test_prediction_svm = svm_model.predict(X_test)
test_data_accuracy_svm = accuracy_score(X_test_prediction_svm, Y_test)
print('Accuracy score on Test Data (SVM): ', test_data_accuracy_svm)

Accuracy on Training data (SVM):  0.9072426937738246
Accuracy score on Test Data (SVM):  0.9137055837563451


In [28]:
from sklearn.neighbors import KNeighborsClassifier

knn_model = KNeighborsClassifier(n_neighbors=5)
knn_model.fit(X_train, Y_train)

# Accuracy on training data
X_train_prediction_knn = knn_model.predict(X_train)
training_data_accuracy_knn = accuracy_score(X_train_prediction_knn, Y_train)
print('Accuracy on Training data (KNN):', training_data_accuracy_knn)

# Accuracy on test data
X_test_prediction_knn = knn_model.predict(X_test)
test_data_accuracy_knn = accuracy_score(X_test_prediction_knn, Y_test)
print('Accuracy score on Test Data (KNN):', test_data_accuracy_knn)

Accuracy on Training data (KNN): 0.7458703939008895
Accuracy score on Test Data (KNN): 0.6751269035532995


In [31]:
from sklearn.ensemble import RandomForestClassifier

# Create and train the Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, Y_train)

# Accuracy on training data
X_train_prediction_rf = rf_model.predict(X_train)
training_data_accuracy_rf = accuracy_score(X_train_prediction_rf, Y_train)
print('Accuracy on Training data (Random Forest):', training_data_accuracy_rf)

# Accuracy on test data
X_test_prediction_rf = rf_model.predict(X_test)
test_data_accuracy_rf = accuracy_score(X_test_prediction_rf, Y_test)
print('Accuracy score on Test Data (Random Forest):', test_data_accuracy_rf)

Accuracy on Training data (Random Forest): 1.0
Accuracy score on Test Data (Random Forest): 0.8984771573604061


In [32]:
from sklearn.tree import DecisionTreeClassifier

# Create and train the Decision Tree model
dtree = DecisionTreeClassifier(random_state=42)
dtree.fit(X_train, Y_train)

# Accuracy on training data
X_train_prediction_dtree = dtree.predict(X_train)
training_data_accuracy_dtree = accuracy_score(X_train_prediction_dtree, Y_train)
print('Accuracy on Training data (Decision Tree):', training_data_accuracy_dtree)

# Accuracy on test data
X_test_prediction_dtree = dtree.predict(X_test)
test_data_accuracy_dtree = accuracy_score(X_test_prediction_dtree, Y_test)
print('Accuracy score on Test Data (Decision Tree):', test_data_accuracy_dtree)

Accuracy on Training data (Decision Tree): 1.0
Accuracy score on Test Data (Decision Tree): 0.8883248730964467


In [35]:
from xgboost import XGBClassifier

# Create and train the XGBoost model
xgb_model = XGBClassifier(eval_metric='logloss', random_state=42)
xgb_model.fit(X_train, Y_train)

# Accuracy on training data
X_train_prediction_xgb = xgb_model.predict(X_train)
training_data_accuracy_xgb = accuracy_score(X_train_prediction_xgb, Y_train)
print('Accuracy on Training data (XGBoost):', training_data_accuracy_xgb)

# Accuracy on test data
X_test_prediction_xgb = xgb_model.predict(X_test)
test_data_accuracy_xgb = accuracy_score(X_test_prediction_xgb, Y_test)
print('Accuracy score on Test Data (XGBoost):', test_data_accuracy_xgb)

Accuracy on Training data (XGBoost): 1.0
Accuracy score on Test Data (XGBoost): 0.9238578680203046


In [36]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam

# Build a simple feedforward neural network
model_keras = Sequential([
    Dense(32, activation='relu', input_shape=(X_train.shape[1],)),
    Dense(16, activation='relu'),
    Dense(1, activation='sigmoid')
])

model_keras.compile(optimizer=Adam(), loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
history = model_keras.fit(X_train, Y_train, epochs=100, batch_size=32, validation_data=(X_test, Y_test), verbose=0)

# Evaluate on training data
train_loss, train_acc_keras = model_keras.evaluate(X_train, Y_train, verbose=0)
print('Keras Model Accuracy on Training data:', train_acc_keras)

# Evaluate on test data
test_loss, test_acc_keras = model_keras.evaluate(X_test, Y_test, verbose=0)
print('Keras Model Accuracy on Test data:', test_acc_keras)

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Keras Model Accuracy on Training data: 0.7191867828369141
Keras Model Accuracy on Test data: 0.720812201499939
