<a href="https://colab.research.google.com/github/moridin04/CCADMACL-Research/blob/main/Fraud_Detection_Program_FINAL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Exploring Anomaly Detection Techniques for Fraudulent Credit Card Transactions**

# **1. Environment Setup**

**1.1 Tools and Libraries Installation**

In [295]:
!pip install lime
!pip install scikit-learn

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
import lime
import lime.lime_tabular



# 2. **Importing Libraries**

**2.1 Essential Libraries for Data Analysis**

In [296]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import cohen_kappa_score
from sklearn.metrics import average_precision_score, precision_recall_curve

**2.2 Libraries for Machine Learning and Visualization**

In [297]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.ensemble import IsolationForest
from sklearn.ensemble import VotingClassifier
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model
from tensorflow.keras.layers import BatchNormalization, Dropout
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.neighbors import LocalOutlierFactor

#3. **Loading Data**

**3.1 Loading the Kaggle Credit Card Fraud Dataset**

In [298]:
cfd = pd.read_csv('creditcard.csv')

#4. **Exploration of Data**

### **Glimpse of the Dataset**

**4.1 Displaying the First Few Rows**

**4.1.2 Dataset Information (Shape, Columns, Null Values, Data Types)**

**4.2 Summary Statistics for Numerical Features**

**4.2.1 Class Distribution (Fraud vs. Non-Fraud)**

### **Distribution of Independent Variable**

**4.3 Distribution of Amount**

**4.4 Distribution of Time**

**4.4 Histograms for Key Features (V1-V28, Amount, Time)**

# **5. Pre-processing of Data**

**5.1 Checking of Null Values**

**5.2 Checking of Outliers**

**5.3 Checking of Duplicate Transactions**

In [299]:
duplicate_counts = cfd.duplicated().value_counts()
print(duplicate_counts)

False    283726
True       1081
Name: count, dtype: int64


In [300]:
duplicate_counts = cfd.duplicated().value_counts()
print("Duplicate Counts before removal:\n", duplicate_counts)

cfd = cfd.drop_duplicates(keep='first')

duplicate_counts = cfd.duplicated().value_counts()
print("\nDuplicate Counts after removal:\n", duplicate_counts)

Duplicate Counts before removal:
 False    283726
True       1081
Name: count, dtype: int64

Duplicate Counts after removal:
 False    283726
Name: count, dtype: int64


**5.4 Feature Selection/Reduction**

**5.4.1 Correlation Matrix for Numerical Features**

**5.4.2 Heatmap Visualization**

**5.4.3 Dropping Irrelevant Features**

### **Application of Standard Scaler**

**5.5 Feature Scaling**

**5.5.1 Standardization (Z-Score Scaling)**

In [466]:
scaler = StandardScaler()
scaler.fit(cfd[['Amount']])
input_data[['Amount']] = scaler.transform(input_data[['Amount']])

**5.5.2 Normalization (Min-Max Scaling)**

In [302]:
time = cfd['Time']
cfd['Time'] = (time - time.min()) / (time.max() - time.min())

**5.6 Train, Test, and Validation**

In [303]:
x = cfd.drop(columns=['Class'])
y = cfd['Class']

**5.7 Splitting the Dataset into Training and Testing Sets**

In [304]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, stratify=y, random_state=42)

# **6. Machine Learning**

### **Isolation Forest**

In [305]:
fraud_ratio = y_train.mean()
if_model = IsolationForest(contamination=0.02, random_state=101)
if_model.fit(x_train)

In [306]:
fraud_test = x_test[y_test == 1]
non_fraud_test = x_test[y_test == 0].sample(len(fraud_test), random_state=42)
x_test_balanced = pd.concat([fraud_test, non_fraud_test])
y_test_balanced = np.concatenate([np.ones(len(fraud_test)), np.zeros(len(non_fraud_test))])

In [307]:
if_y_pred = (if_model.predict(x_test_balanced) == -1).astype(int)
print(classification_report(y_test_balanced, if_y_pred))
print("ROC AUC Score:", roc_auc_score(y_test_balanced, if_y_pred))
print(f"AUPRC for Isolation Forest: {average_precision_score(y_test_balanced, if_y_pred)}")

              precision    recall  f1-score   support

         0.0       0.77      0.94      0.85        95
         1.0       0.92      0.73      0.81        95

    accuracy                           0.83       190
   macro avg       0.85      0.83      0.83       190
weighted avg       0.85      0.83      0.83       190

ROC AUC Score: 0.8315789473684211
AUPRC for Isolation Forest: 0.8050526315789475


In [308]:
def predict_fraud(input_data):
    decision_score = if_model.decision_function(input_data)
    fraud_prediction = (decision_score < 0).astype(int)[0]  # Adjusted threshold
    return "Fraudulent" if fraud_prediction == 1 else "Non-Fraudulent"

In [309]:
def fraudulent_data():
    time = 100000
    amount = 5000.00
    v_values = [-5.64, -7.27, -4.83, -5.68, -1.14, -2.62, -4.36, -7.32, -1.34, -0.02, 0.28, -0.23, -0.64, 0.10, 0.17, 0.13, -0.01, 0.01, -0.11, 0.07, 0.13, -0.19, 0.13, -0.02, 0.13, -0.19, 0.13, -0.02]
    return pd.DataFrame([[time, amount] + v_values], columns=['Time', 'Amount'] + [f'V{i}' for i in range(1, 29)])

In [310]:
input_data = fraudulent_data()
input_data[['Amount']] = scaler.transform(input_data[['Amount']])  # Apply same scaling
input_data['Time'] = (input_data['Time'] - cfd['Time'].min()) / (cfd['Time'].max() - cfd['Time'].min())
input_data = input_data[x_train.columns]  # Ensure correct feature order

In [311]:
print("Transaction Prediction:", predict_fraud(input_data))

Transaction Prediction: Fraudulent


In [312]:
def non_fraudulent_data():
    time = 50000
    amount = 50.00
    v_values = [-1.36, -0.07, 2.54, 1.38, -0.34, 0.46, 0.24, 0.10, 0.36, -0.02, 0.28, -0.23, -0.64, 0.10, 0.17, 0.13, -0.01, 0.01, -0.11, 0.07, 0.13, -0.19, 0.13, -0.02, 0.13, -0.19, 0.13, -0.02]
    return pd.DataFrame([[time, amount] + v_values], columns=['Time', 'Amount'] + [f'V{i}' for i in range(1, 29)])


In [374]:
input_data2 = non_fraudulent_data()
input_data2[['Amount']] = scaler.transform(input_data2[['Amount']])  # Apply same scaling
input_data2['Time'] = (input_data2['Time'] - time.min()) / (time.max() - time.min())
input_data2 = input_data2[x_train.columns]  # Ensure correct feature order

In [375]:
print("Transaction Prediction:", predict_fraud(input_data2))

Transaction Prediction: Non-Fraudulent


### **Autoencoders**

In [393]:
y_train_fraud = y_train[y_train == 1].sample(frac=0.1, random_state=42)  # Increase to 10%
x_train_fraud = x_train.loc[y_train_fraud.index]
x_train_auto = pd.concat([x_train[y_train == 0], x_train_fraud])

In [394]:
input_dim = x_train_auto.shape[1]
input_layer = Input(shape=(input_dim,))
encoded = Dense(64, activation='relu')(input_layer)
encoded = Dense(32, activation='relu')(encoded)
encoded = Dense(16, activation='relu')(encoded)
encoded = Dense(8, activation='relu')(encoded)

decoded = Dense(16, activation='relu')(encoded)
decoded = Dense(32, activation='relu')(decoded)
decoded = Dense(input_dim, activation='sigmoid')(decoded)

In [395]:
autoencoder = Model(input_layer, decoded)
autoencoder.compile(optimizer='adam', loss='mse')

In [396]:
early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
autoencoder.fit(x_train_auto, x_train_auto, epochs=50, batch_size=256, shuffle=True, validation_split=0.2, callbacks=[early_stopping])

Epoch 1/50
[1m709/709[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 3ms/step - loss: 0.9566 - val_loss: 0.8688
Epoch 2/50
[1m709/709[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step - loss: 0.7965 - val_loss: 0.8365
Epoch 3/50
[1m709/709[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 4ms/step - loss: 0.7564 - val_loss: 0.8252
Epoch 4/50
[1m709/709[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step - loss: 0.7654 - val_loss: 0.8185
Epoch 5/50
[1m709/709[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step - loss: 0.7425 - val_loss: 0.8155
Epoch 6/50
[1m709/709[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 3ms/step - loss: 0.7485 - val_loss: 0.8131
Epoch 7/50
[1m709/709[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step - loss: 0.7549 - val_loss: 0.8096
Epoch 8/50
[1m709/709[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step - loss: 0.7403 - val_loss: 0.8077
Epoch 9/50
[1m709/709[0m [32m━━━━━━━━

<keras.src.callbacks.history.History at 0x799597ae1250>

In [406]:
reconstructed = autoencoder.predict(x_test)
mse = np.mean(np.power(x_test - reconstructed, 2), axis=1)
threshold = np.percentile(mse, 80)

[1m1774/1774[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 1ms/step


In [407]:
y_test_pred = (mse > threshold).astype(int)
print(classification_report(y_test, y_test_pred))
print("ROC AUC Score:", roc_auc_score(y_test, y_test_pred))
print(f"AUPRC for Autoencoder: {average_precision_score(y_test, y_test_pred)}")

              precision    recall  f1-score   support

           0       1.00      0.80      0.89     56651
           1       0.01      0.86      0.01        95

    accuracy                           0.80     56746
   macro avg       0.50      0.83      0.45     56746
weighted avg       1.00      0.80      0.89     56746

ROC AUC Score: 0.832136748642891
AUPRC for Autoencoder: 0.006465671120822416


In [408]:
def ae_predict_fraud(input_data):
    reconstructed = autoencoder.predict(input_data)
    mse = np.mean(np.power(input_data - reconstructed, 2), axis=1)
    fraud_prediction = (mse > threshold).astype(int)[0]
    return "Fraudulent" if fraud_prediction == 1 else "Non-Fraudulent"

In [409]:
def fraudulent_data():
    time = 100000
    amount = 5000.00
    v_values = [-5.64, -7.27, -4.83, -5.68, -1.14, -2.62, -4.36, -7.32, -1.34, -0.02, 0.28, -0.23, -0.64, 0.10, 0.17, 0.13, -0.01, 0.01, -0.11, 0.07, 0.13, -0.19, 0.13, -0.02, 0.13, -0.19, 0.13, -0.02]
    return pd.DataFrame([[time, amount] + v_values], columns=['Time', 'Amount'] + [f'V{i}' for i in range(1, 29)])

In [410]:
input_data3 = fraudulent_data()
input_data3[['Amount']] = scaler.transform(input_data3[['Amount']])  # Apply same scaling
input_data3['Time'] = (input_data3['Time'] - cfd['Time'].min()) / (cfd['Time'].max() - cfd['Time'].min())
input_data3 = input_data3[x_train.columns]  # Ensure correct feature order

In [411]:
print("Transaction Prediction:", ae_predict_fraud(input_data3))

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 33ms/step
Transaction Prediction: Fraudulent


In [422]:
def non_fraudulent_data():
    time = 50000
    amount = 50.00
    v_values = [-1.36, -0.07, 2.54, 1.38, -0.34, 0.46, 0.24, 0.10, 0.36, -0.02, 0.28, -0.23, -0.64, 0.10, 0.17, 0.13, -0.01, 0.01, -0.11, 0.07, 0.13, -0.19, 0.13, -0.02, 0.13, -0.19, 0.13, -0.02]
    return pd.DataFrame([[time, amount] + v_values], columns=['Time', 'Amount'] + [f'V{i}' for i in range(1, 29)])

In [420]:
input_data4 = non_fraudulent_data()
input_data4[['Amount']] = scaler.transform(input_data4[['Amount']])  # Apply same scaling
input_data4['Time'] = (input_data4['Time'] - cfd['Time'].min()) / (cfd['Time'].max() - cfd['Time'].min())
input_data4 = input_data4[x_train.columns]  # Ensure correct feature order

In [421]:
print("Transaction Prediction:", ae_predict_fraud(input_data4))

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 45ms/step
Transaction Prediction: Fraudulent


### **Local Outlier Factor**

In [423]:
x_train_normal = x_train[y_train == 0]

In [446]:
lof_model = LocalOutlierFactor(n_neighbors=50, contamination=0.01, novelty=True)
lof_model.fit(x_train_normal)

In [447]:
lof_scores = lof_model.decision_function(x_test)
lof_threshold = np.percentile(lof_scores, 2)
y_test_pred_lof = (lof_scores < lof_threshold).astype(int)



In [448]:
print(classification_report(y_test, y_test_pred_lof))
print("ROC AUC Score:", roc_auc_score(y_test, y_test_pred_lof))
print(f"AUPRC for LOF: {average_precision_score(y_test, y_test_pred_lof)}")

              precision    recall  f1-score   support

           0       1.00      0.98      0.99     56651
           1       0.04      0.45      0.07        95

    accuracy                           0.98     56746
   macro avg       0.52      0.72      0.53     56746
weighted avg       1.00      0.98      0.99     56746

ROC AUC Score: 0.7166778307439178
AUPRC for LOF: 0.01806452088446587


In [449]:
def lof_predict_fraud(input_data):
    lof_score = lof_model.decision_function(input_data)
    fraud_prediction = (lof_score < lof_threshold).astype(int)[0]
    return "Fraudulent" if fraud_prediction == 1 else "Non-Fraudulent"

In [457]:
def fraudulent_data():
    time = 100000
    amount = 5000.00
    v_values = [-5.64, -7.27, -4.83, -5.68, -1.14, -2.62, -4.36, -7.32, -1.34, -0.02, 0.28, -0.23, -0.64, 0.10, 0.17, 0.13, -0.01, 0.01, -0.11, 0.07, 0.13, -0.19, 0.13, -0.02, 0.13, -0.19, 0.13, -0.02]
    if len(v_values) == 28:
        return pd.DataFrame([[time, amount] + v_values], columns=['Time', 'Amount'] + [f'V{i}' for i in range(1, 29)])
    else:
        print("Error: v_values does not contain 28 elements")

In [451]:
input_data5 = fraudulent_data()
input_data5[['Amount']] = scaler.transform(input_data5[['Amount']])  # Apply same scaling
input_data5['Time'] = (input_data5['Time'] - cfd['Time'].min()) / (cfd['Time'].max() - cfd['Time'].min())
input_data5 = input_data5[x_train.columns]  # Ensure correct feature order

In [458]:
print("Transaction Prediction:", lof_predict_fraud(input_data5))

Transaction Prediction: Fraudulent




In [463]:
def non_fraudulent_data():
    time = 50000
    amount = 50.00
    v_values = [-1.36, -0.07, 2.54, 1.38, -0.34, 0.46, 0.24, 0.10, 0.36, -0.02, 0.28, -0.23, -0.64, 0.10, 0.17, 0.13, -0.01, 0.01, -0.11, 0.07, 0.13, -0.19, 0.13, -0.02, 0.13, -0.19, 0.13, -0.02]
    if len(v_values) == 28:
        return pd.DataFrame([[time, amount] + v_values], columns=['Time', 'Amount'] + [f'V{i}' for i in range(1, 29)])
    else:
        print("Error: v_values does not contain 28 elements")

In [464]:
input_data6 = non_fraudulent_data()
input_data6[['Amount']] = scaler.transform(input_data6[['Amount']])  # Apply same scaling
input_data6['Time'] = (input_data6['Time'] - cfd['Time'].min()) / (cfd['Time'].max() - cfd['Time'].min())
input_data6 = input_data6[x_train.columns]  # Ensure correct feature order

In [465]:
print("Transaction Prediction:", lof_predict_fraud(input_data6))

Transaction Prediction: Fraudulent




# **7. Evaluation of Model Performance**

**7.1 Creation of Metrics-Data**

**7.2 Selection of Best Performing Model**

**7.3 LIME Analysis**

**7.3.1 LIME Analysis for Isolation Forest**

**7.3.2 LIME Analysis for Autoencoders**

**7.3.3 LIME Analysis for Local Outlier Factor**

**7.4 Confusion Matrix for Each Model**

**7.5 Cohen's Kappa**

# **8. Detection of Fraud**

**8.1 Defining Input Parameters (Time, Amount, V1-V28)**

Enter transaction time: 100000

Enter transaction amount: 5000.00

Enter value for V1: -5.64

Enter value for V2: -7.27

Enter value for V3: -4.83

Enter value for V4: -5.68

Enter value for V5: -1.14

Enter value for V6: -2.62

Enter value for V7: -4.36

Enter value for V8: -7.32

Enter value for V9: -1.34

Enter value for V10: -0.02

Enter value for V11: 0.28

Enter value for V12: -0.23

Enter value for V13: -0.64

Enter value for V14: 0.10

Enter value for V15: 0.17

Enter value for V16: 0.13

Enter value for V17: -0.01

Enter value for V18: 0.01

Enter value for V19: -0.11

Enter value for V20: 0.07

Enter value for V21: 0.13

Enter value for V22: -0.19

Enter value for V23: 0.13

Enter value for V24: -0.02

Enter value for V25: 0.13

Enter value for V26: -0.19

Enter value for V27: 0.13

Enter value for V28: -0.02

Expected Output: Fraudulent

**8.2 Preprocessing Input Data**

**8.3 Function for Fraud Prediction**

**8.4 Prediction using Isolation Forest**

**8.5 Prediction using Autoencoders**

**8.6 Prediction using Local Outlier Factor**