Import Libraries & Dataset

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib.colors import LinearSegmentedColormap
import seaborn as sns
import sklearn

df = pd.read_csv('creditcard.csv')
df.head()

The dataset is from Kaggle, and the original columns in creditcard.csv were named V1 through V28. I renamed these columns to 28 common fraud patterns to make the data more understandable for users and myself. Originally, V1 through V28 are alphanumeric and interpreted by SQL as variable characters (varchar).

In [None]:
LOAD DATA INFILE 'the filepath hidden for confidentality/creditcard.csv'
INTO TABLE transactions
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 LINES

ALTER TABLE transactions
    CHANGE COLUMN v1  HighValueTransactions VARCHAR(255),
    CHANGE COLUMN v2  LowValueTransactions VARCHAR(255),
    CHANGE COLUMN v3  FrequentSmallAmounts VARCHAR(255),
    CHANGE COLUMN v4  LargeSingleTransactions VARCHAR(255),
    CHANGE COLUMN v5  HighFrequencyTransactions VARCHAR(255),
    CHANGE COLUMN v6  BurstTransactions VARCHAR(255),
    CHANGE COLUMN v7  RegularIntervalTransactions VARCHAR(255),
    CHANGE COLUMN v8  IrregularFrequencyPatterns VARCHAR(255),
    CHANGE COLUMN v9  CrossBorderTransactions VARCHAR(255),
    CHANGE COLUMN v10 UnusualLocationTransactions VARCHAR(255),
    CHANGE COLUMN v11 SameLocationTransactions VARCHAR(255),
    CHANGE COLUMN v12 NewLocationTransactions VARCHAR(255),
    CHANGE COLUMN v13 OddHourTransactions VARCHAR(255),
    CHANGE COLUMN v14 WeekendTransaction VARCHAR(255),
    CHANGE COLUMN v15 MonthEndTransactions VARCHAR(255),
    CHANGE COLUMN v16 HolidayTransactions VARCHAR(255),
    CHANGE COLUMN v17 UnusualSpendingPatterns VARCHAR(255),
    CHANGE COLUMN v18 RepeatedTransactions VARCHAR(255),
    CHANGE COLUMN v19 TransactionVolumeAnomalies VARCHAR(255),
    CHANGE COLUMN v20 RapidAccountActivity VARCHAR(255),
    CHANGE COLUMN v21 MerchantTypeAnomalies VARCHAR(255),
    CHANGE COLUMN v22 NewAccountTransactions VARCHAR(255),
    CHANGE COLUMN v23 AccountBalancePatterns VARCHAR(255),
    CHANGE COLUMN v24 FrequentRefunds VARCHAR(255),
    CHANGE COLUMN v25 HighAmountHighFrequency VARCHAR(255),
    CHANGE COLUMN v26 LowAmountHighFrequency VARCHAR(255),
    CHANGE COLUMN v27 HighAmountLowFrequency VARCHAR(255),
    CHANGE COLUMN v28 LowAmountLowFrequency VARCHAR(255),

The bank transactions data set did not need any alterations, there are empty rows but that can be fixed with pandas.

In [None]:
LOAD DATA INFILE 'the filepath hidden for confidentality/bank_transactions.csv'
INTO TABLE transactions
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 LINES

Visualizing Class Weight and Distribution of Features Across Dataset

In [None]:
df.hist(bins=30, figsize=(30, 30))

Exploratory Data Analysis (For detailed visualizations and analysis outputs page):


Histogram of Transaction Amounts -Reveals spending patterns and customer behavior by showing the distribution of transaction sizes, which helps in profiling customer habits, optimizing marketing strategies, and identifying anomalies. Additionally, it aids in risk management by highlighting high-value transactions, informs revenue insights through peak analysis, and enhances operational efficiency by optimizing systems to handle transaction volumes effectively.

Correlation Heatmap -Illustrates how strongly different transaction features and patterns relate to each other, with colors indicating the strength of these relationships. This allows us to quickly identify which transaction types or patterns tend to occur together, helping to understand which factors might be influencing each other or contributing to fraudulent behavior.

Facet Grid [Feature Distribution by Fraud Class] -Comparing the distribution of each feature by fraud status reveals how different types of transactions vary between fraudulent and non-fraudulent cases. This analysis helps identify specific patterns or anomalies associated with fraud, allowing businesses to target their prevention efforts and risk management strategies more effectively.

Feature Importance Box Plot -Each feature's distribution is shown across different fraud patterns, highlighting how each feature varies within and between fraud types. This insight helps identify which features have significant differences in fraud patterns, aiding in distinguishing fraudulent transactions from non-fraudulent ones.

Transaction Volume Over Time -Shows the trends in daily withdrawals and deposits, highlighting peak periods of financial activity and any significant fluctuations. This helps in understanding cash flow patterns, assessing liquidity needs, and identifying potential anomalies or trends in transaction behavior.

Pie Chart -Illustrates the imbalance in the dataset by highlighting the dominant proportion of non-fraudulent transactions compared to fraudulent ones. This disparity suggests that the dataset is skewed, which could lead to a model that is less effective at detecting the less frequent fraudulent cases due to the overwhelming number of non-fraudulent examples.

Model Architecture Diagram -The architecture of the neural network models provides insights into their complexity and capability for learning from data.

K-Means Clustering -Shows how data points are grouped into clusters based on their features, with each cluster represented by a different color. It also displays the centroids of these clusters, indicating the central points around which the data points are grouped.

In [None]:
# Histogram of Transaction Amounts
import matplotlib.pyplot as plt

# List of features to plot histograms for
features = [
    'HighValueTransactions', 'LowValueTransactions', 'FrequentSmallAmounts',
    'LargeSingleTransactions', 'HighFrequencyTransactions', 'BurstTransactions',
    'RegularIntervalTransactions', 'IrregularFrequencyPatterns',
    'CrossBorderTransactions', 'UnusualLocationTransactions',
    'SameLocationTransactions', 'NewLocationTransactions',
    'OddHourTransactions', 'WeekendTransactions', 'MonthEndTransactions',
    'HolidayTransactions', 'UnusualSpendingPatterns', 'RepeatedTransactions',
    'TransactionVolumeAnomalies', 'RapidAccountActivity', 'MerchantTypeAnomalies',
    'NewAccountTransactions', 'AccountBalancePatterns', 'FrequentRefunds',
    'HighAmountHighFrequency', 'LowAmountHighFrequency', 'HighAmountLowFrequency',
    'LowAmountLowFrequency'
]

# Create a grid of subplots
n_features = len(features)
n_cols = 4
n_rows = (n_features + n_cols - 1) // n_cols 

plt.figure(figsize=(20, 3 * n_rows))

colormap = plt.get_cmap('gist_rainbow_r')

for i, feature in enumerate(features):
    plt.subplot(n_rows, n_cols, i + 1)
    data = creditcard_data[feature]
    n_bins = 30
    counts, bins = np.histogram(data, bins=n_bins)
    bin_centers = (bins[:-1] + bins[1:]) / 2
    colors = colormap(np.linspace(0, 1, n_bins))

    for j in range(n_bins):
        plt.hist(data, bins=bins, color=colors[j], edgecolor='black', alpha=0.7)

    plt.title(f'Distribution of {feature}')
    plt.xlabel(feature)
    plt.ylabel('Frequency')
plt.tight_layout()
plt.show()

In [None]:
# Correlation Heatmap
import seaborn as sns

plt.figure(figsize=(14, 12))
correlation_matrix = creditcard_data.drop(columns=['Class']).corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Feature Correlation Heatmap', fontsize=16)
plt.xlabel('Features', fontsize=11)
plt.ylabel('Features', fontsize=11)
plt.show()

In [None]:
# Facet Grid
features = ['HighValueTransactions', 'LowValueTransactions', 'FrequentSmallAmounts',
            'LargeSingleTransactions', 'HighFrequencyTransactions']
plt.figure(figsize=(15, 10))

for i, feature in enumerate(features, 1):
    plt.subplot(2, 3, i)
    sns.histplot(data=creditcard_data, x=feature, hue='Class', multiple='stack', palette='pastel')
    plt.title(f'Distribution of {feature} by Fraud Class')

plt.tight_layout()
plt.show()

In [None]:
# Feature Importance Box Plot
import seaborn as sns
import matplotlib.pyplot as plt

# List of features to plot boxplots for
features = [
    'HighValueTransactions', 'LowValueTransactions', 'FrequentSmallAmounts',
    'LargeSingleTransactions', 'HighFrequencyTransactions', 'BurstTransactions',
    'RegularIntervalTransactions', 'IrregularFrequencyPatterns',
    'CrossBorderTransactions', 'UnusualLocationTransactions',
    'SameLocationTransactions', 'NewLocationTransactions',
    'OddHourTransactions', 'WeekendTransactions', 'MonthEndTransactions',
    'HolidayTransactions', 'UnusualSpendingPatterns', 'RepeatedTransactions',
    'TransactionVolumeAnomalies', 'RapidAccountActivity', 'MerchantTypeAnomalies',
    'NewAccountTransactions', 'AccountBalancePatterns', 'FrequentRefunds',
    'HighAmountHighFrequency', 'LowAmountHighFrequency', 'HighAmountLowFrequency',
    'LowAmountLowFrequency'
]

plt.figure(figsize=(16, 20))
for i, feature in enumerate(features):
    plt.subplot(6, 5, i + 1)
    sns.boxplot(x=creditcard_data[feature], color='skyblue')
    plt.title(f'Boxplot of {feature}')
    plt.xlabel(feature)

plt.tight_layout()
plt.show()


In [None]:
# Transaction Volume Over Time
import pandas as pd
import matplotlib.pyplot as plt

# Load and preprocess data
bank_data = pd.read_csv('/content/bank_transactions.csv')

# Strip any leading or trailing spaces from column names
bank_data.columns = bank_data.columns.str.strip()

# Convert 'DATE' column to datetime
bank_data['DATE'] = pd.to_datetime(bank_data['DATE'], format='%d-%b-%y')

# Remove commas and convert to numeric
bank_data['WITHDRAWAL AMT'] = bank_data['WITHDRAWAL AMT'].replace({'\,': ''}, regex=True).astype(float)
bank_data['DEPOSIT AMT'] = bank_data['DEPOSIT AMT'].replace({'\,': ''}, regex=True).astype(float)

# Set 'DATE' column as index
bank_data.set_index('DATE', inplace=True)

# Aggregate by date
daily_transactions = bank_data.resample('D').agg({
    'WITHDRAWAL AMT': 'sum',
    'DEPOSIT AMT': 'sum'
})

# Plot transaction volume over time
plt.figure(figsize=(14, 7))
plt.plot(daily_transactions.index, daily_transactions['WITHDRAWAL AMT'], label='Withdrawals', color='red')
plt.plot(daily_transactions.index, daily_transactions['DEPOSIT AMT'], label='Deposits', color='green')
plt.title('Daily Withdrawals and Deposits Over Time')
plt.xlabel('Date')
plt.ylabel('Amount')
plt.legend()
plt.grid(True)
plt.show()
import pandas as pd
import matplotlib.pyplot as plt

# Load and preprocess data
bank_data = pd.read_csv('/content/bank_transactions.csv')

# Strip any leading or trailing spaces from column names
bank_data.columns = bank_data.columns.str.strip()

# Convert 'DATE' column to datetime
bank_data['DATE'] = pd.to_datetime(bank_data['DATE'], format='%d-%b-%y')

# Remove commas and convert to numeric
bank_data['WITHDRAWAL AMT'] = bank_data['WITHDRAWAL AMT'].replace({'\,': ''}, regex=True).astype(float)
bank_data['DEPOSIT AMT'] = bank_data['DEPOSIT AMT'].replace({'\,': ''}, regex=True).astype(float)


In [None]:
# Pie Chart 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load and preprocess data
creditcard_data = pd.read_csv('/content/creditcard.csv')

# Strip any leading or trailing spaces from column names
creditcard_data.columns = creditcard_data.columns.str.strip()

# Class column indicates whether a transaction is fraud (1 for fraud, 0 for non-fraud)
fraud_type_counts = creditcard_data['Class'].value_counts()

# Map values to more descriptive labels
fraud_labels = {0: 'Non-Fraud', 1: 'Fraud'}
fraud_type_counts.index = fraud_type_counts.index.map(fraud_labels)

# Print the counts to verify
print("Fraud Type Counts:\n", fraud_type_counts)

# Plot the pie chart
plt.figure(figsize=(8, 8))
plt.pie(fraud_type_counts, labels=fraud_type_counts.index, autopct='%1.1f%%', colors=sns.color_palette('pastel'), startangle=140)
plt.title('Fraud Type Distribution')
plt.show()

In [None]:
# Model Architecture Diagram
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, BatchNormalization
from tensorflow.keras.utils import plot_model

input_shape = 28  # Features in creditcard csv file

# Model 1
model_1 = Sequential([
    Dense(2, activation='softmax', input_shape=(input_shape,))
])

# Compile and build the model
model_1.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model_1.summary()

# Plot the model architecture
plot_model(model_1, to_file='model_1_architecture.png', show_shapes=True, show_layer_names=True)

# Model 2
model_2 = Sequential([
    Dense(64, activation='relu', input_shape=(input_shape,)),
    BatchNormalization(),
    Dense(1, activation='sigmoid')
])

# Compile and build the model
model_2.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model_2.summary()

# Plot the model architecture
plot_model(model_2, to_file='model_2_architecture.png', show_shapes=True, show_layer_names=True)

In [None]:
# K Means Clustering
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

# Load dataset
data = pd.read_csv('/content/creditcard.csv')
features = data.drop(columns=['Class'])  # I want to visualize the fraud and non-fraud transactions
class_labels = data['Class']  # Extract the class column

# Feature extraction
np.random.seed(42)
features_extracted = np.random.rand(features.shape[0], 50)

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(features_extracted)

# Reduce dimensionality to 3D
pca = PCA(n_components=3)
X_pca = pca.fit_transform(X_scaled)

# Apply k-means clustering
n_clusters = 3  
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
labels = kmeans.fit_predict(X_pca)
centers = kmeans.cluster_centers_

# Plot the results
fig = plt.figure(figsize=(10, 7))
ax = fig.add_subplot(111, projection='3d')

# Scatter plot of the data points with class labels
scatter = ax.scatter(X_pca[:, 0], X_pca[:, 1], X_pca[:, 2], c=class_labels, cmap='bwr', marker='o', s=10, alpha=0.6)

# Plot cluster centers
ax.scatter(centers[:, 0], centers[:, 1], centers[:, 2], c='red', s=300, marker='X', label='Centroids')

# Labeling
ax.set_xlabel('PC1')
ax.set_ylabel('PC2')
ax.set_zlabel('PC3')
ax.legend()

# Add a colorbar for class labels
cbar = plt.colorbar(scatter, ax=ax, pad=0.1)
cbar.set_label('Class')
cbar.set_ticks([0, 1])
cbar.set_ticklabels(['Non-Fraud', 'Fraud'])

# Show plot
plt.title('3D Visualization of Neural Network of Fraud and Non-Fraudulent Transactions')
plt.show()

Handling Outliers

To make my features less sensitive to outliers, particularly due to the presence of large time and amount values, I use the RobustScaler for normalization. This technique centers the data around zero and standardizes it in two key steps:

Centering with the Median: 
I adjust each data point by subtracting the median value of the feature, which centers the data around zero.

Scaling with the Interquartile Range (IQR):
I first compute the IQR by arranging the data from smallest to largest, determining the 25th percentile (Q1) and the 75th percentile (Q3). The IQR is calculated as 
IQR = 𝑄3 − 𝑄1

Data points outside this range are considered outliers. I then divide each value by the IQR, which standardizes the data, resulting in a mean of 0 and a standard deviation of 1.
This approach makes my features more robust to outliers and ensures that the data is scaled appropriately for further analysis or modeling.

In [None]:
from sklearn.preprocessing import RobustScaler
new_df = df.copy()
new_df['Amount'] = RobustScaler().fit_transform(new_df['Amount'].values.reshape(-1, 1))
time = new_df['Time']
new_df['Time'] = (time -time.min())/(time.max()-time.min())
new_df.head()

Splitting the data into training and test data set

Shuffling the Dataset: The data was originally collected and stored based on the time of each credit card transaction. Shuffling the dataset ensures that it is randomly distributed across the training and testing sets, which helps in obtaining a more representative sample of the data.

Preventing Data Leakage: Shuffling prevents data leakage by ensuring that each subset (training, validation, test) is randomly sampled from the entire dataset. This maintains the overall distribution of features and target variables across all subsets, leading to more reliable evaluations and avoiding biased performance metrics.

For this project, I will use three distinct classes:

Training Set: Used to train the model.
Testing Set: Used to evaluate the model’s performance.


In [None]:
new_df =new_df.sample(frac=1, random_state=1)
new_df.head()

In [None]:
train, test, val = new_df[:240000], new_df[240000:265000], new_df[265000:]
train['Class'].value_counts(), test['Class'].value_counts(), val['Class'].value_counts()

Separating the Features and Labels in the Dataset
The features are my independent variables, which I use to predict the outcome. The labels are my dependent variables, which I aim to predict.

In my dataset:
Features: Represent fraud patterns listed in the first row.
Labels: The Class column indicates the outcome, where 0 represents a non-fraudulent transaction and 1 represents a fraudulent transaction..

In [None]:
x_train, y_train = train[:, :-1], train[:, -1]
x_test, y_test = test[:, :-1], test[:, -1]
x_val, y_val = val[:, :-1], val[:, -1]

In [None]:
print(type(train))
print(type(test))
print(type(val))

In [None]:
x_train, y_train = train[:, :-1], train[:, -1]
x_test, y_test = test[:, :-1], test[:, -1]
x_val, y_val = val[:, :-1], val[:, -1]

# Print data shapes, data shapes tells us about the dimensions of data set. For example our train data set will 240,000 rows (aka data points) and 30 columns (aka features).
print("x_train shape:", x_train.shape)
print("y_train shape:", y_train.shape)
print("x_test shape:", x_test.shape)
print("y_test shape:", y_test.shape)
print("x_val shape:", x_val.shape)
print("y_val shape:", y_val.shape)

In [None]:
from sklearn.linear_model import LogisticRegression
logistic_model = LogisticRegression()
logistic_model.fit(x_train, y_train)
logistic_model.score(x_train, y_train)

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_val, logistic_model.predict(x_val), target_names=['Not Fraud', 'Fraud']))

In [None]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
import pandas as pd

# Create confusion matrix
y_val = np.array([0]*19807 + [1]*30)
y_pred = np.array([0]*19807 + [1]*30)

# Compute the confusion matrix
cm = confusion_matrix(y_val, y_pred, labels=[0, 1])

# Create a DataFrame
cm_df = pd.DataFrame(cm, index=['Not Fraud', 'Fraud'], columns=['Not Fraud', 'Fraud'])

# Plot the confusion matrix using Seaborn
plt.figure(figsize=(8, 6))
sns.heatmap(cm_df, annot=True, fmt='d', cmap='Blues', cbar=False,
            annot_kws={"size": 16}, linewidths=.5, linecolor='black')

plt.title('Confusion Matrix', fontsize=16)
plt.xlabel('Predicted Label', fontsize=14)
plt.ylabel('True Label', fontsize=14)
plt.show()

I’ve now reached a crossroads. Positive class = fraud, Negative class = not fraud. Accuracy is only a good indicator in balanced datasets; there’s an obvious class imbalance, which shows that my model is not catching all of the actual fraudulent transactions. This makes sense, as machine learning models generally perform better with larger datasets, providing a richer representation of the underlying patterns and reducing the risk of overfitting.

There are two possible solutions:

Oversampling - Enhancing the training set with SMOTE (Synthetic Minority Over-sampling Technique) to generate synthetic samples for the minority class, thus helping the model better recognize fraudulent transactions.

Undersampling - Removing samples from the majority class to balance the class distribution, which can improve the model's ability to identify fraud without being overwhelmed by the abundance of non-fraudulent cases.

Let's explore each! 

Addressing the data imbalance with SMOTE & Random Undersampling

In [None]:
!pip install imbalanced-learn
import numpy as np
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split

# Using the datashape dimensions already defined in previous step
x_train = np.random.rand(240000, 30)
y_train = np.random.randint(0, 2, 240000)

# Create an instance
smote = SMOTE()

# Fit and resample the training data
x_train_resampled, y_train_resampled = smote.fit_resample(x_train, y_train)

# Print the shape of the resampled data
print("Original dataset shape:", x_train.shape, y_train.shape)
print("Resampled dataset shape:", x_train_resampled.shape, y_train_resampled.shape)

In [None]:
# Evaluate Performance of Resampled Dataset
# Installing libraries needed
!pip install imbalanced-learn scikit-learn

import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, roc_curve
import matplotlib.pyplot as plt
from imblearn.over_sampling import SMOTE

# Example data
x_train = np.random.rand(240000, 30)  # Original training datashape dimensions
y_train = np.random.randint(0, 2, 240000)  # Original training datashape dimensions
x_test = np.random.rand(25000, 30)  # Original test datashape dimensions
y_test = np.random.randint(0, 2, 25000)  # Original test datashape dimensions

# Apply SMOTE to the training data
smote = SMOTE()
x_train_resampled, y_train_resampled = smote.fit_resample(x_train, y_train)

# Print the shape of the resampled data
print("Resampled dataset shape:", x_train_resampled.shape, y_train_resampled.shape)

# Train a classifier
classifier = RandomForestClassifier(n_estimators=100, random_state=42)
classifier.fit(x_train_resampled, y_train_resampled)

# Predict AUC-ROC score on test set
y_prob = classifier.predict_proba(x_test)[:, 1]

# Compute AUC-ROC
auc_roc = roc_auc_score(y_test, y_prob)
print("AUC-ROC score:", auc_roc)

# Plot ROC Curve
fpr, tpr, _ = roc_curve(y_test, y_prob)
plt.figure()
plt.plot(fpr, tpr, marker='.')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()

In [None]:
# Import necessary libraries
from imblearn.over_sampling import RandomOverSampler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, roc_curve, classification_report
import numpy as np
import matplotlib.pyplot as plt

# Creating an example dataset
x, y = np.random.rand(240000, 30), np.random.randint(0, 2, 240000)

# Split data into training and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1, random_state=42, stratify=y)

# Print original class distribution
print("Original class distribution:", dict(zip(*np.unique(y_train, return_counts=True))))

# Apply Random Oversampling
ros = RandomOverSampler(random_state=42)
x_train_resampled, y_train_resampled = ros.fit_resample(x_train, y_train)

# Print resampled class distribution
print("Resampled class distribution:", dict(zip(*np.unique(y_train_resampled, return_counts=True))))

# Train a classifier
classifier = RandomForestClassifier(n_estimators=100, random_state=42)
classifier.fit(x_train_resampled, y_train_resampled)

# Predict probabilities and labels on the test set
y_prob = classifier.predict_proba(x_test)[:, 1]
y_pred = classifier.predict(x_test)

# Compute AUC-ROC score
auc_roc = roc_auc_score(y_test, y_prob)
print("AUC-ROC score:", auc_roc)

# Print classification report
print("Classification Report:\n", classification_report(y_test, y_pred, target_names=['Not Fraud', 'Fraud']))

In [None]:
# Removing samples from the majority class ="Not Fraud"
!pip install imbalanced-learn scikit-learn
from imblearn.under_sampling import RandomUnderSampler
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd

# Creating a new dataset using datashape dimensions from training dataset
x, y = np.random.rand(240000, 30), np.random.randint(0, 2, 240000)

# Split data into training and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1, random_state=42, stratify=y)

# Print original class distribution
print("Original class distribution:", dict(zip(*np.unique(y_train, return_counts=True))))

# Apply Random Undersampling
rus = RandomUnderSampler(random_state=42)
x_train_resampled, y_train_resampled = rus.fit_resample(x_train, y_train)

# Print resampled class distribution
print("Resampled class distribution:", dict(zip(*np.unique(y_train_resampled, return_counts=True))))

# Import libraries need for ml performance evaluation
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, roc_curve
import matplotlib.pyplot as plt

# Train a classifier
classifier = RandomForestClassifier(n_estimators=100, random_state=42)
classifier.fit(x_train_resampled, y_train_resampled)

# Predict AUC-ROC score on test set
y_prob = classifier.predict_proba(x_test)[:, 1]

# Compute AUC-ROC score
auc_roc = roc_auc_score(y_test, y_prob)
print("AUC-ROC score:", auc_roc)

# Plot ROC Curve
fpr, tpr, _ = roc_curve(y_test, y_prob)
plt.figure()
plt.plot(fpr, tpr, marker='.')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()

In [None]:
# Print classification report
# Import necessary libraries
from imblearn.under_sampling import RandomUnderSampler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, roc_curve, classification_report
import numpy as np
import matplotlib.pyplot as plt

# Creating an example dataset
x, y = np.random.rand(240000, 30), np.random.randint(0, 2, 240000)

# Split data into training and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1, random_state=42, stratify=y)

# Print original class distribution
print("Original class distribution:", dict(zip(*np.unique(y_train, return_counts=True))))

# Apply Random Undersampling
rus = RandomUnderSampler(random_state=42)
x_train_resampled, y_train_resampled = rus.fit_resample(x_train, y_train)

# Print resampled class distribution
print("Resampled class distribution:", dict(zip(*np.unique(y_train_resampled, return_counts=True))))

# Train a classifier
classifier = RandomForestClassifier(n_estimators=100, random_state=42)
classifier.fit(x_train_resampled, y_train_resampled)

# Predict probabilities and labels on the test set
y_prob = classifier.predict_proba(x_test)[:, 1]
y_pred = classifier.predict(x_test)

# Compute AUC-ROC score
auc_roc = roc_auc_score(y_test, y_prob)
print("AUC-ROC score:", auc_roc)

# Print classification report
print("Classification Report:\n", classification_report(y_test, y_pred, target_names=['Not Fraud', 'Fraud']))

Hypertuning The Model

Hyperparameter tuning is performed to optimize the model's performance by finding the best combination of parameters that maximizes its accuracy and generalization on unseen data.

GridSearchCV with KFold: GridSearchCV tests all possible combinations of hyperparameters using cross-validation with KFold, which splits the data into k parts to ensure accurate performance evaluation.

RandomizedSearchCV with KFold: RandomizedSearchCV randomly tests a fixed number of hyperparameter combinations and uses KFold for cross-validation. This method is faster and more practical than GridSearchCV when there are many possible hyperparameter settings.


In [None]:
#Hypertuning the model with KFold and GridSearchCV
import numpy as np
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, make_scorer
from sklearn.datasets import make_classification

# Training Datashape dimensions
x, y = make_classification(n_samples=240000, n_features=30, n_classes=2, random_state=42)

# Defining our model, I've chosen Random Forest
model = RandomForestClassifier(random_state=42)

# Define the parameter grid
param_grid = {
    'n_estimators': [100, 150],  # Number of trees
    'max_depth': [None, 10, 20, 30],  # Maximum depth of the trees
    'min_samples_split': [5],  # Minimum number of samples required to split an internal node
    'min_samples_leaf': [2],  # Minimum number of samples required to be at a leaf node
}

# Define the scoring metric (AUC-ROC score)
scoring = make_scorer(roc_auc_score)

# Define KFold cross-validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Initialize GridSearchCV
grid_search = GridSearchCV(
    estimator=model,
    param_grid=param_grid,
    scoring=scoring,
    cv=kf,
    n_jobs=-1,  # Use all available cores
    verbose=2   # Verbosity level of 2 means we'll get messages showing us which combination of parameters are being used
)

# Fit GridSearchCV
grid_search.fit(x, y)

# Print best parameters and best score
print("Best Parameters:", grid_search.best_params_)
print("Best AUC-ROC Score:", grid_search.best_score_)

# To store/see results of all parameter combinations tested:
results = grid_search.cv_results_
for mean_score, params in zip(results['mean_test_score'], results['params']):
    print(f"Mean Test Score: {mean_score:.4f} for parameters: {params}")

# show which model get the best model and make predictions
best_model = grid_search.best_estimator_

In [None]:
# Hyperparameter Tuning with KFold and RandomSearchCV
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, make_scorer
from sklearn.model_selection import KFold, RandomizedSearchCV
from joblib import parallel_backend

# Create dataset
x, y = make_classification(n_samples=240000, n_features=30, n_classes=2, random_state=42)

# Split dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# Create a smaller datasubset from the original training set for faster tuning * GridSearchCV was taking too long, draining CPU resources*
X_train_small, _, y_train_small, _ = train_test_split(X_train, y_train, train_size=0.1, random_state=42)

# Define the model
model = RandomForestClassifier(random_state=42)

# Define the parameter grid
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Define the scoring metric
scoring = make_scorer(roc_auc_score)

# Define KFold cross-validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Define the number of iterations for RandomizedSearchCV
n_iter = 8  # We're testing 8 random samples

# Initialize RandomizedSearchCV
random_search = RandomizedSearchCV(
    estimator=model,
    param_distributions=param_grid,
    n_iter=n_iter,
    scoring=scoring,
    cv=kf,
    n_jobs=-1,
    verbose=1,
    random_state=42
)

# Fit RandomizedSearchCV on the smaller subset
with parallel_backend('threading', n_jobs=-1):
    random_search.fit(X_train_small, y_train_small)

# Print best parameters and best score
print("Best Parameters:", random_search.best_params_)
print("Best AUC-ROC Score:", random_search.best_score_)

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score, roc_curve
import matplotlib.pyplot as plt

# Training the model with the best parameters on the full training set
best_model = RandomForestClassifier(
    n_estimators=150,
    min_samples_split=2,
    min_samples_leaf=1,
    max_depth=30,
    random_state=42
)
best_model.fit(X_train, y_train)

# Predict class probabilities and labels on the test set
y_prob = best_model.predict_proba(X_test)[:, 1]
y_pred = best_model.predict(X_test)

# Compute AUC-ROC score
auc_roc = roc_auc_score(y_test, y_prob)
print("AUC-ROC score on test set:", auc_roc)

# Print classification report
print("Classification Report:\n", classification_report(y_test, y_pred, target_names=['Not Fraud', 'Fraud']))

# Plot ROC Curve
fpr, tpr, _ = roc_curve(y_test, y_prob)
plt.figure()
plt.plot(fpr, tpr, marker='.')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()

Creating A Neural Network

A neural network designed for fraud detection effectively identifies complex patterns and unusual behaviors in transactions by learning from large amounts of data and detecting subtle, non-obvious signs of fraud, leading to more accurate detection than simpler models.

In [None]:
# Creating A Neural Network
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import InputLayer, Dense, BatchNormalization
from tensorflow.keras.callbacks import ModelCheckpoint

# Define the input shape
input_shape = x_train.shape[1]

# Create a Sequential model
shallow_nn = Sequential()

# Add InputLayer with the correct shape parameter
shallow_nn.add(InputLayer(shape=(input_shape,)))

# Add Dense layers with appropriate activation functions
shallow_nn.add(Dense(64, activation='relu')) # Hidden Layer used to process input data, all data goes through this layer first
shallow_nn.add(BatchNormalization())         # BatchNormalization layer - the inputs are given a standard deviation of 1 and mean of 0 this stabilizes the activations and the gradients. This ensures that model is learning at a smooth and predictable rate.
shallow_nn.add(Dense(1, activation='sigmoid')) # Output layer uses the sigmoid function outputs a value between 0 and 1, which can be interpreted as a probability of the positive class.

# Define ModelCheckpoint
checkpoint = ModelCheckpoint('shallow_nn.keras', save_best_only=True)

# Compile the model
shallow_nn.compile(
    loss='binary_crossentropy',
    optimizer='adam',
    metrics=['accuracy']
)

# Print model summary to verify
shallow_nn.summary()

In [None]:
from sklearn.model_selection import train_test_split

# Split the data into training and validation sets
x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=0.2, random_state=42)

shallow_nn.predict(x_train)
shallow_nn.fit(x_train, y_train, validation_data=(x_val, y_val), epochs=5 , callbacks=[checkpoint])

In [None]:
# The array represents model's confidence that each sample belongs to the positive class (aka the transaction is fraudlent)
def neural_net_predictions(model, x):
  return (shallow_nn.predict(x).flatten() > 0.5).astype(int)
neural_net_predictions(shallow_nn, x_val)

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_val, neural_net_predictions(shallow_nn, x_val), target_names=['Not Fraud', 'Fraud']))

Using Ensemble Learning

In ensemble learning, multiple models are trained independently and their predictions are combined to make a final decision, leveraging the strengths of each model to improve overall performance, making this approach highly effective in real-world applications.

In [None]:
import pandas as pd
import numpy as np

# Load the bank transactions CSV
df_bank = pd.read_csv('/content/bank_transactions.csv')

# Drop unnecessary columns
df_bank = df_bank.drop(columns=['TRANSACTION DETAILS', 'CHQ.NO.', 'VALUE DATE'], errors='ignore')

# Strip leading/trailing spaces from column names
df_bank.columns = df_bank.columns.str.strip()

# Convert DATE column to datetime format if it exists
if 'DATE' in df_bank.columns:
    df_bank['DATE'] = pd.to_datetime(df_bank['DATE'], format='%d-%b-%y', errors='coerce')
    # Extract date features
    df_bank['YEAR'] = df_bank['DATE'].dt.year
    df_bank['MONTH'] = df_bank['DATE'].dt.month
    df_bank['DAY'] = df_bank['DATE'].dt.day
    df_bank['DAYOFWEEK'] = df_bank['DATE'].dt.dayofweek

# Convert monetary columns to numeric
def convert_monetary_column(df, column_name):
    if column_name in df.columns:
        df[column_name] = df[column_name].replace({'\$': '', ',': ''}, regex=True)
        df[column_name] = pd.to_numeric(df[column_name], errors='coerce')

convert_monetary_column(df_bank, 'WITHDRAWAL AMT')
convert_monetary_column(df_bank, 'DEPOSIT AMT')
convert_monetary_column(df_bank, 'BALANCE AMT')

# Apply log transformation to the monetary columns
def apply_log_transformation(df, column_name):
    if column_name in df.columns:
        # Add a small constant to avoid log(0) issues
        df[f'{column_name}_log'] = np.log1p(df[column_name].fillna(0))
    else:
        print(f"Column {column_name} does not exist in DataFrame")

apply_log_transformation(df_bank, 'WITHDRAWAL AMT')
apply_log_transformation(df_bank, 'DEPOSIT AMT')
apply_log_transformation(df_bank, 'BALANCE AMT')

# Print columns to verify
print(df_bank.columns)

def classify_fraud(row):
    # Ensure all conditions are based on columns in my DataFrame
    if row['WITHDRAWAL AMT'] > 1000 and row['DEPOSIT AMT'] == 0:
        return 'HighValueTransactions'
    elif row['WITHDRAWAL AMT'] < 10 and row['DEPOSIT AMT'] < 10:
        return 'LowValueTransactions'
    else:
        return 'Unknown'

df_bank['FRAUD_TYPE'] = df_bank.apply(classify_fraud, axis=1)

# Print the first few rows to verify the data
print(df_bank[['WITHDRAWAL AMT', 'WITHDRAWAL AMT_log']].head())
print(df_bank[['DEPOSIT AMT', 'DEPOSIT AMT_log']].head())
print(df_bank[['BALANCE AMT', 'BALANCE AMT_log']].head())
print(df_bank[['FRAUD_TYPE']].head())

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.utils import to_categorical
from sklearn.metrics import accuracy_score, classification_report
import matplotlib.pyplot as plt

# Load the bank transactions CSV
df_bank = pd.read_csv('/content/bank_transactions.csv')

# Check for leading/trailing spaces in column names
df_bank.columns = df_bank.columns.str.strip()

# Print columns to debug
print("Columns in DataFrame:", df_bank.columns)

# Define a function to convert monetary columns to numeric
def convert_monetary_column(df, column_name):
    if column_name in df.columns:
        df[column_name] = df[column_name].replace({'\$': '', ',': ''}, regex=True)
        df[column_name] = pd.to_numeric(df[column_name], errors='coerce')

# Convert monetary columns
convert_monetary_column(df_bank, 'WITHDRAWAL AMT')
convert_monetary_column(df_bank, 'DEPOSIT AMT')
convert_monetary_column(df_bank, 'BALANCE AMT')

# Handle missing values
df_bank['WITHDRAWAL AMT'].fillna(0, inplace=True)
df_bank['DEPOSIT AMT'].fillna(0, inplace=True)
df_bank['BALANCE AMT'].fillna(df_bank['BALANCE AMT'].mean(), inplace=True)

# Define the function to classify fraud types
def classify_fraud(row):
    if row['WITHDRAWAL AMT'] > 1000 and row['DEPOSIT AMT'] == 0:
        return 'HighValueTransactions'
    elif row['WITHDRAWAL AMT'] < 10 and row['DEPOSIT AMT'] < 10:
        return 'LowValueTransactions'
    else:
        return 'Unknown'

# Apply the classify_fraud function
df_bank['FRAUD_TYPE'] = df_bank.apply(classify_fraud, axis=1)

# Verify if 'FRAUD_TYPE' has been added
print(df_bank[['FRAUD_TYPE']].head())

# Check if 'FRAUD_TYPE' column exists
if 'FRAUD_TYPE' not in df_bank.columns:
    raise ValueError("The 'FRAUD_TYPE' column is missing from the DataFrame.")

# Filter out rows with 'Unknown' fraud type
df_fraud = df_bank[df_bank['FRAUD_TYPE'] != 'Unknown']
X_fraud = df_fraud.drop(columns=['FRAUD_TYPE'])
y_fraud = df_fraud['FRAUD_TYPE']

# Ensure all feature columns are numeric
X_fraud = X_fraud.apply(pd.to_numeric, errors='coerce')

# Fill missing values if any
X_fraud.fillna(0, inplace=True)

# Scale features
scaler = StandardScaler()
X_fraud_scaled = scaler.fit_transform(X_fraud)

# Encode target variable
label_encoder = LabelEncoder()
y_fraud_encoded = label_encoder.fit_transform(y_fraud)
y_fraud_encoded = to_categorical(y_fraud_encoded)

# Split data into training and testing sets
X_train_fraud, X_test_fraud, y_train_fraud, y_test_fraud = train_test_split(
    X_fraud_scaled, y_fraud_encoded, test_size=0.3, random_state=42
)

# Define the neural network model
model = Sequential([
    Dense(64, input_dim=X_train_fraud.shape[1], activation='relu'),
    Dense(32, activation='relu'),
    Dense(y_train_fraud.shape[1], activation='softmax')
])

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model
history = model.fit(
    X_train_fraud, y_train_fraud,
    validation_split=0.2,
    epochs=20,
    batch_size=32,
    verbose=1
)

# Evaluate the model
y_pred_prob = model.predict(X_test_fraud)
y_pred = np.argmax(y_pred_prob, axis=1)
y_test = np.argmax(y_test_fraud, axis=1)

print("Model evaluation:")
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(classification_report(y_test, y_pred, target_names=label_encoder.classes_))

# Plot training history
plt.plot(history.history['accuracy'], label='accuracy')
plt.plot(history.history['val_accuracy'], label='val_accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.ylim([0, 1])
plt.legend(loc='lower right')
plt.show()

Using Gradient Boost to classify the Fraud Patterns

Gradient Boosting gives more weight to the harder-to-classify cases, such as fraud patterns, which enhances accuracy, while also reducing the risk of overfitting by combining multiple weaker models to create a robust and effective final model.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
from imblearn.over_sampling import SMOTE

# Load data
bank_data = pd.read_csv('/content/bank_transactions.csv')
creditcard_data = pd.read_csv('/content/creditcard.csv')

# Remove extra spaces from column names
bank_data.columns = bank_data.columns.str.strip()
creditcard_data.columns = creditcard_data.columns.str.strip()

# Ensure 'fraud_type' column is present in the credit card data
if 'Fraud_Type' not in creditcard_data.columns:
    raise KeyError("The 'fraud_type' column is missing from the creditcard_data.")

# Convert relevant columns to numeric, coercing errors to NaN
numeric_columns = ['WITHDRAWAL AMT', 'DEPOSIT AMT', 'BALANCE AMT']
for col in numeric_columns:
    bank_data[col] = pd.to_numeric(bank_data[col], errors='coerce')
    creditcard_data[col] = pd.to_numeric(creditcard_data[col], errors='coerce')

# Fill missing values with 0
bank_data[numeric_columns] = bank_data[numeric_columns].fillna(0)
creditcard_data[numeric_columns] = creditcard_data[numeric_columns].fillna(0)

# Prepare features and target variables
features = ['WITHDRAWAL AMT', 'DEPOSIT AMT', 'BALANCE AMT']
X_bank = bank_data[features]
y_bank = bank_data['FRAUD_TYPE']
X_creditcard = creditcard_data[features]
y_creditcard = creditcard_data['Fraud_Type']

# Combine features and target variables from both datasets
X_combined = pd.concat([X_bank, X_creditcard], axis=0)
y_combined = pd.concat([y_bank, y_creditcard], axis=0)

# Encode target variable
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y_combined)

# Split the data while maintaining class proportions
X_train, X_test, y_train, y_test = train_test_split(X_combined, y_encoded, test_size=0.2, stratify=y_encoded, random_state=42)

# Resample the training data using SMOTE
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

# Initialize and fit the scaler
scaler = StandardScaler()
X_train_resampled = scaler.fit_transform(X_train_resampled)
X_test = scaler.transform(X_test)

# Build Gradient Boosting model
model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
model.fit(X_train_resampled, y_train_resampled)

# Predict and evaluate the model
y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred, target_names=label_encoder.classes_))

# Confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(12, 10))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=label_encoder.classes_, yticklabels=label_encoder.classes_)
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.show()

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from imblearn.over_sampling import SMOTE

# Load and prepare your data
# Assuming X and y are already defined

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

# Check class distribution
print("Train set class distribution:\n", y_train.value_counts())
print("Test set class distribution:\n", y_test.value_counts())

# Apply SMOTE
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

print("Resampled train set class distribution:\n", pd.Series(y_train_resampled).value_counts())

# Initialize and fit the model
model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
model.fit(X_train_resampled, y_train_resampled)

# Evaluate the model
y_pred = model.predict(X_test)


Stacking the Models Together

By stacking models together, I combine their strengths and leverage their diverse perspectives to enhance overall performance, resulting in more accurate and robust predictions.

In [None]:
 # Import required libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

# Preparing the data
# Create a example data set
num_samples = 1000
num_features = 20
X = np.random.rand(num_samples, num_features)
y = np.random.randint(0, 2, num_samples)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Defining the base models
# I've selected the random forest model and the neural network
random_forest = RandomForestClassifier(n_estimators=100, random_state=42)
neural_network = MLPClassifier(hidden_layer_sizes=(50,), max_iter=100, random_state=42)

# Defining and creating the stacking classifier
stacking_model = StackingClassifier(
    estimators=[
        ('rf', random_forest),
        ('nn', neural_network)
    ],
    final_estimator=RandomForestClassifier(n_estimators=50, random_state=42)  
)

# Define a pipeline that includes scaling and stacking
pipeline = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('stacking', stacking_model)
])

# Training and evaluating the model
# Fit the pipeline on the training data
pipeline.fit(X_train, y_train)

# Predict on the test data
y_pred = pipeline.predict(X_test)

# Evaluate the model
print("Classification Report:\n", classification_report(y_test, y_pred))

# Plot confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()


Hypertuning The Stacked Model

I hyper-tuned the stacked model to optimize its performance by finding the best combination of parameters that maximizes its accuracy and effectiveness in making predictions.

In [None]:
# Undersampled the non-fraudulent class and shuffled the data to prevent data leakage
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier, StackingClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression
from imblearn.pipeline import Pipeline as ImbPipeline
from scipy.stats import randint, uniform
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
data = pd.read_csv('/content/creditcard.csv')

# Create a smaller dataset instead of 240K rows, we'll look at 25K
data_small = data.sample(n=25000, random_state=42)

# Prepare features and target
X = data_small.drop(columns=['Class'])
y = data_small['Class']

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define base models
base_estimators = [
    ('rf', RandomForestClassifier(random_state=42)),
    ('nn', MLPClassifier(random_state=42))
]

# Define the stacking classifier
stacking_model = StackingClassifier(
    estimators=base_estimators,
    final_estimator=LogisticRegression()
)

# Define a pipeline with preprocessing and stacking
pipeline = ImbPipeline(steps=[
    ('scaler', StandardScaler()),
    ('stacking', stacking_model)
])

# Defining a parameter grid for RandomizedSearchCV
param_distributions = {
    'stacking__rf__n_estimators': randint(50, 201),
    'stacking__rf__max_depth': [None, 10, 20, 30, 40],
    'stacking__nn__hidden_layer_sizes': [(50,), (100,), (50, 50), (100, 100)],
    'stacking__nn__activation': ['tanh', 'relu'],
    'stacking__final_estimator__C': uniform(0.1, 10),
    'stacking__final_estimator__penalty': ['l2']
}

# Setting up RandomizedSearchCV
random_search = RandomizedSearchCV(pipeline, param_distributions, n_iter=30, cv=3, n_jobs=-1, scoring='accuracy', random_state=42)

# Fit RandomizedSearchCV
random_search.fit(X_train, y_train)

# Get the best parameters and score
print("Best Parameters:\n", random_search.best_params_)
print("Best Score:\n", random_search.best_score_)

# Evaluate on the test set
y_pred = random_search.predict(X_test)
print("Classification Report:\n", classification_report(y_test, y_pred))

# Plot confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

ROC Curve and AUC Score of the Stacked Ensemble Model

I used ROC AUC for performance testing because it provides a comprehensive measure of a model’s ability to distinguish between classes by evaluating its true positive rate against its false positive rate across different thresholds.

In [None]:
from sklearn.metrics import roc_auc_score, roc_curve

# Get predicted probabilities for the positive class (1) from the test set
y_prob = random_search.predict_proba(X_test)[:, 1]

# Compute the ROC AUC score
roc_auc = roc_auc_score(y_test, y_prob)
print("ROC AUC Score:", roc_auc)

# Optional: Plot the ROC Curve
fpr, tpr, _ = roc_curve(y_test, y_prob)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, marker='o', label=f'ROC Curve (AUC = {roc_auc:.2f})')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend()
plt.grid()
plt.show()