# Credit Card Fraud Detection Project 

### ADASYN and SMOTE oversampling techniques, Decision Trees and LightGBM, SHAP values

## Introduction

This project aims to address the issue of fraud detection in credit card transactions. I explored the effectiveness of different oversampling techniques, (ADASYN and SMOTE), combined with Decision Trees and LightGBM models. Additionally, I examine the interpretability of these models using decision tree plots, feature importances and SHAP (SHapley Additive exPlanations) values. 

## Dataset Information

The dataset contains transactions made by European cardholders in September 2013. It consists of transactions that occurred over two days, with a total of 284,807 transactions, out of which only 492 are fraudulent. Consequently, the dataset is highly unbalanced, with frauds accounting for only 0.172% of all transactions, which needs to be accounted.

The dataset consists of numerical input variables resulting from a Principal Component Analysis (PCA) transformation. The PCA-transformed features are labeled as V1, V2, (...), V28, while the 'Time' and 'Amount' features are not transformed. The 'Time' feature represents the seconds elapsed between each transaction and the first transaction in the dataset, while the 'Amount' feature denotes the transaction amount.

The response variable, 'Class', takes a value of 1 in case of fraud and 0 otherwise.

Dataset can be accessed here: [here](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud/data).


In [None]:
#Import neccessary libraries

import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from lightgbm import LGBMClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import KFold, cross_val_score
from sklearn.preprocessing import QuantileTransformer
from sklearn.model_selection import RandomizedSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix, roc_curve, auc, classification_report, accuracy_score
from imblearn.over_sampling import SMOTE, ADASYN
from sklearn.tree import plot_tree
import shap

import warnings
warnings.filterwarnings("ignore")

# Set seed
np.random.seed(42)
#Read in the dataset
df = pd.read_csv("/kaggle/input/creditcardfraud/creditcard.csv")
df.drop(columns='Time', axis=1, inplace=True)

#Set color palette
sns.set_palette('seismic')
%matplotlib inline

In [None]:
#Look at the dataset summary and drop any duplicated rows
print(df.shape)
print(df.head())
print(df.info())
print(df.describe())
print(df.median())
df.drop_duplicates(inplace=True)

## Exploratory Data Analysis

As mentioned previously, the dataset contains very few fradulent transactions, which poses a challange, as the dataset is highly unbalanced.

In [None]:
print(df['Class'].value_counts())
plt.figure(figsize=(8, 6))
df['Class'].value_counts().plot(kind='bar')
plt.title('Distribution of Class')
plt.xlabel('Class')
plt.ylabel('Frequency')
plt.show()

### Distributions of features

In [None]:
# Function for plotting histograms of numerical columns in a DataFrame
def plot_hist(df, color='navy'):
    '''
    Plot Histogram for Numerical Features in a DataFrame
    
    Filters numerical columns from the provided DataFrame and plots their histograms for exploratory data analysis.
    
    '''
    # Select numerical columns from the DataFrame
    num_columns = df.select_dtypes(include='number').columns
    
    # Check if there are any numerical columns
    if len(num_columns) == 0:
        print("No numerical columns found in the DataFrame.")
        return
    
    # Calculate the number of rows needed for subplots
    number_columns = len(num_columns)
    number_rows = (number_columns + 1) // 2
    
    # Create subplots
    fig, axes = plt.subplots(number_rows, 2, figsize=(12, 6*number_rows))
    axes = axes.flatten()
    plt.subplots_adjust(hspace=0.5)
    
    # Loop through numerical columns and plot histogram
    for i, col in enumerate(num_columns):
        ax = axes[i]
        sns.histplot(data=df, x=col, color=color, ax=ax)
        ax.set_title(f'Histogram of {col}')
        ax.set_xlabel(None)
        ax.set_ylabel('Count')
    
    # Remove excess subplot if the number of columns is odd
    if number_columns % 2 != 0:
        fig.delaxes(axes[-1])

    plt.show()


In [None]:
plot_hist(df)

In [None]:
# Calculate the IQR of columns
for column in df.columns:
    print(f"Checking {column}...")
    print()
    q1 = df[column].quantile(0.25)
    q3 = df[column].quantile(0.75)
    iqr = q3 - q1
    print(f"IQR: {iqr}")

    # Identify lower outliers 
    lower_outliers = df[df[column] < q1 - 1.5 * iqr]
    upper_outliers = df[df[column] > q3 + 1.5 * iqr]

    # Calculate the number of lower outliers and upper outliers
    num_lower_outliers = len(lower_outliers)
    num_upper_outliers = len(upper_outliers)
    num_total_outliers = num_lower_outliers + num_upper_outliers
    print(f"Number of lower outliers: {num_lower_outliers}")
    print(f"Number of upper outliers: {num_upper_outliers}")
    print(f"Total number of outliers: {num_total_outliers}")

    # Calculate proportion of outliers in the dataset
    proportion_outliers = num_total_outliers / len(df) * 100
    print(f"Proportion of outliers in the dataset: {proportion_outliers:.2f}%")
    print()


After analyzing the outliers identified through the IQR method, it's evident that certain columns exhibit a notable number of outliers. It's essential to consider these outliers when deciding on transformations for the data.

In [None]:
# Check skew of the features
df.skew().sort_values(ascending=False)

In [None]:
# Check correlations with target variable
correlations = df.corr()
class_correlations = correlations['Class'].sort_values(ascending=False)
print(class_correlations)

## SMOTE and ADASYN oversampling

In order to mitigate the imbalance in the dataset, two oversampling techniques were employed:

- **SMOTE (Synthetic Minority Over-sampling Technique)** generates synthetic samples for the minority class by interpolating between existing minority class samples, balancing class distribution.
- **ADASYN (Adaptive Synthetic Sampling)** is an extension of SMOTE that applies higher sampling intensity to minority class instances that are more difficult to learn, adapting to the dataset's complexity.

We will assess the performance of each technique to determine its effectiveness for this specific task.

In [None]:
# Split into train and test sets
X = df.drop('Class', axis=1)  
y = df['Class'] 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify= y, shuffle=True)

In [None]:
# Use SMOTE and ADASYN oversampling on training set
smote = SMOTE(random_state=42)
adasyn = ADASYN(random_state=42)
X_smote, y_smote = smote.fit_resample(X_train, y_train)
X_adasyn, y_adasyn = adasyn.fit_resample(X_train, y_train)

print(y_smote.value_counts())
print(y_adasyn.value_counts())

## Model selection

For both training sets, model selection with 5-fold cross-validation was employed. Candidates included Logistic Regression, Decision Trees, and LightGBM. Decision Trees and LightGBM exhibited similar performance, with Logistic Regression slightly lagging behind.

Decision Trees and LightGBM were selected for further tuning.

In [None]:
# Function for model selection

def model_selection(models, scoring_metric, X_train=X_train, y_train=y_train, n_splits=5):
    '''
    This function performs model selection by evaluating the performance of different models using cross-validation.
    
    Parameters:
    - models (dict): A dictionary containing model names as keys and corresponding model instances as values.
    - scoring_metric (str): The evaluation metric to use for comparing models.
    - X_train (array-like): The feature matrix of the training data.
    - y_train (array-like): The target vector of the training data.
    
    Returns:
    - results (dict): A dictionary containing model names as keys and the corresponding cross-validation results as values.
    '''
    results = {}
    
    for model_name, model in models.items():
        kf = KFold(n_splits=n_splits, random_state=12, shuffle=True)
        cv_results = cross_val_score(model, X_train, y_train, cv=kf, scoring=scoring_metric)
        results[model_name] = cv_results

    plt.figure(figsize=(10, 6))
    plt.boxplot(results.values(), labels=results.keys())
    plt.title('Model Performance Comparison')
    plt.xlabel('Model')
    plt.ylabel(scoring_metric)
    plt.grid(True)
    plt.show()
    
    return results

In [None]:
# Define the models and scoring metric
models = {
    'Logistic Regression': LogisticRegression(random_state=42),
    'LightGBM': LGBMClassifier(random_state=42),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
}

scoring_metric = 'roc_auc'


In [None]:
model_selection(models=models, scoring_metric=scoring_metric, X_train=X_smote, y_train=y_smote)

In [None]:
model_selection(models=models, scoring_metric=scoring_metric, X_train=X_adasyn, y_train=y_adasyn)

## Hyperparameter Tuning and Model Training

Both Decision Tree Classifier and LightGBM Classifier were further tuned on both training sets. Quantile Transformer was used as a preprocessor to mitigate the skewed feature distributions.

In [None]:
# Define preprocessor and pipeline for LightGBM
preprocessor = ColumnTransformer(
    transformers=[
        ('quantile', QuantileTransformer(output_distribution='normal'), slice(0,None)),
    ],
    remainder='passthrough'
)

pipeline_lgb = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LGBMClassifier(verbosity=-1))  
])

lgb_param_grid = {
    'classifier__n_estimators': [400],  
    'classifier__max_depth': [10], 
    'classifier__min_child_samples': [20, 30],
    'classifier__learning_rate': [0.05],
    'classifier__colsample_bytree': [0.8, 0.9]
}

# Perform hyperparameter tuning for both sets
random_search_lgb = RandomizedSearchCV(pipeline_lgb, param_distributions=lgb_param_grid, n_iter=20, cv=5, scoring='roc_auc', random_state=42, n_jobs=-1)

best_models = {}

for X, y, sampler_name in zip([X_smote, X_adasyn], [y_smote, y_adasyn], ['SMOTE', 'ADASYN']):
    random_search_lgb.fit(X, y)
    best_models[sampler_name] = {
        'best_params': random_search_lgb.best_params_,
        'best_estimator': random_search_lgb.best_estimator_
    }
    print(f"Best Hyperparameters for {sampler_name}: {random_search_lgb.best_params_}")
    print()

best_model_smote_lgb = best_models['SMOTE']['best_estimator']
best_model_adasyn_lgb = best_models['ADASYN']['best_estimator']


In [None]:
# Define preprocessor and pipeline for DT

pipeline_dt = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', DecisionTreeClassifier())  
])

dt_param_grid = {
    'classifier__max_depth': [10], 
    'classifier__min_samples_split': [20, 30],
    'classifier__min_samples_leaf': [5, 10]
}

# Perform hyperparameter tuning for both sets
random_search_dt = RandomizedSearchCV(pipeline_dt, param_distributions=dt_param_grid, n_iter=30, cv=5, scoring='roc_auc', random_state=42, n_jobs=-1)

best_models_dt = {}

for X, y, sampler_name in zip([X_smote, X_adasyn], [y_smote, y_adasyn], ['SMOTE', 'ADASYN']):
    random_search_dt.fit(X, y)
    best_models_dt[sampler_name] = {
        'best_params': random_search_dt.best_params_,
        'best_estimator': random_search_dt.best_estimator_
    }
    print(f"Best Hyperparameters for {sampler_name}: {random_search_dt.best_params_}")
    print()
    
best_model_smote_dt = best_models['SMOTE']['best_estimator']
best_model_adasyn_dt = best_models['ADASYN']['best_estimator']

## Model Evaluation

Upon training and tuning, all four models were evaluated. Both SMOTE and ADASYN training sets exhibited similar performance. However in the end, SMOTE demonstrated slightly superior performance in critical metrics - TPR and FNR, as well precision and recall for the positive class (our 'fraud')

It is impotrant to note however, that with different hyperparameters during training, there were instances when ADASYN set performed slightly better, so the diffrence is not that notable.

Interestingly, both LightGBM and Decision Trees performed very similarly in this task.

In [None]:
# Evaluate all four models
models = {
    'LGBM (SMOTE)': best_model_smote_lgb,
    'LGBM (ADASYN)': best_model_adasyn_lgb,
    'Decision Tree (SMOTE)': best_model_smote_dt,
    'Decision Tree (ADASYN)': best_model_adasyn_dt
}

for model_name, model in models.items():
    
    y_pred = model.predict(X_test)
    
    # Classification Report
    print(f"Classification report for {model_name}:")
    print(classification_report(y_test, y_pred))
    
    # ROC AUC score
    print(f"ROC AUC for {model_name}:")
    print(roc_auc_score(y_test, y_pred))
    
    # TPR and FNR
    
    conf_matrix = confusion_matrix(y_test, y_pred)
    TN, FP, FN, TP = conf_matrix.ravel()
    TPR = TP / (TP + FN)
    FNR = FN / (TP + FN)
    print(f"True Positive Rate for {model_name}:", TPR)
    print(f"False Negative Rate for {model_name}:", FNR)
    print()
    print()


## Model Interpretation - Decision Tree plots, Feature Importances, SHAP values

Let's look at tree plots, feature importances and a summary plot of SHAP values to further understand which features proved to be crucial for the models.

In [None]:
# Plot Decision Trees
for sampler_name, best_model_info in best_models_dt.items():
    best_estimator = best_model_info['best_estimator']
    classifier = best_estimator.named_steps['classifier']  

    if isinstance(classifier, DecisionTreeClassifier):
        plt.figure(figsize=(50, 20))
        plot_tree(classifier, feature_names=feature_names, class_names=class_names, filled=True, max_depth=3)
        plt.title(f'Decision Tree for {sampler_name} set')
        plt.show()
    else:
        print(f"The classifier for {sampler_name} is not a Decision Tree Classifier.")

Upon looking at the first three layers of the tree plots, we observe that V14 has a position as the root node and is the most crucial feature in the decision-making process for both models. Furthermore, features such as V4, Amount, and V12 are close to the root node across these layers, suggesting their importance in the decision paths of the tree.

In [None]:
# Plot feature importances for SMOTE and ADASYN for LGBM

sorted_indices_smote = np.argsort(feature_importance_smote)
sorted_indices_adasyn = np.argsort(feature_importance_adasyn)

sorted_feature_names_smote = np.array(feature_names)[sorted_indices_smote]
sorted_feature_importance_smote = feature_importance_smote[sorted_indices_smote]

sorted_feature_names_adasyn = np.array(feature_names)[sorted_indices_adasyn]
sorted_feature_importance_adasyn = feature_importance_adasyn[sorted_indices_adasyn]

plt.figure(figsize=(10, 6))
plt.barh(sorted_feature_names_smote, sorted_feature_importance_smote)
plt.xlabel('Importance')
plt.ylabel('Features')
plt.title('Feature Importances for SMOTE set')
plt.show()

plt.figure(figsize=(10, 6))
plt.barh(sorted_feature_names_adasyn, sorted_feature_importance_adasyn)
plt.xlabel('Importance')
plt.ylabel('Features')
plt.title('Feature Importances for ADASYN set')
plt.show()

The feature importance plots for LightGBM models suggest that Amount had by far biggest importance for both training sets, followed by V14 and V4 consistently. After that, there are slight differences between the two LGBM models. Let us also look at SHAP values summary plot for the SMOTE and ADASYN datasets.

In [None]:
# Compute SHAP values for the SMOTE training set and plot summary plot
shap.initjs()
quant = QuantileTransformer(output_distribution='normal')
X_smote_transformed = quant.fit_transform(X_smote)
X_test_transformed = quant.transform(X_test)

lgb = LGBMClassifier(min_samples_split=20, min_samples_leaf=5, max_depth=10, random_state=42, verbosity=-1)
lgb.fit(X_smote_transformed, y_smote)

explainer = shap.TreeExplainer(lgb)
shap_values = explainer.shap_values(X_test_transformed)
shap.summary_plot(shap_values, X_test_transformed, feature_names=X_test.columns)

As expected, the SHAP values are quite different from the feature importances, which stems from differences between those metrics. Hee, the mean SHAP value for Amount is notably lower compared to its prominent position in the feature importances plot. Instead, features like V10 and V14 exhibit the highest impact on the model output, followed by V1 and V4.

In [None]:
# SHAP summary plot for ADASYN set
X_adasyn_transformed = quant.fit_transform(X_adasyn)
X_test_transformed = quant.transform(X_test)

lgb2 = LGBMClassifier(min_samples_split=20, min_samples_leaf=5, max_depth=10, random_state=42, verbosity=-1)
lgb2.fit(X_adasyn_transformed, y_adasyn)

explainer = shap.TreeExplainer(lgb2)
shap_values = explainer.shap_values(X_test_transformed)
shap.summary_plot(shap_values, X_test_transformed, feature_names=X_test.columns)

The second model made with ADASYN training set, however it has very similar performance, it has also quite different SHAP values. For instance, while V14 remains a significant feature (ranking first in importance), V10 is ranked sixth. This observation underscores that while there are similarities in feature importance across various models, we can see quite distinct differences too.
