### **Handling Imbalance Datasets in Deep Learning: Implementation of Various Techniques**

#### **1. Setup and Data Preparation**

* We begin by importing the necessary libraries and loading the Credit Card Fraud Detection dataset from Kaggle. 

* This dataset is a perfect example of a real-world imbalanced problem, where the 'Class' variable (0 for legitimate, 1 for fraudulent) is highly skewed.

In [52]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix

import tensorflow as tf
from tensorflow import keras

In [53]:

# Load the dataset 
df = pd.read_csv('creditcard.csv')

# Display the first few rows of the dataframe
df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [54]:
# Get a technical summary of the dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    284807 non-null  float64
 1   V1      284807 non-null  float64
 2   V2      284807 non-null  float64
 3   V3      284807 non-null  float64
 4   V4      284807 non-null  float64
 5   V5      284807 non-null  float64
 6   V6      284807 non-null  float64
 7   V7      284807 non-null  float64
 8   V8      284807 non-null  float64
 9   V9      284807 non-null  float64
 10  V10     284807 non-null  float64
 11  V11     284807 non-null  float64
 12  V12     284807 non-null  float64
 13  V13     284807 non-null  float64
 14  V14     284807 non-null  float64
 15  V15     284807 non-null  float64
 16  V16     284807 non-null  float64
 17  V17     284807 non-null  float64
 18  V18     284807 non-null  float64
 19  V19     284807 non-null  float64
 20  V20     284807 non-null  float64
 21  V21     28

In [55]:
# Drop the 'Time' column as it is not a predictive feature
df = df.drop('Time', axis=1)

# Scale the 'Amount' feature to a similar range as the PCA features
df['Amount'] = StandardScaler().fit_transform(df['Amount'].values.reshape(-1, 1))

# Define features (X) and target (y)
X = df.drop('Class', axis=1)
y = df['Class']

# Check the class distribution
print("Class distribution:\n", y.value_counts())
print(f"\nPercentage of fraudulent transactions: {100 * y.value_counts()[1] / len(y):.2f}%")

Class distribution:
 Class
0    284315
1       492
Name: count, dtype: int64

Percentage of fraudulent transactions: 0.17%


In [56]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Verify the split preserves the imbalance
print("\nTraining set class distribution:\n", y_train.value_counts())
print("Test set class distribution:\n", y_test.value_counts())


Training set class distribution:
 Class
0    227451
1       394
Name: count, dtype: int64
Test set class distribution:
 Class
0    56864
1       98
Name: count, dtype: int64


#### **2. Baseline Model Performance**

* Before applying any balancing techniques, we will train a simple deep neural network on the raw, imbalanced data. 

* This will serve as our baseline to compare the effectiveness of the later methods. 

* Note how we'll focus on the recall of the minority class (1).

In [57]:
# Function to build a simple feedforward neural network

def build_model(input_dim):
    model = keras.Sequential([
        keras.layers.Input(shape=(input_dim,)),
        keras.layers.Dense(128, activation='relu'),
        keras.layers.Dropout(0.3),
        keras.layers.Dense(64, activation='relu'),
        keras.layers.Dropout(0.3),
        keras.layers.Dense(32, activation='relu'),
        keras.layers.Dropout(0.3),
        keras.layers.Dense(1, activation='sigmoid')
    ])

    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    return model

In [58]:
# Create and train the baseline model
baseline_model = build_model(X_train.shape[1])

print("\nTraining Baseline Model...")
baseline_model.fit(X_train, y_train,
                     epochs=10,
                     batch_size=256,
                     validation_split=0.2,
                     verbose=0)


Training Baseline Model...


<keras.src.callbacks.history.History at 0x23883a69f90>

In [59]:
# Evaluate the baseline model
y_pred_baseline = (baseline_model.predict(X_test) > 0.5).astype("int32")

print("\n--- Baseline Model Performance on Test Data ---")
print(classification_report(y_test, y_pred_baseline))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_baseline))

[1m1781/1781[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 3ms/step

--- Baseline Model Performance on Test Data ---
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56864
           1       0.81      0.83      0.82        98

    accuracy                           1.00     56962
   macro avg       0.90      0.91      0.91     56962
weighted avg       1.00      1.00      1.00     56962

Confusion Matrix:
 [[56845    19]
 [   17    81]]


* The baseline model shows a very high overall accuracy but a very low recall for the minority class (Class 1). 

* This confirms our initial hypothesis: the model is biased and fails to correctly identify most of the fraudulent transactions.

#### **3. Handling Techniques**

##### **3.1 Undersampling Technique**

* Undersampling reduces the number of samples from the majority class to match the number of samples in the minority class. 

* While this can lead to a balanced dataset and faster training, it risks losing valuable information from the majority class.

In [60]:
from imblearn.under_sampling import RandomUnderSampler

# Apply Random Undersampling to the training data
rus = RandomUnderSampler(random_state=42)
X_train_us, y_train_us = rus.fit_resample(X_train, y_train)

print("Original training set shape:", X_train.shape)
print("Undersampled training set shape:", X_train_us.shape)
print("Undersampled class distribution:\n", np.bincount(y_train_us))

Original training set shape: (227845, 29)
Undersampled training set shape: (788, 29)
Undersampled class distribution:
 [394 394]


In [61]:
# Train a new model on the undersampled data
undersampled_model = build_model(X_train_us.shape[1])
undersampled_model.fit(X_train_us, y_train_us,
                          epochs=10,
                          batch_size=256,
                          validation_split=0.2,
                          verbose=0)

# Evaluate the undersampled model
y_pred_us = (undersampled_model.predict(X_test) > 0.5).astype("int32")

[1m1781/1781[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 4ms/step


In [62]:
# Evaluate the undersampled model with accuracy report
print("\n--- Undersampled Model Accuracy Report ---")
print("Accuracy:", undersampled_model.evaluate(X_test, y_test))


--- Undersampled Model Accuracy Report ---
[1m1781/1781[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 4ms/step - accuracy: 0.9914 - loss: 0.2053
Accuracy: [0.20526446402072906, 0.9913626909255981]


In [63]:
print("\n--- Undersampled Model Performance ---")
print(classification_report(y_test, y_pred_us))


--- Undersampled Model Performance ---
              precision    recall  f1-score   support

           0       1.00      0.99      1.00     56864
           1       0.15      0.88      0.26        98

    accuracy                           0.99     56962
   macro avg       0.58      0.93      0.63     56962
weighted avg       1.00      0.99      0.99     56962



In [64]:
# Evaluate the undersampled model with confusion matrix
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_us))

Confusion Matrix:
 [[56384   480]
 [   12    86]]


* Notice the significant improvement in the recall of the minority class. 

* The model is now much better at detecting fraud, though this might come at the cost of a lower precision and overall accuracy.

#### **3.2 Oversampling Technique**

* Oversampling duplicates samples from the minority class to balance the dataset. 

* This approach is simple but can lead to a model that overfits to the replicated data points, as it doesn't add any new information.

In [65]:
from imblearn.over_sampling import RandomOverSampler

# Apply Random Oversampling to the training data
ros = RandomOverSampler(random_state=42)
X_train_ros, y_train_ros = ros.fit_resample(X_train, y_train)

print("Original training set shape:", X_train.shape)
print("Oversampled training set shape:", X_train_ros.shape)
print("Oversampled class distribution:\n", np.bincount(y_train_ros))

Original training set shape: (227845, 29)
Oversampled training set shape: (454902, 29)
Oversampled class distribution:
 [227451 227451]


In [66]:
# Train a new model on the oversampled data
oversampled_model = build_model(X_train_ros.shape[1])
oversampled_model.fit(X_train_ros, y_train_ros,
                         epochs=10,
                         batch_size=256,
                         validation_split=0.2,
                         verbose=0)

# Evaluate the oversampled model
y_pred_ros = (oversampled_model.predict(X_test) > 0.5).astype("int32")

[1m1781/1781[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 2ms/step


In [67]:
# Evaluate the oversampled model with accuracy report
print("\n--- Oversampled Model Accuracy Report ---")
print("Accuracy:", oversampled_model.evaluate(X_test, y_test))


--- Oversampled Model Accuracy Report ---
[1m1781/1781[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 6ms/step - accuracy: 0.9989 - loss: 0.0151
Accuracy: [0.01508812140673399, 0.9989466667175293]


In [68]:
print("\n--- Oversampled Model Performance ---")
print(classification_report(y_test, y_pred_ros))


--- Oversampled Model Performance ---
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56864
           1       0.65      0.86      0.74        98

    accuracy                           1.00     56962
   macro avg       0.82      0.93      0.87     56962
weighted avg       1.00      1.00      1.00     56962



In [69]:
# Evaluate the oversampled model with confusion matrix
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_ros))

Confusion Matrix:
 [[56818    46]
 [   14    84]]


#### **3.3 SMOTE (Synthetic Minority Oversampling Technique)**

* SMOTE is an advanced oversampling method. 

* Instead of simply duplicating existing data, it generates *synthetic* samples for the minority class. 

* It does this by taking a sample, finding its `k-nearest neighbors`, and creating new samples at random points along the lines connecting the sample to its neighbors. 

* This approach mitigates the risk of overfitting seen in basic oversampling.

In [70]:
from imblearn.over_sampling import SMOTE

# Apply SMOTE to the training data
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

print("Original training set shape:", X_train.shape)
print("SMOTE training set shape:", X_train_smote.shape)
print("SMOTE class distribution:\n", np.bincount(y_train_smote))

Original training set shape: (227845, 29)
SMOTE training set shape: (454902, 29)
SMOTE class distribution:
 [227451 227451]


In [71]:
# Train a new model on the SMOTE-generated data
smote_model = build_model(X_train_smote.shape[1])
smote_model.fit(X_train_smote, y_train_smote,
                   epochs=10,
                   batch_size=256,
                   validation_split=0.2,
                   verbose=0)

# Evaluate the SMOTE model
y_pred_smote = (smote_model.predict(X_test) > 0.5).astype("int32")

[1m1781/1781[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 3ms/step


In [72]:
# Evaluate the SMOTE model with accuracy report
print("\n--- SMOTE Model Accuracy Report ---")
print("Accuracy:", smote_model.evaluate(X_test, y_test))


--- SMOTE Model Accuracy Report ---
[1m1781/1781[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 6ms/step - accuracy: 0.9989 - loss: 0.0135
Accuracy: [0.013451533392071724, 0.9989290833473206]


In [73]:
print("\n--- SMOTE Model Performance ---")
print(classification_report(y_test, y_pred_smote))


--- SMOTE Model Performance ---
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56864
           1       0.64      0.88      0.74        98

    accuracy                           1.00     56962
   macro avg       0.82      0.94      0.87     56962
weighted avg       1.00      1.00      1.00     56962



In [74]:
# Evaluate the SMOTE model with confusion matrix
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_smote))

Confusion Matrix:
 [[56815    49]
 [   12    86]]


* SMOTE generally provides a good balance between improving recall and maintaining overall performance, as it creates new, more diverse training data for the minority class.

#### **3.4 Ensemble Technique (BalancedBaggingClassifier)**

* Ensemble methods combine multiple models to produce a more robust and accurate prediction. 

* The `BalancedBaggingClassifier` from `imblearn` is specifically designed for imbalanced data. 

* It trains multiple base estimators (in this case, our deep neural network) on different subsets of the data, where each subset is resampled to be balanced. 

* This approach leverages the power of ensemble learning while directly addressing the imbalance.

In [75]:
from imblearn.ensemble import BalancedBaggingClassifier
from scikeras.wrappers import KerasClassifier

# Scikeras is the recommended way to use Keras models with scikit-learn
# The function to create the model must accept the input_dim as an argument
def create_keras_model(input_dim):
    model = build_model(input_dim)
    return model

# Wrap the Keras model in KerasClassifier and correctly pass the input_dim
keras_clf = KerasClassifier(
    model=create_keras_model,
    model__input_dim=X_train.shape[1],
    epochs=10,
    batch_size=256,
    verbose=0
)

# Create the BalancedBaggingClassifier
bbc = BalancedBaggingClassifier(estimator=keras_clf, sampling_strategy='auto', random_state=42)

print("\nTraining BalancedBaggingClassifier...")
# Train the ensemble model
bbc.fit(X_train, y_train)


Training BalancedBaggingClassifier...


In [76]:
# Evaluate the Ensemble model with accuracy report
print("\n--- Ensemble Model Accuracy Report ---")
print("Accuracy:", bbc.score(X_test, y_test))


--- Ensemble Model Accuracy Report ---
Accuracy: 0.988729328324146


In [77]:
# Evaluate the ensemble model
y_pred_bbc = bbc.predict(X_test)

print("\n--- BalancedBaggingClassifier Performance ---")
print(classification_report(y_test, y_pred_bbc))


--- BalancedBaggingClassifier Performance ---
              precision    recall  f1-score   support

           0       1.00      0.99      0.99     56864
           1       0.12      0.89      0.21        98

    accuracy                           0.99     56962
   macro avg       0.56      0.94      0.60     56962
weighted avg       1.00      0.99      0.99     56962



In [78]:
# Evaluate the BalancedBaggingClassifier with confusion matrix
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_bbc))

Confusion Matrix:
 [[56233   631]
 [   11    87]]


#### **3.5 Focal Loss Technique**

* Focal Loss is a special loss function designed to handle highly imbalanced classification problems, particularly where a large number of easy-to-classify negative examples dominate the training. 

* It works by down-weighting the loss contribution from well-classified examples, forcing the model to focus more on the 'hard' examples (the minority class) that it struggles with. 

* This is a powerful technique because it directly modifies the model's learning objective without altering the dataset itself.

In [85]:
# Define the Focal Loss function with a slight improvement for numerical stability
from tensorflow.keras import backend as K

def focal_loss(gamma=2., alpha=.25):
    def focal_loss_fixed(y_true, y_pred):
        y_true = tf.cast(y_true, tf.float32)
        epsilon = K.epsilon()
        y_pred = K.clip(y_pred, epsilon, 1. - epsilon)

        pt_1 = tf.where(tf.equal(y_true, 1), y_pred, tf.ones_like(y_pred))
        pt_0 = tf.where(tf.equal(y_true, 0), y_pred, tf.zeros_like(y_pred))

        loss = -K.sum(alpha * K.pow(1. - pt_1, gamma) * K.log(pt_1)) \
               -K.sum((1-alpha) * K.pow(pt_0, gamma) * K.log(1. - pt_0))

        return loss
    return focal_loss_fixed

# Create a new model and compile with Focal Loss
focal_model = build_model(X_train.shape[1])

In [86]:
print("\nTraining Model with Focal Loss...")
focal_model.compile(optimizer='adam',
                    loss=focal_loss(gamma=2.0, alpha=0.25),
                    metrics=['accuracy'])

focal_model.fit(X_train, y_train,
                   epochs=10,
                   batch_size=256,
                   validation_split=0.2,
                   verbose=0)


Training Model with Focal Loss...


<keras.src.callbacks.history.History at 0x238806a7e00>

In [87]:
# Evaluate the focal loss model with accuracy report
y_pred_focal = (focal_model.predict(X_test) > 0.5).astype("int32")

print("\n--- Focal Loss Model Performance ---")
print("Accuracy:", focal_model.evaluate(X_test, y_test))

[1m1781/1781[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 3ms/step

--- Focal Loss Model Performance ---
[1m1781/1781[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 6ms/step - accuracy: 0.9994 - loss: 0.0102
Accuracy: [0.010217303410172462, 0.9994031190872192]


In [88]:
# Evaluate the Focal Loss model
y_pred_focal = (focal_model.predict(X_test) > 0.5).astype("int32")

print("\n--- Focal Loss Model Performance ---")
print(classification_report(y_test, y_pred_focal))

[1m1781/1781[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 4ms/step

--- Focal Loss Model Performance ---
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56864
           1       0.87      0.77      0.82        98

    accuracy                           1.00     56962
   macro avg       0.94      0.88      0.91     56962
weighted avg       1.00      1.00      1.00     56962



In [89]:
# Evaluate the focal loss model with confusion matrix
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_focal))

Confusion Matrix:
 [[56853    11]
 [   23    75]]


### **4. Summary**

* This notebook has demonstrated several powerful techniques for handling imbalanced datasets in deep learning:

  1) **Undersampling**: Simple and effective for reducing training time, but can lead to a loss of information.
  
  2) **Oversampling (Random)**: Simple to implement, but may cause overfitting due to data duplication.
  
  3) **SMOTE**: A superior oversampling method that generates synthetic data, providing a better balance of performance.
  
  4) **Ensemble Methods**: Combine the power of multiple models to create a more robust classifier.
  
  5) **Focal Loss**: A state-of-the-art loss function that directly addresses the problem by re-weighting examples, a particularly effective approach for extreme imbalance.

* The choice of technique depends on the specific problem. 

* For highly imbalanced datasets like credit card fraud, a technique like Focal Loss or SMOTE often provides the best balance of precision and recall for the crucial minority class. 

* In practice, it's often best to experiment with several of these methods and choose the one that provides the most desirable trade-off for your specific business objective. 

---

*Deep Learning - Python Notebook* by [*Prakash Ukhalkar*](https://github.com/prakash-ukhalkar)