##  SMOTE (Synthetic Minority Over-sampling Technique) 

## **Introduction**
SMOTE is a resampling technique used to handle class imbalance by generating synthetic samples for the minority class instead of simply duplicating existing ones. It works by interpolating between real minority class instances.

## **Algorithm Steps**
1. **Select a minority class sample** $x_i$ from the dataset.
2. **Find its k-nearest neighbors** in the minority class using Euclidean distance.
3. **Randomly select one of these neighbors** $x_{nn}$.
4. **Generate a synthetic sample** along the line segment joining $x_i$ and $x_{nn}$ using interpolation:

   $$
   x_{\text{new}} = x_i + \lambda \cdot (x_{nn} - x_i)
   $$

   where:

   $$
   \lambda \sim U(0,1)
   $$

   is a random number between 0 and 1.

## **Mathematical Formulation**
For a given minority class instance $x_i$, let $x_{nn}$ be one of its k-nearest neighbors. The synthetic sample is created as:

$$
x_{\text{new}} = x_i + \lambda (x_{nn} - x_i)
$$

where:
- $x_i$ is a real minority class instance.
- $x_{nn}$ is one of its k-nearest neighbors.
- $\lambda$ is a random number sampled from a uniform distribution:

  $$
  \lambda \sim U(0,1)
  $$

This process is repeated until the desired number of synthetic samples is generated.

## **Advantages of SMOTE**
- Reduces class imbalance by adding synthetic samples.
- Prevents overfitting caused by simple duplication of minority class instances.
- Preserves the relationships between data points.

## **Limitations of SMOTE**
- Can generate noisy samples if the minority class has a complex distribution.
- Does not consider the majority class, which may lead to overlapping regions and potential misclassification.



In [6]:
import pandas as pd
import numpy as np
from imblearn.combine import SMOTEENN


from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    roc_auc_score, average_precision_score, f1_score, recall_score, accuracy_score,
    confusion_matrix
)
from tqdm import tqdm
from imblearn.over_sampling import SMOTE
import xgboost as xgb
import warnings

warnings.simplefilter("ignore")  # Ignore all warnings

def calculate_metrics(y_test, y_pred_proba):
    threshold = np.mean(y_pred_proba)  # Dynamic threshold based on mean
    y_pred_binary = [1 if p > 0.5 else 0 for p in y_pred_proba]  # Convert probabilities to binary

    # Compute metrics
    accuracy = accuracy_score(y_test, y_pred_binary)
    roc_auc = roc_auc_score(y_test, y_pred_proba)
    pr_auc = average_precision_score(y_test, y_pred_proba)
    recall = recall_score(y_test, y_pred_binary)
    f1 = f1_score(y_test, y_pred_binary)
    
    tn, fp, fn, tp = confusion_matrix(y_test, y_pred_binary).ravel()
    
    specificity = tn / (tn + fp) if (tn + fp) > 0 else 0
    fp_rate = fp / (fp + tn) if (fp + tn) > 0 else 0
    g_mean = np.sqrt(recall * specificity)

    return accuracy, roc_auc, pr_auc, recall, f1, specificity, fp_rate, g_mean

import optuna
from sklearn.metrics import roc_auc_score, precision_recall_curve
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE
import xgboost as xgb

def objective(trial, X_train, y_train, X_test, y_test):
    """Objective function for Optuna to optimize hyperparameters."""
    params = {
        "objective": "binary:logistic",
        "eval_metric": "auc",
        "tree_method": "gpu_hist",  # Use GPU
        "predictor": "gpu_predictor",
        "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3),
        "max_depth": trial.suggest_int("max_depth", 3, 10),
        "subsample": trial.suggest_float("subsample", 0.6, 1.0),
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.6, 1.0),
        "gamma": trial.suggest_float("gamma", 0, 1),
        "reg_alpha": trial.suggest_float("reg_alpha", 0, 1),
        "reg_lambda": trial.suggest_float("reg_lambda", 0, 1),
        "random_state": 42,
    }

    dtrain = xgb.DMatrix(X_train, label=y_train)
    dtest = xgb.DMatrix(X_test, label=y_test)

    model = xgb.train(params, dtrain, num_boost_round=200)
    y_pred_proba = model.predict(dtest)
    return roc_auc_score(y_test, y_pred_proba)

def model(X, y):
    # Train-test split (stratified)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

    # Standardize numerical features
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)

    ### Hyperparameter Optimization for Original Data (No SMOTE)
    study_no_smote = optuna.create_study(direction="maximize")
    study_no_smote.optimize(lambda trial: objective(trial, X_train, y_train, X_test, y_test), n_trials=50)

    # Best hyperparameters for original data
    best_params_no_smote = study_no_smote.best_params
    print("Best Hyperparameters (No SMOTE):", best_params_no_smote)

    ### Train XGBoost WITHOUT SMOTE using best hyperparameters
    dtrain = xgb.DMatrix(X_train, label=y_train)
    dtest = xgb.DMatrix(X_test, label=y_test)

    params_no_smote = {
        "objective": "binary:logistic",
        "eval_metric": "auc",
        "tree_method": "gpu_hist",  # Use GPU
        "predictor": "gpu_predictor",
        "random_state": 42,
    }
    params_no_smote.update(best_params_no_smote)  # Add best hyperparameters

    model_no_smote = xgb.train(params_no_smote, dtrain, num_boost_round=200)

    # Predict on test set
    y_pred_proba_no_smote = model_no_smote.predict(dtest)
    precision, recall, thresholds = precision_recall_curve(y_test, y_pred_proba_no_smote)
    f1_scores = (2 * precision * recall) / (precision + recall + 1e-6)  # Avoid division by zero
    optimal_idx = f1_scores.argmax()
    optimal_threshold = thresholds[optimal_idx]
    y_pred_binary_no_smote = [1 if p > optimal_threshold else 0 for p in y_pred_proba_no_smote]

    metrics_no_smote = calculate_metrics(y_test, y_pred_proba_no_smote)

    ### Apply SMOTE
    smote = SMOTE(sampling_strategy=1, random_state=42)  # Fully balance classes
    X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

    ### Hyperparameter Optimization for Resampled Data (With SMOTE)
    study_with_smote = optuna.create_study(direction="maximize")
    study_with_smote.optimize(lambda trial: objective(trial, X_train_resampled, y_train_resampled, X_test, y_test), n_trials=50)

    # Best hyperparameters for resampled data
    best_params_with_smote = study_with_smote.best_params
    print("Best Hyperparameters (With SMOTE):", best_params_with_smote)

    ### Train XGBoost WITH SMOTE using best hyperparameters
    dtrain_resampled = xgb.DMatrix(X_train_resampled, label=y_train_resampled)
    dtest = xgb.DMatrix(X_test, label=y_test)

    params_with_smote = {
        "objective": "binary:logistic",
        "eval_metric": "auc",
        "tree_method": "gpu_hist",  # Use GPU
        "predictor": "gpu_predictor",
        "random_state": 42,
    }
    params_with_smote.update(best_params_with_smote)  # Add best hyperparameters

    model_with_smote = xgb.train(params_with_smote, dtrain_resampled, num_boost_round=200)

    # Predict on test set
    y_pred_proba_with_smote = model_with_smote.predict(dtest)
    precision, recall, thresholds = precision_recall_curve(y_test, y_pred_proba_with_smote)
    f1_scores = (2 * precision * recall) / (precision + recall + 1e-6)  # Avoid division by zero
    optimal_idx = f1_scores.argmax()
    optimal_threshold = thresholds[optimal_idx]
    y_pred_binary_with_smote = [1 if p > optimal_threshold else 0 for p in y_pred_proba_with_smote]

    metrics_with_smote = calculate_metrics(y_test, y_pred_proba_with_smote)

    # Print comparison
    metric_names = ["Accuracy", "ROC-AUC", "PR-AUC", "Recall (Sensitivity)", "F1", "Specificity", "FP-Rate", "G-Mean"]
    print("\n--- Model Performance ---")
    print("{:<20} {:<10} {:<10}".format("Metric", "No SMOTE", "With Regular SMOTE"))
    for name, no_smote, with_smote in zip(metric_names, metrics_no_smote, metrics_with_smote):
        print(f"{name:<20} {no_smote:.4f}   {with_smote:.4f}")

    return [metrics_no_smote, metrics_with_smote]


In [14]:
### Apply SMOTE
smote = SMOTE(sampling_strategy=1, random_state=42)  # Fully balance classes
X_train_resampled, y_train_resampled = smote.fit_resample(X, y)

In [18]:
X_res,_y_res=custom_smote_with_cubic_interpolation(X,y['Y'])

19908


100%|███████████████████████████████████████████████████████████████████████████| 19908/19908 [01:42<00:00, 193.53it/s]


In [21]:
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


In [23]:
X_test

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,...,X14,X15,X16,X17,X18,X19,X20,X21,X22,X23
2308,30000,1,2,2,25,0,0,0,0,0,...,11581,12580,13716,14828,1500,2000,1500,1500,1500,2000
22404,150000,2,1,2,26,0,0,0,0,0,...,116684,101581,77741,77264,4486,4235,3161,2647,2669,2669
23397,70000,2,3,1,32,0,0,0,0,0,...,68530,69753,70111,70212,2431,3112,3000,2438,2500,2554
25058,130000,1,3,2,49,0,0,0,0,0,...,16172,16898,11236,6944,1610,1808,7014,27,7011,4408
2664,50000,2,2,2,36,0,0,0,0,0,...,42361,19574,20295,19439,2000,1500,1000,1800,0,1000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3941,410000,2,1,2,34,1,-1,-1,-2,-2,...,0,0,0,666,13621,0,0,0,666,0
17854,210000,1,1,2,27,0,0,0,0,0,...,45622,47232,47583,53032,8000,5000,4000,3000,8000,3000
95,90000,1,2,2,35,0,0,0,0,0,...,87653,35565,30942,30835,3621,3597,1179,1112,1104,1143
6279,220000,2,2,1,36,0,0,0,0,0,...,142295,145127,148159,151462,5100,5163,5196,5372,5761,5396


# **Custom SMOTE with Cubic Interpolation (My Development)**

## **Introduction**
This is a modified version of the **Synthetic Minority Over-sampling Technique (SMOTE)**, where instead of linear interpolation, a **third-degree polynomial interpolation** is used to generate synthetic samples. This method helps preserve complex feature relationships and avoids overly simplistic synthetic samples.

## **Algorithm Steps**
1. **Identify the minority class** in the dataset.
2. **Find its k-nearest neighbors** using Euclidean distance.
3. **Randomly select one of these neighbors** $x_{nn}$ for interpolation.
4. **Use cubic interpolation** between the selected sample $x_i$ and its neighbor $x_{nn}$:
   - Define reference points between $x_i$ and $x_{nn}$.
   - Fit a third-degree polynomial for each feature.
   - Sample a new synthetic point using the polynomial.

## **Mathematical Formulation**
For a given minority class instance $x_i$, let $x_{nn}$ be one of its k-nearest neighbors. We define four reference points:

$$
x_0 = x_i, \quad x_1 = \frac{2x_i + x_{nn}}{3}, \quad x_2 = \frac{x_i + 2x_{nn}}{3}, \quad x_3 = x_{nn}
$$

These points correspond to $t$-values:

$$
t_0 = 0, \quad t_1 = 0.33, \quad t_2 = 0.66, \quad t_3 = 1
$$

A third-degree polynomial is fitted for each feature using these values:

$$
P(t) = a_0 + a_1 t + a_2 t^2 + a_3 t^3
$$

where the coefficients $(a_0, a_1, a_2, a_3)$ are determined by the reference points. A synthetic sample is generated by evaluating the polynomial at a randomly chosen $t_{\text{rand}} \sim U(0,1)$:

$$
x_{\text{new}} = P(t_{\text{rand}})
$$

This process is repeated until the desired number of synthetic samples is generated.

## **Advantages of Custom SMOTE with Cubic Interpolation**
- **More realistic synthetic samples**: Cubic interpolation provides a **smoother transition** between real data points.
- **Better feature relationships**: Unlike linear SMOTE, this method **captures non-linear patterns** in the data.
- **Less risk of generating outliers**: Intermediate points help **constrain synthetic samples** within a reasonable range.

## **Limitations**
- **Computationally expensive**: Fitting a polynomial for each feature requires more computation than linear interpolation.
- **Risk of overfitting**: If the minority class has a complex distribution, the interpolation might introduce synthetic samples that do not generalize well.
- **Sensitive to noisy data**: If the minority class contains outliers, the interpolation may exaggerate these variations.


In [7]:
import numpy as np
import pandas as pd
from sklearn.neighbors import NearestNeighbors
from numpy.polynomial.polynomial import Polynomial

def custom_smote_with_cubic_interpolation(X: pd.DataFrame, y: pd.Series, target_class=1, k_neighbors=5, random_state=42):
    """
    Custom SMOTE using 3rd-degree polynomial interpolation.

    Parameters:
        X (pd.DataFrame): Feature matrix.
        y (pd.Series): Target labels.
        target_class (int): The minority class to oversample.
        k_neighbors (int): Number of nearest neighbors to consider.
        sampling_ratio (float): Ratio of synthetic samples to generate relative to minority class.
        random_state (int): Random seed for reproducibility.

    Returns:
        X_resampled (pd.DataFrame): New feature matrix with synthetic samples.
        y_resampled (pd.Series): Updated target labels.
    """
    np.random.seed(random_state)
    
    # Ensure `y` is a 1D array
    y = y.reset_index(drop=True)  # Ensure proper indexing
    
    # Separate minority class
    X_minority = X[y == target_class]
    sampling_ratio=np.floor((X.shape[0]-X_minority.shape[0])/X_minority.shape[0])
    
    # Fit KNN on minority class
    knn = NearestNeighbors(n_neighbors=min(k_neighbors, len(X_minority)))
    knn.fit(X_minority)
    
    # Determine number of synthetic samples to generate
    n_samples = int(len(X_minority) * sampling_ratio)

    print(n_samples)
    
    synthetic_samples = []
    
    for _ in tqdm(range(n_samples)):
        # Randomly select a minority sample
        idx = np.random.randint(0, len(X_minority))
        x_selected = X_minority.iloc[idx].values  # Convert to NumPy array
        
        # Find k-nearest neighbors
        neighbors = knn.kneighbors([x_selected], return_distance=False)[0]
        
        # Select a random neighbor
        neighbor_idx = np.random.choice(neighbors[1:])  # Exclude itself
        x_neighbor = X_minority.iloc[neighbor_idx].values  # Convert to NumPy array
        
        # Fit a 3rd-degree polynomial between x_selected and x_neighbor
        t_values = np.array([0, 0.33, 0.66, 1])  # 4 reference points in [0,1]
        x_values = np.vstack([x_selected, 
                              (2*x_selected + x_neighbor)/3, 
                              (x_selected + 2*x_neighbor)/3, 
                              x_neighbor])  # Intermediate points
        
        # Generate polynomial coefficients for each feature
        x_synthetic = np.zeros_like(x_selected)
        t_random = np.random.rand()  # Random t in [0,1]
        
        for feature_idx in range(X.shape[1]):  # Iterate over all features
            poly = Polynomial.fit(t_values, x_values[:, feature_idx], 3)  # Fit cubic polynomial
            x_synthetic[feature_idx] = poly(t_random)  # Sample new point
        
        synthetic_samples.append(x_synthetic)
    
    # Convert synthetic samples to DataFrame
    synthetic_samples_df = pd.DataFrame(synthetic_samples, columns=X.columns)
    
    # Create new dataset (append synthetic data)
    X_resampled = pd.concat([X, synthetic_samples_df], axis=0, ignore_index=True)
    y_resampled = pd.concat([y, pd.Series(target_class, index=synthetic_samples_df.index)], axis=0, ignore_index=True)
    
    return X_resampled, y_resampled


In [26]:
def model_smote_poly(X, y):
    # Train-test split (stratified)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

    # Standardize numerical features
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)

    ### Hyperparameter Optimization for Original Data (No SMOTE)
    study_no_smote = optuna.create_study(direction="maximize")
    study_no_smote.optimize(lambda trial: objective(trial, X_train, y_train, X_test, y_test), n_trials=50)

    # Best hyperparameters for original data
    best_params_no_smote = study_no_smote.best_params
    print("Best Hyperparameters (No SMOTE):", best_params_no_smote)

    ### Train XGBoost WITHOUT SMOTE using best hyperparameters
    dtrain = xgb.DMatrix(X_train, label=y_train)
    dtest = xgb.DMatrix(X_test, label=y_test)

    params_no_smote = {
        "objective": "binary:logistic",
        "eval_metric": "auc",
        "tree_method": "gpu_hist",  # Use GPU
        "predictor": "gpu_predictor",
        "random_state": 42,
    }
    params_no_smote.update(best_params_no_smote)  # Add best hyperparameters

    model_no_smote = xgb.train(params_no_smote, dtrain, num_boost_round=200)

    # Predict on test set
    y_pred_proba_no_smote = model_no_smote.predict(dtest)
    precision, recall, thresholds = precision_recall_curve(y_test, y_pred_proba_no_smote)
    f1_scores = (2 * precision * recall) / (precision + recall + 1e-6)  # Avoid division by zero
    optimal_idx = f1_scores.argmax()
    optimal_threshold = thresholds[optimal_idx]
    y_pred_binary_no_smote = [1 if p > optimal_threshold else 0 for p in y_pred_proba_no_smote]

    metrics_no_smote = calculate_metrics(y_test, y_pred_proba_no_smote)

    ### Apply Custom SMOTE with Cubic Polynomial Interpolation
    X_train_resampled, y_train_resampled = custom_smote_with_cubic_interpolation(pd.DataFrame(X_train, columns=X.columns), y_train)

    ### Hyperparameter Optimization for Resampled Data (With Cubic Polynomial SMOTE)
    study_with_smote = optuna.create_study(direction="maximize")
    study_with_smote.optimize(lambda trial: objective(trial, np.array(X_train_resampled), y_train_resampled, X_test, y_test), n_trials=50)

    # Best hyperparameters for resampled data

    
    best_params_with_smote = study_with_smote.best_params

    params_with_smote = {
        "objective": "binary:logistic",
        "eval_metric": "auc",
        "tree_method": "gpu_hist",  # Use GPU
        "predictor": "gpu_predictor",
        "random_state": 42,
    }
    params_with_smote.update(best_params_with_smote)  # Add best hyperparameters
    print("Best Hyperparameters (With Cubic Polynomial SMOTE):", best_params_with_smote)

    ### Train XGBoost WITH SMOTE using best hyperparameters
    dtrain_resampled = xgb.DMatrix(np.array(X_train_resampled), label=y_train_resampled)
    model_with_smote = xgb.train(params_with_smote, dtrain_resampled, num_boost_round=200)

    # Predict on test set
    y_pred_proba_with_smote = model_with_smote.predict(dtest)
    precision, recall, thresholds = precision_recall_curve(y_test, y_pred_proba_with_smote)
    f1_scores = (2 * precision * recall) / (precision + recall + 1e-6)  # Avoid division by zero
    optimal_idx = f1_scores.argmax()
    optimal_threshold = thresholds[optimal_idx]
    y_pred_binary_with_smote = [1 if p > optimal_threshold else 0 for p in y_pred_proba_with_smote]

    metrics_with_smote = calculate_metrics(y_test, y_pred_proba_with_smote)

    # Print comparison
    metric_names = ["Accuracy", "ROC-AUC", "PR-AUC", "Recall (Sensitivity)", "F1", "Specificity", "FP-Rate", "G-Mean"]
    print("\n--- Model Performance ---")
    print("{:<20} {:<10} {:<10}".format("Metric", "No SMOTE", "With Cubic Polynomial SMOTE"))
    for name, no_smote, with_smote in zip(metric_names, metrics_no_smote, metrics_with_smote):
        print(f"{name:<20} {no_smote:.4f}   {with_smote:.4f}")

    return [metrics_no_smote, metrics_with_smote]


---

# **Model Performance Metrics for Credit Risk Default Prediction**

In credit risk modeling, correctly classifying **defaulting customers** is crucial, as misclassifications can lead to **financial losses** (false negatives) or **lost opportunities** (false positives). The following metrics help assess model performance:

## **1. Accuracy**
Accuracy measures the proportion of correctly classified instances over the total dataset:

$$
\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}
$$

where:
- $TP$ = True Positives (correctly predicted defaults)
- $TN$ = True Negatives (correctly predicted non-defaults)
- $FP$ = False Positives (incorrectly predicted defaults)
- $FN$ = False Negatives (incorrectly predicted non-defaults)

### **Importance in Credit Risk:**
- Accuracy gives an overall measure of correctness but can be **misleading in imbalanced datasets** (e.g., if defaults are rare, a model predicting all customers as non-defaults can still have high accuracy).

---

## **2. ROC-AUC (Receiver Operating Characteristic - Area Under Curve)**
The **ROC-AUC** measures a model’s ability to distinguish between the positive (default) and negative (non-default) classes. The **ROC curve** plots the **True Positive Rate (Recall)** against the **False Positive Rate (FPR)** at different classification thresholds.  

### **Mathematical Formulation:**
The **AUC (Area Under Curve)** is computed as:

$$
\text{AUC} = \int_0^1 \text{TPR} \, d(\text{FPR})
$$

where:
- **True Positive Rate (TPR) / Recall:**
  $$
  \text{TPR} = \frac{TP}{TP + FN}
  $$
- **False Positive Rate (FPR):**
  $$
  \text{FPR} = \frac{FP}{FP + TN}
  $$

### **Importance in Credit Risk:**
- **Higher AUC** means the model **better separates defaults from non-defaults**.
- **AUC close to 0.5** suggests the model is **random** (not useful).

---

## **3. PR-AUC (Precision-Recall Area Under Curve)**
PR-AUC measures the area under the **Precision-Recall (PR) curve**, focusing on **positive (default) predictions**.

### **Mathematical Formulation:**
The **AUC for Precision-Recall** is:

$$
\text{PR-AUC} = \int_0^1 \text{Precision} \, d(\text{Recall})
$$

where:
- **Precision (Positive Predictive Value, PPV):**
  $$
  \text{Precision} = \frac{TP}{TP + FP}
  $$

- **Recall (Sensitivity / TPR) (as defined above)**

### **Importance in Credit Risk:**
- **More useful than ROC-AUC** for **imbalanced data** since it focuses on **true defaults**.
- **Higher PR-AUC** indicates a better balance between **precision and recall**.

---

## **4. Recall (Sensitivity)**
Recall, also called **Sensitivity or True Positive Rate (TPR)**, measures the ability to detect **actual defaults**:

$$
\text{Recall} = \frac{TP}{TP + FN}
$$

### **Importance in Credit Risk:**
- **High recall** ensures **most actual defaults are detected**, minimizing **false negatives**.
- **Low recall** means many **defaulting customers** are **missed**, leading to **financial losses**.

---

## **5. F1-Score**
F1-Score is the harmonic mean of **Precision** and **Recall**, balancing both:

$$
F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
$$

### **Importance in Credit Risk:**
- **Best when both False Positives & False Negatives are costly**.
- **Useful in imbalanced datasets**, where a high precision or recall alone isn't enough.

---

## **6. Specificity (True Negative Rate)**
Specificity measures how well the model identifies **non-defaulting customers**:

$$
\text{Specificity} = \frac{TN}{TN + FP}
$$

### **Importance in Credit Risk:**
- **Higher specificity** reduces **false alarms (FP)**.
- **Too high specificity may mean recall is low**, missing many defaults.

---

## **7. False Positive Rate (FPR)**
FPR is the proportion of **non-defaulting customers incorrectly classified as defaults**:

$$
\text{FPR} = \frac{FP}{FP + TN} = 1 - \text{Specificity}
$$

### **Importance in Credit Risk:**
- **Low FPR** ensures fewer **non-defaulters are wrongly flagged**, reducing unnecessary **loan rejections**.
- **High FPR** can **hurt customer experience**, causing **unnecessary loan rejections**.

---

## **8. G-Mean (Geometric Mean)**
The **G-Mean** is a performance metric balancing **recall** and **specificity**:

$$
G\text{-Mean} = \sqrt{\text{Recall} \times \text{Specificity}}
$$

### **Importance in Credit Risk:**
- **Higher G-Mean** ensures the model performs well on **both default and non-default classes**.
- Useful for **handling class imbalance**, where one metric alone (like accuracy) can be misleading.

---

# **Summary Table of Metrics**
| **Metric**         | **Interpretation** |
|--------------------|------------------|
| **Accuracy**      | Overall correctness, but misleading in imbalanced data |
| **ROC-AUC**       | Ability to distinguish defaults vs. non-defaults |
| **PR-AUC**        | Performance on the default class, useful for imbalance |
| **Recall**        | Ability to detect defaults (avoid false negatives) |
| **F1-Score**      | Balance between Precision & Recall |
| **Specificity**   | Correctly identifying non-defaulters |
| **FPR**           | Incorrectly flagging non-defaulters as defaults |
| **G-Mean**        | Balance between Recall & Specificity (useful for imbalance) |

---

# **Final Thoughts**
For **credit risk prediction**, metrics should be **carefully chosen** based on **business priorities**:

- **If missing defaults is costly** → **High Recall (Sensitivity)**.
- **If wrongly flagging non-defaulters is a concern** → **Low False Positive Rate (FPR)**.
- **For overall balance** → **High G-Mean & F1-Score**.

# Data

## Data Fetch in YKB Computer - Taiwan Credit Data

In [6]:
df=pd.read_excel('default of credit card clients.xls',index_col=0).iloc[1:,:]

X=df.iloc[:,:-1]
y=pd.DataFrame(df.iloc[:,-1],columns=['Y'])
X=X.astype(float)
y=y.astype(int)
X.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,...,X14,X15,X16,X17,X18,X19,X20,X21,X22,X23
1,20000.0,2.0,2.0,1.0,24.0,2.0,2.0,-1.0,-1.0,-2.0,...,689.0,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0
2,120000.0,2.0,2.0,2.0,26.0,-1.0,2.0,0.0,0.0,0.0,...,2682.0,3272.0,3455.0,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0
3,90000.0,2.0,2.0,2.0,34.0,0.0,0.0,0.0,0.0,0.0,...,13559.0,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0
4,50000.0,2.0,2.0,1.0,37.0,0.0,0.0,0.0,0.0,0.0,...,49291.0,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0
5,50000.0,1.0,2.0,1.0,57.0,-1.0,0.0,-1.0,0.0,0.0,...,35835.0,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0


## Data Fetch in Personal Computer -- Taiwan Credit Data

In [9]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
default_of_credit_card_clients = fetch_ucirepo(id=350) 
  
# data (as pandas dataframes) 
X = default_of_credit_card_clients.data.features 
y = default_of_credit_card_clients.data.targets 

X.head()

ConnectionError: Error connecting to server

In [None]:
print(f"Percentage of Positive targets : {((y.sum()/y.count())*100).values[0]}%")

## Model Training And Results

In [27]:


print("Taiwan Credit dataset: \n")
output=model(X,y)
output_2=model_smote_poly(X,y['Y'])
output.append(output_2[1])

[I 2025-03-13 00:44:52,470] A new study created in memory with name: no-name-a595808f-8f73-4fda-bfb5-d0e1620124bc


Taiwan Credit dataset: 



[I 2025-03-13 00:44:53,877] Trial 0 finished with value: 0.766811333198052 and parameters: {'learning_rate': 0.08728447645777467, 'max_depth': 9, 'subsample': 0.791290072361565, 'colsample_bytree': 0.6557158967004103, 'gamma': 0.826981418165238, 'reg_alpha': 0.9882217675139366, 'reg_lambda': 0.422528159447663}. Best is trial 0 with value: 0.766811333198052.
[I 2025-03-13 00:44:54,293] Trial 1 finished with value: 0.7689011769480519 and parameters: {'learning_rate': 0.15783644765001797, 'max_depth': 4, 'subsample': 0.8085995812503914, 'colsample_bytree': 0.8038851810180707, 'gamma': 0.7505087497670536, 'reg_alpha': 0.07409263055057169, 'reg_lambda': 0.16118872198583567}. Best is trial 1 with value: 0.7689011769480519.
[I 2025-03-13 00:44:54,709] Trial 2 finished with value: 0.7571003884508349 and parameters: {'learning_rate': 0.28303142172776985, 'max_depth': 4, 'subsample': 0.6041032480194249, 'colsample_bytree': 0.7960379801770414, 'gamma': 0.5667587176457386, 'reg_alpha': 0.412750509

Best Hyperparameters (No SMOTE): {'learning_rate': 0.031239832980427443, 'max_depth': 4, 'subsample': 0.9249309198259971, 'colsample_bytree': 0.6860194343374963, 'gamma': 0.7578822768156139, 'reg_alpha': 0.29017177367967617, 'reg_lambda': 0.4638467463260988}


[I 2025-03-13 00:45:35,017] A new study created in memory with name: no-name-dd6d9995-41a5-42d1-8537-603f79da2fc6
[I 2025-03-13 00:45:35,364] Trial 0 finished with value: 0.7626787526089981 and parameters: {'learning_rate': 0.09990632415720277, 'max_depth': 3, 'subsample': 0.7328023989930054, 'colsample_bytree': 0.9423241643731001, 'gamma': 0.22641075985623604, 'reg_alpha': 0.6608855030281018, 'reg_lambda': 0.8599217643626863}. Best is trial 0 with value: 0.7626787526089981.
[I 2025-03-13 00:45:35,798] Trial 1 finished with value: 0.7596382913961038 and parameters: {'learning_rate': 0.21263352997017904, 'max_depth': 4, 'subsample': 0.7310097762922552, 'colsample_bytree': 0.6454912955828934, 'gamma': 0.7329541956573411, 'reg_alpha': 0.6175842694108722, 'reg_lambda': 0.531095746302933}. Best is trial 0 with value: 0.7626787526089981.
[I 2025-03-13 00:45:37,458] Trial 2 finished with value: 0.7489874188311689 and parameters: {'learning_rate': 0.16308269889961977, 'max_depth': 10, 'subsamp

Best Hyperparameters (With SMOTE): {'learning_rate': 0.010448007849115029, 'max_depth': 5, 'subsample': 0.7671801494534726, 'colsample_bytree': 0.8014523771518339, 'gamma': 0.784383417261917, 'reg_alpha': 0.5961087097696677, 'reg_lambda': 0.2172689724060965}


[I 2025-03-13 00:46:13,226] A new study created in memory with name: no-name-c1f39914-8d78-4b37-ab59-b76f47e37ad9



--- Model Performance ---
Metric               No SMOTE   With Regular SMOTE
Accuracy             0.8218   0.7816
ROC-AUC              0.7805   0.7697
PR-AUC               0.5427   0.5272
Recall (Sensitivity) 0.3556   0.5291
F1                   0.4650   0.5134
Specificity          0.9516   0.8518
FP-Rate              0.0484   0.1482
G-Mean               0.5817   0.6713


[I 2025-03-13 00:46:13,649] Trial 0 finished with value: 0.7794380145524119 and parameters: {'learning_rate': 0.05877623479915064, 'max_depth': 4, 'subsample': 0.9182037552014629, 'colsample_bytree': 0.7424724220209391, 'gamma': 0.6668936595584265, 'reg_alpha': 0.45863905304892605, 'reg_lambda': 0.36546334928933266}. Best is trial 0 with value: 0.7794380145524119.
[I 2025-03-13 00:46:14,078] Trial 1 finished with value: 0.7788166381609462 and parameters: {'learning_rate': 0.021603889078995624, 'max_depth': 4, 'subsample': 0.8329710637190378, 'colsample_bytree': 0.8228019556312106, 'gamma': 0.9701677970425782, 'reg_alpha': 0.6944120579842806, 'reg_lambda': 0.1041478084684766}. Best is trial 0 with value: 0.7794380145524119.
[I 2025-03-13 00:46:14,506] Trial 2 finished with value: 0.7708765146683673 and parameters: {'learning_rate': 0.1633129741915907, 'max_depth': 4, 'subsample': 0.7614380227720041, 'colsample_bytree': 0.6189159570090189, 'gamma': 0.24508241297376332, 'reg_alpha': 0.275

Best Hyperparameters (No SMOTE): {'learning_rate': 0.019637241170928747, 'max_depth': 6, 'subsample': 0.7492742171276892, 'colsample_bytree': 0.713204192215958, 'gamma': 0.27849864691005444, 'reg_alpha': 0.6489156504242536, 'reg_lambda': 0.2106809204925585}
14028


100%|███████████████████████████████████████████████████████████████████████████| 14028/14028 [00:56<00:00, 247.43it/s]
[I 2025-03-13 00:47:42,217] A new study created in memory with name: no-name-1525bb92-fdd6-45f4-a2b1-f2ae8de3bb12
[I 2025-03-13 00:47:43,444] Trial 0 finished with value: 0.7448773046150278 and parameters: {'learning_rate': 0.20075005694944115, 'max_depth': 8, 'subsample': 0.8660751646270958, 'colsample_bytree': 0.8906098113009548, 'gamma': 0.21266880437132518, 'reg_alpha': 0.45895279370173747, 'reg_lambda': 0.7918204184793575}. Best is trial 0 with value: 0.7448773046150278.
[I 2025-03-13 00:47:44,075] Trial 1 finished with value: 0.7515789874188311 and parameters: {'learning_rate': 0.2066000288033237, 'max_depth': 6, 'subsample': 0.9796395478986142, 'colsample_bytree': 0.9527873506418316, 'gamma': 0.062010815490333715, 'reg_alpha': 0.5853472831391942, 'reg_lambda': 0.31463824831779263}. Best is trial 1 with value: 0.7515789874188311.
[I 2025-03-13 00:47:44,438] Tria

Best Hyperparameters (With Cubic Polynomial SMOTE): {'learning_rate': 0.010182322099186234, 'max_depth': 7, 'subsample': 0.6584985306347424, 'colsample_bytree': 0.8398310548010428, 'gamma': 0.45862192702927573, 'reg_alpha': 0.9913301295621808, 'reg_lambda': 0.0010727495144577937}

--- Model Performance ---
Metric               No SMOTE   With Cubic Polynomial SMOTE
Accuracy             0.8199   0.7717
ROC-AUC              0.7809   0.7674
PR-AUC               0.5444   0.5247
Recall (Sensitivity) 0.3474   0.5439
F1                   0.4566   0.5092
Specificity          0.9514   0.8351
FP-Rate              0.0486   0.1649
G-Mean               0.5750   0.6739


In [29]:
metrics=output

In [30]:
metric_names = ["Accuracy", "ROC-AUC", "PR-AUC", "Recall (Sensitivity)", 
                "F1", "Specificity", "FP-Rate", "G-Mean"]

print("\n--- Model Performance ---")
print("{:<30} {:<10} {:<20} {:<20}".format("Metric", "No SMOTE",
                                            "With Regular SMOTE",
                                            "With Cubic Polynomial Interpolation SMOTE"))

for i in range(len(metric_names)):
    print("{:<30} {:.3f}              {:.3f}              {:.3f}".format(
        metric_names[i], metrics[0][i], metrics[1][i], metrics[2][i]))


--- Model Performance ---
Metric                         No SMOTE   With Regular SMOTE   With Cubic Polynomial Interpolation SMOTE
Accuracy                       0.822              0.782              0.772
ROC-AUC                        0.781              0.770              0.767
PR-AUC                         0.543              0.527              0.525
Recall (Sensitivity)           0.356              0.529              0.544
F1                             0.465              0.513              0.509
Specificity                    0.952              0.852              0.835
FP-Rate                        0.048              0.148              0.165
G-Mean                         0.582              0.671              0.674


# **ADASYN: Adaptive Synthetic Sampling Approach with Cubic Polynomial Interpolation**

## **1. Introduction**
In classification tasks with imbalanced datasets, where the number of instances in the minority class is significantly lower than in the majority class, machine learning models tend to be biased towards the majority class. To address this, various oversampling techniques have been developed, including the **Adaptive Synthetic Sampling (ADASYN) algorithm**.  

ADASYN improves upon traditional oversampling techniques, such as **SMOTE**, by adaptively generating synthetic samples according to the **local distribution** of the minority class. Specifically, it focuses on generating more synthetic samples in **harder-to-learn regions**, where the local class imbalance is more pronounced.  

In this work, we **replace the standard linear interpolation method** used in ADASYN with **cubic polynomial interpolation**, which provides a smoother and more diverse distribution of synthetic samples in high-dimensional feature spaces.

---

## **2. Algorithm Description**

Let $X \in \mathbb{R}^{n \times m}$ represent the dataset, where $n$ is the number of samples, and $m$ is the number of features. Let the class labels be $y \in \{C_1, C_2\}$, where $C_1$ is the minority class and $C_2$ is the majority class.

### **Step 1: Define the Minority and Majority Classes**
The number of instances in each class is computed as:

$$
n_{\text{min}} = |X_{\text{min}}|, \quad n_{\text{maj}} = |X_{\text{maj}}|
$$

where $X_{\text{min}}$ and $X_{\text{maj}}$ represent the subsets of $X$ belonging to the minority and majority classes, respectively.

The class imbalance ratio is then given by:

$$
d = \frac{n_{\text{min}}}{n_{\text{maj}}}
$$

ADASYN aims to **balance the dataset** by generating synthetic samples until $d \approx 1$.

---

### **Step 2: Compute the Number of Synthetic Samples**
The total number of synthetic samples to be generated is:

$$
G = n_{\text{maj}} - n_{\text{min}}\cdot\tilde{β} 
$$

Each minority sample $x_i$ is assigned a weight based on its difficulty of classification.Where β∈ [0, 1] is a parameter used to specify the desired
balance level after generation of the synthetic data. β = 1
means a fully balanced data set is created after the generalization process.

For each $x_i \in X_{\text{min}}$, we compute the number of its $k$-nearest neighbors belonging to the majority class $X_{\text{maj}}$. Let $k_i^{\text{maj}}$ denote this count. The local distribution ratio $r_i$ is computed as:

$$
r_i = \frac{k_i^{\text{maj}}}{k}
$$

where $k$ is the total number of nearest neighbors considered.

The normalized weight for each $x_i$ is then:

$$
\tilde{r}_i = \frac{r_i}{\sum_{j=1}^{n_{\text{min}}} r_j}
$$

The number of synthetic samples required for each $x_i$ is:

$$
G_i = G \cdot \tilde{r}_i
$$

where $G_i$ is an integer value indicating the number of new samples to generate for $x_i$.

---

### **Step 3: Generate Synthetic Samples Using Cubic Polynomial Interpolation**
For each minority sample $x_i$ requiring $G_i$ synthetic samples, a random neighbor $x_j \in X_{\text{min}}$ from its $k$-nearest neighbors is selected.

#### **Cubic Polynomial Interpolation**
Instead of using linear interpolation, we apply cubic polynomial interpolation for smoother synthetic data generation.

For each feature $f \in \{1, 2, \dots, m\}$, we define **four control points**:
- $P_0 = x_{i,f}$ (original sample)
- $P_1 = 0.5 (x_{i,f} + x_{j,f})$ (midpoint control)
- $P_2 = 0.5 (x_{i,f} + x_{j,f})$ (another midpoint control)
- $P_3 = x_{j,f}$ (selected neighbor)

The corresponding interpolation domain values are:

$$
X_{\text{points}} = [0, 0.33, 0.67, 1]
$$

The values at these points are:

$$
Y_{\text{points}} = [P_0, P_1, P_2, P_3]
$$

Using **CubicSpline interpolation**, we generate a synthetic sample by selecting a random interpolation coefficient $g \sim U(0,1)$ and computing:

$$
\tilde{x}_{f} = \text{CubicSpline}(g)
$$

This process is repeated for all $m$ features, resulting in a synthetic sample $\tilde{x}$.

---

### **Step 4: Update the Dataset**
The newly generated synthetic samples $\tilde{X}$ are added to the original dataset:

$$
X' = X \cup \tilde{X}, \quad y' = y \cup \tilde{y}
$$

where $\tilde{y}$ contains the label of the minority class.

---

## **3. Conclusion**
The proposed ADASYN implementation with cubic polynomial interpolation provides several advantages over standard linear interpolation methods:
- **Enhanced diversity of synthetic samples**: The cubic interpolation technique generates smoother and more naturally distributed synthetic points in the feature space.
- **Better generalization**: By adapting sample generation based on difficulty, ADASYN reduces the risk of overfitting caused by naive oversampling.
- **Improved robustness in high-dimensional spaces**: Unlike linear interpolation, cubic interpolation mitigates abrupt transitions in feature values, making the synthetic data more realistic.

This approach is particularly beneficial for imbalanced datasets where minority class samples exhibit complex distributions.

In [31]:
from imblearn.over_sampling import ADASYN
def model_adasyn(X,y):
    # Train-test split (stratified)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    
    # Standardize numerical features
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)
    
    ### Train XGBoost WITHOUT ADASYN
    dtrain = xgb.DMatrix(X_train, label=y_train)
    dtest = xgb.DMatrix(X_test, label=y_test)
    
    params = {
        "objective": "binary:logistic",
        "eval_metric": "auc",
        "tree_method": "gpu_hist",  # Use GPU
        "predictor": "gpu_predictor",
        "learning_rate": 0.05,
        "max_depth": 6,
        "subsample": 0.8,
        "colsample_bytree": 0.8,
        "random_state": 42
    }
    
    # Train XGBoost without ADASYN
    model_no_smote = xgb.train(params, dtrain, num_boost_round=200)
    
    # Predict on test set
    y_pred_proba_no_smote = model_no_smote.predict(dtest)
    from sklearn.metrics import precision_recall_curve, f1_score
    
    precision, recall, thresholds = precision_recall_curve(y_test, y_pred_proba_no_smote)
    f1_scores = (2 * precision * recall) / (precision + recall + 1e-6)  # Avoid division by zero
    optimal_idx = f1_scores.argmax()
    optimal_threshold = thresholds[optimal_idx]

    y_pred_binary_no_smote = [1 if p > optimal_threshold else 0 for p in y_pred_proba_no_smote]  # Convert probabilities to binary predictions

    metrics_no_smote = calculate_metrics(y_test, y_pred_proba_no_smote)

    
    ### Apply ADASYN
    adasyn = ADASYN(sampling_strategy=1, random_state=42)  # Fully balance classes
    X_train_resampled, y_train_resampled = adasyn.fit_resample(X_train, y_train)
    
    # Train XGBoost WITH ADASYN
    dtrain_resampled = xgb.DMatrix(X_train_resampled, label=y_train_resampled)
    model_with_smote = xgb.train(params, dtrain_resampled, num_boost_round=200)
    
    # Predict on test set
    y_pred_proba_with_smote = model_with_smote.predict(dtest)
    precision, recall, thresholds = precision_recall_curve(y_test, y_pred_proba_with_smote)
    f1_scores = (2 * precision * recall) / (precision + recall + 1e-6)  # Avoid division by zero
    optimal_idx = f1_scores.argmax()
    optimal_threshold = thresholds[optimal_idx]
    y_pred_binary_with_smote = [1 if p > optimal_threshold else 0 for p in y_pred_proba_with_smote]  # Convert probabilities to binary predictions

    metrics_with_smote = calculate_metrics(y_test, y_pred_proba_with_smote)

    # Print comparison
    metric_names = ["Accuracy", "ROC-AUC", "PR-AUC", "Recall (Sensitivity)", "F1", "Specificity", "FP-Rate", "G-Mean"]
    print("\n--- Model Performance ---")
    print("{:<20} {:<10} {:<10}".format("Metric", "No ADASYN", "With Regular ADASYN"))
    for name, no_smote, with_smote in zip(metric_names, metrics_no_smote, metrics_with_smote):
        print(f"{name:<20} {no_smote:.4f}   {with_smote:.4f}")

    return [metrics_no_smote, metrics_with_smote]


In [38]:
import numpy as np
import pandas as pd
from sklearn.neighbors import NearestNeighbors
from numpy.polynomial.polynomial import Polynomial
from tqdm import tqdm
from scipy.interpolate import CubicSpline

def custom_adasyn_with_cubic_interpolation(X, y, k_neighbors, beta, random_state=42):
    """
    Implements the ADASYN algorithm with cubic polynomial interpolation for high-dimensional data.

    Parameters:
    - X: ndarray, shape (n_samples, n_features)
        Feature matrix.
    - y: ndarray, shape (n_samples,)
        Target labels.

    Returns:
    - X_resampled: ndarray
        Resampled feature matrix with synthetic samples.
    - y_resampled: ndarray
        Resampled target labels.
    """

    # Ensure `y` is a 1D array
    y = y.reset_index(drop=True)  # Ensure proper indexing
    
    # Separate minority class

    # Identify minority and majority class
    classes, class_counts = np.unique(y, return_counts=True)
    minority_class = classes[np.argmin(class_counts)]
    majority_class = classes[np.argmax(class_counts)]
    
    X_minority = X[y == minority_class]
    n_minority = len(X_minority)
    n_majority = len(X[y == majority_class])
    n_features = X.shape[1]

    # Step 1: Compute the number of synthetic samples to generate
    d = n_majority - n_minority  # Imbalance factor
    G = d*beta  # Total synthetic samples needed

    # Step 2: Find k-nearest neighbors for each minority sample
    k = k_neighbors
    knn = NearestNeighbors(n_neighbors=k+1).fit(X)
    neighbors = knn.kneighbors(X_minority, return_distance=False)[:, 1:]

    # Step 3: Compute the imbalance degree ri for each minority sample
    ri = np.array([sum(y[neighbors[i]] != minority_class) / k for i in range(n_minority)])
    if ri.sum() == 0:
        return X, y  # No synthetic samples needed
    ri = ri / ri.sum()  # Normalize ri to sum to 1

    # Step 4: Generate synthetic samples using cubic polynomial interpolation
    X_synthetic = []
    for i in tqdm(range(n_minority)):
        Gi = int(G * ri[i])  # Number of samples to generate for instance i
        for _ in range(Gi):

            neighbor_idx = np.random.choice(neighbors[i])  # Select a random neighbor
            x_selected = X_minority.iloc[i]  # Minority instance
            x_neighbor = X.iloc[neighbor_idx]  # Chosen neighbor

            # Create synthetic sample feature-wise
            
            idx = np.random.choice(neighbors[i])  # Select a random neighbor
            t_values = np.array([0, 0.33, 0.66, 1])  # 4 reference points in [0,1]
            x_values = np.vstack([x_selected, 
                                  (2*x_selected + x_neighbor)/3, 
                                  (x_selected + 2*x_neighbor)/3, 
                                  x_neighbor])  # Intermediate points
            # Generate polynomial coefficients for each feature
            x_synthetic = np.zeros_like(x_selected)
            t_random = np.random.rand()  # Random t in [0,1]
            
            for feature_idx in range(X.shape[1]):  # Iterate over all features
                poly = Polynomial.fit(t_values, x_values[:, feature_idx], 3)  # Fit cubic polynomial
                x_synthetic[feature_idx] = poly(t_random)  # Sample new point
            

            X_synthetic.append(x_synthetic)

    X_synthetic = np.array(X_synthetic)
    y_synthetic = np.full(len(X_synthetic), minority_class)

    # Step 5: Return the augmented dataset
    X_resampled = np.vstack((X, X_synthetic))
    y_resampled = np.hstack((y, y_synthetic))

    print(X_resampled.shape)

    return X_resampled, y_resampled


In [33]:
from sklearn.model_selection import train_test_split

In [64]:
def model_adasyn_poly(X,y):
    # Train-test split (stratified)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    
    # Standardize numerical features 
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)
    
    ### Train XGBoost WITHOUT ADASYN
    dtrain = xgb.DMatrix(X_train, label=y_train)
    dtest = xgb.DMatrix(X_test, label=y_test)
    
    params = {
        "objective": "binary:logistic",
        "eval_metric": "auc",
        "tree_method": "gpu_hist",  # Use GPU
        "predictor": "gpu_predictor",
        "learning_rate": 0.05,
        "max_depth": 6,
        "subsample": 0.8,
        "colsample_bytree": 0.8,
        "random_state": 42
    }
    
    # Train XGBoost without ADASYN
    model_no_smote = xgb.train(params, dtrain, num_boost_round=200)
    
    # Predict on test set
    y_pred_proba_no_smote = model_no_smote.predict(dtest)
    from sklearn.metrics import precision_recall_curve, f1_score
    
    precision, recall, thresholds = precision_recall_curve(y_test, y_pred_proba_no_smote)
    f1_scores = (2 * precision * recall) / (precision + recall + 1e-6)  # Avoid division by zero
    optimal_idx = f1_scores.argmax()
    optimal_threshold = thresholds[optimal_idx]

    y_pred_binary_no_smote = [1 if p > optimal_threshold else 0 for p in y_pred_proba_no_smote]  # Convert probabilities to binary predictions

    metrics_no_smote = calculate_metrics(y_test, y_pred_proba_no_smote)

    pd.DataFrame(X_train,columns=X.columns)

    X_train_resampled, y_train_resampled = custom_adasyn_with_cubic_interpolation(pd.DataFrame(X_train,columns=X.columns), y_train,5,1)
    
    # Train XGBoost WITH ADASYN
    dtrain_resampled = xgb.DMatrix(np.array(X_train_resampled), label=y_train_resampled)
    model_with_smote = xgb.train(params, dtrain_resampled, num_boost_round=200)
    
    # Predict on test set
    y_pred_proba_with_smote = model_with_smote.predict(dtest)
    precision, recall, thresholds = precision_recall_curve(y_test, y_pred_proba_with_smote)
    f1_scores = (2 * precision * recall) / (precision + recall + 1e-6)  # Avoid division by zero
    optimal_idx = f1_scores.argmax()
    optimal_threshold = thresholds[optimal_idx]
    y_pred_binary_with_smote = [1 if p > optimal_threshold else 0 for p in y_pred_proba_with_smote]  # Convert probabilities to binary predictions

    metrics_with_smote = calculate_metrics(y_test, y_pred_proba_with_smote)

    # Print comparison
    metric_names = ["Accuracy", "ROC-AUC", "PR-AUC", "Recall (Sensitivity)", "F1", "Specificity", "FP-Rate", "G-Mean"]
    print("\n--- Model Performance ---")
    print("{:<20} {:<10} {:<10}".format("Metric", "No ADASYN", "With Cubic Polynomial ADASYN"))
    for name, no_smote, with_smote in zip(metric_names, metrics_no_smote, metrics_with_smote):
        print(f"{name:<20} {no_smote:.4f}   {with_smote:.4f}")

    return [metrics_no_smote, metrics_with_smote]


In [65]:
print("Taiwan Credit dataset: \n")
output=model_adasyn(X,y)
output_2=model_adasyn_poly(X,y['Y'])
output.append(output_2[1])

Taiwan Credit dataset: 


--- Model Performance ---
Metric               No ADASYN  With Regular ADASYN
Accuracy             0.8190   0.8018
ROC-AUC              0.7771   0.7582
PR-AUC               0.5392   0.5191
Recall (Sensitivity) 0.3566   0.4306
F1                   0.4618   0.4862
Specificity          0.9477   0.9051
FP-Rate              0.0523   0.0949
G-Mean               0.5814   0.6243


100%|██████████| 4676/4676 [00:16<00:00, 278.28it/s]


(30908, 23)

--- Model Performance ---
Metric               No ADASYN  With Cubic Polynomial ADASYN
Accuracy             0.8190   0.8036
ROC-AUC              0.7771   0.7596
PR-AUC               0.5392   0.5172
Recall (Sensitivity) 0.3566   0.4388
F1                   0.4618   0.4931
Specificity          0.9477   0.9051
FP-Rate              0.0523   0.0949
G-Mean               0.5814   0.6302


In [66]:
metrics=output

In [67]:
metric_names = ["Accuracy", "ROC-AUC", "PR-AUC", "Recall (Sensitivity)", 
                "F1", "Specificity", "FP-Rate", "G-Mean"]

print("\n--- Model Performance ---")
print("{:<30} {:<10} {:<20} {:<20}".format("Metric", "No ADASYN",
                                            "With Regular ADASYN",
                                            "With Cubic Polynomial Interpolation ADASYN"))

for i in range(len(metric_names)):
    print("{:<30} {:.3f}              {:.3f}              {:.3f}".format(
        metric_names[i], metrics[0][i], metrics[1][i], metrics[2][i]))


--- Model Performance ---
Metric                         No ADASYN  With Regular ADASYN  With Cubic Polynomial Interpolation ADASYN
Accuracy                       0.819              0.802              0.804
ROC-AUC                        0.777              0.758              0.760
PR-AUC                         0.539              0.519              0.517
Recall (Sensitivity)           0.357              0.431              0.439
F1                             0.462              0.486              0.493
Specificity                    0.948              0.905              0.905
FP-Rate                        0.052              0.095              0.095
G-Mean                         0.581              0.624              0.630
