##  SMOTE (Synthetic Minority Over-sampling Technique) 

## **Introduction**
SMOTE is a resampling technique used to handle class imbalance by generating synthetic samples for the minority class instead of simply duplicating existing ones. It works by interpolating between real minority class instances.

## **Algorithm Steps**
1. **Select a minority class sample** $x_i$ from the dataset.
2. **Find its k-nearest neighbors** in the minority class using Euclidean distance.
3. **Randomly select one of these neighbors** $x_{nn}$.
4. **Generate a synthetic sample** along the line segment joining $x_i$ and $x_{nn}$ using interpolation:

   $$
   x_{\text{new}} = x_i + \lambda \cdot (x_{nn} - x_i)
   $$

   where:

   $$
   \lambda \sim U(0,1)
   $$

   is a random number between 0 and 1.

## **Mathematical Formulation**
For a given minority class instance $x_i$, let $x_{nn}$ be one of its k-nearest neighbors. The synthetic sample is created as:

$$
x_{\text{new}} = x_i + \lambda (x_{nn} - x_i)
$$

where:
- $x_i$ is a real minority class instance.
- $x_{nn}$ is one of its k-nearest neighbors.
- $\lambda$ is a random number sampled from a uniform distribution:

  $$
  \lambda \sim U(0,1)
  $$

This process is repeated until the desired number of synthetic samples is generated.

## **Advantages of SMOTE**
- Reduces class imbalance by adding synthetic samples.
- Prevents overfitting caused by simple duplication of minority class instances.
- Preserves the relationships between data points.

## **Limitations of SMOTE**
- Can generate noisy samples if the minority class has a complex distribution.
- Does not consider the majority class, which may lead to overlapping regions and potential misclassification.
apping regions and potential misclassification.
ions and potential misclassification.
ions and potential misclassification.

know if you need additional details! 🚀

In [19]:
import pandas as pd
import numpy as np
from imblearn.combine import SMOTEENN

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    roc_auc_score, average_precision_score, f1_score, recall_score, accuracy_score,
    confusion_matrix
)
from tqdm import tqdm
from imblearn.over_sampling import SMOTE
import xgboost as xgb
import warnings

warnings.simplefilter("ignore")  # Ignore all warnings

def calculate_metrics(y_test, y_pred_proba):
    threshold = np.mean(y_pred_proba)  # Dynamic threshold based on mean
    y_pred_binary = [1 if p > 0.5 else 0 for p in y_pred_proba]  # Convert probabilities to binary

    # Compute metrics
    accuracy = accuracy_score(y_test, y_pred_binary)
    roc_auc = roc_auc_score(y_test, y_pred_proba)
    pr_auc = average_precision_score(y_test, y_pred_proba)
    recall = recall_score(y_test, y_pred_binary)
    f1 = f1_score(y_test, y_pred_binary)
    
    tn, fp, fn, tp = confusion_matrix(y_test, y_pred_binary).ravel()
    
    specificity = tn / (tn + fp) if (tn + fp) > 0 else 0
    fp_rate = fp / (fp + tn) if (fp + tn) > 0 else 0
    g_mean = np.sqrt(recall * specificity)

    return accuracy, roc_auc, pr_auc, recall, f1, specificity, fp_rate, g_mean

def model(X,y):
    # Train-test split (stratified)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    
    # Standardize numerical features
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)
    
    ### Train XGBoost WITHOUT SMOTE
    dtrain = xgb.DMatrix(X_train, label=y_train)
    dtest = xgb.DMatrix(X_test, label=y_test)
    
    params = {
        "objective": "binary:logistic",
        "eval_metric": "auc",
        "tree_method": "gpu_hist",  # Use GPU
        "predictor": "gpu_predictor",
        "learning_rate": 0.05,
        "max_depth": 6,
        "subsample": 0.8,
        "colsample_bytree": 0.8,
        "random_state": 42
    }
    
    # Train XGBoost without SMOTE
    model_no_smote = xgb.train(params, dtrain, num_boost_round=200)
    
    # Predict on test set
    y_pred_proba_no_smote = model_no_smote.predict(dtest)
    from sklearn.metrics import precision_recall_curve, f1_score
    
    precision, recall, thresholds = precision_recall_curve(y_test, y_pred_proba_no_smote)
    f1_scores = (2 * precision * recall) / (precision + recall + 1e-6)  # Avoid division by zero
    optimal_idx = f1_scores.argmax()
    optimal_threshold = thresholds[optimal_idx]

    y_pred_binary_no_smote = [1 if p > optimal_threshold else 0 for p in y_pred_proba_no_smote]  # Convert probabilities to binary predictions

    metrics_no_smote = calculate_metrics(y_test, y_pred_proba_no_smote)

    
    ### Apply SMOTE
    smote = SMOTE(sampling_strategy=1, random_state=42)  # Fully balance classes
    X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)
    
    # Train XGBoost WITH SMOTE
    dtrain_resampled = xgb.DMatrix(X_train_resampled, label=y_train_resampled)
    model_with_smote = xgb.train(params, dtrain_resampled, num_boost_round=200)
    
    # Predict on test set
    y_pred_proba_with_smote = model_with_smote.predict(dtest)
    precision, recall, thresholds = precision_recall_curve(y_test, y_pred_proba_with_smote)
    f1_scores = (2 * precision * recall) / (precision + recall + 1e-6)  # Avoid division by zero
    optimal_idx = f1_scores.argmax()
    optimal_threshold = thresholds[optimal_idx]
    y_pred_binary_with_smote = [1 if p > optimal_threshold else 0 for p in y_pred_proba_with_smote]  # Convert probabilities to binary predictions

    metrics_with_smote = calculate_metrics(y_test, y_pred_proba_with_smote)

    # Print comparison
    metric_names = ["Accuracy", "ROC-AUC", "PR-AUC", "Recall (Sensitivity)", "F1", "Specificity", "FP-Rate", "G-Mean"]
    print("\n--- Model Performance ---")
    print("{:<20} {:<10} {:<10}".format("Metric", "No SMOTE", "With Regular SMOTE"))
    for name, no_smote, with_smote in zip(metric_names, metrics_no_smote, metrics_with_smote):
        print(f"{name:<20} {no_smote:.4f}   {with_smote:.4f}")

    return [metrics_no_smote, metrics_with_smote]


# **Custom SMOTE with Cubic Interpolation (My Development)**

## **Introduction**
This is a modified version of the **Synthetic Minority Over-sampling Technique (SMOTE)**, where instead of linear interpolation, a **third-degree polynomial interpolation** is used to generate synthetic samples. This method helps preserve complex feature relationships and avoids overly simplistic synthetic samples.

## **Algorithm Steps**
1. **Identify the minority class** in the dataset.
2. **Find its k-nearest neighbors** using Euclidean distance.
3. **Randomly select one of these neighbors** $x_{nn}$ for interpolation.
4. **Use cubic interpolation** between the selected sample $x_i$ and its neighbor $x_{nn}$:
   - Define reference points between $x_i$ and $x_{nn}$.
   - Fit a third-degree polynomial for each feature.
   - Sample a new synthetic point using the polynomial.

## **Mathematical Formulation**
For a given minority class instance $x_i$, let $x_{nn}$ be one of its k-nearest neighbors. We define four reference points:

$$
x_0 = x_i, \quad x_1 = \frac{2x_i + x_{nn}}{3}, \quad x_2 = \frac{x_i + 2x_{nn}}{3}, \quad x_3 = x_{nn}
$$

These points correspond to $t$-values:

$$
t_0 = 0, \quad t_1 = 0.33, \quad t_2 = 0.66, \quad t_3 = 1
$$

A third-degree polynomial is fitted for each feature using these values:

$$
P(t) = a_0 + a_1 t + a_2 t^2 + a_3 t^3
$$

where the coefficients $(a_0, a_1, a_2, a_3)$ are determined by the reference points. A synthetic sample is generated by evaluating the polynomial at a randomly chosen $t_{\text{rand}} \sim U(0,1)$:

$$
x_{\text{new}} = P(t_{\text{rand}})
$$

This process is repeated until the desired number of synthetic samples is generated.

## **Advantages of Custom SMOTE with Cubic Interpolation**
- **More realistic synthetic samples**: Cubic interpolation provides a **smoother transition** between real data points.
- **Better feature relationships**: Unlike linear SMOTE, this method **captures non-linear patterns** in the data.
- **Less risk of generating outliers**: Intermediate points help **constrain synthetic samples** within a reasonable range.

## **Limitations**
- **Computationally expensive**: Fitting a polynomial for each feature requires more computation than linear interpolation.
- **Risk of overfitting**: If the minority class has a complex distribution, the interpolation might introduce synthetic samples that do not generalize well.
- **Sensitive to noisy data**: If the minority class contains outliers, the interpolation may exaggerate these variations.


In [55]:
import numpy as np
import pandas as pd
from sklearn.neighbors import NearestNeighbors
from numpy.polynomial.polynomial import Polynomial

def custom_smote_with_cubic_interpolation(X: pd.DataFrame, y: pd.Series, target_class=1, k_neighbors=5, random_state=42):
    """
    Custom SMOTE using 3rd-degree polynomial interpolation.

    Parameters:
        X (pd.DataFrame): Feature matrix.
        y (pd.Series): Target labels.
        target_class (int): The minority class to oversample.
        k_neighbors (int): Number of nearest neighbors to consider.
        sampling_ratio (float): Ratio of synthetic samples to generate relative to minority class.
        random_state (int): Random seed for reproducibility.

    Returns:
        X_resampled (pd.DataFrame): New feature matrix with synthetic samples.
        y_resampled (pd.Series): Updated target labels.
    """
    np.random.seed(random_state)
    
    # Ensure `y` is a 1D array
    y = y.reset_index(drop=True)  # Ensure proper indexing
    
    # Separate minority class
    X_minority = X[y == target_class]
    sampling_ratio=np.floor((X.shape[0]-X_minority.shape[0])/X_minority.shape[0])
    
    # Fit KNN on minority class
    knn = NearestNeighbors(n_neighbors=min(k_neighbors, len(X_minority)))
    knn.fit(X_minority)
    
    # Determine number of synthetic samples to generate
    n_samples = int(len(X_minority) * sampling_ratio)

    print(n_samples)
    
    synthetic_samples = []
    
    for _ in tqdm(range(n_samples)):
        # Randomly select a minority sample
        idx = np.random.randint(0, len(X_minority))
        x_selected = X_minority.iloc[idx].values  # Convert to NumPy array
        
        # Find k-nearest neighbors
        neighbors = knn.kneighbors([x_selected], return_distance=False)[0]
        
        # Select a random neighbor
        neighbor_idx = np.random.choice(neighbors[1:])  # Exclude itself
        x_neighbor = X_minority.iloc[neighbor_idx].values  # Convert to NumPy array
        
        # Fit a 3rd-degree polynomial between x_selected and x_neighbor
        t_values = np.array([0, 0.33, 0.66, 1])  # 4 reference points in [0,1]
        x_values = np.vstack([x_selected, 
                              (2*x_selected + x_neighbor)/3, 
                              (x_selected + 2*x_neighbor)/3, 
                              x_neighbor])  # Intermediate points
        
        # Generate polynomial coefficients for each feature
        x_synthetic = np.zeros_like(x_selected)
        t_random = np.random.rand()  # Random t in [0,1]
        
        for feature_idx in range(X.shape[1]):  # Iterate over all features
            poly = Polynomial.fit(t_values, x_values[:, feature_idx], 3)  # Fit cubic polynomial
            x_synthetic[feature_idx] = poly(t_random)  # Sample new point
        
        synthetic_samples.append(x_synthetic)
    
    # Convert synthetic samples to DataFrame
    synthetic_samples_df = pd.DataFrame(synthetic_samples, columns=X.columns)
    
    # Create new dataset (append synthetic data)
    X_resampled = pd.concat([X, synthetic_samples_df], axis=0, ignore_index=True)
    y_resampled = pd.concat([y, pd.Series(target_class, index=synthetic_samples_df.index)], axis=0, ignore_index=True)
    
    return X_resampled, y_resampled


In [21]:
def model_smote_poly(X,y):
    # Train-test split (stratified)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    
    # Standardize numerical features 
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)
    
    ### Train XGBoost WITHOUT SMOTE
    dtrain = xgb.DMatrix(X_train, label=y_train)
    dtest = xgb.DMatrix(X_test, label=y_test)
    
    params = {
        "objective": "binary:logistic",
        "eval_metric": "auc",
        "tree_method": "gpu_hist",  # Use GPU
        "predictor": "gpu_predictor",
        "learning_rate": 0.05,
        "max_depth": 6,
        "subsample": 0.8,
        "colsample_bytree": 0.8,
        "random_state": 42
    }
    
    # Train XGBoost without SMOTE
    model_no_smote = xgb.train(params, dtrain, num_boost_round=200)
    
    # Predict on test set
    y_pred_proba_no_smote = model_no_smote.predict(dtest)
    from sklearn.metrics import precision_recall_curve, f1_score
    
    precision, recall, thresholds = precision_recall_curve(y_test, y_pred_proba_no_smote)
    f1_scores = (2 * precision * recall) / (precision + recall + 1e-6)  # Avoid division by zero
    optimal_idx = f1_scores.argmax()
    optimal_threshold = thresholds[optimal_idx]

    y_pred_binary_no_smote = [1 if p > optimal_threshold else 0 for p in y_pred_proba_no_smote]  # Convert probabilities to binary predictions

    metrics_no_smote = calculate_metrics(y_test, y_pred_proba_no_smote)

    
    ### Apply SMOTE
    X_train_resampled, y_train_resampled = custom_smote_with_cubic_interpolation(pd.DataFrame(X_train,columns=X.columns), y_train)
    
    # Train XGBoost WITH SMOTE
    dtrain_resampled = xgb.DMatrix(np.array(X_train_resampled), label=y_train_resampled)
    model_with_smote = xgb.train(params, dtrain_resampled, num_boost_round=200)
    
    # Predict on test set
    y_pred_proba_with_smote = model_with_smote.predict(dtest)
    precision, recall, thresholds = precision_recall_curve(y_test, y_pred_proba_with_smote)
    f1_scores = (2 * precision * recall) / (precision + recall + 1e-6)  # Avoid division by zero
    optimal_idx = f1_scores.argmax()
    optimal_threshold = thresholds[optimal_idx]
    y_pred_binary_with_smote = [1 if p > optimal_threshold else 0 for p in y_pred_proba_with_smote]  # Convert probabilities to binary predictions

    metrics_with_smote = calculate_metrics(y_test, y_pred_proba_with_smote)

    # Print comparison
    metric_names = ["Accuracy", "ROC-AUC", "PR-AUC", "Recall (Sensitivity)", "F1", "Specificity", "FP-Rate", "G-Mean"]
    print("\n--- Model Performance ---")
    print("{:<20} {:<10} {:<10}".format("Metric", "No SMOTE", "With Cubic Polynomial SMOTE"))
    for name, no_smote, with_smote in zip(metric_names, metrics_no_smote, metrics_with_smote):
        print(f"{name:<20} {no_smote:.4f}   {with_smote:.4f}")

    return [metrics_no_smote, metrics_with_smote]


Here is a structured explanation of the evaluation metrics used for **credit risk default prediction**, along with their mathematical formulations in **LaTeX format**:

---

# **Model Performance Metrics for Credit Risk Default Prediction**

In credit risk modeling, correctly classifying **defaulting customers** is crucial, as misclassifications can lead to **financial losses** (false negatives) or **lost opportunities** (false positives). The following metrics help assess model performance:

## **1. Accuracy**
Accuracy measures the proportion of correctly classified instances over the total dataset:

$$
\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}
$$

where:
- $TP$ = True Positives (correctly predicted defaults)
- $TN$ = True Negatives (correctly predicted non-defaults)
- $FP$ = False Positives (incorrectly predicted defaults)
- $FN$ = False Negatives (incorrectly predicted non-defaults)

### **Importance in Credit Risk:**
- Accuracy gives an overall measure of correctness but can be **misleading in imbalanced datasets** (e.g., if defaults are rare, a model predicting all customers as non-defaults can still have high accuracy).

---

## **2. ROC-AUC (Receiver Operating Characteristic - Area Under Curve)**
The **ROC-AUC** measures a model’s ability to distinguish between the positive (default) and negative (non-default) classes. The **ROC curve** plots the **True Positive Rate (Recall)** against the **False Positive Rate (FPR)** at different classification thresholds.  

### **Mathematical Formulation:**
The **AUC (Area Under Curve)** is computed as:

$$
\text{AUC} = \int_0^1 \text{TPR} \, d(\text{FPR})
$$

where:
- **True Positive Rate (TPR) / Recall:**
  $$
  \text{TPR} = \frac{TP}{TP + FN}
  $$
- **False Positive Rate (FPR):**
  $$
  \text{FPR} = \frac{FP}{FP + TN}
  $$

### **Importance in Credit Risk:**
- **Higher AUC** means the model **better separates defaults from non-defaults**.
- **AUC close to 0.5** suggests the model is **random** (not useful).

---

## **3. PR-AUC (Precision-Recall Area Under Curve)**
PR-AUC measures the area under the **Precision-Recall (PR) curve**, focusing on **positive (default) predictions**.

### **Mathematical Formulation:**
The **AUC for Precision-Recall** is:

$$
\text{PR-AUC} = \int_0^1 \text{Precision} \, d(\text{Recall})
$$

where:
- **Precision (Positive Predictive Value, PPV):**
  $$
  \text{Precision} = \frac{TP}{TP + FP}
  $$

- **Recall (Sensitivity / TPR) (as defined above)**

### **Importance in Credit Risk:**
- **More useful than ROC-AUC** for **imbalanced data** since it focuses on **true defaults**.
- **Higher PR-AUC** indicates a better balance between **precision and recall**.

---

## **4. Recall (Sensitivity)**
Recall, also called **Sensitivity or True Positive Rate (TPR)**, measures the ability to detect **actual defaults**:

$$
\text{Recall} = \frac{TP}{TP + FN}
$$

### **Importance in Credit Risk:**
- **High recall** ensures **most actual defaults are detected**, minimizing **false negatives**.
- **Low recall** means many **defaulting customers** are **missed**, leading to **financial losses**.

---

## **5. F1-Score**
F1-Score is the harmonic mean of **Precision** and **Recall**, balancing both:

$$
F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
$$

### **Importance in Credit Risk:**
- **Best when both False Positives & False Negatives are costly**.
- **Useful in imbalanced datasets**, where a high precision or recall alone isn't enough.

---

## **6. Specificity (True Negative Rate)**
Specificity measures how well the model identifies **non-defaulting customers**:

$$
\text{Specificity} = \frac{TN}{TN + FP}
$$

### **Importance in Credit Risk:**
- **Higher specificity** reduces **false alarms (FP)**.
- **Too high specificity may mean recall is low**, missing many defaults.

---

## **7. False Positive Rate (FPR)**
FPR is the proportion of **non-defaulting customers incorrectly classified as defaults**:

$$
\text{FPR} = \frac{FP}{FP + TN} = 1 - \text{Specificity}
$$

### **Importance in Credit Risk:**
- **Low FPR** ensures fewer **non-defaulters are wrongly flagged**, reducing unnecessary **loan rejections**.
- **High FPR** can **hurt customer experience**, causing **unnecessary loan rejections**.

---

## **8. G-Mean (Geometric Mean)**
The **G-Mean** is a performance metric balancing **recall** and **specificity**:

$$
G\text{-Mean} = \sqrt{\text{Recall} \times \text{Specificity}}
$$

### **Importance in Credit Risk:**
- **Higher G-Mean** ensures the model performs well on **both default and non-default classes**.
- Useful for **handling class imbalance**, where one metric alone (like accuracy) can be misleading.

---

# **Summary Table of Metrics**
| **Metric**         | **Interpretation** |
|--------------------|------------------|
| **Accuracy**      | Overall correctness, but misleading in imbalanced data |
| **ROC-AUC**       | Ability to distinguish defaults vs. non-defaults |
| **PR-AUC**        | Performance on the default class, useful for imbalance |
| **Recall**        | Ability to detect defaults (avoid false negatives) |
| **F1-Score**      | Balance between Precision & Recall |
| **Specificity**   | Correctly identifying non-defaulters |
| **FPR**           | Incorrectly flagging non-defaulters as defaults |
| **G-Mean**        | Balance between Recall & Specificity (useful for imbalance) |

---

# **Final Thoughts**
For **credit risk prediction**, metrics should be **carefully chosen** based on **business priorities**:

- **If missing defaults is costly** → **High Recall (Sensitivity)**.
- **If wrongly flagging non-defaulters is a concern** → **Low False Positive Rate (FPR)**.
- **For overall balance** → **High G-Mean & F1-Score**.

# Data

In [22]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
default_of_credit_card_clients = fetch_ucirepo(id=350) 
  
# data (as pandas dataframes) 
X = default_of_credit_card_clients.data.features 
y = default_of_credit_card_clients.data.targets 

X.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,...,X14,X15,X16,X17,X18,X19,X20,X21,X22,X23
0,20000,2,2,1,24,2,2,-1,-1,-2,...,689,0,0,0,0,689,0,0,0,0
1,120000,2,2,2,26,-1,2,0,0,0,...,2682,3272,3455,3261,0,1000,1000,1000,0,2000
2,90000,2,2,2,34,0,0,0,0,0,...,13559,14331,14948,15549,1518,1500,1000,1000,1000,5000
3,50000,2,2,1,37,0,0,0,0,0,...,49291,28314,28959,29547,2000,2019,1200,1100,1069,1000
4,50000,1,2,1,57,-1,0,-1,0,0,...,35835,20940,19146,19131,2000,36681,10000,9000,689,679


In [23]:
print(f"Percentage of Positive targets : {((y.sum()/y.count())*100).values[0]}%")

Percentage of Positive targets : 22.12%


## Model Training And Results

In [56]:


print("Taiwan Credit dataset: \n")
output=model(X,y)
output=model_smote_poly(X,y['Y'])


Taiwan Credit dataset: 


--- Model Performance ---
Metric               No SMOTE   With Regular SMOTE
Accuracy             0.8190   0.8034
ROC-AUC              0.7771   0.7625
PR-AUC               0.5392   0.5249
Recall (Sensitivity) 0.3566   0.4388
F1                   0.4618   0.4930
Specificity          0.9477   0.9050
FP-Rate              0.0523   0.0950
G-Mean               0.5814   0.6301
14028


  7%|█████▌                                                                      | 1027/14028 [00:04<00:53, 241.49it/s]
Exception ignored in: <function DMatrix.__del__ at 0x00000183B36DC9A0>
Traceback (most recent call last):
  File "C:\Users\Onur Yaman\anaconda3\envs\onur_1\Lib\site-packages\xgboost\core.py", line 932, in __del__
    _check_call(_LIB.XGDMatrixFree(self.handle))
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt: 

KeyboardInterrupt



In [25]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score, average_precision_score
from imblearn.over_sampling import SMOTE
import xgboost as xgb

# Load dataset from UCI
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data"
columns = [
    "Status", "Duration", "CreditHistory", "Purpose", "CreditAmount", "Savings",
    "Employment", "InstallmentRate", "PersonalStatus", "Debtors", "ResidenceTime",
    "Property", "Age", "OtherInstallment", "Housing", "ExistingCredits", "Job",
    "NumDependents", "Telephone", "ForeignWorker", "Risk"
]
df = pd.read_csv(url, sep=" ", names=columns)

# Convert target variable (Risk: 1 = bad credit, 0 = good credit)
df["Risk"] = df["Risk"].map({2: 0, 1: 1})

# One-hot encode categorical variables
df = pd.get_dummies(df, drop_first=True)

# Split features and target
X_ger = df.drop("Risk", axis=1)
y_ger = df["Risk"]
y_ger = y_ger.replace({0: 1, 1: 0})


In [35]:
print(f"Percentage of Positive targets : {((y_ger.sum()/y_ger.count())*100)}%")

Percentage of Positive targets : 30.0%


In [36]:
print("German Credit dataset: \n")
output=model(X_ger,y_ger)
output=model_smote_poly(X_ger,y_ger)

German Credit dataset: 


--- Model Performance ---
Metric               No SMOTE   With Regular SMOTE
Accuracy             0.7733   0.7667
ROC-AUC              0.7933   0.8016
PR-AUC               0.6370   0.6602
Recall (Sensitivity) 0.4505   0.4835
F1                   0.5467   0.5570
Specificity          0.9139   0.8900
FP-Rate              0.0861   0.1100
G-Mean               0.6417   0.6560


100%|████████████████████████████████████████████████████████████████████████████████| 418/418 [00:05<00:00, 82.67it/s]



--- Model Performance ---
Metric               No SMOTE   With Cubic Polynomial SMOTE
Accuracy             0.7733   0.7733
ROC-AUC              0.7933   0.7933
PR-AUC               0.6370   0.6451
Recall (Sensitivity) 0.4505   0.4945
F1                   0.5467   0.5696
Specificity          0.9139   0.8947
FP-Rate              0.0861   0.1053
G-Mean               0.6417   0.6652


# **ADASYN (Adaptive Synthetic Sampling) - Working Logic**

**ADASYN (Adaptive Synthetic Sampling)** is an **oversampling technique** designed to address **class imbalance** in classification problems, especially when the minority class is highly imbalanced and difficult to learn.

Unlike traditional oversampling methods like **SMOTE**, which generate synthetic samples in a **uniform manner**, ADASYN focuses on generating more synthetic samples near the **boundary** of the minority class, where it is most difficult for the model to classify correctly. This adaptive sampling method attempts to balance the class distribution by placing more synthetic samples in regions where the decision boundary is complex.

---

## **Key Concepts in ADASYN:**

- **Minority Class:** The class with fewer samples (typically the default class in credit risk models).
- **Majority Class:** The class with more samples (typically the non-default class in credit risk models).
- **Synthetic Samples:** New, artificially created instances of the minority class.
- **K-Nearest Neighbors (K-NN):** ADASYN uses K-NN to identify challenging instances of the minority class.

---

## **Steps in ADASYN Algorithm:**

1. **Step 1: Calculate the K-Nearest Neighbors (K-NN) for each minority instance**  
   For each instance $x_i$ in the minority class, calculate its **k-nearest neighbors** using Euclidean distance or another distance metric.

2. **Step 2: Identify the challenging instances**  
   A **challenging instance** is one where the model finds it hard to classify the sample correctly. These instances are typically near the **decision boundary** (close to the majority class). 

   ADASYN evaluates the **class distribution** in the neighborhood of each minority instance:
   - If a minority instance has more majority class neighbors (i.e., the **decision boundary** is close), it is considered **challenging**.
   - The **difficulty score** for each instance is computed as:
   
     $$
     d_i = \frac{N_{\text{majority}}}{k}
     $$

     where $N_{\text{majority}}$ is the number of majority class instances among the k-nearest neighbors, and $k$ is the number of neighbors.

3. **Step 3: Generate synthetic samples**  
   For each challenging instance $x_i$, ADASYN generates **synthetic samples**:
   - The **number of synthetic samples** for an instance is determined by its **difficulty score**:  
     
     $$
     \text{No. of synthetic samples for } x_i = \lceil d_i \rceil
     $$

     where $\lceil d_i \rceil$ is the **ceiling** function, ensuring that each challenging instance has at least one synthetic sample.
     
4. **Step 4: Interpolate between the instance and its neighbors**  
   ADASYN generates new synthetic samples by **interpolating** between the challenging instance $x_i$ and one of its k-nearest neighbors $x_{nn}$:

   $$
   x_{\text{new}} = x_i + \lambda \cdot (x_{nn} - x_i)
   $$

   where:
   - $x_{\text{new}}$ is the synthetic sample.
   - $\lambda \sim U(0, 1)$ is a random number drawn from a uniform distribution.

5. **Step 5: Repeat until desired number of synthetic samples is reached**  
   This process is repeated for each instance in the minority class until the desired number of synthetic samples is generated.

---

## **Mathematical Formulation**

For a given minority class instance $x_i$, the **difficulty score** $d_i$ is computed as:

$$
d_i = \frac{N_{\text{majority}}}{k}
$$

where $N_{\text{majority}}$ is the number of majority class neighbors among the k-nearest neighbors, and $k$ is the total number of neighbors.

The **synthetic sample generation** for a challenging instance $x_i$ and its randomly chosen neighbor $x_{nn}$ is given by:

$$
x_{\text{new}} = x_i + \lambda \cdot (x_{nn} - x_i)
$$

where:
- $x_{\text{new}}$ is the synthetic sample.
- $x_i$ is the challenging minority instance.
- $x_{nn}$ is one of its k-nearest neighbors.
- $\lambda$ is a random number between 0 and 1.

---

## **Advantages of ADASYN**

1. **Focuses on Hard-to-Classify Instances:** ADASYN generates more samples for the **difficult-to-classify minority instances** that lie near the decision boundary. This allows the classifier to learn better decision boundaries.
2. **Adaptive Sampling:** The sampling process is adaptive and varies based on the difficulty of each instance. This prevents oversampling in simpler regions of the minority class.
3. **Improved Performance in Imbalanced Datasets:** ADASYN is particularly useful for handling **severe class imbalances** where traditional oversampling methods (e.g., SMOTE) might generate redundant or uninformative samples.

---

## **Disadvantages of ADASYN**

1. **Complexity in Implementation:** ADASYN can be more computationally expensive compared to basic oversampling techniques like SMOTE.
2. **Noise Sensitivity:** If the minority class instances are noisy or not well separated, ADASYN may generate synthetic samples that are still difficult to classify, potentially harming the model's performance.
3. **Overfitting Risk:** If the synthetic samples are not properly distributed or not sufficiently diverse, the model may overfit to these samples, leading to poor generalization.

---

## **Summary**

**ADASYN** is an advanced oversampling method that adaptively generates synthetic samples for **hard-to-classify instances** in the minority class. It focuses on improving classifier performance in regions where the model struggles to distinguish between the minority and majority classes. By carefully sampling near the decision boundary, ADASYN improves model robustness in highly imbalanced classification tasks, such as **credit risk prediction**, where detecting defaults is critical.

In [41]:
from imblearn.over_sampling import ADASYN
def model_adasyn(X,y):
    # Train-test split (stratified)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    
    # Standardize numerical features
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)
    
    ### Train XGBoost WITHOUT SMOTE
    dtrain = xgb.DMatrix(X_train, label=y_train)
    dtest = xgb.DMatrix(X_test, label=y_test)
    
    params = {
        "objective": "binary:logistic",
        "eval_metric": "auc",
        "tree_method": "gpu_hist",  # Use GPU
        "predictor": "gpu_predictor",
        "learning_rate": 0.05,
        "max_depth": 6,
        "subsample": 0.8,
        "colsample_bytree": 0.8,
        "random_state": 42
    }
    
    # Train XGBoost without SMOTE
    model_no_smote = xgb.train(params, dtrain, num_boost_round=200)
    
    # Predict on test set
    y_pred_proba_no_smote = model_no_smote.predict(dtest)
    from sklearn.metrics import precision_recall_curve, f1_score
    
    precision, recall, thresholds = precision_recall_curve(y_test, y_pred_proba_no_smote)
    f1_scores = (2 * precision * recall) / (precision + recall + 1e-6)  # Avoid division by zero
    optimal_idx = f1_scores.argmax()
    optimal_threshold = thresholds[optimal_idx]

    y_pred_binary_no_smote = [1 if p > optimal_threshold else 0 for p in y_pred_proba_no_smote]  # Convert probabilities to binary predictions

    metrics_no_smote = calculate_metrics(y_test, y_pred_proba_no_smote)

    
    ### Apply ADASYN
    adasyn = ADASYN(sampling_strategy=1, random_state=42)  # Fully balance classes
    X_train_resampled, y_train_resampled = adasyn.fit_resample(X_train, y_train)
    
    # Train XGBoost WITH SMOTE
    dtrain_resampled = xgb.DMatrix(X_train_resampled, label=y_train_resampled)
    model_with_smote = xgb.train(params, dtrain_resampled, num_boost_round=200)
    
    # Predict on test set
    y_pred_proba_with_smote = model_with_smote.predict(dtest)
    precision, recall, thresholds = precision_recall_curve(y_test, y_pred_proba_with_smote)
    f1_scores = (2 * precision * recall) / (precision + recall + 1e-6)  # Avoid division by zero
    optimal_idx = f1_scores.argmax()
    optimal_threshold = thresholds[optimal_idx]
    y_pred_binary_with_smote = [1 if p > optimal_threshold else 0 for p in y_pred_proba_with_smote]  # Convert probabilities to binary predictions

    metrics_with_smote = calculate_metrics(y_test, y_pred_proba_with_smote)

    # Print comparison
    metric_names = ["Accuracy", "ROC-AUC", "PR-AUC", "Recall (Sensitivity)", "F1", "Specificity", "FP-Rate", "G-Mean"]
    print("\n--- Model Performance ---")
    print("{:<20} {:<10} {:<10}".format("Metric", "No ADASYN", "With Regular ADASYN"))
    for name, no_smote, with_smote in zip(metric_names, metrics_no_smote, metrics_with_smote):
        print(f"{name:<20} {no_smote:.4f}   {with_smote:.4f}")

    return [metrics_no_smote, metrics_with_smote]


In [65]:
import numpy as np
import pandas as pd
from sklearn.neighbors import NearestNeighbors
from numpy.polynomial.polynomial import Polynomial
from tqdm import tqdm

def custom_adasyn_with_cubic_interpolation(X: pd.DataFrame, y: pd.Series, target_class=1, k_neighbors=5, beta=5, random_state=42):
    """
    Custom ADASYN using 3rd-degree polynomial interpolation.

    Parameters:
        X (pd.DataFrame): Feature matrix.
        y (pd.Series): Target labels.
        target_class (int): The minority class to oversample.
        k_neighbors (int): Number of nearest neighbors to consider.
        beta (float): Scaling factor for sample generation.
        random_state (int): Random seed for reproducibility.

    Returns:
        X_resampled (pd.DataFrame): New feature matrix with synthetic samples.
        y_resampled (pd.Series): Updated target labels.
    """
    np.random.seed(random_state)
    y = y.reset_index(drop=True)  

    # Separate minority and majority classes
    X_minority = X[y == target_class]
    X_majority = X[y != target_class]

    # Fit KNN on minority class
    knn = NearestNeighbors(n_neighbors=min(k_neighbors, len(X_minority)))
    knn.fit(X)

    num_synthetic_samples = []
    
    for idx in range(len(X_minority)):
        x_min = X_minority.iloc[idx].values  # Convert to NumPy array
        
        # Find k-nearest neighbors
        neighbors = knn.kneighbors([x_min], return_distance=False)[0]
        
        # Count how many of them belong to the majority class
        majority_neighbors = sum(y.iloc[neighbors] != target_class)
        
        # Compute imbalance factor
        imbalance_factor = majority_neighbors / k_neighbors  

        # Decide how many synthetic samples to generate
        num_synthetic = int(beta * imbalance_factor)
        num_synthetic_samples.append(num_synthetic)

    total_synthetic = sum(num_synthetic_samples)
    if total_synthetic == 0:
        print("⚠ No synthetic samples were generated. Consider adjusting k_neighbors or beta.")
        return X, y  # Return original data if no synthetic samples are created

    synthetic_samples = []

    for idx, n_samples in tqdm(enumerate(num_synthetic_samples), total=len(num_synthetic_samples)):
        if n_samples == 0:
            continue

        x_selected = X_minority.iloc[idx].values
        neighbors = knn.kneighbors([x_selected], return_distance=False)[0]

        for _ in range(n_samples):
            # Choose a random neighbor
            neighbor_idx = np.random.choice(neighbors[1:])  # Exclude itself
            x_neighbor = X.iloc[neighbor_idx].values

            # Fit a 3rd-degree polynomial between x_selected and x_neighbor
            t_values = np.array([0, 0.33, 0.66, 1])  
            x_values = np.vstack([
                x_selected, 
                (2*x_selected + x_neighbor)/3, 
                (x_selected + 2*x_neighbor)/3, 
                x_neighbor  
            ])  

            # Generate polynomial coefficients for each feature
            x_synthetic = np.zeros_like(x_selected)
            t_random = np.random.rand()  

            for feature_idx in range(X.shape[1]):  
                poly = Polynomial.fit(t_values, x_values[:, feature_idx], 3)
                x_synthetic[feature_idx] = poly(t_random)

            synthetic_samples.append(x_synthetic)

    synthetic_samples_df = pd.DataFrame(synthetic_samples, columns=X.columns)
    X_resampled = pd.concat([X, synthetic_samples_df], axis=0, ignore_index=True)
    y_resampled = pd.concat([y, pd.Series(target_class, index=synthetic_samples_df.index)], axis=0, ignore_index=True)

    print(f"✅ Total synthetic samples generated: {len(synthetic_samples)}")
    return X_resampled, y_resampled


In [58]:
def model_adasyn_poly(X,y):
    # Train-test split (stratified)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    
    # Standardize numerical features 
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)
    
    ### Train XGBoost WITHOUT SMOTE
    dtrain = xgb.DMatrix(X_train, label=y_train)
    dtest = xgb.DMatrix(X_test, label=y_test)
    
    params = {
        "objective": "binary:logistic",
        "eval_metric": "auc",
        "tree_method": "gpu_hist",  # Use GPU
        "predictor": "gpu_predictor",
        "learning_rate": 0.05,
        "max_depth": 6,
        "subsample": 0.8,
        "colsample_bytree": 0.8,
        "random_state": 42
    }
    
    # Train XGBoost without SMOTE
    model_no_smote = xgb.train(params, dtrain, num_boost_round=200)
    
    # Predict on test set
    y_pred_proba_no_smote = model_no_smote.predict(dtest)
    from sklearn.metrics import precision_recall_curve, f1_score
    
    precision, recall, thresholds = precision_recall_curve(y_test, y_pred_proba_no_smote)
    f1_scores = (2 * precision * recall) / (precision + recall + 1e-6)  # Avoid division by zero
    optimal_idx = f1_scores.argmax()
    optimal_threshold = thresholds[optimal_idx]

    y_pred_binary_no_smote = [1 if p > optimal_threshold else 0 for p in y_pred_proba_no_smote]  # Convert probabilities to binary predictions

    metrics_no_smote = calculate_metrics(y_test, y_pred_proba_no_smote)

    pd.DataFrame(X_train,columns=X.columns)

    X_train_resampled, y_train_resampled = custom_adasyn_with_cubic_interpolation(pd.DataFrame(X_train,columns=X.columns), y_train)
    
    # Train XGBoost WITH SMOTE
    dtrain_resampled = xgb.DMatrix(np.array(X_train_resampled), label=y_train_resampled)
    model_with_smote = xgb.train(params, dtrain_resampled, num_boost_round=200)
    
    # Predict on test set
    y_pred_proba_with_smote = model_with_smote.predict(dtest)
    precision, recall, thresholds = precision_recall_curve(y_test, y_pred_proba_with_smote)
    f1_scores = (2 * precision * recall) / (precision + recall + 1e-6)  # Avoid division by zero
    optimal_idx = f1_scores.argmax()
    optimal_threshold = thresholds[optimal_idx]
    y_pred_binary_with_smote = [1 if p > optimal_threshold else 0 for p in y_pred_proba_with_smote]  # Convert probabilities to binary predictions

    metrics_with_smote = calculate_metrics(y_test, y_pred_proba_with_smote)

    # Print comparison
    metric_names = ["Accuracy", "ROC-AUC", "PR-AUC", "Recall (Sensitivity)", "F1", "Specificity", "FP-Rate", "G-Mean"]
    print("\n--- Model Performance ---")
    print("{:<20} {:<10} {:<10}".format("Metric", "No SMOTE", "With Cubic Polynomial SMOTE"))
    for name, no_smote, with_smote in zip(metric_names, metrics_no_smote, metrics_with_smote):
        print(f"{name:<20} {no_smote:.4f}   {with_smote:.4f}")

    return [metrics_no_smote, metrics_with_smote]


In [39]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
default_of_credit_card_clients = fetch_ucirepo(id=350) 
  
# data (as pandas dataframes) 
X = default_of_credit_card_clients.data.features 
y = default_of_credit_card_clients.data.targets 

X.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,...,X14,X15,X16,X17,X18,X19,X20,X21,X22,X23
0,20000,2,2,1,24,2,2,-1,-1,-2,...,689,0,0,0,0,689,0,0,0,0
1,120000,2,2,2,26,-1,2,0,0,0,...,2682,3272,3455,3261,0,1000,1000,1000,0,2000
2,90000,2,2,2,34,0,0,0,0,0,...,13559,14331,14948,15549,1518,1500,1000,1000,1000,5000
3,50000,2,2,1,37,0,0,0,0,0,...,49291,28314,28959,29547,2000,2019,1200,1100,1069,1000
4,50000,1,2,1,57,-1,0,-1,0,0,...,35835,20940,19146,19131,2000,36681,10000,9000,689,679


In [66]:
print("Taiwan Credit dataset: \n")
output=model_adasyn(X,y)
output=model_adasyn_poly(X,y['Y'])

Taiwan Credit dataset: 


--- Model Performance ---
Metric               No ADASYN  With Regular ADASYN
Accuracy             0.8190   0.8010
ROC-AUC              0.7771   0.7592
PR-AUC               0.5392   0.5202
Recall (Sensitivity) 0.3566   0.4383
F1                   0.4618   0.4896
Specificity          0.9477   0.9020
FP-Rate              0.0523   0.0980
G-Mean               0.5814   0.6287


100%|█████████████████████████████████████████████████████████████████████████████| 4676/4676 [00:45<00:00, 103.17it/s]


✅ Total synthetic samples generated: 11393

--- Model Performance ---
Metric               No SMOTE   With Cubic Polynomial SMOTE
Accuracy             0.8190   0.8000
ROC-AUC              0.7771   0.7599
PR-AUC               0.5392   0.5226
Recall (Sensitivity) 0.3566   0.4536
F1                   0.4618   0.4969
Specificity          0.9477   0.8964
FP-Rate              0.0523   0.1036
G-Mean               0.5814   0.6377


German Credit dataset: 


--- Model Performance ---
Metric               No SMOTE   With SMOTE
Accuracy             0.7667   0.7533
ROC-AUC              0.7886   0.7945
PR-AUC               0.6403   0.6400
Recall (Sensitivity) 0.4396   0.4396
F1                   0.5333   0.5195
Specificity          0.9091   0.8900
FP-Rate              0.0909   0.1100
G-Mean               0.6321   0.6255


100%|████████████████████████████████████████████████████████████████████████████████| 627/627 [00:10<00:00, 59.92it/s]



--- Model Performance ---
Metric               No SMOTE   With SMOTE
Accuracy             0.7667   0.7667
ROC-AUC              0.7886   0.7831
PR-AUC               0.6403   0.6377
Recall (Sensitivity) 0.4396   0.4725
F1                   0.5333   0.5513
Specificity          0.9091   0.8947
FP-Rate              0.0909   0.1053
G-Mean               0.6321   0.6502


0      1
2      1
3      1
5      1
6      1
      ..
994    1
995    1
996    1
997    1
999    1
Name: Risk, Length: 700, dtype: int64

In [86]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score, average_precision_score
from imblearn.over_sampling import SMOTE
import xgboost as xgb

# Load dataset from UCI
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data"
columns = [
    "Status", "Duration", "CreditHistory", "Purpose", "CreditAmount", "Savings",
    "Employment", "InstallmentRate", "PersonalStatus", "Debtors", "ResidenceTime",
    "Property", "Age", "OtherInstallment", "Housing", "ExistingCredits", "Job",
    "NumDependents", "Telephone", "ForeignWorker", "Risk"
]
df = pd.read_csv(url, sep=" ", names=columns)

# Convert target variable (Risk: 1 = bad credit, 0 = good credit)
df["Risk"] = df["Risk"].map({2: 0, 1: 1})

# One-hot encode categorical variables
df = pd.get_dummies(df, drop_first=True)

# Split features and target
X = df.drop("Risk", axis=1)
y = df["Risk"]
y=pd.DataFrame(y,columns=['Y'])

In [27]:
import warnings

warnings.filterwarnings("ignore", category=UserWarning)


In [52]:
X_res,y_res=custom_smote_with_cubic_interpolation(X,y['Y'])

  9%|███████                                                                       | 601/6636 [00:03<00:35, 168.43it/s]

KeyboardInterrupt



In [39]:
X_res

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,...,X14,X15,X16,X17,X18,X19,X20,X21,X22,X23
0,20000,2,2,1,24,2,2,-1,-1,-2,...,689,0,0,0,0,689,0,0,0,0
1,120000,2,2,2,26,-1,2,0,0,0,...,2682,3272,3455,3261,0,1000,1000,1000,0,2000
2,90000,2,2,2,34,0,0,0,0,0,...,13559,14331,14948,15549,1518,1500,1000,1000,1000,5000
3,50000,2,2,1,37,0,0,0,0,0,...,49291,28314,28959,29547,2000,2019,1200,1100,1069,1000
4,50000,1,2,1,57,-1,0,-1,0,0,...,35835,20940,19146,19131,2000,36681,10000,9000,689,679
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36631,394048,1,1,1,32,0,0,0,0,0,...,245216,246988,251620,253051,9909,8660,9595,8795,8797,12368
36632,9999,1,2,1,42,0,1,1,3,2,...,9432,9145,8871,8631,1040,2783,0,139,47,213
36633,149999,2,1,1,32,1,0,0,0,0,...,117808,111320,112070,105282,4231,5043,2600,2514,2620,1323
36634,9999,1,2,1,25,2,2,6,6,6,...,2400,2400,2400,2391,0,0,0,0,0,0


In [101]:
X[y==1]

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,...,X14,X15,X16,X17,X18,X19,X20,X21,X22,X23
0,20000,2,2,1,24,2,2,-1,-1,-2,...,689,0,0,0,0,689,0,0,0,0
1,120000,2,2,2,26,-1,2,0,0,0,...,2682,3272,3455,3261,0,1000,1000,1000,0,2000
13,70000,1,2,2,30,1,2,2,0,0,...,65701,66782,36137,36894,3200,0,3000,3000,1500,0
16,20000,1,1,2,24,0,0,2,2,2,...,17428,18338,17905,19104,3200,0,1500,0,1650,0
21,120000,2,2,1,39,-1,-1,-1,-1,-1,...,316,0,632,316,316,316,0,632,316,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29991,210000,1,2,1,34,3,2,2,2,2,...,2500,2500,2500,2500,0,0,0,0,0,0
29994,80000,1,2,2,34,2,2,2,2,2,...,79384,77519,82607,81158,7000,3500,0,7000,0,4000
29997,30000,1,2,2,37,4,3,2,-1,0,...,2758,20878,20582,19357,0,0,22000,4200,2000,3100
29998,80000,1,3,1,41,1,-1,0,0,0,...,76304,52774,11855,48944,85900,3409,1178,1926,52964,1804


In [108]:
X_resampled, y_resampled=custom_smote_with_polynomial_interpolation(X,y)



In [111]:
X

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,...,X14,X15,X16,X17,X18,X19,X20,X21,X22,X23
0,20000,2,2,1,24,2,2,-1,-1,-2,...,689,0,0,0,0,689,0,0,0,0
1,120000,2,2,2,26,-1,2,0,0,0,...,2682,3272,3455,3261,0,1000,1000,1000,0,2000
2,90000,2,2,2,34,0,0,0,0,0,...,13559,14331,14948,15549,1518,1500,1000,1000,1000,5000
3,50000,2,2,1,37,0,0,0,0,0,...,49291,28314,28959,29547,2000,2019,1200,1100,1069,1000
4,50000,1,2,1,57,-1,0,-1,0,0,...,35835,20940,19146,19131,2000,36681,10000,9000,689,679
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29995,220000,1,3,1,39,0,0,0,0,0,...,208365,88004,31237,15980,8500,20000,5003,3047,5000,1000
29996,150000,1,3,2,43,-1,-1,-1,-1,0,...,3502,8979,5190,0,1837,3526,8998,129,0,0
29997,30000,1,2,2,37,4,3,2,-1,0,...,2758,20878,20582,19357,0,0,22000,4200,2000,3100
29998,80000,1,3,1,41,1,-1,0,0,0,...,76304,52774,11855,48944,85900,3409,1178,1926,52964,1804


## TAIWAN

In [150]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score, average_precision_score
from imblearn.over_sampling import KMeansSMOTE
import xgboost as xgb

# Load dataset from UCI
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data"
columns = [
    "Status", "Duration", "CreditHistory", "Purpose", "CreditAmount", "Savings",
    "Employment", "InstallmentRate", "PersonalStatus", "Debtors", "ResidenceTime",
    "Property", "Age", "OtherInstallment", "Housing", "ExistingCredits", "Job",
    "NumDependents", "Telephone", "ForeignWorker", "Risk"
]
df = pd.read_csv(url, sep=" ", names=columns)

# Convert target variable (Risk: 1 = bad credit, 0 = good credit)
df["Risk"] = df["Risk"].map({2: 0, 1: 1})

# One-hot encode categorical variables
df = pd.get_dummies(df, drop_first=True)

# Split features and target
X = df.drop("Risk", axis=1)
y = df["Risk"]

# Train-test split (stratified)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.7,random_state=42)

# Standardize numerical features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

### Train XGBoost WITHOUT SMOTE (Baseline)
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

params = {
    "objective": "binary:logistic",
    "eval_metric": "auc",
    "tree_method": "gpu_hist",  # Use GPU
    "predictor": "gpu_predictor",
    "learning_rate": 0.05,
    "max_depth": 6,
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    "random_state": 42,
    "n_estimators":1000
}

# Train XGBoost without SMOTE
model_no_smote = xgb.train(params, dtrain, num_boost_round=200)

# Predict on test set
y_pred_proba_no_smote = model_no_smote.predict(dtest)

# Evaluate performance (No SMOTE)
roc_auc_no_smote = roc_auc_score(y_test, y_pred_proba_no_smote)
pr_auc_no_smote = average_precision_score(y_test, y_pred_proba_no_smote)

### Apply KMeansSMOTE
kmeans_smote = KMeansSMOTE(
    sampling_strategy=1.0,               # Fully balance classes
    random_state=42,                     # Set random state for reproducibility
    cluster_balance_threshold=0.1,       # Lower the cluster balance threshold
    k_neighbors=5                         # Increase the number of neighbors to form clusters
)
X_train_resampled, y_train_resampled = kmeans_smote.fit_resample(X_train, y_train)

# Train XGBoost WITH KMeansSMOTE
dtrain_resampled = xgb.DMatrix(X_train_resampled, label=y_train_resampled)
model_with_kmeans_smote = xgb.train(params, dtrain_resampled, num_boost_round=200)

# Predict on test set
y_pred_proba_with_kmeans_smote = model_with_kmeans_smote.predict(dtest)

# Evaluate performance (With KMeansSMOTE)
roc_auc_with_kmeans_smote = roc_auc_score(y_test, y_pred_proba_with_kmeans_smote)
pr_auc_with_kmeans_smote = average_precision_score(y_test, y_pred_proba_with_kmeans_smote)

# Print comparison
print(f"ROC-AUC (No SMOTE): {roc_auc_no_smote:.4f}")
print(f"PR-AUC (No SMOTE): {pr_auc_no_smote:.4f}")
print(f"ROC-AUC (With KMeansSMOTE): {roc_auc_with_kmeans_smote:.4f}")
print(f"PR-AUC (With KMeansSMOTE): {pr_auc_with_kmeans_smote:.4f}")



    E.g. tree_method = "hist", device = "cuda"

Parameters: { "n_estimators", "predictor" } are not used.


    E.g. tree_method = "hist", device = "cuda"


    E.g. tree_method = "hist", device = "cuda"

Parameters: { "n_estimators", "predictor" } are not used.



ROC-AUC (No SMOTE): 0.7538
PR-AUC (No SMOTE): 0.8734
ROC-AUC (With KMeansSMOTE): 0.7518
PR-AUC (With KMeansSMOTE): 0.8722



    E.g. tree_method = "hist", device = "cuda"



In [145]:
y_train

675    1
703    1
12     1
845    1
795    1
      ..
284    1
169    0
856    1
655    1
695    1
Name: Risk, Length: 800, dtype: int64