## 1.5.2 Class Imbalance Strategy — Data Preparation

In this section, we prepare the processed provider-level dataset for handling
class imbalance. We:

- Load the `provider_features.csv` file generated in Notebook 01.
- Encode the target label (`PotentialFraud`) as 0/1.
- Handle any remaining missing values.
- Split the data into stratified train and test sets, preserving the fraud ratio.

These steps prepare the data so we can correctly apply a class imbalance strategy (SMOTE).


In [14]:
# 1. Load data and basic preprocessing (for class imbalance handling)

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

# Load provider-level features from Notebook 01
df = pd.read_csv("provider_features.csv")

# Encode target label: No -> 0, Yes -> 1
df["PotentialFraud"] = df["PotentialFraud"].map({"No": 0, "Yes": 1})

# Make sure there are no missing values in the modeling features
df = df.fillna(0)

# Separate features and target
X = df.drop(["Provider", "PotentialFraud"], axis=1)
y = df["PotentialFraud"]

# Stratified train/test split to preserve fraud ratio
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.25,
    random_state=42,
    stratify=y
)

# Optional: check class distribution in training set
print("Training class distribution (y_train):")
print(y_train.value_counts(normalize=True))


Training class distribution (y_train):
PotentialFraud
0    0.906581
1    0.093419
Name: proportion, dtype: float64


## 1.5.2 Class Imbalance Strategy — SMOTE Oversampling

The dataset is highly imbalanced (fraudulent providers are a small minority).
To address this, we use **SMOTE (Synthetic Minority Oversampling Technique)** on
the training data only.

This step directly implements the class imbalance strategy required in section 1.5.2.


In [15]:
# 2. Apply SMOTE to balance the training data

from imblearn.over_sampling import SMOTE

sm = SMOTE(random_state=42)
X_train_res, y_train_res = sm.fit_resample(X_train, y_train)

print("Class distribution BEFORE SMOTE:")
print(y_train.value_counts())

print("\nClass distribution AFTER SMOTE:")
print(y_train_res.value_counts())


Class distribution BEFORE SMOTE:
PotentialFraud
0    3678
1     379
Name: count, dtype: int64

Class distribution AFTER SMOTE:
PotentialFraud
0    3678
1    3678
Name: count, dtype: int64


## 1.5.3 Algorithm Selection — Training Selected Models

We now train two different algorithms on the balanced training data:

- **Logistic Regression**: interpretable baseline model.
- **Random Forest**: more powerful tree-based model, robust to mixed features.

This satisfies section 1.5.3 by evaluating relevant algorithms and preparing to
select a primary model.


In [16]:
# 3. Train Logistic Regression and Random Forest on the balanced data

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# Logistic Regression (interpretable baseline)
log_reg = LogisticRegression(
    max_iter=2000,
    class_weight="balanced"
)
log_reg.fit(X_train_res, y_train_res)

# Random Forest (primary model candidate)
rf = RandomForestClassifier(
    n_estimators=300,
    random_state=42,
    class_weight="balanced_subsample"
)
rf.fit(X_train_res, y_train_res)

print("Models trained successfully.")


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Models trained successfully.


## 1.5.4 Comparison Models — Evaluation & Metrics

Here we compare Logistic Regression and Random Forest using metrics that are
appropriate for imbalanced classification:

- Precision
- Recall
- F1-score
- ROC-AUC

This fulfills section 1.5.4 by using standardized metrics to compare models.
We will later create curves and deeper error analysis in Notebook 03.


In [17]:
# 4. Evaluate and compare models on the (original) test set

from sklearn.metrics import classification_report, roc_auc_score

# Random Forest predictions
rf_preds = rf.predict(X_test)
rf_probs = rf.predict_proba(X_test)[:, 1]

print("=== Random Forest ===")
print(classification_report(y_test, rf_preds))
print("ROC-AUC:", roc_auc_score(y_test, rf_probs))

# Logistic Regression predictions
log_preds = log_reg.predict(X_test)
log_probs = log_reg.predict_proba(X_test)[:, 1]

print("\n=== Logistic Regression ===")
print(classification_report(y_test, log_preds))
print("ROC-AUC:", roc_auc_score(y_test, log_probs))


=== Random Forest ===
              precision    recall  f1-score   support

           0       0.97      0.93      0.95      1226
           1       0.51      0.71      0.60       127

    accuracy                           0.91      1353
   macro avg       0.74      0.82      0.77      1353
weighted avg       0.93      0.91      0.92      1353

ROC-AUC: 0.9204377593094503

=== Logistic Regression ===
              precision    recall  f1-score   support

           0       0.98      0.87      0.92      1226
           1       0.40      0.85      0.54       127

    accuracy                           0.86      1353
   macro avg       0.69      0.86      0.73      1353
weighted avg       0.93      0.86      0.88      1353

ROC-AUC: 0.9208295333393276


In [18]:
# 5. Save predictions and probabilities to CSV

# Get the Provider IDs for the test set
providers_test = df.loc[X_test.index, "Provider"]

results = pd.DataFrame({
    "Provider": providers_test,
    "ActualFraud": y_test.values,
    "PredictedFraud_RF": rf_preds,
    "FraudProbability_RF": rf_probs,
    "PredictedFraud_LR": log_preds,
    "FraudProbability_LR": log_probs,
})

results.to_csv("model_predictions.csv", index=False)
print("Saved predictions to model_predictions.csv")


Saved predictions to model_predictions.csv
