In [2]:
# Step 1 â€” Load & Inspect Dataset

import pandas as pd

# Load your dataset (adjust filename if needed)
df = pd.read_csv("provider_features.csv")

# Basic overview
print("Number of samples:", df.shape[0])
print("Number of features:", df.shape[1])
print("\nDataset head:\n", df.head())

# Check class imbalance
print("\nClass distribution:")
print(df['PotentialFraud'].value_counts(normalize=True))

# Separate data types
categorical_cols = df.select_dtypes(include=['object']).columns
numeric_cols = df.select_dtypes(include=['int64','float64']).columns

print("\nCategorical Columns:", list(categorical_cols))
print("Numeric Columns:", list(numeric_cols))

Number of samples: 5410
Number of features: 35

Dataset head:
    Provider  ClaimCount  total_reimbursed  mean_claim_amount  \
0  PRV51001        24.0          104340.0        4347.500000   
1  PRV51003       132.0          605670.0        4588.409091   
2  PRV51004       143.0           51830.0         362.447552   
3  PRV51005      1149.0          278960.0         242.785030   
4  PRV51007        72.0           33710.0         468.194444   

   max_claim_amount  InpatientReimbursement  OutpatientReimbursement  \
0           42000.0                 97000.0                   7340.0   
1           57000.0                573000.0                  32670.0   
2            3300.0                     0.0                  51830.0   
3            4080.0                     0.0                 278960.0   
4           10000.0                 19000.0                  14710.0   

   Inp_Outp_Reimbursement_Ratio     AvgAge  PercDeceased  ...  \
0                     13.215259  78.750000           0

Samples: 5,410

Features: 35 (mostly numeric, 2 categorical)

Target imbalance:

No Fraud: 90.6%

Fraud: 9.3% â†’ highly imbalanced

Data types:

Categorical: Provider, PotentialFraud

Numeric: 33 numeric features

Dataset size: Medium â€” suitable for tree-based models, LR, SVM

Patterns: Fraud is expected to be nonlinear and complex

The dataset is:

Medium-sized â†’ all ML models can run efficiently

Highly imbalanced â†’ need models that handle imbalance or can use class weights

Mostly numeric â†’ very little preprocessing is required

Contains only one meaningful categorical feature: Provider

Decision Tree

Decision Trees create hierarchical rules by splitting features at thresholds.
They naturally handle nonlinear fraud patterns and require almost no preprocessing.
However, they can overfit, especially with imbalance like our dataset.

Fit for the dataset?

âœ” Minimal preprocessing

âœ” Handles mixed data

âœ˜ Weak on imbalanced classes

âœ˜ High variance

Random Forest

Random Forest trains many trees and averages them to reduce overfitting.
It handles nonlinear relationships and mixed numeric/categorical data very well.
It also handles class imbalance better than single trees, especially with class weights.

Fit for the dataset?

âœ” Excellent with numeric-heavy data

âœ” Robust to imbalance

âœ” Captures complex fraud patterns

âœ˜ Less interpretable

âœ˜ Slower than Logistic Regression (but still manageable)

Gradient Boosting (XGBoost / LightGBM / CatBoost)

Gradient Boosting builds trees sequentially to correct previous errors.
It performs exceptionally well on tabular, imbalanced, nonlinear datasets â€” like ours.
Modern implementations handle imbalance directly (e.g., scale_pos_weight).

Fit for the dataset?

âœ” Best for nonlinear and sparse fraud patterns

âœ” Handles imbalance very well

âœ” Works great with numeric features

âœ˜ Slower training

âœ˜ Less interpretable

Logistic Regression

Logistic Regression is a linear classifier with high interpretability and fast training.
However, it assumes linear relationships and depends heavily on preprocessing.
Fraud detection is not linear, and the dataset has complex numeric interactions.

Fit for the dataset?

âœ” Very fast

âœ” Highly interpretable

âœ˜ Performs poorly on nonlinear fraud patterns

âœ˜ Needs preprocessing (scaling, encoding)

Support Vector Machine (SVM)

SVM separates fraud vs non-fraud using an optimal boundary (hyperplane).
Kernel SVMs can model nonlinear patterns but do not scale well on datasets >3000 rows.
Our dataset (5410 rows Ã— 35 features) is borderline for SVM.

Fit for your dataset?

âœ” Works well with nonlinear boundaries

âœ˜ Requires scaling of all features

âœ˜ Slow on medium datasets

âœ˜ Sensitive to imbalance (needs class weights)

Based on our dataset, the best primary model is:

ðŸŽ¯ Gradient Boosting (XGBoost or LightGBM)

Gradient Boosting is chosen as the main model because:

Our dataset is highly imbalanced (9% fraud), and Gradient Boosting algorithms handle this extremely well using scale_pos_weight or built-in imbalance handling.

Fraud detection typically requires capturing nonlinear interactions between numeric features, which boosting models excel at.

With 35 features (mostly numeric), the dataset is ideal for tree-based boosting.

Gradient Boosting consistently outperforms simple models like Logistic Regression on fraud datasets.

Training time is manageable with 5,410 samples.