Write Python code that loads a cleaned dataset, separates the features and target column ‘fraud’, performs a stratified train-test split (20% test size), and prints the shapes and fraud distribution in both training and testing sets. Keep it simple and readable

In [1]:

# REMOVE ID FEATURES BEFORE LOADING DATASET
# Features containing "_id" are typically identifiers (user_id, merchant_id, etc.)
# These are not predictive and should be removed to avoid data leakage and noise

import pandas as pd

# Load the cleaned dataset
df_labeled = pd.read_csv("cleaned_dataset.csv")

print(f"Original dataset shape: {df_labeled.shape}")
print(f"Original features: {list(df_labeled.columns)}")

# Identify and remove features containing "_id"
id_features = [col for col in df_labeled.columns if '_id' in col.lower()]
print(f"\nFeatures containing '_id' to be removed: {id_features}")

# Remove _id features
df_labeled = df_labeled.drop(columns=id_features)

print(f"\nUpdated dataset shape: {df_labeled.shape}")
print(f"Updated features: {list(df_labeled.columns)}")
print(f"Total features removed: {len(id_features)}")


Original dataset shape: (777336, 28)
Original features: ['transaction_id', 'client_id', 'card_id', 'use_chip', 'merchant_id', 'mcc', 'fraud', 'has_chip', 'num_cards_issued', 'year_pin_last_changed', 'card_on_dark_web', 'current_age', 'gender', 'per_capita_income', 'num_credit_cards', 'account_age_days', 'account_age_years', 'card_brand_Discover', 'card_brand_Mastercard', 'card_brand_Visa', 'card_type_Debit', 'card_type_Debit (Prepaid)', 'outlier_iqr', 'amount_norm', 'credit_limit_norm', 'total_debt_norm', 'yearly_income_norm', 'credit_score_norm']

Features containing '_id' to be removed: ['transaction_id', 'client_id', 'card_id', 'merchant_id']

Updated dataset shape: (777336, 24)
Updated features: ['use_chip', 'mcc', 'fraud', 'has_chip', 'num_cards_issued', 'year_pin_last_changed', 'card_on_dark_web', 'current_age', 'gender', 'per_capita_income', 'num_credit_cards', 'account_age_days', 'account_age_years', 'card_brand_Discover', 'card_brand_Mastercard', 'card_brand_Visa', 'card_type_

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split


# 1. LOAD CLEANED DATASET
# Load from the cleaned_dataset.csv (already processed and cleaned)
df_labeled = pd.read_csv("cleaned_dataset.csv")

# Remove ALL ID features (transaction_id, client_id, card_id, merchant_id)
id_features = [col for col in df_labeled.columns if '_id' in col.lower()]
df_labeled = df_labeled.drop(columns=id_features)
print(f"Removed ID features: {id_features}")
print(f"Dataset shape after removing IDs: {df_labeled.shape}")

# 2. SEPARATE FEATURES (X) AND TARGET (y)
# Drop the target label "fraud" to get only the input features.
# `y` holds the binary fraud indicator (0 = legitimate, 1 = fraud).
X = df_labeled.drop(columns=["fraud"])
y = df_labeled["fraud"]

print(f"Features used for training ({X.shape[1]} total):")
print(f"  {list(X.columns)}")

# 3. TRAIN–TEST SPLIT WITH STRATIFICATION

# Splitting dataset into training (80%) and testing (20%).
# `stratify=y` is CRITICAL for fraud datasets — ensures the fraud % stays
# consistent across train and test sets despite imbalance.
# `random_state=42` ensures reproducibility of the split.
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,          # 20% test size for evaluation
    random_state=42,        # reproducible results
    stratify=y              # preserves fraud ratio!
)



# 4. INSPECT SPLIT SHAPES + DISTRIBUTION

# Confirms correct partition sizes and checks that stratification worked.
print("Training shape:", X_train.shape, y_train.shape)
print("Testing shape:", X_test.shape, y_test.shape)

print("\nTraining fraud distribution:")
print(y_train.value_counts(normalize=True) * 100)   # % of fraud vs non-fraud

print("\nTesting fraud distribution:")
print(y_test.value_counts(normalize=True) * 100)    # should match training %



Removed ID features: ['transaction_id', 'client_id', 'card_id', 'merchant_id']
Dataset shape after removing IDs: (777336, 24)
Features used for training (23 total):
  ['use_chip', 'mcc', 'has_chip', 'num_cards_issued', 'year_pin_last_changed', 'card_on_dark_web', 'current_age', 'gender', 'per_capita_income', 'num_credit_cards', 'account_age_days', 'account_age_years', 'card_brand_Discover', 'card_brand_Mastercard', 'card_brand_Visa', 'card_type_Debit', 'card_type_Debit (Prepaid)', 'outlier_iqr', 'amount_norm', 'credit_limit_norm', 'total_debt_norm', 'yearly_income_norm', 'credit_score_norm']
Training shape: (621868, 23) (621868,)
Testing shape: (155468, 23) (155468,)

Training fraud distribution:
fraud
0    99.825043
1     0.174957
Name: proportion, dtype: float64

Testing fraud distribution:
fraud
0    99.825044
1     0.174956
Name: proportion, dtype: float64


In [3]:
!pip install imbalanced-learn




Write Python code that applies SMOTE to handle class imbalance in the fraud dataset.Use a sampling strategy of 0.1, apply it only on the training set to avoid data leakage, and print the class distribution before and after SMOTE.

In [4]:

# APPLY SMOTE TO FIX EXTREME IMBALANCE
from imblearn.over_sampling import SMOTE

# SMOTE generates synthetic minority-class samples (fraud cases)
# This helps the model learn fraud patterns instead of being overwhelmed by majority class.
# sampling_strategy=0.1 → minority class becomes 10% of the training data
# This is intentionally NOT 50/50 to avoid unrealistic inflation + overfitting.
sm = SMOTE(sampling_strategy=0.1, random_state=42)

# IMPORTANT:
# Fit SMOTE **ONLY on the training set**.
# Applying it to the test set would leak synthetic patterns → invalid evaluation!
X_train_res, y_train_res = sm.fit_resample(X_train, y_train)

# Compare original vs resampled class distribution
print("\nBefore SMOTE:", y_train.value_counts())
print("\nAfter SMOTE:", y_train_res.value_counts())




Before SMOTE: fraud
0    620780
1      1088
Name: count, dtype: int64

After SMOTE: fraud
0    620780
1     62078
Name: count, dtype: int64


Write Python code that trains a Logistic Regression and a Decision Tree for fraud detection.
Both models should use class_weight='balanced' to handle imbalance, and they should be trained on the SMOTE-resampled data.
Add clear comments explaining why class weighting matters and why each model is used

In [5]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

# 1) LOGISTIC REGRESSION (WITH CLASS WEIGHTING)
# class_weight="balanced" forces LR to scale the importance of fraud cases.
# Without this, LR would ignore minority class completely due to extreme imbalance.
log_reg = LogisticRegression(
    class_weight="balanced",   # adjust weights inversely proportional to class frequency
    max_iter=2000,              # increases iterations so model fully converges
    n_jobs=-1                  # use all CPU cores for faster training
)

# 2) DECISION TREE CLASSIFIER
# Decision Trees capture nonlinear fraud patterns (e.g., amount thresholds, MCC behavior).
# class_weight="balanced" again prevents bias toward the majority class.
tree_clf = DecisionTreeClassifier(
    class_weight="balanced",
    random_state=42            
)


# TRAIN BOTH MODELS ON SMOTE-RESAMPLED DATA
# We use X_train_res / y_train_res to ensure both models learn from a balanced structure.
# SMOTE helps both linear (LR) and nonlinear (Tree) models better identify fraud signals.
log_reg.fit(X_train_res, y_train_res)
tree_clf.fit(X_train_res, y_train_res)

print("Model Training Complete: Logistic Regression + Decision Tree")


Model Training Complete: Logistic Regression + Decision Tree


In [6]:

# PREDICT ON ORIGINAL TEST SET (NO SMOTE HERE)

# Logistic Regression predictions
y_prob_log = log_reg.predict_proba(X_test)[:, 1]
y_pred_log = (y_prob_log >= 0.5).astype(int)

# Decision Tree predictions
y_prob_tree = tree_clf.predict_proba(X_test)[:, 1]
y_pred_tree = (y_prob_tree >= 0.5).astype(int)

print("Generated predictions on the true test set.")


Generated predictions on the true test set.


In [7]:

#CROSS-VALIDATION FOR BOTH MODELS
from sklearn.model_selection import StratifiedKFold, cross_validate

# StratifiedKFold ensures each fold keeps the same fraud ratio.
# This is CRITICAL for fraud modeling—regular KFold would randomly break the imbalance pattern.
cv = StratifiedKFold(
    n_splits=5,         # 5-fold CV = standard, stable
    shuffle=True,       # shuffle ensures better randomness across folds
    random_state=42     # reproducibility
)

# Metrics to evaluate across all folds.
# We focus on precision, recall, F1, and ROC-AUC — the most important in fraud detection.
scoring = {
    "precision": "precision",
    "recall": "recall",
    "f1": "f1",
    "roc_auc": "roc_auc"
}

# CROSS-VALIDATION — LOGISTIC REGRESSION

log_cv = cross_validate(
    log_reg,
    X_train_res,        # SMOTE-resampled training data
    y_train_res,
    cv=cv,
    scoring=scoring,
    n_jobs=-1           # use all CPU cores
)


# CROSS-VALIDATION — DECISION TREE

tree_cv = cross_validate(
    tree_clf,
    X_train_res,
    y_train_res,
    cv=cv,
    scoring=scoring,
    n_jobs=-1
)


# DISPLAY CROSS-VALIDATION RESULTS 

def show_cv_results(name, cv_results):
    print(f"\n----- {name} Cross-Validation Results -----")
    # Loop through each metric and display mean ± standard deviation across folds.
    for metric in scoring.keys():
        scores = cv_results[f"test_{metric}"]
        print(f"{metric.capitalize()} (mean ± std): {scores.mean():.4f} ± {scores.std():.4f}")

show_cv_results("Logistic Regression", log_cv)
show_cv_results("Decision Tree", tree_cv)



----- Logistic Regression Cross-Validation Results -----
Precision (mean ± std): 0.1986 ± 0.0017
Recall (mean ± std): 0.6129 ± 0.0039
F1 (mean ± std): 0.2999 ± 0.0015
Roc_auc (mean ± std): 0.7572 ± 0.0014

----- Decision Tree Cross-Validation Results -----
Precision (mean ± std): 0.9738 ± 0.0012
Recall (mean ± std): 0.9735 ± 0.0011
F1 (mean ± std): 0.9736 ± 0.0008
Roc_auc (mean ± std): 0.9854 ± 0.0005


In [8]:
import pandas as pd
from sklearn.metrics import (
    confusion_matrix, classification_report,
    precision_recall_curve, roc_auc_score, auc,
    precision_score, recall_score, f1_score, accuracy_score
)

def evaluate_model(name, y_test, y_pred, y_prob):
    print(f"\n================ {name} =================")

    # Confusion Matrix
    cm = confusion_matrix(y_test, y_pred)
    tn, fp, fn, tp = cm.ravel()
    print("\nConfusion Matrix:")
    print(cm)
    print(f"TN={tn}, FP={fp}, FN={fn}, TP={tp}")

    # Metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, zero_division=0)
    recall = recall_score(y_test, y_pred, zero_division=0)
    f1 = f1_score(y_test, y_pred, zero_division=0)
    roc = roc_auc_score(y_test, y_prob)

    pr_precision, pr_recall, _ = precision_recall_curve(y_test, y_prob)
    pr_auc = auc(pr_recall, pr_precision)

    # Table
    results_table = pd.DataFrame({
        "Metric": ["Accuracy", "Precision", "Recall", "F1 Score", "ROC-AUC", "PR-AUC"],
        "Value": [accuracy, precision, recall, f1, roc, pr_auc]
    })

    print("\nPerformance Summary:")
    print(results_table)

    print("\nClassification Report:")
    print(classification_report(y_test, y_pred, digits=4))

# ---- CALL THE FUNCTION ----
evaluate_model("Logistic Regression", y_test, y_pred_log, y_prob_log)
evaluate_model("Decision Tree", y_test, y_pred_tree, y_prob_tree)



Confusion Matrix:
[[116959  38237]
 [   174     98]]
TN=116959, FP=38237, FN=174, TP=98

Performance Summary:
      Metric     Value
0   Accuracy  0.752933
1  Precision  0.002556
2     Recall  0.360294
3   F1 Score  0.005077
4    ROC-AUC  0.588170
5     PR-AUC  0.002365

Classification Report:
              precision    recall  f1-score   support

           0     0.9985    0.7536    0.8590    155196
           1     0.0026    0.3603    0.0051       272

    accuracy                         0.7529    155468
   macro avg     0.5005    0.5570    0.4320    155468
weighted avg     0.9968    0.7529    0.8575    155468



Confusion Matrix:
[[154836    360]
 [   201     71]]
TN=154836, FP=360, FN=201, TP=71

Performance Summary:
      Metric     Value
0   Accuracy  0.996392
1  Precision  0.164733
2     Recall  0.261029
3   F1 Score  0.201991
4    ROC-AUC  0.629355
5     PR-AUC  0.213528

Classification Report:
              precision    recall  f1-score   support

           0     0.9987   

In [9]:

# DETAILED DECISION TREE EVALUATION (FORMATTED OUTPUT)

import pandas as pd
from sklearn.metrics import (
    confusion_matrix, classification_report,
    precision_recall_curve, roc_auc_score, auc,
    precision_score, recall_score, f1_score, accuracy_score
)

print("\n" + "="*80)
print("DECISION TREE DETAILED PERFORMANCE ANALYSIS")
print("="*80)

# Confusion Matrix breakdown
cm_tree = confusion_matrix(y_test, y_pred_tree)
tn_tree, fp_tree, fn_tree, tp_tree = cm_tree.ravel()

print("\nConfusion Matrix:")
print(cm_tree)
print(f"\nBreakdown:")
print(f"  True Negatives (TN):  {tn_tree:,}  — Correctly identified legitimate transactions")
print(f"  False Positives (FP): {fp_tree:,}  — Legitimate flagged as fraud (false alarms)")
print(f"  False Negatives (FN): {fn_tree:,}  — Fraud missed as legitimate (misses)")
print(f"  True Positives (TP):  {tp_tree:,}  — Correctly identified fraud cases")

# Calculate metrics
acc_tree = accuracy_score(y_test, y_pred_tree)
prec_tree = precision_score(y_test, y_pred_tree, zero_division=0)
rec_tree = recall_score(y_test, y_pred_tree, zero_division=0)
f1_tree = f1_score(y_test, y_pred_tree, zero_division=0)
roc_tree = roc_auc_score(y_test, y_prob_tree)

pr_prec_tree, pr_rec_tree, _ = precision_recall_curve(y_test, y_prob_tree)
pr_auc_tree = auc(pr_rec_tree, pr_prec_tree)

# Performance Summary Table
results_table_tree = pd.DataFrame({
    "Metric": ["Accuracy", "Precision", "Recall", "F1 Score", "ROC-AUC", "PR-AUC"],
    "Value": [acc_tree, prec_tree, rec_tree, f1_tree, roc_tree, pr_auc_tree]
})

print("\nPerformance Summary:")
print(results_table_tree.to_string(index=False))

print("\nDetailed Classification Report:")
print(classification_report(y_test, y_pred_tree, digits=4))

# Comparison summary
print("\n" + "-"*80)
print("KEY INSIGHTS FOR DECISION TREE:")
print("-"*80)
print(f"Precision: {prec_tree:.4f} ({prec_tree*100:.2f}%)")
print(f"  → Of {tp_tree + fp_tree:,} transactions flagged as fraud, {tp_tree} were actually fraud")
print(f"  → {(1-prec_tree)*100:.2f}% are false alarms")
print(f"\nRecall: {rec_tree:.4f} ({rec_tree*100:.2f}%)")
print(f"  → Of {tp_tree + fn_tree} actual fraud cases, {tp_tree} were caught")
print(f"  → {(1-rec_tree)*100:.2f}% of fraud slipped through undetected")
print(f"\nF1 Score: {f1_tree:.4f}")
print(f"  → Balances precision ({prec_tree:.4f}) and recall ({rec_tree:.4f})")
print(f"\nROC-AUC: {roc_tree:.4f}")
print(f"  → Model discrimination ability (0.5 = random, 1.0 = perfect)")
print(f"\nPR-AUC: {pr_auc_tree:.4f}")
print(f"  → Precision-Recall area (best metric for imbalanced fraud data)")



DECISION TREE DETAILED PERFORMANCE ANALYSIS

Confusion Matrix:
[[154836    360]
 [   201     71]]

Breakdown:
  True Negatives (TN):  154,836  — Correctly identified legitimate transactions
  False Positives (FP): 360  — Legitimate flagged as fraud (false alarms)
  False Negatives (FN): 201  — Fraud missed as legitimate (misses)
  True Positives (TP):  71  — Correctly identified fraud cases

Performance Summary:
   Metric    Value
 Accuracy 0.996392
Precision 0.164733
   Recall 0.261029
 F1 Score 0.201991
  ROC-AUC 0.629355
   PR-AUC 0.213528

Detailed Classification Report:
              precision    recall  f1-score   support

           0     0.9987    0.9977    0.9982    155196
           1     0.1647    0.2610    0.2020       272

    accuracy                         0.9964    155468
   macro avg     0.5817    0.6294    0.6001    155468
weighted avg     0.9972    0.9964    0.9968    155468


--------------------------------------------------------------------------------
KEY INSI