Write Python code that loads a cleaned dataset, separates the features and target column ‘fraud’, performs a stratified train-test split (20% test size), and prints the shapes and fraud distribution in both training and testing sets. Keep it simple and readable

In [8]:
import pandas as pd
from sklearn.model_selection import train_test_split


# 1. LOAD CLEANED DATASET
df_labeled = pd.read_csv("cleaned_dataset.csv")


# 2. SEPARATE FEATURES (X) AND TARGET (y)
# Drop the target label "fraud" to get only the input features.
# `y` holds the binary fraud indicator (0 = legitimate, 1 = fraud).
X = df_labeled.drop(columns=["fraud"])
y = df_labeled["fraud"]

# 3. TRAIN–TEST SPLIT WITH STRATIFICATION

# Splitting dataset into training (80%) and testing (20%).
# `stratify=y` is CRITICAL for fraud datasets — ensures the fraud % stays
# consistent across train and test sets despite imbalance.
# `random_state=42` ensures reproducibility of the split.
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,          # 20% test size for evaluation
    random_state=42,        # reproducible results
    stratify=y              # preserves fraud ratio!
)



# 4. INSPECT SPLIT SHAPES + DISTRIBUTION

# Confirms correct partition sizes and checks that stratification worked.
print("Training shape:", X_train.shape, y_train.shape)
print("Testing shape:", X_test.shape, y_test.shape)

print("\nTraining fraud distribution:")
print(y_train.value_counts(normalize=True) * 100)   # % of fraud vs non-fraud

print("\nTesting fraud distribution:")
print(y_test.value_counts(normalize=True) * 100)    # should match training %



Training shape: (621868, 27) (621868,)
Testing shape: (155468, 27) (155468,)

Training fraud distribution:
fraud
0    99.825043
1     0.174957
Name: proportion, dtype: float64

Testing fraud distribution:
fraud
0    99.825044
1     0.174956
Name: proportion, dtype: float64


In [9]:
!pip install imbalanced-learn




Write Python code that applies SMOTE to handle class imbalance in the fraud dataset.Use a sampling strategy of 0.1, apply it only on the training set to avoid data leakage, and print the class distribution before and after SMOTE.

In [10]:

# APPLY SMOTE TO FIX EXTREME IMBALANCE
from imblearn.over_sampling import SMOTE

# SMOTE generates synthetic minority-class samples (fraud cases)
# This helps the model learn fraud patterns instead of being overwhelmed by majority class.
# sampling_strategy=0.1 → minority class becomes 10% of the training data
# This is intentionally NOT 50/50 to avoid unrealistic inflation + overfitting.
sm = SMOTE(sampling_strategy=0.1, random_state=42)

# IMPORTANT:
# Fit SMOTE **ONLY on the training set**.
# Applying it to the test set would leak synthetic patterns → invalid evaluation!
X_train_res, y_train_res = sm.fit_resample(X_train, y_train)

# Compare original vs resampled class distribution
print("\nBefore SMOTE:", y_train.value_counts())
print("\nAfter SMOTE:", y_train_res.value_counts())




Before SMOTE: fraud
0    620780
1      1088
Name: count, dtype: int64

After SMOTE: fraud
0    620780
1     62078
Name: count, dtype: int64


Write Python code that trains a Logistic Regression and a Decision Tree for fraud detection.
Both models should use class_weight='balanced' to handle imbalance, and they should be trained on the SMOTE-resampled data.
Add clear comments explaining why class weighting matters and why each model is used

In [11]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

# 1) LOGISTIC REGRESSION (WITH CLASS WEIGHTING)
# class_weight="balanced" forces LR to scale the importance of fraud cases.
# Without this, LR would ignore minority class completely due to extreme imbalance.
log_reg = LogisticRegression(
    class_weight="balanced",   # adjust weights inversely proportional to class frequency
    max_iter=500,              # increases iterations so model fully converges
    n_jobs=-1                  # use all CPU cores for faster training
)

# 2) DECISION TREE CLASSIFIER
# Decision Trees capture nonlinear fraud patterns (e.g., amount thresholds, MCC behavior).
# class_weight="balanced" again prevents bias toward the majority class.
tree_clf = DecisionTreeClassifier(
    class_weight="balanced",
    random_state=42            
)


# TRAIN BOTH MODELS ON SMOTE-RESAMPLED DATA
# We use X_train_res / y_train_res to ensure both models learn from a balanced structure.
# SMOTE helps both linear (LR) and nonlinear (Tree) models better identify fraud signals.
log_reg.fit(X_train_res, y_train_res)
tree_clf.fit(X_train_res, y_train_res)

print("Model Training Complete: Logistic Regression + Decision Tree")


Model Training Complete: Logistic Regression + Decision Tree


In [12]:

# PREDICT ON ORIGINAL TEST SET (NO SMOTE HERE)

# Logistic Regression predictions
y_prob_log = log_reg.predict_proba(X_test)[:, 1]
y_pred_log = (y_prob_log >= 0.5).astype(int)

# Decision Tree predictions
y_prob_tree = tree_clf.predict_proba(X_test)[:, 1]
y_pred_tree = (y_prob_tree >= 0.5).astype(int)

print("Generated predictions on the true test set.")


Generated predictions on the true test set.


In [13]:

#CROSS-VALIDATION FOR BOTH MODELS
from sklearn.model_selection import StratifiedKFold, cross_validate

# StratifiedKFold ensures each fold keeps the same fraud ratio.
# This is CRITICAL for fraud modeling—regular KFold would randomly break the imbalance pattern.
cv = StratifiedKFold(
    n_splits=5,         # 5-fold CV = standard, stable
    shuffle=True,       # shuffle ensures better randomness across folds
    random_state=42     # reproducibility
)

# Metrics to evaluate across all folds.
# We focus on precision, recall, F1, and ROC-AUC — the most important in fraud detection.
scoring = {
    "precision": "precision",
    "recall": "recall",
    "f1": "f1",
    "roc_auc": "roc_auc"
}

# CROSS-VALIDATION — LOGISTIC REGRESSION

log_cv = cross_validate(
    log_reg,
    X_train_res,        # SMOTE-resampled training data
    y_train_res,
    cv=cv,
    scoring=scoring,
    n_jobs=-1           # use all CPU cores
)


# CROSS-VALIDATION — DECISION TREE

tree_cv = cross_validate(
    tree_clf,
    X_train_res,
    y_train_res,
    cv=cv,
    scoring=scoring,
    n_jobs=-1
)


# DISPLAY CROSS-VALIDATION RESULTS 

def show_cv_results(name, cv_results):
    print(f"\n----- {name} Cross-Validation Results -----")
    # Loop through each metric and display mean ± standard deviation across folds.
    for metric in scoring.keys():
        scores = cv_results[f"test_{metric}"]
        print(f"{metric.capitalize()} (mean ± std): {scores.mean():.4f} ± {scores.std():.4f}")

show_cv_results("Logistic Regression", log_cv)
show_cv_results("Decision Tree", tree_cv)



----- Logistic Regression Cross-Validation Results -----
Precision (mean ± std): 0.1342 ± 0.0025
Recall (mean ± std): 0.6363 ± 0.0052
F1 (mean ± std): 0.2217 ± 0.0036
Roc_auc (mean ± std): 0.6650 ± 0.0042

----- Decision Tree Cross-Validation Results -----
Precision (mean ± std): 0.9722 ± 0.0020
Recall (mean ± std): 0.9698 ± 0.0019
F1 (mean ± std): 0.9710 ± 0.0013
Roc_auc (mean ± std): 0.9835 ± 0.0010


Write Python code that performs 5-fold stratified cross-validation on both my Logistic Regression and Decision Tree models (trained on SMOTE-resampled data).
Evaluate precision, recall, F1, and ROC-AUC.

In [None]:
import pandas as pd
from sklearn.metrics import (
    confusion_matrix, classification_report,
    precision_recall_curve, roc_auc_score, auc,
    precision_score, recall_score, f1_score, accuracy_score
)

def evaluate_model(name, y_test, y_pred, y_prob):
    print(f"\n================ {name} =================")

    # Confusion Matrix
    cm = confusion_matrix(y_test, y_pred)
    tn, fp, fn, tp = cm.ravel()
    print("\nConfusion Matrix:")
    print(cm)
    print(f"TN={tn}, FP={fp}, FN={fn}, TP={tp}")

    # Core numeric metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, zero_division=0)
    recall = recall_score(y_test, y_pred, zero_division=0)
    f1 = f1_score(y_test, y_pred, zero_division=0)
    roc = roc_auc_score(y_test, y_prob)

    # PR-AUC
    pr_precision, pr_recall, _ = precision_recall_curve(y_test, y_prob)
    pr_auc = auc(pr_recall, pr_precision)

    # Create a DataFrame for a clean table display
    results_table = pd.DataFrame({
        "Metric": ["Accuracy", "Precision", "Recall", "F1 Score", "ROC-AUC", "PR-AUC"],
        "Value": [accuracy, precision, recall, f1, roc, pr_auc]
    })

    print("\nPerformance Summary:")
    print(results_table.to_string(index=False))

    # Optional full classification report
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred, digits=4))
