# LAB | Imbalanced

**Load the data**

In this challenge, we will be working with Credit Card Fraud dataset.

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv

Metadata

- **distance_from_home:** the distance from home where the transaction happened.
- **distance_from_last_transaction:** the distance from last transaction happened.
- **ratio_to_median_purchase_price:** Ratio of purchased price transaction to median purchase price.
- **repeat_retailer:** Is the transaction happened from same retailer.
- **used_chip:** Is the transaction through chip (credit card).
- **used_pin_number:** Is the transaction happened by using PIN number.
- **online_order:** Is the transaction an online order.
- **fraud:** Is the transaction fraudulent. **0=legit** -  **1=fraud**


In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
from imblearn.over_sampling import RandomOverSampler, SMOTE
from imblearn.under_sampling import RandomUnderSampler

In [2]:
fraud = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv")
fraud.head()

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
0,57.877857,0.31114,1.94594,1.0,1.0,0.0,0.0,0.0
1,10.829943,0.175592,1.294219,1.0,0.0,0.0,0.0,0.0
2,5.091079,0.805153,0.427715,1.0,0.0,0.0,1.0,0.0
3,2.247564,5.600044,0.362663,1.0,1.0,0.0,1.0,0.0
4,44.190936,0.566486,2.222767,1.0,1.0,0.0,1.0,0.0


**Steps:**

- **1.** What is the distribution of our target variable? Can we say we're dealing with an imbalanced dataset?
- **2.** Train a LogisticRegression.
- **3.** Evaluate your model. Take in consideration class importance, and evaluate it by selection the correct metric.
- **4.** Run **Oversample** in order to balance our target variable and repeat the steps above, now with balanced data. Does it improve the performance of our model? 
- **5.** Now, run **Undersample** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?
- **6.** Finally, run **SMOTE** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model? 

In [3]:
fraud["fraud"].value_counts(normalize=True)

fraud
0.0    0.912597
1.0    0.087403
Name: proportion, dtype: float64

Yes, we are definitely dealing with imbalanced data. ~91% of the data is in the negative class, while only ~9% is in the positive class.

In [4]:
X = fraud.drop(columns='fraud')
y = fraud['fraud']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

clf = LogisticRegression(max_iter=1000)
clf.fit(X_train_scaled, y_train)

In [5]:
y_pred = clf.predict(X_test_scaled)
y_proba = clf.predict_proba(X_test_scaled)[:, 1]

print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nROC AUC:", roc_auc_score(y_test, y_proba))

Confusion Matrix:
 [[271940   1839]
 [ 10402  15819]]

Classification Report:
               precision    recall  f1-score   support

         0.0       0.96      0.99      0.98    273779
         1.0       0.90      0.60      0.72     26221

    accuracy                           0.96    300000
   macro avg       0.93      0.80      0.85    300000
weighted avg       0.96      0.96      0.96    300000


ROC AUC: 0.9671129229618998


In [6]:
ros = RandomOverSampler(random_state=42)
X_ros, y_ros = ros.fit_resample(X_train_scaled, y_train)

clf_ros = LogisticRegression(max_iter=1000)
clf_ros.fit(X_ros, y_ros)

y_pred_ros = clf_ros.predict(X_test_scaled)
y_proba_ros = clf_ros.predict_proba(X_test_scaled)[:, 1]
print("\nOversampled Model Report")
print(classification_report(y_test, y_pred_ros))
print("ROC AUC:", roc_auc_score(y_test, y_proba_ros))


Oversampled Model Report
              precision    recall  f1-score   support

         0.0       0.99      0.93      0.96    273779
         1.0       0.58      0.95      0.72     26221

    accuracy                           0.93    300000
   macro avg       0.79      0.94      0.84    300000
weighted avg       0.96      0.93      0.94    300000

ROC AUC: 0.9795629410667628


In [7]:
rus = RandomUnderSampler(random_state=42)
X_rus, y_rus = rus.fit_resample(X_train_scaled, y_train)

clf_rus = LogisticRegression(max_iter=1000)
clf_rus.fit(X_rus, y_rus)

y_pred_rus = clf_rus.predict(X_test_scaled)
y_proba_rus = clf_rus.predict_proba(X_test_scaled)[:, 1]

print("\nUndersampled Model Report")
print(classification_report(y_test, y_pred_rus))
print("ROC AUC:", roc_auc_score(y_test, y_proba_rus))


Undersampled Model Report
              precision    recall  f1-score   support

         0.0       0.99      0.93      0.96    273779
         1.0       0.58      0.95      0.72     26221

    accuracy                           0.93    300000
   macro avg       0.79      0.94      0.84    300000
weighted avg       0.96      0.93      0.94    300000

ROC AUC: 0.9795681667609486


In [8]:
smote = SMOTE(random_state=42)
X_smote, y_smote = smote.fit_resample(X_train_scaled, y_train)

clf_smote = LogisticRegression(max_iter=1000)
clf_smote.fit(X_smote, y_smote)

y_pred_smote = clf_smote.predict(X_test_scaled)
y_proba_smote = clf_smote.predict_proba(X_test_scaled)[:, 1]

print("\nSMOTE Model Report")
print(classification_report(y_test, y_pred_smote))
print("ROC AUC:", roc_auc_score(y_test, y_proba_smote))


SMOTE Model Report
              precision    recall  f1-score   support

         0.0       0.99      0.93      0.96    273779
         1.0       0.58      0.95      0.72     26221

    accuracy                           0.93    300000
   macro avg       0.79      0.94      0.84    300000
weighted avg       0.96      0.93      0.94    300000

ROC AUC: 0.9795823294452995


# Conclusions

**Baseline:** Already very high ROC AUC (~0.97). Model separates classes well, but recall for the minority class still lags.

**Oversampling / SMOTE:** Small improvements in ROC AUC. These methods help the model see more minority samples and adjust its decision boundary.

**Undersampling:** Performed almost identically to oversampling/SMOTE here. Because the dataset is large and separable, reducing majority samples didn’t hurt performance.

**Overall:** Resampling does not drastically change performance because the dataset is already highly predictive. However, recall and precision (esp. for the minority class) may still benefit slightly, which is crucial in fraud detection.