# LAB | Imbalanced

**Load the data**

In this challenge, we will be working with Credit Card Fraud dataset.

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv

Metadata

- **distance_from_home:** the distance from home where the transaction happened.
- **distance_from_last_transaction:** the distance from last transaction happened.
- **ratio_to_median_purchase_price:** Ratio of purchased price transaction to median purchase price.
- **repeat_retailer:** Is the transaction happened from same retailer.
- **used_chip:** Is the transaction through chip (credit card).
- **used_pin_number:** Is the transaction happened by using PIN number.
- **online_order:** Is the transaction an online order.
- **fraud:** Is the transaction fraudulent. **0=legit** -  **1=fraud**


In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [2]:
fraud = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv")
fraud.head()

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
0,57.877857,0.31114,1.94594,1.0,1.0,0.0,0.0,0.0
1,10.829943,0.175592,1.294219,1.0,0.0,0.0,0.0,0.0
2,5.091079,0.805153,0.427715,1.0,0.0,0.0,1.0,0.0
3,2.247564,5.600044,0.362663,1.0,1.0,0.0,1.0,0.0
4,44.190936,0.566486,2.222767,1.0,1.0,0.0,1.0,0.0


**Steps:**

- **1.** What is the distribution of our target variable? Can we say we're dealing with an imbalanced dataset?
- **2.** Train a LogisticRegression.
- **3.** Evaluate your model. Take in consideration class importance, and evaluate it by selection the correct metric.
- **4.** Run **Oversample** in order to balance our target variable and repeat the steps above, now with balanced data. Does it improve the performance of our model? 
- **5.** Now, run **Undersample** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?
- **6.** Finally, run **SMOTE** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model? 

In [5]:
# STEP 1

print(fraud['fraud'].value_counts())

print(fraud['fraud'].value_counts(normalize=True) * 100)

fraud
0.0    912597
1.0     87403
Name: count, dtype: int64
fraud
0.0    91.2597
1.0     8.7403
Name: proportion, dtype: float64


In [None]:
# Based on our target (fraud), we can affirm that our dataset is highly imbalanced

In [None]:
# STEP 2

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

X = fraud.drop('fraud', axis=1)
y = fraud['fraud']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_train_scaled, y_train)

In [9]:
# STEP 3

from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

y_pred = log_reg.predict(X_test_scaled)

print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=["Not Fraud", "Fraud"]))

print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

y_probs = log_reg.predict_proba(X_test_scaled)[:, 1]
roc_auc = roc_auc_score(y_test, y_probs)
print("ROC-AUC Score:", roc_auc)

Classification Report:
              precision    recall  f1-score   support

   Not Fraud       0.96      0.99      0.98    182519
       Fraud       0.90      0.61      0.72     17481

    accuracy                           0.96    200000
   macro avg       0.93      0.80      0.85    200000
weighted avg       0.96      0.96      0.96    200000

Confusion Matrix:
[[181296   1223]
 [  6895  10586]]
ROC-AUC Score: 0.9669773583082968


In [11]:
# STEP 4

from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler(random_state=42)

X_resampled, y_resampled = ros.fit_resample(X_train_scaled, y_train)

from collections import Counter
print("Before oversampling:", Counter(y_train))
print("After oversampling:", Counter(y_resampled))

Before oversampling: Counter({0.0: 730078, 1.0: 69922})
After oversampling: Counter({0.0: 730078, 1.0: 730078})


In [12]:
log_reg_bal = LogisticRegression(random_state=42)
log_reg_bal.fit(X_resampled, y_resampled)

In [None]:
y_pred_bal = log_reg_bal.predict(X_test_scaled)

print("Classification Report (Balanced Data):")
print(classification_report(y_test, y_pred_bal, target_names=["Not Fraud", "Fraud"]))

print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_bal))

y_probs_bal = log_reg_bal.predict_proba(X_test_scaled)[:, 1]
roc_auc_bal = roc_auc_score(y_test, y_probs_bal)
print("ROC-AUC Score:", roc_auc_bal)

Classification Report (Balanced Data):
              precision    recall  f1-score   support

   Not Fraud       0.99      0.93      0.96    182519
       Fraud       0.58      0.95      0.72     17481

    accuracy                           0.93    200000
   macro avg       0.79      0.94      0.84    200000
weighted avg       0.96      0.93      0.94    200000

Confusion Matrix:
[[170390  12129]
 [   911  16570]]
ROC-AUC Score: 0.9795250850098041


In [None]:
# Conclusion:

# When balancing the data, the model's recall for fraud cases was improved (61% to 95%).
# However, this came with a drop in precision (from 90% to 58%).
# Oversampling improved with fraud detection, but there was more false positives.
# In this context, oversampling made the model more effective.

In [14]:
# STEP 5

from imblearn.under_sampling import RandomUnderSampler
from collections import Counter

rus = RandomUnderSampler(random_state=42)
X_rus, y_rus = rus.fit_resample(X_train_scaled, y_train)

print("Before undersampling:", Counter(y_train))
print("After undersampling:", Counter(y_rus))

Before undersampling: Counter({0.0: 730078, 1.0: 69922})
After undersampling: Counter({0.0: 69922, 1.0: 69922})


In [15]:
log_reg_rus = LogisticRegression(random_state=42)
log_reg_rus.fit(X_rus, y_rus)

In [16]:
y_pred_rus = log_reg_rus.predict(X_test_scaled)

print("Classification Report (Undersampled Data):")
print(classification_report(y_test, y_pred_rus, target_names=["Not Fraud", "Fraud"]))

print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_rus))

y_probs_rus = log_reg_rus.predict_proba(X_test_scaled)[:, 1]
roc_auc_rus = roc_auc_score(y_test, y_probs_rus)
print("ROC-AUC Score:", roc_auc_rus)

Classification Report (Undersampled Data):
              precision    recall  f1-score   support

   Not Fraud       0.99      0.93      0.96    182519
       Fraud       0.58      0.95      0.72     17481

    accuracy                           0.93    200000
   macro avg       0.79      0.94      0.84    200000
weighted avg       0.96      0.93      0.94    200000

Confusion Matrix:
[[170394  12125]
 [   918  16563]]
ROC-AUC Score: 0.9795581555971167


In [None]:
# Conclusion:

# Imbalanced: 
    # High precision, but many missed frauds.
# Oversampled & Undersampled:
    # Improved recall in both cases (95%).
    # Lower precision (more false positives, maybe an acceptable trade).

# Undersampling performed just as well as oversampling.
# Since it reduced the dataset size, it trained faster.
# Undersampling is a highly effective alternative to oversampling.
# When compared to the imbalanced model, both balancing techniques significantly improved the model’s fraud detection.

In [None]:
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)

X_smote, y_smote = smote.fit_resample(X_train_scaled, y_train)

print("Before SMOTE:", Counter(y_train))
print("After SMOTE:", Counter(y_smote))

Before SMOTE: Counter({0.0: 730078, 1.0: 69922})
After SMOTE: Counter({0.0: 730078, 1.0: 730078})


In [18]:
log_reg_smote = LogisticRegression(random_state=42)
log_reg_smote.fit(X_smote, y_smote)

In [19]:
y_pred_smote = log_reg_smote.predict(X_test_scaled)

print("Classification Report (SMOTE Data):")
print(classification_report(y_test, y_pred_smote, target_names=["Not Fraud", "Fraud"]))

print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_smote))

y_probs_smote = log_reg_smote.predict_proba(X_test_scaled)[:, 1]
roc_auc_smote = roc_auc_score(y_test, y_probs_smote)
print("ROC-AUC Score:", roc_auc_smote)

Classification Report (SMOTE Data):
              precision    recall  f1-score   support

   Not Fraud       0.99      0.93      0.96    182519
       Fraud       0.58      0.95      0.72     17481

    accuracy                           0.93    200000
   macro avg       0.79      0.94      0.84    200000
weighted avg       0.96      0.93      0.94    200000

Confusion Matrix:
[[170386  12133]
 [   907  16574]]
ROC-AUC Score: 0.9795260633479468


In [None]:
# Conclusion:

# SMOTE matched the performance of Oversampling and Undersampling:
    # Same recall (95%).
    # Lower precision compared to the imbalanced model.
    # Not significantçy different (in performance) from the previous models.

# SMOTE achieved the same performance as random oversampling and undersampling.
# SMOTE's disadvantage was a drop in precision (expected?).