# LAB | Imbalanced

**Load the data**

In this challenge, we will be working with Credit Card Fraud dataset.

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv

Metadata

- **distance_from_home:** the distance from home where the transaction happened.
- **distance_from_last_transaction:** the distance from last transaction happened.
- **ratio_to_median_purchase_price:** Ratio of purchased price transaction to median purchase price.
- **repeat_retailer:** Is the transaction happened from same retailer.
- **used_chip:** Is the transaction through chip (credit card).
- **used_pin_number:** Is the transaction happened by using PIN number.
- **online_order:** Is the transaction an online order.
- **fraud:** Is the transaction fraudulent. **0=legit** -  **1=fraud**


In [90]:
#Libraries
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import precision_score, recall_score, classification_report, confusion_matrix, f1_score
from sklearn.preprocessing import StandardScaler

from sklearn.utils import resample
from imblearn.over_sampling import SMOTE

In [4]:
fraud = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv")
fraud.head()

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
0,57.877857,0.31114,1.94594,1.0,1.0,0.0,0.0,0.0
1,10.829943,0.175592,1.294219,1.0,0.0,0.0,0.0,0.0
2,5.091079,0.805153,0.427715,1.0,0.0,0.0,1.0,0.0
3,2.247564,5.600044,0.362663,1.0,1.0,0.0,1.0,0.0
4,44.190936,0.566486,2.222767,1.0,1.0,0.0,1.0,0.0


**Steps:**

- **1.** What is the distribution of our target variable? Can we say we're dealing with an imbalanced dataset?
- **2.** Train a LogisticRegression.
- **3.** Evaluate your model. Take in consideration class importance, and evaluate it by selection the correct metric.
- **4.** Run **Oversample** in order to balance our target variable and repeat the steps above, now with balanced data. Does it improve the performance of our model? 
- **5.** Now, run **Undersample** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?
- **6.** Finally, run **SMOTE** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model? 

In [7]:
fraud["fraud"].value_counts()

fraud
0.0    912597
1.0     87403
Name: count, dtype: int64

In [None]:
#1. The distribution of our target variable fraud is imbalanced.

In [19]:
#2. Train a LogisticRegression

features= fraud.drop(columns = ["fraud"])
target = fraud["fraud"]

X_train, X_test, y_train, y_test = train_test_split(features, target)

scaler = StandardScaler()
scaler.fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

log_reg = LogisticRegression()
log_reg.fit(X_train_scaled, y_train)

In [21]:
log_reg.score(X_test_scaled, y_test)

0.958696

In [23]:
#3.
pred = log_reg.predict(X_test_scaled)
print(classification_report(y_pred = pred, y_true = y_test))

              precision    recall  f1-score   support

         0.0       0.96      0.99      0.98    228177
         1.0       0.89      0.60      0.72     21823

    accuracy                           0.96    250000
   macro avg       0.93      0.80      0.85    250000
weighted avg       0.96      0.96      0.95    250000



In [None]:
#Precision 0.89: Means that, out of all transactions predicted as fraud, 89% were actually fraud. 
#High precision indicates a low rate of false positives.

#Recall for class 1.0 is 0.60: This means that the model identified only 60% of all actual fraud cases. 
#In fraud detection, recall is critical because missing fraudulent transactions (false negatives) is costly.

In [27]:
#4. OVERSAMPLE

fraud = pd.DataFrame(X_train_scaled, columns = X_train.columns)
fraud["fraud"] = y_train.values
fraud["fraud"].value_counts()

fraud
0.0    684420
1.0     65580
Name: count, dtype: int64

In [45]:
fraud_df = fraud[fraud["fraud"] == 1] 
no_fraud_df = fraud[fraud["fraud"] == 0]
fraud_df_oversampled = resample(fraud_df, n_samples = len(no_fraud_df), random_state=0,replace=True,)
oversampled = pd.concat([fraud_df_oversampled, no_fraud_df])

In [47]:
features_over = oversampled.drop(columns = ["fraud"])
target_over = oversampled["fraud"]

X_train, X_test, y_train, y_test = train_test_split(features_over, target_over )

scaler = StandardScaler()
scaler.fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [49]:
log_reg_over = LogisticRegression()
log_reg_over.fit(X_train_scaled, y_train)

In [53]:
pred_over = log_reg_over.predict(X_test_scaled)
print(classification_report(y_pred = pred_over, y_true = y_test))

              precision    recall  f1-score   support

         0.0       0.95      0.93      0.94    170935
         1.0       0.93      0.95      0.94    171275

    accuracy                           0.94    342210
   macro avg       0.94      0.94      0.94    342210
weighted avg       0.94      0.94      0.94    342210



In [55]:
#The Oversampled works much better:
#Precision 0.93 instead of 0.89: Out of all transactions predicted as fraud, 93% were actually fraud. 
#Recall for class 1.0 is 0.95 instead of 0.60: This means that the model identified 95% of all actual fraud cases. 

In [76]:
#5. UNDERSAMPLE

fraud = pd.DataFrame(X_train_scaled, columns = X_train.columns)
fraud["fraud"] = y_train.values

fraud_df = fraud[fraud["fraud"] == 1] 
no_fraud_df = fraud[fraud["fraud"] == 0]
no_fraud_df_undersampled = resample(no_fraud_df, n_samples = len(fraud_df))
undersampled = pd.concat([no_fraud_df_undersampled, fraud_df])

In [80]:
features_under = undersampled.drop(columns = ["fraud"])
target_under = undersampled["fraud"]

X_train, X_test, y_train, y_test = train_test_split(features_under, target_under )

scaler = StandardScaler()
scaler.fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [82]:
log_reg_under = LogisticRegression()
log_reg_under.fit(X_train_scaled, y_train)

In [88]:
pred_under= log_reg_under.predict(X_test_scaled)
print(classification_report(y_pred = pred_under, y_true = y_test))

              precision    recall  f1-score   support

         0.0       0.95      0.93      0.94    128074
         1.0       0.93      0.95      0.94    128499

    accuracy                           0.94    256573
   macro avg       0.94      0.94      0.94    256573
weighted avg       0.94      0.94      0.94    256573



In [None]:
#The undersample perfomes just the same as the oversample in terms of precision and recall. 

In [92]:
#6. SMOTE

sm = SMOTE(sampling_strategy=1.0)
X_train_sm,y_train_sm = sm.fit_resample(X_train_scaled,y_train)

In [93]:
log_reg_sm = LogisticRegression(max_iter=1000)
log_reg_sm.fit(X_train_sm, y_train_sm)

In [98]:
pred_sm = log_reg_sm.predict(X_test_scaled)
print(classification_report(y_pred = pred_sm, y_true = y_test))

              precision    recall  f1-score   support

         0.0       0.95      0.93      0.94    128074
         1.0       0.93      0.95      0.94    128499

    accuracy                           0.94    256573
   macro avg       0.94      0.94      0.94    256573
weighted avg       0.94      0.94      0.94    256573



In [None]:
#The undersample perfomes just the same as the oversample in terms of precision and recall. 