# LAB | Imbalanced

**Load the data**

In this challenge, we will be working with Credit Card Fraud dataset.

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv

Metadata

- **distance_from_home:** the distance from home where the transaction happened.
- **distance_from_last_transaction:** the distance from last transaction happened.
- **ratio_to_median_purchase_price:** Ratio of purchased price transaction to median purchase price.
- **repeat_retailer:** Is the transaction happened from same retailer.
- **used_chip:** Is the transaction through chip (credit card).
- **used_pin_number:** Is the transaction happened by using PIN number.
- **online_order:** Is the transaction an online order.
- **fraud:** Is the transaction fraudulent. **0=legit** -  **1=fraud**


In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import precision_score, recall_score, classification_report, confusion_matrix, f1_score
from sklearn.preprocessing import StandardScaler

from sklearn.utils import resample

from imblearn.over_sampling import SMOTE

In [None]:
#fraud = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv")
#fraud.to_csv("data/fraud.csv", index=False)
fraud = pd.read_csv("data/fraud.csv")
fraud.head()

In [None]:
display(fraud.info())
display(fraud.describe())
display(fraud['fraud'].value_counts())

In [4]:
#1.** What is the distribution of our target variable? Can we say we're dealing with an imbalanced dataset?
#The distribution of the target variable is imbalanced.

In [None]:
#2.** Train a LogisticRegression.

features = fraud.drop(columns = ["fraud"])
target = fraud["fraud"]

X_train, X_test, y_train, y_test = train_test_split(features, target)

scaler = StandardScaler()
scaler.fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

log_reg = LogisticRegression()

log_reg.fit(X_train_scaled, y_train)

In [6]:
#3.** Evaluate your model. Take in consideration class importance, and evaluate it by selection the correct metric.

In [None]:
print(log_reg.score(X_test_scaled, y_test))
pred = log_reg.predict(X_test_scaled)
print(classification_report(y_pred = pred, y_true = y_test))

# low recall for fraud class despite high precision

In [8]:
#4.** Run **Oversample** in order to balance our target variable and repeat the steps above, now with balanced data. Does it improve the performance of our model? 

In [None]:
train = pd.DataFrame(X_train_scaled, columns = X_train.columns)

train["fraud"] = y_train.values

frauds = train[train["fraud"] == 1]
clean = train[train["fraud"] == 0]

frauds_oversampled = resample(frauds, 
                                    replace=True, 
                                    n_samples = len(clean),
                                    random_state=0)

train_over = pd.concat([frauds_oversampled, clean])

X_train_over = train_over.drop(columns = ["fraud"])
y_train_over = train_over["fraud"]

log_reg = LogisticRegression()
log_reg.fit(X_train_over, y_train_over)


In [None]:
pred = log_reg.predict(X_train_over)
print(classification_report(y_pred = pred, y_true = y_train_over))

In [11]:
# precision and recall are now balanced

In [12]:
#5.** Now, run **Undersample** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?

In [None]:
clean_undersampled = resample(clean, 
                                    replace=False, 
                                    n_samples = len(frauds),
                                    random_state=0)


train_under = pd.concat([clean_undersampled, frauds])

X_train_under = train_under.drop(columns = ["fraud"])
y_train_under = train_under["fraud"]

log_reg = LogisticRegression()
log_reg.fit(X_train_under, y_train_under)

pred = log_reg.predict(X_train_under)
print(classification_report(y_pred = pred, y_true = y_train_under))



In [14]:
#Precision and recall are balanced but lower than oversampling especially for the clean class

In [15]:
#6.** Finally, run **SMOTE** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our 

In [None]:
sm = SMOTE(random_state = 1,sampling_strategy=1.0)

X_train_sm,y_train_sm = sm.fit_resample(X_train_scaled,y_train)

log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train_sm, y_train_sm)

In [None]:
pred = log_reg.predict(X_test_scaled)
print(classification_report(y_pred = pred, y_true = y_test))

In [None]:
#Precision for fraud class is lower than oversampling but recall is similar

**Steps:**

- **1.** What is the distribution of our target variable? Can we say we're dealing with an imbalanced dataset?
- **2.** Train a LogisticRegression.
- **3.** Evaluate your model. Take in consideration class importance, and evaluate it by selection the correct metric.
- **4.** Run **Oversample** in order to balance our target variable and repeat the steps above, now with balanced data. Does it improve the performance of our model? 
- **5.** Now, run **Undersample** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?
- **6.** Finally, run **SMOTE** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model? 