# LAB | Imbalanced

**Load the data**

In this challenge, we will be working with Credit Card Fraud dataset.

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv

Metadata

- **distance_from_home:** the distance from home where the transaction happened.
- **distance_from_last_transaction:** the distance from last transaction happened.
- **ratio_to_median_purchase_price:** Ratio of purchased price transaction to median purchase price.
- **repeat_retailer:** Is the transaction happened from same retailer.
- **used_chip:** Is the transaction through chip (credit card).
- **used_pin_number:** Is the transaction happened by using PIN number.
- **online_order:** Is the transaction an online order.
- **fraud:** Is the transaction fraudulent. **0=legit** -  **1=fraud**


In [90]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import precision_score, recall_score, classification_report, confusion_matrix, f1_score
from sklearn.preprocessing import StandardScaler

from sklearn.utils import resample

In [91]:
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv")
df.head()

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
0,57.877857,0.31114,1.94594,1.0,1.0,0.0,0.0,0.0
1,10.829943,0.175592,1.294219,1.0,0.0,0.0,0.0,0.0
2,5.091079,0.805153,0.427715,1.0,0.0,0.0,1.0,0.0
3,2.247564,5.600044,0.362663,1.0,1.0,0.0,1.0,0.0
4,44.190936,0.566486,2.222767,1.0,1.0,0.0,1.0,0.0


**Steps:**

- **1.** What is the distribution of our target variable? Can we say we're dealing with an imbalanced dataset?
- **2.** Train a LogisticRegression.
- **3.** Evaluate your model. Take in consideration class importance, and evaluate it by selection the correct metric.
- **4.** Run **Oversample** in order to balance our target variable and repeat the steps above, now with balanced data. Does it improve the performance of our model? 
- **5.** Now, run **Undersample** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?
- **6.** Finally, run **SMOTE** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model? 

In [92]:
df.isnull().sum()

distance_from_home                0
distance_from_last_transaction    0
ratio_to_median_purchase_price    0
repeat_retailer                   0
used_chip                         0
used_pin_number                   0
online_order                      0
fraud                             0
dtype: int64

1. What is the distribution of our target variable? Can we say we're dealing with an imbalanced dataset?

In [93]:
#yes

2. Train a LogisticRegression.


In [94]:
features = df.drop(columns = ["fraud"])
target = df["fraud"]

X_train, X_test, y_train, y_test = train_test_split(features, target)

In [95]:
scaler = StandardScaler()
scaler.fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [96]:
log_reg = LogisticRegression()

In [97]:
log_reg.fit(X_train_scaled, y_train)

In [98]:
log_reg.score(X_test_scaled, y_test)

0.95902

In [99]:
pred = log_reg.predict(X_test_scaled)
print(classification_report(y_pred = pred, y_true = y_test))

              precision    recall  f1-score   support

         0.0       0.96      0.99      0.98    228324
         1.0       0.89      0.60      0.72     21676

    accuracy                           0.96    250000
   macro avg       0.93      0.80      0.85    250000
weighted avg       0.96      0.96      0.96    250000



- **3.** Evaluate your model. Take in consideration class importance, and evaluate it by selection the correct metric.


In [100]:
#weighted average 96%

- **4.** Run **Oversample** in order to balance our target variable and repeat the steps above, now with balanced data. 
Does it improve the performance of our model? 


In [101]:
X_train, X_test, y_train, y_test = train_test_split(features, target)

In [102]:
scaler = StandardScaler()
scaler.fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [103]:
train = pd.DataFrame(X_train_scaled, columns = X_train.columns)


In [132]:
train["fraud"] = y_train.values

In [139]:
no_fraud = df[df['fraud'] == 0]
fraud = df[df['fraud'] == 1]

In [140]:
fraud = fraud.sample(len(no_fraud), replace=True)

In [141]:
fraud.shape

(912597, 8)

In [142]:
df = pd.concat([no_fraud, fraud], axis=0)

In [144]:
df = df.sample(frac=1)
df['fraud'].value_counts()

0.0    912597
1.0    912597
Name: fraud, dtype: int64

In [145]:
train_over = pd.concat([fraud, no_fraud])
train_over

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
983518,8.414815,0.327770,11.225472,1.0,1.0,0.0,1.0,1.0
87540,3.141776,0.721107,12.722610,1.0,1.0,0.0,1.0,1.0
546706,180.452270,25.394873,1.736016,1.0,0.0,0.0,1.0,1.0
511518,140.981594,0.809311,0.708707,1.0,0.0,0.0,1.0,1.0
927462,121.244951,0.045552,0.166488,1.0,0.0,0.0,1.0,1.0
...,...,...,...,...,...,...,...,...
772735,1.077506,1.293786,2.638483,0.0,1.0,1.0,1.0,0.0
563691,0.392331,1.038829,0.498031,0.0,0.0,0.0,1.0,0.0
285504,6.482070,0.268610,0.357279,1.0,1.0,0.0,1.0,0.0
506584,16.239535,2.174562,0.349215,1.0,1.0,0.0,0.0,0.0


In [110]:
X_train_over = train_over.drop(columns = ["fraud"])
y_train_over = train_over["fraud"]

In [146]:
log_reg = LogisticRegression()
log_reg.fit(X_train_over, y_train_over)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [147]:
pred = log_reg.predict(X_test_scaled)
print(classification_report(y_pred = pred, y_true = y_test))



              precision    recall  f1-score   support

         0.0       0.93      1.00      0.96    228252
         1.0       0.85      0.17      0.28     21748

    accuracy                           0.93    250000
   macro avg       0.89      0.58      0.62    250000
weighted avg       0.92      0.93      0.90    250000



 **5.** Now, run **Undersample** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?

In [113]:
train

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
0,-0.372036,-0.161967,-0.085381,0.366436,-0.733821,-0.334621,-1.364489,0.0
1,-0.211044,-0.221675,-0.491270,0.366436,-0.733821,-0.334621,0.732875,0.0
2,-0.331463,-0.218280,1.413856,0.366436,-0.733821,-0.334621,0.732875,1.0
3,-0.316364,0.110787,-0.396042,0.366436,-0.733821,-0.334621,-1.364489,0.0
4,-0.148284,-0.058334,-0.538089,0.366436,-0.733821,-0.334621,0.732875,0.0
...,...,...,...,...,...,...,...,...
749995,-0.338694,-0.128589,0.799190,0.366436,1.362730,-0.334621,-1.364489,0.0
749996,0.711706,-0.206936,-0.571325,0.366436,1.362730,-0.334621,0.732875,0.0
749997,0.532119,-0.187876,-0.567573,0.366436,-0.733821,2.988458,0.732875,0.0
749998,-0.361923,-0.103134,0.921506,0.366436,-0.733821,-0.334621,0.732875,1.0


In [114]:
no_fraud_undersampled = no_fraud.sample(len(fraud), random_state=0)

In [115]:
df_undersampled = pd.concat([no_fraud_undersampled, fraud], axis=0)

In [121]:
df_undersampled = df_undersampled.sample(frac=1, random_state=0)

In [122]:
X_undersampled = df_undersampled.drop(columns=['fraud'])
y_undersampled = df_undersampled['fraud']

In [123]:
scaler = StandardScaler()
scaler.fit(X_train)  # Fit scaler on the original training set
X_train_under_scaled = scaler.transform(X_undersampled)
X_test_scaled = scaler.transform(X_test)  # Reusing X_test from the original split


In [124]:
log_reg = LogisticRegression()
log_reg.fit(X_train_under_scaled, y_undersampled)

In [127]:
scaler = StandardScaler()
scaler.fit(X_train_under_scaled)  # Fit scaler on the undersampled training set
X_train_under_scaled = scaler.transform(X_train_under_scaled)
X_test_scaled = scaler.transform(X_test) 



In [130]:
pred = log_reg.predict(X_test_scaled)


In [131]:
print(classification_report(y_pred=pred, y_true=y_test))


              precision    recall  f1-score   support

         0.0       1.00      0.01      0.02    228252
         1.0       0.09      1.00      0.16     21748

    accuracy                           0.10    250000
   macro avg       0.54      0.51      0.09    250000
weighted avg       0.92      0.10      0.03    250000



In [149]:
from imblearn.over_sampling import SMOTE

In [150]:
sm = SMOTE(random_state = 1,sampling_strategy=1.0)

In [151]:
X_train_sm,y_train_sm = sm.fit_resample(X_train_scaled,y_train)

In [152]:
log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train_sm, y_train_sm)

In [153]:
pred = log_reg.predict(X_test_scaled)
print(classification_report(y_pred = pred, y_true = y_test))

              precision    recall  f1-score   support

         0.0       1.00      0.14      0.24    228252
         1.0       0.10      1.00      0.18     21748

    accuracy                           0.21    250000
   macro avg       0.55      0.57      0.21    250000
weighted avg       0.92      0.21      0.24    250000

