# LAB | Imbalanced

**Load the data**

In this challenge, we will be working with Credit Card Fraud dataset.

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv

Metadata

- **distance_from_home:** the distance from home where the transaction happened.
- **distance_from_last_transaction:** the distance from last transaction happened.
- **ratio_to_median_purchase_price:** Ratio of purchased price transaction to median purchase price.
- **repeat_retailer:** Is the transaction happened from same retailer.
- **used_chip:** Is the transaction through chip (credit card).
- **used_pin_number:** Is the transaction happened by using PIN number.
- **online_order:** Is the transaction an online order.
- **fraud:** Is the transaction fraudulent. **0=legit** -  **1=fraud**


In [33]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [34]:
fraud = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv")
fraud.head()

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
0,57.877857,0.31114,1.94594,1.0,1.0,0.0,0.0,0.0
1,10.829943,0.175592,1.294219,1.0,0.0,0.0,0.0,0.0
2,5.091079,0.805153,0.427715,1.0,0.0,0.0,1.0,0.0
3,2.247564,5.600044,0.362663,1.0,1.0,0.0,1.0,0.0
4,44.190936,0.566486,2.222767,1.0,1.0,0.0,1.0,0.0


**Steps:**

- **1.** What is the distribution of our target variable? Can we say we're dealing with an imbalanced dataset?
- **2.** Train a LogisticRegression.
- **3.** Evaluate your model. Take in consideration class importance, and evaluate it by selection the correct metric.
- **4.** Run **Oversample** in order to balance our target variable and repeat the steps above, now with balanced data. Does it improve the performance of our model? 
- **5.** Now, run **Undersample** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?
- **6.** Finally, run **SMOTE** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model? 

In [35]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import precision_score, recall_score, classification_report, confusion_matrix, f1_score
from sklearn.preprocessing import StandardScaler

from sklearn.utils import resample

In [36]:
features = fraud.drop(columns = ["fraud"])
target = fraud["fraud"]

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.3, random_state=42)

In [37]:
scaler = StandardScaler()
scaler.fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [38]:
log_reg = LogisticRegression()

In [39]:
log_reg.fit(X_train_scaled, y_train)

In [40]:
log_reg.score(X_test_scaled, y_test)

0.9586

In [41]:
pred = log_reg.predict(X_test_scaled)
print(classification_report(y_pred = pred, y_true = y_test))

              precision    recall  f1-score   support

         0.0       0.96      0.99      0.98    273871
         1.0       0.89      0.60      0.72     26129

    accuracy                           0.96    300000
   macro avg       0.93      0.80      0.85    300000
weighted avg       0.96      0.96      0.95    300000



Oversampling

In [42]:
train = pd.DataFrame(X_train_scaled, columns = X_train.columns)

In [43]:
train["fraud"] = y_train.values

In [44]:
fraud = train[train["fraud"] == 1]
no_fraud = train[train["fraud"] == 0]

In [45]:
yes_diabetes_oversampled = resample(fraud, 
                                    replace=True, 
                                    n_samples = len(no_fraud),
                                    random_state=42)

In [46]:
train_over = pd.concat([yes_diabetes_oversampled, no_fraud])
train_over

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
644343,-0.381500,-0.158206,1.540054,-2.726469,1.361740,-0.33467,0.733028,1.0
180527,1.144372,-0.161092,-0.105160,0.366775,-0.734355,-0.33467,0.733028,1.0
9755,-0.395621,-0.183209,0.948450,-2.726469,-0.734355,-0.33467,0.733028,1.0
435438,1.416124,-0.174146,0.270254,0.366775,-0.734355,-0.33467,0.733028,1.0
621087,0.452849,-0.184210,4.652813,0.366775,-0.734355,-0.33467,0.733028,1.0
...,...,...,...,...,...,...,...,...
699995,-0.400716,-0.057613,-0.605882,-2.726469,-0.734355,-0.33467,-1.364205,0.0
699996,1.342313,-0.180689,-0.108000,0.366775,-0.734355,-0.33467,-1.364205,0.0
699997,-0.093950,-0.186655,-0.325298,0.366775,-0.734355,-0.33467,-1.364205,0.0
699998,-0.255761,-0.180579,-0.232822,0.366775,-0.734355,-0.33467,0.733028,0.0


In [47]:
X_train_over = train_over.drop(columns = ["fraud"])
y_train_over = train_over["fraud"]

In [48]:
log_reg = LogisticRegression()
log_reg.fit(X_train_over, y_train_over)

In [49]:
pred = log_reg.predict(X_test_scaled)
print(classification_report(y_pred = pred, y_true = y_test))



              precision    recall  f1-score   support

         0.0       1.00      0.93      0.96    273871
         1.0       0.57      0.95      0.72     26129

    accuracy                           0.93    300000
   macro avg       0.78      0.94      0.84    300000
weighted avg       0.96      0.93      0.94    300000



Undersampling

In [50]:
no_diabetes_undersampled = resample(no_fraud, 
                                    replace=False, 
                                    n_samples = len(fraud),
                                    random_state=42)
no_diabetes_undersampled

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
501870,-0.263187,1.279110,-0.192919,0.366775,-0.734355,-0.334670,0.733028,0.0
582719,-0.377711,-0.172357,0.394331,0.366775,-0.734355,-0.334670,0.733028,0.0
328461,-0.356303,-0.183527,-0.442987,0.366775,-0.734355,-0.334670,0.733028,0.0
285918,-0.305413,-0.099293,-0.624200,0.366775,1.361740,2.988015,0.733028,0.0
75247,-0.260221,0.061364,-0.580361,0.366775,-0.734355,-0.334670,-1.364205,0.0
...,...,...,...,...,...,...,...,...
463249,-0.392713,-0.182315,-0.080292,-2.726469,1.361740,-0.334670,0.733028,0.0
501855,-0.122599,1.086861,0.067886,0.366775,-0.734355,-0.334670,-1.364205,0.0
133485,-0.160754,-0.141545,-0.318958,0.366775,-0.734355,-0.334670,0.733028,0.0
32804,13.393563,-0.150392,-0.462654,0.366775,-0.734355,-0.334670,-1.364205,0.0


In [51]:
train_under = pd.concat([no_diabetes_undersampled, fraud])
train_under

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
501870,-0.263187,1.279110,-0.192919,0.366775,-0.734355,-0.334670,0.733028,0.0
582719,-0.377711,-0.172357,0.394331,0.366775,-0.734355,-0.334670,0.733028,0.0
328461,-0.356303,-0.183527,-0.442987,0.366775,-0.734355,-0.334670,0.733028,0.0
285918,-0.305413,-0.099293,-0.624200,0.366775,1.361740,2.988015,0.733028,0.0
75247,-0.260221,0.061364,-0.580361,0.366775,-0.734355,-0.334670,-1.364205,0.0
...,...,...,...,...,...,...,...,...
699964,1.905345,-0.100542,-0.069485,0.366775,-0.734355,-0.334670,0.733028,1.0
699972,-0.382848,-0.111598,1.795254,-2.726469,-0.734355,-0.334670,-1.364205,1.0
699984,0.195593,-0.182704,0.963595,0.366775,-0.734355,-0.334670,0.733028,1.0
699987,-0.191765,-0.182148,0.833886,0.366775,-0.734355,-0.334670,0.733028,1.0


In [52]:
X_train_under = train_under.drop(columns = ["fraud"])
y_train_under = train_under["fraud"]

In [53]:
log_reg = LogisticRegression()
log_reg.fit(X_train_under, y_train_under)

In [54]:
pred = log_reg.predict(X_test_scaled)
print(classification_report(y_pred = pred, y_true = y_test))



              precision    recall  f1-score   support

         0.0       1.00      0.93      0.96    273871
         1.0       0.57      0.95      0.71     26129

    accuracy                           0.93    300000
   macro avg       0.78      0.94      0.84    300000
weighted avg       0.96      0.93      0.94    300000



Smote

In [55]:
!pip install imblearn



In [56]:
from imblearn.over_sampling import SMOTE

In [57]:
sm = SMOTE(random_state = 42,sampling_strategy=1.0)

In [58]:
X_train_sm,y_train_sm = sm.fit_resample(X_train_scaled,y_train)

In [59]:
log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train_sm, y_train_sm)

In [60]:
pred = log_reg.predict(X_test_scaled)
print(classification_report(y_pred = pred, y_true = y_test))

              precision    recall  f1-score   support

         0.0       0.99      0.93      0.96    273871
         1.0       0.57      0.95      0.72     26129

    accuracy                           0.93    300000
   macro avg       0.78      0.94      0.84    300000
weighted avg       0.96      0.93      0.94    300000

