# LAB | Imbalanced

**Load the data**

In this challenge, we will be working with Credit Card Fraud dataset.

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv

Metadata

- **distance_from_home:** the distance from home where the transaction happened.
- **distance_from_last_transaction:** the distance from last transaction happened.
- **ratio_to_median_purchase_price:** Ratio of purchased price transaction to median purchase price.
- **repeat_retailer:** Is the transaction happened from same retailer.
- **used_chip:** Is the transaction through chip (credit card).
- **used_pin_number:** Is the transaction happened by using PIN number.
- **online_order:** Is the transaction an online order.
- **fraud:** Is the transaction fraudulent. **0=legit** -  **1=fraud**


In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [2]:
fraud = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/card_transdata.csv")
fraud.head()

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
0,57.877857,0.31114,1.94594,1.0,1.0,0.0,0.0,0.0
1,10.829943,0.175592,1.294219,1.0,0.0,0.0,0.0,0.0
2,5.091079,0.805153,0.427715,1.0,0.0,0.0,1.0,0.0
3,2.247564,5.600044,0.362663,1.0,1.0,0.0,1.0,0.0
4,44.190936,0.566486,2.222767,1.0,1.0,0.0,1.0,0.0


**Steps:**

- **1.** What is the distribution of our target variable? Can we say we're dealing with an imbalanced dataset?
- **2.** Train a LogisticRegression.
- **3.** Evaluate your model. Take in consideration class importance, and evaluate it by selection the correct metric.
- **4.** Run **Oversample** in order to balance our target variable and repeat the steps above, now with balanced data. Does it improve the performance of our model?
- **5.** Now, run **Undersample** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?
- **6.** Finally, run **SMOTE** in order to balance our target variable and repeat the steps above (1-3), now with balanced data. Does it improve the performance of our model?

In [3]:
fraud['fraud'].value_counts()

Unnamed: 0_level_0,count
fraud,Unnamed: 1_level_1
0.0,912597
1.0,87403


Number of non fraudulent transactions >>> number of fraudulent transactions so we can say we're dealing with an imbalanced dataset.

In [4]:
fraud.isna().any()

Unnamed: 0,0
distance_from_home,False
distance_from_last_transaction,False
ratio_to_median_purchase_price,False
repeat_retailer,False
used_chip,False
used_pin_number,False
online_order,False
fraud,False


In [5]:
features = fraud.drop(columns='fraud')
target = fraud['fraud']

In [6]:
X_train, X_test, Y_train, Y_test = train_test_split(features, target, test_size=0.2, random_state=13)

In [7]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(solver='newton-cholesky', class_weight='balanced', random_state=13)
lr.fit(X_train, Y_train)
lr.score(X_test, Y_test)

0.935175

In [8]:
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(random_state=13)
X_train_res, Y_train_res = ros.fit_resample(X_train, Y_train)

In [9]:
lr.fit(X_train_res, Y_train_res)
lr.score(X_test, Y_test)

0.93531

Minor improvement in score

In [10]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
y_pred = lr.predict(X_test)
print(classification_report(Y_test, y_pred))
print(confusion_matrix(Y_test, y_pred))

              precision    recall  f1-score   support

         0.0       0.99      0.93      0.96    182421
         1.0       0.58      0.95      0.72     17579

    accuracy                           0.94    200000
   macro avg       0.79      0.94      0.84    200000
weighted avg       0.96      0.94      0.94    200000

[[170340  12081]
 [   857  16722]]


In [11]:
from imblearn.under_sampling import RandomUnderSampler
# Initialize RandomUnderSampler
rus = RandomUnderSampler(random_state=13)

# Apply undersampling to the training data
X_train_res, y_train_res = rus.fit_resample(X_train, Y_train)


In [12]:
lr.fit(X_train_res, y_train_res)

In [13]:
lr.score(X_test, Y_test)

0.935135

Slightly less good than oversampling

In [14]:
y_pred = lr.predict(X_test)
print(classification_report(Y_test, y_pred))
print(confusion_matrix(Y_test, y_pred))

              precision    recall  f1-score   support

         0.0       0.99      0.93      0.96    182421
         1.0       0.58      0.95      0.72     17579

    accuracy                           0.94    200000
   macro avg       0.79      0.94      0.84    200000
weighted avg       0.96      0.94      0.94    200000

[[170308  12113]
 [   860  16719]]


In [15]:
from imblearn.over_sampling import SMOTE
# Initialize SMOTE
smote = SMOTE(random_state=13)

# Apply SMOTE to the training data
X_train_res, y_train_res = smote.fit_resample(X_train, Y_train)

In [16]:
lr.fit(X_train_res, y_train_res)

In [17]:
lr.score(X_test, Y_test)

0.935515

Best score yet

In [18]:
y_pred = lr.predict(X_test)
print(classification_report(Y_test, y_pred))
print(confusion_matrix(Y_test, y_pred))

              precision    recall  f1-score   support

         0.0       0.99      0.93      0.96    182421
         1.0       0.58      0.95      0.72     17579

    accuracy                           0.94    200000
   macro avg       0.79      0.94      0.84    200000
weighted avg       0.96      0.94      0.94    200000

[[170415  12006]
 [   891  16688]]
